This InsectMobileREADME.txt file was generated on 2023-10-09 by the authors of the paper GENERAL INFORMATION 1. Title of Dataset: InsectMobile_2023_README.txt 2. Author Information is excluded to comply with GDPR 3. Date of data collection: 2018-06-01 - 2019-06-30 4. Geographic location of data collection: Denmark (DK) 5. Information about funding sources that supported the collection of the data: Aage V. Jensens Naturfond & The Danish Ministry of Higher Education and Science (7072-00014B) (DK) and The German Research Foundation (DFG FZT 118, 202548816) (DE) SHARING/ACCESS INFORMATION 1. Licenses/restrictions placed on the data: Please cite the dataset if used in another context. 2. Links to publications that cite or use the data: DOI: 10.1111/icad.12694 3. Links to other publicly accessible locations of the data: - 4. Links/relationships to ancillary data sets: - 5. Was data derived from another source? yes A. If yes, list source(s): Please see the supplementary file of the paper for a list of data sources for the environmental data. 6. Recommended citation for this dataset: Svenningsen et al. 2023, Insect biomass shows a stronger decrease than species richness along urban gradients, Insect conservation and Diversity DATA & FILE OVERVIEW 1. File List: 1.1 asvs_BIN_comb.csv: matched ASV table and samples - summarized reads per total sample (abbreviations of samples) - contain sequences with a length over 200 bp, so the asv table should only have asvs with the correct read length - we removed rows that contain only zeros (ASVs that are not present in the subsetted samples) 1.2 taxonomy_BIN_combo.csv: table of taxonomy classifications of samples - Data have been matched against a 99% clustered version of the BOLD Public Database v2022-02-22 public data (COI-5P sequences) - We matched names to current taxonmic status to make sure we have the accepted names - we checked how frequent are the different match types across sequences and across BINs 1.3 data_richness_BINs.csv: metadata of sampling and calculations of biomass and estimated richness of flying insects 1.4 archived_insectmobil_script.R: R script for the analysis of flying insects biomass and richness These data tables contains all data, which are necessary to run the ecological/statistical analyses from the study "Insect biomass shows a stronger decrease than species richness along urban gradients" (Svenningsen et al. 2023, DOI:10.1111/icad.12694). 2. Relationship between files, if important: - 3. Additional related data collected that was not included in the current data package: - 4. Are there multiple versions of the dataset? no 5. Explanation of the variables: - asvs_BIN_comb.csv: Sample name combined with the information whether the sample was the early (A: 12-15) or the late (B: 17-20) trip that was driven (see data_richness_BINs.csv). - taxonomy_Bin_comb.csv: In the taxonomy_Bin_comb file, mulitple variables exist. occurrenceId: unique identifier for each sequence read marker: genomic marker identity: threshold of the OTU bitScore:quality of an alignment expectValue: number of hits matchType:sequences with an exact match to the reference database scientificName: All returned matches have then been matched against the GBIF backbone taxonomy by their identifier (e.g.BOLD:ADJ8357) kingdom,phylum,class,order,family,genus,species: taxonomy assignments with the GBIF sequence ID tool sequence: description of sequences type:short classification of scientific name strings parsed, parsedpartially:parsed scientific name string into a structured representation of a name scientificname, genusorabove, specificepithet: taxonomic classification after parsing canonicalname, canonicalnamecomplete, canonicalnamewithmarker: taxonomic classifications with addiotional author names rankmarker: classification depth taxonRank: Rank fo taxon infraspecificEpithet: definition of an intraspecific taxon - data_richness_BINs: RouteID_JB:the id of the route where the samples was taken Year: year were sample was taken PIDRouteID: the ID of the volunteer sampling the route, and whether it was route 1 or 2 the pilot (=volunteer) drove (each pilot received two routes). SampleID:the ID of the volunteer sampling the route, and whether it was route 1 or 2 the pilot (=volunteer) drove (each pilot received two routes), and whether it was the early (A: 12-15) or the late (B: 17-20) trip that was driven PID: the ID of the volunteer sampling the route DOFAtlasQuadrantID: subLandUseType: additional descriptions of other values of land use types Date:the date where the sample was taken StartTime:the time when the sampling began EndTime:the time when the sampling ended Wind:average windspeed and the corresponding on the Beaufor Wind Scale range (m/s) Temperature:the average temperature in 5 degree celsius increments in which the sample was taken Notes: additonal notes added by the pilot PilotNotes: further addiotional notes by the pilot according to sampling utm_x:coordinate in utm utm_y:coordinate in utm decimalLongitude:coordinate in longitude decimalLatitude:coordinate in latitude eventTime: duration of the complete sampling event year: year were sample was taken month:month were sample was taken day:day were sample was taken Time_band:whether the sample was taken during midday (12-15) or in the evening (17-20) Route_length:route lenght in meters. NB, based on the presumed and preplanned route the volunteers were supposed to take Distance_driven:the full distance the volunteers drove (back to starting point) in meters. NB, based on the presumed and preplanned route the volunteers were supposed to take yDay:the day of the year in which the sample was taken Time_driven:the time it took for the volunteers to sample Velocity: - concentrationUnit: unit of which DNA concentration was measured biomassUncertainty: documentation about pontential uncertainties for biomass measurements totalBiomass_mg: the total biomass of the sample (small + large size fraction) meanDNAconc: mean concentration of the DNA within the sample Agriculture_50,Forest_50,Heathland_50,Open.uncultivated.land_50,Unspecified.land.cover_50,Urban_50,Wetland_50,Agriculture_250,Forest_250,Heathland_250,Open.uncultivated.land_250,Unspecified.land.cover_250,Urban_250,Wetland_250,Agriculture_500,Forest_500,Heathland_500,Open.uncultivated.land_500,Unspecified.land.cover_500,Urban_500,Wetland_500,Agriculture_1000,Forest_1000,Heathland_1000,Open.uncultivated.land_1000,Unspecified.land.cover_1000,Urban_1000,Wetland_1000,urbGreenAreaHa_1000,urbGreenAreaHa_500,urbGreenAreaHa_250,urbGreenAreaHa_50,urbGreenPropArea_1000,urbGreenPropArea_500,urbGreenPropArea_250,urbGreenPropArea_50: All columns with either agriculture, forest, heathland, open.uncultivated.land, unspecified.land.cover, wetland ,urbGreenAreaHa and urbGreenPropArea followed by _50, _250, _500, or _1000 are proportional land covers used for analysis. The proportional land cover was calculated for either a 50m, 250m etc. buffer zone around each route. The land cover range is 0-1, but be aware that it was transformed to 0-100(%) for the majority of the analyses. PropArea is the proportional land cover (0-1 transformed to 0-100 in the analysis) of the dominant land cover hegnLength_1000,hegnLength_500,hegnLength_250,hegnLength_50,byHegnLength_1000,byHegnLength_500,byHegnLength_250,byHegnLength_50,hegnMeterPerHa_1000,hegnMeterPerHa_500,hegnMeterPerHa_250,hegnMeterPerHa_50,byHegnMeterPerHa_1000,byHegnMeterPerHa_500,byHegnMeterPerHa_250,byHegnMeterPerHa_50: hegnLength, byHegnLength,hegnMeterPerHa are variables used for the land use intensity analysis.The proportional land cover was calculated for either a 50m, 250m etc. buffer zone around each route.The land cover range is 0-1, but be aware that it was transformed to 0-100(%) for the majority of the analyses. Num_trafficLights:how many traffic lights there were on the route Diversity_1000,Diversity_250,Diversity_50,Diversity_500: calculated Shannon Diversity of proportional land cover (0-1 transformed to 0-100 in the analysis) richness_rarefied:rarefied richness per sample (number of taxa per sample) richness_rarefied_shannon: rarefied shannon diversity per sample n_reads: number of reads in each sample obs_richness:number of observed species richness richness_est: estimated species richness per sample est_richness_lci:bias-corrected Chao1 richness estimations with lower bound of confidence interval est_richness_uci:bias-corrected Chao1 richness estimatations with upper bound of confidence interval est_richness_model: description of the richness estimator model numberTime: sorted time data to standard each around the time band cyDay: centered day of the year in which the sample was taken cTL:centered number of traffic lights cnumberTime:centreed time around each time band Explanations of variables can also be found in the supplementary information of the paper (DOI: 10.1111/icad.12694) METHODOLOGICAL INFORMATION 1. Description of methods used for collection/generation of data: Please see asscociated publication and supplementary material for a description of the methods 2. Methods for processing the data: Please see asscoiated publication and supplementary material for a description of the methods 3. Instrument- or software-specific information needed to interpret the data: Please see asscoiated publication and supplementary material for a description of the software-specific information --- **NOTE ON BIOINFORMATICS** Two sequencing platforms were used to generate the data: HiSeq 4000 and NovaSeq 6000. The NovaSeq processing assigns quality scores differently from the HiSeq platform, where NovaSeq [simplify the error rates](https://www.illumina.com/content/dam/illumina-marketing/documents/products/appnotes/novaseq-hiseq-q30-app-note-770-2017-010.pdf) by binning the 40 possible quality scores into just 4 categories which [vastly reduces the amount of information dada2 can work off of to infer errors in the data](https://github.com/benjjneb/dada2/issues/791). This discrepancy between platforms were not dealt with during the dada2 processing as we were unaware of the problem when we ran the bioinformatics. It seems to affect how many rare species/sequences are detected/retained in that fewer rare species are detected from NovaSeq data if the error rate step is not updated in the dada2 pipeline to accomodate the fewer quality scores. ---