DATASETS GENERAL INFORMATION 1. Title: Diatoms and plants sedimentary ancient DNA from Lake Satagay (Central Yakutia, Siberia) covering the last 10,800 years 2. Author Information: Corresponding authors: Izabella Baisheva - Alfred Wegener Institut, Potsdam - izabella.baisheva@awi.de Kathleen R. Stoof-Leichsenring - Alfred Wegener Institut, Potsdam - kathleen.stoof-leichsenring@awi.de Co-authors: Luidmila Pestryakova - North-Eastern Federal University of Yakutsk Ramesh Glückler - Alfred Wegener Institut, Potsdam Ulrike Herzschuh - Alfred Wegener Institut, Potsdam 3. Date of data collection: Lacustrine sediment core EN18224-4 was retrieved on August 3, 2018 4. Geographic location of data collection: Lake Satagay (N 63.07816°, E 117.99806°), Central Yakutia (Sakha Republic), East Siberia, Russia 5. Funding sources that supported the collection of the data: Izabella Baisheva - German Academic Exchange Service e.V. (DAAD) Luidmila Pestryakova - Russian Ministry of Education and Science FSRG-2020-0019 Ramesh Glückler - AWI INSPIRES (International Science Program for Integrative Research) Ulrike Herzschuh - European Research Council Glacial Legacy: 772852 6. Citation for this dataset: Baisheva, Izabella et al. (2022), Diatoms and plants sedimentary ancient DNA from Lake Satagay (Central Yakutia, Siberia) covering the last 10,800 years, Dryad, Dataset, https://doi.org/10.5061/dryad.vq83bk3w5 The datasets are prepared for the manuscripts: Baisheva et al. (2022): "Permafrost-thaw lake development in Central Yakutia – Sedimentary ancient DNA and element analyses from a Holocene sediment record" (submitted) Glückler et al. (2022): "Holocene wildfire and vegetation dynamics in Central Yakutia, Siberia, reconstructed from lake-sediment proxies" (preprint). ################################################################################## DATASETS STRUCTURE OVERVIEW: DIRECTORY 1*. Data: Diatoms and plants sedimentary ancient DNA from Lake Satagay (Central Yakutia, Siberia) covering the last 10,800 years (data) Folder APMG_32 contains several subfolders and files of different format: Subfolder: 00.APMG-32_Metadata File: APMG-32_Metadata.xlsx Subfolder: 01.Raw_data_APMG-32 Files: 210602_NB501850_A_L1-4_APMG-32_R1.fastq.gz 210602_NB501850_A_L1-4_APMG-32_R2.fastq.gz Subfolder: 04.Final_resampled_data_APMG-32 Files: APMG-32_identitylevel0.98_wideformat.csv Folder APMG_33 contains several folders and files of different format: Subfolder: 00.APMG-33_Metadata Subfolder: 01.Raw_data_APMG_33 – Illumina sequencing raw data. Files: 210602_NB501850_A_L1-4_APMG-33_R1.fastq.gz 210602_NB501850_A_L1-4_APMG-33_R2.fastq.gz Subfolder: 04.Final_datasets_APMG-33 Files: APMG-33_identitylevel100_wideformat.csv - Final dataset. APMG-33_macrophytes_resampled_scientific_name.csv APMG-33_terrestrial_families.csv DIRECTORY 2*. Scripts: Diatoms and plants sedimentary ancient DNA from Lake Satagay (Central Yakutia, Siberia) covering the last 10,800 years (scripts and supporting tag files). Folder APMG_32 Subfolder: 02.Reference_data_rbcl – Database used for taxonomic assignment of diatoms. Files: rbcl_embl143_db.fasta Obi3_rbcL_database_build.sh - Script for the conversion step. Subfolder: 03.OBITools_APMG-32 Files: APMG_32_metabarcoding_rbcL_obi3_Dryad.sh APMG-32_embl143_rbcL.csv APMG32_tagfile.txt *The datasets were uploaded into two separate directories containing data and scripts, as datasets included the processing of the raw sequencing data using bioinformatics tools. Each directory contains two main folders APMG_32 (Diatoms) and APMG_33 (Plants). Files of APMG_32 and APMG_33 after downloading have to be merged in the same folder, so the structure of datasets looks like as it is given in the Usage notes. ################################################################################## DETAILED DESCRIPTION OF DATA - DIRECTORY 1. DIRECTORY 1. Data: Diatoms and plants sedimentary ancient DNA from Lake Satagay (Central Yakutia, Siberia) covering the last 10,800 years_DRYAD (data) ######## Folder APMG_32 ######## Subfolder: 00. APMG-32_Metadata - Metadata information including lake geographic coordinates, sample depths and ages, laboratory codes and used primer tag combinations of forward and reverse primers to enable demultiplexing of the sequencing data File: APMG-32_Metadata.xlsx Format: .xlsx Description: Our metadata contains information specifically on our core on sequencing (run number, type, device, mode, forward and reverse tags, read length). Also, it includes information on individual samples: name, type, age, depth, extraction number, and PCR number, as well as sediment core name and core section number. DATA-SPECIFIC INFORMATION FOR: APMG-32_Metadata.xlsx 1. Number of variables: 32 2. Number of rows: 243 3. Variable List: Sample Name (automatic) – generated automatically combining PCRexperimentNumberCombined, ExtractionNumber, CoreName, MeanCollectionDepth(cm), SampleType(sample/blank) SampleType(sample/blank) – indicates type of amplified item: sample or blank Sequence Run Number - Identification number of sequenced batch Sequencing Device - Short name of device Read Length - Lenght of DNA Mode of Sequencing – paired end Raw file Forward - File name Raw file Reverse - File name Tags - combination of tags Forward Primer (without tag) - rbcL primer for the amplification of diatoms targeted a diagnostic short diatom metabarcode (primer: diat_rbcL705, Stoof-Leichsenring et al. 2012) Reverse Primer (without tag) - rbcL primer for the amplification of diatoms targeted a diagnostic short diatom metabarcode (primer: diat_rbcL808, Stoof-Leichsenring et al. 2012) Genetic Approach (Metabarcoding/Shotgun) - Type of our genetic approach is Metabarcoding Quality Filter - Illuminapairened Extraction Number - Number of extracted sample Experiment Number - Number of PCR experiment PCR Number Single - Number of PCR in PCR batch PCR experiment Number Combined - Number of prepared PCR sample Core Name - Identification code of sediment core Section Name (Core Name) - Identification code of subsampled sediment core sample (usually it is core ID and cm) Section Type (Core or Bulk Segment) - indicates type of core: Core or Bulk Segment Collection Core Depth upper (cm)= composite depth - Upper cm of composite depth Collection Core Depth lower (cm) = composite depth - Lower cm of composite depth Mean Collection Depth round (cm) - We put here upper composite depth to have one digit number (usually it is two digits) Section Age round (yr BP) - Information on age of samples Mean Collection Depth (cm) - We put here upper composite depth to have one digit number (usually it is two digits) Section Age (yr BP) - Age of samples Collection Type - Indicates origin of samples Collection Substrate Lake sediment/ Permafrost/ Lake water/ Soil) - Indicates origin of samples Collection Date (Day_Month_Year,10_05_2003) Collection Latitude (Decimal) Collection Longitude (Decimal) Site Name - In our case it is lake name, and as our working group already had another lake with the same name, our site name has additional 2.0 4. Missing data codes: None 5. Abbreviations used: PCR - polymerase chain reaction BP - before present (usually BP = before 1950 AD) cm - centimeter DNA - Deoxyribonucleic acid rbcL - the chloroplast gene rbcL, which codes for the large subunit of ribulose-1,5-bisphosphate carboxylase/oxygenase (RuBisCO or RuBPCase) ID - Identification number 6. Other relevant information: None Subfolder: 01. Raw_data_APMG-32 – Illumina sequencing raw data. Files: 210602_NB501850_A_L1-4_APMG-32_R1.fastq.gz 210602_NB501850_A_L1-4_APMG-32_R2.fastq.gz Format: Illumina fastq format (.gz archived). Description: The sequence files are compressed as .gz archives. Before using the data with the Obitools script (file “APMG_32_metabarcoding_rbcL_obi3_Dryad.sh” is given in the Directory 2, Folder APMG-32) the datasets need to be uncompressed and converted into .fastq files. Subfolder: 04. Final_resampled_data_APMG-32: Files: APMG-32_identitylevel0.98_wideformat.csv - Final count data. APMG-32_final_resampled_scientific_name.csv - Final dataset with filtering threshold of 98%, resampled to the minimal number of counts (n=2050), including diatoms and Nannochloropsis. Format: .csv Description: DATA-SPECIFIC INFORMATION FOR: APMG-32_identitylevel0.98_wideformat.csv 1. Number of variables: 249 2. Number of rows: 425 3. Variable List: best_identity - shows % of best match with the database, more or equal 98% was used in our study NUC_SEQ_arc - Unique code of amplified sequence scientific_name - the best possible assignment which can vary between family, genus or species level best_family - sequenced name on family level, if assigned (there can be NAs) best_genus - sequenced name on genus level, if assigned (there can be NAs) best_species - sequenced name on species level, if assigned (there can be NAs) start with MERGED_sample.IB009P.04_IB009P_EIB004.EIB024_EN18224.4_42_sample - Sample Name x 243 times 4. Missing data codes: NA: No amplification 5. Abbreviations used: none 6. Other relevant information: In our study we chose to show the results on a scientific name level, therefore it is not a final dataset. Final count data was resampled (https://github.com/StefanKruse/R_Rarefaction) to the minimum number of counts (n = 2050, derived from sample 65) to normalize the dataset prior to subsequent statistical analyses. This process was repeated 100 times and then the mean value was calculated. The final data set includes 422 ASVs, which were grouped to 43 unique taxa names. DATA-SPECIFIC INFORMATION FOR: APMG-32_final_resampled_scientific_name.csv 1. Number of variables: 44 2. Number of rows: 63 3. Variable List: Age - Age of samples And 43 taxa name on a scientific level 4. Missing data codes: None 5. Abbreviations used: None 6. Other relevant information: The file APMG-32_final_resampled_data.csv was used for further statistical analyses in Baisheva et al. (2022): "Permafrost-thaw lake development in Central Yakutia – Sedimentary ancient DNA and element analyses from a Holocene sediment record" (submitted) ######## Folder APMG_33 ######## Subfolder: 00. APMG-33_Metadata - Contains information on sequencing (run number, type, device, mode, forward and reverse tags, read length). Also, it includes information on individual samples: name, type, age, depth, extraction number, and PCR number. As well as sediment core name and core section number. File: APMG-33_Satagay2_metadata.xlsx Format: .xlsx Description: contains information on sequencing (run number, type, device, mode, forward and reverse tags, read length). Also, it includes information on individual samples: name, type, age, depth, extraction number, and PCR number, as well as sediment core name and core section number. DATA-SPECIFIC INFORMATION FOR: APMG-33_Satagay2_metadata.xlsx 1. Number of variables: 32 2. Number of rows: 231 3. Variable List: Sample Name (automatic) – generated automatically combining PCRexperimentNumberCombined, ExtractionNumber, CoreName, MeanCollectionDepth(cm), SampleType(sample/blank) SampleType(sample/blank) – indicates type of amplified item: sample or blank Sequence Run Number - Identification number of sequenced batch Sequencing Device - Short name of device Read Length - Lenght of DNA Mode of Sequencing – paired end Raw file Forward - File name Raw file Reverse - File name Tags - combination of tags Forward Primer (without tag) - primer targeting the chloroplast trnL P6 loop (primer: g, Taberlet et al. 2007) Reverse Primer (without tag) - primer targeting the chloroplast trnL P6 loop (primer: h, Taberlet et al. 2007) Genetic Approach (Metabarcoding/Shotgun) - Type of our genetic approach is Metabarcoding Quality Filter - Illuminapairened Extraction Number - Number of extracted sample Experiment Number - Number of PCR experiment PCR Number Single - Number of PCR in PCR batch PCR experiment Number Combined - Number of prepared PCR sample Core Name - Identification code of sediment core Section Name (Core Name) - Identification code of subsampled sediment core sample (usually it is core ID and cm) Section Type (Core or Bulk Segment) - indicates type of core: Core or Bulk Segment Collection Core Depth upper (cm)= composite depth - Upper cm of composite depth Collection Core Depth lower (cm) = composite depth - Lower cm of composite depth Mean Collection Depth round (cm) - We put here upper composite depth to have one digit number (usually it is two digits) Section Age round (yr BP) - Information on age of samples Mean Collection Depth (cm) - We put here upper composite depth to have one digit number (usually it is two digits) Section Age (yr BP) - Age of samples Collection Type - Indicates origin of samples Collection Substrate Lake sediment/ Permafrost/ Lake water/ Soil) - Indicates origin of samples Collection Date (Day_Month_Year,10_05_2003) Collection Latitude (Decimal) Collection Longitude (Decimal) Site Name - In our case it is lake name, and as our working group already had another lake with the same name, our site name has additional 2.0 4. Missing data codes: None 5. Abbreviations used: PCR - polymerase chain reaction yr BP - year before present (usually BP = before 1950 AD) cm - centimeter DNA - Deoxyribonucleic acid trnL - the chloroplast trnL P6 loop (primers: g-h, Taberlet et al. 2007) ID - Identification number 6. Other relevant information: Metadata is needed to create a tagfile, to indicate sample type, age and depth. Subfolder: 01. Raw_data_APMG_33 – Illumina sequencing raw data. Files: 210602_NB501850_A_L1-4_APMG-33_R1.fastq.gz 210602_NB501850_A_L1-4_APMG-33_R2.fastq.gz Format: Illumina fast-q format. Description: The sequence files are compressed as .gz archives. The archives can be uncompressed on linux OS using a gzip -d command. Subfolder 04. Final_datasets_APMG-33 - EMBL and Arctic assignments were merged into the one dataset and filtered with 100% threshold. Final datasets separated into macrophytes and terrestrial plants. Files: APMG-33_identitylevel100_wideformat.csv - Final count data. APMG-33_macrophytes_resampled_scientific_name.csv - Final dataset of separated macrophytes and resampled to the minimal number of counts (n=1653). APMG-33_terrestrial_families.csv - Final dataset of separated terrestrial plants. Format: .csv Description: The file “APMG-33_macrophytes_resampled_scientific_name.csv” from output data was used for further statistical analyses in Baisheva et al. (2022): "Permafrost-thaw lake development in Central Yakutia – Sedimentary ancient DNA and element analyses from a Holocene sediment record" (submitted). The file “APMG-33_terrestrial_families.csv” of separated terrestrial plants data was used for further statistical analyses in Glückler et al. (2022): "Holocene wildfire and vegetation dynamics in Central Yakutia, Siberia, reconstructed from lake-sediment proxies" (preprint). DATA-SPECIFIC INFORMATION FOR: APMG-33_identitylevel100_wideformat.csv 1. Number of variables: 242 2. Number of rows: 155 3. Variable List: best_identity - shows % of best match with the database, equal 100% was used in our study NUC_SEQ_arc - Unique code of amplified sequence scientific_name - the best possible assignment which can vary between family, genus or species level best_family - sequenced name on family level, if assigned (there can be NAs) best_genus - sequenced name on genus level, if assigned (there can be NAs) best_species - sequenced name on species level, if assigned (there can be NAs) BEST_IDENTITY_arc - shows % of best match with the database, equal 100% was used in our study SCIENTIFIC_NAME_arc - the best possible assignment which can vary between family, genus or species level family_name_arc - sequenced name on family level, if assigned (there can be NAs) BEST_IDENTITY_embl - shows % of best match with the database, equal 100% was used in our study SCIENTIFIC_NAME_embl - the best possible assignment which can vary between family, genus or species level family_name_embl - sequenced name on family level, if assigned (there can be NAs) start with MERGED_sample:IB004P.01_IB004P_EIB001.EIB021_EN18224.4_0_sample - Sample Name x 230 times 4. Missing data codes: NA: No amplification 5. Abbreviations used: embl - matched against the EMBL Nucleotide Sequence Database arc - matched against the Arctic and Boreal vascular plant and bryophytes database (Willerslev et al., 2014; Soininen et al., 2015) 6. Other relevant information: Only 100% matches with either the Arctic or EMBL database were considered. If Arctic and EMBL matches showed both 100%, the taxonomic name of the assignment against the Arctic database was selected. In our study we chose to show the results on a scientific name level, therefore it is not a final dataset. DATA-SPECIFIC INFORMATION FOR: APMG-33_macrophytes_resampled_scientific_name.csv 1. Number of variables: 18 2. Number of rows: 61 3. Variable List: depth - depth of samples Age - Age of samples And 16 taxa name on a scientific level 4. Missing data codes: None 5. Abbreviations used: None 6. Other relevant information: Terrestrial plants were excluded and aquatic plant data was resampled (https://github.com/StefanKruse/R_Rarefaction) to the minimum number of counts (n = 1653, derived from sample 61) to normalize the dataset prior to subsequent statistical analyses. This process was repeated 100 times and then the mean value was calculated. The final data set consists of 14 unique taxa names, including ten submerged plant types, two emerged macrophytes. The file APMG-33_macrophytes_resampled_scientific_name.csv was used for further statistical analyses in Baisheva et al. (2022): "Permafrost-thaw lake development in Central Yakutia – Sedimentary ancient DNA and element analyses from a Holocene sediment record" (submitted) DATA-SPECIFIC INFORMATION FOR: APMG-33_terrestrial_families.csv 1. Number of variables: 27 2. Number of rows: 60 3. Variable List: depth - depth of samples And 26 taxa name on a family level 4. Missing data codes: NA: No amplification 5. Abbreviations used: NA: No amplification 6. Other relevant information: Aquatic plants were excluded and Terrestrial plants are given on a family level. The file APMG-33_terrestrial_families.csv was used for further statistical analyses in Glückler et al. (2022): "Holocene wildfire and vegetation dynamics in Central Yakutia, Siberia, reconstructed from lake-sediment proxies" (preprint) ################################################################################## DETAILED DESCRIPTION OF DATA - DIRECTORY 2. DIRECTORY 2. Scripts: Diatoms and plants sedimentary ancient DNA from Lake Satagay (Central Yakutia, Siberia) covering the last 10,800 years_DRYAD (scripts and supporting tag files). ######## Folder APMG-32 ######## Subfolder: 02. Reference_data_rbcl – Database used for taxonomic assignment of diatoms. Files: rbcl_embl143_db.fasta Obi3_rbcL_database_build.sh - Script for the conversion step. Format: .fasta and .sh. To use the rbcL database in the Obitools script (APMG_32_metabarcoding_rbcL_obi3_Dryad.sh), the rbcl_embl143_db.fasta needs to be converted to an obi3 database. FILES: APMG_32_metabarcoding_rbcL_obi3_Dryad.sh - Script to run OBITools3 pipeline with short descriptions and output data. Other relevant information: We run our raw data on Ollie server Subfolder: 03. OBITools_APMG-32 – The metabarcoding pipeline for analyzing the raw sequencing data using OBITools3. Files: APMG_32_metabarcoding_rbcL_obi3_Dryad.sh - Script to run OBITools3 pipeline with short descriptions and output data. APMG-32_embl143_rbcL.csv - Output file. APMG32_tagfile.txt - File contains primer combinations for demultiplexing with Obitools3 (see script: APMG_32_metabarcoding_rbcL_obi3_Dryad.sh). Format: .csv, .txt and .sh Other relevant information: We run our raw data on Ollie server ######## Folder APMG_33 ######## Subfolder: 02. Reference_database_plants – Reference database to run OBITools pipeline with short instruction and script for the conversion step. FILES: arctborbryo_gh.fasta gh_embl143_db_97.fasta Obi3_arctborbryo_database_build.sh Obi3_embl_database_build.sh Format: .fasta and .sh. To use the arctborbryo embl143 database in the Obitools script (APMG-33_obi3_script.sh), .fasta files need to be converted to an obi3 database. Other relevant information: We run our raw data on Ollie server Subfolder: 03. OBITools_APMG-33 – The metabarcoding pipeline for analyzing the raw sequencing data using OBITools3. FILES: APMG33_arc_anno.csv - Output file. APMG33_embl143_anno.csv - Output file. APMG-33_obi3_script.sh APMG-33_tagfile.txt Format: .csv, .txt and .sh. OBITools_APMG-33 has two outputs as taxonomic assignment provided against the EMBL and Arctic databases. Other relevant information: We run our raw data on Ollie server