This Zenodo repository contains the following data: --------------------------------------------------- Written by Eveline Pinseel in March 2023, and updated in March 2024 # FILE: Pinseel_S.marinoi_pop_gen_data_analysis_MIN40_MAF5.html Overview of all code used to analyze the SNP data of Skeletonema marinoi This file contains the full analysis for a minimum coverage filter of 40X and a minimum allele frequency of 5% This is an html file: open in web browser # FILE: Pinseel_S.marinoi_pop_gen_data_analysis_MIN20_MAF5.html Overview of all code used to analyze the SNP data of Skeletonema marinoi This file contains the analysis for a minimum coverage filter of 20X and a minimum allele frequency of 5% This is an html file: open in web browser # FILE: Pinseel_S.marinoi_pop_gen_environmental_data_Baltic.html Overview of all code used to process and interpolate the environmental data across the Baltic Sea Includes figures of all environmental gradients (annual/seasonal/bloom period) This is an html file: open in web browser # FILE: Pinseel_S.marinoi_rna-seq_reanalysis_Skmarinoi8x3.html Overview of all code used to reanalyze the RNA-seq data of Skeletonema marinoi (Skmarinoi8x3 experiment) Original data published here: https://doi.org/10.1038/s41396-022-01230-x This is an html file: open in web browser # FILE: 4. rbeta_simulated_genotypes_for_LFMM.csv csv file with the simulated genotypes from the rbeta function for all SNPs These simulated genotypes were used for LFMM # FOLDER: 01.Skeletonema_marinoi_genome_v1.1.2 Files of the reference genome of Skeletonema marinoi, ref 1.1.2 1. Skeletonema_marinoi_Ref_v1.1.2.fst = fasta file of the genome 2. Sm_ManualCuration.v1.1.2.gff: GFF file of the genome 3. Smarinoi_Ref1.1.2_full-annotation.csv: functional annotation of the genome 4. Smarinoi_Ref1.1.2_GOterms.txt: GO terms of the genes in the genome # FOLDER: 02.Skmarinoi8x3_rna-seq_reanalysis Output files from the reanalysis of the Skmarinoi8x3 experiment Original data published here: https://doi.org/10.1038/s41396-022-01230-x 1. Skmarinoi8x3_core_response_genes.txt: list of the core response genes 2. Skmarinoi8x3_reanalysis_ref1.1.2_gene-level_counts_FINAL.txt: read count data calculated by HTSeq 3. Skmarinoi8x3_RQ1e2_all_DE_genes.txt: DE genes (indicated with '1') for the average and genotype-specific responses 4. Skmarinoi8x3_RQ3_all_DE_genes.txt: DE genes (indicated with '1') for the interaction-effects, testing all contrasts (8-24, 8-16, 16-24) 5. Skmarinoi8x3_RQ3_logFC_all_reduced_set.csv: logFC of interaction-effect genes, including non-significant genes This RQ3 test only included the 8-24 contrast and might thus has different number of DE genes as file #3 due to a smaller FDR penalty 6. Skmarinoi8x3_RQ3_logFC_significant_reduced_set.csv: logFC of interaction-effect genes, only including significant genes/contrasts This RQ3 test only included the 8-24 contrast and might thus has different number of DE genes as file #3 due to a smaller FDR penalty # FOLDER: 03.environmental_data 1. baltic_env_data.csv: environmental data (annual/seasonal/bloom periods) for each sampling locality obtained by interpolating the original environmental data in R 2. GPS_Smarinoi8x3.txt: GPS coordinates of the sampling localities for each pool 3. ICES_Low_resolution_CTD_bottle_data_Baltic_2000-2021_0-10m.csv: ICES (HELCOM) data used for interpolating environmental data across the Baltic Sea 4. Sharkweb_2000-2021_Physical-Chemical_Profile_Chlorophyll_HELCOMonly_0-10m.csv: Sharkweb data used for interpolating environmental data across the Baltic Sea 5. Skmarinoi_PCA_axes_for_GEA_seasons_inner-Baltic.csv: PCA axis scores from environmental data, used for LFMM and BayPass AUX # FOLDER: 04.biophysics_seascape_connectivity_model The output of the biophysics model on seascape connectivity The included .docx file describes the content of the output files in detail Data are arranged in two folders: 1. matrices: contains the raw output of the seascape connectivity model 2. heatmaps: heatmaps made from the matrices # FOLDER: 05.SNP_data 1. Sync files and allele frequency files (files ending in .fz, .fz.txt, and .sync): - Files with 'complete' in the name only include SNPs that were covered in each pool - Files with 'all' in the name include all SNPs, including those not covered in one or more pools - Files with 'NEUTRAL' in the name include only fully covered SNPs that were not identified as an outlier SNP for their corresponding filtering strategy - Files with 'outer-inner' in the name include only fully covered SNPs after summarizing all data for the outer and inner Baltic - The names of the files refer to their corresponding filtering strategies: (i) min20 MAC4 = minimum coverage 20X, minimum allele count 4, minimum allele frequency 0.1% (ii) min20 MAC4 MAF5% = as in (i) but with MAF of 5% (iii) min40 MAC4 MAF5% = as in (ii) but with minimum coverage 40X 2. rbeta_simulated_genotypes_for_LFMM_COV40_MAF5%_no-missing-data_inner-baltic.RData: RData object with the simulated genotypes from the rbeta function at minimum coverage 40 and MAF 5%. This file can be opened in R 3. rbeta_simulated_genotypes_for_LFMM_COV20_MAF5%_no-missing-data_inner-baltic.RData: RData object with the simulated genotypes from the rbeta function at minimum coverage 20 and MAF 5%. This file can be opened in R 4. contig_length_info.txt: length of each contig 5. contrast_file_baltic: contrast file for the C2-model analysis in BayPass 6. Skmarinoi_PoolSeq_depth-by-contigx2_PoolSNP.txt: average coverage of each contig for each pool. Used for SNP calling with PoolSNP # FOLDER: 06.baypass_input Input files for the BayPass runs, separated in four folders: 1. genobayfiles_full_dataset_MIN20_MAF5%: full dataset at minimum coverage 20X and MAF 5% 2. genobayfiles_full_dataset_MIN40_MAF5%: full dataset at minimum coverage 40X and MAF 5% 3. genobayfiles_inner_MIN20_MAF5%: inner Baltic only at minimum coverage 20X and MAF 5% 4. genobayfiles_inner_MIN40_MAF5%: inner Baltic only at minimum coverage 40X and MAF 5% # FOLDER: 07.output Collection of output files from several analyses: 1. BayPass_inner_GEA_MIN20_MAF5%_output_files: output of the BayPass AUX model (GEA test, inner Baltic) at minimum coverage 20X 2. BayPass_inner_GEA_MIN40_MAF5%_output_files: output of the BayPass AUX model (GEA test, inner Baltic) at minimum coverage 40X 3. BayPass_outer-inner_output_files:output of the BayPass C2-model (outer/inner Baltic test) 4. PoPoolation1_output: output of PoPoolation1 5. FST_outliers_MIN20_MAF5%_top10%.txt: top 10% FST outliers at minimum coverage 20X 6. FST_outliers_MIN40_MAF5%_top10%.txt: top 10% FST outliers at minimum coverage 40X 7. Skmarinoi_Poolfstat_pairwiseFst_MIN40_MAF5%.txt: pairwise Fst values between pools, calculated using 'neutral' SNPs, at minimum coverage 40X 8. Skmarinoi_PoolSeq_A-P_min20_MAC4_MAF5%_output-filtered_complete_outer-inner_sed.fet: output of Fisher Exact Test at minimum coverage 20X 9. Skmarinoi_PoolSeq_A-P_min40_MAC4_MAF5%_output-filtered_complete_outer-inner_sed.fet: output of Fisher Exact Test at minimum coverage 40X # FOLDER: 08.outliers Lists with all the outlier SNPs, separated by category Categories are clear from file names. Details on file names: outer-vs-inner = outer vs inner (refers to the Fst/BayPass outlier tests for the North Sea versus the Baltic Sea) union = all outliers found in the union of two or more tests (e.g., GEA.PC1.union: all outliers from LFMM on PC1 AND/OR BayPass on PC1) intersect = all outliers found in the intersection of two or more tests (e.g., GEA.PC1.intersect: all outliers found by BOTH LFMM on PC1 AND BayPass on PC1) ALL = all outlier SNPs The four subfolders in this folder contain files with the SNPs separated by type (e.g., exon, UTR, ...). These folders are also structured to include 20X and 40X minimum coverage data. # FOLDER: 09.outlier_gene_lists Lists of genes with outlier SNPs, separated in different categories Categories are clear from file names. Details on file names: OVI = outer vs inner (refers to the Fst/BayPass outlier tests for the North Sea versus the Baltic Sea) union = all outliers found in the union of two or more tests (e.g., GEA.PC1.union: all outliers from LFMM on PC1 AND/OR BayPass on PC1) intersect = all outliers found in the intersect of two or more tests (e.g., GEA.PC1.intersect: all outliers found by BOTH LFMM on PC1 AND BayPass on PC1) Folder also contains: Outliers_Skmarinoi_ALL.gene-variants.final.restructured_min20_MAF5%.csv and Outliers_Skmarinoi_ALL.gene-variants.final.restructured_MIN40_MAF5%.csv: this file contains the annotation (missense/synonymous/etc.) for all SNPs that are located in genes at minimum coverage 20X and 40X respectively. # FOLDER: 10.outlier_genes_annotated SNP and functional gene annotations of all outlier SNPs and genes Full details on how these files are structured and obtained can be found in the data analysis html in the main folder of the Zenodo repository # FOLDER: 11.GO_enrichment GO enrichment results on outlier genes. For each filtering strategy, data are structured in six folders: 1. inner_Baltic_GEA: GO enrichment on outlier genes from GEA analysis 2. missense_SNPs: GO enrichment on outlier genes with missense SNPs (including splice region variants and start/stop codon variants) 3. multiple_SNPs: GO enrichment on outlier genes with multiple outlier SNPs 4. outer_vs_inner_Baltic: GO enrichment on outlier genes from the outer/inner Baltic Fst/BayPass test (= North Sea versus Baltic Sea) 5. RNA-seq_overlap_outliers: outliers that overlap with DE genes from the Skmarinoi8x3 experiment (Pinseel et al. 2022 ISME J.) 6. UTR_SNPs: GO enrichment on outlier genes that have outlier SNPs in UTR regions # FOLDER: 12.scripts Contains two scripts which were used to analyze SNP data Also available on GitHub: https://github.com/evelinepinseel/Skeletonema_marinoi_population_genomics_Baltic # FOLDER: 13.snpEff_annotations Results of snpEff SNP annotation for each pool, as well as all pools together # FOLDER: 14.machine_learning This folder contains all the scripts and details used to run the machine learning script The machine learning script was used to estimate the number of cells per chain from image data This ensured that an approximately equal number of cells were pooled for each locality