Scripts & Data of "Spatiotemporal patterns of genetic diversity in the world's coral reefs"

Selmoni, Oliver; Schuman, Meredith C.

doi:10.5281/zenodo.19205727

Published March 24, 2026 | Version v1

Dataset Open

Scripts & Data of "Spatiotemporal patterns of genetic diversity in the world's coral reefs"

1. University of Zurich

Contact: oliver.selmoni@gmail.com

This repository includes scripts and data to reproduce the analysis of the "Spatiotemporal patterns of genetic diversity in the world’s coral reefs" study.

Data

This folder includes the data used in the analysis.

GEO_ENVIRONMENTAL: This is where the raw environmental data should be included. Specifically:

A folder named "RecifsDB": includes data from RECIFS, publicly available at https://github.com/Oselmoni/RECIFS/tree/master/public/DB
A folder named "MPAdata": includes a shapefile of the world Marine Protected Areas, publicly available at: https://www.protectedplanet.net/en/thematic-areas/wdpa?tab=WDPA
A table named "Reefs_Human_Pressure.csv" summarizing human impacts on the world's reefs, publicly available at: https://doi.org/10.1111/conl.12858.
A geojson file summarizing the separation between oceans (atlantic_indian_oceans.geojson).

KMERS: This is where raw k-mers counts are stored. The main folder includes a subfolder for each dataset used in the analysis, then every dataset includes multiple subfolders corresponding to different k-mers filtering strategies. The standard strategy used for downstream analysis in all datasets is k31 (k-mers of length 31), noDDP (non de-duplicated reads), S_all (all reads retained). "A_kmers_table.rda" is an R data object that includes k-mers counts for the dataset; "kmers_unclassified.rda" is an R data object that includes information on k-mers mapping against the reference genome. The remaining geojson files provide summary stats on k-mers across all samples in a dataset.
QC: This folder includes information on sequencing read quality. There is a sub-folder for each dataset, providing the multiqc_report in HTML format for all samples after quality filtering and trimming. There is also a multiqc_report for raw sequencing reads (before quality check).
R_ANALYSIS: This folder includes all inputs & outputs of data analysis in R. It is divided into folders for each processing step:

1_PROCESS_KMC: For every dataset, this folder provides R data objects for the normalized k-mers matrix ("KMN.rda"), the Bray-Curtis distance matrix ("BCD.rda"), and a table summarizing filtering statistics ("TABF.rda").
2_GEO_CLUSTERS: Includes a table describing the geographic distribution of all samples, encoded as the R data object "samples_GEO.rda".
3_ENV_DATA: Includes tables describing the environmental variation across samples and sites, both encoded as R data objects.
4_GDIS_REEFS: Includes the intermediary outputs of the GLMM models characterizing genetic distances across all the datasets, and the intermediary outputs of the analysis of local effects using the penalized regression models. Key outputs are the "wb_GDIS.rda" R data object, a table storing all the genetic distances across all datasets, and the "MODSbw/dataset+sampling+geo+time+latitude.rda" R data object, which is the best model describing genetic distances.
5_BCD_v_PI: Includes the intermediary output of the analysis linking nucleotide diversity with k-mers diversity, stored in a table encoded as an R data object ("sites_PI_BCD.rda").
6_GLOBAL_PREDICTIONS: Includes intermediary data used to characterize environmental variation across reefs worldwide, stored as tables encoded in the R data object format.

samples_metadata: This folder includes metadata about samples from each dataset. For every dataset, there are tables displaying the sampling sites coordinates as provided in the original publications, SRA run tables of metadata associated with the sequencing reads, and tidy tables where metadata was processed and standardized across datasets. There is also a set of .txt files (one per dataset), indicating the correspondence between biosample ID and SRA archive ID. The readClipping.txt table indicates the clipping values used to trim sequencing reads of each dataset. The region_table.csv and taxa_table.csv provide information about sampling location and taxa of every dataset, respectively.
SNPS: For every dataset, this folder includes the genotypes called as single nucleotide polymorphisms, stored as an R object data genotype matrix (GT.rda). The bamlist.txt file indicates the samples used to build the genotype matrix.
SCRATCH: This folder is not provided due to size limitations. It includes the raw sequencing data (all publicly available, see links in the manuscript). The content of this folder is indicated in the "SCRATCH_structure.txt" file, and can be reconstructed from raw data using the "commands_for_pipeline.txt" script, in the Scripts/PREPROCESSING folder.

FiguresTables

This folder includes all figures and tables produced by the R scripts, along the different processing steps.

Scripts

PREPROCESSING: These are the scripts to pre-process raw sequencing reads into k-mers. The processing is performed in 7 steps, controlled by the scripts included in the R_generate_masters folder. These scripts must be called in order, following the example of the "commands_for_pipeline.txt" file. The master scripts will then submit jobs to the Slurm Workload manager, and the instructions for these jobs are in the "Slurm" folder. The software and tools required by the pipeline are listed in the "Software_and_tools.txt" document. The "R" folder includes additional scripts called by the master scripts during the pipeline.
R_ANALYSIS: This folder includes all the scripts to run the R data analysis. They are organized in multiple sub-folders, ordered as follows:

1_PROCESS_KMC: These are the scripts to analyze the genetic structure of datasets. Every dataset has a dedicated script ("datasetID.R"), then there are five general scripts (starting with "AAA...") that will be called for every dataset.
2_GEO_CLUSTERS: Includes the script to merge geographic positions across all datasets.
3_ENV_DATA: Includes the scripts to process environmental data and extract environmental information for every sample and every sampling site.
4_GDIS_REEFS: The "GLMM_genetic_distances.R" script merges genetic distances from all datasets and then runs the GLMM to explain variations. The "PRM..." scripts analyze the spatial effects of the GLMM model, and link them to environmental variables using a penalized regression model (the "logistic" version is the one thoroughly discussed in the paper; the "continuous" is an alternative version where effects are not factorized).
5_BCD_v_PI: The script in this folder runs the comparison between nucleotide diversity and k-mers based diversity.
6_GLOBAL_PREDICTIONS: This folder includes a script to extract environmental data at global scale for the variables of interest ("calculate_env_ww.R"), and the script to perform global prediction of genetic diversity patterns ("predict_local_effects.R").

Files

Archive.zip

Files (3.8 GB)

Name	Size	Download all
Archive.zip md5:4da81f7b72459bdb5f538e3549161228	3.8 GB	Preview Download

Additional details

Programming language: R

	All versions	This version
Views	33	33
Downloads	11	11
Data volume	42.1 GB	42.1 GB

Scripts & Data of "Spatiotemporal patterns of genetic diversity in the world's coral reefs"

Authors/Creators

Description

Data

FiguresTables

Scripts

Files

Archive.zip

Files (3.8 GB)

Additional details

Software