Published August 29, 2024 | Version v1
Dataset Open

Machine Learning based identification of putative coral pathogens in endangered Caribbean staghorn coral

Description

Supplementary Files

SupplementaryFile1.csv.gz – Metadata for field collected samples with columns:

  • “sample_id” – individual sample names.
  • “health” – “H” healthy and “D” diseased fragments.
  • “year” – year fragment collected.
  •  “season” – season fragment collected (“S” July and “W” January)
  • “site” – location fragment collected from
  • “lib.size” – total number of sequenced reads
  • “norm.factors” – factor used to normalize read counts of ASVs

SupplementaryFile2.csv.gz – Metadata for tank collected samples with columns:

  •  “sample_id” – individual sample names.
  • “geno” – fragment genotype
  •  “fragment_id” – fragment identification tracked through repeated sampling
  •  “tank_id” – tank identification
  • “time_treat” – concatenated metric for sampling time, exposure, and disease outcome separated by “_”
    •  Time – 0, 2, 8
    • Exposure – “D” Diseased, “N” Healthy
    • Disease Outcome - “D” Diseased, “H” Healthy
  •  “lib.size” – total number of sequenced reads
  • “norm.factors” – factor used to normalize read counts of ASVs

SupplementaryFile3.fasta – FASTA file including complete 16s sequences named with ASV identifier and taxonomy.

SupplementaryFile4.csv.gz – Matrix of the number of reads of each ASV sequenced in each sample. Combined both field and tank samples.

SupplementaryFile5.csv.gz – Matrix of the log2 CPM of each ASV sequenced in each sample. Combined both field and tank samples.

SupplementaryFile6.csv.gz – Complete results for each ASV association.

  •  “top_classification” – lowest taxonomic classification with more than 80% confidence.
  • “taxonomy” – Full taxonomy including confidence in each taxonomic level.
  • “passedFilter” – indicates taxa filtered from analysis due to rarity and/or lack of observations across sample times.
  •  “rank_*” – machine learning model rankings, median ranking, and model estimated ranking along with standard error, confidence interval, and FDR adjusted p-value used to identify important ASVs.
  • “ml_retained” – Indicates if the ASV was of above average importance to ML models. NA values indicate ASVs which were filtered prior to ML modelling.
  • “fieldModel_*” – ANOVA table results for each ASV testing the effects of health, year, season and all possible interactions indicating:
    • Sums of squares, mean squares, numerator and denominator degrees of freedom, F statistic, p-value, and FDR corrected p-value.
    • NA values are filled for ASVs filtered prior to differential abundance analysis.
  • “diffAbundance_healthAssociation” – Marks the health association of ASVs from differential abundance analysis of field samples: “H” health, “D” diseased, “N” none, NA – filtered prior to differential abundance analysis.
  • “fieldLogFC_*” – Post-hoc contrasts for ML retained ASVs testing the significance of the log2 fold-change between disease and healthy fragments within each sampling time (year: 2016, 2017 & season: “S” July, “W” January) showing:
    • Mean estimate, standard error, degrees of freedom, lower and upper 95% confidence interval, t-statistic, p-value, FDR adjusted p-value.
    • NA values are filled for ASVs which were not marked as important by ML models.
  •  “field_consistent” – Indicates if the ASV was consistently healthy or disease associated across sampling times. NA values are filled for ASVs which were not marked as important by ML models.
  • “tankModel_*” – ANOVA table results for each ASV testing the effect of the combination of time, disease exposure, and disease outcome, indicating:
    • Sums of squares, mean squares, numerator and denominator degrees of freedom, F statistic, p-value, and FDR corrected p-value.
    • NA values are filled for ASVs filtered prior to tank experimental analysis.
  •  “tankLogFC_*” – Post-hoc contrasts for ASVs tested in tank exposure experiments.
    • Contrasts include:
      • Post-exposure diseased vs healthy outcome regardless of exposure (DvH)
      • Post-exposure diseased vs healthy exposure regardless of outcome (DvN)
      • Post-exposure disease exposed corals with disease symptoms compared to disease exposed but still healthy corals (DDvDH)
      • Post-exposure disease exposed corals with disease symptoms compared to healthy exposed and still healthy corals (DDvNH)
      • Post-exposure disease exposed corals which stay healthy compared to healthy exposed and still healthy corals (DHvNH)
      • Pre-exposure compared to Post-exposure in corals with the disease regardless of exposure (PostvPreD)
      • Pre-exposure compared to Post-exposure in corals without the disease regardless of exposure (PostvPreH)
    • Mean estimate, standard error, degrees of freedom, lower and upper 95% confidence interval, t-statistic, p-value, FDR adjusted p-value.
    • NA values are filled for ASVs which were not consistently associated with healthy or diseased corals in the field experiment.
  • “pathogen_classification” – Indicates the predicted microbial classification based on the tank results. Pathogen, Opportunist, Commensal
    • NA values are filled for ASVs which were not consistently associated with healthy or diseased corals in the field experiment.

Files

Files (6.9 MB)

Name Size Download all
md5:31673ec3cb952aa5aa50dae0d94e3f4a
7.2 kB Download
md5:63887219011b3ce3a9b4bef426181afe
1.7 kB Download
md5:d8ffa7b8c9a9222e22a532ca1060e498
6.1 MB Download
md5:f12d72fd1dc6776b1210c99765f939de
58.7 kB Download
md5:2edd82d2d9d9a3ee1353b7b7a2698716
204.6 kB Download
md5:16f346cabd6c6cbab1dbc1d3b89a25cd
497.6 kB Download

Additional details

Funding

U.S. National Science Foundation
Coral-microbial interactions as determinants of disease dynamics 1458158
U.S. National Science Foundation
Multi-omic bases of coral disease resistance 1924145

Software

Repository URL
https://github.com/VollmerLab/WBD_ML_pathogen
Programming language
R