Data from: Viral tagging reveals discrete populations in Synechococcus viral genome sequence space
Deng, Li;
Ignacio-Espinoza, J. Cesar;
Gregory, Ann C.;
Poulos, Bonnie T.;
Weitz, Joshua S.;
Hugenholtz, Philip;
Sullivan, Matthew B.
Microbes and their viruses drive myriad processes across ecosystems ranging from oceans and soils to bioreactors and humans. Despite this importance, microbial diversity is only now being mapped at scales relevant to nature, while the viral diversity associated with any particular host remains little researched. Here we quantify host-associated viral diversity using viral-tagged metagenomics, which links viruses to specific host cells for high-throughput screening and sequencing. In a single experiment, we screened 107 Pacific Ocean viruses against a single strain of Synechococcus and found that naturally occurring cyanophage genome sequence space is statistically clustered into discrete populations. These population-based, host-linked viral ecological data suggest that, for this single host and seawater sample alone, there are at least 26 double-stranded DNA viral populations with estimated relative abundances ranging from 0.06 to 18.2%. These populations include previously cultivated cyanophage and new viral types missed by decades of isolate-based studies. Nucleotide identities of homologous genes mostly varied by less than 1% within populations, even in hypervariable genome regions, and by 42–71% between populations, which provides benchmarks for viral metagenomics and genome-based viral species definitions. Together these findings showcase a new approach to viral ecology that quantitatively links objectively defined environmental viral populations, and their genomes, to their hosts.
RandomizationsX1500
To estimate the variability within a population from the available metagenomic data, random candidatus genomes (CG) were generated as follows using a series of custom perl scripts. First, we recruited reads to each CG requiring at least 95% identity and a coverage of 95% of the entire length of the read. Each read was non-redundantly assigned and aligned to a CG using default parameters in MUSCLE. For each CG population, we generated 100 random CG sequences using the metagenomic data that were recruited to consensus sequences, with each base having a probability of being assigned from its relative abundance in the underlying metagenomic sequence data. Here we show the result of 1500 randomizations.
ANI_2_PCA
Matrix of ANI values as obtained from each a comparison of each candidatus genome and the reference genome. This file is used as the input to perform a PCA, which is the figure shown in the manuscript.
Viral Tagged Metagenome 454
This is identical to VT_MG.fna as it appears in CAM_P_0001068 in camera.
VT_MG.fna
Community Metagenome
Identical to Comm_MG.fna under CAM_P_0001068.
Comm_MG.fna
GP23_Sequences
Gp23 Sequences amplified from the isolates, data incorporated into table 1.
DATA-FIGURES
Tabulated data for all the figures in the manuscript.
Rarefaction files
The zip folder includes the script and tables used to generate the rarefaction curves and richness index. The tables are structured as Read, Protein, Protein Cluster
RAREFACTION.zip
ConsensusCGs
Assembly and gene predictions (CDS and aminoacid sequences) for the 26 candidatus genomes referred in the manuscript.
VT_MG_IL
Fastq sequencing data of the simplified metagenome after a Viral Tagging Experiment.