Published September 18, 2023 | Version v1.1
Dataset Open

Trove of Gut Virus Genomes (TGVG)

  • 1. Baylor College of Medicine

Description

 

TGVG_v1.1.genomes.all.fna

Sequences from the Gut Virome Database, the Cenote Human Virome Database, the Metagenomic Gut Virus catalog, and the Gut Phage Database were downloaded and dereplicated at 95% average nucleotide identity (ANI) across 85% alignment fraction (AF) using anicalc.py and aniclust.py from the CheckV (version 0.9.0) package, in line with metagenomic virus sequence community standards. Exemplar sequences from each cluster/singleton from the input sequences were kept and ran through Cenote-Taker 2 (version 2.1.5) to predict virus hallmark genes within each sequence using the ‘virion’ hallmark gene database. Sequences were kept if they 1) encoded direct terminal repeats (signature of complete virus genome), one or more virus hallmark genes, and were over 1.5 kilobases or longer, or 2) encoded 2 or more virus hallmark genes and were over 12 kilobases. Sequences passing this threshold were run through CheckV to remove flanking host (bacterial) sequences and quantify the virus gene/bacteria gene ratio for each contig. Sequences with 3 or fewer virus genes and 3 or more bacterial genes after pruning/were discarded. Finally, sequences passing this threshold were dereplicated again with CheckV scripts at 95% ANI and 85% AF to yield the Trove of Gut Virus Genomes of 110,296 genomes/genome fragments each representing a viral SGB.

 

TGVG_v1.1_metadata.tsv

For each sequence in the Trove of Gut Virus Genomes CheckV was used to estimate completeness, ipHOP (version 1.1.0) was used to predict bacterial/archael host genus. Bacphlip (version 0.9.3) was run on each of the sequences predicted to be 90% or more complete to predict phage virulence.
vConTACT2 (version 0.11.3) was used to cluster viral SGBs from the Trove of Gut Virus Genomes into virus clusters. In addition to viral SGBs with vConTACT2 “Singleton” labels, viral SGBs with vConTACT2 labels “Unassigned”, “Outlier”, “Overlap”, “Clustered/Singleton” were also considered “Singletons” for downstream analysis. Genomad (version 1.5.2) taxonomy module was run on each sequence to obtain taxonomical assignment at the phylum, class, order, and family levels.
 

Files

Files (1.3 GB)

Name Size Download all
md5:8b6a05d7892f5951a213aef8a3a02870
1.3 GB Download