"Genome binning of viral entities from bulk metagenomics data" - CAMISIM simulated datasets and genomes
Description
Genome binning of viral entities from bulk metagenomics data
Authors
Joachim Johansen1,2, Damian R. Plichta2, Jakob Nybo Nissen1,3, Marie Louise Jespersen1,4, Shiraz A. Shah5, Ling Deng6, Jakob Stokholm5,6, Hans Bisgaard5, Dennis Sandris Nielsen6, Søren Sørensen7, Simon Rasmussen1
Affiliations
1 Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen N, Denmark
2 Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
3 Statens Serum Institut, Viral & Microbial Special diagnostics, Copenhagen, Denmark
4 National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark
5 Copenhagen Prospective Studies on Asthma in Childhood (COPSAC), Herlev and Gentofte Hospital, University of Copenhagen, Copenhagen, Denmark
6 Section of Food Microbiology and Fermentation, Department of Food Science, Faculty of Science, University of Copenhagen, Copenhagen, Denmark
7 Section of Microbiology, Department of Biology, University of Copenhagen, Copenhagen, Denmark
Methods description
We compared the viral binning performance of VAMB and MetaBAT2 using the official CAMI consortium method to create assemblies and metagenome profiles. To this end we generated 3 different metagenome compositions with up to 308 reference genomes; one mixed with bacteria, plasmids and viruses to test binning in complex samples i.e. high diversity (1), one with only crass-like viruses to test binning with highly similar viruses i.e. high relatedness (2) and a set of small-viruses (<6,000 bp) including members of the Microviridae family to address the bias of size (3). Bacterial genomes were gathered from NCBIs refseq genome repository 2021, plasmids from the PLSDB database (v. 2021_06_23) and viral genomes from the recent MGV database.
Dataset A contained a mixture of bacteria (N=8), plasmids (N=20) and viruses (N=280) to test binning in complex samples, i.e. high diversity. Dataset B contained only crass-like viruses (N=80) to test binning with highly similar viruses i.e. high relatedness. Dataset C contained small-viruses (N=50, <6,000 bp) of the Microviridae family to address the bias of size. Bacterial genomes were sampled from the Refseq genome repository 2021, plasmids from the PLSDB database and viral genomes from the recent MGV database (Nayfach, et al. Nature Microbiology 2021).
Files
supplementary_data_5_Gold_standard_virus.zip
Files
(3.3 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:1464b081d84313067d33ffbae75b499d
|
106.6 MB | Preview Download |
|
md5:1fecf997f86bd246cc5cfac5b33085fa
|
3.2 GB | Download |
|
md5:de2bc7a633d945481b06e9ccca678016
|
17.1 MB | Download |