Published December 27, 2022 | Version v2.1
Dataset Open

VEBA Microeukaryotic Protein Database (VDB-Microeukaryotic_v2.1)

Authors/Creators

  • 1. J. Craig Venter Institute

Description

Version:

VDB_Microeukaryotic_v2.1

 

Description:

A protein database is required not only for eukaryotic gene calls using MetaEuk and these results can also be leveraged for MAG annotation. Many eukaryotic protein databases exist such as MMETSPEukZoo, and EukProt, yet these are limited to marine environments, include prokaryotic sequences, or include eukaryotic sequences for organisms that would not be expected to be binned out of metagenomes such as metazoans. While it may be possible to bin fragments of higher eukaryotic genomes, this is often not the objective of many metagenomic studies where microorganisms are the focus. We combined and dereplicated MMETSPEukZooEukProt, and NCBI non-redundant to include only microeukaryotes such as protists and fungi. This optimized microeukaryotic database ensures that only eukaryotic exons expected to be represented in metagenomes are utilized for eukaryotic gene modeling and the resulting MetaEuk reference targets are used for eukaryotic MAG classification. This microeukaryotic targeted protein database lowers the database size and computational resources needed for eukaryotic gene modeling and classification than including additional prokaryotic or metazoan proteins. 

 

Contents:

* target_conversion.v1-v2.tsv.gz - Target conversion between version 1 (e.g., MMETSP_1) and version 2 (e.g., e1268482492fa135d4326159b595e0a9).

* reference.eukaryota_odb10.list - Reference proteins filtered using BUSCO's eukaryota_odb10 HMM set followed by their score cutoffs [New in v2.1]

* source_taxonomy_with_database.tsv.gz - Patch for `source_taxonomy.tsv.gz` with an additional column prepended that includes database information.  `Source_ID` has been replaced with `id_source`. Future versions will follow this format.

* VDB-Microeukaryotic_v2.tar.gz

-rw-r--r-- 1 jespinoz jcl110  11G Dec 20 21:12 reference.faa.gz

-rw-r--r-- 1 jespinoz jcl110 1.3G Dec 20 22:30 humann_uniref50_annotations.tsv.gz

-rw-r--r-- 1 jespinoz jcl110 936M Dec 20 21:34 target_to_source.dict.pkl.gz

-rw-r--r-- 1 jespinoz jcl110 583K Dec 20 21:48 source_to_lineage.dict.pkl.gz

-rw-r--r-- 1 jespinoz jcl110 508K Dec 20 21:51 source_taxonomy.tsv.gz

-rw-r--r-- 1 jespinoz jcl110  910 Dec 20 22:24 RELEASE_NOTES

-rw-r--r-- 1 jespinoz jcl110  352 Dec 20 22:36 md5_checksums

 

* reference.faa.gz - The main fasta protein file which is the dereplicated combination of NR (only protista and fungus), MMETSP, EukZoo, and EukProt.  Only complete lineages are included since this is partially used for classification. 

* humann_uniref50_annotations.tsv.gz - HUMANN reference database annotations.

[id_target]<tab>[id_uniref50]<tab>[length]<tab>[lineage]

* .pkl.gz are Python gzipped pickled dictionaries

* target_to_source.dict.pkl.gz has mapping between identifiers in fasta file and the original source

* source_to_lineage.dict.pkl.gz has the mapping between source identifiers and lineage strings (e.g., c__Aconoidasida;o__Haemosporida;f__Haemoproteidae;g__Haemoproteus;s__Haemoproteus sp. hCWT4)

* source_taxonomy.tsv.gz has the taxonomy for each source identifier

 

Citation:

* Espinoza, J.L., Dupont, C.L. VEBA: a modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes. BMC Bioinformatics 23, 419 (2022). https://doi.org/10.1186/s12859-022-04973-8

Previous version: 

Espinoza, Josh (2022): Microeukaryotic Protein Database (VDB_Microeukaryotic_v1). figshare. Dataset. https://doi.org/10.6084/m9.figshare.19668855.v2 

Release Notes:

Total:

 * Number of sequences (i.e., targets) = 46345614

 * Number of unique source organisms = 44647

 * Number of targets with UniRef50 hits = 37312663

 * Number of unique UniRef50 hits = 6864185

 * Number of unique taxa:

     * class = 161

     * order = 495

     * family = 1281

     * genus = 5752

     * species/strains = 42880

 

The following classes have been removed as these were flagged as higher eukaryotes:

 * Anthocerotopsida

 * Anthozoa

 * Appendicularia

 * Bivalvia

 * Bryopsida

 * Crustacea

 * Demospongiae

 * Echinoidea

 * Gastropoda

 * Ginkgoopsida

 * Gnetopsida

 * Lycopodiopsida

 * Magnoliopsida

 * Marchantiopsida

 * Myxozoa

 * Polychaeta

 * Polypodiopsida

 * Tentaculata

 

Additional Notes:

 * Protein identifiers (i.e., targets) have been relabeled to their hash identifiers. Calculated via `cat [query.fasta] | seqkit fx2tab -s -n > id_to_hash.tsv`.

* The only difference between v2 and v2.1 is that v2.1 now has a list of reference identifiers that are determined to be one of the BUSCO  eukaryota_odb10 markers.

 

MD5 Checksums:

7df1897abfcac5e56f59b508ccd51d49  humann_uniref50_annotations.tsv.gz
caca03766299396618ce95e6d23c9adf  reference.faa.gz
844b25ad5d93c707335a4b72268e6b76  RELEASE_NOTES
e1027b61b0766f46b063f8ce0e83344d  source_taxonomy.tsv.gz
49648a6b988df13b083b154596692a33  source_to_lineage.dict.pkl.gz
a3ac333dc6f021ea5bdaf7126b7cf07d  target_to_source.dict.pkl.gz

Files

Files (14.8 GB)

Name Size Download all
md5:7bbda52b40258f2f0c987bd1df5672ff
25.0 MB Download
md5:731c4af5771c413f30d02f70b88cbdee
65 Bytes Download
md5:5708ac88e5e12ee88d957f220b699da3
566.5 kB Download
md5:a5ab8793ea901778aafe73017223f45a
71 Bytes Download
md5:da1555713adbd7e7c210212f91d91931
1.1 GB Download
md5:0edb525d4e52d2c486b8f6b975dfbfbe
64 Bytes Download
md5:7a69d6133449a9e24d3bc0a7b54de157
13.7 GB Download
md5:e77fa2f2d5e4abb992b22f28a77a914b
64 Bytes Download