VEBA Microeukaryotic Protein Database (VDB-Microeukaryotic_v2.1)
Description
Version:
VDB_Microeukaryotic_v2.1
Description:
A protein database is required not only for eukaryotic gene calls using MetaEuk and these results can also be leveraged for MAG annotation. Many eukaryotic protein databases exist such as MMETSP, EukZoo, and EukProt, yet these are limited to marine environments, include prokaryotic sequences, or include eukaryotic sequences for organisms that would not be expected to be binned out of metagenomes such as metazoans. While it may be possible to bin fragments of higher eukaryotic genomes, this is often not the objective of many metagenomic studies where microorganisms are the focus. We combined and dereplicated MMETSP, EukZoo, EukProt, and NCBI non-redundant to include only microeukaryotes such as protists and fungi. This optimized microeukaryotic database ensures that only eukaryotic exons expected to be represented in metagenomes are utilized for eukaryotic gene modeling and the resulting MetaEuk reference targets are used for eukaryotic MAG classification. This microeukaryotic targeted protein database lowers the database size and computational resources needed for eukaryotic gene modeling and classification than including additional prokaryotic or metazoan proteins.
Contents:
* target_conversion.v1-v2.tsv.gz - Target conversion between version 1 (e.g., MMETSP_1) and version 2 (e.g., e1268482492fa135d4326159b595e0a9).
* reference.eukaryota_odb10.list - Reference proteins filtered using BUSCO's eukaryota_odb10 HMM set followed by their score cutoffs [New in v2.1]
* source_taxonomy_with_database.tsv.gz - Patch for `source_taxonomy.tsv.gz` with an additional column prepended that includes database information. `Source_ID` has been replaced with `id_source`. Future versions will follow this format.
* VDB-Microeukaryotic_v2.tar.gz
-rw-r--r-- 1 jespinoz jcl110 11G Dec 20 21:12 reference.faa.gz
-rw-r--r-- 1 jespinoz jcl110 1.3G Dec 20 22:30 humann_uniref50_annotations.tsv.gz
-rw-r--r-- 1 jespinoz jcl110 936M Dec 20 21:34 target_to_source.dict.pkl.gz
-rw-r--r-- 1 jespinoz jcl110 583K Dec 20 21:48 source_to_lineage.dict.pkl.gz
-rw-r--r-- 1 jespinoz jcl110 508K Dec 20 21:51 source_taxonomy.tsv.gz
-rw-r--r-- 1 jespinoz jcl110 910 Dec 20 22:24 RELEASE_NOTES
-rw-r--r-- 1 jespinoz jcl110 352 Dec 20 22:36 md5_checksums
* reference.faa.gz - The main fasta protein file which is the dereplicated combination of NR (only protista and fungus), MMETSP, EukZoo, and EukProt. Only complete lineages are included since this is partially used for classification.
* humann_uniref50_annotations.tsv.gz - HUMANN reference database annotations.
[id_target]<tab>[id_uniref50]<tab>[length]<tab>[lineage]
* .pkl.gz are Python gzipped pickled dictionaries.
* target_to_source.dict.pkl.gz has mapping between identifiers in fasta file and the original source
* source_to_lineage.dict.pkl.gz has the mapping between source identifiers and lineage strings (e.g., c__Aconoidasida;o__Haemosporida;f__Haemoproteidae;g__Haemoproteus;s__Haemoproteus sp. hCWT4)
* source_taxonomy.tsv.gz has the taxonomy for each source identifier
Citation:
* Espinoza, J.L., Dupont, C.L. VEBA: a modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes. BMC Bioinformatics 23, 419 (2022). https://doi.org/10.1186/s12859-022-04973-8
Previous version:
Espinoza, Josh (2022): Microeukaryotic Protein Database (VDB_Microeukaryotic_v1). figshare. Dataset. https://doi.org/10.6084/m9.figshare.19668855.v2
Release Notes:
Total:
* Number of sequences (i.e., targets) = 46345614
* Number of unique source organisms = 44647
* Number of targets with UniRef50 hits = 37312663
* Number of unique UniRef50 hits = 6864185
* Number of unique taxa:
* class = 161
* order = 495
* family = 1281
* genus = 5752
* species/strains = 42880
The following classes have been removed as these were flagged as higher eukaryotes:
* Anthocerotopsida
* Anthozoa
* Appendicularia
* Bivalvia
* Bryopsida
* Crustacea
* Demospongiae
* Echinoidea
* Gastropoda
* Ginkgoopsida
* Gnetopsida
* Lycopodiopsida
* Magnoliopsida
* Marchantiopsida
* Myxozoa
* Polychaeta
* Polypodiopsida
* Tentaculata
Additional Notes:
* Protein identifiers (i.e., targets) have been relabeled to their hash identifiers. Calculated via `cat [query.fasta] | seqkit fx2tab -s -n > id_to_hash.tsv`.
* The only difference between v2 and v2.1 is that v2.1 now has a list of reference identifiers that are determined to be one of the BUSCO eukaryota_odb10 markers.
MD5 Checksums:
7df1897abfcac5e56f59b508ccd51d49 humann_uniref50_annotations.tsv.gz
caca03766299396618ce95e6d23c9adf reference.faa.gz
844b25ad5d93c707335a4b72268e6b76 RELEASE_NOTES
e1027b61b0766f46b063f8ce0e83344d source_taxonomy.tsv.gz
49648a6b988df13b083b154596692a33 source_to_lineage.dict.pkl.gz
a3ac333dc6f021ea5bdaf7126b7cf07d target_to_source.dict.pkl.gz
Files
Files
(14.8 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:7bbda52b40258f2f0c987bd1df5672ff
|
25.0 MB | Download |
|
md5:731c4af5771c413f30d02f70b88cbdee
|
65 Bytes | Download |
|
md5:5708ac88e5e12ee88d957f220b699da3
|
566.5 kB | Download |
|
md5:a5ab8793ea901778aafe73017223f45a
|
71 Bytes | Download |
|
md5:da1555713adbd7e7c210212f91d91931
|
1.1 GB | Download |
|
md5:0edb525d4e52d2c486b8f6b975dfbfbe
|
64 Bytes | Download |
|
md5:7a69d6133449a9e24d3bc0a7b54de157
|
13.7 GB | Download |
|
md5:e77fa2f2d5e4abb992b22f28a77a914b
|
64 Bytes | Download |