VEBA Microeukaryotic Protein Database (VDB-Microeukaryotic_v2.1)

Espinoza, Josh L

doi:10.1186/s12859-022-04973-8

Published December 27, 2022 | Version v2.1

Dataset Open

VEBA Microeukaryotic Protein Database (VDB-Microeukaryotic_v2.1)

Espinoza, Josh L¹

1. J. Craig Venter Institute

Version:

VDB_Microeukaryotic_v2.1

Description:

A protein database is required not only for eukaryotic gene calls using MetaEuk and these results can also be leveraged for MAG annotation. Many eukaryotic protein databases exist such as MMETSP, EukZoo, and EukProt, yet these are limited to marine environments, include prokaryotic sequences, or include eukaryotic sequences for organisms that would not be expected to be binned out of metagenomes such as metazoans. While it may be possible to bin fragments of higher eukaryotic genomes, this is often not the objective of many metagenomic studies where microorganisms are the focus. We combined and dereplicated MMETSP, EukZoo, EukProt, and NCBI non-redundant to include only microeukaryotes such as protists and fungi. This optimized microeukaryotic database ensures that only eukaryotic exons expected to be represented in metagenomes are utilized for eukaryotic gene modeling and the resulting MetaEuk reference targets are used for eukaryotic MAG classification. This microeukaryotic targeted protein database lowers the database size and computational resources needed for eukaryotic gene modeling and classification than including additional prokaryotic or metazoan proteins.

Contents:

* target_conversion.v1-v2.tsv.gz - Target conversion between version 1 (e.g., MMETSP_1) and version 2 (e.g., e1268482492fa135d4326159b595e0a9).

* reference.eukaryota_odb10.list - Reference proteins filtered using BUSCO's eukaryota_odb10 HMM set followed by their score cutoffs [New in v2.1]

* source_taxonomy_with_database.tsv.gz - Patch for `source_taxonomy.tsv.gz` with an additional column prepended that includes database information. `Source_ID` has been replaced with `id_source`. Future versions will follow this format.

* VDB-Microeukaryotic_v2.tar.gz

-rw-r--r-- 1 jespinoz jcl110 11G Dec 20 21:12 reference.faa.gz

-rw-r--r-- 1 jespinoz jcl110 1.3G Dec 20 22:30 humann_uniref50_annotations.tsv.gz

-rw-r--r-- 1 jespinoz jcl110 936M Dec 20 21:34 target_to_source.dict.pkl.gz

-rw-r--r-- 1 jespinoz jcl110 583K Dec 20 21:48 source_to_lineage.dict.pkl.gz

-rw-r--r-- 1 jespinoz jcl110 508K Dec 20 21:51 source_taxonomy.tsv.gz

-rw-r--r-- 1 jespinoz jcl110 910 Dec 20 22:24 RELEASE_NOTES

-rw-r--r-- 1 jespinoz jcl110 352 Dec 20 22:36 md5_checksums

* reference.faa.gz - The main fasta protein file which is the dereplicated combination of NR (only protista and fungus), MMETSP, EukZoo, and EukProt. Only complete lineages are included since this is partially used for classification.

* humann_uniref50_annotations.tsv.gz - HUMANN reference database annotations.

[id_target]<tab>[id_uniref50]<tab>[length]<tab>[lineage]

* .pkl.gz are Python gzipped pickled dictionaries.

* target_to_source.dict.pkl.gz has mapping between identifiers in fasta file and the original source

* source_to_lineage.dict.pkl.gz has the mapping between source identifiers and lineage strings (e.g., c__Aconoidasida;o__Haemosporida;f__Haemoproteidae;g__Haemoproteus;s__Haemoproteus sp. hCWT4)

* source_taxonomy.tsv.gz has the taxonomy for each source identifier

Citation:

* Espinoza, J.L., Dupont, C.L. VEBA: a modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes. BMC Bioinformatics 23, 419 (2022). https://doi.org/10.1186/s12859-022-04973-8

Previous version:

Espinoza, Josh (2022): Microeukaryotic Protein Database (VDB_Microeukaryotic_v1). figshare. Dataset. https://doi.org/10.6084/m9.figshare.19668855.v2

Release Notes:

Total:

* Number of sequences (i.e., targets) = 46345614

* Number of unique source organisms = 44647

* Number of targets with UniRef50 hits = 37312663

* Number of unique UniRef50 hits = 6864185

* Number of unique taxa:

* class = 161

* order = 495

* family = 1281

* genus = 5752

* species/strains = 42880

The following classes have been removed as these were flagged as higher eukaryotes:

* Anthocerotopsida

* Anthozoa

* Appendicularia

* Bivalvia

* Bryopsida

* Crustacea

* Demospongiae

* Echinoidea

* Gastropoda

* Ginkgoopsida

* Gnetopsida

* Lycopodiopsida

* Magnoliopsida

* Marchantiopsida

* Myxozoa

* Polychaeta

* Polypodiopsida

* Tentaculata

Additional Notes:

* Protein identifiers (i.e., targets) have been relabeled to their hash identifiers. Calculated via `cat [query.fasta] | seqkit fx2tab -s -n > id_to_hash.tsv`.

* The only difference between v2 and v2.1 is that v2.1 now has a list of reference identifiers that are determined to be one of the BUSCO eukaryota_odb10 markers.

MD5 Checksums:

7df1897abfcac5e56f59b508ccd51d49 humann_uniref50_annotations.tsv.gz
caca03766299396618ce95e6d23c9adf reference.faa.gz
844b25ad5d93c707335a4b72268e6b76 RELEASE_NOTES
e1027b61b0766f46b063f8ce0e83344d source_taxonomy.tsv.gz
49648a6b988df13b083b154596692a33 source_to_lineage.dict.pkl.gz
a3ac333dc6f021ea5bdaf7126b7cf07d target_to_source.dict.pkl.gz

Files

Files (14.8 GB)

Name	Size
reference.eukaryota_odb10.list md5:7bbda52b40258f2f0c987bd1df5672ff	25.0 MB	Download
reference.eukaryota_odb10.list.md5 md5:731c4af5771c413f30d02f70b88cbdee	65 Bytes	Download
source_taxonomy_with_database.tsv.gz md5:5708ac88e5e12ee88d957f220b699da3	566.5 kB	Download
source_taxonomy_with_database.tsv.gz.md5 md5:a5ab8793ea901778aafe73017223f45a	71 Bytes	Download
target_conversion.v1-v2.tsv.gz md5:da1555713adbd7e7c210212f91d91931	1.1 GB	Download
target_conversion.v1-v2.tsv.gz.md5 md5:0edb525d4e52d2c486b8f6b975dfbfbe	64 Bytes	Download
VDB-Microeukaryotic_v2.tar.gz md5:7a69d6133449a9e24d3bc0a7b54de157	13.7 GB	Download
VDB-Microeukaryotic_v2.tar.gz.md5 md5:e77fa2f2d5e4abb992b22f28a77a914b	64 Bytes	Download

	All versions	This version
Views	176	175
Downloads	340	334
Data volume	1.8 TB	1.7 TB

VEBA Microeukaryotic Protein Database (VDB-Microeukaryotic_v2.1)

Authors/Creators

Description

Files

Files (14.8 GB)