Published November 15, 2023 | Version 3
Dataset Open

VEBA Microeukaryotic Protein Database (MicroEuk100/90/50, Version 3)

  • 1. ROR icon J. Craig Venter Institute

Description

Microeukaryotic protein database consisting of protists and fungi for VEBA.

 

Number of sequences:

 * MicroEuk100 = 79,920,431 (19 GB)

 * MicroEuk90  = 51,767,730 (13 GB)

 * MicroEuk50  = 29,898,853 (6.5 GB)

 

Number of source organisms per dataset:

* MycoCosm = 2503

* PhycoCosm = 174

* EnsemblProtists = 233

* MMETSP = 759

* TARA_SAGv1 = 8

* EukProt = 366

* EukZoo = 27

* TARA_SMAGv1 = 389

* NR_Protists-Fungi = 48217

 

Files:

MicroEuk_v3.tar.gz = 25 GB

-rw-rw---- 1 jespinoz jcl110  19G Nov 15 14:57 MicroEuk100.faa.gz - Main fasta file with 79,920,431 protein sequences from 52,676 source organisms.  Uses md5 hash identifiers.

-rw-rw---- 1 jespinoz jcl110 2.0G Nov 15 14:59 identifier_mapping.proteins.tsv.gz - Protein identifier mappings between datasets, original identifiers, source organisms, and md5 hash identifiers.

-rw-rw---- 1 jespinoz jcl110 1.7G Nov 15 16:10 MicroEuk90_clusters.tsv.gz - MMSEQS2 clustering MicroEuk100

-rw-rw---- 1 jespinoz jcl110 1.5G Nov 15 14:57 MicroEuk100.list.gz - List of md5 hash protein identifiers in MicroEuk100

-rw-rw---- 1 jespinoz jcl110 1.1G Nov 15 16:10 MicroEuk50_clusters.tsv.gz - MMSEQS2 clustering MicroEuk90

-rw-rw---- 1 jespinoz jcl110  13M Nov 15 23:39 MicroEuk100.eukaryota_odb10.list.gz - MicroEuk100 protein identifier hits to BUSCO's eukaryota_odb10 marker using the provided score thresholds

-rw-rw---- 1 jespinoz jcl110 1.5M Nov 15 14:58 source_taxonomy.tsv.gz - Source taxonomy, lineage, dataset, and notes for each source organism

 

For more information and citations, please visit the main GitHub repository: 

https://github.com/jolespin/veba

Files

Files (26.2 GB)

Name Size Download all
md5:fae810faf99499dc7dcc27b66974f0b6
26.2 GB Download