Published March 6, 2020 | Version v. 03_2020_137
Dataset Open

Darwin: an amino acid sequence collection of complete proteomes from eukaryotes with different phylogenetic affinities (v. 03_2020_137)

  • 1. The Sainsbury Laboratory

Description

Background

Every time we find an interesting gene in an organism of interest, the first question is often “how widely is this gene distributed in the eukaryotic kingdom?”. Naturally, one could use NCBI BLAST search against the non-redundant sequence database provided by GenBank to answer this question. However, it can be cumbersome to parse the results and assign them to taxonomic units. It is also not straightforward to get an overview of which eukaryotic groups are represented in the results. Top BLAST hits can be crowded with sequences from closely-related organisms making it difficult gain an overview of the overall distribution across eukaryotes. To streamline this process, we developed an in-house database of complete eukaryotic proteomes. We tagged each sequence with a eukaryotic group handle (two-character symbol) and combined them into a single data set searchable by standalone BLAST on one’s own computer. We named this data set “Darwin” to reflect the diverse nature of the sequences it contains. 

Methods

We downloaded predicted proteomes in FASTA format from different sources such as GenBank, Joint Genome Institute (Depart of Energy, USA), Broad Institute (Massachusetts Institute of Technology, USA), Phytozome and a number of other specialized websites catering for a specific organism such as the Arabidopsis Information Resource (TAIR), or the Saccharomyces Genome Database (SGD). All the organisms we included in Darwin are listed in Table 1. To reduce redundancy, we took care not to include the same species more than once unless subspecies were known to show wide diversity. Each sequence header was tagged with a eukaryotic group handle composed of two-character symbols (based on Keeling et al., 2005). These handles clearly appear in BLAST output and can be parsed easily. We combined sequences from all proteomes into a single data set and named it “Darwin”.

Results

The current version of Darwin (v. 03_2020_137) contains 2,601,132 amino acid sequences from 137 eukaryotes (Table 1, Data file 1). The sizes of the proteomes were diverse, ranging from ~4000 sequences in some alveolates to 60,000-76,000 in plants. Darwin represents most of the supergroups of eukaryotic kingdom described in Keeling et al., (2005) except those in Rhizaria whose genomes were not available at the time of data set construction. The data set contains larger numbers of proteomes from fungi and plants reflecting areas of interest in our group. 

Conclusions

Darwin is provided as a text fasta file that can be formatted for BLAST searches on standalone computers. The results from the BLAST searches can be parsed to determine how widely a gene of interest is distributed among different eukaryotes. Simple counting of the eukaryotic group handles would also yield an overview of the distribution across taxa. Darwin is also useful for rapidly finding out whether a gene is missing in particular taxa.

Reference

Keeling PJ, Burger G, Durnford DG, Lang BF, Lee RW, Pearlman RE, Roger AJ, Gray MW (2005) The tree of eukaryotes. Trends Ecol. Evol. 20: 670-676

Files

Darwin_data_set_2020March_writeup_for_Zenodo.pdf

Files (1.2 GB)

Name Size Download all
md5:017914ffe7e690b3d50aeebcca590b1e
111.1 kB Preview Download
md5:b3d0aea1bf1e9c31b7411fb64fc216d5
1.2 GB Download
md5:699df0ccab29a2c3611345bca54f3ade
58.4 kB Download

Additional details

Funding

BLASTOFF – Retooling plant immunity for resistance to blast fungi 743165
European Commission
Maximizing the potential for sustainable and durable resistance to the wheat yellow rust pathogen BB/J012017/1
UK Research and Innovation
NGRB – Next generation disease resistance breeding in plants 294608
European Commission
Development of novel blast resistant wheat varieties for Bangladesh by genome editing BB/P023339/2
UK Research and Innovation
A tool for haplotype reconstruction from polyploid genomes BB/L018535/1
UK Research and Innovation