LukProt - an animal evolution-centric eukaryotic protein database
Description
LukProt is the EukProt database with additional species added, mostly the undersampled animal and some holozoan taxa. The database is composed of sequences translated from annotated genomes, transcriptomes or ESTs. The main purposes of the database are to consolidate sequences from undersampled animal taxa and provide usable search tools. The publication associated with LukProt can be found here: https://doi.org/10.1093/gbe/evae231.
The current version of the database (v1.5.1) is based on EukProt v3. The home of all public versions of LukProt is this page (Zenodo).
Proteomes that are novel in LukProt are denoted as LPXXXXX and those coming from AniProtDB are called APXXXXX. The sequence IDs from EukProt are conserved in LukProt. This means that each sequence is assigned an ID in the following format:
(A/E/L)PXXXXX_Species_epithet_(strain)_PYYYYYY
where XXXXX is a number from 00001 to 99999 and YYYYYY is a number from 000001 to 999999. Each sequence is assigned a unique number assigned to each sequence within a taxon. All the IDs are compatible with BLAST v5 "-parse_seqids" option and the database can be readily deployed, for example on a server running SequenceServer. Within each of the source fasta files, the source sequence identifier was kept after a blank space, so that it can still be retrieved if needed.
A publicly available BLAST server providing LukProt search is available at: https://lukprot.hirszfeld.pl/.
Comparison of EukProt v2/v3, LukProt 1.4.1 and LukProt v1.5.1 in their main areas of difference:
Taxogroup | EukProt v2 | EukProt v3 | LukProt v1.4.1 | LukProt v1.5.1 |
---|---|---|---|---|
Holozoa (excluding Metazoa) |
31 | 40 | 39 | 43 |
Ctenophora | 2 | 2 | 35 | 38 |
Porifera | 4 | 5 | 30 | 47 |
Placozoa | 2 | 2 | 3 | 6 |
Cnidaria | 3 | 5 | 65 | 88 |
Bilateria | 51 | 51 | 94 | 142 |
Included with the database are:
- ready to use main database files:
- LukProt_v1.5.1_single_species_FASTA.7z – a FASTA file with the sequences - 7-zipped, uncompressed size: 17.6 GB
- to concatenate all into one file, run this in the parent directory:
for file in $(find . -type f -name "*.fasta"); do awk 'FNR==1{print ""}1' $file >> LukProt_v1.5.1.fa; done
. This will create single FASTA file with all the sequences in the parent directory.awk
is used to insert a new line after every file becausecat
would sometimes merge the last sequence with the header of the first sequence.
- to concatenate all into one file, run this in the parent directory:
- LukProt_v1.5.1_full_BLAST_db.7z – a preformatted, full BLAST database (NCBI BLAST database format version: v5, masked with segmasker), uncompressed size: 28.3 GB
- LukProt_v1.5.1_taxogroup_BLAST_db.7z – a collection of BLAST databases where each proteome is one taxogroup and is placed within the eukaryotic tree of life directory structure, uncompressed size: 26.3 GB
- LukProt_v1.5.1_single_species_BLAST_db.7z – a collection of BLAST databases where each proteome is one BLAST database and is placed within the eukaryotic tree of life directory structure, uncompressed size: 26.4 GB
- LukProt_v1.5.1_single_species_FASTA.7z – a FASTA file with the sequences - 7-zipped, uncompressed size: 17.6 GB
- auxiliary database files:
- LukProt_v1.5.1.cdhit70.7z – the full database clustered at 70% identity using CD-HIT with the following command:
cd-hit -g 1 -d 0 -T 20 -M 90000 -c 0.7 -uL 0.2 -uS 0.9 -s 0.2
, uncompressed sizes: fasta file - 11 GB, clstr file - 2.5 GB - LukProt_IDs_mapped.txt.gz – a text file mapping the LukProt IDs to the AniProtDB IDs and EukProt IDs that are different
- BUSCO_tables.ods – a spreadsheet with full result tables generated by BUSCO analysis
- OMAmer_output.zip – a folder with full results of OMAmer analyses (includes per-sequence taxonomy classification)
- OMArk_output.zip – a folder with the results of all OMArk analyses
- LukProt_v1.5.1.cdhit70.7z – the full database clustered at 70% identity using CD-HIT with the following command:
- metadata:
- README.md – a README file describing the metadata
- LukProt_metadata_sheet.ods – main metadata file. A spreadsheet with information about each proteome (in an open .ods format, most compatible with LibreOffice)
- LukProt_metadata_other.zip – an archive with other metadata files, documented in the README. Contents include:
- the LukProt taxonomy in various formats
- supporting scripts for data manipulation and visualization
- a recoloring script (modified by LFS, originally by Dr. Celine Petitjean). The script is in public domain and reuploaded here only for convenience.
- other files - see README
- changelog.md – database changelog
Words of caution:
- The database has been synchronized to EukProt v3 in version v1.5.1. This means that identifiers were modified in comparison to LukProt v1.4.1. The convention is not expected to change any more in future updates.
- Many proteomes, especially those transcriptome-based, may contain contamination from different species. In addition, the translation algorithms often introduce errors (e.g. the transcript may not represent a full length protein). For this reason, to get accurate sequences from each organism, users are directed to source data and to the included OMAmer, OMArk and BUSCO data for details.
- The taxonomy is different to UniEuk/EukMap, but UniEuk data were integrated where possible.
- A few NCBI taxids are missing and will be added in due course.
- Proteomes from NCBI and UniProt will be updated to current versions.
- A number of proteomes present in some metadata, are unpublished and were held back.
- While the database contains metadata that present a particular phylogeny of animals, holozoans and other eukaryotes, no particular claims or hypotheses are made by the author(s). However, in the future efforts will be made to name clades officially, once they are more firmly established.
Please report any problems or suggestions to Lukasz Sobala: lukasz.sobala (at) hirszfeld.pl.
Acknowledgements:
-
Andrew E. Allen Lab for creating the original PhyloDB.
-
Daniel Richter et al. for creating EukProt and keeping it updated.
-
Members of the Multicellgenome Lab, especially Michelle Leger (for donating her database), for the bioinformatics support and for doing great science.
-
All the authors of the original data.
-
National Science Centre of Poland for funding of the project 2020/36/C/NZ8/00081, "The role of glycosylation in the emergence of animal multicellularity", which enabled the creation of this database.
Files
changelog.md
Files
(29.7 GB)
Name | Size | Download all |
---|---|---|
md5:e5622f0ec615bdadc24c59d12e92060d
|
13.5 MB | Download |
md5:b413079bee3e16bc34cb87891add8ac6
|
2.1 kB | Preview Download |
md5:f128142282b40dba9e3bdb4eb31f204b
|
38.9 MB | Download |
md5:0ff1fcab6a667c0f0d794644789718f1
|
6.1 MB | Preview Download |
md5:fbfb9135266ef77d8f07b21cd24ff217
|
886.5 kB | Download |
md5:149b70757cc4ef2eceee791fdd8a79e4
|
4.4 GB | Download |
md5:8831ca047c7142d98ddb07808e238311
|
6.3 GB | Download |
md5:a2d07ee76473cdaf262070f732799d4e
|
6.1 GB | Download |
md5:57f80efc4d9eef5cef3156cf305bed11
|
5.6 GB | Download |
md5:a7d9b28c1df6ae5af894cfbca20e3748
|
5.9 GB | Download |
md5:d2f6106fe681f00282ca32220fb46976
|
1.1 GB | Preview Download |
md5:20ca89d4ee5caac2476868413d7a1a65
|
207.2 MB | Preview Download |
md5:1cde49a5f87878807e0b55717375a35d
|
3.9 kB | Preview Download |
Additional details
Funding
Software
- Development Status
- Active