Published September 24, 2024 | Version v1.5.1.rev2
Dataset Open

LukProt - an animal evolution-centric eukaryotic protein database

  • 1. Hirszfeld Institute of Immunology and Experimental Therapy, PAS

Description

LukProt is the EukProt database with additional species added, mostly the undersampled animal and some holozoan taxa. The database is composed of sequences translated from annotated genomes, transcriptomes or ESTs. The main purposes of the database are to consolidate sequences from undersampled animal taxa and provide usable search tools. The publication associated with LukProt can be found here: https://doi.org/10.1093/gbe/evae231.

The current version of the database (v1.5.1) is based on EukProt v3. The home of all public versions of LukProt is this page (Zenodo).

Proteomes that are novel in LukProt are denoted as LPXXXXX and those coming from AniProtDB are called APXXXXX. The sequence IDs from EukProt are conserved in LukProt. This means that each sequence is assigned an ID in the following format:

(A/E/L)PXXXXX_Species_epithet_(strain)_PYYYYYY

where XXXXX is a number from 00001 to 99999 and YYYYYY is a number from 000001 to 999999. Each sequence is assigned a unique number assigned to each sequence within a taxon. All the IDs are compatible with BLAST v5 "-parse_seqids" option and the database can be readily deployed, for example on a server running SequenceServer. Within each of the source fasta files, the source sequence identifier was kept after a blank space, so that it can still be retrieved if needed.

A publicly available BLAST server providing LukProt search is available at: https://lukprot.hirszfeld.pl/.

Comparison of EukProt v2/v3, LukProt 1.4.1 and LukProt v1.5.1 in their main areas of difference:

Taxogroup EukProt v2 EukProt v3 LukProt v1.4.1 LukProt v1.5.1

Holozoa

(excluding Metazoa)

31 40 39 43
Ctenophora 2 2 35 38
Porifera 4 5 30 47
Placozoa 2 2 3 6
Cnidaria 3 5 65 88
Bilateria 51 51 94 142

Included with the database are:

  • ready to use main database files:
    • LukProt_v1.5.1_single_species_FASTA.7z – a FASTA file with the sequences - 7-zipped, uncompressed size: 17.6 GB
      • to concatenate all into one file, run this in the parent directory: for file in $(find . -type f -name "*.fasta"); do awk 'FNR==1{print ""}1' $file >> LukProt_v1.5.1.fa; done. This will create single FASTA file with all the sequences in the parent directory. awk is used to insert a new line after every file because cat would sometimes merge the last sequence with the header of the first sequence.
    • LukProt_v1.5.1_full_BLAST_db.7z – a preformatted, full BLAST database (NCBI BLAST database format version: v5, masked with segmasker), uncompressed size: 28.3 GB
    • LukProt_v1.5.1_taxogroup_BLAST_db.7z – a collection of BLAST databases where each proteome is one taxogroup and is placed within the eukaryotic tree of life directory structure, uncompressed size: 26.3 GB
    • LukProt_v1.5.1_single_species_BLAST_db.7z – a collection of BLAST databases where each proteome is one BLAST database and is placed within the eukaryotic tree of life directory structure, uncompressed size: 26.4 GB
  • auxiliary database files:
    • LukProt_v1.5.1.cdhit70.7z – the full database clustered at 70% identity using CD-HIT with the following command: cd-hit -g 1 -d 0 -T 20 -M 90000 -c 0.7 -uL 0.2 -uS 0.9 -s 0.2uncompressed sizes: fasta file - 11 GB, clstr file - 2.5 GB
    • LukProt_IDs_mapped.txt.gz – a text file mapping the LukProt IDs to the AniProtDB IDs and EukProt IDs that are different
    • BUSCO_tables.ods – a spreadsheet with full result tables generated by BUSCO analysis
    • OMAmer_output.zip – a folder with full results of OMAmer analyses (includes per-sequence taxonomy classification)
    • OMArk_output.zip – a folder with the results of all OMArk analyses
  • metadata:
    • README.md – a README file describing the metadata
    • LukProt_metadata_sheet.ods – main metadata file. A spreadsheet with information about each proteome (in an open .ods format, most compatible with LibreOffice)
    • LukProt_metadata_other.zip – an archive with other metadata files, documented in the README. Contents include:
      • the LukProt taxonomy in various formats
      • supporting scripts for data manipulation and visualization
    • a recoloring script (modified by LFS, originally by Dr. Celine Petitjean). The script is in public domain and reuploaded here only for convenience. 
    • other files - see README
  • changelog.md – database changelog

Words of caution:

  • The database has been synchronized to EukProt v3 in version v1.5.1. This means that identifiers were modified in comparison to LukProt v1.4.1. The convention is not expected to change any more in future updates.
  • Many proteomes, especially those transcriptome-based, may contain contamination from different species. In addition, the translation algorithms often introduce errors (e.g. the transcript may not represent a full length protein). For this reason, to get accurate sequences from each organism, users are directed to source data and to the included OMAmer, OMArk and BUSCO data for details.
  • The taxonomy is different to UniEuk/EukMap, but UniEuk data were integrated where possible.
  • A few NCBI taxids are missing and will be added in due course.
  • Proteomes from NCBI and UniProt will be updated to current versions.
  • A number of proteomes present in some metadata, are unpublished and were held back.
  • While the database contains metadata that present a particular phylogeny of animals, holozoans and other eukaryotes, no particular claims or hypotheses are made by the author(s). However, in the future efforts will be made to name clades officially, once they are more firmly established.

Please report any problems or suggestions to Lukasz Sobala: lukasz.sobala (at) hirszfeld.pl.

 

Acknowledgements:

  • Andrew E. Allen Lab for creating the original PhyloDB.

  • Daniel Richter et al. for creating EukProt and keeping it updated.

  • Members of the Multicellgenome Lab, especially Michelle Leger (for donating her database), for the bioinformatics support and for doing great science.

  • All the authors of the original data.

  • National Science Centre of Poland for funding of the project 2020/36/C/NZ8/00081, "The role of glycosylation in the emergence of animal multicellularity", which enabled the creation of this database.

Files

changelog.md

Files (29.7 GB)

Name Size Download all
md5:e5622f0ec615bdadc24c59d12e92060d
13.5 MB Download
md5:b413079bee3e16bc34cb87891add8ac6
2.1 kB Preview Download
md5:f128142282b40dba9e3bdb4eb31f204b
38.9 MB Download
md5:0ff1fcab6a667c0f0d794644789718f1
6.1 MB Preview Download
md5:fbfb9135266ef77d8f07b21cd24ff217
886.5 kB Download
md5:149b70757cc4ef2eceee791fdd8a79e4
4.4 GB Download
md5:8831ca047c7142d98ddb07808e238311
6.3 GB Download
md5:a2d07ee76473cdaf262070f732799d4e
6.1 GB Download
md5:57f80efc4d9eef5cef3156cf305bed11
5.6 GB Download
md5:a7d9b28c1df6ae5af894cfbca20e3748
5.9 GB Download
md5:d2f6106fe681f00282ca32220fb46976
1.1 GB Preview Download
md5:20ca89d4ee5caac2476868413d7a1a65
207.2 MB Preview Download
md5:1cde49a5f87878807e0b55717375a35d
3.9 kB Preview Download

Additional details

Funding

The role of glycosylation in the emergence of animal multicellularity 2020/36/C/NZ8/00081
National Science Center

Software

Development Status
Active