LukProt - an animal evolution-centric eukaryotic protein database

doi:10.5281/zenodo.13829058

Published September 24, 2024 | Version v1.5.1.rev2

Dataset Open

LukProt - an animal evolution-centric eukaryotic protein database

Sobala, Łukasz F.¹

1. Hirszfeld Institute of Immunology and Experimental Therapy, PAS

LukProt is the EukProt database with additional species added, mostly the undersampled animal and some holozoan taxa. The database is composed of sequences translated from annotated genomes, transcriptomes or ESTs. The main purposes of the database are to consolidate sequences from undersampled animal taxa and provide usable search tools. The publication associated with LukProt can be found here: https://doi.org/10.1093/gbe/evae231.

The current version of the database (v1.5.1) is based on EukProt v3. The home of all public versions of LukProt is this page (Zenodo).

Proteomes that are novel in LukProt are denoted as LPXXXXX and those coming from AniProtDB are called APXXXXX. The sequence IDs from EukProt are conserved in LukProt. This means that each sequence is assigned an ID in the following format:

(A/E/L)PXXXXX_Species_epithet_(strain)_PYYYYYY

where XXXXX is a number from 00001 to 99999 and YYYYYY is a number from 000001 to 999999. Each sequence is assigned a unique number assigned to each sequence within a taxon. All the IDs are compatible with BLAST v5 "-parse_seqids" option and the database can be readily deployed, for example on a server running SequenceServer. Within each of the source fasta files, the source sequence identifier was kept after a blank space, so that it can still be retrieved if needed.

A publicly available BLAST server providing LukProt search is available at: https://lukprot.hirszfeld.pl/.

Comparison of EukProt v2/v3, LukProt 1.4.1 and LukProt v1.5.1 in their main areas of difference:

Taxogroup	EukProt v2	EukProt v3	LukProt v1.4.1	LukProt v1.5.1
Holozoa (excluding Metazoa)	31	40	39	43
Ctenophora	2	2	35	38
Porifera	4	5	30	47
Placozoa	2	2	3	6
Cnidaria	3	5	65	88
Bilateria	51	51	94	142

Included with the database are:

ready to use main database files:
- LukProt_v1.5.1_single_species_FASTA.7z – a FASTA file with the sequences - 7-zipped, uncompressed size: 17.6 GB
  - to concatenate all into one file, run this in the parent directory: for file in $(find . -type f -name "*.fasta"); do awk 'FNR==1{print ""}1' $file >> LukProt_v1.5.1.fa; done. This will create single FASTA file with all the sequences in the parent directory. awk is used to insert a new line after every file because cat would sometimes merge the last sequence with the header of the first sequence.
- LukProt_v1.5.1_full_BLAST_db.7z – a preformatted, full BLAST database (NCBI BLAST database format version: v5, masked with segmasker), uncompressed size: 28.3 GB
- LukProt_v1.5.1_taxogroup_BLAST_db.7z – a collection of BLAST databases where each proteome is one taxogroup and is placed within the eukaryotic tree of life directory structure, uncompressed size: 26.3 GB
- LukProt_v1.5.1_single_species_BLAST_db.7z – a collection of BLAST databases where each proteome is one BLAST database and is placed within the eukaryotic tree of life directory structure, uncompressed size: 26.4 GB
auxiliary database files:
- LukProt_v1.5.1.cdhit70.7z – the full database clustered at 70% identity using CD-HIT with the following command: cd-hit -g 1 -d 0 -T 20 -M 90000 -c 0.7 -uL 0.2 -uS 0.9 -s 0.2, uncompressed sizes: fasta file - 11 GB, clstr file - 2.5 GB
- LukProt_IDs_mapped.txt.gz – a text file mapping the LukProt IDs to the AniProtDB IDs and EukProt IDs that are different
- BUSCO_tables.ods – a spreadsheet with full result tables generated by BUSCO analysis
- OMAmer_output.zip – a folder with full results of OMAmer analyses (includes per-sequence taxonomy classification)
- OMArk_output.zip – a folder with the results of all OMArk analyses
metadata:
- README.md – a README file describing the metadata
- LukProt_metadata_sheet.ods – main metadata file. A spreadsheet with information about each proteome (in an open .ods format, most compatible with LibreOffice)
- LukProt_metadata_other.zip – an archive with other metadata files, documented in the README. Contents include:
  - the LukProt taxonomy in various formats
  - supporting scripts for data manipulation and visualization
- a recoloring script (modified by LFS, originally by Dr. Celine Petitjean). The script is in public domain and reuploaded here only for convenience.
- other files - see README
changelog.md – database changelog

Words of caution:

The database has been synchronized to EukProt v3 in version v1.5.1. This means that identifiers were modified in comparison to LukProt v1.4.1. The convention is not expected to change any more in future updates.
Many proteomes, especially those transcriptome-based, may contain contamination from different species. In addition, the translation algorithms often introduce errors (e.g. the transcript may not represent a full length protein). For this reason, to get accurate sequences from each organism, users are directed to source data and to the included OMAmer, OMArk and BUSCO data for details.
The taxonomy is different to UniEuk/EukMap, but UniEuk data were integrated where possible.
A few NCBI taxids are missing and will be added in due course.
Proteomes from NCBI and UniProt will be updated to current versions.
A number of proteomes present in some metadata, are unpublished and were held back.
While the database contains metadata that present a particular phylogeny of animals, holozoans and other eukaryotes, no particular claims or hypotheses are made by the author(s). However, in the future efforts will be made to name clades officially, once they are more firmly established.

Please report any problems or suggestions to Lukasz Sobala: lukasz.sobala (at) hirszfeld.pl.

Acknowledgements:

Andrew E. Allen Lab for creating the original PhyloDB.
Daniel Richter et al. for creating EukProt and keeping it updated.
Members of the Multicellgenome Lab, especially Michelle Leger (for donating her database), for the bioinformatics support and for doing great science.
All the authors of the original data.
National Science Centre of Poland for funding of the project 2020/36/C/NZ8/00081, "The role of glycosylation in the emergence of animal multicellularity", which enabled the creation of this database.

Files

changelog.md

Files (29.7 GB)

Name	Size	Download all
BUSCO_tables.ods md5:e5622f0ec615bdadc24c59d12e92060d	13.5 MB	Download
changelog.md md5:b413079bee3e16bc34cb87891add8ac6	2.1 kB	Preview Download
LukProt_IDs_mapped.txt.gz md5:f128142282b40dba9e3bdb4eb31f204b	38.9 MB	Download
LukProt_metadata_other.zip md5:0ff1fcab6a667c0f0d794644789718f1	6.1 MB	Preview Download
LukProt_metadata_sheet.ods md5:fbfb9135266ef77d8f07b21cd24ff217	886.5 kB	Download
LukProt_v1.5.1.cdhit70.7z md5:149b70757cc4ef2eceee791fdd8a79e4	4.4 GB	Download
LukProt_v1.5.1_full_BLAST_db.7z md5:8831ca047c7142d98ddb07808e238311	6.3 GB	Download
LukProt_v1.5.1_single_species_BLAST_db.7z md5:a2d07ee76473cdaf262070f732799d4e	6.1 GB	Download
LukProt_v1.5.1_single_species_FASTA.7z md5:57f80efc4d9eef5cef3156cf305bed11	5.6 GB	Download
LukProt_v1.5.1_taxogroup_BLAST_db.7z md5:a7d9b28c1df6ae5af894cfbca20e3748	5.9 GB	Download
OMAmer_output.zip md5:d2f6106fe681f00282ca32220fb46976	1.1 GB	Preview Download
OMArk_output.zip md5:20ca89d4ee5caac2476868413d7a1a65	207.2 MB	Preview Download
README.md md5:1cde49a5f87878807e0b55717375a35d	3.9 kB	Preview Download

Additional details

Cites: Software: 10.5281/zenodo.10654583 (DOI); Dataset: 10.6084/m9.figshare.10001870.v3 (DOI); Dataset: 10.6084/m9.figshare.7108433.v1 (DOI); Dataset: 10.6084/m9.figshare.6233573.v1 (DOI); Dataset: 10.6084/m9.figshare.5848068.v1 (DOI); Dataset: 10.6084/m9.figshare.6125030.v1 (DOI); Dataset: 10.6084/m9.figshare.6124802.v1 (DOI); Dataset: 10.6084/m9.figshare.5426494.v1 (DOI); Dataset: 10.6084/m9.figshare.8299529.v2 (DOI); Dataset: 10.6084/m9.figshare.20497143.v1 (DOI); Dataset: 10.6084/m9.figshare.6819812 (DOI); Dataset: 10.7939/R3S000 (DOI); Dataset: 10.7910/DVN/INLEPM (DOI); Dataset: 10.5061/DRYAD.TN0F3 (DOI); Dataset: 10.5061/DRYAD.R2N70 (DOI); Dataset: 10.5061/DRYAD.JDFN2Z3CV (DOI); Dataset: 10.5061/DRYAD.DNCJSXM47 (DOI); Dataset: 10.5061/DRYAD.7717Q (DOI); Dataset: 10.5282/UBM/DATA.202 (DOI); Dataset: 10.7939/R3794177K (DOI); Dataset: 10.7939/R30R9M73W (DOI); Dataset: 10.7910/DVN/25071 (DOI); Dataset: 10.7910/DVN/24737 (DOI); Dataset: 10.6084/M9.FIGSHARE.1334306.V3 (DOI); Dataset: 10.6084/M9.FIGSHARE.6771635.V2 (DOI); Dataset: 10.5061/DRYAD.50DC6 (DOI); Dataset: 10.5061/DRYAD.6CM1166 (DOI); Dataset: 10.6084/m9.figshare.22126232.v1 (DOI); Dataset: 10.5524/100483 (DOI)
Has part: Dataset: 10.6084/m9.figshare.12417881.v3 (DOI); Dataset: https://research.nhgri.nih.gov/aniprotdb/ (URL)
Is cited by: Journal article: 10.1093/glycob/cwad041 (DOI)
Is described by: Preprint: 10.1101/2024.01.30.577650 (DOI); Publication: 10.1093/gbe/evae231 (DOI)
Is referenced by: Peer review: 10.24072/pci.genomics.100368 (DOI)

The role of glycosylation in the emergence of animal multicellularity 2020/36/C/NZ8/00081: National Science Center

Development Status: Active

	All versions	This version
Views	1,055	46
Downloads	288	40
Data volume	877.4 GB	113.3 GB

LukProt - an animal evolution-centric eukaryotic protein database

Files

changelog.md

Files (29.7 GB)

Additional details

Related works

Funding

Software

LukProt - an animal evolution-centric eukaryotic protein database

Creators

Description

Files

changelog.md

Files (29.7 GB)

Additional details

Related works

Funding

Software