High-quality large curated dataset of protein sequences (1.83 million) and their corresponding Position Specific Scoring Matrices
- 1. Technical University of Munich
- 2. Al Akhawayn University in Ifrane
Description
As part of his master thesis at the Rostlab, which is located at the Technical University of Munich (TUM), Mr. Issar Arab developed the first language model that encodes evolutionary information of proteins explicitly. The pre-training involved the creation of a novel high-quality dataset of protein sequences (around 1.83 million proteins, or ~0.8 Billion amino acids) with their corresponding Position Specific Scoring Matrices (PSSMs). Those matrices reflect the relative frequency of each amino acid at each position in a protein and is derived from evolutionarily related proteins.
Mr. Arab makes this work publicly available to help other researchers speed up their work to leverage AI to learn the representation of protein evolutionary information more explicitly. The set of sequences was derived by extracting all PSSMs from the PredictProtein (PP) cache, which were also part o the UniProt Reference Cluster with 50% sequence identity (uniref50 2019_12). The overlap between PP and uniref50 was further filtered to only include high-quality samples, e.g. only multiple sequence alignments with a certain number of aligned sequences were considered. The processing led to a training set of 1.83 Million sequences, a validation set of 879 instances, and a test set of 879 entries. The training data of proteins is reduced to 40% sequence identity, with respect to the validation/test sets, and contains sequences ranging between 18 and 9858 residues in length.
Refer to the Jupyter notebook for a detailed description of the files' structure and a Python code snippet to correctly manipulate this data.
To access the full original work, please visit the following link: Manuscript
Note: The dataset was recently used to fine tune a protein sequence language model (PEvoLM). The work was presented at the CIBCB'23 conference. If you use PEvoLM or this dataset in your work, please cite the following publication:
- Issar Arab, PEvoLM: Protein Sequence Evolutionary Information Language Model, IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), Eindhoven, Netherlands, (2023), pp. 1-8, doi:10.1109/CIBCB56990.2023.10264890
Files
Dataset.zip
Files
(9.9 GB)
Name | Size | Download all |
---|---|---|
md5:c7a908265b25dcdbdab98aac626d7a70
|
9.9 GB | Preview Download |
md5:c60c494d6164cf8e7410d2d973f7566d
|
3.7 kB | Preview Download |
Additional details
References
- The UniProt Consortium, UniProt: a worldwide hub of protein knowledge, Nucleic Acids Research, Volume 47, Issue D1, 08 January 2019, Pages D506–D515, https://doi.org/10.1093/nar/gky1049
- Rost, B., Yachdav, G., & Liu, J. (2004). The PredictProtein server. Nucleic acids research, 32(Web Server issue), W321–W326. doi:10.1093/nar/gkh377