Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published September 15, 2020 | Version v1
Dataset Open

High-quality large curated dataset of protein sequences (1.83 million) and their corresponding Position Specific Scoring Matrices

  • 1. Technical University of Munich
  • 2. Al Akhawayn University in Ifrane

Description

As part of his master thesis at the Rostlab, which is located at the Technical University of Munich (TUM), Mr. Issar Arab developed the first language model that encodes evolutionary information of proteins explicitly. The pre-training involved the creation of a novel high-quality dataset of protein sequences (around 1.83 million proteins, or ~0.8 Billion amino acids) with their corresponding Position Specific Scoring Matrices (PSSMs).  Those matrices reflect the relative frequency of each amino acid at each position in a protein and is derived from evolutionarily related proteins.

Mr. Arab makes this work publicly available to help other researchers speed up their work to leverage AI to learn the representation of protein evolutionary information more explicitly. The set of sequences was derived by extracting all PSSMs from the PredictProtein (PP) cache, which were also part o the UniProt Reference Cluster with 50% sequence identity (uniref50 2019_12). The overlap between PP and uniref50 was further filtered to only include high-quality samples, e.g. only multiple sequence alignments with a certain number of aligned sequences were considered. The processing led to a training set of 1.83 Million sequences, a validation set of 879 instances, and a test set of 879 entries. The training data of proteins is reduced to 40% sequence identity, with respect to the validation/test sets, and contains sequences ranging between 18 and 9858 residues in length.

Refer to the Jupyter notebook for a detailed description of the files' structure and a Python code snippet to correctly manipulate this data.

To access the full original work, please visit the following link:  Manuscript 

Note: The dataset was recently used to fine tune a protein sequence language model (PEvoLM). The work was presented at the CIBCB'23 conference. If you use PEvoLM or this dataset in your work, please cite the following publication:

- Issar Arab, PEvoLM: Protein Sequence Evolutionary Information Language Model, IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), Eindhoven, Netherlands, (2023), pp. 1-8, doi:10.1109/CIBCB56990.2023.10264890

Files

Dataset.zip

Files (9.9 GB)

Name Size Download all
md5:c7a908265b25dcdbdab98aac626d7a70
9.9 GB Preview Download
md5:c60c494d6164cf8e7410d2d973f7566d
3.7 kB Preview Download

Additional details

References

  • The UniProt Consortium, UniProt: a worldwide hub of protein knowledge, Nucleic Acids Research, Volume 47, Issue D1, 08 January 2019, Pages D506–D515, https://doi.org/10.1093/nar/gky1049
  • Rost, B., Yachdav, G., & Liu, J. (2004). The PredictProtein server. Nucleic acids research, 32(Web Server issue), W321–W326. doi:10.1093/nar/gkh377