High-quality large curated dataset of protein sequences (1.83 million) and their corresponding Position Specific Scoring Matrices

Arab, Issar; Heinzinger, Michael; Rost, Burkhard; Cavalli-Sforza, Violetta

doi:10.5281/zenodo.4300971

Published September 15, 2020 | Version v1

Dataset Open

High-quality large curated dataset of protein sequences (1.83 million) and their corresponding Position Specific Scoring Matrices

1. Technical University of Munich
2. Al Akhawayn University in Ifrane

As part of his master thesis at the Rostlab, which is located at the Technical University of Munich (TUM), Mr. Issar Arab developed the first language model that encodes evolutionary information of proteins explicitly. The pre-training involved the creation of a novel high-quality dataset of protein sequences (around 1.83 million proteins, or ~0.8 Billion amino acids) with their corresponding Position Specific Scoring Matrices (PSSMs). Those matrices reflect the relative frequency of each amino acid at each position in a protein and is derived from evolutionarily related proteins.

Mr. Arab makes this work publicly available to help other researchers speed up their work to leverage AI to learn the representation of protein evolutionary information more explicitly. The set of sequences was derived by extracting all PSSMs from the PredictProtein (PP) cache, which were also part o the UniProt Reference Cluster with 50% sequence identity (uniref50 2019_12). The overlap between PP and uniref50 was further filtered to only include high-quality samples, e.g. only multiple sequence alignments with a certain number of aligned sequences were considered. The processing led to a training set of 1.83 Million sequences, a validation set of 879 instances, and a test set of 879 entries. The training data of proteins is reduced to 40% sequence identity, with respect to the validation/test sets, and contains sequences ranging between 18 and 9858 residues in length.

Refer to the Jupyter notebook for a detailed description of the files' structure and a Python code snippet to correctly manipulate this data.

To access the full original work, please visit the following link: Manuscript

Note: The dataset was recently used to fine tune a protein sequence language model (PEvoLM). The work was presented at the CIBCB'23 conference. If you use PEvoLM or this dataset in your work, please cite the following publication:

- Issar Arab, PEvoLM: Protein Sequence Evolutionary Information Language Model, IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), Eindhoven, Netherlands, (2023), pp. 1-8, doi:10.1109/CIBCB56990.2023.10264890

Files

Dataset.zip

Files (9.9 GB)

Name	Size	Download all
Dataset.zip md5:c7a908265b25dcdbdab98aac626d7a70	9.9 GB	Preview Download
Script_snippet_to_manipulate_the_dataset.ipynb md5:c60c494d6164cf8e7410d2d973f7566d	3.7 kB	Preview Download

Additional details

The UniProt Consortium, UniProt: a worldwide hub of protein knowledge, Nucleic Acids Research, Volume 47, Issue D1, 08 January 2019, Pages D506–D515, https://doi.org/10.1093/nar/gky1049
Rost, B., Yachdav, G., & Liu, J. (2004). The PredictProtein server. Nucleic acids research, 32(Web Server issue), W321–W326. doi:10.1093/nar/gkh377

	All versions	This version
Views	489	488
Downloads	58	58
Data volume	543.1 GB	543.1 GB

High-quality large curated dataset of protein sequences (1.83 million) and their corresponding Position Specific Scoring Matrices

Creators

Description

Files

Dataset.zip

Files (9.9 GB)

Additional details

References