Data for 'VespaG: Expert-guided protein language models enable accurate and blazingly fast fitness prediction'

Marquet, Céline; Schlensok, Julius; Abakarova, Marina; Rost, Burkhard; Laine, Elodie

doi:10.5281/zenodo.11085958

Published April 29, 2024 | Version v1

Dataset Open

Data for 'VespaG: Expert-guided protein language models enable accurate and blazingly fast fitness prediction'

Datasets used for development of VespaG and VespaG predictions generated with https://github.com/JSchlensok/VespaG.

Uploads contain:

Performance summaries for ProteinGym [1]:
- Spearman and Pearson correlation for VespaG: proteingym_performance_vespag.csv (columns: 'DMS_id', 'Spearman', 'Pearson')
- Spearman correlation for evaluated methods VespaG, GEMME [2], VESPA [3], TranceptEVE [4], AlphaMissense [5], PoET [6]: proteingym_spearman_allmethods.csv (columns: 'DMS_id', 'Trancept EVE-L', 'VESPA', 'VespaG', 'GEMME', 'AlphaMissense', 'PoET', 'UniProt_ID', 'coarse_selection_type' (function), 'taxon')
Fasta files with sequences for all train sets (vespag_fasta_training_datasets.zip with seq_all9k.fasta, seq_human5k.fasta, seq_droso4k.fasta, seq_ecoli2k.fasta, seq_virus1k.fasta) and test set (proteingym_217.fasta)
VespaG Predictions for test set: vespag_proteingym_rawpreds_by_training_dataset.zip with raw_preds_ecoli.csv, raw_preds_human.csv, raw_preds_virus.csv, raw_preds_all.csv, raw_preds_droso.csv (columns: 'DMS_id', 'mutation', 'DMS_score', 'VespaG'). Predictions are based on different training data, the final model VespaG was trained on a subset of the human proteome and raw VespaG predictions for the ProteinGym benchmark are in raw_preds_human.csv (used to calculate the performances above).
GEMME predictions for train sets: vespag_proteingym_rawpreds_by_training_dataset.zip with folders 'human', 'droso', 'ecoli', 'virus', 'all' for respective fasta file (each containing GEMME mutational landscape output files named 'ID' + '_normPred_evolCombi.txt')
ESM-2 embeddings [7] for test set (proteingym_217_esm2.h5)

For details on VespaG see:

VespaG: Expert-guided protein Language Models enable accurate and blazingly fast fitness prediction

Celine Marquet, Julius Schlensok, Marina Abakarova, Burkhard Rost, Elodie Laine

bioRxiv 2024.04.24.590982; doi: https://doi.org/10.1101/2024.04.24.590982

For more information on data usage and generation please see https://github.com/JSchlensok/VespaG.

Abstract:

Exhaustive experimental annotation of the effect of all known protein variants remains daunting and expensive, stressing the need for scalable effect predictions. We introduce VespaG, a blazingly fast single amino acid variant effect predictor, leveraging embeddings of protein Language Models as input to a minimal deep learning model. To overcome the sparsity of experimental training data, we created a dataset of 39 million single amino acid variants from the human proteome applying the multiple sequence alignment-based effect predictor GEMME as a pseudo standard-of-truth. Assessed against the ProteinGym Substitution Benchmark (217 multiplex assays of variant effect with 2.5 million variants), VespaG achieved a mean Spearman correlation of 0.48 +/- 0.01, matching state-of-the-art methods such as GEMME, TranceptEVE, PoET, AlphaMissense, and VESPA. VespaG reached its top-level performance several orders of magnitude faster, predicting all mutational landscapes of the human proteome in 30 minutes on a consumer laptop (12-core CPU, 16 GB RAM).

[1] Notin, Pascal, et al. "ProteinGym: large-scale benchmarks for protein fitness prediction and design." Advances in Neural Information Processing Systems 36 (2024).
[2] Laine, Elodie, Yasaman Karami, and Alessandra Carbone. "GEMME: a simple and fast global epistatic model predicting mutational effects." Molecular biology and evolution 36.11 (2019): 2604-2619.

[3] Marquet, Céline, et al. "Embeddings from protein language models predict conservation and variant effects." Human genetics 141.10 (2022): 1629-1647.

[4] Notin, Pascal, et al. "TranceptEVE: Combining family-specific and family-agnostic models of protein sequences for improved fitness prediction." bioRxiv (2022): 2022-12.

[5] Cheng, Jun, et al. "Accurate proteome-wide missense variant effect prediction with AlphaMissense." Science 381.6664 (2023): eadg7492.

[6] Truong Jr, Timothy, and Tristan Bepler. "PoET: A generative model of protein families as sequences-of-sequences." Advances in Neural Information Processing Systems 36 (2024).

[7] Lin, Zeming, et al. "Evolutionary-scale prediction of atomic-level protein structure with a language model." Science379.6637 (2023): 1123-1130.

Files

gemme_rawpreds_by_training_dataset.zip

Files (2.2 GB)

Name	Size	Download all
gemme_rawpreds_by_training_dataset.zip md5:2bcdcf7183e1203779e72ac61196ac83	1.2 GB	Preview Download
proteingym_217.fasta md5:d6eb5c0393041329db0aae192f593440	92.7 kB	Download
proteingym_217_esm2.h5 md5:0aa1df362afedbe4d76f7e4edc9a5196	882.6 MB	Download
proteingym_performance_vespag.csv md5:7ff52a6f863a07c6e45a141b4f7ba82b	14.4 kB	Preview Download
proteingym_spearman_allmethods.csv md5:12befd1b740a934ec7c4b60aefcaf283	26.7 kB	Preview Download
vespag_fasta_training_datasets.zip md5:c20de9f0b1f843406427a46303d898ce	4.9 MB	Preview Download
vespag_proteingym_rawpreds_by_training_dataset.zip md5:b14e2decf4186b6597712d1f28b8c80e	163.7 MB	Preview Download

Additional details

Is supplement to: Preprint: 10.1101/2024.04.24.590982 (DOI)

Repository URL: https://github.com/JSchlensok/VespaG
Programming language: Python

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	111	111
Downloads	79	79
Data volume	32.0 GB	32.0 GB

Data for 'VespaG: Expert-guided protein language models enable accurate and blazingly fast fitness prediction'

Files

gemme_rawpreds_by_training_dataset.zip

Files (2.2 GB)

Additional details

Related works

Software

Data for 'VespaG: Expert-guided protein language models enable accurate and blazingly fast fitness prediction'

Creators

Description

Files

gemme_rawpreds_by_training_dataset.zip

Files (2.2 GB)

Additional details

Related works

Software