Published April 2024 | Version v1
Dataset Open

The BenchStab dataset: a dataset for comparing mutational predictors of stability

  • 1. Loschmidt Laboratories
  • 2. ROR icon Masaryk University
  • 3. ROR icon Brno University of Technology
  • 4. ROR icon International Clinical Research Center, St. Anne's University Hospital Brno
  • 5. Enantis (Czechia)

Description

This dataset is a part of BenchStab, a command-line tool for querying and benchmarking web-based protein stability predictors. We created the dataset to independently evaluate 18 structure-enabled and 4 sequence-based predictors of a stability change upon mutation. We suggest that this dataset should be excluded from training and validation of future stability predictors.

The dataset consists of single-point mutations and their experimentally determined ΔΔG from FireProtDB, utilizing only records with both a ΔΔG measurement and a PDB accession code available. We eliminated all records similar to the data used in the training set of any of the predictors considered in BenchStab using UniRef50 clusters. This resulted in 289 records for 36 proteins, of which 28 % display a stabilizing effect (negative value of ΔΔG; see DDG distribution.png for the exact distribution). We further confirmed, by employing SCOP fold-based structure clustering, that the folds of 25 of our proteins were not present in the training sets. 

The file dataset.csv contains specifications of mutations (including the chain) and the ground truth ΔΔG reported from the literature alongside accession codes from FireProtDB (experiment ID), UniProt and Protein Data Bank, and UniRef50 cluster IDs. The file benchstab_input.csv contains the same data in the input format of the BenchStab tool.

For more statistics and details about the dataset, please read the supplement of the paper or get in touch with us.

Files

dataset.csv

Files (39.0 kB)

Name Size Download all
md5:47be804c4c99d8f88f77936ae9070426
3.6 kB Preview Download
md5:da611ce2f6f364c1b24a7f16ad861206
14.2 kB Preview Download
md5:5840969f837f59076a1ad13877e63f74
21.3 kB Preview Download

Additional details

Related works

Is derived from
Data paper: 10.1093/nar/gkaa981 (DOI)
Is supplement to
Journal article: 10.1093/bioinformatics/btae553 (DOI)