The BenchStab dataset: a dataset for comparing mutational predictors of stability
Authors/Creators
Description
This dataset is a part of BenchStab, a command-line tool for querying and benchmarking web-based protein stability predictors. We created the dataset to independently evaluate 18 structure-enabled and 4 sequence-based predictors of a stability change upon mutation. We suggest that this dataset should be excluded from training and validation of future stability predictors.
The dataset consists of single-point mutations and their experimentally determined ΔΔG from FireProtDB, utilizing only records with both a ΔΔG measurement and a PDB accession code available. We eliminated all records similar to the data used in the training set of any of the predictors considered in BenchStab using UniRef50 clusters. This resulted in 289 records for 36 proteins, of which 28 % display a stabilizing effect (negative value of ΔΔG; see DDG distribution.png for the exact distribution). We further confirmed, by employing SCOP fold-based structure clustering, that the folds of 25 of our proteins were not present in the training sets.
The file dataset.csv contains specifications of mutations (including the chain) and the ground truth ΔΔG reported from the literature alongside accession codes from FireProtDB (experiment ID), UniProt and Protein Data Bank, and UniRef50 cluster IDs. The file benchstab_input.csv contains the same data in the input format of the BenchStab tool.
For more statistics and details about the dataset, please read the supplement of the paper or get in touch with us.
Files
dataset.csv
Additional details
Related works
- Is derived from
- Data paper: 10.1093/nar/gkaa981 (DOI)
- Is supplement to
- Journal article: 10.1093/bioinformatics/btae553 (DOI)