Published July 26, 2023 | Version 1.0.0
Dataset Open

FireProtDB + PDB Structural Protein Stability Dataset

  • 1. Department of Biochemistry and Biophysics, University of North Carolina School of Medicine; Division of Chemical Biology and Medicinal Chemistry, University of North Carolina Eshelman School of Pharmacy
  • 2. Division of Chemical Biology and Medicinal Chemistry, University of North Carolina Eshelman School of Pharmacy
  • 3. Department of Biochemistry and Biophysics, University of North Carolina School of Medicine; Department of Bioinformatics and Computational Biology, University of North Carolina School of Medicine
  • 4. Department of Biochemistry and Biophysics, University of North Carolina School of Medicine; Department of Bioinformatics and Computational Biology, University of North Carolina School of Medicine; Lineberger Comprehensive Cancer Center, University of North Carolina School of Medicine

Description

Dataset compiled and curated for use in the ThermoMPNN paper: https://doi.org/10.1073/pnas.2314853121

Dataset for training models for prediction of thermodynamic stability changes (ddG) of protein point mutations given a wildtype protein structure (PDB) file. Data was assembled by matching sequence-based ddG measurements in FireProtDB to structures from the RCSB Protein Data Bank (PDB). For details, see the Methods section of our manuscript.

Citing this work: If you choose to use this dataset for your own research, please cite this repository and the ThermoMPNN paper: https://doi.org/10.1073/pnas.2314853121.

 

Contents:

pdbs/ directory contains all PDB files

csvs/ directory contains all CSVs with mutation data

csvs/4_fireprotDB_bestpH.csv is the main (full) dataset file with 3,438 mutations across 100 proteins.

csvs/fireprot_splits.pkl contains the dataset splits (train/val/test) used in our study

csvs/splits/ contains csvs for each of the splits (train/val/test/homologue-free) indexed from the full dataset csv.

Important CSV columns:

  • pdb_id_corrected: corresponds to the PDB in the pdbs/ directory (after curation and disambiguation)
  • ddG: ddG value for mutation (mutant - WT)
  • wild_type: wild-type amino acid (1-letter code)
  • mutation: mutant amino acid (1-letter code)
  • pdb_position: 0-based index of the mutated residue in the PDB file (may be different from position in  the original FireProtDB sequence entry)

 

Files

fireprot_upload.zip

Files (5.4 MB)

Name Size Download all
md5:25cf04760665946b54b768eb7b6dfd1d
5.4 MB Preview Download