FireProtDB + PDB Structural Protein Stability Dataset
Authors/Creators
- 1. Department of Biochemistry and Biophysics, University of North Carolina School of Medicine; Division of Chemical Biology and Medicinal Chemistry, University of North Carolina Eshelman School of Pharmacy
- 2. Division of Chemical Biology and Medicinal Chemistry, University of North Carolina Eshelman School of Pharmacy
- 3. Department of Biochemistry and Biophysics, University of North Carolina School of Medicine; Department of Bioinformatics and Computational Biology, University of North Carolina School of Medicine
- 4. Department of Biochemistry and Biophysics, University of North Carolina School of Medicine; Department of Bioinformatics and Computational Biology, University of North Carolina School of Medicine; Lineberger Comprehensive Cancer Center, University of North Carolina School of Medicine
Description
Dataset compiled and curated for use in the ThermoMPNN paper: https://doi.org/10.1073/pnas.2314853121:
Dataset for training models for prediction of thermodynamic stability changes (ddG) of protein point mutations given a wildtype protein structure (PDB) file. Data was assembled by matching sequence-based ddG measurements in FireProtDB to structures from the RCSB Protein Data Bank (PDB). For details, see the Methods section of our manuscript.
Citing this work: If you choose to use this dataset for your own research, please cite this repository and the ThermoMPNN paper: https://doi.org/10.1073/pnas.2314853121.
Contents:
pdbs/ directory contains all PDB files
csvs/ directory contains all CSVs with mutation data
csvs/4_fireprotDB_bestpH.csv is the main (full) dataset file with 3,438 mutations across 100 proteins.
csvs/fireprot_splits.pkl contains the dataset splits (train/val/test) used in our study
csvs/splits/ contains csvs for each of the splits (train/val/test/homologue-free) indexed from the full dataset csv.
Important CSV columns:
- pdb_id_corrected: corresponds to the PDB in the pdbs/ directory (after curation and disambiguation)
- ddG: ddG value for mutation (mutant - WT)
- wild_type: wild-type amino acid (1-letter code)
- mutation: mutant amino acid (1-letter code)
- pdb_position: 0-based index of the mutated residue in the PDB file (may be different from position in the original FireProtDB sequence entry)
Files
fireprot_upload.zip
Files
(5.4 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:25cf04760665946b54b768eb7b6dfd1d
|
5.4 MB | Preview Download |