Published September 30, 2021 | Version 0.1
Dataset Open

Data for "Training data composition affects performance of protein structure analysis algorithms" by A. Derry, K. A. Carpenter, & R. B. Altman

  • 1. Stanford University

Description

Description

This repository contains all data used in "Training data composition affects performance of protein structure analysis algorithms", published in the Pacific Symposium on Biocomputing 2022 by A. Derry, K. A. Carpenter, & R. B. Altman. 

The data consists of the following files:

  • ema_zenodo_data.tar.gz: train, validation, and test splits for Estimation of Model Accuracy task, in LMDB format
  • design_zenodo_data.tar.gz: train, validation, and test splits for Protein Sequence Design task, in JSON format
  • enz_cat_res_zenodo_data.tar.gz: train, validation, and test splits for Catalytic Residue and Enzyme Prediction task, in TF record format

Details on dataset construction can be found in our paper and dataloaders can be found in our Github repo.

Reference

A. Derry*, K. A. Carpenter*, & R. B. Altman, "Training data composition affects performance of protein structure analysis algorithms", 2021.

Dataset References

Datasets used were derived from the following works:

Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K., & Moult, J. (2019). Critical assessment of methods of protein structure prediction (CASP)—Round XIII. In Proteins: Structure, Function and Bioinformatics (Vol. 87, Issue 12, pp. 1011–1020). https://doi.org/10.1002/prot.25823

Ingraham, J., Garg, V. K., Barzilay, R., & Jaakkola, T. (2019). Generative Models for Graph-Based Protein Design. https://openreview.net/pdf?id=SJgxrLLKOE

Furnham, N., Holliday, G. L., de Beer, T. A. P., Jacobsen, J. O. B., Pearson, W. R., & Thornton, J. M. (2014). The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes. Nucleic Acids Research, 42 (Database issue), D485–D489.

Files

Files (44.5 GB)

Name Size Download all
md5:af57f5786c885324f02877f4f7a3bcaf
189.8 MB Download
md5:e1f29e0672f34ed3f848a36a3fe0d5c4
4.8 GB Download
md5:89065cfedd5bc306bd4977cdf9229fa8
39.5 GB Download