METL Rosetta datasets
- 1. University of Wisconsin-Madison
Description
This repository contains the biophysical attributes used to pretrain METL-Local and METL-Global models. We provide raw Rosetta data as well as processed Rosetta datasets that have duplicates, outliers, and NaN values removed.
Users of these datasets should cite both METL and Rosetta.
Raw Rosetta data
Raw Rosetta data comes in the form of SQLite databases in the .db format. There are separate databases for each of the local datasets as well as the global dataset. Note the GB1-IgG binding raw data only contains the binding scores, whereas the processed GB1-IgG binding dataset listed below contains both the binding and standard scores. The processed dataset was created by combining the raw GB1-IgG binding data with the raw GB1 standard data.
Processed Rosetta datasets
Each processed Rosetta dataset has its own directory containing the following:
- The dataset in three formats (.tsv, .db, and .h5 files), all containing the same data
- A list of PDB files corresponding to the variants in the dataset (pdb_fns.txt)
- A splits directory containing train, validation, and test splits we used for pretraining
- Standardization parameters computed on the train set (in the splits directory)
Processed Rosetta datasets can be used directly with the main metl GitHub repository to pretrain models. That repository also contains a small example dataset.
Our metl-pub GitHub repository has a mapping from the dataset names to these filenames and instructions for reading the files.
Files
Files
(150.2 GB)
Name | Size | Download all |
---|---|---|
md5:a5536c91289cca054ad4de07fe8494e0
|
11.9 GB | Download |
md5:81677115c1318e7720ebf3b866443a81
|
13.6 GB | Download |
md5:68760620b893be9e8aacb15c7df4f01b
|
4.4 GB | Download |
md5:2c6efa9bc2d6a8f3e8b801f62397907a
|
7.3 GB | Download |
md5:02221f43363c23b06b156900fbd86957
|
20.4 GB | Download |
md5:6eeb625c4c934873808cf4d2a11fcf3e
|
12.2 GB | Download |
md5:aa384c9984a9d8499b8468f2108e8c0e
|
11.9 GB | Download |
md5:78a91362359a06070917529b41b82a63
|
12.5 GB | Download |
md5:be071466be3c08fe7fec2431ea404b91
|
12.1 GB | Download |
md5:039141a2693c5e3907ff34cf19df8ee0
|
5.3 GB | Download |
md5:3c734d96bc636a477338e3b638f44e9f
|
5.7 GB | Download |
md5:59443943105b662adfbdb77dceb04a1e
|
1.3 GB | Download |
md5:ca4d2e90e81fd2a0d5f00017e5f0b145
|
2.9 GB | Download |
md5:74aad528e294433b0c55e0b8c5b75213
|
8.5 GB | Download |
md5:f0c30f43bf4ce47deb0423ffd00fefc8
|
5.1 GB | Download |
md5:284e7bf4351814cb72a92c5c18a43de1
|
5.0 GB | Download |
md5:e0295dd42447b2795771963afe201867
|
5.2 GB | Download |
md5:bbefcd91863e18c1ac72af8f1c5f7b06
|
5.0 GB | Download |
Additional details
Related works
- Is supplement to
- Preprint: 10.1101/2024.03.15.585128 (DOI)
- Software: https://github.com/gitter-lab/metl (URL)
- Software: https://github.com/gitter-lab/metl-sim (URL)
- Software: https://github.com/gitter-lab/metl-pub (URL)
Funding
- Collaborative Research: MFB: Integrating Deep Learning and High-throughput Experimentation to Rapidly Navigate Protein Fitness Landscapes for Non-native Enzyme Catalysis 2226383
- National Science Foundation
- Collaborative Research: MFB: Integrating Deep Learning and High-throughput Experimentation to Rapidly Navigate Protein Fitness Landscapes for Non-native Enzyme Catalysis 2226451
- National Science Foundation
- A Machine Learning Platform for Adaptive Chemical Screening R01GM135631
- National Institutes of Health
- Data-driven analysis of protein structure, function, and regulation R35GM119854
- National Institutes of Health
References
- Sam Gelman, Bryce Johnson, Chase Freschlin, Sameer D'Costa, Anthony Gitter, Philip A Romero. Biophysics-based protein language models for protein engineering. bioRxiv, 2024. doi:10.1101/2024.03.15.585128