Published February 24, 2025 | Version v3
Dataset Open

METL Rosetta datasets

  • 1. University of Wisconsin-Madison

Description

This repository contains the biophysical attributes used to pretrain METL-Local and METL-Global models. We provide raw Rosetta data as well as processed Rosetta datasets that have duplicates, outliers, and NaN values removed.

Users of these datasets should cite both METL and Rosetta.

The repository also contains packaged conda environment files needed to generate new Rosetta simulation data in the OSPool with our Jupyter notebook.

Raw Rosetta data

Raw Rosetta data comes in the form of SQLite databases in the .db format. There are separate databases for each of the local datasets as well as the global dataset. Note the GB1-IgG binding raw data only contains the binding scores, whereas the processed GB1-IgG binding dataset listed below contains both the binding and standard scores. The processed dataset was created by combining the raw GB1-IgG binding data with the raw GB1 standard data.

Processed Rosetta datasets

Each processed Rosetta dataset has its own directory containing the following:

  • The dataset in three formats (.tsv, .db, and .h5 files), all containing the same data
  • A list of PDB files corresponding to the variants in the dataset (pdb_fns.txt)
  • A splits directory containing train, validation, and test splits we used for pretraining
  • Standardization parameters computed on the train set (in the splits directory)

Processed Rosetta datasets can be used directly with the main metl GitHub repository to pretrain models. That repository also contains a small example dataset.

Our metl-pub GitHub repository has a mapping from the dataset names to these filenames and instructions for reading the files.

conda environment files

clean_pdb_2025-02-13.tar.gz and metl-sim_2025-02-13.tar.gz are packaged conda environment files from the metl-sim GitHub repository.

Files

Files (169.9 GB)

Name Size Download all
md5:e3a73b6567d0b2c4f2c3268f583447bc
37.1 MB Download
md5:7ea8b7dd3ba1b005e71dc500b00f2248
482.8 MB Download
md5:a5536c91289cca054ad4de07fe8494e0
11.9 GB Download
md5:81677115c1318e7720ebf3b866443a81
13.6 GB Download
md5:68760620b893be9e8aacb15c7df4f01b
4.4 GB Download
md5:2c6efa9bc2d6a8f3e8b801f62397907a
7.3 GB Download
md5:02221f43363c23b06b156900fbd86957
20.4 GB Download
md5:6eeb625c4c934873808cf4d2a11fcf3e
12.2 GB Download
md5:aa384c9984a9d8499b8468f2108e8c0e
11.9 GB Download
md5:7f12864413763552acb441c64ff9e3c3
13.6 GB Download
md5:78a91362359a06070917529b41b82a63
12.5 GB Download
md5:be071466be3c08fe7fec2431ea404b91
12.1 GB Download
md5:039141a2693c5e3907ff34cf19df8ee0
5.3 GB Download
md5:3c734d96bc636a477338e3b638f44e9f
5.7 GB Download
md5:59443943105b662adfbdb77dceb04a1e
1.3 GB Download
md5:ca4d2e90e81fd2a0d5f00017e5f0b145
2.9 GB Download
md5:74aad528e294433b0c55e0b8c5b75213
8.5 GB Download
md5:f0c30f43bf4ce47deb0423ffd00fefc8
5.1 GB Download
md5:284e7bf4351814cb72a92c5c18a43de1
5.0 GB Download
md5:bbffaa0ea1291acbc12ae74a0346f277
5.6 GB Download
md5:e0295dd42447b2795771963afe201867
5.2 GB Download
md5:bbefcd91863e18c1ac72af8f1c5f7b06
5.0 GB Download

Additional details

Funding

U.S. National Science Foundation
Collaborative Research: MFB: Integrating Deep Learning and High-throughput Experimentation to Rapidly Navigate Protein Fitness Landscapes for Non-native Enzyme Catalysis 2226383
U.S. National Science Foundation
Collaborative Research: MFB: Integrating Deep Learning and High-throughput Experimentation to Rapidly Navigate Protein Fitness Landscapes for Non-native Enzyme Catalysis 2226451
National Institutes of Health
A Machine Learning Platform for Adaptive Chemical Screening R01GM135631
National Institutes of Health
Data-driven analysis of protein structure, function, and regulation R35GM119854

References

  • Sam Gelman, Bryce Johnson, Chase R Freschlin, Arnav Sharma, Sameer D'Costa, John Peters, Anthony Gitter, Philip A Romero. Biophysics-based protein language models for protein engineering. Nature Methods 22, 2025.