METL Rosetta datasets

Gelman, Sam; D'Costa, Sameer; Romero, Philip; Gitter, Anthony

doi:10.5281/zenodo.14916528

Published February 24, 2025 | Version v3

Dataset Open

METL Rosetta datasets

1. University of Wisconsin-Madison

This repository contains the biophysical attributes used to pretrain METL-Local and METL-Global models. We provide raw Rosetta data as well as processed Rosetta datasets that have duplicates, outliers, and NaN values removed.

Users of these datasets should cite both METL and Rosetta.

The repository also contains packaged conda environment files needed to generate new Rosetta simulation data in the OSPool with our Jupyter notebook.

Raw Rosetta data

Raw Rosetta data comes in the form of SQLite databases in the .db format. There are separate databases for each of the local datasets as well as the global dataset. Note the GB1-IgG binding raw data only contains the binding scores, whereas the processed GB1-IgG binding dataset listed below contains both the binding and standard scores. The processed dataset was created by combining the raw GB1-IgG binding data with the raw GB1 standard data.

Processed Rosetta datasets

Each processed Rosetta dataset has its own directory containing the following:

The dataset in three formats (.tsv, .db, and .h5 files), all containing the same data
A list of PDB files corresponding to the variants in the dataset (pdb_fns.txt)
A splits directory containing train, validation, and test splits we used for pretraining
Standardization parameters computed on the train set (in the splits directory)

Processed Rosetta datasets can be used directly with the main metl GitHub repository to pretrain models. That repository also contains a small example dataset.

Our metl-pub GitHub repository has a mapping from the dataset names to these filenames and instructions for reading the files.

conda environment files

clean_pdb_2025-02-13.tar.gz and metl-sim_2025-02-13.tar.gz are packaged conda environment files from the metl-sim GitHub repository.

Files

Files (169.9 GB)

Name	Size	Download all
clean_pdb_2025-02-13.tar.gz md5:e3a73b6567d0b2c4f2c3268f583447bc	37.1 MB	Download
metl-sim_2025-02-13.tar.gz md5:7ea8b7dd3ba1b005e71dc500b00f2248	482.8 MB	Download
processed-avgfp.tar.gz md5:a5536c91289cca054ad4de07fe8494e0	11.9 GB	Download
processed-dlg4.tar.gz md5:81677115c1318e7720ebf3b866443a81	13.6 GB	Download
processed-gb1-binding.tar.gz md5:68760620b893be9e8aacb15c7df4f01b	4.4 GB	Download
processed-gb1.tar.gz md5:2c6efa9bc2d6a8f3e8b801f62397907a	7.3 GB	Download
processed-global.tar.gz md5:02221f43363c23b06b156900fbd86957	20.4 GB	Download
processed-grb2.tar.gz md5:6eeb625c4c934873808cf4d2a11fcf3e	12.2 GB	Download
processed-pab1.tar.gz md5:aa384c9984a9d8499b8468f2108e8c0e	11.9 GB	Download
processed-pten.tar.gz md5:7f12864413763552acb441c64ff9e3c3	13.6 GB	Download
processed-tem-1.tar.gz md5:78a91362359a06070917529b41b82a63	12.5 GB	Download
processed-ube4b.tar.gz md5:be071466be3c08fe7fec2431ea404b91	12.1 GB	Download
raw-avgfp.tar.gz md5:039141a2693c5e3907ff34cf19df8ee0	5.3 GB	Download
raw-dlg4.tar.gz md5:3c734d96bc636a477338e3b638f44e9f	5.7 GB	Download
raw-gb1-binding.tar.gz md5:59443943105b662adfbdb77dceb04a1e	1.3 GB	Download
raw-gb1.tar.gz md5:ca4d2e90e81fd2a0d5f00017e5f0b145	2.9 GB	Download
raw-global.tar.gz md5:74aad528e294433b0c55e0b8c5b75213	8.5 GB	Download
raw-grb2.tar.gz md5:f0c30f43bf4ce47deb0423ffd00fefc8	5.1 GB	Download
raw-pab1.tar.gz md5:284e7bf4351814cb72a92c5c18a43de1	5.0 GB	Download
raw-pten.tar.gz md5:bbffaa0ea1291acbc12ae74a0346f277	5.6 GB	Download
raw-tem-1.tar.gz md5:e0295dd42447b2795771963afe201867	5.2 GB	Download
raw-ube4b.tar.gz md5:bbefcd91863e18c1ac72af8f1c5f7b06	5.0 GB	Download

Additional details

Is supplement to: Preprint: 10.1101/2024.03.15.585128 (DOI); Software: https://github.com/gitter-lab/metl (URL); Software: https://github.com/gitter-lab/metl-sim (URL); Software: https://github.com/gitter-lab/metl-pub (URL); Publication: 10.1038/s41592-025-02776-2 (DOI)

U.S. National Science Foundation
Collaborative Research: MFB: Integrating Deep Learning and High-throughput Experimentation to Rapidly Navigate Protein Fitness Landscapes for Non-native Enzyme Catalysis 2226383
U.S. National Science Foundation
Collaborative Research: MFB: Integrating Deep Learning and High-throughput Experimentation to Rapidly Navigate Protein Fitness Landscapes for Non-native Enzyme Catalysis 2226451
National Institutes of Health
A Machine Learning Platform for Adaptive Chemical Screening R01GM135631
National Institutes of Health
Data-driven analysis of protein structure, function, and regulation R35GM119854

Sam Gelman, Bryce Johnson, Chase R Freschlin, Arnav Sharma, Sameer D'Costa, John Peters, Anthony Gitter, Philip A Romero. Biophysics-based protein language models for protein engineering. Nature Methods 22, 2025.

	All versions	This version
Views	388	171
Downloads	1,801	1,045
Data volume	14.6 TB	8.5 TB

METL Rosetta datasets

Raw Rosetta data

Processed Rosetta datasets

conda environment files

Files

Files (169.9 GB)

Additional details

Related works

Funding

References

METL Rosetta datasets

Creators

Description

Raw Rosetta data

Processed Rosetta datasets

conda environment files

Files

Files (169.9 GB)

Additional details

Related works

Funding

References