Published April 23, 2024 | Version 1.0
Dataset Open

METL Rosetta datasets

  • 1. University of Wisconsin-Madison

Description

This repository contains the biophysical attributes used to pretrain METL-Local and METL-Global models. We provide raw Rosetta data as well as processed Rosetta datasets that have duplicates, outliers, and NaN values removed.

Users of these datasets should cite both METL and Rosetta.

Raw Rosetta data

Raw Rosetta data comes in the form of SQLite databases in the .db format. There are separate databases for each of the local datasets as well as the global dataset. Note the GB1-IgG binding raw data only contains the binding scores, whereas the processed GB1-IgG binding dataset listed below contains both the binding and standard scores. The processed dataset was created by combining the raw GB1-IgG binding data with the raw GB1 standard data.

Processed Rosetta datasets

Each processed Rosetta dataset has its own directory containing the following:

  • The dataset in three formats (.tsv, .db, and .h5 files), all containing the same data
  • A list of PDB files corresponding to the variants in the dataset (pdb_fns.txt)
  • A splits directory containing train, validation, and test splits we used for pretraining
  • Standardization parameters computed on the train set (in the splits directory)

Processed Rosetta datasets can be used directly with the main metl GitHub repository to pretrain models. That repository also contains a small example dataset.

Our metl-pub GitHub repository has a mapping from the dataset names to these filenames and instructions for reading the files.

Files

Files (150.2 GB)

Name Size Download all
md5:a5536c91289cca054ad4de07fe8494e0
11.9 GB Download
md5:81677115c1318e7720ebf3b866443a81
13.6 GB Download
md5:68760620b893be9e8aacb15c7df4f01b
4.4 GB Download
md5:2c6efa9bc2d6a8f3e8b801f62397907a
7.3 GB Download
md5:02221f43363c23b06b156900fbd86957
20.4 GB Download
md5:6eeb625c4c934873808cf4d2a11fcf3e
12.2 GB Download
md5:aa384c9984a9d8499b8468f2108e8c0e
11.9 GB Download
md5:78a91362359a06070917529b41b82a63
12.5 GB Download
md5:be071466be3c08fe7fec2431ea404b91
12.1 GB Download
md5:039141a2693c5e3907ff34cf19df8ee0
5.3 GB Download
md5:3c734d96bc636a477338e3b638f44e9f
5.7 GB Download
md5:59443943105b662adfbdb77dceb04a1e
1.3 GB Download
md5:ca4d2e90e81fd2a0d5f00017e5f0b145
2.9 GB Download
md5:74aad528e294433b0c55e0b8c5b75213
8.5 GB Download
md5:f0c30f43bf4ce47deb0423ffd00fefc8
5.1 GB Download
md5:284e7bf4351814cb72a92c5c18a43de1
5.0 GB Download
md5:e0295dd42447b2795771963afe201867
5.2 GB Download
md5:bbefcd91863e18c1ac72af8f1c5f7b06
5.0 GB Download

Additional details

Funding

Collaborative Research: MFB: Integrating Deep Learning and High-throughput Experimentation to Rapidly Navigate Protein Fitness Landscapes for Non-native Enzyme Catalysis 2226383
National Science Foundation
Collaborative Research: MFB: Integrating Deep Learning and High-throughput Experimentation to Rapidly Navigate Protein Fitness Landscapes for Non-native Enzyme Catalysis 2226451
National Science Foundation
A Machine Learning Platform for Adaptive Chemical Screening R01GM135631
National Institutes of Health
Data-driven analysis of protein structure, function, and regulation R35GM119854
National Institutes of Health

References

  • Sam Gelman, Bryce Johnson, Chase Freschlin, Sameer D'Costa, Anthony Gitter, Philip A Romero. Biophysics-based protein language models for protein engineering. bioRxiv, 2024. doi:10.1101/2024.03.15.585128