Large-scale discovery, analysis, and design of protein energy landscapes

Ramos Ferrari, Allan Jhonathan; Dixit, Sugyan; Thibeault, Jane; Mario Garcia; Houliston, Scott; Ludwig, Robert; Notin, Pascal; Phoumyvong, Claire; Martell, Cydney; Jung, Michelle D.; Tsuboyama, Kotaro; Carter, Lauren; Arrowsmith, Cheryl; Guttman, Miklos; Rocklin, Gabriel

doi:10.5281/zenodo.14983481

Published March 6, 2025 | Version v1

Dataset Open

Large-scale discovery, analysis, and design of protein energy landscapes

1. Northwestern University
2. University of Toronto
3. University of Colorado Anschutz Medical Campus
4. Harvard University
5. Gates Medical Research Institute
6. University of Washington

*** IMPORTANT! Please Register to use of these data so that we can continue to release new useful datasets! This will take 10 seconds!! ***

This repository contains datasets generated for our study on protein energy landscapes using our multiplex hydrogen-deuterium exchange (mHDX) analysis. The datasets include raw and processed HDX data, NMR results, curated subsets, and machine learning splits with interpretable and deep learning-derived features. These resources support various analyses, including protein stability assessment, EX1 kinetics evaluation, and predictive modeling.

Available Datasets:

Dataset_0_InitialOrder: Initial DNA sequences from all libraries (15,715 unique sequences).
Dataset_1_UnfilteredData: Minimally filtered HDX data based on confident identifications and PO score < 50 (8,293 unique sequences).
Dataset_2_SuccessfulHDX: Proteins passing quality control metrics, including EX1 kinetics (5,778 unique sequences).
Dataset_3_MeasurablyStable: Proteins reaching full deuteration with ΔGunfold > 2 kcal/mol and passing EX1 kinetics filter (3,590 unique sequences).
Dataset_4_HDXNMR: HDX-NMR results per condition, including average ΔGopen per position (16 unique sequences).
Dataset_5_MesophilicThermophilic: Subset of proteins from natural domains classified as mesophilic or thermophilic based on optimal growth temperature (>40°C) (1,637 unique sequences).
Dataset_6_splits_interpretable: Machine learning splits with interpretable features (3,193 unique sequences).
Dataset_6_splits_esm2: Machine learning splits with ESM2-derived features (3,465 unique sequences).
Dataset_6_splits_unirep: Machine learning splits with Unirep-derived features (3,465 unique sequences).
Dataset_6_splits_saprot: Machine learning splits with SaProt-derived features (3,465 unique sequences).
Dataset_7_mHDX_cDNA: Subset of Dataset_2 (best PO scored candidate, EX1 kinetics excluded) overlapping with cDNA proteolysis assay data from Tsuboyama et al. (2023) (4,464 unique sequences).
Dataset_8_PDFs: Comprehensive plots generated using the mhdx_pipeline and hdxrate_pipeline, visualizing time-dependent mass distributions and fits to exchange rates. A Jupyter notebook is included to facilitate navigation. (Note: This dataset is split into eight parts for uploading purposes — .zip_part_aa through .zip_part_ah. Please concatenate the parts before unzipping.)
Dataset_9_AlphaFoldModels: AlphaFold 2 models/Rosetta relaxed from Dataset_2_SucessfulHDX (5,778 unique sequences)

Files

Dataset_0_InitialOrder.json

Files (81.4 GB)

Name	Size	Download all
Dataset_0_InitialOrder.json md5:d2c8496eda46dea8301a48cea54dae44	4.0 MB	Preview Download
Dataset_1_UnfilteredData.json md5:c6af9e5a83ed3ce221d4bc975a3c8442	603.7 MB	Preview Download
Dataset_2_SuccessfulHDX.json md5:0abb247295ddacde54b09a3041bf132a	1.1 GB	Preview Download
Dataset_3_MeasurablyStable.json md5:ad28ea20dd7db895e146ad1ec0c568bd	285.6 MB	Preview Download
Dataset_4_HDXNMR.json md5:207a51dc018f2512c479d539fe8e2f42	101.4 kB	Preview Download
Dataset_5_MesophilicThermopholic.json md5:dae7bf89d5b9a486d04bc1ad17456d3a	1.2 MB	Preview Download
Dataset_6_splits_esm2_features.json md5:7052c4865e3958a43f5e27457a0959d3	717.6 MB	Preview Download
Dataset_6_splits_interpretable_features.json md5:a0d81081aa3e20935141aaca99b69f6d	486.4 MB	Preview Download
Dataset_6_splits_saprot_features.json md5:ca979e6e38abcedabbb10276cc401f32	89.9 MB	Preview Download
Dataset_6_splits_unirep_features.json md5:0f16792b2bde50144eb6bebcbd8ef3b1	132.5 MB	Preview Download
Dataset_7_mHDX_cDNA.json md5:9fb80c6e466d6eb6b3f817645334c2b2	343.4 MB	Preview Download
Dataset_8_PDFs.zip.part_aa md5:aef994c2beed3cf963dd4d080e4b3829	10.7 GB	Download
Dataset_8_PDFs.zip.part_ab md5:e8c12e2cf8e96383038caeee4cd3a647	10.7 GB	Download
Dataset_8_PDFs.zip.part_ac md5:efae48ebb299266e01d347eaeafb8415	10.7 GB	Download
Dataset_8_PDFs.zip.part_ad md5:a67956e2b713db694dc2d3f6891804f3	10.7 GB	Download
Dataset_8_PDFs.zip.part_ae md5:b9f5a28df5fee24ffc033377f19cd1d6	10.7 GB	Download
Dataset_8_PDFs.zip.part_af md5:0ce0d49870f59fe7dbb45e6a9198f1ff	10.7 GB	Download
Dataset_8_PDFs.zip.part_ag md5:945fc039aa2c7fa0841785f4a59497b1	10.7 GB	Download
Dataset_8_PDFs.zip.part_ah md5:f6f3847ad79cde518f64220cc6cc79fb	2.4 GB	Download
Dataset_9_AlphaFoldModels.zip md5:668bbd0838a5b7c3e7cbbb3bcf490ea6	99.9 MB	Preview Download

Additional details

National Institutes of Health
High-Throughput Discovery of Protein Energy Landscapes in Natural and Designed Proteomes DP2-GM140927
Fundação de Amparo à Pesquisa do Estado de São Paulo
High-throughput discovery of energy landscapes in natural and designed proteins 20/14421-1

Updated: 2023-03-06

Programming language: Python

	All versions	This version
Views	419	419
Downloads	2,020	2,020
Data volume	8.2 TB	8.2 TB

Dataset_0_InitialOrder.json

Files (81.4 GB)

Funding

Dates

Software

Large-scale discovery, analysis, and design of protein energy landscapes

Authors/Creators

Description

Files

Dataset_0_InitialOrder.json

Files (81.4 GB)

Additional details

Funding

Dates

Software