Published March 6, 2025 | Version v1
Dataset Open

Large-scale discovery, analysis, and design of protein energy landscapes

Description

*** IMPORTANT! Please Register to use of these data so that we can continue to release new useful datasets! This will take 10 seconds!! ***

This repository contains datasets generated for our study on protein energy landscapes using our multiplex hydrogen-deuterium exchange (mHDX) analysis. The datasets include raw and processed HDX data, NMR results, curated subsets, and machine learning splits with interpretable and deep learning-derived features. These resources support various analyses, including protein stability assessment, EX1 kinetics evaluation, and predictive modeling.

Available Datasets:

  • Dataset_0_InitialOrder: Initial DNA sequences from all libraries (15,715 unique sequences).
  • Dataset_1_UnfilteredData: Minimally filtered HDX data based on confident identifications and PO score < 50 (8,293 unique sequences).
  • Dataset_2_SuccessfulHDX: Proteins passing quality control metrics, including EX1 kinetics (5,778 unique sequences).
  • Dataset_3_MeasurablyStable: Proteins reaching full deuteration with ΔGunfold > 2 kcal/mol and passing EX1 kinetics filter (3,590 unique sequences).
  • Dataset_4_HDXNMR: HDX-NMR results per condition, including average ΔGopen per position (16 unique sequences).
  • Dataset_5_MesophilicThermophilic: Subset of proteins from natural domains classified as mesophilic or thermophilic based on optimal growth temperature (>40°C) (1,637 unique sequences).
  • Dataset_6_splits_interpretable: Machine learning splits with interpretable features (3,193 unique sequences).
  • Dataset_6_splits_esm2: Machine learning splits with ESM2-derived features (3,465 unique sequences).
  • Dataset_6_splits_unirep: Machine learning splits with Unirep-derived features (3,465 unique sequences).
  • Dataset_6_splits_saprot: Machine learning splits with SaProt-derived features (3,465 unique sequences).
  • Dataset_7_mHDX_cDNA: Subset of Dataset_2 (best PO scored candidate, EX1 kinetics excluded) overlapping with cDNA proteolysis assay data from Tsuboyama et al. (2023) (4,464 unique sequences).
  • Dataset_8_PDFs: Comprehensive plots generated using the mhdx_pipeline and hdxrate_pipeline, visualizing time-dependent mass distributions and fits to exchange rates. A Jupyter notebook is included to facilitate navigation. (Note: This dataset is split into eight parts for uploading purposes — .zip_part_aa through .zip_part_ah. Please concatenate the parts before unzipping.)
  • Dataset_9_AlphaFoldModels: AlphaFold 2 models/Rosetta relaxed from Dataset_2_SucessfulHDX (5,778 unique sequences)

Files

Dataset_0_InitialOrder.json

Files (81.4 GB)

Name Size Download all
md5:d2c8496eda46dea8301a48cea54dae44
4.0 MB Preview Download
md5:c6af9e5a83ed3ce221d4bc975a3c8442
603.7 MB Preview Download
md5:0abb247295ddacde54b09a3041bf132a
1.1 GB Preview Download
md5:ad28ea20dd7db895e146ad1ec0c568bd
285.6 MB Preview Download
md5:207a51dc018f2512c479d539fe8e2f42
101.4 kB Preview Download
md5:dae7bf89d5b9a486d04bc1ad17456d3a
1.2 MB Preview Download
md5:7052c4865e3958a43f5e27457a0959d3
717.6 MB Preview Download
md5:a0d81081aa3e20935141aaca99b69f6d
486.4 MB Preview Download
md5:ca979e6e38abcedabbb10276cc401f32
89.9 MB Preview Download
md5:0f16792b2bde50144eb6bebcbd8ef3b1
132.5 MB Preview Download
md5:9fb80c6e466d6eb6b3f817645334c2b2
343.4 MB Preview Download
md5:aef994c2beed3cf963dd4d080e4b3829
10.7 GB Download
md5:e8c12e2cf8e96383038caeee4cd3a647
10.7 GB Download
md5:efae48ebb299266e01d347eaeafb8415
10.7 GB Download
md5:a67956e2b713db694dc2d3f6891804f3
10.7 GB Download
md5:b9f5a28df5fee24ffc033377f19cd1d6
10.7 GB Download
md5:0ce0d49870f59fe7dbb45e6a9198f1ff
10.7 GB Download
md5:945fc039aa2c7fa0841785f4a59497b1
10.7 GB Download
md5:f6f3847ad79cde518f64220cc6cc79fb
2.4 GB Download
md5:668bbd0838a5b7c3e7cbbb3bcf490ea6
99.9 MB Preview Download

Additional details

Funding

National Institutes of Health
High-Throughput Discovery of Protein Energy Landscapes in Natural and Designed Proteomes DP2-GM140927
Fundação de Amparo à Pesquisa do Estado de São Paulo
High-throughput discovery of energy landscapes in natural and designed proteins 20/14421-1

Dates

Updated
2023-03-06

Software

Programming language
Python