Published March 6, 2025
| Version v1
Dataset
Open
Large-scale discovery, analysis, and design of protein energy landscapes
Authors/Creators
Description
*** IMPORTANT! Please Register to use of these data so that we can continue to release new useful datasets! This will take 10 seconds!! ***
This repository contains datasets generated for our study on protein energy landscapes using our multiplex hydrogen-deuterium exchange (mHDX) analysis. The datasets include raw and processed HDX data, NMR results, curated subsets, and machine learning splits with interpretable and deep learning-derived features. These resources support various analyses, including protein stability assessment, EX1 kinetics evaluation, and predictive modeling.
Available Datasets:
- Dataset_0_InitialOrder: Initial DNA sequences from all libraries (15,715 unique sequences).
- Dataset_1_UnfilteredData: Minimally filtered HDX data based on confident identifications and PO score < 50 (8,293 unique sequences).
- Dataset_2_SuccessfulHDX: Proteins passing quality control metrics, including EX1 kinetics (5,778 unique sequences).
- Dataset_3_MeasurablyStable: Proteins reaching full deuteration with ΔGunfold > 2 kcal/mol and passing EX1 kinetics filter (3,590 unique sequences).
- Dataset_4_HDXNMR: HDX-NMR results per condition, including average ΔGopen per position (16 unique sequences).
- Dataset_5_MesophilicThermophilic: Subset of proteins from natural domains classified as mesophilic or thermophilic based on optimal growth temperature (>40°C) (1,637 unique sequences).
- Dataset_6_splits_interpretable: Machine learning splits with interpretable features (3,193 unique sequences).
- Dataset_6_splits_esm2: Machine learning splits with ESM2-derived features (3,465 unique sequences).
- Dataset_6_splits_unirep: Machine learning splits with Unirep-derived features (3,465 unique sequences).
- Dataset_6_splits_saprot: Machine learning splits with SaProt-derived features (3,465 unique sequences).
- Dataset_7_mHDX_cDNA: Subset of Dataset_2 (best PO scored candidate, EX1 kinetics excluded) overlapping with cDNA proteolysis assay data from Tsuboyama et al. (2023) (4,464 unique sequences).
- Dataset_8_PDFs: Comprehensive plots generated using the
mhdx_pipelineandhdxrate_pipeline, visualizing time-dependent mass distributions and fits to exchange rates. A Jupyter notebook is included to facilitate navigation. (Note: This dataset is split into eight parts for uploading purposes —.zip_part_aathrough.zip_part_ah. Please concatenate the parts before unzipping.) - Dataset_9_AlphaFoldModels: AlphaFold 2 models/Rosetta relaxed from Dataset_2_SucessfulHDX (5,778 unique sequences)
Files
Dataset_0_InitialOrder.json
Files
(81.4 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:d2c8496eda46dea8301a48cea54dae44
|
4.0 MB | Preview Download |
|
md5:c6af9e5a83ed3ce221d4bc975a3c8442
|
603.7 MB | Preview Download |
|
md5:0abb247295ddacde54b09a3041bf132a
|
1.1 GB | Preview Download |
|
md5:ad28ea20dd7db895e146ad1ec0c568bd
|
285.6 MB | Preview Download |
|
md5:207a51dc018f2512c479d539fe8e2f42
|
101.4 kB | Preview Download |
|
md5:dae7bf89d5b9a486d04bc1ad17456d3a
|
1.2 MB | Preview Download |
|
md5:7052c4865e3958a43f5e27457a0959d3
|
717.6 MB | Preview Download |
|
md5:a0d81081aa3e20935141aaca99b69f6d
|
486.4 MB | Preview Download |
|
md5:ca979e6e38abcedabbb10276cc401f32
|
89.9 MB | Preview Download |
|
md5:0f16792b2bde50144eb6bebcbd8ef3b1
|
132.5 MB | Preview Download |
|
md5:9fb80c6e466d6eb6b3f817645334c2b2
|
343.4 MB | Preview Download |
|
md5:aef994c2beed3cf963dd4d080e4b3829
|
10.7 GB | Download |
|
md5:e8c12e2cf8e96383038caeee4cd3a647
|
10.7 GB | Download |
|
md5:efae48ebb299266e01d347eaeafb8415
|
10.7 GB | Download |
|
md5:a67956e2b713db694dc2d3f6891804f3
|
10.7 GB | Download |
|
md5:b9f5a28df5fee24ffc033377f19cd1d6
|
10.7 GB | Download |
|
md5:0ce0d49870f59fe7dbb45e6a9198f1ff
|
10.7 GB | Download |
|
md5:945fc039aa2c7fa0841785f4a59497b1
|
10.7 GB | Download |
|
md5:f6f3847ad79cde518f64220cc6cc79fb
|
2.4 GB | Download |
|
md5:668bbd0838a5b7c3e7cbbb3bcf490ea6
|
99.9 MB | Preview Download |
Additional details
Funding
- National Institutes of Health
- High-Throughput Discovery of Protein Energy Landscapes in Natural and Designed Proteomes DP2-GM140927
- Fundação de Amparo à Pesquisa do Estado de São Paulo
- High-throughput discovery of energy landscapes in natural and designed proteins 20/14421-1
Dates
- Updated
-
2023-03-06
Software
- Programming language
- Python