Raw data for SimpleFold-Turbo preprint
Authors/Creators
- 1. National Institute of Standards and Technology (NIST)
Description
SimpleFold-Turbo: Adaptive Inference Caching Yields 14-fold Acceleration of Flow-matching Protein Structure Prediction
General Information for Raw Data
Description: This dataset contains all benchmarking data, predicted structures, and analysis results for the SimpleFold-Turbo manuscript. SimpleFold-Turbo applies TeaCache-style adaptive step-skipping to SimpleFold diffusion models across six model scales (100M–3B parameters), evaluated on a structurally diverse subset of 300 CATH domains.
Total size: ~572 MB (553 MB predicted structure files)
Benchmark Set
| File | Description |
| CATH300.csv | Benchmark set of 300 CATH domains: name, sequence, and length |
| diverse_cath_300.fasta | Sequences for the 300 domains in FASTA format |
| diverse_cath_300.json | Extended metadata for each domain (CATH classification, structural annotations |
| ss_content.json | Secondary structure content (helix/sheet/coil fractions) per domain |
Core Benchmark Results
| File | Description |
| cath_benchmark_full.csv | Per-protein benchmark results (3,600 rows): model, TeaCache threshold, TM-score, RMSD, lDDT, inference time, and cache hit rate |
| cath_benchmark_full.json | Same data in JSON format |
| gt_comparison.csv | Side-by-side quality comparison (baseline vs. TeaCache) against ground-truth experimental structures: TM-score, RMSD, and lDDT differences |
Dual Sweep (Uniform vs. Adaptive Step-Skipping)
| File | Description |
| dual_sweep_simplefold_{100M,360M,700M,1.1B,1.6B,3B}.json | Per-protein results for each model scale (5,100 entries each). Each entry records method (uniform or adaptive), condition (number of steps or threshold), inference time, cache hit rate, computed steps, and quality metrics (RMSD, TM-score) |
| dual_sweep_summary.csv | Aggregated summary across all models and conditions (30,600 rows) |
| uniform_vs_adaptive.csv | Head-to-head comparison of uniform vs. adaptive skipping at matched compute budgets |
Threshold and Step Sweeps
| File | Description |
| threshold_sweep.json | Per-protein results across TeaCache threshold values (2,400 entries) |
| threshold_summary.csv | Aggregated: mean time, speedup, cache hit rate, and quality loss per threshold |
| uniform_step_sweep.json | Per-protein results for uniform step counts (2,700 entries) |
Mechanistic Analyses
| File | Description |
| skip_patterns.json | Timestep-resolved skip/compute patterns across the denoising trajectory. Includes per-step skip rates, a summary of always-computed warmup steps (11) vs. always-skipped steps (200) vs. variable steps (289) |
| warmup_comparison.json | Analysis of warmup phase: compares the first 11 (always-computed) steps to full 500-step trajectories across 300 proteins |
| clustering_results.json | Clustering of denoising timesteps into two regimes based on skip behavior, with secondary-structure correlation |
| crystallization_results.json | Atom-level settling ("crystallization") analysis: per-protein statistics on when atomic coordinates stabilize during denoising (20 proteins) |
| dimensionality_control.csv | Cache hit rate vs. chain length and embedding dimensionality (synthetic and empirical) |
| dimensionality_control.json | Full dimensionality control experiment data including Pearson correlations |
Predicted Structures
structures.zip:
30,811 PDB files (~553 MB compressed). Organized as structures/{model}/{method}_{condition}/{domain}.pdb, where model is one of simplefold_{100M,360M,700M,1.1B,1.6B,3B}, method is uniform or adaptive, and condition is the step count or threshold value.
File Formats
- CSV files use comma delimiters with a header row
- JSON files are either arrays of per-protein result objects or dictionaries with descriptive top-level keys
- FASTA follows standard format with CATH domain identifiers as headers
- PDB files follow standard Protein Data Bank format
Reproducing the Figures
The Python scripts used to generate all manuscript figures from these data files are included in the GitHub repo publication/ directory (figure1.py, figure2.py, figure_supplement.py).
Files
CATH300.csv
Files
(598.9 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:443287e3c95caa9fead28b9319f92632
|
43.8 kB | Preview Download |
|
md5:be503e33ba3174838e85a7bfe472b9a7
|
223.0 kB | Preview Download |
|
md5:2ab174947847e3a8d759291def1c504d
|
1.6 MB | Preview Download |
|
md5:231eda6e5eb85a284d69f76f8036b1ce
|
6.3 kB | Preview Download |
|
md5:ab5bca7b716db99bdc3386761b2856b5
|
135.1 kB | Preview Download |
|
md5:46a26a766197acb8215682ac07fb30b3
|
619 Bytes | Preview Download |
|
md5:a1e287a4af002bffdbc9afded11fb7c3
|
2.1 kB | Preview Download |
|
md5:75fb990af4a0effd999c7f37128b9a5a
|
42.7 kB | Download |
|
md5:4e4ae766363d5d29ce7ee8ac4a80f7b5
|
77.4 kB | Preview Download |
|
md5:5672863fbb3f171e4d780de2164cb86c
|
1.8 MB | Preview Download |
|
md5:b588e9a7e77f7544cf12a02298840a20
|
1.8 MB | Preview Download |
|
md5:c90eafd6377787c7d6b93fbb10ef65d8
|
1.8 MB | Preview Download |
|
md5:6a869a1f51270c6e19b08b5d019c431e
|
1.8 MB | Preview Download |
|
md5:807ae8596319cce7e11eb97f1516e7f4
|
1.8 MB | Preview Download |
|
md5:3aff5ae3a416165e1da005d2bbe79245
|
1.8 MB | Preview Download |
|
md5:fe8ebf0928959e96eeeee4bdf3977660
|
3.4 MB | Preview Download |
|
md5:7713b29c08c03770865b022680fbb5e5
|
21.3 kB | Preview Download |
|
md5:d1b317bd1d1b1ece2d073096d3ebe1cd
|
1.8 MB | Preview Download |
|
md5:51e59f25fd2418fcbff073bb80a2a255
|
46.5 kB | Preview Download |
|
md5:6d3fd49466adbbf778c087c66446960f
|
579.6 MB | Preview Download |
|
md5:7c77b77c7c21f6da979b8de8e61a8c6d
|
532 Bytes | Preview Download |
|
md5:48c07ff2144cef7a491c5d51b9db70f9
|
582.9 kB | Preview Download |
|
md5:d5be0e751ce8a9da0fd63278739d2236
|
537.1 kB | Preview Download |
|
md5:fcede1643efc179645f248e01c3b96ff
|
1.2 kB | Preview Download |
|
md5:3af24ea025731e38bfe0f47a94ccc41b
|
53.1 kB | Preview Download |
Additional details
Related works
- Is supplement to
- Preprint: 10.64898/2026.04.07.714835 (DOI)
Dates
- Created
-
2026-04-07
Software
- Repository URL
- https://github.com/usnistgov/simplefold-turbo
- Development Status
- Active
References
- Taghon, G. (2026). SimpleFold-Turbo: Adaptive inference caching yields 14-fold acceleration of flow-matching protein structure prediction. bioRxiv. doi:10.64898/2026.04.07.714835