Published April 7, 2026 | Version v1
Dataset Open

Raw data for SimpleFold-Turbo preprint

  • 1. National Institute of Standards and Technology (NIST)

Contributors

  • 1. ROR icon National Institute of Standards and Technology

Description

SimpleFold-Turbo: Adaptive Inference Caching Yields 14-fold Acceleration of Flow-matching Protein Structure Prediction

General Information for Raw Data

Description: This dataset contains all benchmarking data, predicted structures, and analysis results for the SimpleFold-Turbo manuscript. SimpleFold-Turbo applies TeaCache-style adaptive step-skipping to SimpleFold diffusion models across six model scales (100M–3B parameters), evaluated on a structurally diverse subset of 300 CATH domains.

Total size: ~572 MB (553 MB predicted structure files)

Benchmark Set

File Description
CATH300.csv Benchmark set of 300 CATH domains: name, sequence, and length
diverse_cath_300.fasta Sequences for the 300 domains in FASTA format
diverse_cath_300.json Extended metadata for each domain (CATH classification, structural annotations
ss_content.json Secondary structure content (helix/sheet/coil fractions) per domain

Core Benchmark Results

File Description
cath_benchmark_full.csv Per-protein benchmark results (3,600 rows): model, TeaCache threshold, TM-score, RMSD, lDDT, inference time, and cache hit rate
cath_benchmark_full.json Same data in JSON format
gt_comparison.csv Side-by-side quality comparison (baseline vs. TeaCache) against ground-truth experimental structures: TM-score, RMSD, and lDDT differences

Dual Sweep (Uniform vs. Adaptive Step-Skipping)

File Description
dual_sweep_simplefold_{100M,360M,700M,1.1B,1.6B,3B}.json Per-protein results for each model scale (5,100 entries each). Each entry records method (uniform or adaptive), condition (number of steps or threshold), inference time, cache hit rate, computed steps, and quality metrics (RMSD, TM-score)
dual_sweep_summary.csv Aggregated summary across all models and conditions (30,600 rows)
uniform_vs_adaptive.csv Head-to-head comparison of uniform vs. adaptive skipping at matched compute budgets

Threshold and Step Sweeps

File Description
threshold_sweep.json Per-protein results across TeaCache threshold values (2,400 entries)
threshold_summary.csv Aggregated: mean time, speedup, cache hit rate, and quality loss per threshold
uniform_step_sweep.json Per-protein results for uniform step counts (2,700 entries)

Mechanistic Analyses

File Description
skip_patterns.json Timestep-resolved skip/compute patterns across the denoising trajectory. Includes per-step skip rates, a summary of always-computed warmup steps (11) vs. always-skipped steps (200) vs. variable steps (289)
warmup_comparison.json Analysis of warmup phase: compares the first 11 (always-computed) steps to full 500-step trajectories across 300 proteins
clustering_results.json Clustering of denoising timesteps into two regimes based on skip behavior, with secondary-structure correlation
crystallization_results.json Atom-level settling ("crystallization") analysis: per-protein statistics on when atomic coordinates stabilize during denoising (20 proteins)
dimensionality_control.csv Cache hit rate vs. chain length and embedding dimensionality (synthetic and empirical)
dimensionality_control.json Full dimensionality control experiment data including Pearson correlations

Predicted Structures

structures.zip:
30,811 PDB files (~553 MB compressed). Organized as structures/{model}/{method}_{condition}/{domain}.pdb, where model is one of simplefold_{100M,360M,700M,1.1B,1.6B,3B}, method is uniform or adaptive, and condition is the step count or threshold value.

File Formats

- CSV files use comma delimiters with a header row
- JSON files are either arrays of per-protein result objects or dictionaries with descriptive top-level keys
- FASTA follows standard format with CATH domain identifiers as headers
- PDB files follow standard Protein Data Bank format

Reproducing the Figures

The Python scripts used to generate all manuscript figures from these data files are included in the GitHub repo publication/ directory (figure1.py, figure2.py, figure_supplement.py).

Files

CATH300.csv

Files (598.9 MB)

Name Size Download all
md5:443287e3c95caa9fead28b9319f92632
43.8 kB Preview Download
md5:be503e33ba3174838e85a7bfe472b9a7
223.0 kB Preview Download
md5:2ab174947847e3a8d759291def1c504d
1.6 MB Preview Download
md5:231eda6e5eb85a284d69f76f8036b1ce
6.3 kB Preview Download
md5:ab5bca7b716db99bdc3386761b2856b5
135.1 kB Preview Download
md5:46a26a766197acb8215682ac07fb30b3
619 Bytes Preview Download
md5:a1e287a4af002bffdbc9afded11fb7c3
2.1 kB Preview Download
md5:75fb990af4a0effd999c7f37128b9a5a
42.7 kB Download
md5:4e4ae766363d5d29ce7ee8ac4a80f7b5
77.4 kB Preview Download
md5:5672863fbb3f171e4d780de2164cb86c
1.8 MB Preview Download
md5:b588e9a7e77f7544cf12a02298840a20
1.8 MB Preview Download
md5:c90eafd6377787c7d6b93fbb10ef65d8
1.8 MB Preview Download
md5:6a869a1f51270c6e19b08b5d019c431e
1.8 MB Preview Download
md5:807ae8596319cce7e11eb97f1516e7f4
1.8 MB Preview Download
md5:3aff5ae3a416165e1da005d2bbe79245
1.8 MB Preview Download
md5:fe8ebf0928959e96eeeee4bdf3977660
3.4 MB Preview Download
md5:7713b29c08c03770865b022680fbb5e5
21.3 kB Preview Download
md5:d1b317bd1d1b1ece2d073096d3ebe1cd
1.8 MB Preview Download
md5:51e59f25fd2418fcbff073bb80a2a255
46.5 kB Preview Download
md5:6d3fd49466adbbf778c087c66446960f
579.6 MB Preview Download
md5:7c77b77c7c21f6da979b8de8e61a8c6d
532 Bytes Preview Download
md5:48c07ff2144cef7a491c5d51b9db70f9
582.9 kB Preview Download
md5:d5be0e751ce8a9da0fd63278739d2236
537.1 kB Preview Download
md5:fcede1643efc179645f248e01c3b96ff
1.2 kB Preview Download
md5:3af24ea025731e38bfe0f47a94ccc41b
53.1 kB Preview Download

Additional details

Related works

Is supplement to
Preprint: 10.64898/2026.04.07.714835 (DOI)

Dates

Created
2026-04-07

Software

References

  • Taghon, G. (2026). SimpleFold-Turbo: Adaptive inference caching yields 14-fold acceleration of flow-matching protein structure prediction. bioRxiv. doi:10.64898/2026.04.07.714835