Raw data for "Efficient protein structure generation with sparse denoising models"
Description
This repository contains model parameters and protein structures described in the manuscript "Efficient protein structure generation with sparse denoising models".
model source code
"salad-0.1.0.tar.gz" contains the snapshot of the salad code-base used in the manuscript.
model parameters
The parameters for the salad (sparse all-atom denoising) models described in the manuscript are contained in "salad_params.tar.gz". This unpacks to a directory "params/", which contains pickled parameter files for a number of model variants:
- "default_vp-200k.jax" (and "default_vp_timeless-200k.jax", "default_vp_minimal_timeless-200k.jax"):
- checkpoints at 200,000 training steps for diffusion models with fixed-standard deviation (10 Å) variance preserving (VP) noise.
- "timeless" and "minimal_timeless" files contain parameters for ablated models without diffusion time features and with reduced pair features as described in the manuscript.
- "default_vp_scaled-200k.jax" (and "timeless" / "minimal_timeless" variants):
- checkpoints at 200,000 training steps for input-dependent standard deviation VP noise (VP-scaled in the manuscript).
- "default_ve_scaled-200k.jax" (and "timeless" / "minimal_timeless" variants):
- checkpoints at 200,000 training steps for variance expanding (VE) noise.
- "multimotif_vp-200k.jax":
- checkpoint at 200,000 training steps for a model with multi-motif conditioning and VP noise.
- "default_vp-pdb256-200k.jax":
- checkpoint for a model trained on proteins with 50 to 256 amino acids from the PDB.
- "default_vp-synthete-256-200k.jax":
- checkpoint for a model trained on proteins with 50 to 256 amino acids generated using random secondary structure conditioning with the default_vp-200k.jax checkpoint and ProteinMPNN redesign.
In addition to salad model parameters, we also provide the parameters for the autoencoder models described in the manuscript in "ae_params.tar.gz". This unpacks to a directory "ae_params/", which contains the following checkpoints:
- "small_none-200k.jax": sparse decoder with neighbour selection based only on predicted coordinates.
- "small_inner-200k.jax": sparse decoder with neighbour selection based on per-block distogram predictions and predicted coordinates.
- "small_semiequivariant-200k.jax": same as "small_inner-200k.jax", with the addition of using non-equivariant features on top of the usual equivariant features (relative orientation / distance).
- "small_nodist_vq-200k.jax": same as "small_none-200k.jax", with vector quantization (VQ).
- "small_vq-200k.jax": same as "small_inner-200k.jax", with VQ
- "small_semiequivariant_vq-200k.jax": same as "small_semiequivariant-200k.jax", with VQ
- "small_vq_e2-500k.jax": same as "small_vq-200k.jax", with double encoder depth and 500k training steps.
generated proteins
The protein structures generated using salad, as well as their corresponding sequences generated using ProteinMPNN and predicted structures using ESMfold are contained in "data_package.tar.gz". This archive unpacks to a directory "data_package/" which contains subdirectories for each protein design task described in the manuscript "Efficient protein structure generation with sparse denoising models":
monomers/
This directory contains subdirectories named "<model configuration>-<number of amino acids>-<number of denoising steps>s-esm". For example the subdirectory for the above checkpoint "default_vp-200k.jax", 400 amino acid designs and 500 denoising steps would be "default_vp-400-500s-esm". The corresponding directories for VE models are split into directories prefixed as follows:
- "ve_large_100": VE diffusion starting from noise with standard deviation 100 Å
- "ve_large_80": VE diffusion starting from noise with standard deviation 80 Å
- "ve_domain_100": VE diffusion starting from domain-shaped noise as described in the manuscript with standard deviation 100 Å
- "ve_domain_80": VE diffusion starting from domain-shaped noise as described in the manuscript with standard deviation 80 Å
In addition, there are subdirectories with "random" in their name, instead of a number of steps, e.g. "default_vp_scaled-200-random-esm/". These subdirectories contain data generated using random secondary structure conditioning.
Each subdirectory has the same underlying structure:
- "backbones/": directory containing PDB files of salad-generated backbones
- "predictions/": directory containing PDB files of the predicted structure for the best sequence according to ESMfold pLDDT and scRMSD for each designable backbone (a backbone with at least 1 predicted structure with scRMSD < 2 Å and pLDDT > 70).
- "scores.csv": comma-separated file of structure-prediction metrics for each ProteinMPNN sequence generated for each backbone in "backbones". This file has the following columns:
- "name": base name of the backbone PDB file in "backbones"
- "index": index of the sequence corresponding to this row (0th, 1st, etc.)
- "sequence": amino acid sequence that was had its structure predicted in this row
- "sc_rmsd": root mean square deviation (RMSD) between the salad backbone and the predicted structure for this row
- "sc_tm": TM score between the salad backbone and the predicted structure for this row
- "plddt": pLDDT of the predicted structure for this row
- "ptm": pTM of the predicted structure for this row
- "pae": predicted aligned error for this ro
- for complexes (irrelevant for this study):
- "ipae": mean interface pAE for this row
- "mpae": minimum interface pAE for this row
comparison/
Same as "monomers/", but contains data generated using RFdiffusion and Genie 2 for protein sizes between 50 and 400 amino acids.
shape/
This directory contains the subdirectories named "ve-seg-<letter>-1-esm/", for each letter in (S, A, L, D). These contain the same types of files as the subdirectories contained in "monomers/". In the case of "shape/" the designed structures are conditioned on a letter shape, e.g. backbones in "ve-seg-A-1-esm/backbones/" were generated to be shaped like the letter "A".
motif/
This directory contains generated structures for the motif-scaffolding benchmark described by Lin et al., 2024 [1]. It contains two subdirectories:
- "cond/": contains results generated using motif-conditioned models with the checkpoint "multimotif_vp-200k.jax"
- "nocond/": contains results generated using structure-editing for motif-scaffolding with the checkpoint "default_vp-200k.jax"
Each of these subdirectories has the same structure as the directories "monomers/" and "shape/", with one subdirectory per motif PDB file in the motif-scaffolding benchmark, e.g. "cond/multimotif_vp-1bcf.pdb-esm/" or "nocond/default_vp-1bcf.pdb-esm/". These directories contain the usual "backbones/" and "predictions/" subdirectories, as well as a file "motif_scores.csv". This has fields analogous to "scores.csv", with the addition of two additional fields for motif-RMSD:
- "motif_rmsd_ca": CA-only RMSD between the ESMfold predicted structure and the input motif
- "motif_rmsd_bb": full-backbone (N, CA, C) RMSD between the ESMfold predicted structure and the input motif
A designed sequence-structure pair is only considered successful if sc_rmsd < 2 Å, plddt > 70 and motif_rmsd_bb < 1 Å.
sym/
This directory contains generated structures for symmetric repeat proteins using both VP and VE models with structure-editing. Subdirectories are named by model type ("default_vp", "default_ve_minimal_timeless"), symmetry ("C<number>" for cyclic symmetry, e.g. "C3"; "screw" for screw symmetry) and additionally the screw radius (e.g. "r10" for a 10 Å radius), screw angle in degrees (e.g. "a120") and screw translation in Å (e.g. "t10"), resulting in names in a format like this "default_ve_minimal_timeless-screw-100-t12-r0-a180-1-esm/". These have the same directory structure and "scores.csv" file as the "monomer/" and "shape/" directories.
confchange/
This directory contains generated structures for designed multi-state proteins. In our manuscript we compare two different approaches to multi-state design using salad which are reflected in two subdirectories of "confchange/":
- "default_vp-parent-split-af2": running independent denoising processes with distinct secondary structure constraints, followed by tied ProteinMPNN sequence design
- "default_vp-parent-split-constrained-af2": running coupled denoising processes where shared substructures across states are kept aligned across states, followed by tied ProteinMPNN sequence design
Both share the same directory structure:
- "backbones/{parent, child1, child2, partial_parent, partial_child1, partial_child2}": generated successful backbones, and partially successful backbones ("partial_") for the three designed states (the full parent structure and two child structures resulting from splitting the parent sequence into its N and C terminal parts).
- "predictions/{parent, child1, child2, partial_parent, partial_child1, partial_child2}": the best AlphaFold 2 predicted structures for each successful backbone.
- "scores_{parent, child1, child2}.csv": scores files as above, generated using AlphaFold 2 predictions for each designed state.
To directly compare with the work of Lisanza et al., 2024 [2], successful designs were selected using AlphaFold 2 structure prediction, with cutoffs pLDDT > 75 and scRMSD < 3 Å for all states.
References
[1] Lin, Y., Lee, M., Zhang, Z., & AlQuraishi, M. (2024). Out of Many, One: Designing and Scaffolding Proteins at the Scale of the Structural Universe with Genie 2. arXiv preprint arXiv:2405.15489.
[2] Lisanza, S. L., Gershon, J. M., Tipps, S. W., Sims, J. N., Arnoldt, L., Hendel, S. J., ... & Baker, D. (2024). Multistate and functional protein design using RoseTTAFold sequence space diffusion. Nature biotechnology, 1-11.
Files
Files
(4.1 GB)
Name | Size | Download all |
---|---|---|
md5:62d7c677cff9e19f8c5ab9a478e33247
|
227.2 MB | Download |
md5:b78c2d9c7df8cf05e760267bcc89e04c
|
3.4 GB | Download |
md5:7103f565deb147168aaff4c2cf38cf23
|
18.4 MB | Download |
md5:370dcd1d273b0fe318257247f3b3fe00
|
488.1 MB | Download |
Additional details
Dates
- Created
-
2025-01-24Initial upload date
Software
- Repository URL
- https://github.com/mjendrusch/salad
- Programming language
- Python
- Development Status
- Active