Published April 17, 2025 | Version 0.1.0
Dataset Open

Underlying data for "AlphaDesign: A de novo protein design framework based on AlphaFold"

Description

This repository contains the underlying data for the manuscript "AlphaDesign: A de novo protein design framework based on AlphaFold".

code.tar.gz

Contains the source code for AlphaDesign under a CC BY NC SA 4.0 license, as well as the code for novobench under an Apache 2.0 license. It also includes readme files for both, and a directory of example inputs / expected outputs.

designs.tar.gz

This archive contains de novo designed protein structures, their corresponding designed sequences and AlphaFold / ESMfold-based scores for each designed sequence-structure pair. Scores are provided in the CSV format described below.

*_scores.csv file format

The *_scores.csv files contain values of relevant scores for quantifying the number of successful designs and selecting designs for experimental validation. They contain the following columns in order:

  • result: base name of the backbone PDB-file for this row, e.g. backbone_0.pdb

  • index: unique integer index of the designed sequence for this row.

  • sequence: designed sequence for this row.

  • sc_rmsd: self-consistent RMSD (scRMSD) between the designed backbone and the predicted structure of this row's sequence.

  • sc_tm: self-consistent TM-score (scTM) between designed and predicted structure.

  • plddt: mean pLDDT over all non-templated positions in the predicted structure.

  • pae: mean pAE of the predicted structure.

  • ipae: mean interface pAE of the predicted structure, if multiple chains are present. Otherwise inf.

  • mpae: minimum interface pAE of the predicted structure, if multiple chains are present. Otherwise inf.

subdirectories

The archive has the following directory structure:

├── binder_targets

│   ├── ...

├── binders

│   ├── pd1-100

│   ├── ...

├── complexes

│   ├── heterodimer-50-1.4

│   ├── ...

├── monomers

│   ├── camsol

│   ├── monomer-100-1.4

│   ├── ...

├── multistate

│   ├── confchange-tm0.1-50

│   └── ...

└── rf_diffusion

    ├── monomer-50

    └── ...

The following subdirectories of the archive (monomers, complexes, binders, multistate) contain files for each type and size of protein design. E.g. for de novo designed complexes:

complexes

├── heterodimer-50-1.4

│   ├── alpha_design

│   ├── alphafold_redesign_scores.csv

│   ├── esmfold_raw_scores.csv

│   ├── esmfold_redesign_scores.csv

│   ├── mpnn_redesign

│   └── redesign_template.yaml

├── …

Here, the alpha_design directory contains PDB-files of the designed structure candidates from the first step of the AlphaDesign pipeline. The adm_redesign directory contains a FASTA-file for each PDB-file in alpha_design. This FASTA-file contains the set of redesigned sequences generated using the ADM for this PDB-file. redesign_template.yaml contains the instructions used for redesigning the sequence for each PDB-file.

In addition, these subdirectories contain *_scores.csv files. For monomers and complexes, there are three of these:

  • esmfold_raw_scores.csv: scores for raw sequence-structure pairs from the first step of the AlphaDesign pipeline using ESMfold.

  • esmfold_redesign_scores.csv: scores for ADM-redesigned sequence-structure pairs using ESMfold.

  • alphafold_redesign_scores.csv: scores for ADM-redesigned sequence-structure pairs using AlphaFold.

For binders and multistate, only scores of ADM-redesigned sequence-structure pairs using AlphaFold are reported. Additionally, for multistate, scores are reported for all concurrently designed states of each protein:

  • alphafold_redesign_scores_state_X.pdb: scores using AlphaFold for designed state “X”. State 0 corresponds to the complex / bound state, and states 1 / 2 correspond to the two monomeric states for conformation-changing de novo designs. For bispecific binder designs, state 0 corresponds to the complex with the first target, state 1 to the second.

monomers additionally contains a camsol subdirectory:

monomers/camsol

├── camsol_intrinsic_raw_monomer_sequences.txt

├── camsol_intrinsic_redesigned_monomer_sequences.txt

├── raw_sequence.fasta

└── redesigned_sequence.fasta

This directory contains the raw sequences for all designed monomers from the first step of the AlphaDesign pipeline (raw_sequence.fasta), the corresponding best ADM-redesigned sequences (redesigned_sequence.fasta) and CamSol solubility scores for both (camsol_intrinsic_raw_monomer_sequences.txt; camsol_intrinsic_redesigned_monomer_sequences.txt).

The binder_targets subdirectory contains PDB structures of all the target proteins used for binder design in this work. The RcaT structures here (RcaTSen2_active_domain.pdb and RcaTEco1_active_domain.pdb) contain contiguous cropped structures around the putative active site of these RcaT homologs.

Finally, the rf_diffusion directory and its subdirectories contain structures and sequences generated using RFdiffusion and ProteinMPNN for the comparison between RFdiffusion and AlphaDesign. There are subdirectories for the following types of designs:

  • monomer-<50, 100, 200, 300>: contain monomer designs with 50 to 300 amino acids

  • homomer-<2, 3, 4>: contain homooligomers with 50 amino acid monomers and 2 - 4 subunits.

  • heterodimer: contains heterodimers with 50 amino acid monomers

Each subdirectory contains an alphafold_scores.csv file with AlphaFold scores (as above) for each design; a design subdirectory with designed protein backbones in PDB format; a protein_mpnn directory with designed amino acid sequences in FASTA format.

MD_input.tar.gz

Input files for preparing ensemble all-atom molecular dynamics (MD) simulations of a subset of designed RcaT-Sen2 binder complexes in this work. The directory has the following structure:

├── RcaT_bispecifics

│   └── uncropped

│       └── ... # designed systems

├── RcaT_conf_change

│   └── cropped

│       └── ... # designed systems

├── RcaT_Sen2

│   └── ... # designed systems

└── README

With each subdirectory containing MD input files for a specific class of designed RcaT-Sen2 binders:

  • RcaT_bispecifics: bispecific binders for RcaT-Sen2 and RcaT-Eco1

  • RcaT_conf_change: RcaT-Sen2 binders designed to change conformation upon binding

  • RcaT_Sen2: monospecific binders to RcaT-Sen2

Each of the designed_systems subdirectories has the following contents:

<design>

├── build

│   ├── <design>.pdb

│   └── build.tleap

├── ensemble

│   └── out

│     └── 1

│         └── prod

└── eq

├── out_eq1

│   └── ref-min-10.in

├── ...

│   └── ...

└── out_eq11

    └── ref-equil-NPT.in

These files have the following function:

  • build/<design.pdb>: PDB-format structure of the designed monomer or target-binder complex.

  • build/build.tleap: leap file for solvating the structure, adding ions and producing amber parameters and coordinates in the form of .crd, .pdb and .prmtop files.

  • ensemble/out/1/prod: amber input file for running a single replica production trajectory of an ensemble simulation. To produce additional replicas (2….N), make copies of this directory in the same path.

  • eq/out_eq<N>/<equilibration>.in: amber input files for each separate equilibration step (11 in total) performed before running production. These equilibration steps are described in detail in the methods section of this work.

MD_outputs.tar.gz

Time-series statistics extracted from all-atom explicit solvent molecular dynamics (MD) runs. This directory has the following structure:

└── <design type>

    └── <target - # amino acids>

        └── <design>

            ├── ensemble_timeseries

            │   └── <(1 - 50)>.dat

            └── prodigy_ensemble

                └── <(1 - 50)>

                    ├── mean.dat

                                           └── time.dat

Each designed binder for which we ran ensemble MD has two directories associated with it:

  • ensemble_timeseries: This directory contains statistics about each time-step in each replica trajectory (e.g. 1.dat for the first replica trajectory in the ensemble). The data are given in fixed-width format with the following columns:

    • global RMSD, global RMSF, monomer 1 RMSD, monomer 2 RMSD, monomer 1 RMSF, monomer 2 RMSF, global intra-chain contacts, global inter-chain contacts, global total number of contacts, monomer 1 intra-chain contacts, monomer 2 intra-chain contacts, monomer 1 - monomer 2 interface contacts

  • prodigy_ensemble: This directory contains per-replica means of Prodigy IC values and predicted binding affinity (mean.dat) and per-replica per-time-step outputs from PRODIGY (time.dat).

Files

Files (248.8 MB)

Name Size Download all
md5:37a8c6d2d2679ef68f5e0a7c0cb152df
13.1 MB Download
md5:f2bbcdaf97c147fdda2a140f228d8a84
175.8 MB Download
md5:846e833fc14550e4828750da2faa6ed2
1.6 MB Download
md5:d2561bd2067618bb6a6bd80209c24873
58.4 MB Download