Underlying data for "AlphaDesign: A de novo protein design framework based on AlphaFold"
Authors/Creators
-
Jendrusch, Michael
(Researcher)1
-
Yang, Alessio Ling Jie
(Researcher)
-
Cacace, Elisabetta
(Researcher)2
-
Bobonis, Jacob
(Researcher)3, 4
-
Voogdt, Carlos
(Researcher)1
-
Kaspar, Sarah
(Researcher)1
-
Schweimer, Kristian
(Researcher)5
-
Perez-Borrajero, Cecilia
(Researcher)1
-
Lapouge, Karine
(Researcher)1
-
Scheurich, Jacob
(Researcher)1
-
Remans, Kim
(Researcher)1
-
Hennig, Janosch
(Researcher)5, 1
-
Typas, Athanasios
(Researcher)1
-
Jan Korbel
(Supervisor)1
-
Sadiq, S. Kashif
(Researcher)6
Description
This repository contains the underlying data for the manuscript "AlphaDesign: A de novo protein design framework based on AlphaFold".
code.tar.gz
Contains the source code for AlphaDesign under a CC BY NC SA 4.0 license, as well as the code for novobench under an Apache 2.0 license. It also includes readme files for both, and a directory of example inputs / expected outputs.
designs.tar.gz
This archive contains de novo designed protein structures, their corresponding designed sequences and AlphaFold / ESMfold-based scores for each designed sequence-structure pair. Scores are provided in the CSV format described below.
*_scores.csv file format
The *_scores.csv files contain values of relevant scores for quantifying the number of successful designs and selecting designs for experimental validation. They contain the following columns in order:
-
result: base name of the backbone PDB-file for this row, e.g. backbone_0.pdb
-
index: unique integer index of the designed sequence for this row.
-
sequence: designed sequence for this row.
-
sc_rmsd: self-consistent RMSD (scRMSD) between the designed backbone and the predicted structure of this row's sequence.
-
sc_tm: self-consistent TM-score (scTM) between designed and predicted structure.
-
plddt: mean pLDDT over all non-templated positions in the predicted structure.
-
pae: mean pAE of the predicted structure.
-
ipae: mean interface pAE of the predicted structure, if multiple chains are present. Otherwise inf.
-
mpae: minimum interface pAE of the predicted structure, if multiple chains are present. Otherwise inf.
subdirectories
The archive has the following directory structure:
├── binder_targets
│ ├── ...
├── binders
│ ├── pd1-100
│ ├── ...
├── complexes
│ ├── heterodimer-50-1.4
│ ├── ...
├── monomers
│ ├── camsol
│ ├── monomer-100-1.4
│ ├── ...
├── multistate
│ ├── confchange-tm0.1-50
│ └── ...
└── rf_diffusion
├── monomer-50
└── ...
The following subdirectories of the archive (monomers, complexes, binders, multistate) contain files for each type and size of protein design. E.g. for de novo designed complexes:
complexes
├── heterodimer-50-1.4
│ ├── alpha_design
│ ├── alphafold_redesign_scores.csv
│ ├── esmfold_raw_scores.csv
│ ├── esmfold_redesign_scores.csv
│ ├── mpnn_redesign
│ └── redesign_template.yaml
├── …
Here, the alpha_design directory contains PDB-files of the designed structure candidates from the first step of the AlphaDesign pipeline. The adm_redesign directory contains a FASTA-file for each PDB-file in alpha_design. This FASTA-file contains the set of redesigned sequences generated using the ADM for this PDB-file. redesign_template.yaml contains the instructions used for redesigning the sequence for each PDB-file.
In addition, these subdirectories contain *_scores.csv files. For monomers and complexes, there are three of these:
-
esmfold_raw_scores.csv: scores for raw sequence-structure pairs from the first step of the AlphaDesign pipeline using ESMfold.
-
esmfold_redesign_scores.csv: scores for ADM-redesigned sequence-structure pairs using ESMfold.
-
alphafold_redesign_scores.csv: scores for ADM-redesigned sequence-structure pairs using AlphaFold.
For binders and multistate, only scores of ADM-redesigned sequence-structure pairs using AlphaFold are reported. Additionally, for multistate, scores are reported for all concurrently designed states of each protein:
-
alphafold_redesign_scores_state_X.pdb: scores using AlphaFold for designed state “X”. State 0 corresponds to the complex / bound state, and states 1 / 2 correspond to the two monomeric states for conformation-changing de novo designs. For bispecific binder designs, state 0 corresponds to the complex with the first target, state 1 to the second.
monomers additionally contains a camsol subdirectory:
monomers/camsol
├── camsol_intrinsic_raw_monomer_sequences.txt
├── camsol_intrinsic_redesigned_monomer_sequences.txt
├── raw_sequence.fasta
└── redesigned_sequence.fasta
This directory contains the raw sequences for all designed monomers from the first step of the AlphaDesign pipeline (raw_sequence.fasta), the corresponding best ADM-redesigned sequences (redesigned_sequence.fasta) and CamSol solubility scores for both (camsol_intrinsic_raw_monomer_sequences.txt; camsol_intrinsic_redesigned_monomer_sequences.txt).
The binder_targets subdirectory contains PDB structures of all the target proteins used for binder design in this work. The RcaT structures here (RcaTSen2_active_domain.pdb and RcaTEco1_active_domain.pdb) contain contiguous cropped structures around the putative active site of these RcaT homologs.
Finally, the rf_diffusion directory and its subdirectories contain structures and sequences generated using RFdiffusion and ProteinMPNN for the comparison between RFdiffusion and AlphaDesign. There are subdirectories for the following types of designs:
-
monomer-<50, 100, 200, 300>: contain monomer designs with 50 to 300 amino acids
-
homomer-<2, 3, 4>: contain homooligomers with 50 amino acid monomers and 2 - 4 subunits.
-
heterodimer: contains heterodimers with 50 amino acid monomers
Each subdirectory contains an alphafold_scores.csv file with AlphaFold scores (as above) for each design; a design subdirectory with designed protein backbones in PDB format; a protein_mpnn directory with designed amino acid sequences in FASTA format.
MD_input.tar.gz
Input files for preparing ensemble all-atom molecular dynamics (MD) simulations of a subset of designed RcaT-Sen2 binder complexes in this work. The directory has the following structure:
├── RcaT_bispecifics
│ └── uncropped
│ └── ... # designed systems
├── RcaT_conf_change
│ └── cropped
│ └── ... # designed systems
├── RcaT_Sen2
│ └── ... # designed systems
└── README
With each subdirectory containing MD input files for a specific class of designed RcaT-Sen2 binders:
-
RcaT_bispecifics: bispecific binders for RcaT-Sen2 and RcaT-Eco1
-
RcaT_conf_change: RcaT-Sen2 binders designed to change conformation upon binding
-
RcaT_Sen2: monospecific binders to RcaT-Sen2
Each of the designed_systems subdirectories has the following contents:
<design>
├── build
│ ├── <design>.pdb
│ └── build.tleap
├── ensemble
│ └── out
│ └── 1
│ └── prod
└── eq
├── out_eq1
│ └── ref-min-10.in
├── ...
│ └── ...
└── out_eq11
└── ref-equil-NPT.in
These files have the following function:
-
build/<design.pdb>: PDB-format structure of the designed monomer or target-binder complex.
-
build/build.tleap: leap file for solvating the structure, adding ions and producing amber parameters and coordinates in the form of .crd, .pdb and .prmtop files.
-
ensemble/out/1/prod: amber input file for running a single replica production trajectory of an ensemble simulation. To produce additional replicas (2….N), make copies of this directory in the same path.
-
eq/out_eq<N>/<equilibration>.in: amber input files for each separate equilibration step (11 in total) performed before running production. These equilibration steps are described in detail in the methods section of this work.
MD_outputs.tar.gz
Time-series statistics extracted from all-atom explicit solvent molecular dynamics (MD) runs. This directory has the following structure:
└── <design type>
└── <target - # amino acids>
└── <design>
├── ensemble_timeseries
│ └── <(1 - 50)>.dat
└── prodigy_ensemble
└── <(1 - 50)>
├── mean.dat
└── time.dat
Each designed binder for which we ran ensemble MD has two directories associated with it:
-
ensemble_timeseries: This directory contains statistics about each time-step in each replica trajectory (e.g. 1.dat for the first replica trajectory in the ensemble). The data are given in fixed-width format with the following columns:
-
global RMSD, global RMSF, monomer 1 RMSD, monomer 2 RMSD, monomer 1 RMSF, monomer 2 RMSF, global intra-chain contacts, global inter-chain contacts, global total number of contacts, monomer 1 intra-chain contacts, monomer 2 intra-chain contacts, monomer 1 - monomer 2 interface contacts
-
prodigy_ensemble: This directory contains per-replica means of Prodigy IC values and predicted binding affinity (mean.dat) and per-replica per-time-step outputs from PRODIGY (time.dat).
Files
Files
(248.8 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:37a8c6d2d2679ef68f5e0a7c0cb152df
|
13.1 MB | Download |
|
md5:f2bbcdaf97c147fdda2a140f228d8a84
|
175.8 MB | Download |
|
md5:846e833fc14550e4828750da2faa6ed2
|
1.6 MB | Download |
|
md5:d2561bd2067618bb6a6bd80209c24873
|
58.4 MB | Download |