Underlying data for "AlphaDesign: A de novo protein design framework based on AlphaFold"

Jendrusch, Michael; Yang, Alessio Ling Jie; Cacace, Elisabetta; Bobonis, Jacob; Voogdt, Carlos; Kaspar, Sarah; Schweimer, Kristian; Perez-Borrajero, Cecilia; Lapouge, Karine; Scheurich, Jacob; Remans, Kim; Hennig, Janosch; Typas, Athanasios; Jan Korbel; Sadiq, S. Kashif

doi:10.5281/zenodo.15208893

Published April 17, 2025 | Version 0.1.0

Dataset Open

Underlying data for "AlphaDesign: A de novo protein design framework based on AlphaFold"

1. European Molecular Biology Laboratory
2. ETH Zurich
3. Centre for Microbiology and Environmental Systems Science
4. University of Vienna
5. University of Bayreuth
6. Heidelberg Institute for Theoretical Studies

This repository contains the underlying data for the manuscript "AlphaDesign: A de novo protein design framework based on AlphaFold".

code.tar.gz

Contains the source code for AlphaDesign under a CC BY NC SA 4.0 license, as well as the code for novobench under an Apache 2.0 license. It also includes readme files for both, and a directory of example inputs / expected outputs.

designs.tar.gz

This archive contains de novo designed protein structures, their corresponding designed sequences and AlphaFold / ESMfold-based scores for each designed sequence-structure pair. Scores are provided in the CSV format described below.

*_scores.csv file format

The *_scores.csv files contain values of relevant scores for quantifying the number of successful designs and selecting designs for experimental validation. They contain the following columns in order:

result: base name of the backbone PDB-file for this row, e.g. backbone_0.pdb
index: unique integer index of the designed sequence for this row.
sequence: designed sequence for this row.
sc_rmsd: self-consistent RMSD (scRMSD) between the designed backbone and the predicted structure of this row's sequence.
sc_tm: self-consistent TM-score (scTM) between designed and predicted structure.
plddt: mean pLDDT over all non-templated positions in the predicted structure.
pae: mean pAE of the predicted structure.
ipae: mean interface pAE of the predicted structure, if multiple chains are present. Otherwise inf.
mpae: minimum interface pAE of the predicted structure, if multiple chains are present. Otherwise inf.

subdirectories

The archive has the following directory structure:

├── binder_targets

│ ├── ...

├── binders

│ ├── pd1-100

│ ├── ...

├── complexes

│ ├── heterodimer-50-1.4

│ ├── ...

├── monomers

│ ├── camsol

│ ├── monomer-100-1.4

│ ├── ...

├── multistate

│ ├── confchange-tm0.1-50

│ └── ...

└── rf_diffusion

├── monomer-50

└── ...

The following subdirectories of the archive (monomers, complexes, binders, multistate) contain files for each type and size of protein design. E.g. for de novo designed complexes:

complexes

├── heterodimer-50-1.4

│ ├── alpha_design

│ ├── alphafold_redesign_scores.csv

│ ├── esmfold_raw_scores.csv

│ ├── esmfold_redesign_scores.csv

│ ├── mpnn_redesign

│ └── redesign_template.yaml

├── …

Here, the alpha_design directory contains PDB-files of the designed structure candidates from the first step of the AlphaDesign pipeline. The adm_redesign directory contains a FASTA-file for each PDB-file in alpha_design. This FASTA-file contains the set of redesigned sequences generated using the ADM for this PDB-file. redesign_template.yaml contains the instructions used for redesigning the sequence for each PDB-file.

In addition, these subdirectories contain *_scores.csv files. For monomers and complexes, there are three of these:

esmfold_raw_scores.csv: scores for raw sequence-structure pairs from the first step of the AlphaDesign pipeline using ESMfold.
esmfold_redesign_scores.csv: scores for ADM-redesigned sequence-structure pairs using ESMfold.
alphafold_redesign_scores.csv: scores for ADM-redesigned sequence-structure pairs using AlphaFold.

For binders and multistate, only scores of ADM-redesigned sequence-structure pairs using AlphaFold are reported. Additionally, for multistate, scores are reported for all concurrently designed states of each protein:

alphafold_redesign_scores_state_X.pdb: scores using AlphaFold for designed state “X”. State 0 corresponds to the complex / bound state, and states 1 / 2 correspond to the two monomeric states for conformation-changing de novo designs. For bispecific binder designs, state 0 corresponds to the complex with the first target, state 1 to the second.

monomers additionally contains a camsol subdirectory:

monomers/camsol

├── camsol_intrinsic_raw_monomer_sequences.txt

├── camsol_intrinsic_redesigned_monomer_sequences.txt

├── raw_sequence.fasta

└── redesigned_sequence.fasta

This directory contains the raw sequences for all designed monomers from the first step of the AlphaDesign pipeline (raw_sequence.fasta), the corresponding best ADM-redesigned sequences (redesigned_sequence.fasta) and CamSol solubility scores for both (camsol_intrinsic_raw_monomer_sequences.txt; camsol_intrinsic_redesigned_monomer_sequences.txt).

The binder_targets subdirectory contains PDB structures of all the target proteins used for binder design in this work. The RcaT structures here (RcaTSen2_active_domain.pdb and RcaTEco1_active_domain.pdb) contain contiguous cropped structures around the putative active site of these RcaT homologs.

Finally, the rf_diffusion directory and its subdirectories contain structures and sequences generated using RFdiffusion and ProteinMPNN for the comparison between RFdiffusion and AlphaDesign. There are subdirectories for the following types of designs:

monomer-<50, 100, 200, 300>: contain monomer designs with 50 to 300 amino acids
homomer-<2, 3, 4>: contain homooligomers with 50 amino acid monomers and 2 - 4 subunits.
heterodimer: contains heterodimers with 50 amino acid monomers

Each subdirectory contains an alphafold_scores.csv file with AlphaFold scores (as above) for each design; a design subdirectory with designed protein backbones in PDB format; a protein_mpnn directory with designed amino acid sequences in FASTA format.

MD_input.tar.gz

Input files for preparing ensemble all-atom molecular dynamics (MD) simulations of a subset of designed RcaT-Sen2 binder complexes in this work. The directory has the following structure:

├── RcaT_bispecifics

│ └── uncropped

│ └── ... # designed systems

├── RcaT_conf_change

│ └── cropped

│ └── ... # designed systems

├── RcaT_Sen2

│ └── ... # designed systems

└── README

With each subdirectory containing MD input files for a specific class of designed RcaT-Sen2 binders:

RcaT_bispecifics: bispecific binders for RcaT-Sen2 and RcaT-Eco1
RcaT_conf_change: RcaT-Sen2 binders designed to change conformation upon binding
RcaT_Sen2: monospecific binders to RcaT-Sen2

Each of the designed_systems subdirectories has the following contents:

├── build

│ ├── <design>.pdb

│ └── build.tleap

├── ensemble

│ └── out

│ └── 1

│ └── prod

└── eq

├── out_eq1

│ └── ref-min-10.in

├── ...

│ └── ...

└── out_eq11

└── ref-equil-NPT.in

These files have the following function:

build/<design.pdb>: PDB-format structure of the designed monomer or target-binder complex.
build/build.tleap: leap file for solvating the structure, adding ions and producing amber parameters and coordinates in the form of .crd, .pdb and .prmtop files.
ensemble/out/1/prod: amber input file for running a single replica production trajectory of an ensemble simulation. To produce additional replicas (2….N), make copies of this directory in the same path.
eq/out_eq<N>/<equilibration>.in: amber input files for each separate equilibration step (11 in total) performed before running production. These equilibration steps are described in detail in the methods section of this work.

MD_outputs.tar.gz

Time-series statistics extracted from all-atom explicit solvent molecular dynamics (MD) runs. This directory has the following structure:

└── <design type>

└── <target - # amino acids>

└── <design>

├── ensemble_timeseries

│ └── <(1 - 50)>.dat

└── prodigy_ensemble

└── <(1 - 50)>

├── mean.dat

└── time.dat

Each designed binder for which we ran ensemble MD has two directories associated with it:

ensemble_timeseries: This directory contains statistics about each time-step in each replica trajectory (e.g. 1.dat for the first replica trajectory in the ensemble). The data are given in fixed-width format with the following columns:

global RMSD, global RMSF, monomer 1 RMSD, monomer 2 RMSD, monomer 1 RMSF, monomer 2 RMSF, global intra-chain contacts, global inter-chain contacts, global total number of contacts, monomer 1 intra-chain contacts, monomer 2 intra-chain contacts, monomer 1 - monomer 2 interface contacts

prodigy_ensemble: This directory contains per-replica means of Prodigy IC values and predicted binding affinity (mean.dat) and per-replica per-time-step outputs from PRODIGY (time.dat).

Files

Files (248.8 MB)

Name	Size	Download all
code.tar.gz md5:37a8c6d2d2679ef68f5e0a7c0cb152df	13.1 MB	Download
designs.tar.gz md5:f2bbcdaf97c147fdda2a140f228d8a84	175.8 MB	Download
MD_input.tar.gz md5:846e833fc14550e4828750da2faa6ed2	1.6 MB	Download
MD_output.tar.gz md5:d2561bd2067618bb6a6bd80209c24873	58.4 MB	Download

	All versions	This version
Views	507	507
Downloads	252	252
Data volume	15.4 GB	15.4 GB

Underlying data for "AlphaDesign: A de novo protein design framework based on AlphaFold"

Authors/Creators

Description

code.tar.gz

designs.tar.gz

*_scores.csv file format

subdirectories

MD_input.tar.gz

MD_outputs.tar.gz

Files

Files (248.8 MB)