Published August 8, 2023 | Version 1
Software Open

Colabfold Batch AlphaFold-2-multimer structure analysis pipeline

  • 1. Harvard Medical School

Description

This python script allows one to find contacts between residues in multimeric structure files produced as output from Alphafold2 via the Colabfold pipeline https://github.com/sokrypton/ColabFold/tree/main/colabfold. It integrates both physical proximity and Alphafold confidence metrics such as the predicted Alignment Error(pAE) and the predicted Local Distance Difference Test (pLDDT) to determine whether a pair of residues is a valid contact. It's external dependencies are numpy and pandas. 

Running this script will produce one or more folders each containing 3 comma seperated value (CSV) files that you can then open with a standard text editor or any spreadhseet program. 

The 3 files are: summary.csv, interfaces.csv, and contacts.csv. 


usage: colabfold_analysis.py [-h] [--distance DISTANCE] [--pae PAE] [--pae-mode {min,avg}]
                             [--plddt PLDDT] [--combine-all]
                             [input [input ...]]

positional arguments:
  input                 One or more folders with PDB files and pAE JSON files output by Colabfold.
                        Note that '.done.txt' marker files produced by Colabfold are used to find
                        the names of complexes to analyze.

optional arguments:
  -h, --help            show this help message and exit
  
  --distance DISTANCE   Maximum distance in Angstroms that any two atoms in two residues in
                        different chains can have for them be considered in contact for the
                        analysis. Default is 8 Angstroms.
                        
  --pae PAE             Maximum predicted Angstrom Error (pAE) value in Angstroms allowed for a
                        contact(pair of residues) to be considered in the analysis. Valid values
                        range from 0 (best) to 30 (worst). Default is 15.
                        
  --pae-mode {min,avg}  How to combine the dual PAE values (x, y) and (y, x) into a single PAE
                        value for a residue pair (x, y). Default is 'min'.
                        
  --plddt PLDDT         Minimum pLDDT values required by both residues in a contact in order for
                        that contact to be included in the analysis. Values range from 0 (worst)
                        to 100 (best). Default is 50.
                        
  --aas AAS             A string representing what amino acids contacts to look/filter for. Allows you
                        to limit what contacts to include in the analysis. By default is blank meaning
                        all amino acids. A value of K would be for any lysine lysine pairs. KR would be
                        RR, KR, RK, or RR pairs, etc
                        
  --name-filter NAME_FILTER
                        An optional string that allows one to only analyze complexes that contain
                        that string in their name
                        
  --combine-all         Combine the analysis from multiple folders specified by the input argument
  
  --ignore-pae          Ignore PAE values and just analyze the PDB files. Overides any other PAE
                        settings.

EXAMPLES: 

python3 colabfold_analysis.py my_exciting_colabfold_output_folder

python3 colabfold_analysis.py my_exciting_colabfold_output_folder --pae 12 --plddt 50 --pae-mode avg

python3 colabfold_analysis.py folder1 folder2 folder3 --pae 12 --plddt 50 --pae-mode avg --combine-all

python3 colabfold_analysis.py folder1 --aas DEHKR

python3 colabfold_analysis.py folder1 --ignore-pae --name-filter MCM

python3 colabfold_analysis.py folder_? --distance 10 --plddt 60 --pae-mode min --combine-all

 

 

summary.csv

Summarizes all the findings per complex across all models that were run for it. Each row is a summary for one complex.

complex_name avg_n_models max_n_models num_contacts_with_max_n_models num_unique_contacts best_model_num best_pdockq best_plddt_avg best_pae_avg
name of the complex avg number of
models per contact
max number of models any
contact was seen in
number of unique contacts that
were seen max model number of times
number of unique contacts
across all models anlayzed
model number of prediction
producing strongest interaction
score (pdockq)
highest pdockq score recorded
across all predictions for this
complex
the average pLDDT values
across the interface for the
model with the highest pDOCKQ
the average pAE values
across the interface for the
model with the highest pDOCKQ

 

 

interfaces.csv

Shows the statistics for each prediction made for each complex. Each row is 1 prediction (structure/JSON score file)

complex_name model_num pdockq ncontacts plddt_min plddt_avg plddt_max pae_min pae_avg pae_max distance_avg
name of the complex AF model number predicted DOCKQ interface accuracy score
ranges from 0 worst to best 1
number of contacts seen in prediction Min residue pair pLDDT observed in the interface Average pair pLDDT observed in the interface Max residue pair pLDDT observed in the interface Min residue pair PAE observed in the interface Average residue pair PAE observed in the interface Max residue pair PAE observed in the interface Average distance between closest atoms in residue pairs in the interface

 

contacts.csv

A comprehensive table of all residue contact pairs between all chains that met the contact criteria specified during the run. Each row is 1 pair of interacting residues in different chains.

complex_name model_num aa1_chain aa1_index aa2_chain aa1_plddt aa2_index aa2_type aa2_plddt aa1_type pae min_distance
Name of the complex AlphaFold model number chain residue 1 is in Index of residue 1 within its chain chain residue 2 is in pLDDT for aa1 Index of residue 2 within its chain 1 letter code for residue 2 pLDDT for aa2 1 letter code for residue 1 Combined pAE value for residue pair calculated using specified "pae_mode" Minimum distance in angstroms between the 2 residues.

Files

Files (39.2 kB)

Name Size Download all
md5:db790ab56f846228272f1fb5cdf36c15
39.2 kB Download