Documentation

This repository hosts the data associated with the GitHub repository AsEP-dataset.

Dataset Files

The zip file asep-dataset.zip contains the following files:

asepv1-AbDb-IDs.txt: a text file containing the AbDb identifiers of the 1723 antibody-antigen pairs in the dataset.
asepv1_interim_graphs.tar.gz: contains 1723 .pt file. The .pt files are named using the AbDb identifier, e.g. 1a14_0P.pt. Each file is a dictionary with the following key-value pairs:
- abdbid (str): a string representing the antibody AbDb identifier
- seqres (Dict[str, Union[Dict[str, str], OrderedDict[str, str]]]): a dictionary with the following key-value pairs:
  - ab (OrderedDict[str, str]): an ordered dictionary mapping string chain label H and L to the corresponding sequence string. The letter H and L stand for heavy and light chain, respectively, and are reserved for antibody sequences.
  - ag(Dict[str, str]): a dictionary mapping string chain label to the corresponding sequence string. The chain label letter comes from the PDB file and is not reserved for any specific sequence type.
- mapping (Dict[str, Dict[str, numpy.ndarray]]) a dictionary with the following key-value pairs:
  - ab:
    - seqres2cdr: a binary numpy array of shape (L, ) where L is the length of the antibody sequence. The array is a mask indicating the CDR positions (label 1) in the antibody sequence.
  - ag:
    - seqres2surf: a binary numpy array of shape (L, ) where L is the length of the antigen sequence. The array is a mask indicating the surface residues (label 1) in the antigen sequence.
    - seqres2epitope: a binary numpy array of shape (L, ) where L is the length of the antigen sequence. The array is a mask indicating the epitope residues (label 1) in the antigen sequence
- embedding: a dictionary with the following key-value pairs:
  - ab:
    - igfold: a pytorch tensor of shape (L_ab, 512) where L is the length of the antibody sequence (Heavy + Light chain). Embedding is computed using the AntiBERTy model provided in IgFold.
    - esm2: a pytorch tensor of shape (L_ab, 480) where L is the length of the antibody sequence. Embedding is computed using the ESM2 esm2_t12_35M_UR50D model.
  - ag:
    - esm2: a pytorch tensor of shape (L_ag, 480) where L is the length of the antigen sequence. Embedding is computed using the ESM2 esm2_t12_35M_UR50D model.
- edges: a dictionary with the following key-value pairs:
  - ab: a pytorch sparse coo tensor of shape (L_cdr, L_cdr) where L_cdr is the length of CDR loops. The tensor represents the binary edges between the CDR residues with 1 indicating an edge between two residues.
  - ag: a pytorch sparse coo tensor of shape (L_surf, L_surf) where L_surf is the number of antigen surface residues. The tensor represents the binary edges between the surface residues with 1 indicating an edge between two residues.
- stats: metadata about the antibody-antigen pair. The dictionary contains the following key-value pairs:
  - cdr: an integer denoting the number of CDR residues.
  - surf: an integer denoting the number of surface residues.
  - epitope: an integer denoting the number of epitope residues.
  - epitope2surf_ratio: a float denoting the ratio of epitope residues to surface residues in the antigen.
- Nb: an integer denoting the number of nodes in the antibody graph.
- Ng: an integer denoting the number of nodes in the antigen graph.
structures.tar.gz: contains 1723 pdb structures, each corresponding to the antibody-antigen pair in the asepv1_interim_graphs.tar.gz file. The pdb files are named using the AbDb identifier, e.g. 1a14_0P.pdb.
split_dict.pt: train/val/test split dictionary. All indices correspond to the abdb identifiers in the asepv1-AbDb-IDs.txt file.
- epitope_ratio: train/val/test split based on the epitope ratio of the antigen.
  - train: pytorch LongTensor of shape (1383,) containing the indices of the training set samples.
  - val: pytorch LongTensor of shape (170,) containing the indices of the validation set samples.
  - test: pytorch LongTensor of shape (170,) containing the indices of the test set samples.
- epitope_group: train/val/test split based on the epitope group of the antigen.
  - train: pytorch LongTensor of shape (1383,) containing the indices of the training set samples.
  - val: pytorch LongTensor of shape (170,) containing the indices of the validation set samples.
  - test: pytorch LongTensor of shape (170,) containing the indices of the test set samples.

Docker Images

ESMFold

Downloadable from either (due to size constraints):

GitHub esmfold-docker-image
DockerHub biochunan/esmfold-image

EpiPred

Original code collected from OPIG Tools page/EpiPred, and the python code is downloaded from here.

A copy of containerized EpiPred is provided epipred.tar.

Load the image using the following command:

docker load -i epipred.tar

MaSIF-Site

The docker image for MaSIF-Site was obtained from DockerHub pablogainza/masif.

A copy is provided masif.tar.

Load the image using the following command:

docker load -i masif.tar

Benchmark

Benchmark experiments are provided in the benchmark.zip file. The file contains the following directories:

benchmark/
├── README.md
├── abag_dataset/
├── ESMBind/
├── ESMFold/
├── EpiPred/
├── MaSIF-Site/
└── walle-inference/

Each directory contains a README file with instructions on how to run the benchmark experiments.