Documentation

This repository hosts the data associated with the GitHub repository AsEP-dataset.

Dataset Files

The zip file asep-dataset.zip contains the following files:

  • asepv1-AbDb-IDs.txt: a text file containing the AbDb identifiers of the 1723 antibody-antigen pairs in the dataset.
  • asepv1_interim_graphs.tar.gz: contains 1723 .pt file. The .pt files are named using the AbDb identifier, e.g. 1a14_0P.pt. Each file is a dictionary with the following key-value pairs:
    • abdbid (str): a string representing the antibody AbDb identifier
    • seqres (Dict[str, Union[Dict[str, str], OrderedDict[str, str]]]): a dictionary with the following key-value pairs:
      • ab (OrderedDict[str, str]): an ordered dictionary mapping string chain label H and L to the corresponding sequence string. The letter H and L stand for heavy and light chain, respectively, and are reserved for antibody sequences.
      • ag(Dict[str, str]): a dictionary mapping string chain label to the corresponding sequence string. The chain label letter comes from the PDB file and is not reserved for any specific sequence type.
    • mapping (Dict[str, Dict[str, numpy.ndarray]]) a dictionary with the following key-value pairs:
      • ab:
        • seqres2cdr: a binary numpy array of shape (L, ) where L is the length of the antibody sequence. The array is a mask indicating the CDR positions (label 1) in the antibody sequence.
      • ag:
        • seqres2surf: a binary numpy array of shape (L, ) where L is the length of the antigen sequence. The array is a mask indicating the surface residues (label 1) in the antigen sequence.
        • seqres2epitope: a binary numpy array of shape (L, ) where L is the length of the antigen sequence. The array is a mask indicating the epitope residues (label 1) in the antigen sequence
    • embedding: a dictionary with the following key-value pairs:
      • ab:
        • igfold: a pytorch tensor of shape (L_ab, 512) where L is the length of the antibody sequence (Heavy + Light chain). Embedding is computed using the AntiBERTy model provided in IgFold.
        • esm2: a pytorch tensor of shape (L_ab, 480) where L is the length of the antibody sequence. Embedding is computed using the ESM2 esm2_t12_35M_UR50D model.
      • ag:
        • esm2: a pytorch tensor of shape (L_ag, 480) where L is the length of the antigen sequence. Embedding is computed using the ESM2 esm2_t12_35M_UR50D model.
    • edges: a dictionary with the following key-value pairs:
      • ab: a pytorch sparse coo tensor of shape (L_cdr, L_cdr) where L_cdr is the length of CDR loops. The tensor represents the binary edges between the CDR residues with 1 indicating an edge between two residues.
      • ag: a pytorch sparse coo tensor of shape (L_surf, L_surf) where L_surf is the number of antigen surface residues. The tensor represents the binary edges between the surface residues with 1 indicating an edge between two residues.
    • stats: metadata about the antibody-antigen pair. The dictionary contains the following key-value pairs:
      • cdr: an integer denoting the number of CDR residues.
      • surf: an integer denoting the number of surface residues.
      • epitope: an integer denoting the number of epitope residues.
      • epitope2surf_ratio: a float denoting the ratio of epitope residues to surface residues in the antigen.
    • Nb: an integer denoting the number of nodes in the antibody graph.
    • Ng: an integer denoting the number of nodes in the antigen graph.
  • structures.tar.gz: contains 1723 pdb structures, each corresponding to the antibody-antigen pair in the asepv1_interim_graphs.tar.gz file. The pdb files are named using the AbDb identifier, e.g. 1a14_0P.pdb.
  • split_dict.pt: train/val/test split dictionary. All indices correspond to the abdb identifiers in the asepv1-AbDb-IDs.txt file.
    • epitope_ratio: train/val/test split based on the epitope ratio of the antigen.
      • train: pytorch LongTensor of shape (1383,) containing the indices of the training set samples.
      • val: pytorch LongTensor of shape (170,) containing the indices of the validation set samples.
      • test: pytorch LongTensor of shape (170,) containing the indices of the test set samples.
    • epitope_group: train/val/test split based on the epitope group of the antigen.
      • train: pytorch LongTensor of shape (1383,) containing the indices of the training set samples.
      • val: pytorch LongTensor of shape (170,) containing the indices of the validation set samples.
      • test: pytorch LongTensor of shape (170,) containing the indices of the test set samples.

Docker Images

ESMFold

Downloadable from either (due to size constraints):

EpiPred

Original code collected from OPIG Tools page/EpiPred, and the python code is downloaded from here.

A copy of containerized EpiPred is provided epipred.tar.

Load the image using the following command:

docker load -i epipred.tar

MaSIF-Site

The docker image for MaSIF-Site was obtained from DockerHub pablogainza/masif.

A copy is provided masif.tar.

Load the image using the following command:

docker load -i masif.tar

Benchmark

Benchmark experiments are provided in the benchmark.zip file. The file contains the following directories:

benchmark/
├── README.md
├── abag_dataset/
├── ESMBind/
├── ESMFold/
├── EpiPred/
├── MaSIF-Site/
└── walle-inference/

Each directory contains a README file with instructions on how to run the benchmark experiments.