This repository hosts the data associated with the GitHub repository AsEP-dataset.
The zip file asep-dataset.zip
contains the following files:
asepv1-AbDb-IDs.txt
: a text file containing the AbDb identifiers of the 1723 antibody-antigen pairs in the dataset.asepv1_interim_graphs.tar.gz
: contains 1723 .pt
file. The .pt
files are named using the AbDb identifier, e.g. 1a14_0P.pt
. Each file is a dictionary with the following key-value pairs:abdbid
(str): a string representing the antibody AbDb identifierseqres
(Dict[str, Union[Dict[str, str], OrderedDict[str, str]]]): a dictionary with the following key-value pairs:ab
(OrderedDict[str, str]): an ordered dictionary mapping string chain label H
and L
to the corresponding sequence string. The letter H
and L
stand for heavy and light chain, respectively, and are reserved for antibody sequences.ag
(Dict[str, str]): a dictionary mapping string chain label to the corresponding sequence string. The chain label letter comes from the PDB file and is not reserved for any specific sequence type.mapping
(Dict[str, Dict[str, numpy.ndarray]]) a dictionary with the following key-value pairs:ab
:seqres2cdr
: a binary numpy array of shape (L, ) where L is the length of the antibody sequence. The array is a mask indicating the CDR positions (label 1) in the antibody sequence.ag
:seqres2surf
: a binary numpy array of shape (L, ) where L is the length of the antigen sequence. The array is a mask indicating the surface residues (label 1) in the antigen sequence.seqres2epitope
: a binary numpy array of shape (L, ) where L is the length of the antigen sequence. The array is a mask indicating the epitope residues (label 1) in the antigen sequenceembedding
: a dictionary with the following key-value pairs:ab
:igfold
: a pytorch tensor of shape (L_ab, 512) where L is the length of the antibody sequence (Heavy + Light chain). Embedding is computed using the AntiBERTy model provided in IgFold.esm2
: a pytorch tensor of shape (L_ab, 480) where L is the length of the antibody sequence. Embedding is computed using the ESM2 esm2_t12_35M_UR50D model.ag
:esm2
: a pytorch tensor of shape (L_ag, 480) where L is the length of the antigen sequence. Embedding is computed using the ESM2 esm2_t12_35M_UR50D model.edges
: a dictionary with the following key-value pairs:ab
: a pytorch sparse coo tensor of shape (L_cdr, L_cdr) where L_cdr is the length of CDR loops. The tensor represents the binary edges between the CDR residues with 1 indicating an edge between two residues.ag
: a pytorch sparse coo tensor of shape (L_surf, L_surf) where L_surf is the number of antigen surface residues. The tensor represents the binary edges between the surface residues with 1 indicating an edge between two residues.stats
: metadata about the antibody-antigen pair. The dictionary contains the following key-value pairs:cdr
: an integer denoting the number of CDR residues.surf
: an integer denoting the number of surface residues.epitope
: an integer denoting the number of epitope residues.epitope2surf_ratio
: a float denoting the ratio of epitope residues to surface residues in the antigen.Nb
: an integer denoting the number of nodes in the antibody graph.Ng
: an integer denoting the number of nodes in the antigen graph.structures.tar.gz
: contains 1723 pdb structures, each corresponding to the antibody-antigen pair in the asepv1_interim_graphs.tar.gz
file. The pdb files are named using the AbDb identifier, e.g. 1a14_0P.pdb
.split_dict.pt
: train/val/test split dictionary. All indices correspond to the abdb identifiers in the asepv1-AbDb-IDs.txt
file.epitope_ratio
: train/val/test split based on the epitope ratio of the antigen.train
: pytorch LongTensor of shape (1383,) containing the indices of the training set samples.val
: pytorch LongTensor of shape (170,) containing the indices of the validation set samples.test
: pytorch LongTensor of shape (170,) containing the indices of the test set samples.epitope_group
: train/val/test split based on the epitope group of the antigen.train
: pytorch LongTensor of shape (1383,) containing the indices of the training set samples.val
: pytorch LongTensor of shape (170,) containing the indices of the validation set samples.test
: pytorch LongTensor of shape (170,) containing the indices of the test set samples.Downloadable from either (due to size constraints):
Original code collected from OPIG Tools page/EpiPred, and the python code is downloaded from here.
A copy of containerized EpiPred is provided epipred.tar
.
Load the image using the following command:
docker load -i epipred.tar
The docker image for MaSIF-Site was obtained from DockerHub pablogainza/masif.
A copy is provided masif.tar
.
Load the image using the following command:
docker load -i masif.tar
Benchmark experiments are provided in the benchmark.zip
file. The file contains the following directories:
benchmark/
├── README.md
├── abag_dataset/
├── ESMBind/
├── ESMFold/
├── EpiPred/
├── MaSIF-Site/
└── walle-inference/
Each directory contains a README file with instructions on how to run the benchmark experiments.