ANABAG: ANnotated Antibody AntiGen dataset
Authors/Creators
Contributors
Project members:
Description
ANABAG (ANnotated AntiBody AntiGen)
ANABAG is a curated dataset of antibody–antigen complexes. It includes:
Files Included
This dataset is provided in three versions to accommodate different computational requirements:
1. data.tar.gz (Full Dataset, ~30 GB)
The complete ANABAG dataset containing all biological units (BUs) with comprehensive features and structures:
- Initial chain structures: Renumbered, chain-standardized format with antigen (AG) first and antibody (AG) second
- Formatted structures: Identical formating with the exeption of the chains: chain-standardized format with AG as chain A and AB as chain B
- Heteroatom files: Identical as Initial chain structure with the inclusion of all non-protein atoms (cofactors, glycans, water, etc.)
- Rosetta-processed data: Energy-minimized structures (relax) and associated features
- Note: Some Rosetta calculations did not complete successfully; these BUs lack Rosetta-specific outputs
2. light_version.tar.gz (Light Version, ~7 GB)
A streamlined version for users who need core structural data without additional processing:
- Initial chain version of each biological unit
- Associated features and annotations
- Excludes: Heteroatom files, Rosetta features, and relaxed structures
- Ideal for initial exploration and machine learning applications that don't require heteroatoms
3. formated_structures_only.tar.gz (Minimal Version, ~4 GB)
The most compact version containing essential structural information:
- Initial chain version of each biological unit only
- Suitable for quick access and overview of available complexes
- Recommended for users with limited storage or bandwidth
4. per_residue_files.tar.gz (Minimal Version, ~3 GB)
The per residue features
- per_residue_information_AG.tsv containing all features for antigen residues
- per_residue_information_AB.tsv containing all features for antibody residues
Note: All structures (except heteroatom files) include modeled regions where gaps up to 12 residues were modelled using Modeller and Disgro. Each residue is annotated in the 'Stat_res_pdbm' column as either 'Modelled' or 'Solved', allowing users to filter based on experimental vs. modeled content. The 'Distance_interface' column (in Ångströms) enables filtering of modeled residues (or any residue) by their proximity to the binding interface.
Usage and Tools
ANABAG can be used directly or through our companion tools available at: DSIMB/anabag-handler
These scripts enable users to:
- Filter biological units based on specific criteria (pH range, experimental technique, resolution, secondary structures, etc.)
- Extract subsets for specialized analyses
- Convert between different structural formats
- Generate machine learning-ready features
For detailed usage instructions and examples, please refer to the GitHub repository documentation.
Files
Files
(43.5 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:673a143aec60b3750179f361e266d3c2
|
29.9 GB | Download |
|
md5:33a2f0e6da3dc618224eec3da4770aa2
|
3.6 GB | Download |
|
md5:8bd4decb1963ce654333ea1e8f9b2c86
|
6.8 GB | Download |
|
md5:ee4eb39515f73b4bf273171cf579bdab
|
3.1 GB | Download |
Additional details
Funding
- Agence Nationale de la Recherche
- EMULATE - Developing, Validating and applying computer simulation methods to Enhance the MolecULAr undersTandIng and tO eNgineer functionalized biomaterials ANR-20-CE06-0029