ANABAG: ANnotated Antibody AntiGen dataset

Grandguillaume, Ilyas

doi:10.5281/zenodo.17065788

Published September 5, 2025 | Version Version 2.0

Dataset Open

ANABAG: ANnotated Antibody AntiGen dataset

Grandguillaume, Ilyas (Contact person)

Contributors

Project member (2):

1. Université Paris Cité
2. Universidade de São Paulo

ANABAG (ANnotated AntiBody AntiGen)

ANABAG is a curated dataset of antibody–antigen complexes. It includes:

- 3D structural data (with various formats)

- Per-sequence and per-residue features

- Frequent updates (monthly on the GitHub repository)

The analysis and prediction of antibody–antigen (Ab–Ag) interactions often overlook critical structural features such as glycosylation, physical chemical conditions like pH and salt concentration, as well as the lack of standardized criteria for selecting complexes based on structural properties and sequence identity. Common practices in dataset construction rely on removing redundancy using sequence identity thresholds, which can inadvertently exclude complexes with alternative binding modes that share identical sequences. To enable more precise Ab–Ag modeling and antibody engineering, it is essential to incorporate richer structural and physical information into both physics-based and machine learning models. To address these limitations, we present ANABAG, a new curated dataset of Ab–Ag complexes annotated at the residue level with UniProt sequence information and enriched with a wide range of structural and physicochemical features. The dataset allows flexible filtering of complexes using a variety of descriptors available at both the complex and residue levels. Selected features are ready to use in machine learning workflows, while the structural files are compatible with antibody design and docking pipelines like Rosetta or Haddock. The complete dataset is available on Zenodo, and all accompanying scripts and usage documentation can be accessed via GitHub.

Files Included

This dataset is provided in three versions to accommodate different computational requirements:

1. data.tar.gz (Full Dataset, ~30 GB)

The complete ANABAG dataset containing all biological units (BUs) with comprehensive features and structures:

Initial chain structures: Renumbered, chain-standardized format with antigen (AG) first and antibody (AG) second
Formatted structures: Identical formating with the exeption of the chains: chain-standardized format with AG as chain A and AB as chain B
Heteroatom files: Identical as Initial chain structure with the inclusion of all non-protein atoms (cofactors, glycans, water, etc.)
Rosetta-processed data: Energy-minimized structures (relax) and associated features
- Note: Some Rosetta calculations did not complete successfully; these BUs lack Rosetta-specific outputs

2. light_version.tar.gz (Light Version, ~7 GB)

A streamlined version for users who need core structural data without additional processing:

Initial chain version of each biological unit
Associated features and annotations
Excludes: Heteroatom files, Rosetta features, and relaxed structures
Ideal for initial exploration and machine learning applications that don't require heteroatoms

3. formated_structures_only.tar.gz (Minimal Version, ~4 GB)

The most compact version containing essential structural information:

Initial chain version of each biological unit only
Suitable for quick access and overview of available complexes
Recommended for users with limited storage or bandwidth

4. per_residue_files.tar.gz (Minimal Version, ~3 GB)

The per residue features

per_residue_information_AG.tsv containing all features for antigen residues
per_residue_information_AB.tsv containing all features for antibody residues

Note: All structures (except heteroatom files) include modeled regions where gaps up to 12 residues were modelled using Modeller and Disgro. Each residue is annotated in the 'Stat_res_pdbm' column as either 'Modelled' or 'Solved', allowing users to filter based on experimental vs. modeled content. The 'Distance_interface' column (in Ångströms) enables filtering of modeled residues (or any residue) by their proximity to the binding interface.

Usage and Tools

ANABAG can be used directly or through our companion tools available at: DSIMB/anabag-handler

These scripts enable users to:

Filter biological units based on specific criteria (pH range, experimental technique, resolution, secondary structures, etc.)
Extract subsets for specialized analyses
Convert between different structural formats
Generate machine learning-ready features

For detailed usage instructions and examples, please refer to the GitHub repository documentation.

Files

Files (43.5 GB)

Name	Size
data.tar.gz md5:673a143aec60b3750179f361e266d3c2	29.9 GB	Download
formated_structures_only.tar.gz md5:33a2f0e6da3dc618224eec3da4770aa2	3.6 GB	Download
light_version.tar.gz md5:8bd4decb1963ce654333ea1e8f9b2c86	6.8 GB	Download
per_residue_files.tar.gz md5:ee4eb39515f73b4bf273171cf579bdab	3.1 GB	Download

Additional details

Agence Nationale de la Recherche
EMULATE - Developing, Validating and applying computer simulation methods to Enhance the MolecULAr undersTandIng and tO eNgineer functionalized biomaterials ANR-20-CE06-0029

	All versions	This version
Views	237	237
Downloads	211	211
Data volume	6.2 TB	6.2 TB

ANABAG: ANnotated Antibody AntiGen dataset

Authors/Creators

Contributors

Project member (2):

Description

ANABAG (ANnotated AntiBody AntiGen)

ANABAG is a curated dataset of antibody–antigen complexes. It includes:

Files Included

1. data.tar.gz (Full Dataset, ~30 GB)

2. light_version.tar.gz (Light Version, ~7 GB)

3. formated_structures_only.tar.gz (Minimal Version, ~4 GB)

4. per_residue_files.tar.gz (Minimal Version, ~3 GB)

Usage and Tools

Files

Files (43.5 GB)

Additional details

Funding