Published September 5, 2025 | Version Version 2.0
Dataset Open

ANABAG: ANnotated Antibody AntiGen dataset

  • 1. Université Paris Cité
  • 2. ROR icon Universidade de São Paulo

Description

ANABAG (ANnotated AntiBody AntiGen)

ANABAG is a curated dataset of antibody–antigen complexes. It includes:

- 3D structural data (with various formats)
- Per-sequence and per-residue features
- Frequent updates (monthly on the GitHub repository)
 
The analysis and prediction of antibody–antigen (Ab–Ag) interactions often overlook critical structural features such as glycosylation, physical chemical conditions like pH and salt concentration, as well as the lack of standardized criteria for selecting complexes based on structural properties and sequence identity. Common practices in dataset construction rely on removing redundancy using sequence identity thresholds, which can inadvertently exclude complexes with alternative binding modes that share identical sequences. To enable more precise Ab–Ag modeling and antibody engineering, it is essential to incorporate richer structural and physical information into both physics-based and machine learning models. To address these limitations, we present ANABAG, a new curated dataset of Ab–Ag complexes annotated at the residue level with UniProt sequence information and enriched with a wide range of structural and physicochemical features. The dataset allows flexible filtering of complexes using a variety of descriptors available at both the complex and residue levels. Selected features are ready to use in machine learning workflows, while the structural files are compatible with antibody design and docking pipelines like Rosetta or Haddock. The complete dataset is available on Zenodo, and all accompanying scripts and usage documentation can be accessed via GitHub.
 

Files Included

This dataset is provided in three versions to accommodate different computational requirements:

1. data.tar.gz (Full Dataset, ~30 GB)

The complete ANABAG dataset containing all biological units (BUs) with comprehensive features and structures:

  • Initial chain structures: Renumbered, chain-standardized format with antigen (AG) first and antibody (AG) second
  • Formatted structures: Identical formating with the exeption of the chains: chain-standardized format with AG as chain A and AB as chain B
  • Heteroatom files: Identical as Initial chain structure with the inclusion of all non-protein atoms (cofactors, glycans, water, etc.) 
  • Rosetta-processed data: Energy-minimized structures (relax) and associated features
    • Note: Some Rosetta calculations did not complete successfully; these BUs lack Rosetta-specific outputs

2. light_version.tar.gz (Light Version, ~7 GB)

A streamlined version for users who need core structural data without additional processing:

  • Initial chain version of each biological unit
  • Associated features and annotations
  • Excludes: Heteroatom files, Rosetta features, and relaxed structures
  • Ideal for initial exploration and machine learning applications that don't require heteroatoms

3. formated_structures_only.tar.gz (Minimal Version, ~4 GB)

The most compact version containing essential structural information:

  • Initial chain version of each biological unit only
  • Suitable for quick access and overview of available complexes
  • Recommended for users with limited storage or bandwidth

4. per_residue_files.tar.gz (Minimal Version, ~3 GB)

The per residue features

  • per_residue_information_AG.tsv containing all features for antigen residues
  • per_residue_information_AB.tsv containing all features for antibody residues

Note: All structures (except heteroatom files) include modeled regions where gaps up to 12 residues were modelled using Modeller and Disgro. Each residue is annotated in the 'Stat_res_pdbm' column as either 'Modelled' or 'Solved', allowing users to filter based on experimental vs. modeled content. The 'Distance_interface' column (in Ångströms) enables filtering of modeled residues (or any residue) by their proximity to the binding interface.

Usage and Tools

ANABAG can be used directly or through our companion tools available at: DSIMB/anabag-handler

These scripts enable users to:

  • Filter biological units based on specific criteria (pH range, experimental technique, resolution, secondary structures, etc.)
  • Extract subsets for specialized analyses
  • Convert between different structural formats
  • Generate machine learning-ready features

For detailed usage instructions and examples, please refer to the GitHub repository documentation.

Files

Files (43.5 GB)

Name Size Download all
md5:673a143aec60b3750179f361e266d3c2
29.9 GB Download
md5:33a2f0e6da3dc618224eec3da4770aa2
3.6 GB Download
md5:8bd4decb1963ce654333ea1e8f9b2c86
6.8 GB Download
md5:ee4eb39515f73b4bf273171cf579bdab
3.1 GB Download

Additional details

Funding

Agence Nationale de la Recherche
EMULATE - Developing, Validating and applying computer simulation methods to Enhance the MolecULAr undersTandIng and tO eNgineer functionalized biomaterials ANR-20-CE06-0029