Published June 1, 2026 | Version v1

AVIDbase repository of biologically accurate and standardized nanobody-antigen complexes

  • 1. ROR icon University of Ljubljana
  • 2. ROR icon Novartis (Slovenia)

Description

Abstract

Accurate structural data in a standarized format is one of the key factors behind the success of machine learning (ML)-based methods for protein design and structure prediction. However, their application to nanobody-antigen complexes has lower success rates compared to globular protein complexes, partly due to the limited amount of high-quality structural data. While several dedicated databases already exist, automated assembly pipelines frequently overlook various artifacts, which act as additional noise and limit the effectiveness of ML applications. Common issues include incorrect biological assembly, redundancy bias, inclusion of crystal contacts, strained geometry due to crystal packing, as well as missing density or post-translational modifications near the interface. To address these issues, we present AVIDbase (Antigen-VHH Interface Database), a highly curated dataset of nanobody-antigen structures. In addition to correcting structural artifacts, the dataset provides a nonredundant set of structures with standardized chain identifiers, harmonized metadata, and cleaned atomic coordinates in a ready-to-use format for ML applications.

Repository contents

AVIDbase is an annotated database of all nanobody-antigen interfaces found within publicly available crystal structures deposited in the PDB as of February 9th, 2026. The database contains nanobody complexes with protein and ordered peptide antigens.

The repository consists of 3 files:

  • AVIDbase.xlsx - an Excel file encompassing the final nanobody-antigen interface database. The first column contains a unique ID associated with a single complex structure. It consists of the following sheets:
    • AVIDbase-nr - subset of nonredundant complexes (determined based on identity of CDR1-3 sequence, antigen name, and antigen source organism). The nonredundant complexes were chosen based on ranking by resolution and interface integrity (i. e. interface proximity of missing density segments, posttranslational modifications, obligate cofactors or bound substrate).
    • AVIDbase-prot - subset of AVIDbase-nr containing complexes with high or medium interface integrity (no or negligible interface missing density, contacts with PTM species or cofactors) and with a crystallographic resolution ≤ 3.5 Å.
    • AVIDbase-low - remaining AVIDbase-nr complexes with low interface integrity and crystallographic resolution > 3.5 Å.
    • AVIDbase-r - redundant complexes not found in AVIDbase-nr. Contains redundant complexes found in the same PDB structure as the representative AVIDbase-nr complex (intra_redundancy column) or in a separate PDB structure (inter_redundancy column).
  • annotation_input_table_MANUAL.xlsx - an Excel file describing all automatically identified interfaces of nanobody structures in the PDB, subjected to manual review of both the crystal structure and associated literature. It represents the input file used to generate both the final AVIDbase tables and the standardized structures found in AVIDbase.zip. Complexes that were incorrectly assigned, complexes defined by incomplete biological assemblies, etc. were manually edited to ensure  Columns that were manually edited are highlighted in yellow:
    • is_edited - Boolean index determining whether specified complex was manually edited.
    • edit_comment - free text column used by the authors to provide additional context for certain modifications or exclusion of certain complexes.
    • keep - Boolean index determining whether specified complex is retained in the final database. Rows with keep=0 have an accompanying justification under edit_comment.
    • assembly_strategy - notes how the corresponding complex structure was obtained:
      • bio_assembly - from a given biological assembly specified in the corresponding PDB structure, with the specific ID noted under bio_assembly_id.
      • asym - from the asymmetric unit, with bio_assembly_id set to 0.
      • from_symmetry - via symmetry operations to obtain the unit cell.
    • ag_auth_chain_id - comma-separated list of chain IDs comprising the biologically relevant antigen assembly. Duplicate chain IDs are differentiated by a symmetry index, e.g. A_0,A_1.
    • expected_cofactors - comma-separated list of non-protein, non-glycan residue names expected in the structure.
    • ag_name - custom naming convention for the antigen, created to maximize harmonization between different PDB structures containing the same antigen under different names.
  • AVIDbase.zip - .zip file containing standardized .pdb structures belonging to each subset of AVIDbase, with the following layout:
    • AVIDbase-nr
      • AVIDbase-low - all structures listed in AVIDbase-low with no additional processing.
      • AVIDbase-prot - all structures listed in AVIDbase-prot, additionally processed by removing all noncanonical moieties (i. e. the protein-only subset).
      • AVIDbase-prot-relaxed - AVIDbase-prot structures additionally relaxed with Rosetta.
      • full - all structures listed in AVIDbase-nr with no additional processing.
    • AVIDbase-r
      • full - all structures listed in AVIDbase-r with no additional processing.

Available structural data

All nanobody-antigen structures are processed by:

  • renumbering nanobody under the IMGT scheme with ANARCII
  • removing N- and C-terminal expression tags
  • standardizing chain IDs:
    • nanobody chain ID is set to H, only one nanobody per structure,
    • antigen chain IDs are set in alphabetical order (e. g. A,B,C),
    • cofactor chain ID is set to X,
    • glycan chain IDs are set to lowercase equivalents of covalently linked antigen chain ID.
  • replacing all selenomethionine residues with methionines.

Rosetta-relaxed structures additionally:

  • are repacked and minimized, with mitigated crystallographic clashes,
  • contain explicit H-atoms optimizing hydrogen bonding,
  • include per-residue Rosetta beta_nov16 score terms at the end of the structure file.

 

 

Files

AVIDbase.zip

Files (300.9 MB)

Name Size
md5:fe0b07c6a36dd25c37d17f028270345a
258.0 kB Download
md5:d72ef58a783cbc0c3450d6c26d928a35
864.6 kB Download
md5:ea715341accd5e1fdb6f53e429487382
299.8 MB Preview Download