Published January 5, 2026 | Version v2
Dataset Open

Antibody and Nanobody Design Dataset (ANDD)

Authors/Creators

  • 1. ROR icon Fudan University

Description

Title: Antibody and Nanobody Design Dataset (ANDD): A Comprehensive Resource with Sequence, Structure, and Binding Affinity Data

DOI: 10.5281/zenodo.18151718

Resource Type: Dataset

Publisher: Zenodo

Publication Year: 2025

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Overview (Abstract):

The Antibody and Nanobody Design Dataset (ANDD) is a unified, large-scale dataset created to overcome the limitations of data fragmentation and incompleteness in antibody and nanobody research. It integrates sequence, structure, antigen information, and binding affinity data from 15 diverse sources, including OAS, PDB, SabDab, and others. ANDD comprises 48,683 antibody/nanobody sequences, structural data for 24,941 entries, antigen sequences for 12,575 entries, and a total of 9,557 binding affinity values for antibody/nanobody-antigen pairs. A key innovation is the augmentation of experimental affinity data with 2,271 high-quality predictions generated by the ANTIPASTI model. This makes ANDD the largest available dataset of its kind, providing a robust foundation for training and validating deep learning models in therapeutic antibody and nanobody design.

Keywords: Dataset, Antibody Design, Nanobody Design, VHH, Deep Learning, Protein Engineering, Binding Affinity, Therapeutic Antibodies, Computational Biology

Methods (Data Curation and Processing):

The ANDD was constructed through a rigorous multi-step process:

  1. Data Collection: Data was aggregated from 15 primary sources, including both antibody/nanobody-specific databases (e.g., OAS, SAbDab, INDI, sdAb-DB) and general protein databases (e.g., PDB, UNIPROT, PDBbind).
  2. Integration and Standardization: Data from disparate sources was consolidated into a consistent format, addressing challenges of format inconsistency. Entries were manually validated to exclude non-relevant data (e.g., T-cell receptors).
  3. Affinity Data Augmentation: The ANTIPASTI deep learning model was used to predict and add binding affinity values for entries that had structural data but lacked experimental affinity measurements.
  4. Manual Curation: Web-based data and information from publicly available patents targeting key antigens (HER2, IL-6, CD45, SARS-CoV-2 RBD) were manually extracted to enhance completeness.
  5. Hierarchical Organization: Data is organized in a hierarchical structure, offering four progressively detailed levels: Sequence-only, Sequence+Structure, Sequence+Structure+Antigen, and Sequence+Structure+Antigen+Affinity.

Data Specifications and Format:

The dataset is distributed in two parts:

  1. ANDD.csv: A comprehensive spreadsheet containing all annotated metadata for each entry.
  2. All_structures/Folder: A directory containing the corresponding PDB structure files for entries with structural data.
  3. Quality control report: A QC report evaluating the data quality of ANDD.
  4. The data dictionary describing all fields, controlled terms, units, and allowed values.

The ANDD.csvfile includes the following key fields (a full description is available in the Data Record section of the paper):

  • General Info: Source, Update_Date, PDB_ID, Experimental_Method, Ab_or_Nano, Source_Organism.
  • Chain Details: Entity IDs, Asym IDs, Database Accession Codes, and Macromolecule Names for Heavy (H) and Light (L) chains.
  • Antigen Details: Ag_Name, Ag_Seq, Ag_Source Organism, and relevant database identifiers.
  • Sequence Data: Full amino acid sequences for H/L chains and individual CDR regions (H1-H3, L1-L3).
  • Affinity Data: Experimentally measured or predicted Affinity_Kd(M)∆Gbinding(kJ), and the Affinity_Method.
  • Mutation Data: Annotation of any amino acid mutations (Ab/Nano_mutation).

Technical Validation:

The quality of ANDD has been ensured through extensive validation:

  1. Manual Curation: A rigorous manual review process was conducted to check for accuracy and consistency between sequence, structure, and affinity data across randomly selected entries.
  2. Affinity Validation with AlphaBind: The experimental Kd values were validated by comparing them against enrichment ratios predicted by the AlphaBind model, showing a significant correlation (Pearson’s r = 0.750).
  3. Cross-Mapping Validation: The internal consistency between Kd and ∆Gbinding values within the dataset was confirmed, showing a perfect correlation (Pearson’s r = 1.000) as per thermodynamic principles.
  4. Proof-of-Concept Application: The dataset's utility was demonstrated by fine-tuning the Diffab generative model on a subset of ANDD. The fine-tuned model showed significant improvements in generating nanobodies with better predicted binding affinity, structural diversity, and developability metrics.

Potential Uses:

ANDD is designed to accelerate research in computational biology and drug discovery, including:

  • Training and benchmarking deep learning models for de novoantibody/nanobody sequence and structure generation.
  • Developing and validating predictive models for antibody-antigen binding affinity.
  • Studying structure-function relationships in antibody-antigen interactions.
  • Facilitating the design of optimized therapeutic antibodies and nanobodies with improved specificity and efficacy.

Access and License:

The ANDD dataset is publicly available for download under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. Users are free to share and adapt the material for any purpose, even commercially, provided appropriate credit is given to the original authors and this data descriptor is cited.

Files

ANDD_pdb.zip

Files (2.2 GB)

Name Size Download all
md5:e93b4134817ed3ddbdd5ebe3380df9da
2.2 GB Preview Download
md5:d9c977f192f575077e39a731876d2a69
13.3 MB Download
md5:d74a5e3901b18cc5d95c8f9b558e1f11
3.5 kB Preview Download
md5:6a4296b4ee754c6f95704befeeb02a6e
858.8 kB Preview Download