Data-optimal scaling of paired antibody language models

Shafiei Neyestanak, Mahdi; Briney, Bryan

doi:10.5281/zenodo.16938681

Published August 26, 2025 | Version v2025.08.26

Dataset Open

Data-optimal scaling of paired antibody language models

1. Scripps Research Institute

Motivation: Antibody language models (AbLMs) play a critical role in exploring the extensive sequence diversity of antibody repertoires, significantly enhancing therapeutic discovery. However, the optimal strategy for scaling these models, particularly concerning the interplay between model size and data availability, remains underexplored, especially in contrast to natural language processing, where data is abundant. This study aims to systematically investigate scaling laws in AbLMs to define optimal scaling thresholds and maximize their potential in antibody engineering and discovery.

Results: This study pretrained ESM-2 architecture models across five distinct model sizes (8 million to 650 million parameters) and three training data scales (Quarter, Half, and Full datasets, with the full set comprising ~1.6 million paired antibody sequences). Performance was evaluated using cross-entropy loss and downstream tasks, including per-position amino acid prediction, antibody specificity classification, and native heavy-light chain pairing recognition. Findings reveal that increasing model size does not monotonically improve performance. For instance, with the full dataset, the optimal model size is estimated to be ~152M parameters. The 350M parameter model trained on the full dataset (350M-F) often demonstrated optimal or near-optimal performance in downstream tasks, such as achieving the highest accuracy in predicting mutated CDRH3 regions.

Conclusion: These results underscore that in data-constrained domains like paired AbLMs, strategically balancing model capacity with dataset size is crucial, as simply increasing model parameters without a proportional increase in diverse training data can lead to diminishing returns or even impaired generalization. Our findings also underscore the importance of generating additional high-quality, paired antibody sequence data to improve AbLM performance.

Files. The following files are included in this repository:

model_weights.zip: Model weights for all pre-trained AbLMs in the study. The models can also be downloaded from HuggingFace.
train-eval-test.zip: Contains the datasets used to train all models, including sequences obtained from Jaffe et al., sequences downloaded from OAS, and an additional set of internally generated sequences. The folder is organized into three subdirectories—Full_data, Half_data, and Quarter_data—each providing the corresponding training datasets. Within Full_data, the data are further divided into training, eval, and test subfolders, which contain train_dataset.csv, validation_dataset.csv, and test_dataset.csv, respectively.
HD_vs_COV.csv.zip: The paired antibody sequences that were used for the antibody specificity binary classification task. The Coronavirus (CoV) antibody sequences included were sourced from the CoV-AbDab database.
hd-0_CoV-1_flu-2.csv.zip: Paired antibody sequences utilized for the 3-way antibody specificity classification task, distinguishing between Healthy Donor (HD), Coronavirus (CoV), and Influenza (Flu) specific Abs. CoV-specific sequences were sourced from the CoV-AbDab. Flu-specific antibodies were obtained from Wang et al., and healthy donor antibodies were obtained from the Ng et al. control dataset.
Fixed_data_profiles.zip: Paired antibody sequences from 10 independent donors, not present in training, evaluation, and test datasets, used to assess model performance with a masked language modeling objective via quadratic regression on average cross-entropy loss.
Per-residue_pred.zip: Paired antibody heavy-chain sequences (mutated and unmutated) from 10 independent donors, sampled at 1,000 sequences each, used to evaluate the performance of AbLMs in the residue identity prediction task.
Native_paired.zip: Natively paired and randomly shuffled antibody sequences used for the binary classification of native versus shuffled pairs.

Code: The code for model training and evaluation is available under the MIT license on GitHub.

Files

Fixed_data_profiles.zip

Files (27.3 GB)

Name	Size	Download all
Fixed_data_profiles.zip md5:302925801ae1309c1716f94753133bca	4.2 MB	Preview Download
hd-0_CoV-1_flu-2.csv.zip md5:be6cee36bfcc93e8a3778dacc29c545b	277.2 kB	Preview Download
HD_vs_COV.csv.zip md5:407828a0ae72dbeafad3bfad4bc3c776	1.8 MB	Preview Download
model_weights.zip md5:f79fb508923a4cf79c6fd438018c606d	27.0 GB	Preview Download
Native_paired.zip md5:a46dad1c2dc3a924773717cc6fa1cb49	25.4 MB	Preview Download
Per-residue_pred.zip md5:024695aeec48d789ac2e7c38cc30d6ec	1.7 MB	Preview Download
train-eval-test.zip md5:5a3cb1c66404151854a940ec6b41964f	219.9 MB	Preview Download

Additional details

Repository URL: https://github.com/brineylab/AbLMs-scaling-laws/
Programming language: Python
Development Status: Active

	All versions	This version
Views	79	22
Downloads	30	19
Data volume	27.8 GB	27.3 GB

Data-optimal scaling of paired antibody language models

Creators

Description

Files

Fixed_data_profiles.zip

Files (27.3 GB)

Additional details

Software