Scaling laws in antibody language models reveal data-constrained optima

Shafiei Neyestanak, Mahdi; Briney, Bryan

doi:10.5281/zenodo.15447079

Published May 16, 2025 | Version v2025.05.16

Dataset Restricted

Scaling laws in antibody language models reveal data-constrained optima

1. Scripps Research Institute

Motivation: Antibody language models (AbLMs) play a critical role in exploring the extensive sequence diversity of antibody repertoires, significantly enhancing therapeutic discovery. However, the optimal strategy for scaling these models, particularly concerning the interplay between model size and data availability, remains underexplored, especially in contrast to natural language processing where data is abundant. This study aims to systematically investigate scaling laws in AbLMs to define optimal scaling thresholds and maximize their potential in antibody engineering and discovery.

Results: This study pretrained ESM-2 architecture models across five distinct parameterizations (8 million to 650 million weights) and three training data scales (Quarter, Half, and Full datasets, with the full set comprising ~1.6 million paired antibody sequences). Performance was evaluated using cross-entropy loss and downstream tasks, including per-position amino acid identity prediction, antibody specificity classification, and native heavy-light chain pairing recognition. Findings reveal that increasing model size does not monotonically improve performance; for instance, with the full dataset, loss began to increase beyond ~163M parameters. The 350M parameter model trained on the full dataset (350M-F) often demonstrated optimal or near-optimal performance in downstream tasks, such as achieving the highest accuracy in predicting mutated CDRH3 regions.

Conclusion: These results underscore that in data-constrained domains like antibody sequences, strategically balancing model capacity with dataset size is crucial, as simply increasing model parameters without a proportional increase in diverse training data can lead to diminishing returns or even impaired generalization

Files. The following files are included in this repository:

model_weights.zip: Model weights for all pre-trained AbLMs in the study. The models can also be downloaded from HuggingFace.
train-eval-test.zip: The datasets used for training all models, with sequences obtained from Jaffe et al. and Hurtado et al., are provided in a compressed folder. This folder contains three subfolders—Full_data, Half_data, and Quarter_data—each containing the training data used for the models. Specifically, the Full_data subfolder is further organized into training, eval, and test subdirectories, which respectively contain the train_dataset.csv, validation_dataset.csv, and test_dataset.csv files.
HD_vs_COV.csv.zip: The paired antibody sequences that were used for the antibody specificity binary classification task. The Coronavirus (CoV) antibody sequences included were sourced from the CoV-AbDab database.
hd-0_CoV-1_flu-2.csv.zip: Paired antibody sequences utilized for the 3-way antibody specificity classification task, distinguishing between Healthy Donor (HD), Coronavirus (CoV), and Influenza (Flu) specific Abs. The influenza-specific antibody sequences included in this dataset were sourced from Wang et al.
shuffled_data.csv.zip: Contains the dataset used for the native vs. shuffled paired antibody sequence classification task. This dataset is derived from the test_dataset.csv.
per_position_inference.zip: The dataset utilized for per-residue prediction by the full-data models, including both unmutated and mutated antibody sequences.
test_datasets.zip: A compressed folder that contains twelve distinct test sets that were not utilized during model training. These datasets were specifically used for evaluating pretrained models and generating Cross-entropy loss curves. The data originates from both in-house laboratory sources and a study conducted by Ng et al.

Code: The code for model training and evaluation is available under the MIT license on GitHub.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Additional details

Repository URL: https://github.com/brineylab/AbLMs-scaling-laws/tree/main
Programming language: Python
Development Status: Active

	All versions	This version
Views	78	57
Downloads	30	11
Data volume	27.8 GB	467.6 MB

Scaling laws in antibody language models reveal data-constrained optima

Creators

Description

Files

Restricted

Additional details

Software