Improving antibody language models with native pairing

Burbach, Sarah; Briney, Bryan

doi:10.5281/zenodo.8237396

Published August 16, 2023 | Version v2023.08.16

Dataset Open

Improving antibody language models with native pairing

1. Scripps Research

Motivation. Existing large language models designed to predict antibody structure and function have been trained exclusively with unpaired antibody sequences. This is a substantial drawback, as each antibody represents a unique pairing of heavy and light chains that both contribute to antigen recognition. The cost of generating large datasets of natively paired antibody sequences is orders of magnitude higher than the cost of unpaired sequences, and the paucity of available paired antibody sequence datasets precludes training a state-of-the-art language model using only paired training data. Here, we sought to determine whether and to what extent natively paired training data improves model performance.

Results. Using a unique and recently reported dataset of approximately 1.6 x 10⁶ natively paired human antibody sequences, we trained two baseline antibody language model (BALM) variants: BALM-paired and BALM-unpaired. We quantify the superiority of BALM-paired over BALM-unpaired, and we show that BALM-paired's improved performance can be attributed at least in part to its ability to learn cross-chain features that span natively paired heavy and light chains. Additionally, we fine-tuned the general protein language model ESM-2 using these paired antibody sequences and report that the fine-tuned model, but not base ESM-2, demonstrates a similar understanding of cross-chain features.

Files. The following files are included in this repository:

BALM-paired.tar.gz: Model weights for the BALM-paired model.
BALM-unpaired.tar.gz: Model weights for the BALM-unpaired model.
ESM2-650M_paired-fine-tuned.tar.gz: Model weights for the 650M-parameter ESM-2 model after fine-tuning with natively paired antibody sequences.
jaffe-paired-dataset_airr-annotation.tar.gz: All natively paired antibody sequences from the Jaffe dataset were annotated with abstar and subsequently filtered to remove duplicates or unproductive sequences. The annotated sequences are provided in an AIRR-compliant format.
train-test-eval_paired.tar.gz: Datasets used to train, test, and evaluate the BALM-paired model. Compressed folder containing three files: train.txt, test.txt, and eval.txt. Each file has one input sequence per line. This dataset was also used to fine-tune the 650M-parameter ESM-2 variant.
train-test-eval_unpaired.tar.gz: Datasets used to train, test, and evaluate the BALM-unpaired model. Compressed folder containing three files: train.txt, test.txt, and eval.txt. Each file has one input sequence per line.

Code: All code used for model training, testing, and figure generation is available under the MIT license on GitHub.

Files

Files (9.1 GB)

Name	Size	Download all
BALM-paired.tar.gz md5:977b8467e5e1a27b0eaa7b69c1b26300	1.1 GB	Download
BALM-unpaired.tar.gz md5:bd3d3c41cc106036f8047ce4f13eafb2	96.2 MB	Download
ESM2-650M_paired-fine-tuned.tar.gz md5:109faed067216dd0881fff4a584cce23	7.2 GB	Download
jaffe-paired-dataset_airr-annotation.tsv.gz md5:9dd72461dfa7fb0dccc8b96dd5e8a6c3	543.9 MB	Download
train-test-eval_paired.tar.gz md5:1febfb05d50934ac3c841e6b25bc21fd	43.5 MB	Download
train-test-eval_unpaired.tar.gz md5:6d5bf2476c697b39755e0a0661faaa79	43.3 MB	Download

	All versions	This version
Views	1,822	476
Downloads	1,333	106
Data volume	2.8 TB	129.3 GB

Improving antibody language models with native pairing

Creators

Description

Files

Files (9.1 GB)