Improving antibody language models with native pairing

Burbach, Sarah; Briney, Bryan

doi:10.5281/zenodo.12745725

Published July 16, 2024 | Version v2024.07.16

Dataset Open

Improving antibody language models with native pairing

1. Scripps Research

Motivation. Existing large language models designed to predict antibody structure and function have been trained exclusively with unpaired antibody sequences. This is a substantial drawback, as each antibody represents a unique pairing of heavy and light chains that both contribute to antigen recognition. The cost of generating large datasets of natively paired antibody sequences is orders of magnitude higher than the cost of unpaired sequences, and the paucity of available paired antibody sequence datasets precludes training a state-of-the-art language model using only paired training data. Here, we sought to determine whether and to what extent natively paired training data improves model performance.

Results. Using a unique and recently reported dataset of approximately 1.6 x 10⁶ natively paired human antibody sequences, we trained two baseline antibody language model (BALM) variants: BALM-paired and BALM-unpaired. We quantify the superiority of BALM-paired over BALM-unpaired, and we show that BALM-paired's improved performance can be attributed at least in part to its ability to learn cross-chain features that span natively paired heavy and light chains. Additionally, we fine-tuned the general protein language model ESM-2 using these paired antibody sequences and report that the fine-tuned model, but not base ESM-2, demonstrates a similar understanding of cross-chain features.

Files. The following files are included in this repository:

BALM-paired.tar.gz: Model weights for the BALM-paired model.
BALM-shuffled.tar.gz: Model weights for the BALM-shuffled model.
BALM-unpaired.tar.gz: Model weights for the BALM-unpaired model.
ESM2-650M_paired-fine-tuned.tar.gz: Model weights for the 650M-parameter ESM-2 model after fine-tuning with natively paired antibody sequences.
jaffe-paired-dataset_airr-annotation.tar.gz: All natively paired antibody sequences from the Jaffe dataset were annotated with abstar and subsequently filtered to remove duplicates or unproductive sequences. The annotated sequences are provided in an AIRR-compliant format.
test-dataset_annotated.tar.gz: Two csv files, both with sequences annotated in an AIRR-compliant format. lc-coherence_test-unique_annotated.csv contains all sequences from the test dataset and fig3-20kembeddings_annotated.csv contains the 20k sequences from the test used for the Figure 2 UMAP embeddings. For both datasets, the sequences can be paired together based on their pair_id.
train-test-eval_paired.tar.gz: Datasets used to train, test, and evaluate the BALM-paired model. Compressed folder containing three files: train.txt, test.txt, and eval.txt. Each file has one input sequence per line. This dataset was also used to fine-tune the 650M-parameter ESM-2 variant.
train-test-eval_shuffled.tar.gz: Datasets used to train, test, and evaluate the BALM-shuffled model. Compressed folder containing three csv files, with two columns for the heavy and light chains.
train-test-eval_unpaired.tar.gz: Datasets used to train, test, and evaluate the BALM-unpaired model. Compressed folder containing three files: train.txt, test.txt, and eval.txt. Each file has one input sequence per line.
classification-datasets.tar.gz: Three classification datasets used to train classification models in Figure 5. The datasets are: flu-0_cov-1.csv, hd-0_cov-1.csv, and hd-0_flu-1_cov-2.csv. CoV antibody sequences were obtained from CoV-AbDab, Flu antibody sequences were obtained from Wang et al., and healthy donor antibody sequences were obtained from Hurtado et al.

Code: All code used for model training, testing, and figure generation is available under the MIT license on GitHub. An archived version of the GitHub repository (from the time of manuscript publication) is included here as code-archive.zip.

Files

code-archive.zip

Files (11.4 GB)

Name	Size	Download all
BALM-paired.tar.gz md5:977b8467e5e1a27b0eaa7b69c1b26300	1.1 GB	Download
BALM-shuffled.tar.gz md5:b7db1ba702509a5e4bb1e6e63ff6dc8d	1.1 GB	Download
BALM-unpaired.tar.gz md5:53ca98f80ef5b85935fbb1f270ed6128	1.1 GB	Download
classification-datasets.tar.gz md5:625b24cc6165fd868e189bd287b17600	1.8 MB	Download
code-archive.zip md5:e9bf4f2486016fee609dd09ada7503f5	23.5 kB	Preview Download
ESM2-650M_paired-fine-tuned.tar.gz md5:109faed067216dd0881fff4a584cce23	7.2 GB	Download
jaffe-paired-dataset_airr-annotation.tsv.gz md5:9dd72461dfa7fb0dccc8b96dd5e8a6c3	543.9 MB	Download
test-dataset_annotated.tar.gz md5:c56a491e63089e8777f23b84ed2d90ff	64.7 MB	Download
train-test-eval_paired.tar.gz md5:1febfb05d50934ac3c841e6b25bc21fd	43.5 MB	Download
train-test-eval_shuffled.tar.gz md5:eeb0a8698a7eee743ef3b17c1d99872f	43.3 MB	Download
train-test-eval_unpaired.tar.gz md5:6d5bf2476c697b39755e0a0661faaa79	43.3 MB	Download

Additional details

Repository URL: https://github.com/brineylab/BALM-paper
Programming language: Python
Development Status: Active

	All versions	This version
Views	1,825	284
Downloads	1,334	378
Data volume	2.8 TB	744.2 GB

Improving antibody language models with native pairing

Creators

Description

Files

code-archive.zip

Files (11.4 GB)

Additional details

Software