Improving antibody language models with native pairing
Description
Motivation. Existing large language models designed to predict antibody structure and function have been trained exclusively with unpaired antibody sequences. This is a substantial drawback, as each antibody represents a unique pairing of heavy and light chains that both contribute to antigen recognition. The cost of generating large datasets of natively paired antibody sequences is orders of magnitude higher than the cost of unpaired sequences, and the paucity of available paired antibody sequence datasets precludes training a state-of-the-art language model using only paired training data. Here, we sought to determine whether and to what extent natively paired training data improves model performance.
Results. Using a unique and recently reported dataset of approximately 1.6 x 106 natively paired human antibody sequences, we trained two baseline antibody language model (BALM) variants: BALM-paired and BALM-unpaired. We quantify the superiority of BALM-paired over BALM-unpaired, and we show that BALM-paired's improved performance can be attributed at least in part to its ability to learn cross-chain features that span natively paired heavy and light chains. Additionally, we fine-tuned the general protein language model ESM-2 using these paired antibody sequences and report that the fine-tuned model, but not base ESM-2, demonstrates a similar understanding of cross-chain features.
Files. The following files are included in this repository:
- BALM-paired.tar.gz: Model weights for the BALM-paired model.
- BALM-unpaired.tar.gz: Model weights for the BALM-unpaired model.
- ESM2-650M_paired-fine-tuned.tar.gz: Model weights for the 650M-parameter ESM-2 model after fine-tuning with natively paired antibody sequences.
- jaffe-paired-dataset_airr-annotation.tar.gz: All natively paired antibody sequences from the Jaffe dataset were annotated with abstar and subsequently filtered to remove duplicates or unproductive sequences. The annotated sequences are provided in an AIRR-compliant format.
- train-test-eval_paired.tar.gz: Datasets used to train, test, and evaluate the BALM-paired model. Compressed folder containing three files: train.txt, test.txt, and eval.txt. Each file has one input sequence per line. This dataset was also used to fine-tune the 650M-parameter ESM-2 variant.
- train-test-eval_unpaired.tar.gz: Datasets used to train, test, and evaluate the BALM-unpaired model. Compressed folder containing three files: train.txt, test.txt, and eval.txt. Each file has one input sequence per line.
Code: All code used for model training, testing, and figure generation is available under the MIT license on GitHub.
Files
Files
(9.1 GB)
Name | Size | Download all |
---|---|---|
md5:977b8467e5e1a27b0eaa7b69c1b26300
|
1.1 GB | Download |
md5:bd3d3c41cc106036f8047ce4f13eafb2
|
96.2 MB | Download |
md5:109faed067216dd0881fff4a584cce23
|
7.2 GB | Download |
md5:9dd72461dfa7fb0dccc8b96dd5e8a6c3
|
543.9 MB | Download |
md5:1febfb05d50934ac3c841e6b25bc21fd
|
43.5 MB | Download |
md5:6d5bf2476c697b39755e0a0661faaa79
|
43.3 MB | Download |