There is a newer version of the record available.

Published August 16, 2023 | Version v2023.08.16
Dataset Open

Improving antibody language models with native pairing

  • 1. Scripps Research

Description

Motivation. Existing large language models designed to predict antibody structure and function have been trained exclusively with unpaired antibody sequences. This is a substantial drawback, as each antibody represents a unique pairing of heavy and light chains that both contribute to antigen recognition. The cost of generating large datasets of natively paired antibody sequences is orders of magnitude higher than the cost of unpaired sequences, and the paucity of available paired antibody sequence datasets precludes training a state-of-the-art language model using only paired training data. Here, we sought to determine whether and to what extent natively paired training data improves model performance.

Results. Using a unique and recently reported dataset of approximately 1.6 x 106 natively paired human antibody sequences, we trained two baseline antibody language model (BALM) variants: BALM-paired and BALM-unpaired. We quantify the superiority of BALM-paired over BALM-unpaired, and we show that BALM-paired's improved performance can be attributed at least in part to its ability to learn cross-chain features that span natively paired heavy and light chains. Additionally, we fine-tuned the general protein language model ESM-2 using these paired antibody sequences and report that the fine-tuned model, but not base ESM-2, demonstrates a similar understanding of cross-chain features.

Files. The following files are included in this repository:

  • BALM-paired.tar.gz: Model weights for the BALM-paired model.
  • BALM-unpaired.tar.gz: Model weights for the BALM-unpaired model.
  • ESM2-650M_paired-fine-tuned.tar.gz: Model weights for the 650M-parameter ESM-2 model after fine-tuning with natively paired antibody sequences.
  • jaffe-paired-dataset_airr-annotation.tar.gz: All natively paired antibody sequences from the Jaffe dataset were annotated with abstar and subsequently filtered to remove duplicates or unproductive sequences. The annotated sequences are provided in an AIRR-compliant format.
  • train-test-eval_paired.tar.gz: Datasets used to train, test, and evaluate the BALM-paired model. Compressed folder containing three files: train.txt, test.txt, and eval.txt. Each file has one input sequence per line. This dataset was also used to fine-tune the 650M-parameter ESM-2 variant.
  • train-test-eval_unpaired.tar.gz: Datasets used to train, test, and evaluate the BALM-unpaired model. Compressed folder containing three files: train.txt, test.txt, and eval.txt. Each file has one input sequence per line.

Code: All code used for model training, testing, and figure generation is available under the MIT license on GitHub.

 

Files

Files (9.1 GB)

Name Size Download all
md5:977b8467e5e1a27b0eaa7b69c1b26300
1.1 GB Download
md5:bd3d3c41cc106036f8047ce4f13eafb2
96.2 MB Download
md5:109faed067216dd0881fff4a584cce23
7.2 GB Download
md5:9dd72461dfa7fb0dccc8b96dd5e8a6c3
543.9 MB Download
md5:1febfb05d50934ac3c841e6b25bc21fd
43.5 MB Download
md5:6d5bf2476c697b39755e0a0661faaa79
43.3 MB Download