Published October 27, 2025 | Version v1
Dataset Open

Data for: Nucleotide context models outperform protein language models for predicting antibody affinity maturation

  • 1. ROR icon Fred Hutch Cancer Center
  • 2. ROR icon Rockefeller University
  • 3. ROR icon University of California, Berkeley
  • 4. ROR icon Indiana University Bloomington
  • 5. EDMO icon University of Washington
  • 6. ROR icon Howard Hughes Medical Institute

Description

This record provides data from various studies as described in Johnson, M.M. et al. (2025). Nucleotide context models outperform protein language models for predicting antibody affinity maturation.

The EPAM software to reproduce the results in that publication can be found here.

Model data files

The models_setup.tar.gz archive contains model parameter files needed for human heavy chain S5F, Replay SHM and DMS, and Thrifty-prod in EPAM. The archive should be extracted in the root directory of the EPAM repository. The files corresponding to each model is described below.

model files
S5F
  • mutability rates: data/S5F/hh_s5f_muts.csv
  • substitution rates: data/S5F/hh_s5f_subs.csv
ReplaySHM (+ DMS)
  • heavy chain SHM rates: data/gcreplay/chigy_hc_mutation_rates_nt.csv
  • light chain SHM rates: data/gcreplay/chigy_lc_mutation_rates_nt.csv
  • DMS measurements: data/gcreplay/final_variant_scores.csv
Thrifty-prod
  • thrifty-models/models/cnn_ind_lrg-v1wyatt-simple-0.*

PCP data files

The pcps.tar.gz archive contains dataframes of parent-child pairs of B cell receptor sequences from various data sets.

file data set
pcp_inputs/tang-deepshm-prod_pcp_2024-08-08_MASKED_NI_noN_no-naive.csv human heavy chain data from Tang et al. (2022) and
Vergani et al. (2017)
pcp_inputs/wyatt-10x-1p5m_paired-igh_fs-all_pcp_2024-11-22_NI_noN_no-naive.csv human heavy chain data from Jaffe et al. (2022)
pcp_inputs/rodriguez-airr-seq-race-prod_pcp_2024-07-28_MASKED_NI_noN_no-naive.csv human heavy chain data from Rodriguez et al. (2023)
pcp_inputs/ford-flairr-seq-prod_pcp_2024-07-26_MASKED_NI_noN_no-naive.csv human heavy chain data from Ford et al. (2023)
pcp_gcreplay_inputs/igh/gctrees_2025-01-10-full_igh_pcp_NoBackMuts.csv mouse heavy chain data from DeWitt et al. (2025)
pcp_gcreplay_inputs/igk/gctrees_2025-01-10-full_igk_pcp_NoBackMuts.csv mouse light chain data from DeWitt et al. (2025)

Dataframes and tables for manuscript plots

The dataframes_and_tables.tar.gz archive contains analysis results for generating plots for figures in the manuscript. The archive should be extracted in the notebooks/ directory of the EPAM repository.

aaprobs data files

The aaprobs.tar.gz archive contains the amino acid probabilities computed on PCP data sets with EPAM. These are in the epam_output/ directory of the archive. The files are organized as follows:

  • Results for models on human repertoire data sets have file path pattern: epam_output/<data set>/<model>/combined_aaprob.hdf5
  • Results with each ESM-1v model and the ESM-1v ensemble on the Rodriguez et al. (2023) data set have file path pattern: epam_output/rodriguez-airr-seq-race-prod_pcp_2024-07-28_MASKED_NI_noN_no-naive/<esm-model>/<model>/combined_aaprob.hdf5
  • Results for non-ESM models on the Replay data set (DeWitt et al. (2025)) have file path pattern: epam_output/gcreplay/<chain>/gctrees_2025-01-10-full_<chain>_pcp_NoBackMuts/<model>/aaprob.hdf5
  • Results for ESM-related models on Replay data set (DeWitt et al. (2025)) have file path pattern: epam_output/gcreplay_esm/<chain>/gctrees_2025-01-10-full_<chain>_pcp_NoBackMuts/<model>/<esm-model>/aaprob.hdf5

The archive also contains the anarci/ directory that has mappings of human data clonal families to IMGT numbering determined with the ANARCI tool, Dunbar et al. (2016).

The aaprobs files, ANARCI files, and PCP files can be used to regenerate the dataframes and tables (i.e. found in dataframes_and_tables.tar.gz) for producing manuscript plots. The aaprobs.tar.gz and pcps.tar.gz archives should be extracted in the notebooks/ directory of the EPAM repository.

Files

Files (95.5 GB)

Name Size Download all
md5:147d5b5115eab9aaed26ea74535a7e54
77.8 GB Download
md5:6a3c094fe9cf8acc2b7be31ced070c4b
17.7 GB Download
md5:1b196db12c86ee3fc11e6083e9d0a73c
779.2 kB Download
md5:c9f5a5798e4a387b14a66a27f97c1acf
54.0 MB Download

Additional details

Funding

National Institutes of Health
R01-AI146028
National Institutes of Health
R56-HG013117
National Institutes of Health
R01-HG013117
Office of Research Infrastructure Programs
S10OD028685

Software