Data for: Nucleotide context models outperform protein language models for predicting antibody affinity maturation
Creators
Description
This record provides data from various studies as described in Johnson, M.M. et al. (2025). Nucleotide context models outperform protein language models for predicting antibody affinity maturation.
The EPAM software to reproduce the results in that publication can be found here.
Model data files
The models_setup.tar.gz archive contains model parameter files needed for human heavy chain S5F, Replay SHM and DMS, and Thrifty-prod in EPAM. The archive should be extracted in the root directory of the EPAM repository. The files corresponding to each model is described below.
| model | files |
| S5F |
|
| ReplaySHM (+ DMS) |
|
| Thrifty-prod |
|
PCP data files
The pcps.tar.gz archive contains dataframes of parent-child pairs of B cell receptor sequences from various data sets.
| file | data set |
pcp_inputs/tang-deepshm-prod_pcp_2024-08-08_MASKED_NI_noN_no-naive.csv |
human heavy chain data from Tang et al. (2022) and Vergani et al. (2017) |
pcp_inputs/wyatt-10x-1p5m_paired-igh_fs-all_pcp_2024-11-22_NI_noN_no-naive.csv |
human heavy chain data from Jaffe et al. (2022) |
pcp_inputs/rodriguez-airr-seq-race-prod_pcp_2024-07-28_MASKED_NI_noN_no-naive.csv |
human heavy chain data from Rodriguez et al. (2023) |
pcp_inputs/ford-flairr-seq-prod_pcp_2024-07-26_MASKED_NI_noN_no-naive.csv |
human heavy chain data from Ford et al. (2023) |
pcp_gcreplay_inputs/igh/gctrees_2025-01-10-full_igh_pcp_NoBackMuts.csv |
mouse heavy chain data from DeWitt et al. (2025) |
pcp_gcreplay_inputs/igk/gctrees_2025-01-10-full_igk_pcp_NoBackMuts.csv |
mouse light chain data from DeWitt et al. (2025) |
Dataframes and tables for manuscript plots
The dataframes_and_tables.tar.gz archive contains analysis results for generating plots for figures in the manuscript. The archive should be extracted in the notebooks/ directory of the EPAM repository.
aaprobs data files
The aaprobs.tar.gz archive contains the amino acid probabilities computed on PCP data sets with EPAM. These are in the epam_output/ directory of the archive. The files are organized as follows:
- Results for models on human repertoire data sets have file path pattern:
epam_output/<data set>/<model>/combined_aaprob.hdf5 - Results with each ESM-1v model and the ESM-1v ensemble on the Rodriguez et al. (2023) data set have file path pattern:
epam_output/rodriguez-airr-seq-race-prod_pcp_2024-07-28_MASKED_NI_noN_no-naive/<esm-model>/<model>/combined_aaprob.hdf5 - Results for non-ESM models on the Replay data set (DeWitt et al. (2025)) have file path pattern:
epam_output/gcreplay/<chain>/gctrees_2025-01-10-full_<chain>_pcp_NoBackMuts/<model>/aaprob.hdf5 - Results for ESM-related models on Replay data set (DeWitt et al. (2025)) have file path pattern:
epam_output/gcreplay_esm/<chain>/gctrees_2025-01-10-full_<chain>_pcp_NoBackMuts/<model>/<esm-model>/aaprob.hdf5
The archive also contains the anarci/ directory that has mappings of human data clonal families to IMGT numbering determined with the ANARCI tool, Dunbar et al. (2016).
The aaprobs files, ANARCI files, and PCP files can be used to regenerate the dataframes and tables (i.e. found in dataframes_and_tables.tar.gz) for producing manuscript plots. The aaprobs.tar.gz and pcps.tar.gz archives should be extracted in the notebooks/ directory of the EPAM repository.
Files
Files
(95.5 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:147d5b5115eab9aaed26ea74535a7e54
|
77.8 GB | Download |
|
md5:6a3c094fe9cf8acc2b7be31ced070c4b
|
17.7 GB | Download |
|
md5:1b196db12c86ee3fc11e6083e9d0a73c
|
779.2 kB | Download |
|
md5:c9f5a5798e4a387b14a66a27f97c1acf
|
54.0 MB | Download |
Additional details
Funding
- National Institutes of Health
- R01-AI146028
- National Institutes of Health
- R56-HG013117
- National Institutes of Health
- R01-HG013117
- Office of Research Infrastructure Programs
- S10OD028685
Software
- Repository URL
- https://github.com/matsengrp/epam