Data for: A sitewise model of natural selection on individual antibodies via a transformer-encoder
Creators
Description
This record provides processed B-cell receptor (BCR) data from various studies as described in Matsen IV, F. et al. (2025). A sitewise model of natural selection on individual antibodies via a transformer-encoder. Molecular Biology and Evolution. The software to reproduce the results in that publication can be found here.
BCR sequences are clustered into clonal families and germline sequences are inferred. For each clonal family, phylogenetic tree inference and reconstruction of ancestral sequences are performed. The parent and child sequences of each branch of a tree form a "parent-child pair" (PCP), which are used for training or evaluating models. Only heavy chain sequences are considered in this work.
PCP data files
The dnsm_data.tar.gz contains dataframes of PCP from various datasets and simulations.
Processed datasets are found in the DATA_DIR/v3/ directory:
file name | description |
wyatt-10x-1p5m_fs-all-NoWinCheck_igh_pcp_2024-10-29_NI_noN_no-naive.csv.gz | PCPs of BCR sequences from Jaffe et al. (2022). |
tang-deepshm-prod-NoWinCheck_igh_pcp_2024-10-29_MASKED_NI_noN_no-naive.csv.gz | PCPs of BCR sequences from Tang et al. (2022) and Vergani et al. (2017). |
rodriguez-airr-seq-race-prod-NoWinCheck_igh_pcp_2024-11-12_MASKED_NI_noN_no-naive.csv.gz | PCPs of BCR sequences from Rodriguez et al. (2023). |
Corresponding per clonal familiy IMGT numbering schemes for each dataset are found in DATA_DIR/v3/anarci/.
Simulation datasets are found in the DATA_DIR/simulations/v3/ directory:
file name | description |
v3convert_dnsm_jaffe+tang_SIM_rodriguez-6-2-25_NI_no-naive.csv.gz | PCPs from simulation of BCR sequences along trees inferred from Rodriguez et al. (2023). |
v3convert_dnsm_jaffe+tang_SIM_v2tang-2025-6-3_NI_no-naive_CONCAT_dnsm_jaffe+tang_SIM_v1jaffebulk-2025-6-3_NI_no-naive.csv.gz | PCPs from simulation of BCR sequences along trees inferred from Jaffe et al. (2022), and Tang et al. (2022) and Vergani et al. (2017). |
v3convert_dnsm_jaffe+tang_SIM_v2tang-2025-6-3_NI_no-naive_CONCAT_dnsm_jaffe+tang_SIM_v1jaffebulk-2025-6-3_NI_no-naive_downsample_50k.csv.gz |
50k subset of PCPs from simulation along trees inferred from Jaffe et al. (2022), and Tang et al. (2022) and Vergani et al. (2017). |
v3convert_dnsm_jaffe+tang_SIM_v2tang-2025-6-3_NI_no-naive_CONCAT_dnsm_jaffe+tang_SIM_v1jaffebulk-2025-6-3_NI_no-naive_downsample_100k.csv.gz | 100k subset of PCPs from simulation along trees inferred from Jaffe et al. (2022), and Tang et al. (2022) and Vergani et al. (2017). |
v3convert_dnsm_jaffe+tang_SIM_v2tang-2025-6-3_NI_no-naive_CONCAT_dnsm_jaffe+tang_SIM_v1jaffebulk-2025-6-3_NI_no-naive_downsample_250k.csv.gz | 250k subset of PCPs from simulation along trees inferred from Jaffe et al. (2022), and Tang et al. (2022) and Vergani et al. (2017). |
PCP data format
column | description |
sample_id | sample label, where a sample corresponds to an individual |
family | clonal family label within a sample |
parent_name | label of the parent sequence |
parent_heavy | parent heavy chain sequence |
child_name | label of the child sequence |
child_heavy | child heavy chain sequence |
branch_length | branch length computed in IQ-TREE |
depth | the number of edges that the child sequence is away from the naive sequence in the inferred tree |
distance | sum of branch lengths from the naive sequence to the child sequence |
v_gene_heavy | inferred heavy chain V gene |
j_gene_heavy | inferred heavy chain J gene |
cdr1_codon_start_heavy | position of the first nucleotide of the first codon in heavy chain CDR1 |
cdr1_codon_end_heavy | position of the first nucleotide of the last codon in heavy chain CDR1 |
cdr2_codon_start_heavy | position of the first nucleotide of the first codon in heavy chain CDR2 |
cdr2_codon_end_heavy | position of the first nucleotide of the last codon in heavy chain CDR2 |
cdr3_codon_start_heavy | position of the first nucleotide of the first codon in heavy chain CDR3 |
cdr3_codon_end_heavy | position of the first nucleotide of the last codon in heavy chain CDR3 |
parent_is_naive | True/False whether the parent sequence is the naive sequence of the clonal family |
child_is_leaf | True/False whether the child sequence is a leaf node of the tree |
rate_scale_heavy | Correction factor for the heavy chain evolution rate (simulation files only, but not used in the analysis) |
rate_scale_light | Correction factor for the light chain evolution rate (simulation files only, but not used in the analysis) |
Files
Files
(132.9 MB)
Name | Size | Download all |
---|---|---|
md5:55c219c7a2b123d6b20f85a1e0516bbb
|
132.9 MB | Download |
Additional details
Dates
- Accepted
-
2025-07-09