Published July 15, 2025 | Version v1
Dataset Open

Data for: A sitewise model of natural selection on individual antibodies via a transformer-encoder

  • 1. ROR icon Fred Hutch Cancer Center
  • 2. ROR icon University of Washington
  • 3. ROR icon Howard Hughes Medical Institute
  • 4. ROR icon University of Utah
  • 5. ROR icon University of California, Berkeley
  • 6. ROR icon Indiana University Bloomington

Description

This record provides processed B-cell receptor (BCR) data from various studies as described in Matsen IV, F. et al. (2025). A sitewise model of natural selection on individual antibodies via a transformer-encoder. Molecular Biology and Evolution. The software to reproduce the results in that publication can be found here.

BCR sequences are clustered into clonal families and germline sequences are inferred. For each clonal family, phylogenetic tree inference and reconstruction of ancestral sequences are performed. The parent and child sequences of each branch of a tree form a "parent-child pair" (PCP), which are used for training or evaluating models. Only heavy chain sequences are considered in this work.

 

PCP data files

The dnsm_data.tar.gz contains dataframes of PCP from various datasets and simulations.

Processed datasets are found in the DATA_DIR/v3/ directory:

file name description
wyatt-10x-1p5m_fs-all-NoWinCheck_igh_pcp_2024-10-29_NI_noN_no-naive.csv.gz PCPs of BCR sequences from Jaffe et al. (2022).
tang-deepshm-prod-NoWinCheck_igh_pcp_2024-10-29_MASKED_NI_noN_no-naive.csv.gz PCPs of BCR sequences from Tang et al. (2022) and Vergani et al. (2017).
rodriguez-airr-seq-race-prod-NoWinCheck_igh_pcp_2024-11-12_MASKED_NI_noN_no-naive.csv.gz PCPs of BCR sequences from Rodriguez et al. (2023).

Corresponding per clonal familiy IMGT numbering schemes for each dataset are found in DATA_DIR/v3/anarci/.

Simulation datasets are found in the DATA_DIR/simulations/v3/ directory:

file name description
v3convert_dnsm_jaffe+tang_SIM_rodriguez-6-2-25_NI_no-naive.csv.gz PCPs from simulation of BCR sequences along trees inferred from Rodriguez et al. (2023).
v3convert_dnsm_jaffe+tang_SIM_v2tang-2025-6-3_NI_no-naive_CONCAT_dnsm_jaffe+tang_SIM_v1jaffebulk-2025-6-3_NI_no-naive.csv.gz PCPs from simulation of BCR sequences along trees inferred from Jaffe et al. (2022), and Tang et al. (2022) and Vergani et al. (2017).
v3convert_dnsm_jaffe+tang_SIM_v2tang-2025-6-3_NI_no-naive_CONCAT_dnsm_jaffe+tang_SIM_v1jaffebulk-2025-6-3_NI_no-naive_downsample_50k.csv.gz

50k subset of PCPs from simulation along trees inferred from Jaffe et al. (2022), and Tang et al. (2022) and Vergani et al. (2017).

v3convert_dnsm_jaffe+tang_SIM_v2tang-2025-6-3_NI_no-naive_CONCAT_dnsm_jaffe+tang_SIM_v1jaffebulk-2025-6-3_NI_no-naive_downsample_100k.csv.gz 100k subset of PCPs from simulation along trees inferred from Jaffe et al. (2022), and Tang et al. (2022) and Vergani et al. (2017).
v3convert_dnsm_jaffe+tang_SIM_v2tang-2025-6-3_NI_no-naive_CONCAT_dnsm_jaffe+tang_SIM_v1jaffebulk-2025-6-3_NI_no-naive_downsample_250k.csv.gz 250k subset of PCPs from simulation along trees inferred from Jaffe et al. (2022), and Tang et al. (2022) and Vergani et al. (2017).

 

PCP data format

column description
sample_id sample label, where a sample corresponds to an individual
family clonal family label within a sample
parent_name label of the parent sequence
parent_heavy parent heavy chain sequence
child_name label of the child sequence
child_heavy child heavy chain sequence
branch_length branch length computed in IQ-TREE
depth the number of edges that the child sequence is away from the naive sequence in the inferred tree
distance sum of branch lengths from the naive sequence to the child sequence
v_gene_heavy inferred heavy chain V gene
j_gene_heavy inferred heavy chain J gene
cdr1_codon_start_heavy position of the first nucleotide of the first codon in heavy chain CDR1
cdr1_codon_end_heavy position of the first nucleotide of the last codon in heavy chain CDR1
cdr2_codon_start_heavy position of the first nucleotide of the first codon in heavy chain CDR2
cdr2_codon_end_heavy position of the first nucleotide of the last codon in heavy chain CDR2
cdr3_codon_start_heavy position of the first nucleotide of the first codon in heavy chain CDR3
cdr3_codon_end_heavy position of the first nucleotide of the last codon in heavy chain CDR3
parent_is_naive True/False whether the parent sequence is the naive sequence of the clonal family
child_is_leaf True/False whether the child sequence is a leaf node of the tree
rate_scale_heavy Correction factor for the heavy chain evolution rate (simulation files only, but not used in the analysis)
rate_scale_light Correction factor for the light chain evolution rate (simulation files only, but not used in the analysis)

 

Files

Files (132.9 MB)

Name Size Download all
md5:55c219c7a2b123d6b20f85a1e0516bbb
132.9 MB Download

Additional details

Dates

Accepted
2025-07-09