Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage

Naqvi, Sahin

doi:10.5281/zenodo.14633030

Published January 11, 2025 | Version v2

Dataset Open

Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage

Naqvi, Sahin (Contact person)^{1, 2}

1. Stanford University School of Medicine
2. Boston Children's Hospital

Processed data and code for "Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage," Naqvi et al 2025.

Directory is organized into the following subfolders, each tar'ed and gzipped:

data_analysis.tar.gz - Processed data for modulation of TWIST1 levels and calculation of RE responsiveness to TWIST1 dosage

atac_design.txt - design matrix for ATAC-seq TWIST1 titration samples
all.sub.150bpclust.greater2.500bp.merge.TWIST1.titr.ATAC.counts.txt - ATAC-seq counts from all samples over all reproducible ATAC-seq peak regions, as defined in Naqvi et al 2023
atac_deseq_fitmodels_moded50.R - R code for calculating new version of ED50 and response to full depletion from TWIST1 titration data (note, uses drm.R function from 10.5281/zenodo.7689948, install drc() with this version to avoid errors)

baseline_models.tar.gz - Code and data for training baseline models to predict RE responsiveness to SOX9/TWIST1 dosage

{sox9|twist1}.{0v100|ed50}.{train|valid|test}.txt - Training/testing/validation data (ED50 or full TF depletion effect for SOX9 or TWIST1), split into train/test/validation folds
HOCOMOCOv11_core_HUMAN_mono_jaspar_format.all.sub.150bpclust.greater2.500bp.merge.minus300bp.p01.maxscore.mat.cpg.gc.basemean.txt.gz - matrix of predictors for all REs. Quantitative encoding of PWM match for all HOCOMOCO motifs + CpG + GC content, plus unperturbed ATAC-seq signal
train_baseline.R - R code to train baseline (LASSO regression or random forest) models using predictor matrix and the provided training data.
- Note: training the random forest to predict full TF depletion is computationally intensive because it is across all REs, if doing this run on CPU for ~6 hrs.

chrombpnet_models.tar.gz - Remainder of code, data, and models for fine-tuning and interpreting ChromBPNet mdoels to predict RE responsiveness to SOX9/TWIST1 dosage

Fine-tuning code, data, models
- {all|sox9.direct|twist1.bound.down}.{train|valid|test}.{ed50|0v100.log2fc}.txt - Training/testing/validation data (ED50 or full TF depletion effect for SOX9 or TWIST1), split into train/test/validation folds
- pretrained.unperturbed.chrombpnet.h5 - Pretrained model of unperturbed ATAC-seq signal in CNCCs, obtained by running ChromBPNet (https://github.com/kundajelab/chrombpnet) on DMSO-treated SOX9/TWIST1-tagged ATAC-seq data
- finetune_chrombpnet.py - code for fine-tuning the pretrained model for any of the relevant prediction tasks (ED50/ effect of full TF depletion for SOX9/TWIST1)
- best.model.chrombpnet.{0v100|ed50}.{sox9|twist1}.h5 - output of finetune_chrombpnet.py, best model after 10 training epochs for the indicated task
- chrombpnet.{0v100|ed50}.{sox9|twist1}.contrib.{h5|bw} - contribution scores for the indicated predictive model, obtained by running chrombpnet contribs_bw on the corresponding model h5 file.
- chrombpnet.{0v100|ed50}.{sox9|twist1}.contrib.modisco.{h5|bw} - TF-MoDIsCo output from the corresponding contribution score file
Interpretation code, data, models
- contrib_h5_to_projshap_npy.py - code to convert contrib .h5 files into .npy files containing projected SHAP scores (required because the CWM matching code takes this format of contribution scores)
- sox9.direct.10col.bed, twist1.bound.down.10col.uniq.bed - regions over which CWMs will be matched (likely direct targets of each TF)
- match_cwms.py - Python code to match individual CWM instances. Takes as input: modisco .h5 file, SHAP .npy file, bed file of regions to be matched. Output is a bed file of all CWM matches (not pruned, contains many redundant matches).
- chrombpnet.ed50.{sox9|twist1}.contrib.perc05.matchperc10.allmatch.bed - output of match_cwms.py
- take_max_overlap.py - code to merge output of match_cwms.py into clusters, and then take the maximum (length-normalized) match score in each cluster as the representative CWM match of that cluster. Requires upstream bedtools commands to be piped in, see example usage in file.
- chrombpnet.ed50.{sox9|twist1}.contrib.perc05.matchperc10.allmatch.maxoverlap.bed - output of take_max_overlap.py. These CWM instances are the ones used throughout the paper.

modisco_reports.zip - TF-MoDIsCo reports from running on the fine-tuned ChromBPNet models

modisco_report_{sox9|twist1}_{0v100|ed50}: folders containing images of discovered CWMs and HTMLs/PDFs of summarized reports from running TF-MoDisCo on the indicated fine-tuned ChromBPNet model

chrombpnet_models_supp.tar.gz - Alternative ChromBPNet mdoels to predict SOX9/TWIST1 ED50 using varying definitons of direct targets

best.model.chrombpnet.ed50.twist1.3hdn.h5 - TWIST1 direct targets defined using response to full 3h depletion (as was done for SOX9 throughout the rest of the paper)
best.model.chrombpnet.ed50.sox9.v5chip.h5 - SOX9 direct targets defined using V5 ChIP-seq from SOX9-tagged lines (as was done for TWIST1 throughout the paper)

mirny_model.tar.gz - Code and data for analyzing and fitting Mirny model of TF-nucleosome competition to observed RE dosage response curves

twist1.strong.multi.only.ed50.cutoff.true.hill.txt - ED50 and signed hill coefficients for all TWIST1-dependent REs with only buffering Coordinators (mostly one or two) and no other TFs' binding sites. "ed50_new" is the ED50 calculation used in this paper.
twist1.strong.weak{1|2|3}.ed50.cutoff.true.hill.txt - ED50 and signed hill coefficients for all TWIST1-dependent REs with only buffering Coordinators (mostly one or two) and the indicated number of sensitizing (weak) Coordinators and no other TFs' binding sites. "ed50_new" is the ED50 calculation used in this paper.
MirnyModelAnalysis.py - Python code for analysis of Mirny model of TF-nucleosome competition. Contains implementations of analytic solutions, as well as code to fit model to observed ED50 and hill coefficients in the provided data files.

nucleoatac.tar.gz - Output files from running NucleoATAC on merged ATAC-seq from each of 5 TWIST1 dosages

TWIST1_{dosage}_merge.nucmap_combined.bed.gz - see NucleoATAC docs for output format

Files

modisco_reports.zip

Files (14.0 GB)

Name	Size	Download all
baseline_models.tar.gz md5:cfe33c3c49f01e440a045ffaca18d26f	390.4 MB	Download
chrombpnet_models.tar.gz md5:4bef6bbb1259de6c93040efcd3f21a7d	13.4 GB	Download
chrombpnet_models_supp.tar.gz md5:106a3d8a9cb5bfd05e6e3c60a85f8fca	119.1 MB	Download
data_analysis.tar.gz md5:8a6f86288bddbc46a81334ed1b5eb0e3	5.5 MB	Download
mirny_model.tar.gz md5:e68a075bd88c5288e0a2e10753a9f746	108.4 kB	Download
modisco_reports.zip md5:f2e130e59cd862596e955b459834c23b	49.2 MB	Preview Download
nucleoatac.tar.gz md5:6d32e8337fcfbfbc68a1072d43847a2c	82.3 MB	Download

Additional details

National Institutes of Health
Mapping and prediction of quantitative transcription factor dosage effects to understand variation in craniofacial morphology and disease K99DE032729

	All versions	This version
Views	243	109
Downloads	203	110
Data volume	645.9 GB	317.4 GB

Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage

Creators

Description

Files

modisco_reports.zip

Files (14.0 GB)

Additional details

Funding