Published January 11, 2025 | Version v2
Dataset Open

Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage

  • 1. ROR icon Stanford University School of Medicine
  • 2. ROR icon Boston Children's Hospital

Description

Processed data and code for "Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage," Naqvi et al 2025.

Directory is organized into the following subfolders, each tar'ed and gzipped:

data_analysis.tar.gz - Processed data for modulation of TWIST1 levels and calculation of RE responsiveness to TWIST1 dosage

  • atac_design.txt - design matrix for ATAC-seq TWIST1 titration samples
  • all.sub.150bpclust.greater2.500bp.merge.TWIST1.titr.ATAC.counts.txt - ATAC-seq counts from all samples over all reproducible ATAC-seq peak regions, as defined in Naqvi et al 2023
  • atac_deseq_fitmodels_moded50.R - R code for calculating new version of ED50 and response to full depletion from TWIST1 titration data (note, uses drm.R function from 10.5281/zenodo.7689948, install drc() with this version to avoid errors)

baseline_models.tar.gz - Code and data for training baseline models to predict RE responsiveness to SOX9/TWIST1 dosage

  • {sox9|twist1}.{0v100|ed50}.{train|valid|test}.txt - Training/testing/validation data (ED50 or full TF depletion effect for SOX9 or TWIST1), split into train/test/validation folds
  • HOCOMOCOv11_core_HUMAN_mono_jaspar_format.all.sub.150bpclust.greater2.500bp.merge.minus300bp.p01.maxscore.mat.cpg.gc.basemean.txt.gz - matrix of predictors for all REs. Quantitative encoding of PWM match for all HOCOMOCO motifs + CpG + GC content, plus unperturbed ATAC-seq signal
  • train_baseline.R - R code to train baseline (LASSO regression or random forest) models using predictor matrix and the provided training data.
    • Note: training the random forest to predict full TF depletion is computationally intensive because it is across all REs, if doing this run on CPU for ~6 hrs. 

chrombpnet_models.tar.gz - Remainder of code, data, and models for fine-tuning and interpreting ChromBPNet mdoels to predict RE responsiveness to SOX9/TWIST1 dosage

  • Fine-tuning code, data, models
    • {all|sox9.direct|twist1.bound.down}.{train|valid|test}.{ed50|0v100.log2fc}.txt - Training/testing/validation data (ED50 or full TF depletion effect for SOX9 or TWIST1), split into train/test/validation folds
    • pretrained.unperturbed.chrombpnet.h5 - Pretrained model of unperturbed ATAC-seq signal in CNCCs, obtained by running ChromBPNet (https://github.com/kundajelab/chrombpnet) on DMSO-treated SOX9/TWIST1-tagged ATAC-seq data
    • finetune_chrombpnet.py - code for fine-tuning the pretrained model for any of the relevant prediction tasks (ED50/ effect of full TF depletion for SOX9/TWIST1)
    • best.model.chrombpnet.{0v100|ed50}.{sox9|twist1}.h5 - output of finetune_chrombpnet.py, best model after 10 training epochs for the indicated task
    • chrombpnet.{0v100|ed50}.{sox9|twist1}.contrib.{h5|bw} - contribution scores for the indicated predictive model, obtained by running chrombpnet contribs_bw on the corresponding model h5 file.
    • chrombpnet.{0v100|ed50}.{sox9|twist1}.contrib.modisco.{h5|bw} - TF-MoDIsCo output from the corresponding contribution score file
  • Interpretation code, data, models
    • contrib_h5_to_projshap_npy.py - code to convert contrib .h5 files into .npy files containing projected SHAP scores (required because the CWM matching code takes this format of contribution scores)
    • sox9.direct.10col.bed, twist1.bound.down.10col.uniq.bed - regions over which CWMs will be matched (likely direct targets of each TF)
    • match_cwms.py - Python code to match individual CWM instances. Takes as input: modisco .h5 file, SHAP .npy file, bed file of regions to be matched. Output is a bed file of all CWM matches (not pruned, contains many redundant matches).
    • chrombpnet.ed50.{sox9|twist1}.contrib.perc05.matchperc10.allmatch.bed - output of match_cwms.py 
    • take_max_overlap.py - code to merge output of match_cwms.py into clusters, and then take the maximum (length-normalized) match score in each cluster as the representative CWM match of that cluster. Requires upstream bedtools commands to be piped in, see example usage in file. 
    • chrombpnet.ed50.{sox9|twist1}.contrib.perc05.matchperc10.allmatch.maxoverlap.bed - output of  take_max_overlap.py. These CWM instances are the ones used throughout the paper.

modisco_reports.zip - TF-MoDIsCo reports from running on the fine-tuned ChromBPNet models

  • modisco_report_{sox9|twist1}_{0v100|ed50}: folders containing images of discovered CWMs and HTMLs/PDFs of summarized reports from running TF-MoDisCo on the indicated fine-tuned ChromBPNet model

chrombpnet_models_supp.tar.gz - Alternative ChromBPNet mdoels to predict SOX9/TWIST1 ED50 using varying definitons of direct targets

  • best.model.chrombpnet.ed50.twist1.3hdn.h5 - TWIST1 direct targets defined using response to full 3h depletion (as was done for SOX9 throughout the rest of the paper)

  • best.model.chrombpnet.ed50.sox9.v5chip.h5 - SOX9 direct targets defined using V5 ChIP-seq from SOX9-tagged lines (as was done for TWIST1 throughout the paper)

mirny_model.tar.gz - Code and data for analyzing and fitting Mirny model of TF-nucleosome competition to observed RE dosage response curves

  • twist1.strong.multi.only.ed50.cutoff.true.hill.txt - ED50 and signed hill coefficients for all TWIST1-dependent REs with only buffering Coordinators (mostly one or two) and no other TFs' binding sites. "ed50_new" is the ED50 calculation used in this paper. 
  • twist1.strong.weak{1|2|3}.ed50.cutoff.true.hill.txt - ED50 and signed hill coefficients for all TWIST1-dependent REs with only buffering Coordinators (mostly one or two) and the indicated number of sensitizing (weak) Coordinators and no other TFs' binding sites. "ed50_new" is the ED50 calculation used in this paper. 
  • MirnyModelAnalysis.py - Python code for analysis of Mirny model of TF-nucleosome competition. Contains implementations of analytic solutions, as well as code to fit model to observed ED50 and hill coefficients in the provided data files.

nucleoatac.tar.gz - Output files from running NucleoATAC on merged ATAC-seq from each of 5 TWIST1 dosages

  • TWIST1_{dosage}_merge.nucmap_combined.bed.gz - see NucleoATAC docs for output format

Files

modisco_reports.zip

Files (14.0 GB)

Name Size Download all
md5:cfe33c3c49f01e440a045ffaca18d26f
390.4 MB Download
md5:4bef6bbb1259de6c93040efcd3f21a7d
13.4 GB Download
md5:106a3d8a9cb5bfd05e6e3c60a85f8fca
119.1 MB Download
md5:8a6f86288bddbc46a81334ed1b5eb0e3
5.5 MB Download
md5:e68a075bd88c5288e0a2e10753a9f746
108.4 kB Download
md5:f2e130e59cd862596e955b459834c23b
49.2 MB Preview Download
md5:6d32e8337fcfbfbc68a1072d43847a2c
82.3 MB Download

Additional details

Funding

National Institutes of Health
Mapping and prediction of quantitative transcription factor dosage effects to understand variation in craniofacial morphology and disease K99DE032729