Code and data for "ChromoGen: Diffusion model predicts single-cell chromatin conformations"
Description
This dataset includes all code and data required to reproduce the results of:
Greg Schuette, Zhuohan Lao, and Bin Zhang. ChromoGen: Diffusion model predicts single-cell chromatin conformations, 16 July 2024, PREPRINT (Version 1) available at Research Square [https://doi.org/10.21203/rs.3.rs-4630850/v1]
File descriptions:
chromogen_code.tar.gz
contains all code and, as of its upload date, is identical to the corresponding GitHub repo. Note that:- Some or all of the code inside
chromogen_code.tar.gz/ChromoGen/recreate_results/train/EPCOT/
,chromogen_code.tar.gz/ChromoGen/recreate_results/generate_data/EPCOT
, andchromogen_code.tar.gz/ChromoGen/src/model/Embedder
was adapted from that provided in the original EPCOT paper, Zhang et al. (2023). chromogen_code.tar.gz/ChromoGen/recreate_results/create_figures/Figure_4/domain_boundary_support/PostAnalysisTools.py
was adopted from Bintu et al. (2018); our only change was translating the code from Python 2 to Python 3.- Several of the Jupyter Notebooks within
chromogen_code.tar.gz/ChromoGen/recreate_results/create_figures/
visualize Hi-C and DNase-seq data from Rao et al. (2014) and The ENCODE Project Consortium (2012), respectively, though this dataset excludes the experimental data itself. Seechromogen_code.tar.gz/README.md
for instructions on obtaining the data. - Dip-C data from Tan et al. (2018) are visualized throughout these notebooks, as well. This dataset excludes the raw Dip-C data, though it does include a post-processed version of the data (see bullets 4-5).
- The files within
chromogen_code.tar.gz/ChromoGen/recreate_results/generate_data/conformations/MDHomopolymer
were originally used for Schuette et al. (2023), though we first make those scripts available here (the first author of both works created these files).
- Some or all of the code inside
epcot_final.pt
contains the fine-tuned EPCOT parameters. Note that the pre-trained parameters -- not included in this dataset -- came from Zhang et al. (2023) and were used as the starting point for our fine-tuning optimization of these parameters.chromogen.pt
contains the complete set of ChromoGen model parameters, including both the relevant fine-tuned EPCOT parameters and all diffusion model parameters. Note that this also contains the fine-tuned EPCOT parameters.conformations.tar.gz
contains all conformations analyzed in the manuscript, including the Dip-C conformations formatted in an HDF5 file, all ChromoGen-inferred conformations, and the MD-generated MD homopolymer conformations. Descriptively named subdirectories organize the data. Note that:conformations.tar.gz/conformations/MDHomopolymer/DUMP_FILE.dcd
is from Schuette et al. (2023), though it first made available here.conformations.tar.gz/conformations/DipC/processed_data.h5
represents our post-processed version of the 3D genome structures predicted by Dip-C in Tan et al. (2018).
outside_data.tar.gz
contains two subdirectories:inputs
contains our post-processed genome assembly file. Its sole content,hg19.h5
, is a post-processed version of the FASTA-formatted hg19 human genome alignment created by Church et al. (2011), which we downloaded from the UCSC genome browser (Kent et al. (2002) and Nassar et al. (2023)). This dataset does NOT include the FASTA file itself.training_data
contains the Dip-C conformations post-processed by our pipeline. This is a duplicated version of the file described in bullet 4.2.
embeddings.tar.gz
contains the sequence embeddings created by our fine-tuned EPCOT model for each region included in the diffusion model's training set. This is really only needed during training.
chromogen_code.tar.gz/ChromoGen/README.md
and the README.md
file on our GitHub repo (identical at the time of this dataset's publication) explain the content of each file in greater detail. They also explain how to use the code to reproduce our results or to make your own structure predictions.
You can download and organize all the files in this dataset as intended by running the following in bash:# Download the code and expand the tarball whose contents define the
# larger file structure of the repository this dataset is archiving.wget https://zenodo.org/records/14218666/files/chromogen_code.tar.gz
tar -xvzf chromogen_code.tar.gz
rm chromogen_code.tar.gz
# Enter the top-level directory of the repo, create the subdirectories
# that'll contain the data, and cd to it
cd ChromoGen
mkdir -p recreate_results/downloaded_data/models
cd recreate_results/downloaded_data
# Download all the data in the proper locations
wget https://zenodo.org/records/14218666/files/conformations.tar.gz &
wget https://zenodo.org/records/14218666/files/embeddings.tar.gz &
wget https://zenodo.org/records/14218666/files/outside_data.tar.gz &
cd models
wget https://zenodo.org/records/14218666/files/chromogen.pt &
wget https://zenodo.org/records/14218666/files/epcot_final.pt &
cd ..
wait
# Untar the three tarballs
tar -xvzf conformations.tar.gz &
tar -xvzf embeddings.tar.gz &
tar -xvzf outside_data.tar.gz &
wait
# Remove the now-unneeded tarballs
rm conformations.tar.gz embeddings.tar.gz outside_data.tar.gz
Files
Files
(47.6 GB)
Name | Size | Download all |
---|---|---|
md5:ec721f447cc327d0e0f9e0d8cdcbb4b6
|
2.3 GB | Download |
md5:80d0351772d4c7bc1c9d39cdbf079a9c
|
17.3 MB | Download |
md5:7bbcd5f8ddb6c3012f4526b02dc39285
|
9.9 GB | Download |
md5:c73b767256cb85119c861c0804b98c55
|
33.3 GB | Download |
md5:01df3a20f4fa223772a9d2c400130656
|
52.5 MB | Download |
md5:5cba449c19175708f7939deb51cdacd8
|
2.1 GB | Download |
Additional details
Related works
- Is supplement to
- Preprint: 10.21203/rs.3.rs-4630850/v1 (DOI)
Funding
Dates
- Other
-
2024-12-04
Software
- Repository URL
- https://github.com/ZhangGroup-MITChemistry/ChromoGen
- Programming language
- Python
References
- Zhang et al., "A generalizable framework to comprehensively predict epigenome, chromatin organization, and transcriptome." Nucleic Acids Research (2023) DOI: 10.1093/nar/gkad436
- Bintu et al. "Super-resolution chromatin tracing reveals domains and cooperative interactions in single cells." Science (2018) DOI: 10.1126/science.aau1783
- Rao et al. "A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping." Cell (2014) DOI: 10.1016/j.cell.2014.11.021
- The ENCODE Project Consortium "An integrated encyclopedia of DNA elements in the human genome." Nature (2012) DOI: 10.1038/nature11247
- Tan et al. "Three-dimensional genome structures of single diploid human cells." Science (2018) DOI: 10.1126/science.aat5641
- Schuette et al. "Efficient Hi-C inversion facilitates chromatin folding mechanism discovery and structure prediction." Biophysical Journal (2023) DOI: 10.1016/j.bpj.2023.07.017
- Church et al., "Modernizing reference genome assemblies." PLOS Biology (2011) DOI: 10.1371/journal.pbio.1001091
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at UCSC. Genome Res. 2002 Jun;12(6):996-1006.
- Nassar et al. "The UCSC Genome Browser database: 2023 update." Nucleic Acids Research 2023 PMID: 36420891, DOI: 10.1093/nar/gkac1072