Code and data for "ChromoGen: Diffusion model predicts single-cell chromatin conformations"

Schuette, Greg; Lao, Zhuohan; Zhang, Bin

doi:10.5281/zenodo.14218666

Published December 4, 2024 | Version 1.0.0

Dataset Open

Code and data for "ChromoGen: Diffusion model predicts single-cell chromatin conformations"

1. Massachusetts Institute of Technology (MIT)
2. Massachusetts Institute of Technology

Contributors

Researchers:

Supervisor:

Zhang, Bin²

1. Massachusetts Institute of Technology (MIT)
2. Massachusetts Institute of Technology

This dataset includes all code and data required to reproduce the results of:

Greg Schuette, Zhuohan Lao, and Bin Zhang. ChromoGen: Diffusion model predicts single-cell chromatin conformations, 16 July 2024, PREPRINT (Version 1) available at Research Square [https://doi.org/10.21203/rs.3.rs-4630850/v1]

File descriptions:

chromogen_code.tar.gz contains all code and, as of its upload date, is identical to the corresponding GitHub repo. Note that:
1. Some or all of the code inside chromogen_code.tar.gz/ChromoGen/recreate_results/train/EPCOT/, chromogen_code.tar.gz/ChromoGen/recreate_results/generate_data/EPCOT, and chromogen_code.tar.gz/ChromoGen/src/model/Embedder was adapted from that provided in the original EPCOT paper, Zhang et al. (2023).
2. chromogen_code.tar.gz/ChromoGen/recreate_results/create_figures/Figure_4/domain_boundary_support/PostAnalysisTools.py was adopted from Bintu et al. (2018); our only change was translating the code from Python 2 to Python 3.
3. Several of the Jupyter Notebooks within chromogen_code.tar.gz/ChromoGen/recreate_results/create_figures/ visualize Hi-C and DNase-seq data from Rao et al. (2014) and The ENCODE Project Consortium (2012), respectively, though this dataset excludes the experimental data itself. Seechromogen_code.tar.gz/README.md for instructions on obtaining the data.
4. Dip-C data from Tan et al. (2018) are visualized throughout these notebooks, as well. This dataset excludes the raw Dip-C data, though it does include a post-processed version of the data (see bullets 4-5).
5. The files within chromogen_code.tar.gz/ChromoGen/recreate_results/generate_data/conformations/MDHomopolymer were originally used for Schuette et al. (2023), though we first make those scripts available here (the first author of both works created these files).
epcot_final.pt contains the fine-tuned EPCOT parameters. Note that the pre-trained parameters -- not included in this dataset -- came from Zhang et al. (2023) and were used as the starting point for our fine-tuning optimization of these parameters.
chromogen.pt contains the complete set of ChromoGen model parameters, including both the relevant fine-tuned EPCOT parameters and all diffusion model parameters. Note that this also contains the fine-tuned EPCOT parameters.
conformations.tar.gz contains all conformations analyzed in the manuscript, including the Dip-C conformations formatted in an HDF5 file, all ChromoGen-inferred conformations, and the MD-generated MD homopolymer conformations. Descriptively named subdirectories organize the data. Note that:
1. conformations.tar.gz/conformations/MDHomopolymer/DUMP_FILE.dcd is from Schuette et al. (2023), though it first made available here.
2. conformations.tar.gz/conformations/DipC/processed_data.h5 represents our post-processed version of the 3D genome structures predicted by Dip-C in Tan et al. (2018).
outside_data.tar.gz contains two subdirectories:
1. inputs contains our post-processed genome assembly file. Its sole content, hg19.h5, is a post-processed version of the FASTA-formatted hg19 human genome alignment created by Church et al. (2011), which we downloaded from the UCSC genome browser (Kent et al. (2002) and Nassar et al. (2023)). This dataset does NOT include the FASTA file itself.
2. training_data contains the Dip-C conformations post-processed by our pipeline. This is a duplicated version of the file described in bullet 4.2.
embeddings.tar.gz contains the sequence embeddings created by our fine-tuned EPCOT model for each region included in the diffusion model's training set. This is really only needed during training.

chromogen_code.tar.gz/ChromoGen/README.md and the README.md file on our GitHub repo (identical at the time of this dataset's publication) explain the content of each file in greater detail. They also explain how to use the code to reproduce our results or to make your own structure predictions.

You can download and organize all the files in this dataset as intended by running the following in bash:
# Download the code and expand the tarball whose contents define the # larger file structure of the repository this dataset is archiving.wget https://zenodo.org/records/14218666/files/chromogen_code.tar.gztar -xvzf chromogen_code.tar.gzrm chromogen_code.tar.gz# Enter the top-level directory of the repo, create the subdirectories # that'll contain the data, and cd to it cd ChromoGen mkdir -p recreate_results/downloaded_data/models cd recreate_results/downloaded_data # Download all the data in the proper locations wget https://zenodo.org/records/14218666/files/conformations.tar.gz & wget https://zenodo.org/records/14218666/files/embeddings.tar.gz & wget https://zenodo.org/records/14218666/files/outside_data.tar.gz & cd models wget https://zenodo.org/records/14218666/files/chromogen.pt & wget https://zenodo.org/records/14218666/files/epcot_final.pt & cd .. wait # Untar the three tarballs tar -xvzf conformations.tar.gz & tar -xvzf embeddings.tar.gz & tar -xvzf outside_data.tar.gz & wait # Remove the now-unneeded tarballs rm conformations.tar.gz embeddings.tar.gz outside_data.tar.gz

Files

Files (47.6 GB)

Name	Size	Download all
chromogen.pt md5:ec721f447cc327d0e0f9e0d8cdcbb4b6	2.3 GB	Download
chromogen_code.tar.gz md5:80d0351772d4c7bc1c9d39cdbf079a9c	17.3 MB	Download
conformations.tar.gz md5:7bbcd5f8ddb6c3012f4526b02dc39285	9.9 GB	Download
embeddings.tar.gz md5:c73b767256cb85119c861c0804b98c55	33.3 GB	Download
epcot_final.pt md5:01df3a20f4fa223772a9d2c400130656	52.5 MB	Download
outside_data.tar.gz md5:5cba449c19175708f7939deb51cdacd8	2.1 GB	Download

Additional details

Is supplement to: Preprint: 10.21203/rs.3.rs-4630850/v1 (DOI)

National Institutes of Health
Probing and Perturbing Transcriptional Condensates with Multiscale Modeling and Deep Learning R35GM133580

Other: 2024-12-04

Repository URL: https://github.com/ZhangGroup-MITChemistry/ChromoGen
Programming language: Python

Zhang et al., "A generalizable framework to comprehensively predict epigenome, chromatin organization, and transcriptome." Nucleic Acids Research (2023) DOI: 10.1093/nar/gkad436
Bintu et al. "Super-resolution chromatin tracing reveals domains and cooperative interactions in single cells." Science (2018) DOI: 10.1126/science.aau1783
Rao et al. "A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping." Cell (2014) DOI: 10.1016/j.cell.2014.11.021
The ENCODE Project Consortium "An integrated encyclopedia of DNA elements in the human genome." Nature (2012) DOI: 10.1038/nature11247
Tan et al. "Three-dimensional genome structures of single diploid human cells." Science (2018) DOI: 10.1126/science.aat5641
Schuette et al. "Efficient Hi-C inversion facilitates chromatin folding mechanism discovery and structure prediction." Biophysical Journal (2023) DOI: 10.1016/j.bpj.2023.07.017
Church et al., "Modernizing reference genome assemblies." PLOS Biology (2011) DOI: 10.1371/journal.pbio.1001091
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at UCSC. Genome Res. 2002 Jun;12(6):996-1006.
Nassar et al. "The UCSC Genome Browser database: 2023 update." Nucleic Acids Research 2023 PMID: 36420891, DOI: 10.1093/nar/gkac1072

	All versions	This version
Views	236	236
Downloads	377	377
Data volume	18.4 TB	18.4 TB

Code and data for "ChromoGen: Diffusion model predicts single-cell chromatin conformations"

Contributors

Researchers:

Supervisor:

Files

Files (47.6 GB)

Additional details

Related works

Funding

Dates

Software

References

Code and data for "ChromoGen: Diffusion model predicts single-cell chromatin conformations"

Creators

Contributors

Researchers:

Supervisor:

Description

Files

Files (47.6 GB)

Additional details

Related works

Funding

Dates

Software

References