Published December 4, 2024 | Version 1.0.0
Dataset Open

Code and data for "ChromoGen: Diffusion model predicts single-cell chromatin conformations"

  • 1. Massachusetts Institute of Technology (MIT)
  • 2. ROR icon Massachusetts Institute of Technology

Contributors

Supervisor:

  • 1. Massachusetts Institute of Technology (MIT)
  • 2. ROR icon Massachusetts Institute of Technology

Description

This dataset includes all code and data required to reproduce the results of:

Greg Schuette, Zhuohan Lao, and Bin Zhang. ChromoGen: Diffusion model predicts single-cell chromatin conformations, 16 July 2024, PREPRINT (Version 1) available at Research Square [https://doi.org/10.21203/rs.3.rs-4630850/v1]

File descriptions:

  1. chromogen_code.tar.gz contains all code and, as of its upload date, is identical to the corresponding GitHub repo. Note that:
    1. Some or all of the code inside chromogen_code.tar.gz/ChromoGen/recreate_results/train/EPCOT/, chromogen_code.tar.gz/ChromoGen/recreate_results/generate_data/EPCOT, and chromogen_code.tar.gz/ChromoGen/src/model/Embedder was adapted from that provided in the original EPCOT paper, Zhang et al. (2023). 
    2. chromogen_code.tar.gz/ChromoGen/recreate_results/create_figures/Figure_4/domain_boundary_support/PostAnalysisTools.py was adopted from Bintu et al. (2018); our only change was translating the code from Python 2 to Python 3. 
    3. Several of the Jupyter Notebooks within chromogen_code.tar.gz/ChromoGen/recreate_results/create_figures/ visualize Hi-C and DNase-seq data from Rao et al. (2014) and The ENCODE Project Consortium (2012), respectively, though this dataset excludes the experimental data itself. Seechromogen_code.tar.gz/README.md for instructions on obtaining the data.
    4. Dip-C data from Tan et al. (2018) are visualized throughout these notebooks, as well. This dataset excludes the raw Dip-C data, though it does include a post-processed version of the data (see bullets 4-5).
    5. The files within chromogen_code.tar.gz/ChromoGen/recreate_results/generate_data/conformations/MDHomopolymer were originally used for Schuette et al. (2023), though we first make those scripts available here (the first author of both works created these files). 
  2. epcot_final.pt contains the fine-tuned EPCOT parameters. Note that the pre-trained parameters -- not included in this dataset -- came from Zhang et al. (2023) and were used as the starting point for our fine-tuning optimization of these parameters. 
  3. chromogen.pt contains the complete set of ChromoGen model parameters, including both the relevant fine-tuned EPCOT parameters and all diffusion model parameters. Note that this also contains the fine-tuned EPCOT parameters. 
  4. conformations.tar.gz contains all conformations analyzed in the manuscript, including the Dip-C conformations formatted in an HDF5 file, all ChromoGen-inferred conformations, and the MD-generated MD homopolymer conformations. Descriptively named subdirectories organize the data. Note that:
    1. conformations.tar.gz/conformations/MDHomopolymer/DUMP_FILE.dcd is from Schuette et al. (2023), though it first made available here. 
    2. conformations.tar.gz/conformations/DipC/processed_data.h5 represents our post-processed version of the 3D genome structures predicted by Dip-C in Tan et al. (2018). 
  5. outside_data.tar.gz contains two subdirectories:
    1. inputs contains our post-processed genome assembly file. Its sole content, hg19.h5, is a post-processed version of the FASTA-formatted hg19 human genome alignment created by Church et al. (2011), which we downloaded from the UCSC genome browser (Kent et al. (2002) and Nassar et al. (2023)). This dataset does NOT include the FASTA file itself.
    2. training_data contains the Dip-C conformations post-processed by our pipeline. This is a duplicated version of the file described in bullet 4.2. 
  6. embeddings.tar.gz contains the sequence embeddings created by our fine-tuned EPCOT model for each region included in the diffusion model's training set. This is really only needed during training. 

chromogen_code.tar.gz/ChromoGen/README.md and the README.md file on our GitHub repo (identical at the time of this dataset's publication) explain the content of each file in greater detail. They also explain how to use the code to reproduce our results or to make your own structure predictions.

You can download and organize all the files in this dataset as intended by running the following in bash:
# Download the code and expand the tarball whose contents define the
# larger file structure of the repository this dataset is archiving.
wget https://zenodo.org/records/14218666/files/chromogen_code.tar.gz
tar -xvzf chromogen_code.tar.gz
rm chromogen_code.tar.gz

# Enter the top-level directory of the repo, create the subdirectories
# that'll contain the data, and cd to it
cd ChromoGen
mkdir -p recreate_results/downloaded_data/models
cd recreate_results/downloaded_data

# Download all the data in the proper locations
wget https://zenodo.org/records/14218666/files/conformations.tar.gz &
wget https://zenodo.org/records/14218666/files/embeddings.tar.gz &
wget https://zenodo.org/records/14218666/files/outside_data.tar.gz &
cd models
wget https://zenodo.org/records/14218666/files/chromogen.pt &
wget https://zenodo.org/records/14218666/files/epcot_final.pt &
cd ..
wait

# Untar the three tarballs
tar -xvzf conformations.tar.gz &
tar -xvzf embeddings.tar.gz &
tar -xvzf outside_data.tar.gz &
wait

# Remove the now-unneeded tarballs
rm conformations.tar.gz embeddings.tar.gz outside_data.tar.gz

Files

Files (47.6 GB)

Name Size Download all
md5:ec721f447cc327d0e0f9e0d8cdcbb4b6
2.3 GB Download
md5:80d0351772d4c7bc1c9d39cdbf079a9c
17.3 MB Download
md5:7bbcd5f8ddb6c3012f4526b02dc39285
9.9 GB Download
md5:c73b767256cb85119c861c0804b98c55
33.3 GB Download
md5:01df3a20f4fa223772a9d2c400130656
52.5 MB Download
md5:5cba449c19175708f7939deb51cdacd8
2.1 GB Download

Additional details

Related works

Is supplement to
Preprint: 10.21203/rs.3.rs-4630850/v1 (DOI)

Funding

National Institutes of Health
Probing and Perturbing Transcriptional Condensates with Multiscale Modeling and Deep Learning R35GM133580

Dates

Other
2024-12-04

Software

Repository URL
https://github.com/ZhangGroup-MITChemistry/ChromoGen
Programming language
Python

References

  • Zhang et al., "A generalizable framework to comprehensively predict epigenome, chromatin organization, and transcriptome." Nucleic Acids Research (2023) DOI: 10.1093/nar/gkad436
  • Bintu et al. "Super-resolution chromatin tracing reveals domains and cooperative interactions in single cells." Science (2018) DOI: 10.1126/science.aau1783
  • Rao et al. "A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping." Cell (2014) DOI: 10.1016/j.cell.2014.11.021
  • The ENCODE Project Consortium "An integrated encyclopedia of DNA elements in the human genome." Nature (2012) DOI: 10.1038/nature11247
  • Tan et al. "Three-dimensional genome structures of single diploid human cells." Science (2018) DOI: 10.1126/science.aat5641
  • Schuette et al. "Efficient Hi-C inversion facilitates chromatin folding mechanism discovery and structure prediction." Biophysical Journal (2023) DOI: 10.1016/j.bpj.2023.07.017
  • Church et al., "Modernizing reference genome assemblies." PLOS Biology (2011) DOI: 10.1371/journal.pbio.1001091
  • Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at UCSC. Genome Res. 2002 Jun;12(6):996-1006.
  • Nassar et al. "The UCSC Genome Browser database: 2023 update." Nucleic Acids Research 2023 PMID: 36420891, DOI: 10.1093/nar/gkac1072