Published April 22, 2022 | Version Nov 18, 2020 version
Dataset Open

PMD hypomethylation human (hg19) neural network scores

  • 1. Hebrew University

Contributors

Data manager:

Description

Global loss of DNA methylation in mammalian genomes occurs cumulatively as a mitotic process during aging and cancer, primarily in Partially Methylated Domains (PMDs). It has been shown that local sequence context (100bp) has a strong effect on the rate of demethylation of individual CpG dinucleotides within PMDs. Here, we train a deep learning model to characterize this sequence dependence further, finding that methylation loss can be predicted from a CpG’s 150bp sequence context alone with an AUC of 0.95. We use re-methylation rates of newly synthesized DNA to show that CpGs with fast-loss sequence context are inefficiently re-methylated. Interestingly, we find that the 10% of CpGs predicted to have the “slowest” rate of loss lose almost no DNA methylation in healthy cell types. These same slow-loss CpGs lose a significant amount of DNA methylation in cancer, suggesting that they could be responsible for deregulation of genes and transposable elements that are associated with DNA hypomethylation in cancer.

This directory contains the Nov. 18, 2020 version of the human (hg19) CpG hypomethylation Neural network scores in a single tab-delimited (bedgraph) file:
multitissue-nn-scores.allCGs.0based.hg19.bedgraph.gz
with the following columns:
1: chromosome (hg19)
2: start coord (hg19, 0-based)
3: end coord (hg19, 0-based)
4: multi-tissue NN score (0-1). Close to 0 is classified as slow-loss CpG, close to 1 is classified as fast loss CpG5: Num CpGs in 150 bp window (including central CpG, so minimum is 1).

 

The full version of the NN scores with additional details are in the file zhou-bian.allCGs.1based.hg19.tsv.gz

Each row is a CG which provides (1) chromosome, (2) the corresponding C coordinate on the forward (watson) strand of the reference genome in one-based coordinates, (3) Neural network score, (4) number of CpGs within the 150bp sequence centered on this CpG, including the center CpG, (5) CpG is within a CpG island (0, no; 1, yes), CpG is within ENCODE blacklist (0, no; 1, yes)

 Here the CpG islands are the union set of Irizarry (Irizarry et al. 2009, Nat Genet), Takai-Jones (Takai et al. 2002, PNAS), Gardner-Gardin CGIs (Gardner-Gardin et al. 1987, J Mol Biol.). The blacklist was downloaded from https://github.com/Boyle-Lab/Blacklist/tree/master/lists.

Additional files are included here:
zhou_pmds.0based.hg19.bed.gz: Input PMD CpGs from the Zhou (multi-tissue) dataset
bian_pmds.crc01.0based.hg19.bed.gz: Input PMD CpGs from the Bian (intra-tumor) dataset
zhou_bian_train_test_data.tar.gz: All training and test CpGs, including labels and sequence windows.

 

 

Files

Files (2.2 GB)

Name Size Download all
md5:1f7b977f3a003fdc58e2f4b34f56107a
32.5 MB Download
md5:d7244cc8c4f888437c108748d34b9386
19.2 MB Download
md5:e1a0953b4937fde7c62bc2d1ee0fe0c1
16.1 MB Download
md5:308d38ec9d5be6351d2faad3afa02779
14.9 MB Download
md5:d6900d8a706295fa8f368fdbdbbc0c47
432.4 MB Download
md5:6fd670dbeb70a54940817cd4919e583b
694.0 MB Download
md5:d67cffd08da657b3b259095a54979389
911.9 MB Download
md5:648d6689cdbd8eb76b2c6d8774b52fb3
38.9 MB Download

Additional details

Related works