maxATAC Data
Creators
- 1. University of Cincinnati
- 2. Cincinnati Children's Hospital Medical Center
Description
Abstract
Transcription factors read the genome, fundamentally connecting DNA sequence to gene expression across diverse cell types. Determining how, where, and when TFs bind chromatin will advance our understanding of gene regulatory networks and cellular behavior. The 2017 ENCODE-DREAM in vivo Transcription-Factor Binding Site (TFBS) Prediction Challenge highlighted the value of chromatin accessibility data to TFBS prediction, establishing state-of-the-art methods for TFBS prediction from DNase-seq. However, the more recent Assay-for-Transposase-Accessible-Chromatin (ATAC)-seq has surpassed DNase-seq as the most widely-used chromatin accessibility profiling method. Furthermore, ATAC-seq is the only such technique available at single-cell resolution from standard commercial platforms. While ATAC-seq datasets grow exponentially, suboptimal motif scanning is unfortunately the most common method for TFBS prediction from ATAC-seq. To enable community access to state-of-the-art TFBS prediction from ATAC-seq, we (1) curated an extensive benchmark dataset (127 TFs) for ATAC-seq model training and (2) built “maxATAC”, a suite of user-friendly, deep neural network models for genome-wide TFBS prediction from ATAC-seq in any cell type. With models available for 127 human TFs, maxATAC is the first collection of high-performance TFBS prediction models for ATAC-seq.
Repository Overview
This repository contains all of the processed training data used by maxATAC for model training and benchmarking. All directories have the extension .tar.gz .
In this repository you will find the directories:
ATAC_Peaks: ATAC-seq peak files called with MACS2. These files are generated for the hg38 reference genome. The files are have the extension .bed.gz. ATAC_Signal_File: ATAC-seq signal file. This file has been read-depth normalized and min-max normalized between 0,1 using the 99th percentile max value. These files are presented as bigwig files with a .bw extension. ChIP_Binding_File: ChIP-seq signal tracks. These files are the binary signal tracks in bigwig format that are found in the ChIP_Peaks directory. ChIP_Peaks: ChIP-seq peaks files. This directory contains the ENCODE IDR peak sets and peak sets created in the maxATAC publication. These files have the extension .bed.gz. Full_Models: Current set of 127 maxATAC TF models. This directory includes the information for thresholding and the .h5 model files. hg38: This directory includes the hg38 reference genome information that was used in this publication. Prediction_and_Benchmarking: This directory contains all of the predictions for chr1 used for benchmarking in a round-robin training approach. Tn5_CutSites: This directory contains the Tn5 cut sites that have been shifted +4 on the (+) strand and -5 on the (-) strand. The cut sites were then slopped 20 bp using bedtools slop. These files are presented as bed files that have been bzipped. Each file represents an individual biological replicate. scATAC: This directory includes data used for scATAC-seq based predictions.
For additional details please see the maxATAC GitHub Repository and bioRxiv pre-print.
Files
Files
(90.4 GB)
Name | Size | Download all |
---|---|---|
md5:fd12014f134fcf93e55fc86dd5f37e4e
|
182.7 MB | Download |
md5:812ceab5e34aa4305a07d8aa74488a2e
|
45.3 GB | Download |
md5:9d51b5d5035bd92ab9cac8a45273167b
|
74.1 MB | Download |
md5:0c621d94ff1434042b05238f6d9139df
|
96.4 MB | Download |
md5:bbd0a3b8a5652025f88f0be9a07f452b
|
229.6 MB | Download |
md5:10cede63a8f21f3b57c658993d5614a9
|
12.4 kB | Download |
md5:533d0f68cec0be1ba41d2e6ac07ffe1e
|
19.8 GB | Download |
md5:067e815d2e3ef7ea984e26ad07380848
|
4.2 GB | Download |
md5:0ae4807352bc1dd850cdeb43230976b1
|
20.6 GB | Download |
Additional details
Related works
- Has part
- Dataset: https://github.com/MiraldiLab/maxATAC_data (URL)
- Is derived from
- Preprint: 10.1101/2022.01.28.478235 (DOI)
- Is referenced by
- Software: https://github.com/MiraldiLab/maxATAC (URL)