Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published June 27, 2022 | Version V1.0
Dataset Open

maxATAC Data

Description

Abstract

Transcription factors read the genome, fundamentally connecting DNA sequence to gene expression across diverse cell types. Determining how, where, and when TFs bind chromatin will advance our understanding of gene regulatory networks and cellular behavior. The 2017 ENCODE-DREAM in vivo Transcription-Factor Binding Site (TFBS) Prediction Challenge highlighted the value of chromatin accessibility data to TFBS prediction, establishing state-of-the-art methods for TFBS prediction from DNase-seq. However, the more recent Assay-for-Transposase-Accessible-Chromatin (ATAC)-seq has surpassed DNase-seq as the most widely-used chromatin accessibility profiling method. Furthermore, ATAC-seq is the only such technique available at single-cell resolution from standard commercial platforms. While ATAC-seq datasets grow exponentially, suboptimal motif scanning is unfortunately the most common method for TFBS prediction from ATAC-seq. To enable community access to state-of-the-art TFBS prediction from ATAC-seq, we (1) curated an extensive benchmark dataset (127 TFs) for ATAC-seq model training and (2) built “maxATAC”, a suite of user-friendly, deep neural network models for genome-wide TFBS prediction from ATAC-seq in any cell type. With models available for 127 human TFs, maxATAC is the first collection of high-performance TFBS prediction models for ATAC-seq. 

Repository Overview

This repository contains all of the processed training data used by maxATAC for model training and benchmarking. All directories have the extension .tar.gz .

In this repository you will find the directories:

ATAC_Peaks: ATAC-seq peak files called with MACS2. These files are generated for the hg38 reference genome. The files are have the extension .bed.gz.

ATAC_Signal_File: ATAC-seq signal file. This file has been read-depth normalized and min-max normalized between 0,1 using the 99th percentile max value. These files are presented as bigwig files with a .bw extension. 

ChIP_Binding_File: ChIP-seq signal tracks. These files are the binary signal tracks in bigwig format that are found in the ChIP_Peaks directory.

ChIP_Peaks: ChIP-seq peaks files. This directory contains the ENCODE IDR peak sets and peak sets created in the maxATAC publication. These files have the extension .bed.gz.

Full_Models: Current set of 127 maxATAC TF models. This directory includes the information for thresholding and the .h5 model files.

hg38: This directory includes the hg38 reference genome information that was used in this publication. 

Prediction_and_Benchmarking: This directory contains all of the predictions for chr1 used for benchmarking in a round-robin training approach. 

Tn5_CutSites: This directory contains the Tn5 cut sites that have been shifted +4 on the (+) strand and -5 on the (-) strand. The cut sites were then slopped 20 bp using bedtools slop. These files are presented as bed files that have been bzipped. Each file represents an individual biological replicate. 

scATAC: This directory includes data used for scATAC-seq based predictions.

 

For additional details please see the maxATAC GitHub Repository and bioRxiv pre-print. 

Files

Files (90.4 GB)

Name Size Download all
md5:fd12014f134fcf93e55fc86dd5f37e4e
182.7 MB Download
md5:812ceab5e34aa4305a07d8aa74488a2e
45.3 GB Download
md5:9d51b5d5035bd92ab9cac8a45273167b
74.1 MB Download
md5:0c621d94ff1434042b05238f6d9139df
96.4 MB Download
md5:bbd0a3b8a5652025f88f0be9a07f452b
229.6 MB Download
md5:10cede63a8f21f3b57c658993d5614a9
12.4 kB Download
md5:533d0f68cec0be1ba41d2e6ac07ffe1e
19.8 GB Download
md5:067e815d2e3ef7ea984e26ad07380848
4.2 GB Download
md5:0ae4807352bc1dd850cdeb43230976b1
20.6 GB Download

Additional details

Related works

Has part
Dataset: https://github.com/MiraldiLab/maxATAC_data (URL)
Is derived from
Preprint: 10.1101/2022.01.28.478235 (DOI)
Is referenced by
Software: https://github.com/MiraldiLab/maxATAC (URL)