Published June 23, 2018 | Version v1
Dataset Open

BRIDGES singletons annotated with local genomic features

  • 1. University of Michigan

Description

This dataset consists of a tarball archive containing 8,192 tab-delimited files (one per 7-mer sequence motif). Each file contains information about the status or value of 15 different genomic features at every possible site in hg19, centered at the 7-mer sequence motif indicated in the filename (or the reverse complement of that motif; e.g. ACGATGC_annotated.txt includes information for sites at 5’-ACGATGC-3’ and sites at 5’-GCATCGT-3’ motifs).

Each file contains the following columns:

  • AT_CG [indicator if site carries an A>C or T>G singleton (1) or not (0) in the BRIDGES data]

  • AT_GC [indicator if site carries an A>G or T>C singleton (1) or not (0) in the BRIDGES data]

  • AT_TA [indicator if site carries an A>T or T>A singleton (1) or not (0) in the BRIDGES data]

  • GC_AT [indicator if site carries a G>A or C>T singleton (1) or not (0) in the BRIDGES data]

  • GC_CG [indicator if site carries a G>C or C>G singleton (1) or not (0) in the BRIDGES data]

  • GC_TA [indicator if site carries a G>T or C>A singleton (1) or not (0) in the BRIDGES data]

  • DP [average depth of coverage at site]

  • H3K4me1 [indicator if site is within a H3K4me1 broad peak (1) or not (0)]

  • H3K4me3 [indicator if site is within a H3K4me3 broad peak (1) or not (0)]

  • H3K9ac [indicator if site is within a H3K9ac broad peak (1) or not (0)]

  • H3K9me3 [indicator if site is within a H3K9me3 broad peak (1) or not (0)]

  • H3K27ac [indicator if site is within a H3K27ac broad peak (1) or not (0)]

  • H3K27me3 [indicator if site is within a H3K27me3 broad peak (1) or not (0)]

  • H3K36me3 [indicator if site is within a H3K36me3 broad peak (1) or not (0)]

  • EXON [indicator if site is within an exon (1) or not (0)]

  • CpGI [indicator if site is within a CpG island (1) or not (0)]

  • RR [average recombination rate in the 10kb window centered at the site]

  • LAMIN [indicator if site is within an Lamin-Associated Domain (1) or not (0)]

  • DHS [indicator if site is within a DNase Hypersensitive region (1) or not (0)]

  • TIME [average recombination rate in the 10kb window centered at the site]

  • GC [average GC content in the 10kb window centered at the site]

Note that the chromosome and position of each site has been removed to protect sample privacy.

Each file is then passed to an R script (available at https://github.com/carjed/smaug-genetics) to estimate the effects each feature on the relative mutation rate using a logistic regression model (e.g., AT_GC ~ DP + ... + GC). Each of the features used is available from data in the public domain; the provenance of these features is described in the associated paper, and additional scripts for processing the feature data can be found at at https://github.com/carjed/smaug-genetics.

 

The BRIDGES whole-genome sequencing study is described at https://doi.org/10.1101/108290

Files

Files (28.8 GB)

Name Size Download all
md5:26f0ca32ce19f85aa5b681531f0c1c0c
28.8 GB Download

Additional details

Related works

Is supplement to
10.1101/108290 (DOI)