BRIDGES singletons annotated with local genomic features
Description
This dataset consists of a tarball archive containing 8,192 tab-delimited files (one per 7-mer sequence motif). Each file contains information about the status or value of 15 different genomic features at every possible site in hg19, centered at the 7-mer sequence motif indicated in the filename (or the reverse complement of that motif; e.g. ACGATGC_annotated.txt
includes information for sites at 5’-ACGATGC-3’ and sites at 5’-GCATCGT-3’ motifs).
Each file contains the following columns:
-
AT_CG [indicator if site carries an A>C or T>G singleton (1) or not (0) in the BRIDGES data]
-
AT_GC [indicator if site carries an A>G or T>C singleton (1) or not (0) in the BRIDGES data]
-
AT_TA [indicator if site carries an A>T or T>A singleton (1) or not (0) in the BRIDGES data]
-
GC_AT [indicator if site carries a G>A or C>T singleton (1) or not (0) in the BRIDGES data]
-
GC_CG [indicator if site carries a G>C or C>G singleton (1) or not (0) in the BRIDGES data]
-
GC_TA [indicator if site carries a G>T or C>A singleton (1) or not (0) in the BRIDGES data]
-
DP [average depth of coverage at site]
-
H3K4me1 [indicator if site is within a H3K4me1 broad peak (1) or not (0)]
-
H3K4me3 [indicator if site is within a H3K4me3 broad peak (1) or not (0)]
-
H3K9ac [indicator if site is within a H3K9ac broad peak (1) or not (0)]
-
H3K9me3 [indicator if site is within a H3K9me3 broad peak (1) or not (0)]
-
H3K27ac [indicator if site is within a H3K27ac broad peak (1) or not (0)]
-
H3K27me3 [indicator if site is within a H3K27me3 broad peak (1) or not (0)]
-
H3K36me3 [indicator if site is within a H3K36me3 broad peak (1) or not (0)]
-
EXON [indicator if site is within an exon (1) or not (0)]
-
CpGI [indicator if site is within a CpG island (1) or not (0)]
-
RR [average recombination rate in the 10kb window centered at the site]
-
LAMIN [indicator if site is within an Lamin-Associated Domain (1) or not (0)]
-
DHS [indicator if site is within a DNase Hypersensitive region (1) or not (0)]
-
TIME [average recombination rate in the 10kb window centered at the site]
-
GC [average GC content in the 10kb window centered at the site]
Note that the chromosome and position of each site has been removed to protect sample privacy.
Each file is then passed to an R script (available at https://github.com/carjed/smaug-genetics) to estimate the effects each feature on the relative mutation rate using a logistic regression model (e.g., AT_GC ~ DP + ... + GC
). Each of the features used is available from data in the public domain; the provenance of these features is described in the associated paper, and additional scripts for processing the feature data can be found at at https://github.com/carjed/smaug-genetics.
The BRIDGES whole-genome sequencing study is described at https://doi.org/10.1101/108290
Files
Files
(28.8 GB)
Name | Size | Download all |
---|---|---|
md5:26f0ca32ce19f85aa5b681531f0c1c0c
|
28.8 GB | Download |
Additional details
Related works
- Is supplement to
- 10.1101/108290 (DOI)