Predicting mutation rate variations with DNA shape

Zian Liu

Last updated: 6/9/2023

Introduction

This is the Zenodo tarball for the Liu Z and Samee MAH 2023 publication: Structural Underpinnings of Mutation Rate Variations in the Human Genome. The manuscript is accepted by Nucleic Acids Research and will be released online in the near future. An earlier version is also submitted to bioRxiv (https://www.biorxiv.org/content/10.1101/2021.01.15.426837v2). See the bottom of the page for citation information.

Per NAR regulations: the permanent repository will be stored on Zenodo. For any recent changes and correspondences, please check out our Codeberg repository at https://codeberg.org/sameelab/mutprediction-with-shape.

More details on data storage for the Zenodo archive

Per NAR regulations, we set up the Zenodo archive in order to create a one-stop-shop that allows the reproduction of our entire study. Because of this, you might notice a large, almost excessive amount of data files as compared to the cleaner, more streamlined git repo on Codeberg.

Most of the data files are our original data that we produced when first running the pipelines. For the most part there is no harm in directly using them, but please do read the rest of the README and cite/acknowledge the data sources accordingly. We removed a few depreciated pieces of intermediate data that 1) are no longer used in the pipeline, 2) have minimal/no contribution to the research, and/or 3) are excesively big in terms of file size.

Because of the way the whole thing is set up, there is a good chance you might find some data files that don't seem to match to anything in particular; chances are we didn't include that analysis in our final manuscript. There is also the small chance that you may encounter a serious error while running the scripts because of missing data. If the second scenario happens, please contact us.

Workflow

Our workflow are documented in the various .ipynb notebooks located in the notebook/ directory. Make sure to download the python library script and the individual numbered notebooks for individual steps.

Installation

The program runs on python version 3, the following packages are required:

numpy
pandas
joblib
sklearn

along with their dependencies.

You might notice that the TFBS analysis requires a new set of command-line tools:

bedtools
bedops
R

Make sure to have these if you need to run the TFBS analysis.

What are the input data?

For the mutation rates data, we have it available here but we strongly encourage you to request Dr. Benjamin F. Voight first; their data is also available from Dr. Voight's GitHub.

As you might have noticed, we included an input mutation rate data file in our example script directory. We would strongly discourage you to directly use this data for other purposes. This input data is generated by one of our in-production pipelines, and then re-formatted to match the format of the Aggarwala and Voight data. It is intended to be a toy dataset and we do not currently have documentation for how to generate it. If you are interested, please stay tuned as we do have plans to release our pipeline to the Samee Lab GitHub, or contact us and we are more than happy to pass the data (as well as the steps to generate it) to you.

For the TFBS data, these are from the Kheradpour and Kellis 2014 paper, which used to be accessed from this webpage. We have retained some intermediate data from this paper in our various data directories, as long as you properly acknowledged the project's authors you are welcomed to use our pre-processed data as you wish. We noticed that the website has been down: if you need access to the original 2014 data but couldn't, we can try our best to help.

For the DNAshape reference table, we have included a 7-mer reference table in the "data_input" directory. We also have a repo named DNAshapeR_reference which contains scripts for extracting the reference table from the DNAshapeR package. Please make sure to cite the four DNAshapeR papers when using this excel spreadsheet.

For the DNAshapeR package, please visit Tsu-Pei Chiu's GitHub page for more information.

How can I run this for myself?

We have included our Jupyter notebooks as reference documents. We have separately prepared an example pipeline in the "pipeline_example" directory. The "Publication_note.ipynb" document from the archive folder is an older version of our pipeline that used to run everything together.

To run our model using the example pipeline, call:

python main.py input_mutation_file reference_dnashape_file.xlsx

from the example directory, make sure that the python refer to python version 3. The included README file will share more regarding what to do, and the script file is well annotated for you to follow.

Where are your TFBS analyses?

We have included our TFBS analysis scripts in the tfbs-analysis/ directory. Please read the directory-specific README for more information, and please don't hesitate to reach out to Zian if you need help with anything.

Citations

If you are using the input data from Aggarwala and Voight, please make sure to cite:

Aggarwala, V. & Voight, B. F. An expanded sequence context model broadly explains variability in polymorphism levels across the human genome. Nature Genetics 48, 349–355 (2016).

If you are using the data from Kheradpour and Kellis, please make sure to cite:

Kheradpour, P. & Kellis, M. Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments. Nucleic Acids Research 42(5), 2976-2987 (2014).

If you are using any data pertinent to the DNAshape method, the DNAshapeR package, or our curated DNA shape tables, please make sure to cite all four of the following:

Chiu, T.-P. et al. DNAshapeR: an R/Bioconductor package for DNA shape prediction and feature encoding. Bioinformatics 32, 1211–1213 (2016).
Chiu, T.-P., Rao, S., Mann, R. S., Honig, B. & Rohs, R. Genome-wide prediction of minor-groove electrostatic potential enables biophysical modeling of protein–DNA binding. Nucleic Acids Res 45, 12565–12576 (2017).
Li, J. et al. Expanding the repertoire of DNA shape features for genome-scale studies of transcription factor binding. Nucleic Acids Res 45, 12877–12887 (2017).
Rao, S. et al. Systematic prediction of DNA shape changes due to CpG methylation explains epigenetic effects on protein–DNA binding. Epigenetics & Chromatin 11, 6 (2018).

For all other usages pertinent to our work, our manuscript is currently still undergoing final processing. In the meantime, please choose one of the following to cite:

Liu, Z. & Samee, M. A. H. Mutation rate variations in the human genome are encoded in DNA shape. bioRxiv 2021.01.15.426837. doi: https://doi.org/10.1101/2021.01.15.426837
This data repository that you downloaded from Zenodo

Contact

Please contact Md. Abul Hassan Samee (samee@bcm.edu) for questions related to our manuscript.

Please contact Zian Liu (zian.liu@bcm.edu) for questions specifically related to our research. Note that if you are accessing this page after 2023 and you don't hear back from Zian for 2 days, please email Dr. Samee directly as Zian may have graduated.