This is the Zenodo tarball for the Liu Z and Samee MAH 2023 publication: Structural Underpinnings of Mutation Rate Variations in the Human Genome. The manuscript is accepted by Nucleic Acids Research and will be released online in the near future. An earlier version is also submitted to bioRxiv (https://www.biorxiv.org/content/10.1101/2021.01.15.426837v2). See the bottom of the page for citation information.
Per NAR regulations: the permanent repository will be stored on Zenodo. For any recent changes and correspondences, please check out our Codeberg repository at https://codeberg.org/sameelab/mutprediction-with-shape.
Per NAR regulations, we set up the Zenodo archive in order to create a one-stop-shop that allows the reproduction of our entire study. Because of this, you might notice a large, almost excessive amount of data files as compared to the cleaner, more streamlined git repo on Codeberg.
Most of the data files are our original data that we produced when first running the pipelines. For the most part there is no harm in directly using them, but please do read the rest of the README and cite/acknowledge the data sources accordingly. We removed a few depreciated pieces of intermediate data that 1) are no longer used in the pipeline, 2) have minimal/no contribution to the research, and/or 3) are excesively big in terms of file size.
Because of the way the whole thing is set up, there is a good chance you might find some data files that don't seem to match to anything in particular; chances are we didn't include that analysis in our final manuscript. There is also the small chance that you may encounter a serious error while running the scripts because of missing data. If the second scenario happens, please contact us.
Our workflow are documented in the various .ipynb notebooks located in the notebook/ directory. Make sure to download the python library script and the individual numbered notebooks for individual steps.
The program runs on python
version 3, the following packages are required:
along with their dependencies.
You might notice that the TFBS analysis requires a new set of command-line tools:
Make sure to have these if you need to run the TFBS analysis.
For the mutation rates data, we have it available here but we strongly encourage you to request Dr. Benjamin F. Voight first; their data is also available from Dr. Voight's GitHub.
As you might have noticed, we included an input mutation rate data file in our example script directory. We would strongly discourage you to directly use this data for other purposes. This input data is generated by one of our in-production pipelines, and then re-formatted to match the format of the Aggarwala and Voight data. It is intended to be a toy dataset and we do not currently have documentation for how to generate it. If you are interested, please stay tuned as we do have plans to release our pipeline to the Samee Lab GitHub, or contact us and we are more than happy to pass the data (as well as the steps to generate it) to you.
For the TFBS data, these are from the Kheradpour and Kellis 2014 paper, which used to be accessed from this webpage. We have retained some intermediate data from this paper in our various data directories, as long as you properly acknowledged the project's authors you are welcomed to use our pre-processed data as you wish. We noticed that the website has been down: if you need access to the original 2014 data but couldn't, we can try our best to help.
For the DNAshape reference table, we have included a 7-mer reference table in the "data_input" directory. We also have a repo named DNAshapeR_reference which contains scripts for extracting the reference table from the DNAshapeR package. Please make sure to cite the four DNAshapeR papers when using this excel spreadsheet.
For the DNAshapeR package, please visit Tsu-Pei Chiu's GitHub page for more information.
We have included our Jupyter notebooks as reference documents. We have separately prepared an example pipeline in the "pipeline_example" directory. The "Publication_note.ipynb" document from the archive folder is an older version of our pipeline that used to run everything together.
To run our model using the example pipeline, call:
python main.py input_mutation_file reference_dnashape_file.xlsx
from the example directory, make sure that the python
refer to python version 3. The included README file will share more regarding what to do, and the script file is well annotated for you to follow.
We have included our TFBS analysis scripts in the tfbs-analysis/ directory. Please read the directory-specific README for more information, and please don't hesitate to reach out to Zian if you need help with anything.
If you are using the input data from Aggarwala and Voight, please make sure to cite:
If you are using the data from Kheradpour and Kellis, please make sure to cite:
If you are using any data pertinent to the DNAshape method, the DNAshapeR package, or our curated DNA shape tables, please make sure to cite all four of the following:
For all other usages pertinent to our work, our manuscript is currently still undergoing final processing. In the meantime, please choose one of the following to cite:
Please contact Md. Abul Hassan Samee (samee@bcm.edu) for questions related to our manuscript.
Please contact Zian Liu (zian.liu@bcm.edu) for questions specifically related to our research. Note that if you are accessing this page after 2023 and you don't hear back from Zian for 2 days, please email Dr. Samee directly as Zian may have graduated.