Published October 2, 2020 | Version v3
Preprint Open

Sustainable data analysis with Snakemake

  • 1. Algorithms for reproducible bioinformatics, Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
  • 2. Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
  • 3. EMBL-EBI, Hinxton, UK
  • 4. Broad Institute of MIT and Harvard, Cambridge, USA
  • 5. Stanford University Research Computing Center, Stanford University
  • 6. Biomedical Informatics, Harvard Medical School, Harvard University, Boston, USA
  • 7. Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health (BIH), Center for Digital Health, Berlin, Germany
  • 8. Biozentrum, University of Basel, Switzerland \& SIB Swiss Institute of Bioinformatics / ELIXIR Switzerland, Lausanne, Switzerland
  • 9. Microsoft Singapore, Singapore
  • 10. CUBI – Core Unit Bioinformatics, Berlin Institute of Health, Berlin, Germany
  • 11. Genome Informatics, University of Duisburg-Essen
  • 12. Quantitative Biology Center (QBiC), University of Tübingen, Tübingen, Germany

Description

Data analysis often entails a multitude of heterogeneous steps, from the application of various command line tools to the usage of scripting languages like R or Python for the generation of plots and tables. It is widely recognized that data analyses should ideally be conducted in a reproducible way. Reproducibility enables technical validation and regeneration of results on the original or even new data. However, reproducibility alone is by no means sufficient to deliver an analysis that is of lasting impact (i.e. sustainable) for the field, or even just one research group. We postulate that it is equally important to ensure adaptability and transparency.
The former describes the ability to modify the analysis to answer extended or slightly different research questions. The latter describes the ability to understand the analysis in order to judge whether it is not only technically, but methodologically valid.

Here, we analyze the properties needed for a data analysis to become reproducible, adaptable, and transparent, and show how the popular workflow management system Snakemake can be used to fulfill all these needs.

Notes

This work was supported by the Netherlands Organisation for Scientific Research (NWO) (VENI grant 016.Veni.173.076, Johannes Köster), the German Research Foundation (SFB 876, Johannes Köster), and Google LLC (Vanessa Sochat and Johannes Köster).

Files

main.pdf

Files (2.6 MB)

Name Size Download all
md5:78cc3e24789042dba4cebeb0098312e1
730.8 kB Preview Download
md5:95ffc1b78b98aa751a646b9451e8c5fc
1.7 MB Download
md5:d7cd94a38706a6994b935d90e5496e76
159.1 kB Preview Download