Sustainable data analysis with Snakemake
Creators
- 1. Algorithms for reproducible bioinformatics, Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
- 2. Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- 3. EMBL-EBI, Hinxton, UK
- 4. Broad Institute of MIT and Harvard, Cambridge, USA
- 5. Stanford University Research Computing Center, Stanford University
- 6. Biomedical Informatics, Harvard Medical School, Harvard University, Boston, USA
- 7. Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health (BIH), Center for Digital Health, Berlin, Germany
- 8. Biozentrum, University of Basel, Switzerland \& SIB Swiss Institute of Bioinformatics / ELIXIR Switzerland, Lausanne, Switzerland
- 9. Microsoft Singapore, Singapore
- 10. CUBI – Core Unit Bioinformatics, Berlin Institute of Health, Berlin, Germany
- 11. Genome Informatics, University of Duisburg-Essen
- 12. Quantitative Biology Center (QBiC), University of Tübingen, Tübingen, Germany
Description
Data analysis often entails a multitude of heterogeneous steps, from the application of various command line tools to the usage of scripting languages like R or Python for the generation of plots and tables. It is widely recognized that data analyses should ideally be conducted in a reproducible way. Reproducibility enables technical validation and regeneration of results on the original or even new data. However, reproducibility alone is by no means sufficient to deliver an analysis that is of lasting impact (i.e. sustainable) for the field, or even just one research group. We postulate that it is equally important to ensure adaptability and transparency.
The former describes the ability to modify the analysis to answer extended or slightly different research questions. The latter describes the ability to understand the analysis in order to judge whether it is not only technically, but methodologically valid.
Here, we analyze the properties needed for a data analysis to become reproducible, adaptable, and transparent, and show how the popular workflow management system Snakemake can be used to fulfill all these needs.