Published November 16, 2017 | Version v1
Journal article Open

Robust cross-platform workflows: how technical and scientific communities collaborate to develop, test and share best practices for data analysis

  • 1. University of Rostock, Institute for Biostatistics and Informatics in Medicine and Ageing Research, Rostock, Germany and Debian Project
  • 2. School of Chemical Engineering, UNSW Sydney, NSW 2052, Australia and Debian Project
  • 3. QvarnLabs, Helsinki, Finland and Debian Project
  • 4. University Center for Information Technology, University of Oslo, Oslo, Norway and Debian Project
  • 5. Harvard School of Public Health, Boston, Massachusetts, USA
  • 6. University Medical Center Utrecht, Utrecht, The Netherlands
  • 7. eScience Lab, School of Computer Science, The University of Manchester, Manchester, UK; Common Workflow Language Project and Apache Software Foundation
  • 8. Max-Planck-Institute for Evolutionary Biology, Plön, Germany
  • 9. Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, Germany
  • 10. Computational Biology Unit, Department of Informatics, University of Bergen, Bergen, Norway
  • 11. Debian Project, Wernigerode, Germany
  • 12. Common Workflow Language Project and Debian Project

Description

Information integration and workflow technologies for data analysis have always been major fields of investigation in bioinformatics. A range of popular workflow suites are available to support analyses in computational biology. Commercial providers tend to offer prepared applications remote to their clients. However, for most academic environments with local expertise, novel data collection techniques or novel data analysis, it is essential to have all the flexibility of open source tools and open source workflow descriptions.

Workflows in data-driven science such as computational biology have considerably gained in complexity. New tools or new releases with additional features arrive at an enormous pace, new reference data or concepts for quality control are emerging. A well-abstracted workflow and the exchange of the same across work groups has an enormous impact on the efficiency of research and the further development of the field. High-throughput sequencing adds to the avalanche of data available in the field; 
efficient computation and, in particular, parallel execution motivate the transition from traditional scripts and Makefiles to workflows.

We here review the extant software development and distribution model with a focus on the role of integration testing and discuss the effect of Common Workflow Language (CWL) on distributions of open source scientific software to swiftly and reliably provide the tools demanded for the execution of such formally described workflows. It is contended that, alleviated from technical differences for the execution on local machines, clusters or the cloud, communities also gain the technical means to test workflow-driven interaction across several software packages. 

Notes

Author's Accepted version; for published version in Data Science and Engineering see https://doi.org/10.1007/s41019-017-0050-4

Files

robust-cross-platform-2017-10-30.pdf

Files (265.0 kB)

Name Size Download all
md5:37e16cac55156cf719f5bf572fb2a850
265.0 kB Preview Download

Additional details

Related works

Is identical to
10.1007/s41019-017-0050-4 (DOI)

Funding

Ageing with elegans – Validating C. elegans healthspan model for better understanding factors causing health and disease, to develop evidence based prevention, diagnostic, therapeutic and other strategies. 633589
European Commission
BioExcel – Centre of Excellence for Biomolecular Research 675728
European Commission
H2020 – COST at a turning point: A unique framework for pan-European ST cooperation as clear demonstration of European values 681463
European Commission