Planned intervention: On Wednesday April 3rd 05:30 UTC Zenodo will be unavailable for up to 2-10 minutes to perform a storage cluster upgrade.

There is a newer version of the record available.

Published May 23, 2019 | Version v5
Preprint Open

Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv

  • 1. The University of Melbourne; Common Workflow Language project
  • 2. The University of Manchester; Common Workflow Language project
  • 3. The University of Melbourne
  • 4. The University of Manchester
  • 5. Common Workflow Language project
  • 1. Curoverse; Common Workflow Language
  • 2. EMBL-EBI
  • 3. Harvard School of Public Health
  • 4. RTI international
  • 5. University of California, Santa Cruz

Description

Background: The automation of data analysis in the form of scientific workflows has become a widely adopted practice in many fields of research. Computationally driven data-intensive experiments using workflows enable Automation, Scaling, Adaption and Provenance support (ASAP). However, there are still several challenges associated with the effective sharing, publication and reproducibility of such workflows due to the incomplete capture of provenance and lack of interoperability between different technical (software) platforms.

Results: Based on best practice recommendations identified from literature on workflow design, sharing and publishing, we define a hierarchical provenance framework to achieve uniformity in the provenance and support comprehensive and fully re-executable workflows equipped with domain-specific information. To realise this framework, we present CWLProv, a standard-based format to represent any workflow-based computational analysis to produce workflow output artefacts that satisfy the various levels of provenance. We utilise open source community-driven standards; interoperable workflow definitions in Common Workflow Language (CWL), structured provenance representation using the W3C PROV model, and resource aggregation and sharing as workflow-centric Research Objects (RO) generated along with the final outputs of a given workflow enactment. We demonstrate the utility of this approach through a practical implementation of CWLProv and evaluation using real-life genomic workflows developed by independent groups.

Conclusions: The underlying principles of the standards utilised by CWLProv enable semantically-rich and executable Research Objects that capture computational workflows with retrospective provenance such that any platform supporting CWL will be able to understand the analysis, re-use the methods for partial re-runs, or reproduce the analysis to validate the published findings.

Notes

Accepted to appear in GigaScience (GIGA-D-18-00483). Revised following review comments.

Files

CWLProv.pdf

Files (6.2 MB)

Name Size Download all
md5:c5e7c25951f819a88094ba2f1c622d72
6.1 MB Preview Download
md5:aaf4529963efcd0c554bc35c541ce024
135.6 kB Preview Download

Additional details

Funding

IBISBA 1.0 – Industrial Biotechnology Innovation and Synthetic Biology Accelerator 730976
European Commission
BioExcel – Centre of Excellence for Biomolecular Research 675728
European Commission