Peter Amstutz
Pau Ruiz Safont
Pjotr Prins
Brad Chapman
Christopher Ball
Lon Blauvelt
Farah Zaib Khan
Stian Soiland-Reyes
Richard O. Sinnott
Andrew Lonie
Carole Goble
Michael R. Crusoe
2018-12-04
<p><strong>Background:</strong> The automation of data analysis in the form of scientific workflows has become a widely adopted practice in many fields of research. Computationally driven data-intensive experiments using workflows enable <strong>A</strong>utomation, <strong>S</strong>caling, <strong>A</strong>daption and <strong>P</strong>rovenance support (<strong>ASAP</strong>). However, there are still several challenges associated with the effective sharing, publication and reproducibility of such workflows due to the incomplete capture of provenance and lack of interoperability between different technical (software) platforms.</p>
<p><strong>Results:</strong> Based on best practice recommendations identified from literature on workflow design, sharing and publishing, we define a <em>hierarchical provenance framework</em> to achieve uniformity in the provenance and support comprehensive and fully re-executable workflows equipped with domain-specific information. To realise this framework, we present <a href="https://w3id.org/cwl/prov/"><strong>CWLProv</strong></a>, a standard-based format to represent any workflow-based computational analysis to produce workflow output artefacts that satisfy the various levels of provenance. We utilise open source community-driven standards; interoperable workflow definitions in <a href="https://www.commonwl.org/">Common Workflow Language</a> (CWL), structured provenance representation using the <a href="https://www.w3.org/TR/prov-overview/">W3C PROV</a> model, and resource aggregation and sharing as workflow-centric <a href="http://www.researchobject.org/">Research Objects</a> (RO) generated along with the final outputs of a given workflow enactment. We demonstrate the utility of this approach through a practical implementation of <em>CWLProv</em> and evaluation using real-life genomic workflows developed by independent groups.</p>
<p><strong>Conclusions: </strong>The underlying principles of the standards utilised by <em>CWLProv</em> enable semantically-rich and executable Research Objects that capture computational workflows with retrospective provenance such that any platform supporting CWL will be able to understand the analysis, re-use the methods for partial re-runs, or reproduce the analysis to validate the published findings.</p>
Submitted to GigaScience (GIGA-D-18-00483)
https://doi.org/10.5281/zenodo.1966881
oai:zenodo.org:1966881
Zenodo
https://doi.org/10.1109/BigData.2016.7840618
https://doi.org/10.5281/zenodo.592090
https://doi.org/10.5281/zenodo.51314
https://zenodo.org/record/1304969
https://doi.org/10.17632/xnwncxpw42.1
https://doi.org/10.17632/6wtpgr3kbj.1
https://doi.org/10.17632/97hj93mkfd.3
https://doi.org/10.5281/zenodo.1471376
https://doi.org/10.5281/zenodo.1471585
https://doi.org/10.5281/zenodo.1471589
https://zenodo.org/communities/ro
https://zenodo.org/communities/linkeddata
https://zenodo.org/communities/eu
https://doi.org/10.5281/zenodo.1208477
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
Provenance
Common Workflow Language
CWL
Research Object
RO
BagIt
Interoperability
Scientific Workflows
Containers
Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv
info:eu-repo/semantics/preprint