Abstract (accepted for poster and talk at BOSC2018)
The automation of data analysis in the form of scientific workflows is a widely adopted practice in many fields of research nowadays. Computationally driven data-intensive experiments using workflows enable Automation, Scaling, Adaption and Provenance support (ASAP). However, there are still a number of challenges associated with the effective sharing, publication, understandability and reproducibility of such workflows due to the incomplete capture of provenance and the dependence on the particular technical (software) platforms.
We present CWLProv, an approach for retrospective provenance-capture utilizing open source community-driven standards involving application and customization of workflow-centric Research Objects (ROs). The ROs are produced as an output of a workflow enactment defined in the Common Workflow Language (CWL) using reference implementation cwltool.
The approach aggregates and annotates all the resources involved in the scientific investigation including inputs, outputs, workflow specification, command line tool specifications and input parameter settings. The resources are linked within the RO to enable re-enactment of an analysis without depending on external resources. The workflow provenance profile is represented in W3C standardized PROV-N and PROV-JSON format and captures retrospective provenance of the workflow enactment.
The workflow-centric RO produced as an output of a CWL workflow enactment is expected to be interoperable, reusable, shareable and portable across different platforms. Our work describes the need and motivation for CWLProv and the lessons learned in applying it for ROs using CWL in the bioinformatics domain. The complete capture of provenance along with the aggregated resources used in a workflow enactment will mitigate the workflow decay and allow applications of provenance to make experiments transparent, reproducible and authentic.
We believe that underlying principles of the standards utilized to implement CWLProv will result in a semantically rich executable workflow objects such that any platform supporting CWL and CWLProv will be able to reproduce them. We ultimately aim to achieve a solution that is compliant with all four dimensions of FAIR principles. Currently CWLProv is implemented using the reference implementation, cwltool. This study can further be extended to support Provenance Capture on other platforms supporting CWL to demonstrate interoperability of analysis methods.
FZK funded by MIRS and MIFRS scholarships. SSR funded by BioExcel CoE (www.bioexcel.eu), a project funded by the European Union contract H2020-EINFRA-2015-1-675728. SSR and MRC are members of the leadership team for Common Workflow Language at the Software Freedom Conservancy.