Poster Open Access

CWLProv – Interoperable retrospective provenance capture and its challenges

Farah Zaib Khan; Stian Soiland-Reyes; Richard O. Sinnott; Andrew Lonie; Michael R. Crusoe

Abstract (accepted for poster and talk at BOSC2018)

The automation of data analysis in the form of scientific workflows is a widely adopted practice in many fields of research nowadays. Computationally driven data-intensive experiments using workflows enable Automation, Scaling, Adaption and Provenance support (ASAP). However, there are still a number of challenges associated with the effective sharing, publication, understandability and reproducibility of such workflows due to the incomplete capture of provenance and the dependence on the particular technical (software) platforms.

We present CWLProv, an approach for retrospective provenance-capture utilizing open source community-driven standards involving application and customization of workflow-centric Research Objects (ROs). The ROs are produced as an output of a workflow enactment defined in the Common Workflow Language (CWL) using reference implementation cwltool.

The approach aggregates and annotates all the resources involved in the scientific investigation including inputs, outputs, workflow specification, command line tool specifications and input parameter settings. The resources are linked within the RO to enable re-enactment of an analysis without depending on external resources. The workflow provenance profile is represented in W3C standardized PROV-N and PROV-JSON format and captures retrospective provenance of the workflow enactment.

The workflow-centric RO produced as an output of a CWL workflow enactment is expected to be interoperable, reusable, shareable and portable across different platforms. Our work describes the need and motivation for CWLProv and the lessons learned in applying it for ROs using CWL in the bioinformatics domain. The complete capture of provenance along with the aggregated resources used in a workflow enactment will mitigate the workflow decay and allow applications of provenance to make experiments transparent, reproducible and authentic.

We believe that underlying principles of the standards utilized to implement CWLProv will result in a semantically rich executable workflow objects such that any platform supporting CWL and CWLProv will be able to reproduce them. We ultimately aim to achieve a solution that is compliant with all four dimensions of FAIR principles. Currently CWLProv is implemented using the reference implementation, cwltool. This study can further be extended to support Provenance Capture on other platforms supporting CWL to demonstrate interoperability of analysis methods.

FZK funded by MIRS and MIFRS scholarships. SSR funded by BioExcel CoE (www.bioexcel.eu), a project funded by the European Union contract H2020-EINFRA-2015-1-675728. SSR and MRC are members of the leadership team for Common Workflow Language at the Software Freedom Conservancy.
Files (4.5 MB)
Name Size
2018-06-25-bosc2018-cwlprov.pdf
md5:506981254680b623416b89e29f7fa16a
1.4 MB Download
2018-06-25-bosc2018-cwlprov.svg
md5:b55f6d5bd6ee55bde193ecdc79fcb991
3.1 MB Download
  • Alper, Pinar et al. (2018): LabelFlow Framework for Annotating Workflow Provenance. https://doi.org/10.3390/informatics5010011
  • Alterovitz, Gil et al. (2018): Enabling Precision Medicine via standard communication of NGS provenance, analysis, and results. https://doi.org/10.1101/191783
  • Amstutz, Peter et al. (2016): Common Workflow Language, v1.0. https://doi.org/10.6084/m9.figshare.3115156.v2
  • Bechhofer, Sean et al. (2013): Why Linked Data is Not Enough for Scientists https://doi.org/10.1016/j.future.2011.08.004
  • Belhajjame, Khalid et al. (2012): Workflow-Centric Research Objects: First Class Citizens in Scholarly Discourse. http://ceur-ws.org/Vol-903/paper-01.pdf
  • Belhajjame, Khalid et al. (2015): Using a suite of ontologies for preserving workflow-centric research objects. https://doi.org/10.1016/j.websem.2015.01.003
  • Chard, Kyle et al. (2016): I'll Take That to Go: Big Data Bags and Minimal Identifiers for Exchange of Large, Complex Datasets. https://doi.org/10.1109/BigData.2016.7840618
  • García Silva, Andrés et al. (2017): Ensuring the Quality of Research Objects in the Earth Science Domain. https://doi.org/10.1109/eScience.2017.62
  • Garijo, Daniel et al (2014): Common motifs in scientific workflows: An empirical analysis. https://doi.org/10.1016/j.future.2013.09.018
  • Garijo, Daniel et al. (2013): Quantifying reproducibility in computational biology: The case of the tuberculosis drugome. https://doi.org/10.1371/journal.pone.0080278
  • Garijo, Daniel et al. (2017): Abstract, Link, Publish, Exploit: An End to End Framework for Workflow Sharing. https://doi.org/10.1016/j.future.2017.01.008
  • Hettne, Kristina M et al. (2014): Structuring research methods and data with the research object model: genomics workflows as a case study. https://doi.org/10.1186/2041-1480-5-41
  • Kanwal, Sehrish et al. (2017): Investigating reproducibility and tracking provenance – A genomic workflow case study. https://doi.org/10.1186/s12859-017-1747-0
  • Kunze, John et al. (2018): The BagIt File Packaging Format (V1.0). IETF Internet-Draft draft-kunze-bagit-16 https://tools.ietf.org/id/draft-kunze-bagit-16
  • Madduri, Ravi et al. (2018): Reproducible big data science: A case study in continuous FAIRness. https://doi.org/10.1101/268755
  • Mitchell, Alex L et al. (2018): EBI Metagenomics in 2017: Enriching the analysis of microbial communities, from sequence reads to assemblies. https://doi.org/10.1093/nar/gkx967
  • Pavis, Stephen and Morris, Andrew D (2015): Unleashing the power of administrative health data: the Scottish model. https://doi.org/10.17061/phrp2541541
  • Ruiz JE et al. (2014): AstroTaverna—Building workflows with Virtual Observatory services. https://doi.org/10.1016/j.ascom.2014.09.002
  • Sandve, Geir Kjetil et al. (2013): Ten Simple Rules for Reproducible Computational Research. https://doi.org/10.1371/journal.pcbi.1003285
  • Stodden, Victoria et al. (2016): Enhancing reproducibility for computational methods https://doi.org/10.1126/science.aah6168
  • Wolstencroft, Katherine et al. (2017): FAIRDOMHub: a repository and collaboration environment for sharing systems biology research. https://doi.org/10.1093/nar/gkw1032
  • Zhao, Jun et al. (2012): Why workflows break — Understanding and combating decay in Taverna workflows. https://doi.org/10.1109/eScience.2012.6404482
  • Zhao, Yong et al. (2006): Applying the Virtual Data Provenance Model. https://doi.org/10.1007/11890850_16
36
18
views
downloads
Views 36
Downloads 18
Data volume 28.5 MB
Unique views 35
Unique downloads 15

Share

Cite as