Poster Open Access
Abstract (accepted for poster and talk at BOSC2018)
The automation of data analysis in the form of scientific workflows is a widely adopted practice in many fields of research nowadays. Computationally driven data-intensive experiments using workflows enable Automation, Scaling, Adaption and Provenance support (ASAP). However, there are still a number of challenges associated with the effective sharing, publication, understandability and reproducibility of such workflows due to the incomplete capture of provenance and the dependence on the particular technical (software) platforms.
We present CWLProv, an approach for retrospective provenance-capture utilizing open source community-driven standards involving application and customization of workflow-centric Research Objects (ROs). The ROs are produced as an output of a workflow enactment defined in the Common Workflow Language (CWL) using reference implementation cwltool.
The approach aggregates and annotates all the resources involved in the scientific investigation including inputs, outputs, workflow specification, command line tool specifications and input parameter settings. The resources are linked within the RO to enable re-enactment of an analysis without depending on external resources. The workflow provenance profile is represented in W3C standardized PROV-N and PROV-JSON format and captures retrospective provenance of the workflow enactment.
The workflow-centric RO produced as an output of a CWL workflow enactment is expected to be interoperable, reusable, shareable and portable across different platforms. Our work describes the need and motivation for CWLProv and the lessons learned in applying it for ROs using CWL in the bioinformatics domain. The complete capture of provenance along with the aggregated resources used in a workflow enactment will mitigate the workflow decay and allow applications of provenance to make experiments transparent, reproducible and authentic.
We believe that underlying principles of the standards utilized to implement CWLProv will result in a semantically rich executable workflow objects such that any platform supporting CWL and CWLProv will be able to reproduce them. We ultimately aim to achieve a solution that is compliant with all four dimensions of FAIR principles. Currently CWLProv is implemented using the reference implementation, cwltool. This study can further be extended to support Provenance Capture on other platforms supporting CWL to demonstrate interoperability of analysis methods.
Alper, Pinar et al. (2018): LabelFlow Framework for Annotating Workflow Provenance. https://doi.org/10.3390/informatics5010011
Alterovitz, Gil et al. (2018): Enabling Precision Medicine via standard communication of NGS provenance, analysis, and results. https://doi.org/10.1101/191783
Amstutz, Peter et al. (2016): Common Workflow Language, v1.0. https://doi.org/10.6084/m9.figshare.3115156.v2
Bechhofer, Sean et al. (2013): Why Linked Data is Not Enough for Scientists https://doi.org/10.1016/j.future.2011.08.004
Belhajjame, Khalid et al. (2012): Workflow-Centric Research Objects: First Class Citizens in Scholarly Discourse. http://ceur-ws.org/Vol-903/paper-01.pdf
Belhajjame, Khalid et al. (2015): Using a suite of ontologies for preserving workflow-centric research objects. https://doi.org/10.1016/j.websem.2015.01.003
Chard, Kyle et al. (2016): I'll Take That to Go: Big Data Bags and Minimal Identifiers for Exchange of Large, Complex Datasets. https://doi.org/10.1109/BigData.2016.7840618
García Silva, Andrés et al. (2017): Ensuring the Quality of Research Objects in the Earth Science Domain. https://doi.org/10.1109/eScience.2017.62
Garijo, Daniel et al (2014): Common motifs in scientific workflows: An empirical analysis. https://doi.org/10.1016/j.future.2013.09.018
Garijo, Daniel et al. (2013): Quantifying reproducibility in computational biology: The case of the tuberculosis drugome. https://doi.org/10.1371/journal.pone.0080278
Garijo, Daniel et al. (2017): Abstract, Link, Publish, Exploit: An End to End Framework for Workflow Sharing. https://doi.org/10.1016/j.future.2017.01.008
Hettne, Kristina M et al. (2014): Structuring research methods and data with the research object model: genomics workflows as a case study. https://doi.org/10.1186/2041-1480-5-41
Kanwal, Sehrish et al. (2017): Investigating reproducibility and tracking provenance – A genomic workflow case study. https://doi.org/10.1186/s12859-017-1747-0
Kunze, John et al. (2018): The BagIt File Packaging Format (V1.0). IETF Internet-Draft draft-kunze-bagit-16 https://tools.ietf.org/id/draft-kunze-bagit-16
Madduri, Ravi et al. (2018): Reproducible big data science: A case study in continuous FAIRness. https://doi.org/10.1101/268755
Mitchell, Alex L et al. (2018): EBI Metagenomics in 2017: Enriching the analysis of microbial communities, from sequence reads to assemblies. https://doi.org/10.1093/nar/gkx967
Pavis, Stephen and Morris, Andrew D (2015): Unleashing the power of administrative health data: the Scottish model. https://doi.org/10.17061/phrp2541541
Ruiz JE et al. (2014): AstroTaverna—Building workflows with Virtual Observatory services. https://doi.org/10.1016/j.ascom.2014.09.002
Sandve, Geir Kjetil et al. (2013): Ten Simple Rules for Reproducible Computational Research. https://doi.org/10.1371/journal.pcbi.1003285
Stodden, Victoria et al. (2016): Enhancing reproducibility for computational methods https://doi.org/10.1126/science.aah6168
Wolstencroft, Katherine et al. (2017): FAIRDOMHub: a repository and collaboration environment for sharing systems biology research. https://doi.org/10.1093/nar/gkw1032
Zhao, Jun et al. (2012): Why workflows break — Understanding and combating decay in Taverna workflows. https://doi.org/10.1109/eScience.2012.6404482
Zhao, Yong et al. (2006): Applying the Virtual Data Provenance Model. https://doi.org/10.1007/11890850_16