Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published August 16, 2017 | Version v1
Journal article Open

Using YesWorkflow hybrid queries to reveal data lineage from data curation activities

  • 1. University of Illinois Urbana-Champaign, Champaign, United States of America
  • 2. Museum of Comparative Zoology, Harvard University, Cambridge, MA, United States of America
  • 3. Museum of Comparative Zoology, Harvard University, Cambridge, United States of America
  • 4. Agriculture and Agri-Food Canada, Ottawa, Canada|Agriculture and Agri-Food Canada, Ottawa, Canada
  • 5. University of Massachusetts, Boston, Boston, United States of America|Museum of Comparative Zoology, Harvard University, Cambridge, United States of America

Description

The YesWorkflow McPhillips et al. 2015b, McPhillips et al. 2015a toolkit was designed to annotate data curation workflows in conventional scripts (e.g., Python, R, Java) but it can also be used to annotate YAML-based Kurator workflow configuration files. From just a file that has been annotated by YesWorkflow, YesWorkflow is able to render a top-level graphical view of the workflow structure (prospective provenance), including system inputs and outputs, actors, connections among those actors, and expected data to be passed on those connections.

YesWorkflow also supports dynamic analysis and reporting on the results of the workflow (retrospective provenance) at various levels of granularity (e.g., at the actor level, script level, data level, record level, file level, function level), provided that it has been configured at each. YesWorkflow includes an @Log annotation, which describes the semantic structure of a log message within some actor in the workflow and allows the log message to be linked to the actor within which it was created, and for parts of that log message to be linked to the data passed between actors. YesWorkflow can be used to analyze the log messages after a run of the workflow and construct a store of facts, which can be queried and reasoned upon to make statements about the evolving paths taken by particular data elements through the workflow and assertions made about those data elements within the workflow.

Provenance, like other metadata, appears to be rarely actionable or immediately useful for those who are expected to provide it. However, by refactoring and integrating runtime observables generated from retrospective provenance and context information from prospective provenance analysis into hybrid queries, we show how both elements can yield hybrid visualizations that reveal "the plot" of the whole execution. In this way, a comprehensive workflow graph and a customizable data lineage report are made actionable for a workflow run with meaningful provenance artifacts. Queries run on a set of facts extracted from log messages by YesWorkflow after a workflow run, in combination with the facts extracted from the annotated workflow itself, allow for powerful visualizations of the retrospective provenance of a workflow run and of particular data records within a branching workflow.

Files

BISS_article_20380.pdf

Files (62.1 kB)

Name Size Download all
md5:03b9c7058119f5bb4b9d69a95195b87d
46.0 kB Preview Download
md5:b3027cc384b16cb4a15ed22189e06402
16.1 kB Preview Download

Linked records