Bridging the Gap between Process and Procedural Provenance for Statistical Data

doi:10.5281/zenodo.7405415

Published November 30, 2022 | Version v1

Presentation Open

Bridging the Gap between Process and Procedural Provenance for Statistical Data

1. University of Illinois at Champaign-Urbana
2. University of California at Santa Barbara
3. Metadata Technology North America
4. University of Michigan
5. Colectica

We show how two models of provenance can work together to answer basic questions about data provenance, such as “What computed variables were affected by values of variable X?” The W3C PROV data model is a standard for describing activities and persons that produce digital artifacts. PROV associates processes with inputs and outputs, but it does not have a way to describe how data are changed within the process. PROV has no language for program components, like mathematical expressions or joining data tables. Structured Data Transformation Language (SDTL) provides machine-actionable representations of data transformation commands in the five most widely-used statistical analysis applications. SDTL is a procedural language in which commands are executed sequentially. Thus, SDTL describes the inner workings of programs that are black boxes in PROV. However, SDTL is detailed and verbose, and simple queries can be very complicated in SDTL. Combining PROV and SDTL allows us to answer questions about data preparation and management at levels not available in PROV. Our bridge between PROV and SDTL rests on two pillars: ProvONE, an extension of PROV, and Structured Data Transformation History (SDTH), a simplified view of SDTL.

Files

Files (110.9 kB)

Name	Size	Download all
Bridging Process and Procedural Provenance_ v3.pptx md5:bf10636a332e06d8e7d1ce1ded9c6a6d	110.9 kB	Download

	All versions	This version
Views	56	56
Downloads	12	12
Data volume	1.4 MB	1.4 MB

Bridging the Gap between Process and Procedural Provenance for Statistical Data

Creators

Description

Files

Files (110.9 kB)