Stanford InfoLab Publication Server

Provenance in Data-Oriented Workflows

Ikeda, Robert (2012) Provenance in Data-Oriented Workflows. PhD thesis, Stanford University.




Data-processing tasks are commonly managed using data-oriented workflows, in which input data sets are processed by a graph of transformations to produce output data. In data-oriented workflows, it can be useful to track data provenance (also sometimes called lineage), which describes where data came from and how it has been manipulated and combined. We begin by giving a new general definition of provenance, introducing the notions of correctness, precision, and minimality. We then: (1) Describe a wrapper-based approach for capturing provenance in workflows in which all transformations are either map or reduce functions; (2) Describe a provenance-based approach for selectively refreshing one or more elements in the output data, i.e., computing the latest values of particular output elements based on modified input data; (3) Show how logical provenance, i.e., provenance information stored at the transformation level, can often capture precise provenance relationships in a compact fashion; (4) Describe our prototype system called Panda (for Provenance And Data) that supports refresh in data-oriented workflows, as well as debugging and drill-down using logical provenance. Overall, our work provides a comprehensive foundation, set of algorithms, and prototype system for provenance in data-oriented workflows.

Item Type:Thesis (PhD)
ID Code:1062
Deposited By:Robert Ikeda
Deposited On:07 Dec 2012 00:02
Last Modified:07 Dec 2012 00:05

Download statistics

Repository Staff Only: item control page