Stanford InfoLab Publication Server

Provenance for Generalized Map and Reduce Workflows

Ikeda, Robert and Park, Hyunjung and Widom, Jennifer Provenance for Generalized Map and Reduce Workflows. In: CIDR 2011.




We consider a class of workflows, which we call generalized map and reduce workflows (GMRWs), where input data sets are processed by an acyclic graph of map and reduce functions to produce output results. We show how data provenance (also sometimes called lineage) can be captured for map and reduce functions transparently. The captured provenance can then be used to support backward tracing (finding the input subsets that contributed to a given output element) and forward tracing (determining which output elements were derived from a particular input element). We provide formal underpinnings for provenance in GMRWs, and we identify properties that are guaranteed to hold when provenance is applied recursively. We have built a prototype system that supports provenance capture and tracing as an extension to Hadoop. Our system uses a wrapper-based approach, requiring little if any user intervention in most cases, and retaining Hadoop's parallel execution and fault tolerance. Performance numbers from our system are reported.

Item Type:Conference or Workshop Item (Paper)
ID Code:985
Deposited By:Robert Ikeda
Deposited On:30 Sep 2010 10:20
Last Modified:19 Jan 2011 17:14

Download statistics

Repository Staff Only: item control page