Stanford InfoLab Publication Server

Managing Uncertain Data

Das Sarma, Anish (2009) Managing Uncertain Data. PhD thesis, Stanford University.




The ubiquity of uncertain data in modern-day applications (such as information extraction, data integration, sensor and RFID networks, and scientific experiments) has resulted in a growing need for techniques to deal with such data. This thesis addresses challenges in managing uncertain data in a principled, usable, and scalable fashion. We identify and explore a fundamental tension between {\em usability} and {\em expressiveness} in models for representing uncertain data. We propose a space of models for representing uncertain data, place the models in an expressiveness hierarchy, and study how the models relate to each other in terms of {\em closure} properties. We also address important problems of {\em uniqueness} testing, {\em equivalence} checking, {\em minimization}, and {\em approximation} in our space of models. For a representative model in our space (called \urm), we study database design theory: We provide a sound and complete axiomatization of functional dependencies (FDs) for \urm\ data, describe lossless decompositions, and give algorithms and complexity results for {\em testing}, {\em finding}, and {\em inferring} FDs. To address the usability-expressiveness tradeoff, we show that by adding {\em lineage} ({\em provenance}) to the \urm\ model, we obtain a {\em complete} (intuitively, a fully expressive) data model, which we call the Uncertainty-Lineage Database (ULDB) model. We study properties of ULDBs including {\em membership}, {\em extraction}, and {\em minimization}. We develop techniques for query processing over ULDBs and show that lineage can be exploited for efficient {\em confidence} computation in ULDBs. Then, we present an extension to ULDBs that allows a seamless incorporation of data modifications and a lightweight versioning capability. Finally, we look at uncertain data management in the context of data integration. Data integration systems offer a uniform interface to a set of data sources. Despite recent progress, setting up and maintaining a data integration application still requires significant up-front effort. We present a completely self-configuring data integration system based on a probabilistic framework. The system produces high-quality query answers with no human intervention.

Item Type:Thesis (PhD)
ID Code:945
Deposited By:Anish Das Sarma
Deposited On:02 Nov 2009 15:16
Last Modified:02 Nov 2009 15:16

Download statistics

Repository Staff Only: item control page