Stanford InfoLab Publication Server

Data Integration with Dependent Sources

Das Sarma, Anish and Dong, Xin and Halevy, Alon (2008) Data Integration with Dependent Sources. Technical Report. Stanford InfoLab.




Data integration addresses the problem of providing a uniform interface for answering queries over a multitude of data sources. The vast majority of previous work on data integration assumes that the sources are created independently of each other. However, in environments with many sources (e.g., the World-Wide Web), we often see that sources are created by copying all or parts of the content from other sources. In such contexts, we can take advantage of the dependency between sources to answer queries more efficiently and to return first answers more quickly. This paper introduces a formalism for representing dependencies between data sources, and considers a set of basic algorithmic optimization problems for answering queries over dependent sources. These problems include (1) Coverage Problem: What is the number of answer tuples covered by a set of sources? (2) Cost Minimization Problem: What is the minimum cost we must incur to get all answer tuples? (3) Maximum Coverage Problem: Given a bound on the cost, how can we get the maximum possible coverage? (4) Source Ordering Problem: For a set of data sources, what is the best order to query them so as to retrieve answer tuples as fast as possible? We provide algorithms and complexity results for each of them. Since, as we show, most of the problems tend to be intractable in general, we propose approximation algorithms, and identify useful restricted classes of dependencies yielding tractable optimal solutions.

Item Type:Techreport (Technical Report)
ID Code:894
Deposited By:Anish Das Sarma
Deposited On:10 Dec 2008 09:56
Last Modified:08 Dec 2009 03:54

Download statistics

Repository Staff Only: item control page