Stanford InfoLab Publication Server

Entity Resolution with Evolving Rules

Whang, Steven Euijong and Garcia-Molina, Hector (2010) Entity Resolution with Evolving Rules. In: PVLDB, September 13-17, 2010, Singapore.

BibTeXDublinCoreEndNoteHTML

This is the latest version of this item.

[img]
Preview
PDF - Published Version
276Kb

Abstract

Entity resolution (ER) identifies database records that refer to the same real world entity. In practice, ER is not a one-time process, but is constantly improved as the data, schema and application are better understood. We address the problem of keeping the ER result up-to-date when the ER logic ``evolves'' frequently. A na\"\i ve approach that re-runs ER from scratch may not be tolerable for resolving large datasets. This paper investigates when and how we can instead exploit previous ``materialized'' ER results to save redundant work with evolved logic. We introduce algorithm properties that facilitate evolution, and we propose efficient rule evolution techniques for two clustering ER models: match-based clustering and distance-based clustering. Using real data sets, we illustrate the cost of materializations and the potential gains over the na\"\i ve approach.

Item Type:Conference or Workshop Item (Paper)
Projects:SERF
ID Code:974
Deposited By:Steven Whang
Deposited On:02 Jul 2010 10:43
Last Modified:08 Jul 2010 00:54

Available Versions of this Item

Download statistics

Repository Staff Only: item control page