Stanford InfoLab Publication Server

Generic Entity Resolution with Data Confidences

Menestrina, David and Benjelloun, Omar and Garcia-Molina, Hector (2005) Generic Entity Resolution with Data Confidences. Technical Report. Stanford.




We consider the {\em Entity Resolution} ({\em ER}) problem (also known as deduplication, or merge-purge), in which records determined to represent the same real-world entity are successively located and merged. Our approach to the ER problem is {\em generic}, in the sense that the functions for comparing and merging records are viewed as black-boxes. In this context, managing numerical confidences along with the data makes the ER problem more challenging to define (e.g., how should confidences of merged records be combined?), and more expensive to compute. In this paper, we propose a sound and flexible model for the ER problem with confidences, and propose efficient algorithms to solve it. We validate our algorithms through experiments that show significant performance improvements over naive schemes.

Item Type:Techreport (Technical Report)
Uncontrolled Keywords:Entity Resolution, Deduplication, Record Linkage
Subjects:Computer Science > Data Integration and Mediation
Related URLs:Project Homepage
ID Code:699
Deposited By:Import Account
Deposited On:28 Nov 2005 16:00
Last Modified:22 Dec 2008 18:21

Download statistics

Repository Staff Only: item control page