Whang, Steven and Benjelloun, Omar and Garcia-Molina, Hector (2009) Generic Entity Resolution with Negative Rules. VLDB Journal .
BibTeX | DublinCore | EndNote | HTML |
This is the latest version of this item.
| PDF - Accepted Version 610Kb |
Abstract
Entity Resolution (also known as deduplication or merge-purge) is a process of identifying records that refer to the same real-world entity and merging them together. In practice, ER results may contain "inconsistencies," either due to mistakes by the match and merge function writers or changes in the application semantics. To remove the inconsistencies, we introduce "negative rules" that disallow inconsistencies in the ER solution (ER-N). A consistent solution is then derived based on the guidance from a domain expert. The inconsistencies can be resolved in several ways, leading to accurate solutions. We formalize ER-N, treating the match, merge, and negative rules as black boxes, which permits expressive and extensible ER-N solutions. We identify important properties for the rules that, if satisfied, enable much more efficient ER-N. We develop and evaluate two algorithms that find an ER-N solution based on guidance from the domain expert: the GNR algorithm that does not assume the properties and the ENR algorithm that exploits the properties.
Item Type: | Article | |
---|---|---|
Uncontrolled Keywords: | generic entity resolution, integrity constraint, negative rule, data cleaning | |
Subjects: | Computer Science > Data Integration and Mediation | |
Projects: | SERF | |
Related URLs: | Project Homepage | http://infolab.stanford.edu/serf/ |
ID Code: | 902 | |
Deposited By: | Steven Whang | |
Deposited On: | 17 Jan 2009 00:29 | |
Last Modified: | 22 Apr 2009 22:15 |
Available Versions of this Item
- Generic Entity Resolution with Negative Rules. (deposited 12 Apr 2007 17:00)
- Generic Entity Resolution with Negative Rules. (deposited 17 Jan 2009 00:29) [Currently Displayed]
Download statistics
Repository Staff Only: item control page