Stanford InfoLab Publication Server

Bufoosh: Buffering Algorithms for Generic Entity Resolution

Kawai, Hideki and Garcia-Molina, Hector and Benjelloun, Omar and Larson, Tait and Menestrina, David and Thavisomboon, Suttipong (2006) Bufoosh: Buffering Algorithms for Generic Entity Resolution. Technical Report. Stanford.

BibTeXDublinCoreEndNoteHTML

[img]
Preview
PDF
215Kb

Abstract

Entity Resolution (ER) is a problem that arises in many information integration applications. ER process identifies duplicated records that refer to the same real-world entity (match process), and derives composite information about the entity (merge process). Even though the cost of the match process is high, the cost of disk I/O is likely to be the dominant problem for a limited memory or very large record set. In this paper, we proposed buffering algorithms for ER, Bufoosh, based on lazy disk update and locality-aware match scheduling. Our evaluation results using Yahoo! shopping data show that the algorithms can reduce disk I/O dramatically.

Item Type:Techreport (Technical Report)
Uncontrolled Keywords:entity resolution, record linkage, deduplication, object identification, buffering algorithm
Subjects:Computer Science > Active Databases
Computer Science > Databases and the Web
Miscellaneous
Projects:Miscellaneous
Related URLs:Project Homepagehttp://infolab.stanford.edu/
ID Code:780
Deposited By:Import Account
Deposited On:12 Sep 2006 17:00
Last Modified:18 Dec 2008 14:48

Download statistics

Repository Staff Only: item control page