Stanford InfoLab Publication Server

Bufoosh: Buffering Algorithms for Generic Entity Resolution

Kawai, Hideki and Garcia-Molina, Hector and Benjelloun, Omar and Larson, Tait and Menestrina, David and Thavisomboon, Suttipong (2006) Bufoosh: Buffering Algorithms for Generic Entity Resolution. Technical Report. Stanford.




Entity Resolution (ER) is a problem that arises in many information integration applications. ER process identifies duplicated records that refer to the same real-world entity (match process), and derives composite information about the entity (merge process). Even though the cost of the match process is high, the cost of disk I/O is likely to be the dominant problem for a limited memory or very large record set. In this paper, we proposed buffering algorithms for ER, Bufoosh, based on lazy disk update and locality-aware match scheduling. Our evaluation results using Yahoo! shopping data show that the algorithms can reduce disk I/O dramatically.

Item Type:Techreport (Technical Report)
Uncontrolled Keywords:entity resolution, record linkage, deduplication, object identification, buffering algorithm
Subjects:Computer Science > Active Databases
Computer Science > Databases and the Web
Related URLs:Project Homepage
ID Code:780
Deposited By:Import Account
Deposited On:12 Sep 2006 17:00
Last Modified:18 Dec 2008 14:48

Download statistics

Repository Staff Only: item control page