Kawai, Hideki and Garcia-Molina, Hector and Benjelloun, Omar and Larson, Tait and Menestrina, David and Thavisomboon, Suttipong (2006) Bufoosh: Buffering Algorithms for Generic Entity Resolution. Technical Report. Stanford.
Entity Resolution (ER) is a problem that arises in many information integration applications. ER process identifies duplicated records that refer to the same real-world entity (match process), and derives composite information about the entity (merge process). Even though the cost of the match process is high, the cost of disk I/O is likely to be the dominant problem for a limited memory or very large record set. In this paper, we proposed buffering algorithms for ER, Bufoosh, based on lazy disk update and locality-aware match scheduling. Our evaluation results using Yahoo! shopping data show that the algorithms can reduce disk I/O dramatically.
|Item Type:||Techreport (Technical Report)|
|Uncontrolled Keywords:||entity resolution, record linkage, deduplication, object identification, buffering algorithm|
|Subjects:||Computer Science > Active Databases|
Computer Science > Databases and the Web
|Related URLs:||Project Homepage||http://infolab.stanford.edu/|
|Deposited By:||Import Account|
|Deposited On:||12 Sep 2006 17:00|
|Last Modified:||18 Dec 2008 14:48|
Repository Staff Only: item control page