Kawai, Hideki and Garcia-Molina, Hector and Benjelloun, Omar and Larson, Tait and Menestrina, David and Thavisomboon, Suttipong (2006) Bufoosh: Buffering Algorithms for Generic Entity Resolution. Technical Report. Stanford.
BibTeX | DublinCore | EndNote | HTML |
| PDF 215Kb |
Abstract
Entity Resolution (ER) is a problem that arises in many information integration applications. ER process identifies duplicated records that refer to the same real-world entity (match process), and derives composite information about the entity (merge process). Even though the cost of the match process is high, the cost of disk I/O is likely to be the dominant problem for a limited memory or very large record set. In this paper, we proposed buffering algorithms for ER, Bufoosh, based on lazy disk update and locality-aware match scheduling. Our evaluation results using Yahoo! shopping data show that the algorithms can reduce disk I/O dramatically.
Item Type: | Techreport (Technical Report) | |
---|---|---|
Uncontrolled Keywords: | entity resolution, record linkage, deduplication, object identification, buffering algorithm | |
Subjects: | Computer Science > Active Databases Computer Science > Databases and the Web Miscellaneous | |
Projects: | Miscellaneous | |
Related URLs: | Project Homepage | http://infolab.stanford.edu/ |
ID Code: | 780 | |
Deposited By: | Import Account | |
Deposited On: | 12 Sep 2006 17:00 | |
Last Modified: | 18 Dec 2008 14:48 |
Download statistics
Repository Staff Only: item control page