Kawai, Hideki and Garcia-Molina, Hector and Benjelloun, Omar and Larson, Tait and Menestrina, David and Thavisomboon, Suttipong (2006) Bufoosh: Buffering Algorithms for Generic Entity Resolution. Technical Report. Stanford.
| BibTeX | DublinCore | EndNote | HTML |
| PDF 215Kb |
Abstract
Entity Resolution (ER) is a problem that arises in many information integration applications. ER process identifies duplicated records that refer to the same real-world entity (match process), and derives composite information about the entity (merge process). Even though the cost of the match process is high, the cost of disk I/O is likely to be the dominant problem for a limited memory or very large record set. In this paper, we proposed buffering algorithms for ER, Bufoosh, based on lazy disk update and locality-aware match scheduling. Our evaluation results using Yahoo! shopping data show that the algorithms can reduce disk I/O dramatically.
| Item Type: | Techreport (Technical Report) | |
|---|---|---|
| Uncontrolled Keywords: | entity resolution, record linkage, deduplication, object identification, buffering algorithm | |
| Subjects: | Computer Science > Active Databases Computer Science > Databases and the Web Miscellaneous | |
| Projects: | Miscellaneous | |
| Related URLs: | Project Homepage | http://infolab.stanford.edu/ |
| ID Code: | 780 | |
| Deposited By: | Import Account | |
| Deposited On: | 12 Sep 2006 17:00 | |
| Last Modified: | 18 Dec 2008 14:48 |
Download statistics
Repository Staff Only: item control page

