Kawai, Hideki and Garcia-Molina, Hector and Benjelloun, Omar and Menestrina, David and Whang, Euijong and Gong, Heng (2006) P-Swoosh: Parallel Algorithm for Generic Entity Resolution. Technical Report. Stanford.
| BibTeX | DublinCore | EndNote | HTML |
| PDF 333Kb |
Abstract
Entity Resolution (ER) is a problem that arises in many information integration applications. ER process identifies duplicated records that refer to the same real-world entity (match process), and derives composite information about the entity (merge process). Additionally, the merged record can match another records recursively. Since the ER process is typically compute-intensive, it is important to distribute the ER workload across multiple processors. In this paper, we propose a parallel algorithm for ER, P-Swoosh, which uses generic match and merge functions and allows load balancing between processors. Our evaluation results using Yahoo! shopping data demonstrates the almost linear scalability from 2 to 15 processors.
| Item Type: | Techreport (Technical Report) | |
|---|---|---|
| Uncontrolled Keywords: | entity resolution, deduplication, record linkage, object identification, information integration, distributed system, parallel algorithm | |
| Subjects: | Computer Science > Databases and the Web Computer Science > Distributed Systems Computer Science > Data Integration and Mediation | |
| Projects: | Miscellaneous | |
| Related URLs: | Project Homepage | http://infolab.stanford.edu/ |
| ID Code: | 784 | |
| Deposited By: | Import Account | |
| Deposited On: | 12 Sep 2006 17:00 | |
| Last Modified: | 18 Dec 2008 14:50 |
Download statistics
Repository Staff Only: item control page

