Benjelloun, Omar and Garcia-Molina, Hector and Gong, Heng and Kawai, Hideki and Larson, Tait E. and Menestrina, David and Thavisomboon, Sutthipong (2007) D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution. Technical Report. Stanford. (Publication Note: Extended version of paper published in ICDCS 2007.)
This is the latest version of this item.
Entity Resolution (ER) matches and merges records that refer to the same real-world entities, and is typically a compute-intensive process due to complex matching functions and high data volumes. We present a family of algorithms, D-Swoosh, for distributing the ER workload across multiple processors. The algorithms use generic match and merge functions, and ensure that new merged records are distributed to processors that may have matching records. We perform a detailed performance evaluation on a testbed of 15 processors, for cases where application knowledge can eliminate some comparisons and where all records must be matched. Our experiments use actual comparison shopping data provided by Yahoo!.
|Item Type:||Techreport (Technical Report)|
|Additional Information:||Extended version of paper published in ICDCS 2007.|
|Uncontrolled Keywords:||Entity Resolution, Information Integration, Data Cleaning|
|Subjects:||Computer Science > Distributed Systems|
Computer Science > Data Integration and Mediation
|Related URLs:||Project Homepage||http://infolab.stanford.edu/|
|Deposited By:||Import Account|
|Deposited On:||25 Jun 2007 17:00|
|Last Modified:||10 Dec 2008 16:56|
Available Versions of this Item
- D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution. (deposited 14 Mar 2006 16:00)
- D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution. (deposited 25 Jun 2007 17:00) [Currently Displayed]
Repository Staff Only: item control page