Stanford InfoLab Publication Server

D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution

Benjelloun, Omar and Garcia-Molina, Hector and Gong, Heng and Kawai, Hideki and Larson, Tait E. and Menestrina, David and Thavisomboon, Sutthipong (2007) D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution. Technical Report. Stanford. (Publication Note: Extended version of paper published in ICDCS 2007.)


This is the latest version of this item.



Entity Resolution (ER) matches and merges records that refer to the same real-world entities, and is typically a compute-intensive process due to complex matching functions and high data volumes. We present a family of algorithms, D-Swoosh, for distributing the ER workload across multiple processors. The algorithms use generic match and merge functions, and ensure that new merged records are distributed to processors that may have matching records. We perform a detailed performance evaluation on a testbed of 15 processors, for cases where application knowledge can eliminate some comparisons and where all records must be matched. Our experiments use actual comparison shopping data provided by Yahoo!.

Item Type:Techreport (Technical Report)
Additional Information:Extended version of paper published in ICDCS 2007.
Uncontrolled Keywords:Entity Resolution, Information Integration, Data Cleaning
Subjects:Computer Science > Distributed Systems
Computer Science > Data Integration and Mediation
Related URLs:Project Homepage
ID Code:856
Deposited By:Import Account
Deposited On:25 Jun 2007 17:00
Last Modified:10 Dec 2008 16:56

Available Versions of this Item

Download statistics

Repository Staff Only: item control page