Menestrina, David and Whang, Steven Euijong and Garcia-Molina, Hector (2009) Evaluating Entity Resolution Results (Extended version). Technical Report. Stanford InfoLab.
Entity Resolution (ER) is the process of identifying groups of records that refer to the same real-world entity. Various measures (e.g., pairwise $F_1$, cluster $F_1$) have been used for evaluating ER results. However, ER measures tend to be chosen in an ad-hoc fashion without careful thought as to what defines a good result for the specific application at hand. In this paper, our contributions are twofold. First, we conduct an extensive survey on existing ER measures, showing that they can often conflict with each other by ranking the results of ER algorithms differently. Second, we propose a new distance measure for ER (called ``merge distance'') inspired by the edit distance of strings, using cluster splits and merges as its basic operations. A significant advantage of merge distance is that the cost functions for splits and merges can be configured to adjust two important parameters: sensitivity to error type and sensitivity to cluster size. This flexibility enables us to clearly understand the characteristics of a defined merge distance measure. Surprisingly, the widely used pairwise $F_1$ measure and a state-of-the-art clustering measure called Variation of Information are both special cases of our merge distance measure. We present an efficient linear-time algorithm that correctly computes the merge distance measure for a large class of cost functions that satisfy reasonable properties.
|Item Type:||Techreport (Technical Report)|
|Deposited By:||David Menestrina|
|Deposited On:||17 Jun 2009 17:29|
|Last Modified:||02 Jul 2010 10:53|
Available Versions of this Item
- Evaluating Entity Resolution Results (Extended version). (deposited 17 Jun 2009 17:29) [Currently Displayed]
Repository Staff Only: item control page