Stanford InfoLab Publication Server

Evaluating Entity Resolution Results (Extended version)

Menestrina, David and Whang, Steven Euijong and Garcia-Molina, Hector (2009) Evaluating Entity Resolution Results (Extended version). Technical Report. Stanford InfoLab.

WarningThere is a more recent version of this item available.



Entity Resolution (ER) is the process of identifying groups of records that refer to the same real-world entity. Various measures (e.g., pairwise $F_1$, cluster $F_1$) have been used for evaluating ER results. However, ER measures tend to be chosen in an ad-hoc fashion without careful thought as to what defines a good result for the specific application at hand. In this paper, our contributions are twofold. First, we conduct an extensive survey on existing ER measures, showing that they can often conflict with each other by ranking the results of ER algorithms differently. Second, we propose a new distance measure for ER (called ``merge distance'') inspired by the edit distance of strings, using cluster splits and merges as its basic operations. A significant advantage of merge distance is that the cost functions for splits and merges can be configured to adjust two important parameters: sensitivity to error type and sensitivity to cluster size. This flexibility enables us to clearly understand the characteristics of a defined merge distance measure. Surprisingly, the widely used pairwise $F_1$ measure and a state-of-the-art clustering measure called Variation of Information are both special cases of our merge distance measure. We present an efficient linear-time algorithm that correctly computes the merge distance measure for a large class of cost functions that satisfy reasonable properties.

Item Type:Techreport (Technical Report)
ID Code:930
Deposited By:David Menestrina
Deposited On:17 Jun 2009 17:29
Last Modified:02 Jul 2010 10:53

Available Versions of this Item

Download statistics

Repository Staff Only: item control page