Menestrina, David and Whang, Steven Euijong and Garcia-Molina, Hector (2010) Evaluating Entity Resolution Results. In: PVLDB, September 13-17, 2010, Singapore.
This is the latest version of this item.
|PDF - Published Version|
Entity Resolution (ER) is the process of identifying groups of records that refer to the same real-world entity. Various measures (e.g., pairwise $F_1$, cluster $F_1$) have been used for evaluating ER results. However, ER measures tend to be chosen in an ad-hoc fashion without careful thought as to what defines a good result for the specific application at hand. In this paper, our contributions are twofold. First, we conduct an analysis on existing ER measures, showing that they can often conflict with each other by ranking the results of ER algorithms differently. Second, we explore a new distance measure for ER (called ``generalized merge distance'' or $GMD$) inspired by the edit distance of strings, using cluster splits and merges as its basic operations. A significant advantage of $GMD$ is that the cost functions for splits and merges can be configured, enabling us to clearly understand the characteristics of a defined $GMD$ measure. Surprisingly, a state-of-the-art clustering measure called Variation of Information is a special case of our configurable $GMD$ measure, and the widely used pairwise $F_1$ measure can be directly computed using $GMD$. We present an efficient linear-time algorithm that correctly computes the $GMD$ measure for a large class of cost functions that satisfy reasonable properties.
|Item Type:||Conference or Workshop Item (Paper)|
|Deposited By:||Steven Whang|
|Deposited On:||02 Jul 2010 10:53|
|Last Modified:||08 Jul 2010 00:57|
Available Versions of this Item
- Evaluating Entity Resolution Results (Extended version). (deposited 17 Jun 2009 17:29)
- Evaluating Entity Resolution Results. (deposited 02 Jul 2010 10:53) [Currently Displayed]
Repository Staff Only: item control page