Stanford InfoLab Publication Server

Evaluating Entity Resolution Results

Menestrina, David and Whang, Steven Euijong and Garcia-Molina, Hector (2010) Evaluating Entity Resolution Results. In: PVLDB, September 13-17, 2010, Singapore.

BibTeXDublinCoreEndNoteHTML

This is the latest version of this item.

[img]
Preview
PDF - Published Version
408Kb

Abstract

Entity Resolution (ER) is the process of identifying groups of records that refer to the same real-world entity. Various measures (e.g., pairwise $F_1$, cluster $F_1$) have been used for evaluating ER results. However, ER measures tend to be chosen in an ad-hoc fashion without careful thought as to what defines a good result for the specific application at hand. In this paper, our contributions are twofold. First, we conduct an analysis on existing ER measures, showing that they can often conflict with each other by ranking the results of ER algorithms differently. Second, we explore a new distance measure for ER (called ``generalized merge distance'' or $GMD$) inspired by the edit distance of strings, using cluster splits and merges as its basic operations. A significant advantage of $GMD$ is that the cost functions for splits and merges can be configured, enabling us to clearly understand the characteristics of a defined $GMD$ measure. Surprisingly, a state-of-the-art clustering measure called Variation of Information is a special case of our configurable $GMD$ measure, and the widely used pairwise $F_1$ measure can be directly computed using $GMD$. We present an efficient linear-time algorithm that correctly computes the $GMD$ measure for a large class of cost functions that satisfy reasonable properties.

Item Type:Conference or Workshop Item (Paper)
Projects:SERF
ID Code:975
Deposited By:Steven Whang
Deposited On:02 Jul 2010 10:53
Last Modified:08 Jul 2010 00:57

Available Versions of this Item

Download statistics

Repository Staff Only: item control page