Stanford InfoLab Publication Server

Top-K Entity Resolution with Adaptive Locality-Sensitive Hashing

Verroios, Vasilis and Garcia-Molina, Hector Top-K Entity Resolution with Adaptive Locality-Sensitive Hashing. Technical Report. Stanford InfoLab.

BibTeXDublinCoreEndNoteHTML
WarningThere is a more recent version of this item available.

[img]PDF - Draft Version
1004Kb

Abstract

Given a set of records, entity resolution algorithms find all the records referring to each entity. In this paper, we study the problem of top-k entity resolution: finding all the records referring to the k largest (in terms of records) entities. Top-k entity resolution is driven by many modern applications that operate over just the few most popular entities in a dataset. We propose a novel approach, based on locality-sensitive hashing (LSH), that can very rapidly and accurately process massive datasets. Our key insight is to adaptively decide how much processing each record requires to ascertain if it refers to a top-k entity or not: the less likely a record is to refer to a top-k entity, the less it is processed. The heavily reduced amount of processing for the vast majority of records that do not refer to top-k entities, leads to significant speedups. Our experiments with web images, web articles, and scientific publications show a 2x to 25x speedup compared to the traditional approach for high-dimensional data.

Item Type:Techreport (Technical Report)
ID Code:1149
Deposited By:vasilis verroios
Deposited On:20 Nov 2016 22:46
Last Modified:01 Mar 2017 15:25

Available Versions of this Item

Download statistics

Repository Staff Only: item control page