Stanford InfoLab Publication Server

Top-K Entity Resolution with Adaptive Locality-Sensitive Hashing

Verroios, Vasilis and Garcia-Molina, Hector Top-K Entity Resolution with Adaptive Locality-Sensitive Hashing. Technical Report. Stanford InfoLab.


This is the latest version of this item.

[img]PDF - Draft Version


Given a set of records, entity resolution algorithms find all the records referring to each entity. In this paper, we study the problem of top-k entity resolution: finding all the records referring to the k largest (in terms of records) entities. Top-k entity resolution is driven by many modern applications that operate over just the few most popular entities in a dataset. We propose a novel approach, based on locality-sensitive hashing (LSH), that can very rapidly and accurately process massive datasets. Our key insight is to adaptively decide how much processing each record requires to ascertain if it refers to a top-k entity or not: the less likely a record is to refer to a top-k entity, the less it is processed. The heavily reduced amount of processing for the vast majority of records that do not refer to top-k entities, leads to significant speedups. Our experiments with web images, web articles, and scientific publications show a 2x to 25x speedup compared to the traditional approach for high-dimensional data.

Item Type:Techreport (Technical Report)
ID Code:1157
Deposited By:vasilis verroios
Deposited On:28 Jan 2018 17:21
Last Modified:28 Jan 2018 17:21

Available Versions of this Item

Download statistics

Repository Staff Only: item control page