Verroios, Vasilis and Garcia-Molina, Hector Top-K Entity Resolution with Adaptive Locality-Sensitive Hashing. Technical Report. Stanford InfoLab.
BibTeX | DublinCore | EndNote | HTML |
There is a more recent version of this item available. |
PDF - Draft Version 1004Kb |
Abstract
Given a set of records, entity resolution algorithms find all the records referring to each entity. In this paper, we study the problem of top-k entity resolution: finding all the records referring to the k largest (in terms of records) entities. Top-k entity resolution is driven by many modern applications that operate over just the few most popular entities in a dataset. We propose a novel approach, based on locality-sensitive hashing (LSH), that can very rapidly and accurately process massive datasets. Our key insight is to adaptively decide how much processing each record requires to ascertain if it refers to a top-k entity or not: the less likely a record is to refer to a top-k entity, the less it is processed. The heavily reduced amount of processing for the vast majority of records that do not refer to top-k entities, leads to significant speedups. Our experiments with web images, web articles, and scientific publications show a 2x to 25x speedup compared to the traditional approach for high-dimensional data.
Item Type: | Techreport (Technical Report) |
---|---|
ID Code: | 1149 |
Deposited By: | vasilis verroios |
Deposited On: | 20 Nov 2016 22:46 |
Last Modified: | 01 Mar 2017 15:25 |
Available Versions of this Item
- Top-K Entity Resolution with Adaptive Locality-Sensitive Hashing. (deposited 20 Nov 2016 22:46) [Currently Displayed]
Download statistics
Repository Staff Only: item control page