Verroios, Vasilis and Garcia-Molina, Hector Top-K Entity Resolution with Adaptive Locality-Sensitive Hashing. Technical Report. Stanford InfoLab.
BibTeX | DublinCore | EndNote | HTML |
This is the latest version of this item.
PDF - Draft Version 1319Kb |
Abstract
Given a set of records, entity resolution algorithms find all the records referring to each entity. In this paper, we study the problem of top-k entity resolution: finding all the records referring to the k largest (in terms of records) entities. Top-k entity resolution is driven by many modern applications that operate over just the few most popular entities in a dataset. We propose a novel approach, based on locality-sensitive hashing (LSH), that can very rapidly and accurately process massive datasets. Our key insight is to adaptively decide how much processing each record requires to ascertain if it refers to a top-k entity or not: the less likely a record is to refer to a top-k entity, the less it is processed. The heavily reduced amount of processing for the vast majority of records that do not refer to top-k entities, leads to significant speedups. Our experiments with web images, web articles, and scientific publications show a 2x to 25x speedup compared to the traditional approach for high-dimensional data.
Item Type: | Techreport (Technical Report) |
---|---|
ID Code: | 1157 |
Deposited By: | vasilis verroios |
Deposited On: | 28 Jan 2018 17:21 |
Last Modified: | 28 Jan 2018 17:21 |
Available Versions of this Item
- Top-K Entity Resolution with Adaptive Locality-Sensitive Hashing. (deposited 20 Nov 2016 22:46)
- Top-K Entity Resolution with Adaptive Locality-Sensitive Hashing. (deposited 01 Mar 2017 15:25)
- Top-K Entity Resolution with Adaptive Locality-Sensitive Hashing. (deposited 28 Jan 2018 17:21) [Currently Displayed]
- Top-K Entity Resolution with Adaptive Locality-Sensitive Hashing. (deposited 01 Mar 2017 15:25)
Download statistics
Repository Staff Only: item control page