Venetis, Petros and Sismanis, Yannis and Reinwald, Berthold (2012) CRSI: A Compact Randomized Similarity Index for Set-Valued Features. In: EDBT, March 26-30, 2012, Berlin, Germany.
|PDF - Published Version|
We propose a similarity index for set-valued features and study algorithms for executing various set similarity queries on it. Such queries are fundamental for many application areas, including data integration and cleaning, data profiling as well as near duplicate document detection. In this paper, we focus on Jaccard similarity and present estimators that work for arbitrary similarity thresholds based on a single similarity index. We show how to build this similarity index a-priori, without knowledge about query similarity thresholds, based on recently proposed synopses for multiset operations. The index is deployed using existing disk-based inverted indexing implementations and our algorithms exploit available techniques, like skip-lists, to further optimize the query performance. The index has provably small space footprints, is orders of magnitude smaller and faster to create/incrementally maintain than exact solutions, and the algorithms provide approximate answers, with an error that is controlled by a user-specified parameter. We prove the error bounds of our algorithms analytically, and, finally, we demonstrate the performance of the algorithms and verify their accuracy experimentally.
|Item Type:||Conference or Workshop Item (Paper)|
|Related URLs:||Author Homepage||http://i.stanford.edu/~venetis/|
|Deposited By:||Petros Venetis|
|Deposited On:||12 Feb 2012 16:21|
|Last Modified:||12 Feb 2012 16:21|
Repository Staff Only: item control page