Stanford InfoLab Publication Server

SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections

Theobald, Martin and Siddharth, Jonathan and Paepcke, Andreas (2008) SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections. In: 31st annual international ACM SIGIR conference on Research and development in information retrieval 2008 (SIGIR 2008), July 20 - 24, 2008, Singapore, Singapore.


This is the latest version of this item.



Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching signatures for near duplicate detection in large Web crawls. Our spot signatures are designed to favor naturallanguage portions of Web pages over advertisements and navigational bars. The contributions of SpotSigs are twofold: 1) by combining stopword antecedents with short chains of adjacent content terms, we create robust document signatures with a natural ability to filter out noisy components of Web pages that would otherwise distract pure n-gram-based approaches such as Shingling; 2) we provide an exact and efficient, self- tuning matching algorithm that exploits a novel combination of collection partitioning and inverted index pruning for high-dimensional similarity search. Experiments confirm an increase in combined precision and recall of more than 24 percent over state-of-the-art approaches such as Shingling or I-Match and up to a factor of 3 faster execution times than Locality Sensitive Hashing (LSH), over a demonstrative \Gold Set" of manually assessed near-duplicate news articles as well as the TREC WT10g Web collection.

Item Type:Conference or Workshop Item (Paper)
Uncontrolled Keywords:Stopword Signatures, High-dimensional Similarity Search, Optimal Partitioning, Inverted Index Pruning
Projects:Digital Libraries
Related URLs:Project Homepage
ID Code:860
Deposited By:Import Account
Deposited On:24 Apr 2008 17:00
Last Modified:10 Dec 2008 16:27

Available Versions of this Item

Download statistics

Repository Staff Only: item control page