Stanford InfoLab Publication Server

SpotSigs: Near Duplicate Detection in Web Page Collections

Jonathan, Siddharth and Paepcke, Andreas (2007) SpotSigs: Near Duplicate Detection in Web Page Collections. Masters thesis, Stanford University.

WarningThere is a more recent version of this item available.



Motivated by our work with political scientists we present an algorithm that detects near-duplicate Web pages. These scientists analyze Web archives of news sites. The archives were collected with crawlers and contain a large number of pages that look very different because the frame around their core content differs. However, the news stories in the pages are nearly identical. The close proximity of unrelated items on the pages makes the detection of content overlap difficult. Our SpotSigs algorithm generates signatures that are spread across each document. Places for these signatures are determined by the placement of common words, like 'is' and 'the' in the documents. We can vary our method of computing the signatures. Using hash collisions the algorithm detects overlap among the signatures of matching contents. We study how the different SpotSigs parameters impact precision and recall performance. We propose and evaluate variants of SpotSigs on a test bed of 2168 Web Pages and study the tradeoffs involved. One of our motivations was also to keep pre-processing requirements low for the detection of near duplicates and to this end we do not remove ads, client side scripts and other HTML formatting elements from the documents. On this data set SpotSigs obtains a precision of over 93% and a recall of over 85% for near duplicate detection.

Item Type:Thesis (Masters)
Uncontrolled Keywords:Algorithms, Near duplicate detection, Theory
Subjects:Computer Science
Computer Science > Databases and the Web
Computer Science > Digital Libraries
Digital Libraries
Related URLs:Project Homepage, Project Homepage,
ID Code:821
Deposited By:Import Account
Deposited On:23 Jan 2008 16:00
Last Modified:10 Dec 2008 17:09

Available Versions of this Item

Download statistics

Repository Staff Only: item control page