Jonathan, Siddharth and Paepcke, Andreas (2007) SpotSigs: Near Duplicate Detection in Web Page Collections. Masters thesis, Stanford University.
Motivated by our work with political scientists we present an algorithm that detects near-duplicate Web pages. These scientists analyze Web archives of news sites. The archives were collected with crawlers and contain a large number of pages that look very different because the frame around their core content differs. However, the news stories in the pages are nearly identical. The close proximity of unrelated items on the pages makes the detection of content overlap difficult. Our SpotSigs algorithm generates signatures that are spread across each document. Places for these signatures are determined by the placement of common words, like 'is' and 'the' in the documents. We can vary our method of computing the signatures. Using hash collisions the algorithm detects overlap among the signatures of matching contents. We study how the different SpotSigs parameters impact precision and recall performance. We propose and evaluate variants of SpotSigs on a test bed of 2168 Web Pages and study the tradeoffs involved. One of our motivations was also to keep pre-processing requirements low for the detection of near duplicates and to this end we do not remove ads, client side scripts and other HTML formatting elements from the documents. On this data set SpotSigs obtains a precision of over 93% and a recall of over 85% for near duplicate detection.
|Item Type:||Thesis (Masters)|
|Uncontrolled Keywords:||Algorithms, Near duplicate detection, Theory|
Computer Science > Databases and the Web
Computer Science > Digital Libraries
|Related URLs:||Project Homepage, Project Homepage||http://infolab.stanford.edu/, http://www-diglib.stanford.edu/diglib/pub/|
|Deposited By:||Import Account|
|Deposited On:||23 Jan 2008 16:00|
|Last Modified:||10 Dec 2008 17:09|
Available Versions of this Item
- SpotSigs: Near Duplicate Detection in Web Page Collections. (deposited 23 Jan 2008 16:00) [Currently Displayed]
Repository Staff Only: item control page