Stanford InfoLab Publication Server

Similarity Search on the Web: Evaluation and Scalability Considerations

Haveliwala, Taher and Gionis, Aristides and Klein, Dan and Indyk, Piotr (2001) Similarity Search on the Web: Evaluation and Scalability Considerations. Technical Report. Stanford.

BibTeXDublinCoreEndNoteHTML
WarningThere is a more recent version of this item available.

[img]
Preview
PDF
209Kb

Abstract

Allowing users to find pages on the web similar to a particular query page is a crucial component of modern search engines. A variety of techniques and approaches exist to support "Related Pages" queries. In this paper we discuss shortcomings of previous approaches and present a unifying approach that puts special emphasis on the use of text, both within anchors and surrounding anchors. In the central contribution of our paper, we present a novel technique for automating the evaluation process, allowing us to tune our parameters to maximize the quality of the results. Finally, we show how to scale our approach to millions of web pages, using the established Locality-Sensitive-Hashing technique.

Item Type:Techreport (Technical Report)
Uncontrolled Keywords:web search, related pages, similarity search, clustering
Subjects:Computer Science > Data Mining
Computer Science > Digital Libraries
Projects:Miscellaneous
Related URLs:Project Homepage, Project Homepagehttp://infolab.stanford.edu/, http://www-nlp.stanford.edu/
ID Code:526
Deposited By:Import Account
Deposited On:25 Feb 2001 16:00
Last Modified:27 Dec 2008 09:57

Available Versions of this Item

Download statistics

Repository Staff Only: item control page