Haveliwala, Taher and Gionis, Aristides and Indyk, Piotr (2000) Scalable Techniques for Clustering the Web (Extended Abstract). In: Third International Workshop on the Web and Databases (WebDB 2000), May 18-19, 2000, Dallas, Texas,.
BibTeX | DublinCore | EndNote | HTML |
| PDF 83Kb |
Abstract
Clustering is one of the most crucial techniques for dealing with the massive amount of information present on the web. Clustering can either be performed once offline, independent of search queries, or performed online on the results of search queries. Our offline approach aims to efficiently cluster similar pages on the web, using the technique of Locality-Sensitive Hashing (LSH), in which web pages are hashed in such a way that similar pages have a much higher probability of collision than dissimilar pages. Our preliminary experiments on the Stanford WebBase have shown that the hash-based scheme can be scaled to millions of urls.
Item Type: | Conference or Workshop Item (Paper) | |
---|---|---|
Uncontrolled Keywords: | WebBase,search,information retrieval,clustering,related pages | |
Subjects: | Computer Science > Data Mining Miscellaneous | |
Projects: | MIDAS Digital Libraries | |
Related URLs: | Project Homepage, Project Homepage | http://www-diglib.stanford.edu/diglib/pub/, http://infolab.stanford.edu/midas/midas.html |
ID Code: | 445 | |
Deposited By: | Import Account | |
Deposited On: | 06 May 2000 17:00 | |
Last Modified: | 27 Dec 2008 14:35 |
Download statistics
Repository Staff Only: item control page