Shivakumar, N. and Garcia-Molina, H. (1998) Finding near-replicas of documents on the web. In: International Workshop on the Web and Databases (WebDB 1998 ), March 27-28, 1998, Valencia, Spain.
BibTeX | DublinCore | EndNote | HTML |
![]()
| PDF 149Kb |
Abstract
We consider how to efficiently compute the overlap between all pairs of web documents. This information can be used to improve web crawlers, web archivers and in the presentation of search results, among others. We report statistics on how common replication is on the web, and on the cost of computing the above information for a relatively large subset of the web { about 24 million web pages which corresponds to about 150 Gigabytes of textual information.
Item Type: | Conference or Workshop Item (Paper) | |
---|---|---|
Uncontrolled Keywords: | SCAM, web experiments | |
Subjects: | Computer Science > Databases and the Web | |
Projects: | Digital Libraries | |
Related URLs: | Project Homepage | http://www-diglib.stanford.edu/diglib/pub/ |
ID Code: | 325 | |
Deposited By: | Import Account | |
Deposited On: | 25 Feb 2000 16:00 | |
Last Modified: | 29 Dec 2008 11:45 |
Download statistics
Repository Staff Only: item control page