Stanford InfoLab Publication Server

Finding near-replicas of documents on the web

Shivakumar, N. and Garcia-Molina, H. (1998) Finding near-replicas of documents on the web. In: International Workshop on the Web and Databases (WebDB 1998 ), March 27-28, 1998, Valencia, Spain.




We consider how to efficiently compute the overlap between all pairs of web documents. This information can be used to improve web crawlers, web archivers and in the presentation of search results, among others. We report statistics on how common replication is on the web, and on the cost of computing the above information for a relatively large subset of the web { about 24 million web pages which corresponds to about 150 Gigabytes of textual information.

Item Type:Conference or Workshop Item (Paper)
Uncontrolled Keywords:SCAM, web experiments
Subjects:Computer Science > Databases and the Web
Projects:Digital Libraries
Related URLs:Project Homepage
ID Code:325
Deposited By:Import Account
Deposited On:25 Feb 2000 16:00
Last Modified:29 Dec 2008 11:45

Download statistics

Repository Staff Only: item control page