Stanford InfoLab Publication Server

Parallel Crawlers

Cho, Junghoo and Garcia-Molina, Hector (2002) Parallel Crawlers. Technical Report. Stanford.


This is the latest version of this item.



In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time. We first propose multiple architectures for a parallel crawler and identify fundamental issues related to parallel crawling. Based on this understanding, we then propose metrics to evaluate a parallel crawler, and compare the proposed architectures using 40 million pages collected from the Web. Our results clarify the relative merits of each architecture and provide a good guideline on when to adopt which architecture.

Item Type:Techreport (Technical Report)
Uncontrolled Keywords:Web crawler, paralellism, distributed crawler
Subjects:Computer Science > Databases and the Web
Projects:Digital Libraries
Related URLs:Project Homepage
ID Code:733
Deposited By:Import Account
Deposited On:18 Feb 2002 16:00
Last Modified:25 Dec 2008 08:38

Available Versions of this Item

Download statistics

Repository Staff Only: item control page