Cho, Junghoo and Garcia-Molina, Hector (2002) Parallel Crawlers. Technical Report. Stanford.
This is the latest version of this item.
In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time. We first propose multiple architectures for a parallel crawler and identify fundamental issues related to parallel crawling. Based on this understanding, we then propose metrics to evaluate a parallel crawler, and compare the proposed architectures using 40 million pages collected from the Web. Our results clarify the relative merits of each architecture and provide a good guideline on when to adopt which architecture.
|Item Type:||Techreport (Technical Report)|
|Uncontrolled Keywords:||Web crawler, paralellism, distributed crawler|
|Subjects:||Computer Science > Databases and the Web|
|Related URLs:||Project Homepage||http://www-diglib.stanford.edu/diglib/pub/|
|Deposited By:||Import Account|
|Deposited On:||18 Feb 2002 16:00|
|Last Modified:||25 Dec 2008 08:38|
Available Versions of this Item
- Parallel Crawlers. (deposited 02 Oct 2001 17:00)
- Parallel Crawlers. (deposited 18 Feb 2002 16:00) [Currently Displayed]
Repository Staff Only: item control page