Stanford InfoLab Publication Server

Parallel Crawlers

Cho, Junghoo and Garcia-Molina, Hector (2002) Parallel Crawlers. Technical Report. Stanford.

BibTeXDublinCoreEndNoteHTML

This is the latest version of this item.

[img]
Preview
PDF
150Kb

Abstract

In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time. We first propose multiple architectures for a parallel crawler and identify fundamental issues related to parallel crawling. Based on this understanding, we then propose metrics to evaluate a parallel crawler, and compare the proposed architectures using 40 million pages collected from the Web. Our results clarify the relative merits of each architecture and provide a good guideline on when to adopt which architecture.

Item Type:Techreport (Technical Report)
Uncontrolled Keywords:Web crawler, paralellism, distributed crawler
Subjects:Computer Science > Databases and the Web
Projects:Digital Libraries
Related URLs:Project Homepagehttp://www-diglib.stanford.edu/diglib/pub/
ID Code:733
Deposited By:Import Account
Deposited On:18 Feb 2002 16:00
Last Modified:25 Dec 2008 08:38

Available Versions of this Item

Download statistics

Repository Staff Only: item control page