Stanford InfoLab Publication Server

Crawling the Hidden Web

Raghavan, Sriram and Garcia-Molina, Hector (2000) Crawling the Hidden Web. Technical Report. Stanford.

WarningThere is a more recent version of this item available.



Current-day crawlers retrieve content only from the publicly indexable Web, i.e.,the set of web pages reachable purely by following hypertext links, ignoring search forms and pages that require authorization or prior registration. In particular, they ignore the tremendous amount of high quality content ``hidden'' behind search forms, in large searchable electronic databases. In this paper, we provide a framework for addressing the problem of extracting content from this hidden Web. At Stanford, we have built a task-specific hidden Web crawler called the Hidden Web Exposer (HiWE). We describe the architecture of HiWE and present a number of novel techniques that went into its design and implementation. We also present results from experiments we conducted to test and validate our techniques.

Item Type:Techreport (Technical Report)
Uncontrolled Keywords:Crawling, Hidden Web, Content extraction, HTML Forms
Subjects:Computer Science > Databases and the Web
Projects:Digital Libraries
Related URLs:Project Homepage
ID Code:456
Deposited By:Import Account
Deposited On:07 Dec 2000 16:00
Last Modified:27 Dec 2008 15:30

Available Versions of this Item

Download statistics

Repository Staff Only: item control page