Raghavan, Sriram and Garcia-Molina, Hector (2001) Crawling the Hidden Web. In: 27th International Conference on Very Large Data Bases (VLDB 2001), September 11-14, 2001, Rome, Italy.
This is the latest version of this item.
Current-day crawlers retrieve content only from the publicly indexable Web, i.e., the set of Web pages reachable purely by following hypertext links, ignoring search forms and pages that require authorization or prior registration. In particular, they ignore the tremendous amount of high quality content ``hidden'' behind search forms, in large searchable electronic databases. In this paper, we address the problem of designing a crawler capable of extracting content from this hidden Web. We introduce a generic operational model of a hidden Web crawler and describe how this model is realized in HiWE (Hidden Web Exposer), a prototype crawler built at Stanford. We introduce a new Layout-based Information Extraction Technique (LITE) and demonstrate its use in automatically extracting semantic information from search forms and response pages. We alsopresent results from experiments conducted to test and validate our techniques.
|Item Type:||Conference or Workshop Item (Paper)|
|Subjects:||Computer Science > Databases and the Web|
|Related URLs:||Project Homepage||http://www-diglib.stanford.edu/diglib/pub/|
|Deposited By:||Import Account|
|Deposited On:||17 May 2001 17:00|
|Last Modified:||27 Dec 2008 10:44|
Available Versions of this Item
- Crawling the Hidden Web. (deposited 07 Dec 2000 16:00)
- Crawling the Hidden Web. (deposited 17 May 2001 17:00) [Currently Displayed]
Repository Staff Only: item control page