Stanford InfoLab Publication Server

Extracting Structured Data from Web Pages

Arasu, Arvind and Garcia-Molina, Hector (2002) Extracting Structured Data from Web Pages. Technical Report. Stanford.

BibTeXDublinCoreEndNoteHTML

[img]
Preview
PDF
326Kb

Abstract

Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from the web pages without any learning examples or other similar human input. We formally define the notion of a template, and propose a model that describes how values are encoded into pages using a template. We present an extraction algorithm that uses sets of words that have similar occurrence pattern in the input pages, to construct the template. The constructed template is then used to extract values from the pages. We show experimentally that the extracted values make semantic sense in most cases.

Item Type:Techreport (Technical Report)
Uncontrolled Keywords:Automatic Data Extraction
Subjects:Computer Science > Databases and the Web
Computer Science > Data Integration and Mediation
Projects:Miscellaneous
Digital Libraries
Related URLs:Project Homepage, Project Homepage, Project Homepagehttp://infolab.stanford.edu/, http://infolab.stanford.edu/, http://www-diglib.stanford.edu/diglib/pub/
ID Code:548
Deposited By:Import Account
Deposited On:10 Jul 2002 17:00
Last Modified:25 Dec 2008 08:30

Download statistics

Repository Staff Only: item control page