Hammer, J. and Garcia-Molina, H. and Cho, J. and Aranha, R. and Crespo, A. (1997) Extracting Semistructured Information from the Web. Technical Report. Stanford InfoLab. (Publication Note: In Proceedings of the Workshop on Management of Semistructured Data. Tucson, Arizona, May 1997)
BibTeX | DublinCore | EndNote | HTML |
![]()
| PDF 74Kb |
Abstract
We describe a configurable tool for extracting semistructured data from a set of HTML pages andfor converting the extracted information into database objects. The input to the extractor is adeclarative specification that states where the data of interest is located on the HTML pages, andhow the data should be packaged into objects. We have implemented the Web extractor usingthe Python programming language stressing efficiency and ease-of-use. We also describe variousways of improving the functionality of our current prototype. The prototype is installed andrunning in the TSIMMIS testbed as part of a DARPA I3 (Intelligent Integration of Information)technology demonstration where it is used for extracting weather data form various WWW sites.
Item Type: | Techreport (Technical Report) | |
---|---|---|
Subjects: | Computer Science > Databases and the Web Computer Science > Semistructured Data | |
Projects: | TSIMMIS | |
Related URLs: | Project Homepage | http://infolab.stanford.edu/tsimmis/tsimmis.html |
ID Code: | 250 | |
Deposited By: | Import Account | |
Deposited On: | 25 Feb 2000 16:00 | |
Last Modified: | 02 Jan 2009 17:09 |
Download statistics
Repository Staff Only: item control page