Mase, H. (1998) Experiments on Automatic Web Page Categorization for IR system. Technical Report. Stanford InfoLab.
Abstract This paper describes keyword-based Web page categorization. Our goal is to embed our categorization technique into information retrieval (IR) systems to facilitate the end-users' search task. In such systems, search results must be categorized faster, while keeping accuracy high. Our categorization system uses a knowledge base (KB) to assign categories to Web pages. The KB contains a set of characteristic keywords with weights by category, and is automatically generated from training texts. With the keyword-based approach, the algorithms to extract keywords and assign weights to them should be considered, because the algorithms affect strongly both categorization accuracy and processing speed. Furthermore, we must take two characteristics of Web pages into account: (1) the text length is very variable, which makes it harder to use statistics such as word frequency to calculate keyword weights, and (2) a huge number of distinct words are used, which makes the KB bigger and therefore processing speed lower. We propose five kinds of methods to normalize word frequency distribution for higher categorization accuracy, and three kinds of methods to filter out non-important words from the KB for faster processing. We performed experiments to compare these methods from viewpoints of both accuracy and KB size. We used 15 categories, 10,311 Web pages for KB generation and 939 pages for testing. The results show that the KBs with various accuracy values and sizes could be generated by applying our methods and that it is possible for end-users to select the most appropriate KB according to their preferences in accuracy and speed.
|Item Type:||Techreport (Technical Report)|
|Related URLs:||Project Homepage||http://infolab.stanford.edu/|
|Deposited By:||Import Account|
|Deposited On:||25 Feb 2000 16:00|
|Last Modified:||29 Dec 2008 11:20|
Repository Staff Only: item control page