Stanford InfoLab Publication Server

Hierarchically classifying documents using very few words

Koller, Daphne and Sahami, Mehran (1997) Hierarchically classifying documents using very few words. Technical Report. Stanford InfoLab.




The proliferation of topic hierarchies for text documents has resulted in a need for tools that automatically classify new documents within such hierarchies. One can use existing classifiers by ignoring the hierarchical structure, treating the topics as separate classes. Unfortunately, in the context of text categorization, we are faced with a large number of classes and a huge number of relevant features needed to distinguish between them. Consequently, we are restricted to using only very simple classifiers, both because of computational cost and the tendency of complex models to overfit. We propose an approach that utilizes the hierarchical topic structure to decompose the classification task into a set of simpler problems, one at each node in the classification tree. As we show, each of these smaller problems can be solved accurately by focusing only on a very small set of features, those relevant to the task at hand. This set of relevant features varies widely throughout the hierarchy, so that, while the overall relevant feature set may be large, each classifier only examines a small subset. The use of reduced feature sets allows us to utilize more complex (probabilistic) models, without encountering the computational and robustness difficulties described above.

Item Type:Techreport (Technical Report)
Additional Information:Previous number = SIDL-WP-1997-0059
Subjects:Computer Science > Digital Libraries
Projects:Digital Libraries
Related URLs:Project Homepage
ID Code:291
Deposited By:Import Account
Deposited On:28 Oct 2001 16:00
Last Modified:04 Jan 2009 11:22

Download statistics

Repository Staff Only: item control page