Stanford InfoLab Publication Server

Web Content Categorization Using Link Information

Gyongyi, Zoltan and Garcia-Molina, Hector and Pedersen, Jan (2006) Web Content Categorization Using Link Information. Technical Report. Stanford.




Document categorization is one of the foundational problems in (web) information retrieval. Even though web documents are hyperlinked, most proposed classification techniques take little advantage of the link structure and rely primarily on text features, as it is not immediately clear how to make link information intelligible to supervised machine learning algorithms. This paper introduces a link-based approach to classification, which can be used in isolation or in conjunction with text-based classification. Various large-scale experimental results indicate that link-based classification is on par with text-based classification, and the combination of the two offers the best of both worlds.

Item Type:Techreport (Technical Report)
Uncontrolled Keywords:web search, hypertext categorization, web link structure analysis
Subjects:Computer Science > Data Mining
Computer Science > Databases and the Web
Related URLs:Project Homepage
ID Code:782
Deposited By:Import Account
Deposited On:19 Jul 2006 17:00
Last Modified:18 Dec 2008 14:46

Download statistics

Repository Staff Only: item control page