Stanford InfoLab Publication Server

Clustering the Tagged Web

Ramage, Daniel and Heymann, Paul and Manning, Christopher D. and Garcia-Molina, Hector (2008) Clustering the Tagged Web. In: Second ACM International Conference on Web Search and Data Mining (WSDM 2009), February 9-12, 2009, Barcelona, Spain.


PDF (Accepted Version Pre-Formatting Requirements) - Accepted Version
PDF (Published Version) - Published Version


Automatically clustering web pages into semantic groups promises improved search and browsing on the web. In this paper, we demonstrate how user-generated tags from large scale social bookmarking websites such as can be used as a complementary data source to page text and anchor text for improving automatic clustering of web pages. This paper explores the use of tags in 1) K-means clustering in an extended vector space model that includes tags as well as page text and 2) a novel generative clustering algorithm based on latent Dirichlet allocation that jointly models text and tags. We evaluate the models by comparing their output to an established web directory. We find that the naive inclusion of tagging data improves cluster quality versus page text alone, but a more principled inclusion can substantially improve the quality of all models with a statistically significant absolute F-score increase of 4%. The generative model outperforms K-means with another 8% F-score increase.

Item Type:Conference or Workshop Item (Paper)
Uncontrolled Keywords:Collaborative tagging systems, clustering, information retrieval, k-means clustering, latent Dirichlet allocation
Related URLs:Author Homepage, Author Homepage, Author Homepage,,
ID Code:890
Deposited By:Paul Heymann
Deposited On:27 Nov 2008 23:12
Last Modified:30 Dec 2008 16:13

Download statistics

Repository Staff Only: item control page