Goldszmidt, Moises and Sahami, Mehran (1998) A Probabilistic Approach to Full-Text Document Clustering. Technical Report. Stanford InfoLab.
In addressing the issue of text document clustering, a suitable function for measuring the distance between documents is needed. In this paper we explore a function for scoring document similarity based on probabilistic considerations: similarity is scored according to the expectation of the same words appearing in two documents. This score enables the investigation of different smoothing methods for estimating the probability of a word appearing in a document for purposes of clustering. Our experimental results show that these different smoothing methods may be more or less effective depending on the degree of separability between the clusters. Furthermore, we show that the cosine coefficient widely used in information retrieval can be associated with a particular form of probabilistic smoothing in our model. We also introduce a specific scoring function that outperforms the cosine coefficient and its extensions such as TFIDF weighting in our experiments with document clustering tasks. This new scoring is based on normalizing (in the probabilistic sense) the cosine similarity score and adding a scaling factor based on the characteristics of the corpus being clustered. Finally our experiments indicate that our model, which assumes an asymmetry between positive (word appearance) and negative (word non-appearance) information in the document clustering task, outperforms standard mixture models that weight such information equally.
|Item Type:||Techreport (Technical Report)|
|Additional Information:||Previous number = SIDL-WP-1998-0091|
|Subjects:||Computer Science > Digital Libraries|
|Related URLs:||Project Homepage||http://www-diglib.stanford.edu/diglib/pub/|
|Deposited By:||Import Account|
|Deposited On:||30 Oct 2001 16:00|
|Last Modified:||29 Dec 2008 10:47|
Repository Staff Only: item control page