STANFORD DIGITAL LIBRARIES TECHNOLOGIES
PROJECTS
DOCUMENTS
PEOPLE
SEMINARS
TESTBED
RESOURCES

Projects In Brief

HOME

PROJECTS
Retrieving Information
Information Tiling
PalmPilot Infrastructure
Power Browsing
PB Summarization Movie
PB Navigation Movie
PB Forms Movie
Query Translator
SDLIP
Value Filtering
WebBase
Interpreting Information
Web Clustering
Managing Information
Archival Repositories
Archiving Movie
InterBib
Medical Transport Info
PhotoBrowser
Sharing Information
Diet ORB
Digital Wallets
Mobile Security

DLI1 Projects
AHA
ComMentor
DLITE
Google
GLOSS
FAB
Grassroots
Metadata Architecture
RManage/FIRM
SenseMaker
SCAM
Shopping Models, U-PAI
SONIA
STARTS
WebWriter

Web Clustering

One of the difficult challenges in the area of Web related research is that of clustering or classifying web pages. Clustering refers to the grouping of pages into categories, in a fashion similar to Yahoo Yahoo or the Open Directory . These two directories, however, are maintained entirely by human editors, using no automated techniques. Manual techniques are not scalable to the entire web: although there are over 1 billion pages on the web (http://www.inktomi.com/webmap/) , Yahoo and Open Directory each have fewer than 2 million urls in their hierarchy.

We are currently investigating techniques to efficiently cluster the entire web. Traditional IR approaches are not appropriate in the context of the web, due to both the enormous size and hyperlinked nature of the web. We plan to use recently developed techniques that allow for similarity searches in high dimensional spaces (for instance) http://theory.stanford.edu/~indyk/vldb99.ps to allow for offline clustering of the web. Even with the newer techniques, the resource requirements will be large, especially as precision requirements are raised. Supercomputing resources will be a valuable asset in performing clustering and other mining operations on the contents of the web. Such resources will allow us to explore and evaluate more of the available clustering options as we develop the most effective techniques.

Questions or Comments? Send email to dlwebmaster@db.stanford.edu
PROJECTS
DOCUMENTS
PEOPLE
SEMINARS
TESTBED
RESOURCES
SPONSORS/PARTNERS