Stanford InfoLab Publication Server

Duplicate Detection in Information Dissemination

Yan, T. (1995) Duplicate Detection in Information Dissemination. Technical Report. Stanford InfoLab.




Our experience with the SIFT [YGM95] information dissemination system (in use by over 7,000 users daily) has identified an important and generic dissemination problem: duplicate information. In this paper we explain why duplicates arise, we quantify the problem, and we discuss why it impairs information dissemination. We then propose a Duplicate Removal Module (DRM) for an information dissemination system. The removal of duplicates operates on a per user, per document basis { each document read by a user generates a request, or a duplicate restraint. In wide-area environments, the number of restraints handled is very large. We consider the implementation of a DRM, examining alternative algorithms and data structures that may be used. We present a performance evaluation of the alternatives and answer important design questions such as: Which implementation is the best? With "best" scheme, how expensive will duplicate removal be? How much memory is required? How fast can restraints be processed?

Item Type:Techreport (Technical Report)
Subjects:Computer Science
Related URLs:Project Homepage
ID Code:108
Deposited By:Import Account
Deposited On:25 Feb 2000 16:00
Last Modified:08 Dec 2008 13:51

Download statistics

Repository Staff Only: item control page