Yan, T. (1995) Duplicate Detection in Information Dissemination. Technical Report. Stanford InfoLab.
BibTeX | DublinCore | EndNote | HTML |
![]()
| PDF 345Kb |
Abstract
Our experience with the SIFT [YGM95] information dissemination system (in use by over 7,000 users daily) has identified an important and generic dissemination problem: duplicate information. In this paper we explain why duplicates arise, we quantify the problem, and we discuss why it impairs information dissemination. We then propose a Duplicate Removal Module (DRM) for an information dissemination system. The removal of duplicates operates on a per user, per document basis { each document read by a user generates a request, or a duplicate restraint. In wide-area environments, the number of restraints handled is very large. We consider the implementation of a DRM, examining alternative algorithms and data structures that may be used. We present a performance evaluation of the alternatives and answer important design questions such as: Which implementation is the best? With "best" scheme, how expensive will duplicate removal be? How much memory is required? How fast can restraints be processed?
Item Type: | Techreport (Technical Report) | |
---|---|---|
Subjects: | Computer Science | |
Projects: | Miscellaneous | |
Related URLs: | Project Homepage | http://infolab.stanford.edu/ |
ID Code: | 108 | |
Deposited By: | Import Account | |
Deposited On: | 25 Feb 2000 16:00 | |
Last Modified: | 08 Dec 2008 13:51 |
Download statistics
Repository Staff Only: item control page