Duplicate Detection in Information Dissemination

Yan, T. (1995) Duplicate Detection in Information Dissemination. Technical Report. Stanford InfoLab.

Preview

PDF
345Kb

Abstract

Our experience with the SIFT [YGM95] information dissemination system (in use by over 7,000 users daily) has identified an important and generic dissemination problem: duplicate information. In this paper we explain why duplicates arise, we quantify the problem, and we discuss why it impairs information dissemination. We then propose a Duplicate Removal Module (DRM) for an information dissemination system. The removal of duplicates operates on a per user, per document basis { each document read by a user generates a request, or a duplicate restraint. In wide-area environments, the number of restraints handled is very large. We consider the implementation of a DRM, examining alternative algorithms and data structures that may be used. We present a performance evaluation of the alternatives and answer important design questions such as: Which implementation is the best? With "best" scheme, how expensive will duplicate removal be? How much memory is required? How fast can restraints be processed?

Item Type:	Techreport (Technical Report)
Subjects:	Computer Science
Projects:	Miscellaneous
Related URLs:	Project Homepage	http://infolab.stanford.edu/
ID Code:	108
Deposited By:	Import Account
Deposited On:	25 Feb 2000 16:00
Last Modified:	08 Dec 2008 13:51

Download statistics

Repository Staff Only: item control page