Stanford InfoLab Publication Server

dSCAM: Finding Document Copies across Multiple Databases

Garcia-Molina, H. and Gravano, L. and Shivakumar, N. (1996) dSCAM: Finding Document Copies across Multiple Databases. In: Proceedings of 4th International Conference on Parallel and Distributed Information Systems (PDIS'96), Miami Beach, Florida.




The advent of the Internet has made the illegal dissemination of copyrighted material easy. An important problem is how to automatically detect when a "new" digital document is "suspiciously close" to existing ones. The SCAM project at Stanford University has addressed this problem when there is a single registered-document database. However, in practice, text documents may appear in many autonomous databases, and one would like to discover copies without having to exhaustively search in all databases. Our approach, dSCAM, is a distributed sion of SCAM that keeps succinct metainformation about the contents of the available document databases. Given a suspicious document S, dSCAM uses its information to prune all databases that cannot contain any document that is close enough to S, and hence the search can focus on the remaining sites. We also study how to query the remaining databases so as to minimize different querying costs. We empirically study the pruning and searching schemes, using a collection of 50 databases and two sets of test documents

Item Type:Conference or Workshop Item (Paper)
Subjects:Computer Science > Digital Libraries
Projects:Digital Libraries
Related URLs:Project Homepage
ID Code:199
Deposited By:Import Account
Deposited On:25 Feb 2000 16:00
Last Modified:08 Dec 2008 15:19

Download statistics

Repository Staff Only: item control page