Precision and Recall of GlOSS Estimators for Database Discovery

Gravano, L. and Garcia-Molina, H. and Tomasic, A. (1994) Precision and Recall of GlOSS Estimators for Database Discovery. In: Third International Conference on Parallel and Distributed Information Systems (PDIS 1994), September 28-30, 1994, Austin, Texas.

Preview

PDF
178Kb

Abstract

Precision and Recall of GlOSS Estimators for Database Discovery Luis Gravano H ector Garc a-Molina Anthony Tomasic Computer Science Department Stanford University Stanford, CA 94305-2140 fgravano,hector,tomasicg@cs.stanford.edu 1 Overview On-line information vendors offer access to multiple databases. In addition, the advent of a variety of INTERNET tools [1, 2] has provided easy, distributed access to many more databases. The result is thousands of text databases from which a user may choose for a given information need (a user This paper, an abridged version of [3], presents a framework for (and analyzes a solution to) this problem, which we call the text-database discovery problem (see [3] for a survey of related wOur solution to the text-database discovery problem is to build a service that can suggest potentially good databases to search. A user's query will go through two steps: first, the query is presented to our server (dubbed GlOSS, for Glossary-Of-Servers Server) to select a set of promising databases to search. During the second step, the query is actually evaluated at the chosen databases. GlOSS gives a hint of what databases might be useful for the user's query, based on word-frequency information for each database. This information indicates, for each database and each keyword in the database vocabulary, how many documents at that database actually contain the keyword, for each field designator (Sections 2 and For example, a Computer-Science library could report that "Knuth" (keyword) occurs as an author (field designator) in 180 documents, the keyword "computer," in the title of 25,548 documents, and so on. This information is orders of magnitude smaller than a full index (see [4]) since for each keyword fielddesignation pair we only need to keep its frequency, not the identities of the documents that contain it. To evaluate the set of databases that GlOSS returns for a given query, Section 4 presents a framework based on the precision and recall me

Item Type:	Conference or Workshop Item (Paper)
Subjects:	Computer Science
Projects:	Miscellaneous
Related URLs:	Project Homepage	http://infolab.stanford.edu/
ID Code:	64
Deposited By:	Import Account
Deposited On:	25 Feb 2000 16:00
Last Modified:	05 Feb 2009 15:26

Download statistics

Repository Staff Only: item control page