Stanford Integrated Digital Library

Annual Program Plan and Report Reporting Period Feb. 1995 - Feb. 1996

Stanford Integrated Digital Library

Annual Program Plan and Report Reporting Period Feb. 1995 - Feb. 1996


1. Progress Report

A.  Project Summary

DATE PREPARED: Feb 1, 1996 ORGANIZATION: Stanford University PRINCIPAL
INVESTIGATORS:
	Hector Garcia-Molina (hector@db.stanford.edu,
415-723-0685/415-725-2588),
	Terry Winograd (winograd@cs.stanford.edu, 415-723-2780),
PARTICIPATING INVESTIGATOR:
	Daphne Koller (koller@cs.stanford.edu, 415-723-6598) TITLE OF EFFORT:
Stanford Integrated Digital Library ACCESS INFORMATION:
http://www-diglib.stanford.edu OBJECTIVE: To research, design and implement
technology that will allow users to interact with large numbers of
heterogeneous services in digital libraries. To develop enabling technology
that will allow economically feasible deployment of digital libraries.

APPROACH: We envision digital libraries to be collections of autonomous,
publication-related services, such as remotely used services for
summarization, indexing, copy detection, payment, search, format conversion,
etc. We use distributed object technology for our communication
infrastructure and to ensure interaction model interoperability among digital
library services. We build 'library service proxies' (LSPs) which are CORBA
objects that represent digital library services. Computer programs interact
with these LSPs via remote method calls. Differences in service interaction
models are 'smoothed out' by building appropriate interfaces to the
proxies. This assists programmers in building digital library patron modules
that interact with multiple services at a time.

To help end users deal with heterogeneity, our user interface effort includes
exploration of a drag-and-drop desktop which is user-configurable to reflect
task-specific needs. Users build 'task component networks', which are
interconnected visual representations of online services. Users drag services
onto the desktop and configure them to accomplish individual, recurring
tasks, such as 'keeping up-to-date on company X'. Visual feedback allows
users to monitor multiple services that may be running simultaneously on
their behalf.

For query interoperability we give users one rich front-end query
language. For a given query, we then compute a best possible target-native
query, and a post-filtering query. We submit the native query to the
respective source and apply the filter query to the result documents to
ensure full equivalence of result sets to the original user queries when
possible.

For payment interoperability we are developing InterPay, a layered
architecture that allows (i) easy user-level customization of payment control
and (ii) easy program-level integration of multiple payment schemes. It
distinguishes between the client side and the service proxy side of a payment
transaction. On the client side, it provides for 'payment agents' which are
user-programmable software components which monitor incoming invoices and, if
necessary, interact with the user to make payment decisions. 'Payment
capabilities' are components which accomplish actual fund transfers by
interacting with diverse online payment services.

In order to help users make sense of documents returned from searches, our
SenseMaker component uses recursive clustering techniques. Result documents
are clustered along multiple dimensions, such as common origin, common
authors or similar titles. Users can apply such clustering repeatedly on
successive clusters to gain an understanding of the document set.

SCAM addresses copyright problems in its exploration of algorithms for
efficiently detecting partial overlap among the contents of
documents. Individual documents are tested against existing collections of
reference documents. We measure these algorithms for performance sensitivity
to variations in parameters such as document comparison 'chunk size' and
collection size.

Our network-accessible InterBib service supports bibliography maintenance by
converting and merging extended Refer and BibTex bibliographies to
hyperlinked MIF and HTML. Users can also deliver Framemaker documents with
embedded citation keys and corresponding bibliographic databases to the
service.  InterBib returns a new document with citations resolved and a
bibliography appended.

Resource discovery is assisted by the results of our efforts in query-based
source selection.  We are exploring efficient and feasible designs that
maintain some amount of statistical information about the contents of
multiple sources. Queries are tested against this limited central information
to find the sources that will most likely yield a large set of results.

PROGRESS: We made major headways in the development of a protocol for the
delivery of search services and documents. The protocol emphasizes
flexibility in dynamically moving document collections and service
computation among multiple resources. Our user interface has evolved into a
prototype desktop for interacting with digital libraries. A query translation
prototype was constructed for two very different target engines. We have
carefully evaluated several of the developed algorithms and systems.

RECENT ACCOMPLISHMENTS: Development of Digital Library Interoperability
* Protocol Construction of drag-and-drop digital library desk top SenseMaker:
* Recursive categorization of result documents Query translation for
* heterogeneous boolean search engines Interoperability with University of
* Michigan DL project Hired Tom Schirmer as full-time programmer Prof. Daphne
* Koller joined as participating investigator MIT Press and Hitachi joined as
* industrial partners

PLANS: Extension of query translation to additional search models
* Interoperability with UC Santa Barbara and University of Illinois
* Categorization of search results based on generalized relations among
* documents InfoBus accessibility through Z39.50 View-based metadata and
* structural interoperability Continued evaluation of algorithms and
* interfaces Start of production branch for testbed software

TECHNOLOGY TRANSITION, SHARING, PARTNERING, ETC.: The University of Michigan
and Stanford successfully accessed each others' resources through remote
method calls and the DL interoperability protocol. Our distributed object
approach has inspired several institutions to explore a similar route. We are
acting as brokers and designers for an informal agreement among major search
engine providers and their users. The agreement is intended to help
interoperability issues in three areas: (i) resource discovery, (ii) query
submission and (iii) rank merging of result sets from the various engines. If
this effort succeeds, the engine providers will include a standardized
interface to their engines to support clients interacting with multiple
engines of the various vendors. The selective dissemination project SIFT is
being moved to a startup company: PANGEA Reference Systems.

B. Significant Event Our efforts in integrating multiple heterogeneous
Digital Library sources and services at the operational and interface levels
were well represented during the dedication of Stanford's new Computer
Science building. As part of the festivities, we demonstrated our testbed
prototype to Bill Gates, Chairman of Microsoft Corporation, and to numerous
internationally renowned academic and business leaders. Feedback to the
demonstration was very encouraging.

(See slide in appendix and separate file)

C. Quad Chart

 (See slides in appendix and separate file)

D. Gantt Chart

We include with this report the Gantt charts for our project (see appendix).
The first set of charts is identical to the ones included in our previous
annual report, except that we have indicated the progress made for each task
through February 29, 1996 (progress is indicated by a dark line inside the
shaded task rectangle).  Since our plans for the third funding period have
advanced substantially from what we envisioned a year ago, we have a second
set of Gantt charts detailing the tasks for the third period, March 1, 1996
to February 28, 1997.

As can be seen from the first set of charts, we made substantial progress on
most of the tasks we had planned a year ago. In the Testbed category, we set
up the basic infrastructure for our Information Bus, and connected various
sources to it.  We made substantial progress on some of our tasks for Phase
II (third funding period), including the development of the Interoperability
Protocol that has allowed us to exchange information with the University of
Michigan. In the research categories we also made substantial progress. For
example, we implemented a full user interface, developed our InterPay scheme
for coordinating payments and charges to and from heterogeneous services, and
we developed the basic query translation algorithms.

The only area where we did not complete most of the tasks was in the Testbed
Evaluation category. This is because we re-directed our evaluation efforts
based on the feedback we received at the previous site visit. From that
feedback, it became clear that end-user evaluations were premature for the
second period of our technology driven project.  Hence, during the second
period we evaluated particular components of our system, as will be discussed
in Section H of this report.

In the second set of Gantt charts we show our planned tasks for the third
funding period. We have broken these down by the same categories, except that
we deleted the InfoBus category because it was redundant with the Testbed
category (our testbed has become the embodiment of the InfoBus). The charts
also show the main faculty, graduate students, and partners involved in each
task, although there are many more interactions than what can be represented
here.

E.  Supporting Transparencies

(See slides in appendix and separate file)

F.  Selected Visits and Other Outside Contacts

Interviews: San Jose Mercury News Communications Week IDG News Service
* Uniforum Monthly IEEE Software Discover Magazine Wired Magazine LAN Times

Meetings with companies, universities and other organizations: At Home
* Cybercash Dow Jones Chemical Fulcrum Hewlett-Packard Hitachi IBM Infoseek
* Interval Research Microsoft Corporation NEC Oracle PLS SUN Microsystems
* Telecom, France Transarc Verity WAIS Xerox Palo Alto Research Center

* University of California at Berkeley Carnegie-Mellon University University
* of Colima, Mexico Cornell University of Illinois University of Michigan MIT
* Press Ngee Ann Polytechnic Library, Singapore University of Pennsylvania
* Princeton University of California at Santa Barbara

* CNRI CommerceNet (continuing contact) NSF, Arpa, NASA (several PI meetings,
* coordination meetings, etc) Delegation of Japanese professors interested in
* supercomputing

Outside presentations: AAAI Spring Workshop, Stanford: "Information
* Gathering" (Terry Winograd) AAAS Meeting, Atlanta: "Overview of Stanford
* Digital Libraries Project" (Hector Garcia-Molina) AT&T: "Overview of
* Stanford Digital Libraries Project" (Hector Garcia-Molina) AAAI 1995 Spring
* Symposium on Information Gathering from Heterogenous Distributed
* Environments: "Learning to Surf" (Marko Balabanovic) Allerton meeting, DL
* overview (Vicky Reich) ASIS meeting, DL overview (Vicky Reich) ARPA
* Information Technology Office meeting, "Overview of Stanford Digital
* Libraries Project" (Andreas Paepcke) University of California, Berkeley,
* Digital Libraries Seminar: "Challenges and Pitfalls on the Way to the
* Digital Library of the Future" (Hector Garcia-Molina) CARL, DL overview
* (Vicky Reich) Coalition for Networked Information, "Interoperability in the
* Stanford Digital Library". (Rebecca Lasher, Andreas Paepcke, Vicky Reich)
* DAGS '95: "Transaction protection for information buyers and sellers"
* (Steve Ketchpel)(Best Student Paper) Data Engineering Conference, Taipei,
* Keynote Invited Talk: "Challenges and Pitfalls on the Way to the Digital
* Library of the Future" (Hector Garcia-Molina) DL '95: "InterPay: Managing
* Multiple Payment Mechanisms in Digital Libraries" (Steve Ketchpel) German
* National Research Center GMD, Bonn, Germany: "The Stanford Digital Library
* Project and Annotation Service" (Martin Röscheisen) European Conference
* on Object-Oriented Programming (ECCOP95), Denmark, Panel organizer and
* presentation "Object Technology and the World-Wide Information
* Infrastructure" (Andreas Paepcke) University of Illinois, Chicago,
* Distinguished Lecture: "Overview of the Stanford Digital Library Project"
* (Hector Garcia-Molina) University of Illinois, Urbana-Champaign, DLI
* Meeting, April '95 (Hector Garcia-Molina, Scott Hassan, Rebecca Lasher,
* David Levy (Xerox), Larry Masinter (Xerox), Vicky Reich, Bernie Rous (ACM),
* Bob Wasilaus (Navy), Terry Winograd) University of Illinois,
* Urbana-Champaign: "Interoperability in Digital Libraries" (Andreas Paepcke)
* Interval Research (Michelle Baldonado, Andreas Paepcke) Marino Institute,
* focus group (Vicky Reich) NSF/ARPA/NASA Digital Library Initiative
* Workshop: "Boolean Query Mapping Across Heterogeneous Information Sources"
* (Kevin Chang), "Interoperability in Digital Libraries" (Andreas Paepcke)
* SIGIR Conference, keynote address: "Digital vs. Libraries: Binding the Two
* Cultures" (Terry Winograd) ACM SIGMOD'95: "Copy Detection Mechanisms for
* Digital Documents" (S.  Brin, J. Davis, and H. Garcia-Molina) Stanford
* University Center for the Study of Language and Interaction Industrial
* Affiliates Meeting: "Adaptive Information Retrieval on the World-Wide Web"
* (Marko Balabanovic) Stanford Center for the Study of Language and
* Information lecture series on Intelligent Agents: "Learning to Surf" (Marko
* Balabanovic) Stanford Library Association: "The Stanford Digital Library
* Project" (Rebecca Lasher, Vicky Reich, Terry Winograd) Stanford University
* satellite-televised guest lectures on DL user interface, NCSTRL and DL
* interoperability protocol (Steve Cousins, Rebecca Lasher, Andreas Paepcke)
* Stanford University Forum panel presentation: "The Future of Databases on
* the Information Superhighway" (Andreas Paepcke) University of California
* Santa Barbara: "Overview of Stanford Digital Libraries Project" (Hector
* Garcia-Molina) University of Utah, Distinguished Lecture Series:
* "Challenges and Pitfalls on the Way to the Digital Library of the Future"
* (Hector Garcia-Molina) VLDB (Conference on Very Large Databases) Panel:
* "The Future of Digital Journals" (Hector Garcia-Molina) VLDB (Conference on
* Very Large Databases) Presentation: "Generalizing GlOSS to Vector-Space
* Databases and Broker Hierarchies" (Luis Gravano) VLDB (Conference on Very
* Large Databases) Presentation: "Copy Removal in SIFT" (Tak Woon Yan) WWW
* Conference: "Beyond Browsing: Shared Comments, SOAPs, Trails and On-line
* Communities" (Martin Röscheisen) Xerox PARC: DL Overviews ( Terry
* Winograd, Michelle Baldonado, Steve Cousins, Hector Garcia-Molina, Scott
* Hassan, Narayanan Shivakumar, Andreas Paepcke)

Regular meetings and seminars: Weekly digital library seminar, open to the
* public Weekly project meetings to coordinate work Executive committee
* meetings as needed Meetings with industrial partners, several times per
* year Digital Library related research paper discussion group

G. Information Compiled and Provided

* Digital Library Initiative survey (Rebecca Lasher) Glossary of digital
* library related terms (Rebecca Lasher) Digital Library annotated
* bibliography (Steve Ketchpel) Information about the testbed, including
* pointers to relevant technologies.

H.  Extended Abstracts Of Project Components

Economic Issues -- Steven Ketchpel

The Stanford Digital Library goal of interoperation is important in
commercial transactions.  At one level, we need standardized interfaces to
various commercial payment systems.  At another level, interoperation among
systems at different sites owned by different users is stymied due to a lack
of trust between the systesms.  Projects underway at Stanford during 1995
worked to address both of these problems.  The development of the InterPay
architecture and implementation in the prototype showed the interoperability
of different payment mechanisms.  An exploration of the desired properties
for exchanges between mistrusting parties led to several protocol
suggestions, which may enable risk-reduced purchases from unknown parties.
This work was expanded to the case of complex exchanges involving multiple
parties that are mutually distrusting.  These three projects are described in
more detail below.  In addition to these research and development efforts at
Stanford, the digital library project was represented at CommerceNet
meetings.  Initial consultations are underway to leverage the digital library
research into a broader commercial environment.

The InterPay architecture was presented at Digital Libraries '95.  It
proposes a method for managing financial interactions with for-pay digital
library services. The approach accommodates multiple payment mechanisms,
interaction models, and charging policies. Key components of the model are
payment agents and payment capabilities that encapsulate payment policy and
the mechanism details of payment on behalf of the user.  Collection agents
and collection capabilities provide similar encapsulation for the service
provider.  The architecture supports interactions ranging from individual
users directly interacting with the service provider to institutional users
accessing information brokers via a corporate library.  The prototype
development system implements the InterPay architecture, allowing access to
real services under varying payment policies.  It makes use of three
different payment capabilites: 1) An account-based mechanism integrated
directly into the library prototype; 2) DigiCash's E-cash payment system; and
3) the First Virtual payment system.  Although these three mechanisms have
different protocols and transport mechanisms, the library user is largely
removed from the particulars of the payment mechanism, while a consistent
user interface provides summary information of transactions made with all
three mechanisms.  Although existing payment mechanisms protect the parties
from snoopers who might be intercepting network messages, most do not provide
much protection from misconduct by the other party involved in the
negotiation, such as reneging on promises to pay or not providing the
promised goods.  At the Dartmouth Institute for Advanced Graduate Study, we
presented three different approaches that address this deficiency.  The first
relies on the message delivery level for automatic acknowledgment of
messages.  The second makes use of a trusted third party which acts as an
intermediary for the transfer of information, so that the seller can prove
that the information was sent (even if the buyer claims it was never
subsequently received.)  The third approach provides greater security,
enabling the prevention of fraudulent transactions, rather than just
providing proof after the fact.  This approach places greater demands on the
third party, essentially turning it into an escrow agent.

The transaction proctection mechansisms described above apply only to
exchanges between a pair of parties.  When the transactions are more complex
(perhaps requiring a broker or multiple service providers), these mechanisms
are no longer sufficient.  Therefore, we have isolated a new formal model
called the "distributed commerce transaction" which addresses this
shortcoming.  This model includes a language for specifying these commercial
exchange problems, and sequencing graphs, a formalism for determining whether
a given exchange may occur.  A new algorithm generates a feasible execution
sequence of pairwise exchanges between parties (when it exists), thereby
reducing the more complicated, distributed problem to individual pair-wise
exchanges where the approaches described above are applicable.  Finally, the
addition of indemnities may guarantee the actions of an untrusted party,
further facilitating commercial transactions.

Task-oriented, Direct Manipulation Interface to the Digital Library Testbed
-- Steve Cousins

If the goal of the larger project is to "glue together" many different
digital library sources and services, the goal of our interface is to hide
the "joints" from the end users as much as is possible and appropriate.

Our interface is based on scenarios and published studies of library use. The
most important lesson is that library use is part of a larger task
context. Library users have goals that they want to achieve, and individual
library activities are only important as a means of achieving those
goals. Another lesson is that the problem is often not solved in a single
session. Studies of library use almost uniformly conclude that systems should
save result sets automatically for use later. Finally, there is much more to
digital libraries than search. Libraries, and especially digital libraries,
are made up of many services. They range from search and retrieval, to
services which help us understand what we have found, to mechanisms which
help us manage our results, to services which help us pass on our
newly-acquired conclusions to others.

Our primary goal is to support user tasks. The tasks we have in mind are
composite entities which are not instantaneously completed. For example, a
user might want to buy a color printer. The corresponding library task would
involve searching for information about color printers, retrieving promising
articles, sifting through those articles, annotating relevant passages,
compiling new documents such as lists of desirable features, and perhaps
sharing the results with colleagues or the world.

A larger task would be a professor preparing a course. Her work would involve
accessing materials in the digital library, and potentially adding new
materials such as an annotated bibliography or a syllabus which other
professors could access. This task might be divided into sub-tasks for the
various topics covered in the course. A digital library interface needs to
provide affordances for the various components of each task. Each instance of
a task should persist across time, since the task is unlikely to be completed
in a single session. Based on our reading of library-use studies, we believe
that user tasks require a tool that falls between a "scrapbook" and a
completely-automated, custom application.

Since user tasks involve an increasingly rich variety of services, our next
goal is to design the interface to integrate the results from a broad array
of services. In our example of the professor teaching a course, relevant
services include document summarization, bibliography creation, and "sense-
making" (understanding the results of broad searches). We use the term
"service" to refer to computational objects which take digital library
objects as input. Examples include documents, queries, and collections. We
are working with a list of about 30 types of services, ranging from complex
information visualization services to (conceptually) straightforward
translation services.  Library services differ widely in the amount of time
they require, so our third goal is to design the interface to handle widely
varying time scales. The interface needs to let the user know before
initiating a service whether it will take milliseconds or hours to
complete. While a service is working, the interface needs to provide feedback
on the progress of the service, and a means for interrupting the service. If
the user has moved on to another task, the interface should continue to
accept results from running services and compile them into a meaningful form
for when the user returns to this task.

Our fourth goal is to make the system extensible. The number of available
services is constantly growing. Ideally, adding a new service to a task needs
to be as easy as dropping a "service card" onto the interface for a task. The
service card would describe the parameters needed to invoke the service. It
would either contain the service (for example as a Java applet) or would
point to a network object which would perform the service. When appropriate,
it would also include a fee schedule. Service cards would be exchanged via
electronic mail or retrieved from catalogs of services.

Finally, the interface needs to support sharing and reuse of information
processing knowledge. Bonnie Nardi has described how "local developers" of
spreadsheet macros pop up in many different organizations. We expect that
with a well-designed task-based interface, individuals could share expertise
in informal and semi-formal ways. An individual who spent a lot of time on
configuring her "color printer" research task might want to share that with a
colleague looking to buy a new ethernet card. More formally, a digital
librarian's job description could include the creation of specialized task
templates for use by his patrons. Task representations also facilitate reuse
by an individual, and could be used to manage a history of digital library
activities.

We have implemented an interface which begins to achieve these goals on top
of the evolving InfoBus infrastructure.  We have chosen to use
direct-manipulation, with a relatively straightforward mapping between
library objects and screen representations. The basic types of objects in our
interface are queries, documents, collections, and services. Services are
activated by dropping queries, documents, or collections onto
them. Collections support multiple views; we currently have implemented a
tabular view and a simple graphical view.  Clicking on documents causes them
to become activated; they currently respond by instructing a Netscape web
browser to display their contents.  The interface we are building takes
advantage InfoBus protocols. For example, although we distinguish services
and collections in the user interface, our implementation often keeps a link
between a collection and the service that generated it, so that the user does
not always have to wait until a collection is fully materialized before using
it.  Our current prototype provides an interface to the InfoBus search
services and to services for sense-making, for summarizing documents, for
doing copy detection and for creating bibliographies from collection of
document descriptions.

The interface is running in prototype form, and has been demonstrated at the
DLI workshop and to many visitors.  We are continuing to add additional
functionality and services to the interface, and to make it more robust.

Digital Library Interoperability Protocol -- Scott Hassan, Andreas Paepcke

Our base technology for managing interoperability problems in our testbed
makes use of CORBA distributed objects. We use them for three purposes: (i)
to provide the software engineering advantages of object-oriented programming
in a distributed setting, (ii) to help provide more unified interaction
models among multiple, autonomous services and (iii) to hide the complexity
of remote interprocess communication. We build 'library service proxy'
objects for the independent publication-related services we are interested
in. These proxies hide the service-specific interaction models and
communication protocols. Client objects communicate with service proxies
through the Digital Library Interoperability Protocol (DLIOP). We have
developed this protocol and deployed it in experiments both within Stanford
and with the University of Michigan Digital Library project.

The DLIOP combines advantages of Z39.50 and HTTP in an object-centric
* approach that affords significant amounts of flexibility for
* implementors. For example, here are some of the areas for which the
* protocol carefully preserves implementation flexibility: Services can cache
* result sets of searches for possible future use. Alternatively, clients may
* in addition or instead cache some or all of the information.  It is
* possible to instantiate and materialize the objects in a result set (e.g.,
* documents) at various points in time and at various locations. For
* instance, a pre-fetching strategy may materialize documents at the client
* side before their contents are requested. An on-demand-only scheme can
* instead choose to wait until an application program asks for the contents
* of a given document. Where and when document objects are materialized is
* constrained only by end user needs, not by the protocol.  Processing tasks
* can dynamically be off-loaded to other machines, including the client
* computer, even while a client/server interaction is in progress.

Existing information access protocols typically do not provide flexibility
across all these dimensions. For interoperability in our testbed setting, we
prefer a protocol that does not fix these choices, allowing, for example, a
provider to asynchronously and incrementally push information and associated
management responsibility to the client.

The protocol takes advantage of the encapsulation properties of
object-oriented programming by providing a client-side result collection
object which hides all the state and processing details of interacting with
the remote service proxies. The result collection communicates queries to the
proxy. The proxy then asynchronously adds result items to the collection,
while client programs simultaneously extract them from the same collection
object. Information is streamed from the service, through the collection, to
the client program.  Client programs thus have the illusion that their
queries are all handled locally, and that the act of creating a result
collection for a query simply produces a set of results within that
collection.

Result item references within result collections are called 'access
capabilities'. These in turn each contain multiple 'access options' which may
be tried sequentially when dereferencing the capabilities. The first options
are generally cache pointers into the proxy.  When they succeed, they ensure
fast response time. If dereferencing is delayed until after the remote proxy
has chosen to discard its cache, the other access options provide for
successively more expensive, but longer-lived access. See our IEEE Computer
article for details.

Evaluation Activities -- Luis Gravano, Frankie James, Narayanan Shivakumar

The evaluation of mechanisms and systems is a central component of our
project.  In the initial stages of the project, our technologies have not
been fielded and used by significant numbers of users.  Thus, our evaluations
have not yet looked at "end-user satisfaction".  Instead, our evaluations
have focused on the effectiveness of individual pieces of technology.

The major completed evaluations performed this year are as follows:

* For gGLOSS, we used logs of real user queries to evaluate the success of
* gGLOSS in suggesting good sources.  The suggestions produced by gGLOSS on
* the real queries were compared against the results of actually submitting
* the queries to all the sources directly. The results can be found in our
* VLDB'95 paper.  For copy detection, we have not only analyzed the run time
* performance of the various schemes, but we have also evaluated the accuracy
* of the detection. For the latter, we compared the duplicates flagged by our
* prototype service to those that were identified as documents with
* significant overlap by a real person reading the documents.  This study
* highlights the tradeoff between high detection accuracy and good run time
* performance. For details see our paper in DL'96.

In addition, we are currently planning or carrying out evaluations of our
query translation algorithms, our interoperability protocol, our audio
interfaces, and SenseMaker.  In the first two cases, the focus will be on
performance (e.g., computation cost, network traffic generated).  In the last
two cases, the experiments will involve human subjects.

Audio Interfaces to Hypertext -- Frankie James

Every day, more and more information is being made available online to the
general public in the form of electronic documents.  Since the advent of the
WWW, hypertext (in particular, HTML) has become the medium of choice for the
presentation of these documents.  This is because HTML can be used to present
not only the text of a document, but also much of its structure. The ability
to use this structure in a generic (multimodal) way would mean that
electronic documents could be accessible to everyone, even non-standard users
such as blind users or users connecting to the WWW via the telephone.

Traditionally, blind computer users have accessed documents and information
through ASCII text files.  This method has done well to preserve the textual
content of the document, but has problems when dealing with the visual
content such as tables, figures, and font changes.  The visual content, which
can be further subdivided into those visual elements which are used to
indicate structure (like tables and the use of typeface or type style to
denote headings) and those which are more purely visual, such as pictures,
can be a fundamental part of any document, and as much as possible of it
should be preserved for blind users.  HTML, with its use of markup tags,
explicitly represents the structural visual content of a document as well as
the textual content.  This content is needed not only for getting a sense of
the document's overall structure, but since we are dealing with hypertext,
also for navigation between documents.  That is, with hypertext, we are not
only interested in reading and browsing a single document, but also in
navigating and browsing within the larger space of multiple documents.
Although these tasks are currently limited to the visual domain, they should
be possible in other modalities such as audio.

Presenting document structure in audio will need to be based on a body of
work in the fields of typography (to see what it is we are really
representing), communications, and HCI for blind users (specifically, GUI
access).  The current focus of our work is on implementing an experiment
which tests four different auditory interfaces to a set of HTML pages to see
what parts of the audio space are more or less useful for representing
particular document structures and HTML tags.  The audio space is too broad
to advocate only one type of structuring technique, therefore, the test
interfaces incorporate many different elements, including voice changes,
speaker changes, and non-speech sound effects.  However, a primary interest
in this experiment (and our future research) is to explore the usefulness of
speaker changes for marking various kinds of HTML structure in a document,
since this technique has proven useful in traditional radio broadcasts but is
largely unexplored in computer interface design.

We anticipate that these experiments will first of all give a clearer picture
of the issues involved in presenting document structure in the audio
domain. This should be different from other presentations of information in
audio, since documents have for so long been grounded in the visual.  The
experiments should also provide insight into how hypertext documents are used
in general, as opposed to how they are used given that we are currently
constrained to using them in a visual environment.  Once we determine the
cognitive tasks involved in using hypertext, these can be mapped directly
into either the visual or audio modality, depending on the preferences or
abilities of the individual user.

SenseMaker -- Michelle Baldonado

We have developed a relation-centric user interaction model for browsing in
the digital library.  The motivation for this new model comes from our belief
that the digital library of the future will follow the current trend of
providing access to more and more heterogeneous search services.
Accordingly, users of the future digital library who are engaged in an
information seeking task will find that more and more citations and documents
match their specifications.  Possible solutions to this problem include
filtering the results that are returned and narrowing both the supplied query
and the set of search services to which that query is submitted. However,
when a user does not have a well-defined goal in mind, but instead is
engaging in an exploratory search (browsing), answering the question of how
to filter results or to narrow queries and search service sets can be quite
difficult.  As an alternative, we propose that digital library interfaces for
browsing should make their primary unit for interaction be the inter-result
relation rather than the individual result or document.

By moving to a model where the unit of interaction is the relation, we can
build improved interfaces both for result analysis and for search expansion.
At the analysis level, users can "make sense" out of their results by looking
at the results in terms of the relations that hold among the documents they
describe (e.g., what documents share similar titles?  what documents share
common authors?).  This approach is valuable in that it allows the user to
cut down on the number of entities that must be scanned.  In addition, it
gives the user a feel for the common characteristics of the results.  At the
search expansion level, users can ask for new results to be found which fit
into a relation of interest (e.g., find me more documents that have a title
like this, find me more documents by this author).  This approach offers a
user a natural way of expanding her search results without needing to drop
down to the level of individual results.

Readers familiar with browsing models developed to support full text
clustering (notably, Scatter/Gather from Xerox PARC) and relevance feedback
will notice that both of these strategies fit into the relation-centric model
that we are proposing.  Browsing models based on full text clustering can be
viewed as revolving around the question "What documents have similar full
texts?"  Likewise, models based on relevance feedback can be viewed as
revolving around the question "What documents have full texts similar to the
full texts for these documents, but dissimilar from the full texts for these
other documents?"  In fact, we argue that our model is both a generalization
from and a unification of the models developed to support these two
strategies.  An important consequence of a generalized relation-centric model
is that it moves the user away from a focus on full text similarity to a
whole family of multi-dimensional relations. We believe that this step up in
abstraction will be especially important in a heterogeneous digital library
which includes a combination of citations, abstracts, documents, multimedia,
etc.

In addition to the theoretical work on developing a relation-centric user
interaction model, we have also implemented a prototype tool, SenseMaker, to
experiment with the interaction strategies suggested by the model.
Currently, SenseMaker allows users to perform relation-centric interactive
analysis of results from heterogeneous search services.  It mediates between
the user and seven distinct search services (WebCrawler, Lycos, Inktomi,
InfoSeek's Web database, Dialog's 275 database, and Folio's Inspec database).
It communicates with these services using the Stanford interoperability
protocol, and can be accessed via both Web and non-Web interfaces.  In the
coming year, we will extend SenseMaker to include the capability for
relation-based feedback interactions, as well as to incorporate
relation-based duplicate detection.

Boolean Query Translation -- Kevin Chen-Chuang Chang

Emerging Digital Libraries can provide a wealth of information. However,
there are also a wealth of search engines behind these libraries, each with a
different query language. Our goal is to provide a front-end to a collection
of Digital Libraries that hides, as much as possible, this heterogeneity. As
a first step, we focus on translating Boolean queries, from a generalized
form, into queries that only use the functionality and syntax provided by a
particular target search engine. We initially look at Boolean queries because
they are used by most current commercial systems; eventually we will
incorporate other types of queries such as vector space and
probabilistic-model ones.
 To illustrate our approach, suppose that a user is interested in documents
discussing multiprocessors and distributed systems. Say the user's query is
originally formulated as:
 "Title Contains multiprocessor And distributed (W) system".  This query
selects documents with the three given words in the title field; furthermore,
the (W) proximity operator specifies that the word "distributed" must
immediately precede "system." Now assume the the user wishes to query a
source which does not understand the (W) operator. In this case, our approach
will be to approximate the predicate "distributed (W) system" by the closest
predicate supported by the source, for example "distributed And system." This
predicate requires that the two words appear in matching documents, but in
any position. Thus, the native query that is sent to the source is: "Find
Title multiprocessor And distributed And system" In the syntax understood by
the source. The native query will return a preliminary result set that is a
super-set of what the user expects. Therefore, an additional post-filtering
step is required at the front-end to eliminate from the preliminary result
those documents that do not have the words "distributed" and "system"
occurring next to each other. In particular, the required filter query is:
 "Title Contains distributed (W) system".  The problem of multiple and
heterogeneous IR systems has been observed since the early 1970's. Since
then, many solutions were proposed to address this problem. Our approach
differs from others mainly in that the front-end language is uniform across
underlying sources while still providing powerful search features not
necessarily supported by all systems.

Our research work started with a feature analysis of query languages of some
typical text retrieval systems, including Dialog, WAIS, STN, BRS, and
Stanford-Folio. Based on this study of Boolean systems, we designed a
front-end Boolean language. Meanwhile, we have completed theoretical work for
translating front-end Boolean queries into target-specific queries and the
corresponding filter queries required to carry out features not supported by
a source. Transformation algorithms based on a query normal form have been
proposed to solve the mapping problem which generates both minimal native
queries in terms of the number of extra documents retrieved and optimal
filter queries in the sense of least processing effort.

gGlOSS -- Luis Gravano

As large numbers of text databases have become available on the Internet, it
is harder to locate the right sources for given queries.  To address this
problem, we designed gGlOSS, a generalized Glossary-Of-Servers Server, that
keeps statistics on the available databases to estimate which databases are
the potentially most useful ones for a given query. During the reporting
period, we extended our original GlOSS beyond its previous capability, which
focused on databases using the boolean model of document retrieval. gGlOSS
also covers databases using the more sophisticated vector-space retrieval
model.  We evaluated our new techniques experimentally using real-user
queries and 53 databases.  Also, we further generalized our approach by
showing how to build a hierarchy of gGlOSS brokers. The top level of the
hierarchy is so small it could be widely replicated, even at end-user
workstations. We presented our main results in a paper that appeared at the
VLDB'95 conference in Zurich, Switzerland

We are acting as brokers and designers for an informal agreement among major
search engine providers and their users. The agreement is intended to help
interoperability issues in three areas: (i) resource discovery, (ii) query
submission and (iii) rank merging of result sets from the various engines. If
this effort succeeds, the engine providers will include a standardized
interface to their engines to support clients interacting with multiple
engines of the various vendors. Participants are Microsoft Corporation,
InfoSeek, WAIS Inc., Fulcrum, Verity and PLS. We have met with each of the
companies to gather their input. After consolidating this data into a design,
we will go through a second round of consultation with each participant to
work out the final agreement. It is then expected that the search engine
companies will provide the agreed-upon interfaces.

Adaptive Search Agents -- Marko Balabanovic

Work has been completed on the design of a new architecture for adaptive
information searching agents.  Given a group of users, the "Fab" system will
adapt over time to deliver documents interesting to individuals within the
group.  We hope to take advantage of common interests within the group to
allow more efficient scaling, so that eventually the system can serve a large
number of users.  Currently the documents are found by the system on the
World-Wide Web.

The implementation of the system, started in August 1995, is nearing
completion.  A major difficulty with these systems has been finding valid
ways to evaluate them, as the commonly used models from the information
retrieval literature do not apply to this domain.  We have recently completed
an experimental design which will allow measurement of the improvement over
time and absolute performance of the system in a scientifically and
statistically valid way.  The first experiment is scheduled to start at the
beginning of March 1996.

 I. Financial Report
  
Not publicly available.

J. Bibliography

[1] Marko Balabanovic and Yoav Shoham. Learning Information Retrieval Agents:
Experiments with Automated Web Browsing. In Proceedings of the AAAI Spring
Symposium on Information Gathering from Heterogenous, Distributed Resources,
March, 1995.

[2] M. Balabanovic, Y. Shoham, and Y. Yun. An Adaptive Agent for Automated
Web Browsing. Journal of Visual Communication and Image Representation, 6(4),
December, 1995.

[3] Michelle Q Wang Baldonado and Terry Winograd. Techniques and Tools for
Making Sense out of Heterogeneous Search Service Results. Number
SIDL-WP-1995-0019. Stanford University, 1995.

[4] Michelle Q Wang Baldonado and Terry Winograd. A User Interaction Model
for Browsing Based on Category-Level Operations. Number
SIDL-WP-1996-0029. Stanford University, 1996.

[5] S. Brin, J. Davis, and H. Garcia-Molina. Copy Detection Mechanisms for
Digital Documents. In Proceedings of SIGMOD '95, 1995.

[6] Edward Chang and Hector Garcia-Molina. Reducing Initial Latency in a
Multimedia Storage System. In Submitted for publication, 1996.

[7] Kevin Chen-Chuan Chang, Hector Garcia-Molina, and Andreas
Paepcke. Boolean Query Mapping Across Heterogeneous Information
Sources. Invited to IEEE Transactions on Knowledge and Data Engineering,
1996.

[8] Luis Gravano and Hector Garcia-Molina. Generalizing GlOSS to Vector-Space
Databases and Broker Hierarchies. In In Proceedings of VLDB '95, 1995.

[9] Steven Ketchpel.  Transaction Protection for Information Buyers and
Sellers. In DAGS '95, 1995.

[10] Steve B. Cousins, Steven P. Ketchpel, Andreas Paepcke, Hector
Garcia-Molina, Scott W. Hassan, and Martin Roscheisen.  InterPay: Managing
Multiple Payment Mechanisms in Digital Libraries. In DL '95 proceedings,
1995.

[11] Steven Ketchpel and Hector Garcia-Molina. Making Trust Explicit in
Distributed Commerce Transactions. In Proceedings of the International
Conference on Distributed Computing Systems, 1996.

[12] Daphne Koller and Mehran Sahami. Toward Optimal Feature
Selection. 1996. Submitted for publication.

[13] Clifford Lynch and Hector Garcia-Molina. IITA Digital Libraries Workshop
Report. Marianne Siroker, Stanford University, GATES 436, Stanford,
CA. 94305, May, 1995. Available on request.

[14] Andreas Paepcke, Steve B. Cousins, Hector Garcia-Molina, Scott
W. Hassan, Steven K. Ketchpel, Martin Roscheisen, and Terry Winograd. Towards
Interoperability in Digital Libraries: Overview and Selected Highlights of
the Stanford Digital Library Project. to appear in IEEE Computer Magazine,
May, 1996.

[15] M. Roscheisen, C. Mogensen, and T. Winograd. Interaction Design for
Shared World-Wide Web Annotations. In Proceedings of CHI '95, 1995.

[16] M. Roscheisen, C. Mogensen, and T. Winograd. Beyond Browsing: Shared
Comments, SOAPs, Trails and On-line Communities. In Proceedings of the
World-Wide Web Conference '95, Darmstadt, Germany, 1995.

[17] Martin Roscheisen, Terry Winograd, and Andreas Paepcke. Content Ratings
and Other Third-Party Value-Added Information: Defining an Enabling
Platform. CNRI D-Lib
Magazine(http://www.dlib.org/dlib/august95/08contents.html), August, 1995.

[18] M. Roscheisen, C. Mogensen, and T. Winograd. A Platform for Third-Party
Value-Added Information Providers: Architecture, Protocols, and Usage
Examples. Stanford University, May, 1995.

[19] Narayanan Shivakumar and Hector Garcia-Molina.  SCAM: A Copy Detection
Mechanism for Digital Documents. In DL '95 Proceedings, 1995.

[20] N. Shivakumar and H. Garcia-Molina. The SCAM Approach to Copy Detection
in Digital Libraries. CNRI D-Lib
Magazine(http://www.dlib.org/dlib/november95/11contents.html), November,
1995.

[21] N. Shivakumar and H. Garcia-Molina. Building a Scalable and Accurate
Copy Detection Mechanism. In Proceedings of DL'96, 1996.

[22] The Stanford Digital Library Project. 1995. Special Issue of the
Communications of the ACM.

Other materials: Video: We have prepared a 10 minute videotape on WebWriter,
entitled: "WebWriter: Interface Development on the World Wide Web." shot
September 22, 1995.

2. Plans and Direction

We plan to continue aggressive pursuit of all our five program thrusts during
the upcoming reporting period. In the infrastructure area, we will focus on
service proxy stability and support for proxy maintenance. In the SenseMaker
project, we will focus on the use of relations for defining what users
consider 'the same document' for the purpose of duplicate elimination. The
query translation work will expand to include non-boolean queries. SCAM will
enable users to compare sets of documents against other sets of documents,
rather than just a single document being tested against a set of reference
documents. We will ensure that Z39.50 servers are reachable from the InfoBus,
and the Z39.50 clients can access the InfoBus. We plan to explore learning
algorithms for deployment in the document rank merging problem and to
investigate new techniques that use the words appearing in retrieved
documents to cluster them or classify them according to topic. We plan to
apply decision-theoretic and economic criteria in deciding which data sources
should be accessed to respond to a given query.

We will install a separate testbed software branch which will lag behind the
development branch in features, but which will be kept stable and running for
demonstration purposes.

We plan to expand our cooperation with the University of Illinois and the UC
at Santa Barbara by exchanging access to services and collections with both
these institutions. Existing access to and from the University of Michigan
will be expanded to go beyond search by including the remote access of
publication-related services, such as summarization.


________________ I certify that to the best of my knowledge (1) the
statements herin (excluding scientific hypotheses and scientific opinions)
are true and complete, and (2) the text and graphics in this report as well
as any accompanying publications or other documents, unless otherwise
indicated, are the original work of the signatories or individuals working
under their supervision. I understand that the willful provision of false
information or concealing a material fact in this report(s) or any other
communication submitted to NSF is a criminal offense (U.S. Code, Title 18,
Section 1001.)


Project Director Signature: _____________________________________


Enclosures: Supporting transparencies Excerpts of papers published in 1995


Page  