Stanford Integrated Digital Library Annual Program Plan and Report Reporting Period Feb. 1995 - Feb. 1996 Stanford Integrated Digital Library Annual Program Plan and Report Reporting Period Feb. 1995 - Feb. 1996 1. Progress Report A. Project Summary DATE PREPARED: Feb 1, 1996 ORGANIZATION: Stanford University PRINCIPAL INVESTIGATORS: Hector Garcia-Molina (hector@db.stanford.edu, 415-723-0685/415-725-2588), Terry Winograd (winograd@cs.stanford.edu, 415-723-2780), PARTICIPATING INVESTIGATOR: Daphne Koller (koller@cs.stanford.edu, 415-723-6598) TITLE OF EFFORT: Stanford Integrated Digital Library ACCESS INFORMATION: http://www-diglib.stanford.edu OBJECTIVE: To research, design and implement technology that will allow users to interact with large numbers of heterogeneous services in digital libraries. To develop enabling technology that will allow economically feasible deployment of digital libraries. APPROACH: We envision digital libraries to be collections of autonomous, publication-related services, such as remotely used services for summarization, indexing, copy detection, payment, search, format conversion, etc. We use distributed object technology for our communication infrastructure and to ensure interaction model interoperability among digital library services. We build 'library service proxies' (LSPs) which are CORBA objects that represent digital library services. Computer programs interact with these LSPs via remote method calls. Differences in service interaction models are 'smoothed out' by building appropriate interfaces to the proxies. This assists programmers in building digital library patron modules that interact with multiple services at a time. To help end users deal with heterogeneity, our user interface effort includes exploration of a drag-and-drop desktop which is user-configurable to reflect task-specific needs. Users build 'task component networks', which are interconnected visual representations of online services. Users drag services onto the desktop and configure them to accomplish individual, recurring tasks, such as 'keeping up-to-date on company X'. Visual feedback allows users to monitor multiple services that may be running simultaneously on their behalf. For query interoperability we give users one rich front-end query language. For a given query, we then compute a best possible target-native query, and a post-filtering query. We submit the native query to the respective source and apply the filter query to the result documents to ensure full equivalence of result sets to the original user queries when possible. For payment interoperability we are developing InterPay, a layered architecture that allows (i) easy user-level customization of payment control and (ii) easy program-level integration of multiple payment schemes. It distinguishes between the client side and the service proxy side of a payment transaction. On the client side, it provides for 'payment agents' which are user-programmable software components which monitor incoming invoices and, if necessary, interact with the user to make payment decisions. 'Payment capabilities' are components which accomplish actual fund transfers by interacting with diverse online payment services. In order to help users make sense of documents returned from searches, our SenseMaker component uses recursive clustering techniques. Result documents are clustered along multiple dimensions, such as common origin, common authors or similar titles. Users can apply such clustering repeatedly on successive clusters to gain an understanding of the document set. SCAM addresses copyright problems in its exploration of algorithms for efficiently detecting partial overlap among the contents of documents. Individual documents are tested against existing collections of reference documents. We measure these algorithms for performance sensitivity to variations in parameters such as document comparison 'chunk size' and collection size. Our network-accessible InterBib service supports bibliography maintenance by converting and merging extended Refer and BibTex bibliographies to hyperlinked MIF and HTML. Users can also deliver Framemaker documents with embedded citation keys and corresponding bibliographic databases to the service. InterBib returns a new document with citations resolved and a bibliography appended. Resource discovery is assisted by the results of our efforts in query-based source selection. We are exploring efficient and feasible designs that maintain some amount of statistical information about the contents of multiple sources. Queries are tested against this limited central information to find the sources that will most likely yield a large set of results. PROGRESS: We made major headways in the development of a protocol for the delivery of search services and documents. The protocol emphasizes flexibility in dynamically moving document collections and service computation among multiple resources. Our user interface has evolved into a prototype desktop for interacting with digital libraries. A query translation prototype was constructed for two very different target engines. We have carefully evaluated several of the developed algorithms and systems. RECENT ACCOMPLISHMENTS: Development of Digital Library Interoperability * Protocol Construction of drag-and-drop digital library desk top SenseMaker: * Recursive categorization of result documents Query translation for * heterogeneous boolean search engines Interoperability with University of * Michigan DL project Hired Tom Schirmer as full-time programmer Prof. Daphne * Koller joined as participating investigator MIT Press and Hitachi joined as * industrial partners PLANS: Extension of query translation to additional search models * Interoperability with UC Santa Barbara and University of Illinois * Categorization of search results based on generalized relations among * documents InfoBus accessibility through Z39.50 View-based metadata and * structural interoperability Continued evaluation of algorithms and * interfaces Start of production branch for testbed software TECHNOLOGY TRANSITION, SHARING, PARTNERING, ETC.: The University of Michigan and Stanford successfully accessed each others' resources through remote method calls and the DL interoperability protocol. Our distributed object approach has inspired several institutions to explore a similar route. We are acting as brokers and designers for an informal agreement among major search engine providers and their users. The agreement is intended to help interoperability issues in three areas: (i) resource discovery, (ii) query submission and (iii) rank merging of result sets from the various engines. If this effort succeeds, the engine providers will include a standardized interface to their engines to support clients interacting with multiple engines of the various vendors. The selective dissemination project SIFT is being moved to a startup company: PANGEA Reference Systems. B. Significant Event Our efforts in integrating multiple heterogeneous Digital Library sources and services at the operational and interface levels were well represented during the dedication of Stanford's new Computer Science building. As part of the festivities, we demonstrated our testbed prototype to Bill Gates, Chairman of Microsoft Corporation, and to numerous internationally renowned academic and business leaders. Feedback to the demonstration was very encouraging. (See slide in appendix and separate file) C. Quad Chart (See slides in appendix and separate file) D. Gantt Chart We include with this report the Gantt charts for our project (see appendix). The first set of charts is identical to the ones included in our previous annual report, except that we have indicated the progress made for each task through February 29, 1996 (progress is indicated by a dark line inside the shaded task rectangle). Since our plans for the third funding period have advanced substantially from what we envisioned a year ago, we have a second set of Gantt charts detailing the tasks for the third period, March 1, 1996 to February 28, 1997. As can be seen from the first set of charts, we made substantial progress on most of the tasks we had planned a year ago. In the Testbed category, we set up the basic infrastructure for our Information Bus, and connected various sources to it. We made substantial progress on some of our tasks for Phase II (third funding period), including the development of the Interoperability Protocol that has allowed us to exchange information with the University of Michigan. In the research categories we also made substantial progress. For example, we implemented a full user interface, developed our InterPay scheme for coordinating payments and charges to and from heterogeneous services, and we developed the basic query translation algorithms. The only area where we did not complete most of the tasks was in the Testbed Evaluation category. This is because we re-directed our evaluation efforts based on the feedback we received at the previous site visit. From that feedback, it became clear that end-user evaluations were premature for the second period of our technology driven project. Hence, during the second period we evaluated particular components of our system, as will be discussed in Section H of this report. In the second set of Gantt charts we show our planned tasks for the third funding period. We have broken these down by the same categories, except that we deleted the InfoBus category because it was redundant with the Testbed category (our testbed has become the embodiment of the InfoBus). The charts also show the main faculty, graduate students, and partners involved in each task, although there are many more interactions than what can be represented here. E. Supporting Transparencies (See slides in appendix and separate file) F. Selected Visits and Other Outside Contacts Interviews: San Jose Mercury News Communications Week IDG News Service * Uniforum Monthly IEEE Software Discover Magazine Wired Magazine LAN Times Meetings with companies, universities and other organizations: At Home * Cybercash Dow Jones Chemical Fulcrum Hewlett-Packard Hitachi IBM Infoseek * Interval Research Microsoft Corporation NEC Oracle PLS SUN Microsystems * Telecom, France Transarc Verity WAIS Xerox Palo Alto Research Center * University of California at Berkeley Carnegie-Mellon University University * of Colima, Mexico Cornell University of Illinois University of Michigan MIT * Press Ngee Ann Polytechnic Library, Singapore University of Pennsylvania * Princeton University of California at Santa Barbara * CNRI CommerceNet (continuing contact) NSF, Arpa, NASA (several PI meetings, * coordination meetings, etc) Delegation of Japanese professors interested in * supercomputing Outside presentations: AAAI Spring Workshop, Stanford: "Information * Gathering" (Terry Winograd) AAAS Meeting, Atlanta: "Overview of Stanford * Digital Libraries Project" (Hector Garcia-Molina) AT&T: "Overview of * Stanford Digital Libraries Project" (Hector Garcia-Molina) AAAI 1995 Spring * Symposium on Information Gathering from Heterogenous Distributed * Environments: "Learning to Surf" (Marko Balabanovic) Allerton meeting, DL * overview (Vicky Reich) ASIS meeting, DL overview (Vicky Reich) ARPA * Information Technology Office meeting, "Overview of Stanford Digital * Libraries Project" (Andreas Paepcke) University of California, Berkeley, * Digital Libraries Seminar: "Challenges and Pitfalls on the Way to the * Digital Library of the Future" (Hector Garcia-Molina) CARL, DL overview * (Vicky Reich) Coalition for Networked Information, "Interoperability in the * Stanford Digital Library". (Rebecca Lasher, Andreas Paepcke, Vicky Reich) * DAGS '95: "Transaction protection for information buyers and sellers" * (Steve Ketchpel)(Best Student Paper) Data Engineering Conference, Taipei, * Keynote Invited Talk: "Challenges and Pitfalls on the Way to the Digital * Library of the Future" (Hector Garcia-Molina) DL '95: "InterPay: Managing * Multiple Payment Mechanisms in Digital Libraries" (Steve Ketchpel) German * National Research Center GMD, Bonn, Germany: "The Stanford Digital Library * Project and Annotation Service" (Martin Röscheisen) European Conference * on Object-Oriented Programming (ECCOP95), Denmark, Panel organizer and * presentation "Object Technology and the World-Wide Information * Infrastructure" (Andreas Paepcke) University of Illinois, Chicago, * Distinguished Lecture: "Overview of the Stanford Digital Library Project" * (Hector Garcia-Molina) University of Illinois, Urbana-Champaign, DLI * Meeting, April '95 (Hector Garcia-Molina, Scott Hassan, Rebecca Lasher, * David Levy (Xerox), Larry Masinter (Xerox), Vicky Reich, Bernie Rous (ACM), * Bob Wasilaus (Navy), Terry Winograd) University of Illinois, * Urbana-Champaign: "Interoperability in Digital Libraries" (Andreas Paepcke) * Interval Research (Michelle Baldonado, Andreas Paepcke) Marino Institute, * focus group (Vicky Reich) NSF/ARPA/NASA Digital Library Initiative * Workshop: "Boolean Query Mapping Across Heterogeneous Information Sources" * (Kevin Chang), "Interoperability in Digital Libraries" (Andreas Paepcke) * SIGIR Conference, keynote address: "Digital vs. Libraries: Binding the Two * Cultures" (Terry Winograd) ACM SIGMOD'95: "Copy Detection Mechanisms for * Digital Documents" (S. Brin, J. Davis, and H. Garcia-Molina) Stanford * University Center for the Study of Language and Interaction Industrial * Affiliates Meeting: "Adaptive Information Retrieval on the World-Wide Web" * (Marko Balabanovic) Stanford Center for the Study of Language and * Information lecture series on Intelligent Agents: "Learning to Surf" (Marko * Balabanovic) Stanford Library Association: "The Stanford Digital Library * Project" (Rebecca Lasher, Vicky Reich, Terry Winograd) Stanford University * satellite-televised guest lectures on DL user interface, NCSTRL and DL * interoperability protocol (Steve Cousins, Rebecca Lasher, Andreas Paepcke) * Stanford University Forum panel presentation: "The Future of Databases on * the Information Superhighway" (Andreas Paepcke) University of California * Santa Barbara: "Overview of Stanford Digital Libraries Project" (Hector * Garcia-Molina) University of Utah, Distinguished Lecture Series: * "Challenges and Pitfalls on the Way to the Digital Library of the Future" * (Hector Garcia-Molina) VLDB (Conference on Very Large Databases) Panel: * "The Future of Digital Journals" (Hector Garcia-Molina) VLDB (Conference on * Very Large Databases) Presentation: "Generalizing GlOSS to Vector-Space * Databases and Broker Hierarchies" (Luis Gravano) VLDB (Conference on Very * Large Databases) Presentation: "Copy Removal in SIFT" (Tak Woon Yan) WWW * Conference: "Beyond Browsing: Shared Comments, SOAPs, Trails and On-line * Communities" (Martin Röscheisen) Xerox PARC: DL Overviews ( Terry * Winograd, Michelle Baldonado, Steve Cousins, Hector Garcia-Molina, Scott * Hassan, Narayanan Shivakumar, Andreas Paepcke) Regular meetings and seminars: Weekly digital library seminar, open to the * public Weekly project meetings to coordinate work Executive committee * meetings as needed Meetings with industrial partners, several times per * year Digital Library related research paper discussion group G. Information Compiled and Provided * Digital Library Initiative survey (Rebecca Lasher) Glossary of digital * library related terms (Rebecca Lasher) Digital Library annotated * bibliography (Steve Ketchpel) Information about the testbed, including * pointers to relevant technologies. H. Extended Abstracts Of Project Components Economic Issues -- Steven Ketchpel The Stanford Digital Library goal of interoperation is important in commercial transactions. At one level, we need standardized interfaces to various commercial payment systems. At another level, interoperation among systems at different sites owned by different users is stymied due to a lack of trust between the systesms. Projects underway at Stanford during 1995 worked to address both of these problems. The development of the InterPay architecture and implementation in the prototype showed the interoperability of different payment mechanisms. An exploration of the desired properties for exchanges between mistrusting parties led to several protocol suggestions, which may enable risk-reduced purchases from unknown parties. This work was expanded to the case of complex exchanges involving multiple parties that are mutually distrusting. These three projects are described in more detail below. In addition to these research and development efforts at Stanford, the digital library project was represented at CommerceNet meetings. Initial consultations are underway to leverage the digital library research into a broader commercial environment. The InterPay architecture was presented at Digital Libraries '95. It proposes a method for managing financial interactions with for-pay digital library services. The approach accommodates multiple payment mechanisms, interaction models, and charging policies. Key components of the model are payment agents and payment capabilities that encapsulate payment policy and the mechanism details of payment on behalf of the user. Collection agents and collection capabilities provide similar encapsulation for the service provider. The architecture supports interactions ranging from individual users directly interacting with the service provider to institutional users accessing information brokers via a corporate library. The prototype development system implements the InterPay architecture, allowing access to real services under varying payment policies. It makes use of three different payment capabilites: 1) An account-based mechanism integrated directly into the library prototype; 2) DigiCash's E-cash payment system; and 3) the First Virtual payment system. Although these three mechanisms have different protocols and transport mechanisms, the library user is largely removed from the particulars of the payment mechanism, while a consistent user interface provides summary information of transactions made with all three mechanisms. Although existing payment mechanisms protect the parties from snoopers who might be intercepting network messages, most do not provide much protection from misconduct by the other party involved in the negotiation, such as reneging on promises to pay or not providing the promised goods. At the Dartmouth Institute for Advanced Graduate Study, we presented three different approaches that address this deficiency. The first relies on the message delivery level for automatic acknowledgment of messages. The second makes use of a trusted third party which acts as an intermediary for the transfer of information, so that the seller can prove that the information was sent (even if the buyer claims it was never subsequently received.) The third approach provides greater security, enabling the prevention of fraudulent transactions, rather than just providing proof after the fact. This approach places greater demands on the third party, essentially turning it into an escrow agent. The transaction proctection mechansisms described above apply only to exchanges between a pair of parties. When the transactions are more complex (perhaps requiring a broker or multiple service providers), these mechanisms are no longer sufficient. Therefore, we have isolated a new formal model called the "distributed commerce transaction" which addresses this shortcoming. This model includes a language for specifying these commercial exchange problems, and sequencing graphs, a formalism for determining whether a given exchange may occur. A new algorithm generates a feasible execution sequence of pairwise exchanges between parties (when it exists), thereby reducing the more complicated, distributed problem to individual pair-wise exchanges where the approaches described above are applicable. Finally, the addition of indemnities may guarantee the actions of an untrusted party, further facilitating commercial transactions. Task-oriented, Direct Manipulation Interface to the Digital Library Testbed -- Steve Cousins If the goal of the larger project is to "glue together" many different digital library sources and services, the goal of our interface is to hide the "joints" from the end users as much as is possible and appropriate. Our interface is based on scenarios and published studies of library use. The most important lesson is that library use is part of a larger task context. Library users have goals that they want to achieve, and individual library activities are only important as a means of achieving those goals. Another lesson is that the problem is often not solved in a single session. Studies of library use almost uniformly conclude that systems should save result sets automatically for use later. Finally, there is much more to digital libraries than search. Libraries, and especially digital libraries, are made up of many services. They range from search and retrieval, to services which help us understand what we have found, to mechanisms which help us manage our results, to services which help us pass on our newly-acquired conclusions to others. Our primary goal is to support user tasks. The tasks we have in mind are composite entities which are not instantaneously completed. For example, a user might want to buy a color printer. The corresponding library task would involve searching for information about color printers, retrieving promising articles, sifting through those articles, annotating relevant passages, compiling new documents such as lists of desirable features, and perhaps sharing the results with colleagues or the world. A larger task would be a professor preparing a course. Her work would involve accessing materials in the digital library, and potentially adding new materials such as an annotated bibliography or a syllabus which other professors could access. This task might be divided into sub-tasks for the various topics covered in the course. A digital library interface needs to provide affordances for the various components of each task. Each instance of a task should persist across time, since the task is unlikely to be completed in a single session. Based on our reading of library-use studies, we believe that user tasks require a tool that falls between a "scrapbook" and a completely-automated, custom application. Since user tasks involve an increasingly rich variety of services, our next goal is to design the interface to integrate the results from a broad array of services. In our example of the professor teaching a course, relevant services include document summarization, bibliography creation, and "sense- making" (understanding the results of broad searches). We use the term "service" to refer to computational objects which take digital library objects as input. Examples include documents, queries, and collections. We are working with a list of about 30 types of services, ranging from complex information visualization services to (conceptually) straightforward translation services. Library services differ widely in the amount of time they require, so our third goal is to design the interface to handle widely varying time scales. The interface needs to let the user know before initiating a service whether it will take milliseconds or hours to complete. While a service is working, the interface needs to provide feedback on the progress of the service, and a means for interrupting the service. If the user has moved on to another task, the interface should continue to accept results from running services and compile them into a meaningful form for when the user returns to this task. Our fourth goal is to make the system extensible. The number of available services is constantly growing. Ideally, adding a new service to a task needs to be as easy as dropping a "service card" onto the interface for a task. The service card would describe the parameters needed to invoke the service. It would either contain the service (for example as a Java applet) or would point to a network object which would perform the service. When appropriate, it would also include a fee schedule. Service cards would be exchanged via electronic mail or retrieved from catalogs of services. Finally, the interface needs to support sharing and reuse of information processing knowledge. Bonnie Nardi has described how "local developers" of spreadsheet macros pop up in many different organizations. We expect that with a well-designed task-based interface, individuals could share expertise in informal and semi-formal ways. An individual who spent a lot of time on configuring her "color printer" research task might want to share that with a colleague looking to buy a new ethernet card. More formally, a digital librarian's job description could include the creation of specialized task templates for use by his patrons. Task representations also facilitate reuse by an individual, and could be used to manage a history of digital library activities. We have implemented an interface which begins to achieve these goals on top of the evolving InfoBus infrastructure. We have chosen to use direct-manipulation, with a relatively straightforward mapping between library objects and screen representations. The basic types of objects in our interface are queries, documents, collections, and services. Services are activated by dropping queries, documents, or collections onto them. Collections support multiple views; we currently have implemented a tabular view and a simple graphical view. Clicking on documents causes them to become activated; they currently respond by instructing a Netscape web browser to display their contents. The interface we are building takes advantage InfoBus protocols. For example, although we distinguish services and collections in the user interface, our implementation often keeps a link between a collection and the service that generated it, so that the user does not always have to wait until a collection is fully materialized before using it. Our current prototype provides an interface to the InfoBus search services and to services for sense-making, for summarizing documents, for doing copy detection and for creating bibliographies from collection of document descriptions. The interface is running in prototype form, and has been demonstrated at the DLI workshop and to many visitors. We are continuing to add additional functionality and services to the interface, and to make it more robust. Digital Library Interoperability Protocol -- Scott Hassan, Andreas Paepcke Our base technology for managing interoperability problems in our testbed makes use of CORBA distributed objects. We use them for three purposes: (i) to provide the software engineering advantages of object-oriented programming in a distributed setting, (ii) to help provide more unified interaction models among multiple, autonomous services and (iii) to hide the complexity of remote interprocess communication. We build 'library service proxy' objects for the independent publication-related services we are interested in. These proxies hide the service-specific interaction models and communication protocols. Client objects communicate with service proxies through the Digital Library Interoperability Protocol (DLIOP). We have developed this protocol and deployed it in experiments both within Stanford and with the University of Michigan Digital Library project. The DLIOP combines advantages of Z39.50 and HTTP in an object-centric * approach that affords significant amounts of flexibility for * implementors. For example, here are some of the areas for which the * protocol carefully preserves implementation flexibility: Services can cache * result sets of searches for possible future use. Alternatively, clients may * in addition or instead cache some or all of the information. It is * possible to instantiate and materialize the objects in a result set (e.g., * documents) at various points in time and at various locations. For * instance, a pre-fetching strategy may materialize documents at the client * side before their contents are requested. An on-demand-only scheme can * instead choose to wait until an application program asks for the contents * of a given document. Where and when document objects are materialized is * constrained only by end user needs, not by the protocol. Processing tasks * can dynamically be off-loaded to other machines, including the client * computer, even while a client/server interaction is in progress. Existing information access protocols typically do not provide flexibility across all these dimensions. For interoperability in our testbed setting, we prefer a protocol that does not fix these choices, allowing, for example, a provider to asynchronously and incrementally push information and associated management responsibility to the client. The protocol takes advantage of the encapsulation properties of object-oriented programming by providing a client-side result collection object which hides all the state and processing details of interacting with the remote service proxies. The result collection communicates queries to the proxy. The proxy then asynchronously adds result items to the collection, while client programs simultaneously extract them from the same collection object. Information is streamed from the service, through the collection, to the client program. Client programs thus have the illusion that their queries are all handled locally, and that the act of creating a result collection for a query simply produces a set of results within that collection. Result item references within result collections are called 'access capabilities'. These in turn each contain multiple 'access options' which may be tried sequentially when dereferencing the capabilities. The first options are generally cache pointers into the proxy. When they succeed, they ensure fast response time. If dereferencing is delayed until after the remote proxy has chosen to discard its cache, the other access options provide for successively more expensive, but longer-lived access. See our IEEE Computer article for details. Evaluation Activities -- Luis Gravano, Frankie James, Narayanan Shivakumar The evaluation of mechanisms and systems is a central component of our project. In the initial stages of the project, our technologies have not been fielded and used by significant numbers of users. Thus, our evaluations have not yet looked at "end-user satisfaction". Instead, our evaluations have focused on the effectiveness of individual pieces of technology. The major completed evaluations performed this year are as follows: * For gGLOSS, we used logs of real user queries to evaluate the success of * gGLOSS in suggesting good sources. The suggestions produced by gGLOSS on * the real queries were compared against the results of actually submitting * the queries to all the sources directly. The results can be found in our * VLDB'95 paper. For copy detection, we have not only analyzed the run time * performance of the various schemes, but we have also evaluated the accuracy * of the detection. For the latter, we compared the duplicates flagged by our * prototype service to those that were identified as documents with * significant overlap by a real person reading the documents. This study * highlights the tradeoff between high detection accuracy and good run time * performance. For details see our paper in DL'96. In addition, we are currently planning or carrying out evaluations of our query translation algorithms, our interoperability protocol, our audio interfaces, and SenseMaker. In the first two cases, the focus will be on performance (e.g., computation cost, network traffic generated). In the last two cases, the experiments will involve human subjects. Audio Interfaces to Hypertext -- Frankie James Every day, more and more information is being made available online to the general public in the form of electronic documents. Since the advent of the WWW, hypertext (in particular, HTML) has become the medium of choice for the presentation of these documents. This is because HTML can be used to present not only the text of a document, but also much of its structure. The ability to use this structure in a generic (multimodal) way would mean that electronic documents could be accessible to everyone, even non-standard users such as blind users or users connecting to the WWW via the telephone. Traditionally, blind computer users have accessed documents and information through ASCII text files. This method has done well to preserve the textual content of the document, but has problems when dealing with the visual content such as tables, figures, and font changes. The visual content, which can be further subdivided into those visual elements which are used to indicate structure (like tables and the use of typeface or type style to denote headings) and those which are more purely visual, such as pictures, can be a fundamental part of any document, and as much as possible of it should be preserved for blind users. HTML, with its use of markup tags, explicitly represents the structural visual content of a document as well as the textual content. This content is needed not only for getting a sense of the document's overall structure, but since we are dealing with hypertext, also for navigation between documents. That is, with hypertext, we are not only interested in reading and browsing a single document, but also in navigating and browsing within the larger space of multiple documents. Although these tasks are currently limited to the visual domain, they should be possible in other modalities such as audio. Presenting document structure in audio will need to be based on a body of work in the fields of typography (to see what it is we are really representing), communications, and HCI for blind users (specifically, GUI access). The current focus of our work is on implementing an experiment which tests four different auditory interfaces to a set of HTML pages to see what parts of the audio space are more or less useful for representing particular document structures and HTML tags. The audio space is too broad to advocate only one type of structuring technique, therefore, the test interfaces incorporate many different elements, including voice changes, speaker changes, and non-speech sound effects. However, a primary interest in this experiment (and our future research) is to explore the usefulness of speaker changes for marking various kinds of HTML structure in a document, since this technique has proven useful in traditional radio broadcasts but is largely unexplored in computer interface design. We anticipate that these experiments will first of all give a clearer picture of the issues involved in presenting document structure in the audio domain. This should be different from other presentations of information in audio, since documents have for so long been grounded in the visual. The experiments should also provide insight into how hypertext documents are used in general, as opposed to how they are used given that we are currently constrained to using them in a visual environment. Once we determine the cognitive tasks involved in using hypertext, these can be mapped directly into either the visual or audio modality, depending on the preferences or abilities of the individual user. SenseMaker -- Michelle Baldonado We have developed a relation-centric user interaction model for browsing in the digital library. The motivation for this new model comes from our belief that the digital library of the future will follow the current trend of providing access to more and more heterogeneous search services. Accordingly, users of the future digital library who are engaged in an information seeking task will find that more and more citations and documents match their specifications. Possible solutions to this problem include filtering the results that are returned and narrowing both the supplied query and the set of search services to which that query is submitted. However, when a user does not have a well-defined goal in mind, but instead is engaging in an exploratory search (browsing), answering the question of how to filter results or to narrow queries and search service sets can be quite difficult. As an alternative, we propose that digital library interfaces for browsing should make their primary unit for interaction be the inter-result relation rather than the individual result or document. By moving to a model where the unit of interaction is the relation, we can build improved interfaces both for result analysis and for search expansion. At the analysis level, users can "make sense" out of their results by looking at the results in terms of the relations that hold among the documents they describe (e.g., what documents share similar titles? what documents share common authors?). This approach is valuable in that it allows the user to cut down on the number of entities that must be scanned. In addition, it gives the user a feel for the common characteristics of the results. At the search expansion level, users can ask for new results to be found which fit into a relation of interest (e.g., find me more documents that have a title like this, find me more documents by this author). This approach offers a user a natural way of expanding her search results without needing to drop down to the level of individual results. Readers familiar with browsing models developed to support full text clustering (notably, Scatter/Gather from Xerox PARC) and relevance feedback will notice that both of these strategies fit into the relation-centric model that we are proposing. Browsing models based on full text clustering can be viewed as revolving around the question "What documents have similar full texts?" Likewise, models based on relevance feedback can be viewed as revolving around the question "What documents have full texts similar to the full texts for these documents, but dissimilar from the full texts for these other documents?" In fact, we argue that our model is both a generalization from and a unification of the models developed to support these two strategies. An important consequence of a generalized relation-centric model is that it moves the user away from a focus on full text similarity to a whole family of multi-dimensional relations. We believe that this step up in abstraction will be especially important in a heterogeneous digital library which includes a combination of citations, abstracts, documents, multimedia, etc. In addition to the theoretical work on developing a relation-centric user interaction model, we have also implemented a prototype tool, SenseMaker, to experiment with the interaction strategies suggested by the model. Currently, SenseMaker allows users to perform relation-centric interactive analysis of results from heterogeneous search services. It mediates between the user and seven distinct search services (WebCrawler, Lycos, Inktomi, InfoSeek's Web database, Dialog's 275 database, and Folio's Inspec database). It communicates with these services using the Stanford interoperability protocol, and can be accessed via both Web and non-Web interfaces. In the coming year, we will extend SenseMaker to include the capability for relation-based feedback interactions, as well as to incorporate relation-based duplicate detection. Boolean Query Translation -- Kevin Chen-Chuang Chang Emerging Digital Libraries can provide a wealth of information. However, there are also a wealth of search engines behind these libraries, each with a different query language. Our goal is to provide a front-end to a collection of Digital Libraries that hides, as much as possible, this heterogeneity. As a first step, we focus on translating Boolean queries, from a generalized form, into queries that only use the functionality and syntax provided by a particular target search engine. We initially look at Boolean queries because they are used by most current commercial systems; eventually we will incorporate other types of queries such as vector space and probabilistic-model ones. To illustrate our approach, suppose that a user is interested in documents discussing multiprocessors and distributed systems. Say the user's query is originally formulated as: "Title Contains multiprocessor And distributed (W) system". This query selects documents with the three given words in the title field; furthermore, the (W) proximity operator specifies that the word "distributed" must immediately precede "system." Now assume the the user wishes to query a source which does not understand the (W) operator. In this case, our approach will be to approximate the predicate "distributed (W) system" by the closest predicate supported by the source, for example "distributed And system." This predicate requires that the two words appear in matching documents, but in any position. Thus, the native query that is sent to the source is: "Find Title multiprocessor And distributed And system" In the syntax understood by the source. The native query will return a preliminary result set that is a super-set of what the user expects. Therefore, an additional post-filtering step is required at the front-end to eliminate from the preliminary result those documents that do not have the words "distributed" and "system" occurring next to each other. In particular, the required filter query is: "Title Contains distributed (W) system". The problem of multiple and heterogeneous IR systems has been observed since the early 1970's. Since then, many solutions were proposed to address this problem. Our approach differs from others mainly in that the front-end language is uniform across underlying sources while still providing powerful search features not necessarily supported by all systems. Our research work started with a feature analysis of query languages of some typical text retrieval systems, including Dialog, WAIS, STN, BRS, and Stanford-Folio. Based on this study of Boolean systems, we designed a front-end Boolean language. Meanwhile, we have completed theoretical work for translating front-end Boolean queries into target-specific queries and the corresponding filter queries required to carry out features not supported by a source. Transformation algorithms based on a query normal form have been proposed to solve the mapping problem which generates both minimal native queries in terms of the number of extra documents retrieved and optimal filter queries in the sense of least processing effort. gGlOSS -- Luis Gravano As large numbers of text databases have become available on the Internet, it is harder to locate the right sources for given queries. To address this problem, we designed gGlOSS, a generalized Glossary-Of-Servers Server, that keeps statistics on the available databases to estimate which databases are the potentially most useful ones for a given query. During the reporting period, we extended our original GlOSS beyond its previous capability, which focused on databases using the boolean model of document retrieval. gGlOSS also covers databases using the more sophisticated vector-space retrieval model. We evaluated our new techniques experimentally using real-user queries and 53 databases. Also, we further generalized our approach by showing how to build a hierarchy of gGlOSS brokers. The top level of the hierarchy is so small it could be widely replicated, even at end-user workstations. We presented our main results in a paper that appeared at the VLDB'95 conference in Zurich, Switzerland We are acting as brokers and designers for an informal agreement among major search engine providers and their users. The agreement is intended to help interoperability issues in three areas: (i) resource discovery, (ii) query submission and (iii) rank merging of result sets from the various engines. If this effort succeeds, the engine providers will include a standardized interface to their engines to support clients interacting with multiple engines of the various vendors. Participants are Microsoft Corporation, InfoSeek, WAIS Inc., Fulcrum, Verity and PLS. We have met with each of the companies to gather their input. After consolidating this data into a design, we will go through a second round of consultation with each participant to work out the final agreement. It is then expected that the search engine companies will provide the agreed-upon interfaces. Adaptive Search Agents -- Marko Balabanovic Work has been completed on the design of a new architecture for adaptive information searching agents. Given a group of users, the "Fab" system will adapt over time to deliver documents interesting to individuals within the group. We hope to take advantage of common interests within the group to allow more efficient scaling, so that eventually the system can serve a large number of users. Currently the documents are found by the system on the World-Wide Web. The implementation of the system, started in August 1995, is nearing completion. A major difficulty with these systems has been finding valid ways to evaluate them, as the commonly used models from the information retrieval literature do not apply to this domain. We have recently completed an experimental design which will allow measurement of the improvement over time and absolute performance of the system in a scientifically and statistically valid way. The first experiment is scheduled to start at the beginning of March 1996. I. Financial Report Not publicly available. J. Bibliography [1] Marko Balabanovic and Yoav Shoham. Learning Information Retrieval Agents: Experiments with Automated Web Browsing. In Proceedings of the AAAI Spring Symposium on Information Gathering from Heterogenous, Distributed Resources, March, 1995. [2] M. Balabanovic, Y. Shoham, and Y. Yun. An Adaptive Agent for Automated Web Browsing. Journal of Visual Communication and Image Representation, 6(4), December, 1995. [3] Michelle Q Wang Baldonado and Terry Winograd. Techniques and Tools for Making Sense out of Heterogeneous Search Service Results. Number SIDL-WP-1995-0019. Stanford University, 1995. [4] Michelle Q Wang Baldonado and Terry Winograd. A User Interaction Model for Browsing Based on Category-Level Operations. Number SIDL-WP-1996-0029. Stanford University, 1996. [5] S. Brin, J. Davis, and H. Garcia-Molina. Copy Detection Mechanisms for Digital Documents. In Proceedings of SIGMOD '95, 1995. [6] Edward Chang and Hector Garcia-Molina. Reducing Initial Latency in a Multimedia Storage System. In Submitted for publication, 1996. [7] Kevin Chen-Chuan Chang, Hector Garcia-Molina, and Andreas Paepcke. Boolean Query Mapping Across Heterogeneous Information Sources. Invited to IEEE Transactions on Knowledge and Data Engineering, 1996. [8] Luis Gravano and Hector Garcia-Molina. Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies. In In Proceedings of VLDB '95, 1995. [9] Steven Ketchpel. Transaction Protection for Information Buyers and Sellers. In DAGS '95, 1995. [10] Steve B. Cousins, Steven P. Ketchpel, Andreas Paepcke, Hector Garcia-Molina, Scott W. Hassan, and Martin Roscheisen. InterPay: Managing Multiple Payment Mechanisms in Digital Libraries. In DL '95 proceedings, 1995. [11] Steven Ketchpel and Hector Garcia-Molina. Making Trust Explicit in Distributed Commerce Transactions. In Proceedings of the International Conference on Distributed Computing Systems, 1996. [12] Daphne Koller and Mehran Sahami. Toward Optimal Feature Selection. 1996. Submitted for publication. [13] Clifford Lynch and Hector Garcia-Molina. IITA Digital Libraries Workshop Report. Marianne Siroker, Stanford University, GATES 436, Stanford, CA. 94305, May, 1995. Available on request. [14] Andreas Paepcke, Steve B. Cousins, Hector Garcia-Molina, Scott W. Hassan, Steven K. Ketchpel, Martin Roscheisen, and Terry Winograd. Towards Interoperability in Digital Libraries: Overview and Selected Highlights of the Stanford Digital Library Project. to appear in IEEE Computer Magazine, May, 1996. [15] M. Roscheisen, C. Mogensen, and T. Winograd. Interaction Design for Shared World-Wide Web Annotations. In Proceedings of CHI '95, 1995. [16] M. Roscheisen, C. Mogensen, and T. Winograd. Beyond Browsing: Shared Comments, SOAPs, Trails and On-line Communities. In Proceedings of the World-Wide Web Conference '95, Darmstadt, Germany, 1995. [17] Martin Roscheisen, Terry Winograd, and Andreas Paepcke. Content Ratings and Other Third-Party Value-Added Information: Defining an Enabling Platform. CNRI D-Lib Magazine(http://www.dlib.org/dlib/august95/08contents.html), August, 1995. [18] M. Roscheisen, C. Mogensen, and T. Winograd. A Platform for Third-Party Value-Added Information Providers: Architecture, Protocols, and Usage Examples. Stanford University, May, 1995. [19] Narayanan Shivakumar and Hector Garcia-Molina. SCAM: A Copy Detection Mechanism for Digital Documents. In DL '95 Proceedings, 1995. [20] N. Shivakumar and H. Garcia-Molina. The SCAM Approach to Copy Detection in Digital Libraries. CNRI D-Lib Magazine(http://www.dlib.org/dlib/november95/11contents.html), November, 1995. [21] N. Shivakumar and H. Garcia-Molina. Building a Scalable and Accurate Copy Detection Mechanism. In Proceedings of DL'96, 1996. [22] The Stanford Digital Library Project. 1995. Special Issue of the Communications of the ACM. Other materials: Video: We have prepared a 10 minute videotape on WebWriter, entitled: "WebWriter: Interface Development on the World Wide Web." shot September 22, 1995. 2. Plans and Direction We plan to continue aggressive pursuit of all our five program thrusts during the upcoming reporting period. In the infrastructure area, we will focus on service proxy stability and support for proxy maintenance. In the SenseMaker project, we will focus on the use of relations for defining what users consider 'the same document' for the purpose of duplicate elimination. The query translation work will expand to include non-boolean queries. SCAM will enable users to compare sets of documents against other sets of documents, rather than just a single document being tested against a set of reference documents. We will ensure that Z39.50 servers are reachable from the InfoBus, and the Z39.50 clients can access the InfoBus. We plan to explore learning algorithms for deployment in the document rank merging problem and to investigate new techniques that use the words appearing in retrieved documents to cluster them or classify them according to topic. We plan to apply decision-theoretic and economic criteria in deciding which data sources should be accessed to respond to a given query. We will install a separate testbed software branch which will lag behind the development branch in features, but which will be kept stable and running for demonstration purposes. We plan to expand our cooperation with the University of Illinois and the UC at Santa Barbara by exchanging access to services and collections with both these institutions. Existing access to and from the University of Michigan will be expanded to go beyond search by including the remote access of publication-related services, such as summarization. ________________ I certify that to the best of my knowledge (1) the statements herin (excluding scientific hypotheses and scientific opinions) are true and complete, and (2) the text and graphics in this report as well as any accompanying publications or other documents, unless otherwise indicated, are the original work of the signatories or individuals working under their supervision. I understand that the willful provision of false information or concealing a material fact in this report(s) or any other communication submitted to NSF is a criminal offense (U.S. Code, Title 18, Section 1001.) Project Director Signature: _____________________________________ Enclosures: Supporting transparencies Excerpts of papers published in 1995 Page