Digital Library Project
Stanford University
Annual Report. Feb 1, 1997
Reporting Period: Feb 1, 1996-Feb 1, 1997
Administrative matters
InfoBus Architecture and Testbed
User Interfaces
STARTS Internet Meta-Search Proposal
Miscellaneous Activities
References to Papers Produced in 1996
Other Publications

1. Administrative

Osaka Gas Information Systems, a fully owned subsidiary of Japan's Osaka Gas company decided to send Yosuke Akamatsu as a visitor to our Digital Library project for one year. He is working on our economics thrust.

Many of us attended the University of Michigan DLI meeting in '96, and in December, we hosted the meeting of DLI participants. We also spent time preparing for an additional pre-conference workshop of key people from the Digitial Library Initiative and the leaders of the National Digital Library Federation (Directors of the nation's largest research libraries). Approximately 55 people attended. The meeting built a stronger bridge, and encouraged communication between the NDLF leadership and the NSF/DARPA/NASA Digital Library projects. The most productive discussions between these two groups centered around search and discovery methods, and metadata. A handful of NDLF people stayed for the meeting on the 16th and 17th.

Carl Lagoze and David Fielding of Cornell University, and Jim Davis of Xerox PARC joined us part time. They bring rich, relevant experience from their work on the NCSTRL project.

Several students spent their summer at surrounding companies, mostly continuing Digital Library research, though usually with their host companies' goals driving their direction. This has enriched the project, as the students bring new ideas and feedback to us.

Twenty-one working papers from the Stanford Digital Library Project have been made into technical notes or technical reports. This will give the DigLIb papers more visibility and archive them in the library.

2. InfoBus Architecture and Testbed

We spent a considerable amount of effort on exploring the possibilities of Java for the InfoBus. Our interest lies particularly in the use of Java to distribute InfoBus access software. In this scheme, all access software is obtained through the Web as a Java applet. Once at the client site, the applet assumes the role of a CORBA ORB, the core part of the CORBA system which facilitates communication with other objects. This mode of deployment is nearing completion and will be the basis for our library user studies.

Work on the Zserver was completed. This is an InfoBus proxy which behaves towards clients like a Z39.50 server. But it does not itself contain any data. Instead, it translates incoming Z39.50 requests into InfoBus requests. These are forwarded through the InfoBus to any of the InfoBus-accessible sources. The resulting information is returned to the client as if it came from a Z39.50 service. The advantage of this is that standard Z39.50 clients can access the InfoBus without the use of our software. The Zserver completes our Z39.50 interoperability suite, complementing the University of Michigan Z39.50 client to InfoBus software reported on earlier.

The University of Santa Barbara was added as an InfoBus repository. We can search over the Alexandria collection and obtain map metadata and GIF thumbnails of maps. Similarly, Project Alexandria has demonstrated access to our other sources through the InfoBus.

We constructed a socket-based interface to the InfoBus which allows for using our DLIOP protocol through the Unix socket interface. CMU has used it successfully to access the InfoBus.

We developed a Web proxy generator. This is a facility which allows users to define the behavior of Web-based search services. Using this description, the facility automatically constructs an InfoBus proxy which can then provide access to the service through the InfoBus.

Work progressed on further integrating the SCAM service into the InfoBus architecture. When this work is finished, SCAM will be accessible via method calls. In addition, a Java interface is being developed.

The second version of InterBib was released. It now provides support for RTF and for bibliography search. The RTF support means that users may now submit MS Word and HTML documents with BibTeX or Refer bibliographies to InterBib, and have reference lists added to the documents. Version one only supported Framemaker. The search facility allows users to search over all the bibliographies that were previously submitted to InterBib, were parsed correctly, and were released for public access by the submitters. InterBib was also fixed to properly handle submissions originating from Macintosh computers. Previously, only PC and Unix were handled properly. InterBib is available at

Work progressed on allowing views to be added to objects. This facility helps when multiple services interact with document objects, and each facility would like to 'pretend' that the documents have different attributes and behavior. With document views, each module can attach a different view to the document objects it operates on. This is like dynamically adding a superclass to the object's class. We are also exploring additional uses of views, such as access control to documents, the dynamic addition of payment capabilities, etc.

We added a stand-alone collection service to the InfoBus. This allows InfoBus clients to store documents and other objects in DLIOP compliant collections with a variety of underlying storage managers. This facility is beginning to get used throughout the system.

The ability for search proxies to support subcollections was added. This allows convenient access to external services with multiple collection offerings. Example: Knight-Ridder's Dialog Information Service.

A new proxy to the Xerox PARC document summarization service was developed.

A new proxy for the NCSTRL collection was added to the InfoBus by our colleagues at Cornell.

A proxy was constructed which uses a converter from the New Zealand Digital Library to convert postscript to approximate text.

We constructed a DLITE component and proxy to TextBridge, a Xerox remotely accessible OCR service. The intent is for users to send a document image. The service then returns an OCRed copy. This technical aspect of this work is completed, and we are in negotiation with Xerox regarding release of the facility to outside the Xerox fire wall.

Work progressed on making the InfoBus testbed thread-safe. We have several proxies and parts of SenseMaker running with threads. More work is needed in this area.

3. Economics

Work began on InterPay II. It is intended to move beyond interoperability of payment to the notion of interoperability among 'shopping models'. Thus, our interoperability concerns for online financial activities are now expanding to include issues such as the sequence of component actions such as offers, invoice delivery, negotiation, verification, payment, document delivery, etc.

Our framework for distributed commerce transactions was presented at an international conference on distributed computing systems. Follow-up work has improved the basic architecture and algorithm, which has been proven sound and complete. Extensions for direct trust and deadlines are in progress.

In the SCAM work , we considered the case when documents are located in several distributed text databases. In this scenario, we studied several liberal and conservative techniques that helped us ``prune'' away databases that were very unlikely to have document copies. A new paper on dSCAM presents algorithms that compute ``minimal'' queries to retrieve potential copies of illegal documents in autonomous text databases.

Work on relationship management for access control (R-MANAGE) moved from design towards implementation. We have been augmenting the basic testbed infrastructure by rights management facilities in order to deal with access control issues such as licensing and copyrights. The basic approach is based on a contract framework, adding both persons ("epers") and relationships ("commpacts") as first-class citizens to interface and architecture: An "e-person" (or "epers") is a persistent, access-controlled electronic representation of a person, that provides a single point of reference for everything related to this person, including controlled ways to get hold of personal information, to request approvals, to leave behind notfications, to automtically set up certain standard relationships (such as accounts with content providers), etc. An epers is essentially a generalization of a Unix account. A "commpact" is a "communication pact" between e-persons that will be the representation of a legal contract in many cases, but it can also encapsulate less formal agreements, such as privacy related ones. Commpacts are technically a set of rights and obligations. For example, the obligation to pay a certain sum for a subscription to a newsletter might be one such promise in a subscription commpact. Actions are then evaluated (authorized) with respect to the context of a previously established commpact. R-MANAGE integrates with InterPay II at this action level. Much of its functionality is available to users as part of a separate "relationship manager" task in the task interface.

The following economics-related interface components (with their corresponding backend proxies) are available in the Dlite interface:

DL proxies now also have an "owner" with whom users can contract for usage terms and conditions.

Authentication: Based on the person representation and public-key credentials (RSA/md5) issued by the home provider, a "network login" facility has been added to the testbed. Both the browser (Netscape via cookies) and the DLITE task viewers are thus able to convey who is using them, and testbed services can securely identify their users.

4. User Interfaces

In parallel to the Java experiments at the infrastructure level, the user interface thrust expanded its scope to explore how the easy delivery of InfoBus access software would impact user-level interaction with the InfoBus. We focused in particular on the possibility of collaboration among users, some of which might be mobile. Our driving scenario includes a user on the road, who needs to consult with a reference librarian at home. Challenges being addressed include a good balance between screen interactions being visible to all parties, and the constraints of bandwidth and latency shortcomings.

SenseMaker has evolved in that it now uses a structured hierarchical attribute interlingua for computing choices for grouping results into categories. In the previous prototype, we were using a hierarchy of target sources instead.

We began to develop a conceptual model for information finding that can be thought of as a Recursive Extensible Active Card catalog for Heterogeneity (REACH). It explains the categorization activity enabled by SenseMaker as the task of creating virtual card catalogs.

We completed the initial design of a new front-end interface for SenseMaker. This design introduces the concept of the "hi-citer." A hi-citer describes an information object through a delineated sequence of attribute values with special highlighting properties. Given a set of hi-citers, highlighting an attribute value in one causes that same attribute to be highlighted in the other hi-citers in the set. Hi-citers allow for fast skimming and the quick comparison of "citations" in a heterogeneous environment. We plan to implement the new design in the coming quarter.

Threads were added to the current version of SenseMaker. This allows multiple users to access SenseMaker over the Web at once.

We added support for new proxies in SenseMaker. This broadens the types of sources with which SenseMaker can communicate. An interesting new example is Informedia, the CMU video search service. We are also working to allow for third-party bundling in SenseMaker.

Formal testing of our audio-based Web interface technology has been completed. Analysis of test results is ongoing.

We ran several user studies for our DLITE interface in which users were asked to complete a bibliographic task with different versions of DLITE. Several changes have been made to the system in response to these studies.

We added another interface component to DLITE for helping users compose fielded queries. This is to help novice users who need fielded, but keyword-only query entry. This work was undertaken in response to our user testing.

Subcollection support was added to our source constructor. This allows users to create interface components that represent subcollections of external services. Dropping queries into these will cause searches in those corresponding subcollections.

Our WebWriter and InterBib systems were incorporated into the testbed and into DLITE.

5. Searching

We constructed first versions of interfaces to CSQuest and CMU's Informedia project. CSQuest is a concept space over Inspec which was constructed at the University of Arizona (see University of Illinois's DL project for an explanation of this term). We are planning to use this facility for our interactive query development module.

The CMU proxy searches over Informedia's video text, title and abstract database, not the video itself. We are planning to move this first implementation to CMU's evolving RPC interface.

Our front-end query language was extended to enable more effective queries to UCSB's ADL map service. Specifically, our data types now include numeric values, and the attribute set has been extended to include various spatial aspects. We also extended the query translation mechanism to add missing features, such as negation and stemming.

We added a query translator for NCSTRL's Dienst server, which uses STARTS-like query language.

Translation service to AltaVista (a Boolean search engine on the Web) was added to the testbed, and the query translation implementation was ported to SunOS and Solaris.

Our SenseMaker search interface progressed further this quarter. Recall that SenseMaker users "make sense" out of their result collections by looking at them through multiple views. Within a view, complexity is reduced through user-directed "merging" and "bundling" of results. Now, SenseMaker users can also contextually evolve the direction of the search process once they have made sense of the current collection of results. They can expand upon, limit, or replace the current collection of results. Examples of expand actions implemented this quarter include:

  1. Query-by-example. Users can point to "bundles" of related results and ask for them to serve as examples of what is to be found.

  2. Query refinement. Users can change their queries directly and can also accept suggestions as to new terms they might use in their query. These suggestions are obtained from the CSQuest proxy.

Two user studies were conducted to test the SenseMaker interface.

In our query translation project we conducted experiments for measuring the cost of our query translation approach. We have compared the selectivity of front-end and translated queries to understand the post-filtering cost. Specifically, the experiment was desined to measure selectivity degeneration with respect to the following translation schemes:

In support of several subprojects we designed a set of metadata components which will work on the infobus, and which will satisfy several of our metadata needs. In particular, we added support for multiple attribute models in our front-end query language. Users can use attrModel.attrName notation to specify the search attributes in queries. The client of the query translator can specify a default attribute model. To support multiple attribute models, the query translator internally keeps no knowledge of the various models (except a default one); it relies on attribute model proxies to provide attribute details of standard models such as Bib-1.

We have designed a new algorithm for automatically classifying text documents into an existing topic hierarchy. That is, given a pre-existing hierarchy of documents, e.g., the Yahoo collection, our algorithm learns how to take new documents and insert them into their appropriate place in the hierarchy. The algorithm utilizes some of our previous work on feature selection for text documents.

We have also started a collaborative effort with researchers at MIT, whose goal is to improve the accuracy of ad hoc queries to a corpus. The idea is to combine latent semantic indexing with machine learning techniques. The former teases out some set of themes that are dominant in the corpus, while the latter determines how important the various themes are. For example, some themes may correspond to topics, while others may correspond only to stylistic differences.

6. Agents

The Fab adaptive information retrieval system has been running "live" since the end of March. Several user tests have been performed.

A proxy for Fab was added to the testbed, and an interface has been built to allow use of Fab from within our DLITE interface.

7. STARTS Proposal for Meta-Search Support

We have been active in the definition of a proposal to support metasearching on the Internet. The proposal addresses three problems encountered by services that search multiple, heterogeneous search engines to satisfy a given query: finding promising collections, submitting appropriate forms of the query to the corresponding engines, and merging result rankings.

We held a one day workshop with several major search engine providers and consumers to reach agreement on a final draft. This draft is available at

The Z39.50 community is working on a Z39.50 profile based on STARTS. Our Cornell colleagues have just completed a reference implementation of the protocol (

8. Miscellaneous Activities

Prof. Jerry Saltzer of MIT visited for one month, meeting individually with project team members and attending the seminars and weekly technical design meetings.

8.1 Visitors and Industry Contacts

8.2 Public Presentations and Meetings Attended

We organized a one-day workshop where several major search engine providers and consumers discussed the STARTS proposal.

8.3 Local Events

8.4 Regular Meetings/Seminars

9. References for 1996

[1] Marko Balabanovic and Yoav Shoham. Combining Content-Based and Collaborative Recommendation. Communications of the ACM, 40(3), March, 1997.

[2] Marko Balabanovic. An Adaptive Web Page Recommendation Service. In Proceedings of the First International Conference on Autonomous Agents, February, 1997.

[3] Michelle Q Wang Baldonado and Steve B. Cousins. Addressing heterogeneity in the networked information environment. Review of Information Networking, to appear.

[4] Michelle Q Wang Baldonado and Terry Winograd. SenseMaker: An Information-Exploration Interface Supporting the Contextual Evolution of a User's Interests. In Proceedings of the Conference on Human Factors in Computing Systems, 1997.

[5] Michelle Baldonado, Chen-Chuan K. Chang, Luis Gravano, and Andreas Paepcke. The Stanford Digital Library Metadata Architecture. International Journal of Digital Libraries, 1(2), February, 1997. See also

[6] Michelle Baldonado, Chen-Chuan K. Chang, Luis Gravano, and Andreas Paepcke. Metadata for Digital Libraries: Architecture and Design Rationale. In Submitted to DL97, 1997. At

[7] Edward Chang and Héctor García-Molina. Reducing Initial Latency in a Multimedia Storage System. In Third International Workshop of Multimedia Database Systems, 1996.

[8] Chen-Chuan K. Chang, Héctor García-Molina, and Andreas Paepcke. Boolean Query Mapping Across Heterogeneous Information Sources. IEEE Transactions on Knowledge and Data Engineering, 8(4):515-521, Aug, 1996.

[9] Edward Chang and Hector Garcia-Molina. Minimizing Memory Use In Video Servers. Number SIDL-WP-1996-0045. Stanford University, December, 1996.

[10] Chen-Chuan K. Chang and Hector Garcia-Molina. Evaluating the Cost of Boolean Query Mapping. In Submitted to DL97, 1997. At

[11] Steve B. Cousins, Scott W. Hassan, Andreas Paepcke, and Terry Winograd. A Distributed Interface for the Digital Library. Number SIDL-WP-1996-0037. Stanford University, 1996. Accessible at

[12] Steve B. Cousins, Andreas Paepcke, Terry Winograd, Eric A. Bier, and Ken Pier. The Digital Library Integrated Task Environment (DLITE). 1997. Submitted to DL 97.Accessible at

[13] Arturo Crespo and Eric A. Bier. WebWriter: A Browser-Based Editor for Constructing Web Applications. In Proceedings of the Sixth World-Wide Web Conference, 1996.

[14] Arturo Crespo, Bay-Wei Chang, and Eric A. Bier. Responsive Interaction for a Large Web Application: The Meteor Shower Architecture in the WebWriter II Editor. In Proceedings of the Seventh World-Wide Web Conference, 1997.

[15] Luis Gravano, Narayanan Shivakumar Hector Garcia-Molina. dSCAM: Finding Document Copies Across Multiple Databases. Proceedings of 4th International Conference on Parallel and Distributed Information Systems, 1996.

[16] Luis Gravano, Chen-Chuan K. Chang, Héctor García-Molina, and Andreas Paepcke. STARTS: Stanford Proposal for Internet Meta-Searching. In Proceedings of the International Conference on Management of Data, 1997.

[17] Kenichi Kamiya, Martin Röscheisen, and Terry Winograd. Grassroots: A System Providing a Uniform Framework for Communicating, Structuring, Sharing Information, and Organizing People. In Proceedings of the Sixth World-Wide Web Conference, 1996. Also published in part as a short paper for CHI'96 (conference companion).

[18] Steven Ketchpel and Héctor García-Molina. Making Trust Explicit in Distributed Commerce Transactions. In Proceedings of the International Conference on Distributed Computing Systems, 1996.

[19] Steven Ketchpel, Hector Garcia-Molina, Andreas Paepcke, Scott Hassan, and Steve Cousins. UPAI: A Universal Payment Application Interface. In USENIX 2nd e-commerce workshop, 1996.

[20] Steven P. Ketchpel, Hector Garcia-Molina, and Andreas Paepcke. Shopping Models: A Flexible Architecture for Information Commerce. In Submitted to DL97, 1997. At

[21] Ron Kohavi and Mehran Sahami. Error-Based and Entropy-Based Discretization of Continuous Features. In Second International Conference on Knowledge Discovery in Databases, 1996. At

[22] D.Koller and Y. Shoham. Information agents: A new challenge for AI. IEEE Expert:8-10, June, 1996.

[23] Daphne Koller and Mehran Sahami. Toward Optimal Feature Selection. 1996. Submitted for publication.

[24] D. Koller and M. Sahami. Hierarchically Classifying Documents Using Very Few Words. In Submitted to ICML97, 1997.

[25] Andreas Paepcke, Steve B. Cousins, Héctor García-Molina, Scott W. Hassan, Steven K. Ketchpel, Martin Röscheisen, and Terry Winograd. Towards Interoperability in Digital Libraries: Overview and Selected Highlights of the Stanford Digital Library Project. IEEE Computer Magazine, May, 1996.

[26] Andreas Paepcke. Searching is Not Enough: What We Learned On-Site. D-Lib Magazine, May, 1996.

[27] Andreas Paepcke. Information Needs in Technical Work Settings and their Implications for the Design of Computer Tools. Computer Supported Cooperative Work: The Journal of Collaborative Computing, 5:63-92, 1996.

[28] Martin Röscheisen and Terry Winograd. A Communication Agreement Framework of Access/Action Control. In Proceedings of the 1996 IEEE Symposium on Research in Security and Privacy, 1996.

[29] M. Sahami, M. Hearst, and E. Saund. Applying the Multiple Cause Mixture Model to Text Categorization. In Proceedings of the Thirteenth International Conference on Machine Learning, pp. 435-443. Morgan Kaufmann, 1996. At

[30] Mehran Sahami. Learning Limited Dependence Bayesian Classifiers. In Second International Conference on Knowledge Discovery in Databases, 1996. At

[31] Narayanan Shivakumar and Héctor García-Molina. Building a Scalable and Accurate Copy Detection Mechanism. In Proceedings of the Third Annual Conference on the Theory and Practice of Digital Libraries, 1996.

[32] Tak Woon Yan, Matthew Jacobsen, Héctor García-Molina, and Umeshwar Dayal. From User Access Patterns to Dynamic Hypertext Linking. In Submitted to the Fifth World-Wide Web Conference, 1996.

10. Other Publications

With the help of Xerox PARC a new, updated video tape of the DLITE interface was produced.