Digital Library Project
Stanford University
Quarterly Report. Nov 1, 1997
Reporting Period: Aug 1, 1997-Oct 31, 1997

Content:

Administrative matters
InfoBus Architecture and Testbed
Economics
User Interfaces
Searching
Miscellaneous Activities
References to Papers Produced During Reporting Period

1. Administrative

The new academic year started. The students have returned from their summer jobs. Many of them came back refreshed, with new ideas to try in the project.

Steve Cousins, creator of our DLITE interface to digital libraries has graduated and was hired by Xerox PARC. He is our liason there, and we are staying in touch with him regularly.

Michelle Baldonado has passed her PhD oral examination, and will also begin work at Xerox PARC on January 1. These are the first two students to graduate under the Digital Library project.

Meanwhile, several new Masters and PhD students have joined us. Henry Berg and Neill Daswani (PhD track) will be working in the areas of UI and economics, respectively. Henry will be exploring uses of personal digital assistants, and large-area displays for digital libraries. Neill comes to us with strong experience in Web servers.

Mehul Desai will work towards his Masters degree. He is building an initial connection between the PalmPilot PDA and the InfoBus. Samir Raiyani is examining alternatives for information push and subscription technologies on the InfoBus. Yuhua Liu has taken over our DLITE user interface work.

Gerard Rodriguez joined us for a year. He is initially working on our technology transfer kit.

The new owners of Knight-Ridder's Dialog Information Service have renewed their partnership with the project. This will allow us to continue our use of this rich source for our InfoBus experiments and prototypes.

The InfoBus is now being used at Xerox PARC for some very interesting infrastructure R&D. We are very happy with this piece of technology transfer.

3. InfoBus Architecture and Testbed

We started to build better facilities that will help third parties to provide collections and services for the InfoBus more easily. This will include a sample library service proxy written in Java. We will pay particular attention to providing good defaults, and powerful base classes to cover the most common usage of our DLIOP protocol. As part of the same effort, we are ensuring that the InfoBus is properly compatible with other CORBA implementations.

We have overhauled the Dialog Information Service proxy to handle subcollection searching more seamlessly, and to include the metadata architecture. This will allow us to include many of the 600 Dialog databases more effectively. We are adding Consumer Reports, the movie review database, a database of publicly funded projects, and several others.

We have first results for our new ACM collections proxy. This is an exciting new addition to the InfoBus, because it will provide full-text search over the rich resources of the ACM proceedings and journals, as ACM brings them online.

We added the Library of Congress server to the InfoBus, as the second generation of our Z39.50 proxy for the InfoBus stabilized.

We completed work on making query translation an InfoBus service. Before we had this facility, every client of query translation needed to load the appropriate code. As partners outside of Stanford began to contribute services, we wanted as little code as possible to be required in the distribution package. The query translation proxy helps solve this problem for us.

4. Economics

In our economics-related work, we developed a novel representation of multi-good sourcing strategies with deadlines (submitted in "Competitive Sourcing" paper to DCS '98). We also worked out a design for combining our FIRM rights management system with our Shopping Model architecture.

In our work on distributed transactions, we developed new ways to handle cases where there is trust between intermediaries that are in turn trusted by clients, or where multiple pairwise transactions use the same trusted intermediary.

5. User Interfaces

We created a "museum" version of the Java-based SenseMaker for the NSF exhibition space. SenseMaker was also modified to make the SONIA profile capability accessible: when users request bundling by similar content, they are asked if they want to use an old profile as the basis for bundling; after bundling by similar content, users are asked if they want to save the bundling in a profile. SONIA in turn maintains and uses profiles -- SenseMaker just provides the user interface for them.

Also for SenseMaker, we conducted a think-aloud user study involving 5 people. We looked to see if organizational structure had an affect on user behavior. Analysis of the videos showed this was the case. For example, with results bundled by Web site, subjects paid more attention to the inferences they could make about Web sites.

We conducted a think-aloud user study involving 3 people. In particular, we looked to see if they could understand the expand/limit/return/specify collection-creation action ontology that is used in SenseMaker without any training. 2/3 figured it out, 1 had trouble -- results are promising but show that the interface could still be improved.

We have started a new effort designing a construction kit for creating specialized user interfaces that help users search over multiple collections at once. The construction kit draws on our metadata architecture to collate the schemas of the target sources users interactively express interest in. The kit then automatically generates user interface code that lets end users formulate searches more easily.

6. Searching

We continued our implementation of the SONIA document clustering system. The student most involved in this work spent the summer as an intern at SRI. SRI has shown interest in using SONIA and other possible off-shoots from the Stanford Digital Libraries project.

Also jointly with SRI, we developed theoretical analysis of document clustering techniques that have since been implemented in the SONIA system. A paper (jointly written with Moises Goldszmidt at SRI) on this work is in preparation.

Clustering results include a new algorithm for unsupervised feature selection, allowing us to considerably improve the set of words that we use for clustering, and some new algorithms for performing the clustering itself.

We began development of an interface to SONIA which supports hierarchical clustering and classification of documents.

We completed a series of simulated experiments for testing our Fab automated Web page retrieval system with data from Yahoo. We studied exploitative versus exploratory document recommendation strategies. We demonstrated how the system can increase the speed of learning user profiles at the expense of showing the user potentially less interesting documents. The end result is a UI addition to allow users to control this tradeoff. Such controls have been integrated into the Fab system.

One problem we encounter in our SCAM work is that we need to manipulate and search over 'chunks' of information, like characters, words, sentences, paragraphs, etc. In order to facilitate these tasks, we established a new framework for describing chunking primitives, and generalized it to a class of approximate predicates that can be used in Digital Libraries and database systems.

We formulated indexing and querying in SCAM as special cases of iceberg queries. These are queries that extract the most frequent occurrences of a class of items, like 'the most frequently used word', or 'the most frequently occurring pair of words'. We proposed new techniques for optimizing this class of queries. We then combined notions of approximate predicates and iceberg queries to efficiently compute a syntactic clustering of over 250,000 Stanford web pages, obtained from the Stanford BackRub Web crawler.

We have begun to deploy a new experiment for what we call value-based filtering. This term covers mechanisms that help users filter information based on the value attached to information by human beings, either explicitly or implicitly. In particular, we have developed an http proxy which monitors the pages accessed by our group. The proxy only records the URL of the pages accessed, not the origination of the request. As users access pages, they are shown how many others have accessed the same pages in the past. We are now collecting data through this new facility.

We continued to work on developing a framework for translating attributes and their values. The focus during this reporting period has been to deal with documents that have hierarchical structure. Our previous system had the limitation of flat sets of attribute-value pairs.

7. Miscellaneous Activities

7.1 Visitors and Industry Contacts

7.2 Public Presentations and Meetings Attended

7.3 Regular Meetings/Seminars

We published the following papers.

8. References

[1] E. Bauer, D. Koller, and Y. Singer. Update Rules for parameter estimation in Bayesian networks. In Proceedings of the 13th Annual Conference on Uncertainty in AI (UAI), 1997. [2] Edward Chang and Hector Garcia-Molina. Reducing Initial Latency in Media Servers. In IEEE Multimedia. Vol. 4. 1997.

[3] Edward Chang and Hector Garcia-Molina. Effective Memory Use ina Media Server. In Proceedings of the 23rd Very Large Data Base (VLDB) Conference, 1997.

[4] Edward Chang and Hector Garcia-Molina. MEDIC: A Memory & Disk Cache for Multimedia Clients. Number SIDL-WP-1997-0076. Stanford University, October, 1997.

[5] Arturo Crespo and Hector Garcia-Molina. Awareness Services for Digital Libraries. In Lecture Notes in Computer Science. Vol. 1324. 1997.

[6] Joachim Hammer, Hector Garcia-Molina, Junghoo Cho, Arturo Crespo, and Rohan Aranha. Extracting Semistructured Information from the Web. In Proceedings of the Workshop on Management fo Semistructured Data, 1997.

[7] Steven Ketchpel and Hector Garcia-Molina. A Sound and Complete Algorithm for Distributed Commerce Transactions. Number SIDL-WP-1997-0074. Stanford Digital Library Project, 1997.

[8] Steven Ketchpel and Hector Garcia-Molina. Competitive Sourcing for Internet Commerce. Number SIDL-WP-1997-0075. Stanford Digital Library Project, 1997.

[9] Mehran Sahami. Applications of Machine Learning to Information Access. In AAAI-97, Proceedings of the Fourteenth National Conference on Artificial Intelligence, 1997.

[10] Mehran Sahami, Salim Yusufali, and Michelle Q. Wang Baldonado. Real-time Full-text Clustering of Networked Documents. In AAAI-97, Proceedings of the Fourteenth National Conference on Artificial Intelligence, 1997.