Digital Library Project
Stanford University
Annual Report. April 1, 2000
Reporting Period: April 1, 1999 - April 1, 2000

Content:

Administrative matters
Interoperability
Information Overload
Mobile Access to Digital Libraries
Archiving
Miscellaneous Activities
References to Papers Produced in 1999
Other Publications

Some of the group members, and this year's site visit

1. Administrative

We launched the project to a good start. Several new graduate students are growing into our new project. Due to the late funding, we missed some student recruiting opportunities, but the students we were able to commit to are very good, and have begun to make their marks in the various areas they are addressing.

We hosted numerous visitors who are interested in Stanford's Digital Library project, and in the NSF initiative as a whole. We also hosted the first year's site visit, conducted by Mike Lesk and Steve Griffin of NSF, and Eugene Miya of NASA. It was exciting to show off our first results to the NSF visitors. For some students, this was their first public speaking experience; everyone did very well. We also had the opportunity to present some of our work to Ruzena Bajcsy, assistant director of NSF's CISE directorate.

We constructed a Web site that allows us to inform the public of our progress as we go along. Our publications page is receiving regular interest.

We met several times with our Interlib colleagues from the University of California at Berkeley, UC Santa Barbara, the San Diego Supercomputer Center, and the California Digital Library. This is to ensure that our developments are coordinated well, and that we complement each other optimally. Our contact with the larger community was furthered by the DL kickoff meeting hosted by Cornell.

A new faculty member, Chris Manning, joined our InfoLab (of which the Digital Library Project is a part). He is working on Web page classification using machine learning methods, and on statistical machine translation. We look forward to fruitful collaboration with him.

The following sections summarize activities in the various thrusts of the project.

2. Interoperability

2.1 SDLIP: A Digital Library Interoperability Protocol

A good amount of activity was invested in our new Simple Digital Library Interoperability Protocol (SDLIP). We took advantage of our DLI1 experiences with interoperability, and designed a protocol that is very simple to implement. The goal is to develop an Interlib project communication vehicle.

Clients use SDLIP to request searches to be performed over information sources. The result documents are returned synchronously, or they are streamed from service to client as they become available. Implementations can be constructed over HTTP or CORBA based transports. In fact, any search service can be accessible through both kinds of transports at the same time. Implementations for IETF's HTTP based DASL protocol, and for CORBA are available.

After an initial design, all of our partners met here at Stanford to discuss the details. We also invited a participant in the IETF proposal for a Distributed Authoring, Searching, and Locating protcol (DASL). He was able to provide valuable information about this IETF effort, and he helped us in our goal of being a powerful complement to existing efforts.

We have implemented examples of using SDLIP over both HTTP and CORBA. These examples are the foundation of our ongoing effort towards making all of our collections available via SDLIP. For example, Ray Larson at UC Berkeley implemented SDLIP access to all of Berkeley's Melville collections, and to the holdings of the California Digital Library (CDL). Similarly, we implemented access to several collections at the San Diego Supercomputer Center, the collection of Computer Science Technical Reports (NCSTRL), to Z39.50 services, and AMICO, a collection of online paintings and associated metadata. Multiple trips to our Interlib partners have helped move these implementations forward.

A DLib Magazine article on SDLIP is helping to publicize the effort.

2.2 DietORB: Miniaturized CORBA for PDAs

A second interoperability effort has created a miniaturized CORBA client for PDAs. We carefully evaluated which portions of CORBA were critically needed to access CORBA servers. Based on this analysis, we constructed a client for Palm Pilots. It can access CORBA-based search services over wireless links. The system was transferred successfully to a medical informatics company.

2.3 Service Mediation

Our visitor, Sergey Melnik, has introduced an RDF based mediation architecture for digital library services. RDF is an emerging W3C standard for encoding models in XML. This service mediation effort aims to express interfaces to DL services as RDF graphs. Example services are query translation, document format translations, and document summarization services. Each RDF model represents both the methods available on a service, and the order in which they can be invoked. In order to use multiple services with similar functionality but differing interfaces from a single client, our mediation service attempts to transform the RDF graph that represents the client's interface to the graphs that represent the different services. This effort is still in an experimental phase, but a publication explaining the principles will appear in DL00 (see references section below).

3. Information Overload

Our 'information overload' thrust has three sub-efforts: WebBase, Smart Crawling, and our Information Mural.

3.1 WebBase: A Database for Many Millions of Web Pages

We are developing facilities for storing and using many millions of Web pages. Emphasis is on providing as many advanced 'database-like' facilities as we can for HTML's semistructured format. In addition to the basic archiving machinery, we are developing facilities for building different kinds of indexes and search facilities. Researchers will be able to stream all or part of the collection across multicast channels. Computer programs at the researchers' sites can then analyze the stream, and build novel indexes that are re-integrated into WebBase. Examples of such indexes might be document reading level, document 'genre', or document length. We are investigating a variety of challenges in this multicasting context. For example, as multiple researchers request multiple items from the collection, a scheduler must construct optimal multicast channels that do not overtax communication channels, but also avoid overloading clients with items they do not need.

Our current system includes 40 million pages. We have indexes for forward and backward links, and ranking. After carefully examining available options for text indexing software, we are building and analyzing a text indexer for the collection.

During the reporting period we built the hardware, with contributions of many gigabytes from Quantum Corporation. The hardware is specially set up to allow for updates of new Web snapshots with minimal disruption of service.

3.2 Smart Crawling

Crawlers usually follow a breadth-first pattern. This results in broad coverage, but is less useful for building collections with particular goals. For example, the curator of a collection might wish to optimize her use of network and computing resources in such a way that her collection is as fresh as it can be. This goal calls for a crawler that re-crawls sites at different rates, depending on their respective rate of change. We are analyzing different Web site characteristics, and are developing specialized crawlers that optimize data collection for different collection goals.

During the reporting period, we implemented a highly parallel Web crawler, which has collected 42 million pages. Based on Web change statistics that we compiled from 720,000 web pages over a 4 month period, we studied the optimal refresh strategy for a Web crawler that specializes on freshness.

In order to avoid unnecessary crawling, we devised an algorithm to identify mirror sites on the Web automatically.

3.3 Information Mural

At the user interface level, we are working on a physical room that supports collaborative information exploration. This effort is also supported by other funding sources. A first version of the room is now complete. It includes three large, high resolution displays that cover an entire wall. The displays are touch sensitive. The room also includes a table with a built-in display for supporting round-table discussions with computer support.

The initial emphasis of the room will be the exploration of I/O devices that support collaboration.

4. Mobile Access to Digital Libraries

Our mobile access thrust attempts to integrate digital libraries more tightly into everyday life. Three projects are part of this thrust: our Power Browser, Information-based Collaboration for Emergency Medical Transport, and Information Tiling.

4.1 Power Browser

The Power Browser is a novel way of browsing the Web from a personal digital assistant (PDA). The system is implemented on the Palm Pilot device with wireless connection. Three outstanding features are a tree widget, a dynamically generated site search capability, and site-specific keyword completion.

The tree widget displays descriptions of the links on Web pages. This is similar to how the Macintosh Finder or Windows File Explorer display hierarchical file systems. The links on each page are shown at the same indentation level. The user can expand the tree using gestures with the pen. An incremental crawler retrieves the necessary information.

Incremental crawling is also used to provide users with local site search facilities, even for sites that do not themselves offer such a facility. Once the user enters a site, a crawler builds an index for the site, which can be used for searching.

The index is, however, also used to support search keyword completion on the user interface. Once users have entered two or three characters, a proxy sends a list of matching keywords to the PDA. The PDA displays the keywords for the user to select from.

User studies showed more than 40% improvements in user task completion speed, and pen stroke reduction. We have nearly completed our port to the Palm VII device, which has a wireless connection built in.

An invention disclosure has been filed.

4.2 Information-Centric Collaboration in Emergency Transport

We are working with the Stanford hospital in studying ways to use information flow for improving task-based collaboration. Stanford is a regional hub for the emergency treatment of infants and children. When outlying hospitals are unable to care for a particular child, a highly specialized transport team is dispatched from Stanford to fetch the child via ambulance or helicopter. This process involves many participants: the referring hospital's doctors and staff, parents, Stanford's dispatcher, a medical coordinator, neo-natal or pediatric intensive care unit staff and doctors, and the transport team. Information is constantly created and consumed at different locations. The flow of this information is vital if patient care is to be optimized.

During the reporting period, we have thoroughly analyzed the process through on-site interviews, and participation in actual patient transports. We found that more than 700 pieces of information are involved in a single transport. We are building tablet-based facilities that help all parties involved keep track of the information, and to transmit vital data from the ambulance to central repositories. The facilities enable thorough preparation by the receiving staff.

We face many user interface challenges. On the one extreme, the information must be manipulable on a small tablet screen in a bumpy ambulance. On the other extreme, the information must also be made suitable for long-term analysis by state supervisory agencies.

We constructed a first prototype, and are working closely with the hospital staff to work out details.

4.3 Information Tiling

Information tiling refers to the ability to define 'collages' of information from different sources. A tile might be all the information a user likes to find on her screen in the morning, such as news, stock information, and the wheather. Or it might be the information he needs as he begin a particular task, such as the treatment of a patient, or as she enters some geographic area, such as a city or airporty during travel. The goal is to have the components of each tile update themselves continuously, to build a database of this temporally ordered information, and to develop summaries suitable for presentation on handheld devices.

In this area of the project we discovered research topics in the area of distributing tiles to roaming devices. We are currently investigating how we can make different kinds of delivery guarantees, and at which cost. For example, we are exploring delivery infrastructure that can guarantee that a tile is delivered exactly once, even if the user roams through many geographic locations.

5. Archiving

Our archiving thrust is working toward the long-term preservation of digital information. We have conducted a thorough investigation into existing digital archiving approaches. Based on these studies, we decided to focus initially on three aspects of the problem:

Tools for cost/benefit analyses of archiving solutions
Automated, continuous local archiving of entire file systems
Large-scale, fully automated, wide-area replication of archives

We have constructed a simulator, ArchSim, which allows us to help designers of archives make decisions around the tradeoffs involved. For example, designers must decide whether to buy more reliable equipment, or whether to change it frequently instead. Our simulator uses statistical methods to provide the data that is necessary to make this decision, and others like it.

For the automated file system archiving portion of the archiving thrust, we developed our InfoMonitor. It observes changes in a file system, such as a library catalog, or online document repository, and replicates changes into a local archive. A demonstration of the system shows how the documents that comprise a 2GB Web site can be archived effectively.

Wide-area replication, finally, includes the detection of changes in all participating sites, and the consequent replication of information from the modified archives to their mirrors. We use the notion of 'replication contracts' between geographically distributed, autonomously administered sites to set up replication paths among sites. An example might be the agreement that two museums replicate particularly important portions of their online holdings. Our prototype is called the Stanford Archival Vault (SAV). It demonstrates how two or more institutions could automatically and continuously monitor the others' holdings to ensure that nothing is lost, even with a local disaster destroying one organization's records. We tested and measured performance on repositories up to 10 GB (270,000 archived objects)

6. Miscellaneous Activities

Database Workshop With Over 80 Participants from Industry & Academia
Two of Our Visitors: Sergey Melnik and Jun Hirai

6.1 Visitors and Industry Contacts

Jun Hirai of Toshiba Corporation spent the year with us, working on the WebBase project. We were very happy to have him. He contributed greatly to the project. Hisao Mase of Hitachi also spent time with the project. Sergey Melnik from the University of Leipzig is with us as well. He is contributing prolifically to the project, mainly in the areas of SDLIP, WebBase, and service mediation.

Proj. Lockeman of the University of Karlsruhe also visited over the summer, participating in our discussions, and contributing his views.

In addition, we hosted various other visitors, among them:

Steve Griffin, NSF
Mike Lesk, NSF
Ruzena Bajcsy, NSF
Arno Puder, Deutsche Telekom
Dr. Wolfenstetter, Deutsche Telekom
Samir Raiyani
Representatives from IBM
Representatives from Toshiba
Dr. Mamoru Sugie of Hitachi
Prof. Lamersdorf of the University of Hamburg
Doug Sery of MIT Press
Brigette Wild of Dialog Corporation
Michael White, WhoWhere
Representative from Morgan Kauffman
Claudia Munc of IBM Almaden
Holger Mayer, University of Rostock, Germany
Zolt Silberer, ISI
Prof. Schweppe, Germany
David Levy, Xerox PARC
Marti Hearst, UC Berkeley
Chung Le, File Maker Pro
Dr. Steve Cousins, Xerox PARC
Dr. Michelle Baldonado, Xerox PARC
Brewster Kahle, Alexa
Representatives from AltaVista
Representatives from Verity

6.2 Public Presentations and Meetings Attended

As noted, we organized a one-day workshop to finalize our SDLIP design. We also visited NASA Ames on the kind invitation of Eugene Miya. The visit was very exciting for all of us. It demonstrated the wide spectrum of activities at NASA, and its rich history.

We attended a meeting at NPACI, communicating our DL efforts to that community. Other meetings include a datamining workshop at IBM Almaden, and a meeting of the Special Interest Group on information retrieval.

Here is a list of project-related presentations the group members gave during the reporting period.

Arvind Arasu
- Power Browser presentation to NSF NPACI meeting in Washington D.C.
Karen Butler
- Presented emergency transport information analysis to prenatal advisory board, a panel of emergency transport physicians and other stakeholders.
Junghoo Cho
- How to crawl the Web, Compaq SRC, February 14, 2000
- How to crawl the Web, Stanford Database group workshop, March 14, 2000
Brian Cooper
- Talk and demo on SAV and InfoMonitor for Mike Lesk
- Talk and demo on SAV and InfoMonitor for NSF site visit
- Talk and demo on SAV and InfoMonitor for Ruzena Bajcsy
- Talk and demo on SAV and InfoMonitor for Database Workshop
- Poster on SAV at Internet Archive Colloquium
- Demo and informal talk on SAV and InfoMonitor for visitors from the Stanford Linear Accelerator Center (SLAC).
- Discussed SAV and Web archiving with Jeneane Harlick, San Mateo County Times. Ms. Harlick wrote an article discussing the issue of web archiving and mentions our SAV system; the article appeared on 4/25/2000.
Arturo Crespo
- At CISCO Systems, presented work conducted with Junghoo Cho on Web Extraction (ProxyGen).
Hector Garcia-Molina
- InterLib Digital Libraries Technologies, U. California Berkeley, Sept 27, 1999.
- How to Crawl the Web, Microsoft Corporation, January 28, 2000.
- Barriers to Digital Libraries, Intel Corp., Beverton, OR, February 8, 2000.
- How to Crawl the Web, Compaq SRC, February 8, 2000.
- Evaluating Archival Repositories, Internet Archive Workshop, March 8, 2000.
- Barriers to Digital Libraries, Intel Corp., Santa Clara, April 12, 2000.
- Distinguished Lecture Series at: U. Illinois Urbana, U. Chicago (November 1999).
- Distinguished Lecture Series at: U. California Santa Barbara (March 23,2000).
Wang Lam
- Intel New Business Group
Andreas Paepcke
- DietORB presentation at NPACI meeting at SDSC
- Invited talk at University of Jena, Germany
- Invited talk at San Diego Supercomputer Center
- Invited talk at Columbia University
- Invited talk at Humbold University, Berlin, Germany
- Demo/Presentation to representatives of Cadence Design Systems

6.3 Local Events

NSF/ARPA/NASA site visit
Stanford Forum. A large gathering of industry representatives. Presented archiving work and Power Browser.
Hosted SDLIP planning meeting.
The DL Project participated in hosting the industrial database workshop here at Stanford. The meeting attracted more than 80 people from numerous companies. Pictures are available.

6.4 Regular Meetings/Seminars

Weekly Digital Library student meeting
Executive committee meetings when required
Weekly technical design meetings

7. References for 1999/2000

The references below cover only the reporting period, and the period between DLI1 and DLI2. For a more complete list of the group's publications, please visit our Web site at http://www-diglib.stanford.edu.

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]

8. Other Publications

We produced a video tape describing our Power Browser. It was shown at CHI2000, WWW9, at an NPACI meeting in Washington D.C., and to several visitors.

9. References

[1] Michelle Baldonado, Steve Cousins, B. Lee, and Andreas Paepcke. Notable: An Annotation System for Networked Handheld Devices. In Proceedings of the Conference on Human Factors in Computing Systems, pp. 210-211, 1999.

[2] Onn Brandman, Junghoo Cho, Hector Garcia-Molina, and Narayanan Shivakumar. Crawler-Friendly Web Servers. In Proceedings of the Workshop on Performance and Architecture of Web Servers (PAWS), Santa Clara, California, June, 2000. Held in conjunction with ACM SIGMETRICS 2000. Available at http://www-diglib.stanford.edu/cgi-bin/get/SIDL-WP-2000-0138.

[3] Onn Brandman, Hector Garcia-Molina, and Andreas Paepcke. Where Have You Been? A Comparison of Three Web Tracking Technologies" http://www-diglib.stanford.edu/cgi-bin/WP/get/SIDL-WP-1999-010. In Submitted for publication, 1999. Available at http://www-diglib.stanford.edu/cgi-bin/WP/get/SIDL-WP-1999-0105.

[4] Orkut Buyukokkten, Junghoo Cho, Hector Garcia-Molina, Luis Gravano, and Narayanan Shivakumar. Exploiting Geographical Location Information of Web Pages. In Proceedings of Workshop on Web Databases (WebDB'99), June, 1999. Held in conjunction with ACM SIGMOD'99. Available at http://www-diglib.stanford.edu/cgi-bin/get/SIDL-WP-2000-0136.

[5] Orkut Buyukkokten, Hector Garcia Molina, Andreas Paepcke, and Terry Winograd. Power Browser: Efficient Web Browsing for PDAs. In Proceedings of the Conference on Human Factors in Computing Systems, 2000.

[6] Orkut Buyukkokten, Hector Garcia-Molina, and Andreas Paepcke. Focused Web Searching with PDAs. In Proceedings of the Nineth World-Wide Web Conference, 2000.

[7] Chen-Chuan K. Chang, Héctor García-Molina, and Andreas Paepcke. Predicate Rewriting for Translating Boolean Queries in a Heterogeneous Information System. ACM Transactions on Information Systems, 17(1):1-39, January, 1999. Available at http://www-diglib.stanford.edu/cgi-bin/WP/get/SIDL-WP-1996-0028.

[8] Junghoo Cho and Hector Garcia-Molina. Synchronizing a Database to Improve Freshness. In Proceedings of the International Conference on Management of Data, 2000. Available at http://www-diglib.stanford.edu/cgi-bin/get/SIDL-WP-1999-0116.

[9] Junghoo Cho, Narayanan Shivakumar, and Hector Garcia-Molina. Finding Replicated Web Collections. In Proceedings of the International Conference on Management of Data, 2000. Available at http://www-diglib.stanford.edu/cgi-bin/get/SIDL-WP-1999-0118.

[10] Junghoo Cho and Hector Garcia-Molina. The Evolution of the Web and Implications for an Incremental Crawler. In Submitted for publication, 1999. Available at http://www-diglib.stanford.edu/cgi-bin/get/SIDL-WP-1999-0129.

[11] Junghoo Cho and Hector Garcia-Molina. Estimating Frequency of Change. In Submitted for publication, 2000.

[12] Junghoo Cho, Narayanan Shivakumar, and Hector Garcia-Molina. Computing Document Clusters on the Web. In Proceedings of the International Conference on Management of Data, 1998.

[13] Brian Cooper and Hector Garcia-Molina. InfoMonitor: Unobtrusively Archiving a World Wide Web Server. In Submitted for publication, 2000. Available at http://www-diglib.stanford.edu/cgi-bin/get/SIDL-WP-1999-0125.

[14] Brian Cooper, Arturo Crespo, and Hector Garcia-Molina. Implementing a Reliable Digital Object Archive. In Submitted for publication, 2000. Available at http://www-diglib.stanford.edu/cgi-bin/get/SIDL-WP-1999-0127.

[15] Arturo Crespo and Hector Garcia-Molina. Modeling Archival Repositories for Digital Libraries. In Submitted for publication, 2000. Available at http://www-diglib.stanford.edu/cgi-bin/get/SIDL-WP-1999-0130.

[16] Arturo Crespo, Orkut Buyukkokten, and Hector Garcia-Molina. Efficient Query Subscription Processing in a Multicast Environment . In Proceedings of the 16th International Conference on Data Engineering, 2000. Available at http://www-diglib.stanford.edu/cgi-bin/get/SIDL-WP-2000-0137.

[17] Jun Hirai, Sriram Raghavan, Hector Garcia-Molina, and Andreas Paepcke. WebBase: A repository of web pages. In Proceedings of the Nineth World-Wide Web Conference, 1999. Available at http://www-diglib.stanford.edu/cgi-bin/get/SIDL-WP-1999-0124.

[18] Yongqiang Huang and Hector Garcia-Molina. Exactly-Once Semantics in a Replicated Messaging System. In Submitted for publication, 2000.

[19] Sergey Melnik, Hector Garcia-Molina, and Andreas Paepcke. A Mediation Infrastructure for Digital Library Services. In Proceedings of the Fifth ACM International Conference on Digital Libraries, 2000. Available at http://www-diglib.stanford.edu/cgi-bin/get/SIDL-WP-1999-0126.

[20] Andreas Paepcke, Hector Garcia-Molina, Gerard Rodriguez, and Junghoo Cho. Beyond Document Similarity: Understanding Value-Based Search and Browsing Technologies. SIGMOD Records, 29(1), March, 2000. Available at http://www-diglib.stanford.edu/cgi-bin/WP/get/SIDL-WP-1998-0099.

[21] Andreas Paepcke, Michelle Baldonado, Chen-Chuan K. Chang, Steve Cousins, and Hector Garcia-Molina. Using distributed objects to build the Stanford digital library Infobus. IEEE Computer, 32(2):80-87, February, 1999. Similar version available at http://www-diglib.stanford.edu/cgi-bin/WP/get/SIDL-WP-1998-0096.

[22] The Simple Digital Library Interoperability Protocol. Stanford University, University of California Berkeley, University of California Santa Barbara, San Diego Supercomputing Center, and California Digital Library, 2000. Available at http://www-diglib.stanford.edu/ testbed/doc2/SDLIP/.

[23] Andreas Paepcke, Robert Brandriff, Greg Janee, Ray Larson, Bertram Ludaescher, Sergey Melnik, and Sriram Raghavan. Search Middleware and the Simple Digital Library Interoperability Protocol. DLIB Magazine, March, 2000. Available at http://www-diglib.stanford.edu/cgi-bin/get/SIDL-WP-2000-0135.