Stanford Digital Library
Research and Education Activities
NSF Progress Report - June 2002

One of the serious concerns that carry over from physical libraries to digital libraries is the loss of collections over time. We are investigating solutions for the long-term archiving of large digital collections. Our current focus is on saveguarding at the bit level, and on economic factors of this problem.

During the reporting period we succeeded in analyzing mechanisms for trading disk space efficiently. In particular, we were able to show that space auctions without the complications of currency yielded high reliability for collections at reasonable prices, and good utilization of system-wide disk space resources.

We also constructed a simulator for guiding library administrators through the tradeoffs between equipment expense and reliability. The model also computes (probabilistically) when equipment should be replaced to minimize the likelihood of collection service failure. Research published during the reporting period: [1, 2, ?, 3, 4, 5].

User Interfaces to information

We continued our development of the PowerBrowser. This browser enables users to browse and search the Web from small, handheld devices. We needed to rethink the display and pre-processing of information in order to avoid the trap of trying simply to miniaturize full-desktop Web pages. In particular, we extended our tree browsing techniques to the small-display email domain. Research published during the reporting period: [6, 7].

We also began a new effort to enable non-technical end users to browse collections of photographs. Our self-imposed goal for this effort is a facility that can operate at the scale of a lifetime's worth of photographs. Our initial approach has used time clustering and cluster-based sampling to prepare informative photograph summary screens for the user. Our time clustering is recursive, which enables the user to modify the time-based zoom level for viewing (years vs. months, vs. days, etc.)  Research published during the reporting period: [8].

Also in the user interface area of our progress has been the continued development of a medical tablet application. The application helps the teams that are responsible for the safe transport of critically ill infants and children from outlying hospitals to the Stanford Hospital's Neonatal and Pediatric Intensive Care Units. We devised an information gathering and sharing system that allows timely information flow to all involved parties, particularly to the IC unit physicians. As part of this effort we invented a novel input technique for numerical data, using only gross motor skill movement. During this reporting period we focused particularly on pilot user testing by nurses and physicians at the Stanford Hospital, and on information security issues.

Information Collection, Storage, and Distribution

Our WebBase project has progessed to the point where we can collaborate with other academic institutions. The implementation has also progressed far enough for us to publish research that required WebBase to investigate. WebBase is a highly customizable facility for crawling, storing, indexing, and finally distributing Web content at high speed. We have investigated several crawling algorithms. The challenge at the indexing stage has been the unification of traditional inverted text indexes (computed in a distributed environment), and other, less common indexes over, for example, the collection's link graph. A challenge yet to be addressed is the support for queries over these mixed index structures.

While we made progress throughout the entire system, high speed distribution of the collection has evolved particularly far during the recent year. We now distribute simple WebBase client software to interested universities. Our colleagues can then write plugin code that analyzes Web pages with some novel algorithm they are researching. Once the plugin is in place, the WebBase client will request from WebBase a stream of pages. This stream is passed to the plugin. In fact, the remote researchers may provide any number of plugins, maybe to test variations of their algorithm, or to explore unrelated analysis approaches. The WebBase client will feed the stream to all of the plugins at high speed.

This distribution machinery represents a significant saving in time and effort for these research colleagues. Instead of needing to deploy a crawler, monitor it carefully, and retaining all of the pages for subsequent analysis, they can instantly request our currently 200 Million pages. We can serve the stream at the speed of the network or the client's plugin processing speed.

Our collaborations have included the Digital Library Projects at Columbia University and UC Berkeley, as well as the University of Washington.

Our own use of WebBase has resulted in the development of an algorithm that yields significantly higher precision than the Page Rank algorithm that was also developed by our Digital Library Project (under DLI 1), and is being used (in combination with other ranking methods) by the Google search engine. Our new algorithm yields topic-sensitized page rank. Research published during the reporting period: [9, 10, 11, 12, 13].

An altogether new effort that was not anticipated in our initial proposal is an investigation into optimization problems in peer-to-peer networks (P2P). During the reporting period we developed novel index technology that combines many of the advantages of P2P's distributed information storage, while reducing the amount of network traffic that is generated by queries. Research published during the reporting period: [3]

In addition, see [14, 15, 16] for publications on reliable delivery of information to mobile users.

Training and Development

same as for ITR report

Other Specific Products
  1. WebBase, a research tool that collects, indexes, stores, and is able to distribute Web pages at high speed. The facility allows the research community to develop and test Web-related algorithms without the inconvenience, time, and effort of first assembling a large collection of Web pages.
  2. A tablet-based information entry system for teams of nurses, respiratory specialists, physicians, and administrators to share in real time the data that is collected during the transport of critically ill infants, children, and high-risk maternity patients. We are developing this system in close cooperation with physicians and nurses at the Stanford Hospital.
Contributions to Education and Human Resources

same as ITR, plus: In fact, our Digital Library Project guided one student from his undergraduate research project, through a subsequent masters program to his first scientific research publication [8].

Contributions Beyond Science and Engineering

Please see our description of the tablet-based information sharing among medical emergency workers above.


[1] Brian Cooper and Hector Garcia-Molina. Creating Trading Networks of Digital Archives. In Proceedings of the First ACM/IEEE-CS Joint Conference on Digital Libraries, 2001.

[2] Brian F. Cooper and Hector Garcia-Molina. Bidding for storage space in a peer-to-peer data preservation system (Extended version). Number 2002-22. Stanford University, cooperb@CS.Stanford.EDU, March, 2002.

[3] Arturo Crespo and Hector Garcia-Molina. Routing Indices For Peer-to-Peer Systems. In Proceedings of the 22nd International Conference on Distributed Computing Systems, 2002. Available at

[4] Brian F. Cooper, Mayank Bawa, Neil Daswani, and Hector Garcia-Molina. Protecting the PIPE from Malicious Peers., 2002. Technical report.

[5] Brian F. Cooper and Hector Garcia-Molina. Peer-to-Peer Data Trading to Preserve Information. ACM Transactions on Information Systems, 20(1), April, 2002.

[6] Oliver Kaljuvee, Orkut Buyukkokten, Hector Garcia-Molina, and Andreas Paepcke. Efficient Web Form Entry on PDAs. In Proceedings of the Tenth International World-Wide Web Conference, 2001. Available at

[7] Orkut Buyukkokten, Hector Garcia-Molina, and Andreas Paepcke. Seeing the Whole in Parts: Text Summarization for Web Browsing on Handheld Devices. In Proceedings of the Tenth International World-Wide Web Conference, 2001. Available at

[8] Adrian Graham, Hector Garcia-Molina, Andreas Paepcke, and Terry Winograd. Time as Essence for Photo Browsing Through Personal Digital Libraries. Number 2002-4. Stanford University, January, 2002. Available at

[9] Taher H. Haveliwala. Topic-Sensitive PageRank. In Proceedings of the Eleventh International World-Wide Web Conference, 2002.

[10] Wang Lam and Hector Garcia-Molina. Multicasting a Web Repository. In Fourth International Workshop on the Web and Databases (WebDB 2001), pp. 25-30, 2001.

[11] Junghoo Cho and Hector Garcia-Molina. Parallel Crawlers. Number 2002-9. Stanford University, February, 2002.

[12] Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram Raghavan. Searching the Web. ACM Transactions on Internet Technology, 2001. Submitted for publication. Available at

[13] Wang Lam and Hector Garcia-Molina. Multicasting a Web Repository. In Fourth International Workshop on the Web and Databases (WebDB 2001), pp. 25-30, 2001.

[14] Yongqiang Huang and Hector Garcia-Molina. Exactly-once Semantics in a Replicated Messaging System. In Proceedings of the 17th International Conference on Data Engineering, pp. 3-12, 2001.

[15] Yongqiang Huang and Hector Garcia-Molina. Publish/Subscribe in a Mobile Enviroment. In Proceedings of the Second ACM International Workshop on Data Engineering for Wireless and Mobile Access, pp. 27-34, 2001.

[16] Yongqiang Huang and Hector Garcia-Molina. Replicated condition monitoring. In Proceedings of the 20th ACM Symposium on Principles of Distributed Computing, pp. 229-237, 2001.