Script for NSF Exhibit on Digital Libraries

May-September 1997

Introduction

The Digital Libraries Initiative (DLI) is a federally-sponsored research program to understand and foster our use of digital information at home, at school, and at work, now and tomorrow. This spans a broad range of questions:

Six university-led public/private partnerships are examining these and related issues. In this exhibit, we invite you to explore some of the questions motivating this research and to examine some of the findings. But these are dynamic projects in a fast-changing world; take a moment to visit the web pages of each project to see what's new.

Three Sponsoring Agencies

National Science Foundation <http://www.nsf.gov/>

Defense Advanced Research Projects Agency <http://www.darpa.mil/>

National Aeronautics and Space Administration <http://www.nasa.gov/>

This exhibit was organized by D-Lib and the National Science Foundation on behalf of the Digital Libraries Initiative (DLI). D-Lib is a forum for researchers and developers of advanced digital libraries, sponsored by DARPA on behalf of the DLI and coordinated by the Corporation for National Research Initiatives.

1. What kinds of materials live in digital collections?

Almost any kind of information can exist in digital form - music, images, text, motion pictures, speech, and so on. And the universe of this information is expanding. Some is created digitally - satellite images or remote sensing data. Others must be converted - the purpose of large scanning projects of historical collections, corporate archives, and technical journals. But different kinds of digital data have different storage and other requirements, and some, like video, pose particularly complex retrieval problems.

A "document" can take many forms but is characterized by three properties: content, or what you are trying to say; structure, or how the content is organized; and format, how the content and structure are encoded so that we can store, find, and use these documents. The "California Dams" research shows how structure can be independent of content. In this demo, you can look at the same content, information about all the dams in the state of California, as either an image of a page, as text which can be searched, or as a tabular display of a subset of the information, depending on a query. Try it!

The formats in which we store documents can differ substantially. For example, images are large; text is small.

Retrospective conversion is the process by which information in print format can be expressed in digital form - either as a sequence of characters or as a digitized image. Many rare and unique materials already exist in print on paper, and one important aspect of building large digital libraries is scanning, or digitizing, these materials and then indexing them for storage and future retrieval. Scanning is a lot harder than it looks: Researchers at the Alexandria Project have found that reliably scanning an aerial photograph can require 12 separate steps and 15-20 minutes. The item must then be indexed for future use - a separate process requiring several additional steps. How to automate these processes, and reduce the need for intensive human involvement, is an area of research.

Video embodies several media that can be taken apart, interpreted separately letting us employ different tools for different components, and then re-integrated. For example, some of the research at Carnegie Mellon is devoted to automatic speech recognition, converting speech to text so that we can search the text using one set of tools but search images using other strategies. Storing and searching different media are separate research issues; a first step is to take the complex "document" apart, so that we can differentiate among the components.

Standard Generalized Mark-up Language (SGML) is a set of codes that lets us subdivide a document into components (like chapters and paragraphs). Recognizing this underlying structure means that we can partition documents in consistent ways, save them efficiently, and retrieve only the relevant parts. SGML also lets us preserve display so that a page in an engineering journal displays on the screen in the same way that it appears in print.

2. Where do collections live?

Digital information is stored on computers across the country and throughout the world. Advances in computing and communications technologies mean that separate computers can be networked, and users can find and use information any place and at any time, restricted only by conditions applied by the owners of the information. But some kinds of documents, like images, are so large that simply downloading them can pose performance problems. And in a rapidly changing environment, not all networks, computers, and collections can or will "speak" the same language.

So two important groups of questions are: How do we build systems that can interoperate - or let users work across heterogeneous collections and systems without worrying about compatibility or learning different procedures? And how can we store material so that people can find what they want more easily and efficiently -- either by partitioning or by describing it?

What goes on behind the scenes?

Heterogeneity exists at many levels - from search systems that end-users see down to the switches and routers that process and manage the flow of bits and bytes over the network. To cope with rapid change and increasing variety, researchers at Stanford are devising sets of computing specifications or rules, called "protocols". Protocols, like zoning codes in architecture or grammar in language, do not require similar programs and systems to be identical; they do establish a design "envelope" or framework that permits variation in specifics to co-exist.

Another approach to coping with rapid change is use of "agents". We can think of an "agent" as a program that provides a service that can adapt to new information without significant - or any -- re-programming. Like protocols, agents exist behind the scenes - we need never see them. Researchers at the University of Michigan are working on the notion of societies of agents, collections of computer programs each providing a specific service that team up to achieve a goal.

What's stored where? What's processed where?

Using agents also means that demand on the network can be reduced - and a crowded network is increasingly an issue. Even without congestion that comes from rapid growth in the numbers of users, some kinds of documents are so large and used so infrequently that we only want to store them once and download them as needed. But retrieving them in their entirety would tie up lines unnecessarily and might result in a document at the desktop that is larger than we need.

Researchers at the Alexandria Project at the University of California, Santa Barbara, are experimenting with a set of mathematical techniques called "wavelets" for storing, partitioning, and retrieving extremely large images. These techniques can support progressive resolution of portions of an image so that users browse a coarse version of it or can zoom in on a detail. This has several implications for performance: In terms of storage, it means that lower-level resolution data that are accessed more frequently than the higher resolution information can be stored in faster devices for efficient browsing. At the desktop, only some of the data need be transferred for local reconstruction of an image stored elsewhere.

Engineers at the Alexandria Project, like researchers at Berkeley and Carnegie Mellon, are investigating retrieval of images. Alexandria's approach is based on characteristics of texture and color. Images can be segmented and segments compared so that someday, not too far off, we can ask a collection of aerial photographs this question: "Show me all the images with cornfields in them."

Describing Data: What is Metadata?

Metadata is data about data: A metadata record can describe a collection or an individual item - image, text, database, video clip, and so on. We can store the metadata records separately from the material they describe so that when users request information about images, documents, or collections, less data is sent. This means that the system responds more quickly and the demand on the network is reduced.

Not all metadata records are the same because what we need to describe a video is different from what we need to describe a database. But we need enough similarity among records so that we find related material across different media. So, one cluster of research questions is: "What are the minimum requirements for all items? And what are additional requirements for collections of related items?" Because creating metadata is labor-intensive, a second set of questions asks: "What can we automate? And how do we do it?"

3. How do we find and use information?

Metadata can let us improve performance. But like progressive retrieval of images, it is also a tool for rapid browsing of materials. Browsing is one way to select information. Before we can browse a handful of images, we have to find relevant material in the first place. In the expanding universe of heterogeneous information, finding relevant material is a problem with many facets. How do we look for information - what concepts and words do we use? And once we bring resources to the desktop, what tools can we use to work with it?

Human/Computer Interfaces:

Human computer interface design deals with how the display is organized. DLITE, developed at Stanford, helps users integrate the results of many, disparate services, supports sharing and reuse of information, and is "extensible", meaning that it is designed to encourage others to build additional capabilities on to it.

PAD++ was originally developed with funding from DARPA but is being integrated into the research program at the University of Michigan. The Highly-Interactive Computing Research Group at Michigan is also studying use of digital libraries in schools. With support from the University of Michigan, the Michigan Department of Education, and the Ann Arbor Public Schools as well as from NSF, NASA, and the DLI, these researchers are undertaking a broad range of investigations in learning and technology.

When is meaning the same? And when is it different?

We usually look for information by submitting a "query". Problems frequently arise when the same word can have different meaning, depending on the context of the word or the intent of the searcher. The interface to the spatial collections of the Alexandria Digital Library shows you the notion of location as a way to search collections organized spatially without resorting to words. But the Interspace shows you that the same idea, "California", can also be a way to find information related to environmental information. They're the same - but they're also different.

How do we find what's useful?

Stanford University's SenseMaker is one way to help users find related materials. The program is designed to run over a group of documents stored locally or on the web, and find the ones that are similar.

Another way to select what's useful from what's not is through iterative searching - taking words and concepts from one set of documents and asking the system to search again. This is the approach embodied in IODyne, an experimental system developed at the University of Illinois, Urbana-Champaign, which uses traditional tools like subject thesauri and new tools like concept spaces to get at documents or parts of documents where the concepts are the same, but the words may be different.

We're used to notions of abstracts and summaries as tools to help us find documents that are useful. We're less used to the notion of visual abstracts, but researchers at Carnegie Mellon's Informedia project have devised a way to provide visual abstracts. Try it!

One of the powerful advantages of digital materials is that once we find relevant information, we can work with it at the desktop without resorting to yellow markers, scissors, and re-typing. But because it is so easy to manipulate the information, we need tools that will help us authenticate materials as well as manage them. The SCAM program, developed by researchers at Stanford University, asks the question, are these documents similar? SCAM has already proved useful in identifying instances of plagiarism. Try it!

At the desktop, at home, at school, and at work, relevant information in digital form promises to enable us to explore relationships among different kinds of information. One example of these tools is Berkeley's "multivalent document model", which supports annotation, overlays of different kinds of information, zooming in on details, and backing off for a broader view. It's already been used in flood recovery efforts in California. Try it!

Learn more about the DLI!

©1997 Corporation for National Research Initiatives