Terence R. Smith
Director, Alexandria Digital Library Project
University of California at Santa Barbara
Santa Barbara, CA 93106
D-Lib Magazine, July/August 1996
The research in this paper was supported in part by NSF IRI94-11330.
Libraries are organized to facilitate access to controlled collections of information. Traditional libraries (TL's) possess three organizational characteristics that, together, provide a basis for such access. These are
As currently conceived, digital libraries (DL's) are libraries in which the controlled collections are in digital form and access to the information in the collections is based almost entirely on digital technology. From a user's point of view, digital technology changes the three organizational characteristics of TL's. First, the organization of information into physical IO's is replaceable with a more flexible organization into logical IO's. Second, the single physical organization of a collection of IO's is replaceable with multiple logical organizations of IO's.
The third and most significant changes, however, occur in the meta-information environment of a library. In terms of advantages, having the IO's in digital form permits the use of digital technology in extracting information from the IO's. The extracted information may satisfy a user's ultimate need for information or it may be employed by ``digital librarians'' in characterizing the IO's in the collection. In the latter case, this meta-information may be employed in providing access to the information encoded in the IO's. In terms of disadvantages, important interactions between librarians and users that occur in the meta-information environments of TL's may be lost with the near-automation of information access in DL's.
The goal of this essay is to suggest a framework for the design of the meta-information environments for DL's that takes advantage of digital technology and compensates for the loss of direct user-librarian interactions.
In the remainder of this essay, we briefly examine the use of the terms ``metadata'' and ``meta-information''. We then employ a simple scenario of library use in order to characterize the meta-information environment of a TL. We generalize this characterization to the meta-information environment of libraries in general. The environment is modeled in terms of a set of high-level services which are, in turn, supported by sets of lower level services, some of which are provided by an extensible set of ``knowledge representation systems''. Finally, we examine the implications of this general characterization in terms of a design for the meta-information environment of a DL. In particular, we suggest a design that is implementable within a distributed object framework.
The term ``metadata'' has been applied in a large variety of contexts. For example, the topics of papers at a recent conference on metadata ranged from metadata in data dictionaries and its use in controlling the operations of database management systems; to metadata used for describing scientific datasets and supporting data sharing among scientists; to metadata used in DL's to support user access to information .
The concept of metadata, when applied in the context of current libraries, digital or traditional, typically refers to information that
More generally, however, if one surveys the many contexts in which it has been applied, it becomes apparent that the concept associated with the term ``metadata'' is the principal focus of an emerging area of the information sciences whose goal is to discover appropriate methods for the modeling of various classes of IO's. Since a model of an IO is itself typically an IO, and since the concept that is generally associated with the term ``data'' is subsumed by the concept associated with the term ``information object'', it seems preferable to use the term ``meta-information'' and to define it as a model of an information object.
To motivate a general characterization of meta-information in the context of DL's, we briefly examine a ``typical'' usage scenario of a TL. We employ this scenario as a basis for constructing a general model of the meta-information environment of TL's that may be generalized to encompass the case of DL's.
For the sake of concreteness, let us assume a user whose interest is in finding information on condor re-introduction programs in California. In order to access such information in a TL, the user may engage in a variety of activities. The four most important activities include consulting a librarian; consulting available catalog and reference materials; browsing through the open collections of the library; and processing the information that has been accessed.
Let us assume that the user begins a search by consulting a librarian, and indicates an initial interest in discovering whether programs for re-introducing condors from captive breeding populations have been a success. Several important processes may co-occur during these interactions. First, the librarian may build a ``cognitive model'' of the user that is employed in helping the user. As an example, the librarian may note the user's level of knowledge about the use of a library, and discover that the user does not understand the value of subject heading catalogs in searching for references to information on the decline of the condors.
Second, the librarian may build a cognitive model of the user's information requirements, or ``query'', typically in an iterative process during which the user may change the initial query. The librarian may discover, for example, that the user would like to know the locations of the release sites in order to visit them. Third, and depending on the context of the query, the librarian may also construct a model of the user's information processing requirements. In terms of our example, these might include estimating the time to hike to the release sites.
In conjunction with these emerging models of the user's knowledge base and information needs, the librarian employs a cognitive model of the library's information resources to determine an appropriate set of actions that will lead to the satisfaction of the user's information needs. Three classes of activities are worthy of note. First, the librarian may direct the user to meta-information, such as the subject catalog, that points directly to IO's of interest. Second, the librarian may guide the user to ``general'' meta-information that can be used in a less direct manner in finding IO's of interest. For example, the user may be directed to a gazetteer in order to find the geographical coordinates of the release sites, whose names the librarian may happen to know. These coordinates may then be used in accessing the appropriate maps from the library's map collection. Third, the librarian may suggest that the user browse in the ornithology section of the library to look for books that may be relevant to the topic of condors. In so doing, the user may assess meta-information in the form of titles and tables of contents.
Before leaving the library, the user may employ the relevant maps to estimate the time it would take to hike to the condor release areas.
The preceding example, which is by no means artificial, emphasizes the fact that the meta-information accessed by users of TL's in satisfying their information needs is not restricted to the meta-information in the author, title, and subject catalogs. In particular, the scenario was devised to emphasize that, during search, a user may conceivably employ as meta-information almost all the information sources in a library. Such sources range from the librarian's general knowledge of the world to information encoded in the IO's on the stacks.
Based on the scenario, we are justified in defining the meta-information environment of a TL to be
In order to analyze further the manner in which the preceding sets of services provide support for user access to information, it is useful to introduce the concept of knowledge representation systems (KRS's). We argue that an important component of the functionality of the six sets of meta-information services in TL's is provided by a diverse set of KRS's. This conceptualization in terms of KRS's provides a useful theoretical framework for the design and analysis of DL's.
A KRS may be defined as a system for representing and reasoning about the knowledge in some domain of discourse, and is generally comprised of:
In general, we may view the KRS's of a library as providing a diverse set of services that are of particular value in the modeling of both IO's and user queries. They are, for example, of particular significance in supporting the modeling of IO's in terms of their content, since, in principle, the content of library materials may refer to any representable aspect of our knowledge.
In order to gain further insight into the nature and significance of KRS's, we provide examples of their use in supporting key sets of services in the meta-information environments of TL's.
Thesauri are an important class of KRS's that are employed in constructing models of the subject matter (or ``content'') of IO's for the catalog systems of TL's. The motivation for the use of thesauri is the difficulties that arise from using a KRS based on natural language (NL) in this context. These difficulties arise from the syntactic and semantic complexity and the high levels of ambiguity that are typically associated with general expressions in NL. The KRL of a thesaurus, on the other hand, is designed to possess a restricted syntax and semantics that permits the representation of restricted domains of discourse in an unambiguous manner. These restrictions result in the construction of many domain-specific thesauri, which in essence represents a ``divide-and-conquer'' approach to building unambiguous representations of a complex world.
For the present purpose, we may use a highly-simplified view of a thesaurus that is abstracted from the ANSI-NISO standard for thesauri .
Other classes of KRS that are also employed in the modeling of IO's for the catalog systems of TL's include subject headings and descriptive cataloging systems. The Library of Congress Subject Headings now bear great apparent similarities to thesauri. They are different in the sense that single terms do not necessarily denote a single concept . The descriptive cataloging that is used to represent such contextual information about IO's as title and author, may also be interpreted in terms of KRS's. In particular, the KRL that is employed for most of the descriptive cataloging in TL's is specified by the Anglo-American cataloging rules (AACR2) and the MARC interchange format for exchanging such information between libraries .
In TL's, there are a variety of KRS that may aid a user in expressing a query that is answerable in terms of the catalog. A gazetteer is a good example of such a KRS and is essentially a set of terms that represent classes of features on the surface of the Earth, such as rivers and towns, and a large set of named instances of such features, such ``Ohio River''. The spatial coordinates of the feature instances on the surface of the Earth are provided as an essential component of a gazetteer. One may therefore view a gazetteer as a geographic thesaurus of limited extent, in which large numbers of class instances are given, and a function is defined on these instances that assigns geographic coordinates to the instances.
In TL's with electronic catalogs, KRS's may be employed in representing user queries. A simple example is the use of the terms of the KRL of some thesaurus in order to represent the content that a user wishes to find in acceptable IO's. In the case of representing queries, the user is frequently permitted to define the content of IO's in terms of boolean expressions of the terms from acceptable thesauri. The reasoning procedures of the thesaurus may be used to expand the representation of the query by replacing, for example, one synonym with another, or a narrow term with a broad term.
Finally, we note that in relation to their interactions in the meta-information environment of a TL, it is not unreasonable to view a librarian as providing the services of a large set of KRS's, each focused on a specific domain of discourse. These KRS's are employed in the various roles played by the librarian in the meta-information environment of a library.
The meta-information environments of current DL's may be viewed as special cases of the preceding model. In terms of the testbed for the Alexandria Digital Library (ADL)  , for example, the system provides services that: support access to models of IO's in terms of USMARC and Federal Geographic Data Committee (FGDC) standards ; support the construction of models of user queries in terms of regions of interest, defined in part by the services of a background map and in part by the services of a gazetteer, as well as models of IO's based USMARC/FGDC standards: support the computation of exact matches between query and IO models; and support a simple workspace involving a local cache in which users may save retrieved items.
It currently appears reasonable, therefore, to use the general model of the meta-information environment of a TL developed above as a basis for designing the meta-information environment of a DL.
Figure 1 illustrates a high-level design for a meta-information environment for DL's. The design is based on the model developed above and is intended to be extensible. It views the meta-information environment of a DL as a set of high-level services that provide the essential functionality of a library. We view these high-level services, in turn, as being supported by the services of an appropriate set of KRS's. Such services may be implemented within a distributed object framework which may be based upon standards such as CORBA . We note that the Figure is intended to be neither exhaustive in showing all possible meta-information services, nor indicative of the flow of processing.
Figure 1: A High-Level Design for the Meta-information Environment of a DL
We briefly summarize the main clusters of services.
As noted above, we envisage the high-level services of a DL as being supported, in part, by other sets of services that are provided by various KRS's. The services of a given KRS may support several sets of high-level services, as in the case of the services of a thesaurus supporting the modeling of both queries and IO's. We now provide a few examples of classes of KRS's that may be of value in supporting the high-level services of the meta-information environment of a DL.
Services of particular importance in the meta-information environment of a DL are those supporting the construction of models of both user queries and the IO's of the library. Digital technology makes it possible to construct relatively complete and complex models of queries and IO's. Important categories of characteristics of IO's, for example, that may be modeled by meta-information include the access path of the IO; the type of the IO (such as book, map, or video); the logical structure of the IO (including such structural components as title page, preface, chapters, and index if it is a book); the representation of the IO, including its form (html file, or postscript file, or gif file); and its language (English, or French, or Arabic); the context of the IO (including such information as author, publisher, lineage); the content of the IO; the terms and conditions of access to, and use of, the IO; evaluative information about of the IO, particularly with respect to its value in various applications; the relations of the IO to other IO's.
An example of a characteristic of an aggregate of IO's that may be modeled by meta-information is the number of items in the aggregate that possess specified values for a given characteristic of the individual IO's.
The services of an extensible set of KRS's may be employed in constructing models of queries and IO's in terms of such categories of meta-information. These KRS include digital versions of some of the KRS mentioned in the context of TL's, such as thesauri, subject headings, and gazetteers. Digital technology, however, makes it possible to support a wide variety of other KRS. We briefly discuss a few of these possibilities.
Finally, we note a few of the issues that relate to the provision of the services of KRS's. Since a DL with heterogeneous holdings will generally need to employ several KRS's of different types, it is important that designs for the meta-information environment allow the easy addition of new KRS's and removal of old KRS's. This is facilitated by distributed object technology. A related research issue of some interest concerns whether it is best to use a large number of relatively small KRS's, or a small number of relatively large KRS's.
Another important research issue concerns the construction of semantic mappings between the KRL's of different KRS's. It is possible to employ different sets of KRS's for modeling user queries and for modeling IO's. There is therefore a need for translation during the application of matching services. One approach to constructing such mappings involves the use of human experts working in a top-down manner, which is likely to be a time-consuming and controversial process. An approach that is promising in terms of automation involves bottom-up techniques based on empirical analyses of the use of language .
The meta-information environment of a library is the aspect of library structure that is likely to be most affected by DL technology. It is important to design meta-information environments for DL's that simultaneously compensate for the loss of many of the services of librarians and take advantage of the ability to apply digital processing to information objects in the collection of DL's. In particular, the essay suggests the importance of a top-down component that takes the perspective of the user in the process of designing such environments. The approach to design suggested in the essay involves the implementation of a meta-information environment in terms of six basic sets of services that are, at least in part, supported by services from a variety of knowledge representation systems. Such an environment is probably best implemented within a distributed object framework.