Search   |   Back Issues   |   Author Index   |   Title Index   |   Contents

Opinion

spacer

D-Lib Magazine
December 2002

Volume 8 Number 12

ISSN 1082-9873

A Framework for Digital Library Research

Broadening the Vision

 

Dagobert Soergel
University of Maryland
<ds52@umail.umd.edu>

Red Line

spacer

(This Opinion piece presents the opinions of the author. It does not necessarily reflect the views of D-Lib Magazine, its publisher, the Corporation for National Research Initiatives, or its sponsor.)

spacer

Digital library (DL) research and development needs a framework that can be used as a perspective on existing research and practice and, more importantly, as a structured vision for the development of new ideas. As distinct from the DELOS brainstorming report [1], which offers its own agenda for the next phase of DL research (and a somewhat ad-hoc roster for EU-NSF working groups), the framework offered here is based on a very broad view of digital libraries that takes full advantage of the possibilities offered by the integration of computer and telecommunication technology. When engine-driven vehicles were first introduced, they were built in the shape of a horse-drawn carriage and indeed were called "horseless carriages"; it took some time to take full advantage of the new technology and engineer the modern automobile. Much of DL practice is still at the stage of the "horseless carriage"; we must move on to the modern automobile.

No claim is made that all or even most of the ideas in this commentary are new; there are many forward-looking leaders in the DL community; indeed, many of the ideas come from or were inspired by project presentations at the EU-NSF DL all projects meeting in March 2002 in Rome [2]. It is hoped that bringing these ideas together in a systematic framework will lead to changes in how DLs are viewed and implemented. The framework consists of three overarching guiding principles and eleven specific themes and areas of research and development.

Guiding principles

  • Some see the DL field focused on serving research, scholarship, and education, but in order to achieve their full benefit for society and a concomitant viable business model, DLs must also support practice (in medicine, law, business, and government, for example [3]).
  • Some see DLs primarily as a means for accessing information, but in order to reach their full potential, DLs must go beyond that and support new ways of intellectual work. This requires development of the two components of the total system of intellectual work: the computer system component, through innovative system development; and the user component, through user education and training in the use of new methods. (Eventually this will happen in K-12 and higher education.) There are two corollaries to this principle: (1) Information access must be embedded seamlessly into an integrated system that supports all of a user's work, information access as well as information use and application, and new thought. (2) Systems must go beyond paper-based limitations. Many systems today use paper-based metaphors and thus import paper-based limitations into system functions; similarly, user expectations of how systems work are often shaped by their experience with paper and the limitations on information manipulation paper imposes. When the means of transportation changed from horse-drawn carriages to automobiles, people had to and did take driver education.
  • Some see DLs as providing services primarily to individual users, but DLs must also support collaboration and communities of practice.

Themes for DL research and development

Theme 1. DLs must integrate access to materials with access to tools to process these materials (DL = materials + tools).

A digital library should provide access to materials and objects and to the tools needed to process and present these materials in ways that serve the user's ultimate purpose. A whole host of tools are needed in this area, from NLP techniques (including summarization, information extraction, translation, and automatic speech recognition), to statistical and scientific computation and modeling, to graphical rendering and visualization. To give just a few examples: The Digimorph DL [4] provides CAT scan data from biological specimens and the tools to manipulate these data in many ways to separate out bone structures and organs or to show 3-D images that can be rotated. Perseus [5] provides access to maps and pictures of historic London and a tool that lets the user take a virtual 3-D walk through the streets of the historic city. ARION [6] provides access to oceanographic data and to algorithms to process these data, allowing the user to set up a total process by specifying, for each step, the data to be used (data resulting from the previous step and/or data retrieved from the DL) and the algorithm(s) to be applied; the system then executes the process, fetching data and algorithms from the DL as needed. CHLT [7] is developing an integrated reading environment to assist students and humanities scholars with reading, understanding, and interpreting classical texts.

Theme 2. DLs should support individual and community information spaces.

A DL should support users who work with materials and create their own individual or community information spaces through a process of selection, annotation, contribution, and collaboration. Annotation includes incorporating new structures; for example, adding links, preferably anchored at both ends, between existing objects. Users should be able to contribute new materials, such as the papers they are working on or images they scan from their own collections. User annotations and contributions can be private or public and shared—in a work group, on a company intranet, or on the public Web—for collaboration. Digital libraries become collaboratories or shared information spaces, a medium that scholarly or practice communities can use to store their information and to communicate with each other, adding to the store in the process (the third of the guiding principles). Through keeping detailed search and work histories, the system can capture contributions, relieving the user from having to do extra work [8]. For example, if the user pastes a document passage into his or her own work as a quotation (such as a lawyer quoting from a legal case) or includes a part of an image as an illustration, the system should establish the proper two-way link in the background. Some of these capabilities have been implemented in CYCLADES [9]. Developing a tool set for annotation and "information space development" requires careful task analysis; taking a careful look at the practices in humanities and legal scholarship would provide a useful starting point.

Both Theme 1 and Theme 2 address the second guiding principle, creating new ways for intellectual work through seamless integration of information access with information use and information application / users' work.

Theme 3. Digital libraries need semantic structure.

To convey meaning to users, especially students, to support search and retrieval, to provide knowledge-based support in the user interface, and to support agents that perform work for the user, a DL needs semantic structure, an ontology both in the broad sense of a conceptual schema for a domain (which includes metadata standards) and in the more narrow sense of a classification of subjects and values of other entity types. This is a recurring, if not sufficiently heeded, theme in literature, with emphasis on the need for harmonization (and, to the extent feasible, standardization) across disciplines, languages, cultures, and time (for historical materials). Semantic structures are expensive to create, so reuse, re-purposing, and adaptation of existing schemes is important, along with computer-assisted methods for aligning different schemes; automatic discovery of terms, concepts, definitions, and relationships from text and from user interactions; and collaborative development and maintenance of ontologies. (For a general reference and an example, see [10].) Semantic structure is an area of prime importance; it needs its own EU-NSF Working Group.

Theme 4. DLs need linked data structures for powerful navigation and search.

This is actually a special case of Theme 3, but it is important enough to discuss as a separate theme. Knowledge is richly interlinked, and computers are extremely good at following links and computing on links; so let's seize the opportunity. This idea is best illustrated by examples:

  • Support a pharmaceutical chemist's work by providing links from a chemical substance to known biological effects, to toxic effects, to reactions in which the substance participates, to methods of synthesis, etc.—in each case, with further links to supporting data or documents.
  • Support language scholars through archives of lemmatized text with two-way links to dictionary entries (as done in Perseus).
  • Link dictionary entries between one dictionary and another within a single language, across time, and/or across languages (Perseus, Grimm Woerterbuch [11]).
  • Use discourse structures for better retrieval and to aid in historical and other text interpretation.
  • Support retrieval of historical documents based on linked commentary. (Commentaries can get at the reasons for historical actions where in the documents themselves the reasons are hidden rather than made explicit, as in Nazi Germany, see COLLATE [12]).
  • Support user exploration and interpretation through links across different kinds of data and objects (Perseus).
  • Provide a context navigation interface to support better understanding of oral history interviews (for example, of Holocaust survivors): link from place names mentioned by the survivor to gazetteer information and images of the place from the time of the events recounted, link from a place to events that happened at that place, link from the interview to historical events at the time, link to Nazi policies that led to the personal fate of the survivor, etc.
There should be links across disciplines and across digital libraries. Much of this requires typed links. For this we need a link taxonomy (closely related to metadata standards), and we need research into the intuitive display of typed links. (On typed links see, for example, [13].)

Theme 5. DLs should support powerful search that combines information across databases.

We need to move beyond simple controlled-vocabulary and free-text retrieval in single databases to retrieval in the complex interrelated structures described in Theme 4, working across databases. This includes techniques such as:

  • Queries that include attributes of related objects (for example, descriptors in cited and citing documents; descriptors in encompassing objects—hierarchical inheritance);
  • Retrieval of composite objects (texts made up of sections, food products made up of ingredients), considering component attributes and interrelationships;
  • Retrieval based on attributes derived on the fly, such as retrieval of census tracts based on median income (which is particularly difficult if the base attributes must be obtained from several systems).

In addition to more sophisticated retrieval algorithms, this involves interoperability and distributed access to heterogeneous systems, both on a technical and on an intellectual level. While a lot of work is being done in this area, there are still few widespread operational solutions; we need more efforts that actually build on each other. Problems include:

  • Dealing with searches where the values of search attributes to be combined are obtained from different systems and combining search results from multiple systems: unified ranking (the fusion problem);
  • Comparing and consolidating search results (simplest case: detecting duplicate documents; more complex case: combining several dictionary entries, including comparison of definitions);
  • Discovering inconsistencies and contradictions in results;
  • Creating a unified report combining different types of data found from different systems, such as a report combining structural, functional, toxicity, and economic data about a chemical gleaned from several databases. (In the early eighties, there was a system called microCSIN doing just that.)

Finally, this theme encompasses efforts now pursued under the heading "Semantic Web": using statistical and artificial intelligence techniques to derive from the results found a final answer to the user's question or to present data gleaned from the results in a form adapted to the user's background and purpose (see also Theme 1).

Theme 6. DL interfaces should guide users through complex tasks.

User interfaces are very important in facilitating user interaction with the complex functions described above and letting users—with some training—take advantage of new methods of intellectual work. Perseus, cited often in this commentary for its exemplary data structures and tools, still has a poor interface that stands in the way of full exploitation of its riches. Other issues include customization / personalization and methods for deriving user profiles automatically. This area also deserves its own EU-NSF working group.

Theme 7. The DL field should provide ready-made tools for building and using semantically rich digital libraries.

The Web has dramatically lowered the threshold for making materials accessible. Now we need to lower the threshold for creating or converting materials and presenting them in meaningfully structured collections—we need a sophisticated digital library toolkit for the masses. Organizations with useful collections, or user communities that want to create a collective store of materials of interest and a platform for collaboration, need affordable and easy-to-use technology. Industry has created—or will create—DL systems for publishers and others who can afford them, but the research community should produce high-quality freeware. An example of this is the Greenstone DL package [14]. Greenstone concentrates, so far, on traditional DL functions and does not yet include annotation or collaboration or the other functions discussed above. Ideally, we would have a toolbox to which different people could contribute programs, LINUX-style, including tools for functions such as:

  • Authoring support, document templates;
  • Digitization (for example, META-E, University of Innsbruck [15], and Gamera [16]) and format conversion (such as OCR, automated speech recognition, PDF to text);
  • (Automated) content acquisition, harvesting content;
  • Annotation and multiple views on documents (for example, multi-valent documents [17]);
  • Automatic markup;
  • Automatic link generation, including parsing bibliographic citation and establishing citation links (for example, CiteSeer [18]);
  • Automatic metadata generation, including named entity recognition and automatic subject indexing/classification;
  • Efficient manual or computer-assisted creation or editing of markup, links and metadata;
  • Ontology/taxonomy creation by system administrators or users, alone or in collaboration (automatic identification of terms and concepts from text and user interactions, automatic discovery of taxonomy structure, reuse of existing taxonomies, editing);
  • Search and navigation;
  • Building interfaces and publications;
  • Collecting user interaction data (search and work histories) to be used to support users' work and for requirements analysis and evaluation; other evaluation data;
  • Tools for all kinds of processing, as discussed under Theme 1. (Such tools can be used for pre-processing or for post-processing.)

Theme 8. Innovative DL design should be informed by studies of user requirements and user behavior.

User requirements analysis methodology is well understood, but actual task-based user requirements studies from the point of view of developing new methods of intellectual work are important both for policy and for system design at all levels. User behavior studies are important for two purposes: (1) improving system design and usability and (2) determining needs for user education. (Remember that a DL and its users are two components of one overall system. Both components may require changes for best overall performance. If DLs are merely "user-centric", they may never take the user to the future place of improved methods of intellectual work.)

Theme 9. DL evaluation needs to consider new functionality.

Evaluation frameworks for DLs (as well as for other information systems) must be adapted to take into account the many new functions and the changed environment of DLs. Such a framework is the basis for devising an evaluation strategy that tests components and functions. Some of these will be amenable to quantitative measurement; others will require qualitative approaches. Conducting evaluation studies with real users is important; relying solely on TREC-style evaluations can be very misleading. Some of these issues have been addressed in a DELOS workshop [19].

Theme 10. Legal/organizational issues of information access and rights management need to be addressed using new technology.

Access to some or all information may be restricted to certain users, particularly if we think of individual and group information spaces, so good computer security is needed. But even if the information is, in principle, universally accessible, the information producers may want to be compensated, so rights management is a huge issue; the trick is to balance fair compensation for intellectual and editorial contributions with ease and universality of access. Another issue in this area is protection of human subject information (which may bein user profiles and use histories as well as in the content of a DL, such as interviews in an oral history collection, survey data, or questions and answers in a medical ask-the-expert site). In many ways, the expanded DL functionality envisioned here can come to full fruition only if the security and rights management issues are resolved. Luckily, while computer access has created many new problems in access control, computers can be used to solve these problems; for example, it is now feasible to collect micro-payments from users and aggregate payments to bill information users and pay information producers.

Theme 11. DLs need sustainable business models.

Intellectual and technical excellence by themselves do not guarantee the success of a DL; the DL must have a business model for creating the income necessary for its operation. Most paper libraries are funded as public utilities or infrastructure components in organizations. The business model in this case is to make a convincing case for the library's usefulness to the funding organizations. This model applies to many DLs as well, particularly to DLs that grew out of or are extensions of paper libraries. But DLs also grow out of publishing houses, and here the business model is one of selling information access. Furthermore, computerized tracking of use and services allows for a charge-back model—say within a company—presumably supporting only services that users are willing to pay for, at least in "funny money". (Whether this is the best option for the company or actually leads to harmful under-use of information is another question.) With the "value adding" expansion of functionality suggested here, there are also many opportunities to create DLs of direct value to commercial organizations; such DLs can be sustained (and even made profitable) from user fees. On the other hand, a DL in the humanities must likely rely on public or grant support unless it is used in courses as a textbook replacement and could thus derive income from student purchases. So there is no one-size-fits-all business model, but rather a variety of business models mirroring the variety of information providers (such as authors and publishers) and intermediaries (such as libraries) that get into the digital library business.

Conclusion

To advance digital libraries to their full potential, a broad-based digital library research and development framework is needed both to evaluate and integrate existing research and practice and to provide a structured vision for what digital libraries can be. This commentary presented such a framework by integrating ideas from many sources. Because all the elements of this framework interact and depend on each other, digital library research must address not some but all of the eleven themes outlined here.

Notes and References

[1] DELOS brainstorming report (San Cassiano, Alta Badia, Italy, June 2001, ERCIM-02-W02, <http://delos-noe.iei.pi.cnr.it/activities/researchforum/Brainstorming/brainstorming-report.pdf>.

[2] EU-NSF DL all projects meeting in March 2002 in Rome, <http://www.ercim.org/publication/Ercim_News/enw50/thanos.html>.

[3] See, for example, ERCIM News, No.48, January 2002, Special Theme: e-Government, <http://www.ercim.org/publication/Ercim_News/enw48/>.

[4] The Digimorph Digital Library was established by Tim Rowe at the University of Texas at Austin. See <http://digimorph.org/aboutdigimorph.phtml>.

[5] See the Perseus Digital Library (Gregory Crane, Tufts University) at <http://www.perseus.tufts.edu/PR/vr.ann.html>.

[6] ARION, Forth, <http://dlforum.external.forth.gr:8080>, and <http://dlforum.external.forth.gr:8080/papers/ARION-paper_v52.pdf>.

[7] CHLT (Cultural Heritage Language Technologies) Jeffrey Rydberg-Cox, University of Missouri, Kansas City, <http://www.chlt.org/>.

[8] Komlodi, Anita Hajnalka; Soergel, Dagobert. Attorneys interacting with legal information systems: tools for mental model building and task integration. ASIST 2002, p. 152-163, <http://www.research.umbc.edu/~komlodi/papers/ komlodi_soergel_asis2002.pdf>. See also <http://www.research.umbc.edu/~komlodi/papers/akomlodi_dissertation.pdf>.

[9] See CYCLADES, IEI-CNR, at <http://www.ercim.org/cyclades/>.

[10] Soergel, Dagobert. SemWeb. An environment for integrated access to distributed ontological and lexical knowledge bases and their collaborative development and maintenance. A proposal <http://www.clis.umd.edu/faculty/soergel/soergelSEMMULS7.html>. See also, Atlas for Science Literacy <http://www.project2061.org/tools/atlas/default.htm>.

[11] DFG-Projekt "DWB auf CD-ROM und im Internet", <http://www.DWB.uni-trier.de>.

[12] COLLATE, FhG IPSI, <http://collate.de>.

[13] Kopak, Richard W. Functional link typing in hypertext. ACM Computing Surveys, December 1999; 31(4es). Available from <http://www.acm.org/dl>. See also, <http://www.cs.brown.edu/memex/ACM_HypertextTestbed/papers/41.html>. For a brief discussion of implementation, see <http://www.w3.org/2000/02/rdf-xlink>.

[14] Greenstone DL package, <http://www.greenstone.org/english/home.html>.

[15] META-E, University of Innsbruck, <http://heds.herts.ac.uk/conf2002/heds2002_metae.pdf>.

[16] Droettboom, M., K. MacMillan, I. Fujinaga, G. S. Choudhury, T. DiLauro, M. Patton, and T. Anderson. 2002. Using the Gamera framework for the recognition of cultural heritage materials. Joint Conference on Digital Libraries. JCDL 2002 Proceedings p. 11-17. Available from <http://www.acm.org/dl> and from <http://dkc.jhu.edu/gamera/papers/>.

[17] The Multivalent Browser; A Platform for New Ideas, <http://www.cs.berkeley.edu/~phelps/Multivalent>.

[18] CiteSeer, <http://www.neci.nec.com/homepages/lawrence/citeseer.html>.

[19] DELOS Workshop on Evaluation of digital libraries: Testbeds, measurements, and metrics, <http://www.sztaki.hu/conferences/deval/>.

 

Copyright 2002 Dagobert Soergel
spacer
spacer

Top | Contents
Search | Author Index | Title Index | Back Issues
Editorial | First Article
Home | E-mail the Editor

spacer
spacer

D-Lib Magazine Access Terms and Conditions

DOI: 10.1045/december2002-soergel