The Digital Library Research Agenda

What's Missing -- and How Humanities Textbase Projects Can Help

Allen Renear
Scholarly Technology Group, Brown University
Providence, Rhode Island
http://www.stg.brown.edu
[email protected]

D-Lib Magazine, July/August 1997

ISSN 1082-9873

Introduction and Overview

This is a good time to take stock of how we are doing in the development of a digital library "research agenda". For over ten years now, the notion of "electronic" or "virtual" libraries has been a fairly focused and self-conscious area of effort, and the NSF/NASA/ARPA Digital Library Initiative, which has given some formality to the notion of a digital libraries research agenda, is finishing its third year. So how are we doing?

Three questions can be asked about the topics on any applied research agenda:

Are they theoretically deep?
Can they be usefully investigated with current techniques and methodologies?
Are they of timely practical consequence?

At first glance, the answers to these questions, with respect to digital library research, seem emphatically positive: thoughtful researchers and well- administered highly capable institutions have brought powerful analytical methods to bear on important problems -- and with extremely promising results. A wide range of critical issues have been taken up (retrieval, copyright, economics, multimedia, metadata, etc.) and prototypes and pilot projects decisively demonstrate the practical value of the systems and techniques being developed. No one can fail to have a sense that we are in the midst of a revolution in how we create, organize, and communicate knowledge.

However, there is also an important question to ask about the agenda as a whole: namely, has anything important (i.e., deep, tractable, and practical) been left out? And, in fact, if we blink twice and take a fresh and critical look at current digital libraries research, another, perhaps modified picture may appear -- one in which something is, indeed, missing.

At the heart of knowledge management, historically, currently, and in all the optimistic scenarios of the informatic future, are rich, structured, highly functional, documents. This has been the case for literally thousands of years; it is evident today when we walk through a library and browse a book or scientific journal;, and it is exhibited in almost every one of the many recent achievements in publishing and textual communication. In fact, one could very plausibly argue that the deepest theoretical achievements, as well as the most dramatic practical innovations, in digital knowledge management have almost all been based on insights into document structure.

Of course this area is not being entirely neglected by current digital library research; much critical work on documents is being done at a number of sites. But overall, I would argue, the attention being given to research topics in the area of document structure and functionality is not nearly what would seem to be warranted either by the theoretic centrality of the document as a feature of knowledge management, or by the its evident fertility as an area of historically productive research and development. Within the computer science and engineering communities in particular, there seems to be relatively little growth in the amount of research devoted to these issues - Berkeley's work in multi-valent documents, and a few other examples notwithstanding. This means that, proportionately, the role of document structure studies in digital library research seems to be diminishing.

However, over the last 30 years there has been evolving a community which, though still small, and not particularly well-institutionalized, is focused on research and development in the core issues of knowledge representation and document structure. This is that portion of the humanities and social science computing community that is developing SGML "textbases". I would suggest that this community, in conjunction with other traditional sources of document-oriented research (such as the hypertext, hypermedia, and SGML communities) can help maintain document studies at the center of the digital library research agenda.

Why Documents?

Documents have been at the heart of knowledge management for thousands of years. In fact, the history of knowledge technologies seem to divide into two initial phases: before and after the development of the document. Before documents, an oral culture relied on mnemonic devices of formula, narrative, meter, melody, dance, rituals, and so on in order to develop, organize, and transmit knowledge. With the development of textual communication came new and vastly more effective devices for managing information, ones which had vast impact on cultural and economic lives. Much of what we take for granted in thinking about digital libraries is the result of reflecting on how complicated modern documents actual work as vehicles of knowledge. This includes the work of the hypertext titans -- Bush, Engelbart, and Nelson -- and continues into the SGML and hypertext research of the last thirty years. As a result it is not implausible to say that most of the major achievements in the realm of digital libraries are also based on insights into document structure. These include such things as, formal grammars for documents (such as SGML and HTML), modular separation of form and content (in stylesheets), hypertext data models, multiple views, etc. How specifically does a focus on document structure as a topic fare against our research agenda evaluation questions? Rather well:

1) The notion of a document is theoretically deep.

Information is most naturally and typically organized into document-like structures, traditionally, currently, and in any foreseeable future. These structures are visually evident in the documents we see around us of course, but such familiar things as titles, sections, equations, citations, etc. are only the most obvious document elements. As processing tools become more powerful and fine-grained, these structures are naturally elaborated to include more subtle and discipline-specific objects. These elements not only play powerful predictive and explanatory roles in our reasoning about documents but they are the salient features of documents as knowledge representation structures, and as such are the foundation for the functionality of the digital library. Ontologically speaking the stuff of which documents are made is the stuff of knowledge itself.

2) Document structures are susceptible to current methodologies and techniques.

Many current techniques and methods (themselves often based on new results in basic science in other fields) can be brought directly to bear upon the investigation of document structure. These include formal grammars and computability theory, parser and compiler construction theory, theories of abstraction and indirection, object orientation, text linguistics, discourse analysis, and so on. In fact, what has made a focus on document structure so productive is that it has specifically evolved to take advantage of these recently developed techniques and methods.

3) A focus on documents is practical

Perhaps in the distant future, new techniques of complementary decomposition and "connectedness" will create knowledge structures so different from those we are used to that the notion of a document will become attenuated beyond recognition. Although I suspect that the essential nature of the document secures it an enduring role in human knowledge organization, in fact the relevance of documents to current work is independent of this possibility: for in the foreseeable future there can be no doubt that knowledge management will be, as it is now, dominated by "document-like objects".

Have Documents Yielded All They Have to Give?

It might be argued that while perhaps in some sense documents are central to digital library research, the suggestion that document structure, as exemplified by the work of the SGML and hypertext research communities, is not. That while these communities have contributed useful data models and formalisms, which were important, even critical, at the time, now the action have moved elsewhere.

Far from it. Much of the work up until now has revolved around syntax and formal structure of documents and issues related to representing this structure and providing functionality. And while much more remains to be done (as major problems with SGML attest) the most central, and the hardest, work of all is ahead of us: developing the semantics and pragmatics of document structures, as high-performance vehicles of knowledge representation. It is a daunting project, but if we are successful there is enormous functionality to be achieved.

How Humanities Textbase Projects Can Help

In the ecology of the digital libraries community, humanities textbase projects have, of course, a natural role as developers of the content and of domain- specific methods, techniques, and knowledge. But they can also play another role as well: they are in a unique position to help keep the exploration of models and tools for document structure a central part of the digital libraries research agenda.

Father Roberto Busa began his digital library (St. Thomas Aquinas) in 1947. Since then, computing humanists have been wrestling with making the knowledge structures implicit in documents tractable to mechanical processing. Not surprisingly, computing humanists became deeply involved in the hypertext and SGML research communities that developed in the 1970s and 1980s. And because for computing humanists, the text was always the main thing, the source of insight and the center of disciplinary activity, there was a general conviction that the key to using computers to support disciplinary work lay in a deep understanding of the nature of documents and developing tools based on that understanding.

This community and its agenda was given a focus in the late 1980s, when the Text Encoding Initiative (TEI) became a focal point for the interdisciplinary community involved in modelling documents. TEI researchers and the textbase projects related to the TEI brought a wide variety of disciplinary methods, and problems, to bear in the development of encoding techniques and theories (see for example, David Chesnutt's story in this issue). They dealt with theoretical problems, such as concurrent non-hierarchical structures (which cannot be modeled in SGML), and practical problems such as version control and optimizing performance. Increasingly, they are now turning their attention beyond the problems of syntax to the difficult, but critical problems of formalizing semantics, and pragmatics in deep, discipline-specific ways. Although some of their achievements have been published, most remain buried in encoding manuals, project documentation, trip reports, email logs, and (yes) an oral tradition.

Conclusion

Arguably, a digital library is, if not a collection of documents, at least a structure of "document-like" material. If this is so, then the investigation of documents -- of their essential nature and composition -- must be at the core of the digital libraries research agenda. In helping the digital library community keep this focus, humanities textbase projects can play a valuable role.

hdl:cnri.dlib/july97-renear