Jeffrey A. Rydberg-Cox
One could argue that there are no true libraries with millions of volumes.1 Buildings exist that do contain and provide access to collections of this size, but no single human being can make productive use of more than a fraction of such collections. Digital libraries make available even richer collections and have already begun to increase the set of documents we can examine. Books locked away in special collections and rare book libraries, where a handful of visitors used them in previous years, now attract thousands of readers from around the world. But even large-scale digital libraries serve users with particular needs at particular times. To understand the impact of digital libraries we need to think not only about global access via an OPAC or a search engine but also about access to thematically coherent clusters of information.
This article considers the needs of an area of study that stands to benefit particularly from emerging comprehensive digital libraries, early modern printed books. These books, from Gutenberg to the middle of the eighteenth century, are printed with fixed letter forms and are the earliest documents that lend themselves to optical character recognition. Immense challenges lie before us as we struggle to make these new materials accessible in digital libraries; actual OCR technology, for example, is optimized for contemporary print and performs poorly with most early printed books. In no area of historical study do the limitations of print publication and distribution present so many obstacles and so constrain our ability to understand this period.
Most of these texts cannot be read and understood by those who do not happen to be experts in the field of early modern studies. In fact, the experts themselves quite often find themselves at a loss facing the linguistic variety and multifarious learning present in these texts. Online access to these materials will not have a substantial impact unless online knowledge resources can be used to make them intelligible to larger audiences. The prospect of a vast digital collection of European books printed before 1750 therefore poses a great challenge. The scope of this challenge has not yet been properly identified, and current digitization technologies have not even begun to formulate a response.
The scope of this challenge begins to emerge in the above extract from Jan Jonston's Historiae Naturalis De Piscibus Et Cetis Libri V published in Frankfurt in 1650. While all aspects of this text may not correspond to modern usage, this work exhibits many of the conventions that we find in contemporary publications (page numbers, explicit division of the work into 'chapters' and 'sections'). This work, for example, includes explicit source citations in the marginalia rather than in the text. This makes it difficult to economically extract crucial data using conventional OCR or data entry technologies. Books from earlier periods exhibit even fewer modern conventions and correspondingly pose even more problems associated with text acquisition and encoding.
Traditionally, important pre-1750 texts have been painstakingly edited and annotated in order to make them intelligible to the modern reader. This level of editing and annotation cannot be achieved on a scale as large as that envisioned by Google Library, the Open Content Alliance or the planned European Digital Library. There are probably more than 2 million pre-1750 books (bibliographical items, journals included) totaling almost half a billion pages. This "literary ocean," as it has been called, fortunately has its islands and its ports. The islands (trodden ground) are the classical texts of antiquity, the middle ages and the early modern period that have been edited and explained so they can be appreciated by those interested today. The ports are those reference works both early modern and contemporary that contain and summarize the store of information that was available to writers and readers in early modern Europe.
In order to support exploration of a vast collection of digitised pre-1750 books as envisaged by the European Digital Library Initiative, classical texts and reference books from the early modern period must be offered in an integrated online environment. These books represent the core of humanistic education that was in part engraved in the memory of every man who had attended a grammar school and finished a liberal arts course. They constitute the knowledge and assumptions that are present in any writing of the period, whether in Latin or in the vernacular. These books must be carefully edited, annotated and cut down to bits of information so they can feed the process of exploring the unknown texts of the unfathomed literary ocean.
Designing digital collections of the early modern period
We are less concerned with relating these source collections to current systems of knowledge than we are with revealing them as historical phenomena that shed light upon the very different habits of thought at play in their own time. We are thus dealing not only with collection development assembling and describing objects but also with historical analysis of these works both individually and in collections.
Image books that are accompanied by transcriptions are essential for this process. We focus upon everything that is visible. We are, therefore, not only interested in an electronic text that is abstracted from an original book, but also in the concrete appearance of the book as well. A transcription of the text from Jonston's Historiae Naturalis De Piscibus that does not also include the engravings (illustrated below) fundamentally fails to deliver the essence of the work to readers in a digital environment.
Digital editions of early modern texts must navigate a course between the unique identity of every early modern copy and the desire for authoritative editions.2 A digital edition that records many unique features of its source text often must also serve as a surrogate for a complete edition. Each digitization project must decide which details from their particular exemplar lie within their areas of concern. CAMENA,1 for example, does not provide any analysis of the binding or transcribe handwritten comments found within the text.
The first, comprehensive step towards recording the historical source material is bibliographic description. The resulting metadata should make it possible to organize the materials according to period, place, language, literary genre, producer and audience. Since these digitized materials will later be used in contexts far removed from those in which they were digitized, it is essential that we follow, wherever possible, internationally recommended best practices and guidelines for metadata.
Classification systems such as the Dewey Decimal system (used by the German Portal Digitalisierte Drucke) or the American Library of Congress classification scheme do not capture the conceptual structures of early modern writing and publication. We need another organizational scheme that reflects the dominant figures of thought of our period. Classification based on form (e.g., dictionaries, miscellaneous writings, general vs. special treatises), space and time, competes in library classification schemes with ontology-based concept hierarchies (e.g., genus species) that would be more appropriate for our sources.
A historically appropriate organizational scheme must emerge from the sources themselves, especially from their own thematic divisions. We should capture such patterns that of course may vary by time, place, and social circumstances (with a particular emphasis in the early modern period on religious affiliations). We are thus not looking for a normative, uniquely fashioned system and certainly not a typical library classification scheme. The system must go deeper and be more flexible, since our goal is to represent what we actually find in the sources.
Although classification schemes such as DDC and LC are of limited use, they do contribute in two ways. First, they provide to those who are not immersed in the early modern period a familiar and internally coherent overview and point of entry to the materials. Second, these schemes allow us to link recent scholarly literature with the early modern sources. The categories from these systems should, however, be used as keywords associated with particular documents and should not appear at the higher levels as the primary or dominant organizational structure for the collections.
Aside from the bibliographic metadata we need to identify the structural metadata, i.e., front, body, back, along with at least the major divisions of the text, and to capture their heads, roughly corresponding to a table of contents and providing a framework by which we can make individual page images available. To a sizable collection of structural metadata, especially tables of contents and summaries, we can then apply automatic text similarity measures and thereby recognize text families. Such filiations can trace the influence of major primary sources (e.g., works by Aristotle, Cicero, Hieronymus, Augustinus, Albertus Magnus, Thomas of Aquinas) or the specific purposes of a given document (e.g., instruction, sermon, etc.).
Where good early modern printed indices are available and the significance of the work justifies the expense, these should be carefully transcribed and linked with the appropriate pages of text. Indices in early modern books are often simpler in structure than the main text and can thus often be captured with OCR. Beyond the content of individual works, indices also document the keywords associated with a particular thematic field. Furthermore, indices often provide semantic and factual pieces of information, adding brief glosses, synonyms, summaries and similar data to their entries. These indices can be combined with the automatically generated clusters of 'text families' to provide a conceptually coherent entry point into a body of texts that is based on a classification scheme used in the period of our texts.
Some publications contain texts of immense cultural significance that warranted repeated manually produced editions in the past and now would be candidates for careful digitization and tagging. Moreover, contemporary reference works, which contain information and ideas current at the time, can be equally important, since these provide the context by which readers can understand a broad range of publications. In these cases, we should go beyond the bibliographic and structural metadata and provide richer semantic markup as well. Automatic linking tools such as SCALE3 can connect keywords and phrases in a source text to entries in encyclopedias and similar reference works. Such links can provide a first pass at semantic markup, with named entity identification systems applied to disambiguate ambiguous references in a given context. Users will have access to the cumulative and up-to-date content of the indices. Over the long term, a community driven system such as a Wiki or Noosphere can collect corrections and additions to the automatically generated markup.4
In the long run, our goal is to open up the early modern period to new audiences. Given the more specialized nature of our undertaking, we will be more dependent upon academic partners than more general projects such as Wikipedia or Project Gutenberg: the initial barriers of language and technical knowledge raise major challenges to amateur collaborators.
The services created in this way must be compared against Google Book Search, Eighteenth Century Collections Online,5 and similar portals to academic content. Extra materials will compensate for the inevitable gaps in full text collection: enriching of key words with lexical and factual data; linking and categorization of key words; communication of similar texts not only through bibliographic metadata but in a fine grained fashion down to the level of individual chapters and reference work entries.
The program outlined here is conceived of as an added value enterprise. It can make use of digitized resources that emerge from various sources and is thus not limited to current or planned projects and funding priorities. Furthermore, this program lays a core foundation on which to build fundamentally broader and more elaborate information sources.6 Since the digitization of early modern books is most widely disseminated in the European Union, this work is well suited to the EU i2010 initiative. The European dimension of the project is clear: a common learned language (Latin), texts distributed across boundaries, interrelatedness of texts produced in different countries, the commonly cited texts of the Respublica Litterarum. Essential and pressing is acquisition of digital copies of key source texts, ancient and medieval as well as early modern, and early modern information resources. This should include recently republished and reedited primary materials as well as digital versions of original sources. At this point, the electronic texts are no longer mere supplements to the traditional, expensive print publications but become our core resources.
Universities will need to recognize and reward contributions in this new environment, supporting the collaborative editors and commentators required by the emerging digital environment. In this world, the fixed edition produced by a single editor or editorial team may no longer be the dominant form of publication.
Implications for emerging digital libraries
A range of projects in Europe and elsewhere in the world has begun creating page images of early modern books. These efforts have provided a laboratory within which to gauge both challenges and opportunities. Large scale book scanning opens up the possibilities for both creating new services and replacing early efforts.
Digitization projects quite often have produced page images that are inconsistent and/or inadequate. Pages are imperfectly captured, with content near the folds particularly prone to being undecipherable. Color and resolution are inadequate for many interpretive tasks, especially when we are studying illustrations or the numerous texts that were printed poorly in the first place. Above all, only the original text appearance including the layout and the material quality is an object of historical research. Given the linguistic, typographical and structural complexity of many early printed books, the production of machine readable full text will not be available in the near future. Major projects such as Google Library, Open Content Alliance and EU i2010 provide an opportunity to create collections that are not only large in scope but also consistently high quality.
We need more flexible OCR tools if we are to generate usable transcripts of many, if not most, early modern books. The GAMERA project and general research agenda outlined in Choudhury's contribution to this issue are fundamental to any large scale digital library of early modern culture. Centuries ago, observers pointed out some of the problems that bedevil OCR today: "the complex range and variety of typefaces today renders typography amazingly ornate in comparison with those crude and simplistic beginnings of this art. A single page may display ten, twelve or more distinct varieties of this sort."7
Pre-standardized language complicates automatic tagging and linking. For all the variation in contemporary documents, most languages have approved spellings that published materials strive to follow. In the early modern period, authors freely used whatever spellings they chose, often spelling the same word differently from one page to the next. The lack of standardization includes not only spelling but higher level constructs: quotations, literary citations, synonyms and etymologies all appear in widely varying forms. Contemporary academic publications at least try to follow normative patterns. Automated analysis of early modern texts, even if cleanly transcribed, must deal with even greater complexity.
If tools can be developed that deal with these complexities, the potential reward is sizable because learned books of the modern period have an immense amount of internal structure and data. While format may be inconsistent, the underlying content is rich. One 17th century observer commented: "Rarely in this era does a book come out in which, aside from the front matter directed at the reader and others ... summaries or synopses, as they call them, lists of cited authors, subject indices, marginal notes, chapter headings or inscriptions, and other things of this sort are not added."8 If we can extract data records from these rich sources, they can provide the seeds for further textual analysis, supporting named entity analysis and query expansion for information retrieval.
Linguistic tools for the Latin of early modern Europe are in their infancy. Aside from the issues outlined above, two major challenges face humans and computers alike. First, we have no comprehensive dictionary of Neo-Latin. Readers must cope with neologisms or, often much harder to decipher, idioms and turns of expression of particular groups. Second, aside from morphological analyzers such as Morpheus the Latin morphological analyzer found in the Perseus Digital Library we have few computational tools for Latin. Even Morpheus does not use contextual clues to prioritize analyses, and we are not aware of any substantive work on named entity recognition in Latin. We do not yet have mature electronic authority lists for the Greco-Roman world, much less the people, places, etc. of the early modern period. Promising preliminary work in Latin syntactic analysis has appeared,9 but this work remains at an early stage of development. Machine translation could revolutionize early modern studies, opening up the field to whole new audiences and research questions, but no substantive work has so far taken place in this regard.
Digital libraries have the potential to transform fields such as early modern studies, where problems of physical access to sources and intellectual access to their contents have hampered our ability to contemplate major topics. Collection development strategies must, however, be forward thinking and pro-active: we must build collections that can be used with the technology at our disposal today but that can also become not only larger but more intelligent over time. Our work with early modern materials documents concrete decisions to collect sources that are of interest not only for their own sake but also because they shed light on the collection as a whole. We thus need not only poetry and various categories of literary prose but the encyclopedias, indices, marginalia and other materials that shed light on the period and the ideas that this period produced. We need comprehensive digital libraries, but we need carefully structured reference works as well.
1. The work reflected here was supported by a variety of funders. The Deutsche Forschungsgemeinschaft supported the CAMENA and TERMINI projects: <http://www.uni-mannheim.de/mateo/camenahtdocs/camena_e.html>; <http://www.uni-mannheim.de/mateo/camenahtdocs/camenaref_e.html>. Support from the National Endowment for the Humanities Division of Preservation and Access has funded Dr. Rydberg-Cox's research on the problem of digitizing early printed books: <http://daedalus.umkc.edu/incunables/index.html>, which in turn builds upon the results of the Cultural Heritage Language Technology project <http://www.chlt.org>, funded by the National Science Foundation and the European Union: <doi:10.1045/may2005-rydberg-cox>.
2. In some cases, scholarship has tracked minute changes that emerge during the printing process, with errors in the typesetting fixed as the printers notice them: see, for example, Charlton Hinman, The Printing and Proof-reading of the First Folio of Shakespeare (Oxford 1963).
4. While Wikipedia, <http://www.wikipedia.org>, is the most famous example of a wiki, many projects already use wikis to maintain documentation, generate shared discussions and similar tasks; the Noosphere tends to support more editorial control, with individual authors owning particular articles: <doi:10.1045/october2003-krowne>. The National Endowment for the Humanities funded in the fall of 2005 a community driven project, Pleiades: An Online Workspace for Ancient Geography: <http://www.unc.edu/awmc/pleiadesannc1.html>.
6. TERMINI, <http://www.uni-mannheim.de/mateo/termini/index_e.html>.
7. "Sed et multiplex typorum diversitas et varietas hodie Typographiam mirifice exornat, prae rudibus illis et simplicioribus huius artis principiis, cum una saepe pagina eius generis decem, duodecim aut amplius conspiciendas exhibeat diversas species." (Bernhard von Mallinckrodt: De ortu ac progressu artis typographicae dissertatio historica. Köln, 1640, p. 105).
8. Bernhard von Mallinckrodt: De ortu ac progressu artis typographicae dissertatio historica. Köln, 1640. Cap. XVII. "Comparatio antiquae et hodiernae typographiae. Quarta differentia (p. 104): Rarus enim hoc tempore [...] liber exit, cui non praeter eiusmodi praeliminares ad Lectorem aliosque [...] etiam Summaria ut vocant seu Synopses, Auctorum allegatorum Catalogi, rerum et materiarum Indices, marginales notationes, capitum Lemmata seu inscriptiones, aliaque similia accedant, quorum omnium usum et necessitatem non attinet pluribus commendare."
9. Koster, C.H.A. "Constructing a Parser for Latin," in Computational Linguistics and Intelligent Text Processing: 6th International Conference, CICLing 2005, Mexico City, Mexico, February 13-19, 2005.
Copyright © 2006 Wolfgang Schibel and Jeffrey A. Rydberg-Cox