Digital Conversion of Research Library Materials

A Case for Full Informational Capture

Stephen Chapman and Anne R. Kenney
Department of Preservation and Conservation
Cornell University Library, Ithaca, NY
[email protected], [email protected]

D-Lib Magazine, October 1996

ISSN 1082-9873

  • Abstract
  • Introduction
  • Retrospective Conversion of Collections to Digital Image Form
  • Full Informational Capture
  • Costs and Benefits
  • References

  • Abstract

    Digital collections will remain viable over time only if they meet baseline standards of quality and functionality. This paper advocates a strategy to select research materials based on their intellectual value, and to define technical requirements for retrospective conversion to digital image form based on their informational content. In a rapidly changing world, the original document is the least changeable. Defining conversion requirements according to curatorial judgments of meaningful document attributes may be the surest guarantee of building digital collections with sufficient richness to be useful for the long-term. The authors argue that there are compelling economic, access, and preservation reasons to support this approach, and present a case study to illustrate these points.

    [Return to top]


    The principal discussions associated with digital library development to date have focused on the promise of technology to help libraries respond to the challenges of the information explosion, spiraling storage costs, and increasing user demands. Economic models comparing the digital to the traditional library show that digital will become more cost-effective provided the following four assumptions prove true:

    These four assumptions -- resource sharing, lower maintenance and distribution costs, meeting user demands for timely access, and continuing value of information -- presume that electronic files will have relevant content and meet baseline measures of functionality over time. Of course, even if all digital collections were created to meet a common standard and level of functionality, there is no guarantee that use would follow. Last year, the Australian cultural community issued a statement of principles pertaining to long-term access to "digital objects." One of their working assumptions is that not all digitized information should be saved, and that resources should be devoted to retaining digital materials "only for as long as they are judged to have continuing value and significance."[2] This statement may reflect a realistic approach and may serve as a good yardstick against which to measure the acquisition of current and prospective electronic resources. But it also signals a cautionary note regarding efforts to retrospectively convert paper- and film- based research library materials to digital image form.

    There has been a great deal of recent activity -- most notably at the Library of Congress -- to make nineteenth century materials accessible electronically to students, scholars, and the public as quickly as possible. The Department of Preservation and Conservation in the Cornell University Library has been a leading participant in such efforts, having created over 1 million digital images in the past five years. From our experience, we have developed a set of working principles -- which we have freely characterized as our "prejudices" in another forum [3] -- that govern the conversion of research collections into digital image form. Among these are the following beliefs. We believe that digital conversion efforts will be economically viable only if they focus on selecting and creating electronic resources for long-term use. We believe that retrospective sources should be selected carefully based on their intellectual content; that digital surrogates should effectively capture that intellectual content; and that access should be offered to those surrogates in a more timely, usable, or cost-effective manner than is possible with the original source documents.[4] In essence, we believe that long-term utility should be defined by the informational value and functionality of digital images, not limited by technical decisions made at the point of conversion or anywhere else along the digitization chain.

    This paper defines and advocates a strategy of "full informational capture" to ensure that digital objects rich enough to be useful over time will be created in the most cost-effective manner. It argues for the retrospective conversion of historically valuable paper- and film-based materials to digital image form. Currently, scanning is the most cost-effective means to create digital files, and digital imaging is the only electronic format that can accurately render the information, page layout, and presentation of source documents, including text, graphics, and evidence of age and use. By producing digital images, one can create an authentic representation of the original at minimal cost, and then derive the most useful version and format (e.g., marked-up text) for transmission and use.

    [Return to top]

    Retrospective Conversion of Collections to Digital Image Form

    Converting collections into digital images begins with project planning, and then ostensibly follows a linear progression of tasks: select/prepare, convert/catalog, store/process, distribute, and maintain over time. It is risky, however, to proceed with any step before fully considering the relationship between conversion -- where quality, throughput, and cost are primary considerations -- and access, where processibility, speed, and usability are desirable. Informed project management recognizes the interrelationships among each of the various processes, and appreciates that decisions made at the beginning affect all subsequent steps. An excessive concern with user needs, current technological capabilities, image quality, or project costs alone may compromise the ultimate utility of digital collections. At the outset, therefore, those involved in planning a conversion project should ask, "How good do the digital images need to be to meet the full range of purposes they are intended to serve?" Once the general objectives for quality and functionality have been set, the following factors will collectively determine whether or not these benchmarks will be met:

    Michael Ester of Luna Imaging has suggested that the best means to ensure an image collection's longevity is to develop a standard of archival quality for scanned images that incorporates the functional range of an institution's services and programs: "It should be possible to use the archival image in any of the contexts in which [the source] would be used, for example, for direct viewing, prints, posters, and publications, as well as to open new electronic opportunities and outlets."[5]

    If it is true that the value of digital images depends upon their utility, then we must recognize that this is a relative value defined by a given state of technology and immediate user needs. Pegging archival quality to current institutional services and products may not be prudent, considering the fast-changing pace of technological development. Geoffrey Nunberg of Xerox PARC has cautioned that digital library design should remain flexible "to accommodate the new practices and new document types that are likely to emerge over the coming years." For these reasons, he argues, "it would be a serious error to predicate the design . . . on a single model of the work process or on the practices of current users alone, or to presuppose a limited repertory of document types and genres."[6]

    It is difficult to conceive of the full range of image processing and presentation applications that will be available in the future -- but it is safe to say they will be considerable. Image formats, compression schemes, network transmission, monitor and printer design, computing capacity, and image processing capabilities, particularly for the automatic extraction of metadata (OCR) and visual searching (QBIC), are all likely to improve dramatically over the next decade. We can expect each of these developments to influence user needs, and lead users to expect more of electronic information. As long as functionality is measured against how well an image performs on "today's technology," then we would expect the value of the digital collections we are currently creating to decrease over time.

    Fortunately, functionality is not solely determined by the attributes of the hardware and software needed to make digital objects human-readable. Image quality is equally important. We believe that a more realistic, and safer, means for accommodating inevitable change is to create digital images that are capable of conveying all significant information contained in the original documents. This position brings us back to the source documents themselves as the focal point for conversion decisions -- not current users' needs, current service objectives, current technical capabilities, or current visions of what the future may hold. By placing the document at the center of digital imaging initiatives, we provide a touchstone against which to measure all other variables. In the rapidly changing technological and information environment, the original document is the least changeable -- by defining requirements against it, we can hope to control the circumstances under which digital imaging can satisfy current objectives and meet future needs.

    [Return to top]

    Full Informational Capture

    The "full informational capture" approach to digital conversion is designed to ensure high quality and functionality while minimizing costs. The objective is not to scan at the highest resolution and bit depth possible, but to match the conversion process to the informational content of the original -- no more, no less. At some point, for instance, continuing to increase resolution will not result in any appreciable gain in image quality, only a larger file size. The key is to identify, but not exceed, the point at which sufficient settings have been used to capture all significant information present in the source document. We advocate full-informational capture in the creation of digital images and sufficient indexing at the point of conversion as the surest guarantee for providing long-term viability.[7]

    Begin with the source document

    James Reilly, Director of the Image Permanence Institute, describes a strategy for scanning photographs that begins with "knowing and loving your documents." Before embarking on large-scale conversion, he recommends choosing a representative sample of photographs and, in consultation with those with curatorial responsibility, identifying key features that are critical to the documents' meaning. As one becomes a connoisseur of the originals, the task of defining the value of the digital surrogates becomes one of determining how well they reflect the key meaningful features. In other words, digital conversion requires both curatorial and technical competency in order to correlate subjective attributes of the source to the objective specifications that govern digital conversion (e.g., resolution, bit depth, enhancements, and compression).

    Table 1 lists some of the document attributes that are essential in defining digital conversion requirements; each will have a direct impact on scanning costs. Where color is essential, for example, scanning might be 20 times more expensive than black and white capture.

    Table 1. Selected Attributes of Source Documents to Assess for "Significance"[8]

    bound and unbound printed materials


    • size of document (w x h, in inches)
    • size of details (in mm)
    • text characteristics (holograph, printed)
    • medium and support (e.g., pencil on paper)
    • illustrations (content and process used)
    • tones, including color
    • dynamic range, density, and contrast
    • format (35 mm, 4" x 5", etc.)
    • detail and edge reproduction
    • noise
    • dynamic range
    • tone reproduction
    • color reproduction

    Once key source attributes have been identified and translated to digital requirements, image quality should be confirmed by comparing the full resolution digital images on-screen and in print to the original documents under magnification[9] -- not on some vaguely defined concept of what's "good enough," or how well the digital files serve immediate needs. If the digital image is not faithful to the original, what is sufficient for today's purposes may well prove inadequate for tomorrow's. Consider, for example, that while photographs scanned at 72 dpi often look impressive on today's computer monitors, this resolution shortchanges image quality in a printed version, and will likely be inadequate to support emerging visual searching techniques.

    Case study: the brittle book

    To illustrate the full informational capture approach, let's consider a book published in 1914 entitled Farm Management, by Andrew Boss. This brittle monograph contains text, line art, and halftone reproductions of photographs. Curatorial review established that text, including captions and tables, and all illustrations were essential to the meaning of the book. For this volume, the size of type ranged from 1.7 mm for the body of the text to 0.9 mm for tables; photographs were reproduced as 150 line screen halftones. Although many pages did not contain illustrations, captions, or tables, a significant number did, and resolution requirements to meet our quality objectives were pegged to those pages so as to avoid the labor and expense of page-by-page review. We determined that 600 dpi bitonal scanning would fully capture all text-based material, and that the halftones could be captured at the same resolution provided that enhancement filters were used.

    For comparison purposes, we scanned two representative pages -- one pure text, and one containing text and a halftone -- at various resolutions to determine the tradeoffs amongst file size, image quality, and functionality. We evaluated image quality on screen and in print, created derivatives for network access, and ran the images through two OCR programs to generate searchable text.

    Table 2. Tradeoffs in the Digital Conversion of a Text Page

    uncompressed file size

    compressed file size

    visual assessment of quality

    OCR result*

    300 dpi, 1-bit

    380 Kb

    31 Kb

    legibility achieved

    33 errors

    600 dpi, 1-bit

    1.5 Mb

    61 Kb

    fidelity achieved

    15 errors

    *We used two OCR programs for this case study (Xerox Textbridge 2.01, and Calera WordScan 3.1); the error count refers to word errors.

    Table 2 compares the capture of the text page at 300 and 600 dpi, resolutions in common use for flatbed scanning today. The smallest significant characters on the page measure 0.9 mm, and the numbers listed in two tables are 1.6 mm in height (see Example 1). Note that although the file size for the uncompressed 600 dpi image is four times greater than the 300 dpi version, the 600 dpi Group 4 compressed file is only twice as large. There was no observable difference in the scanning times on our Xerox XDOD scanner between the two resolutions, and we would expect similar throughput rates from 600 and 300 dpi bitonal scanning.[10] Rapid improvements in computing and storage, as well as significant developments in scanning technology in recent years, have been closing the cost gap between high-quality and medium-quality digital image capture.

    Developments in image processing are also moving forward, but at a slower rate. The OCR programs we used for our examples are optimized to process 300-400 dpi bitonal images, with type sizes ranging from 6-72 points. It is striking that although the 300 dpi file produces a legible print (see postscript file in Example1), the OCR error rate was more than double that of the 600 dpi file. In the case of the 600 dpi image, we noted that the majority of errors were generated on the 0.9 mm text, but not on the numeric characters contained in the chart. We concluded, therefore, that the file was rich enough to be processed, and that errors were attributed to the OCR programs' inability to "read" the font size.

    Visual distinctions between fidelity (full capture) to the original and legibility (fully readable) can be subtle. The sample portions selected for Example 1 reveal the quality differences between the two scanning resolutions. Note that in the 600 dpi version, character details have been faithfully rendered. The italicized word "Bulletin" successfully replicates the flawed "i" of the original; the crossbar in the "e" in "carbohydrates" is complete; and the "s" is broken as in the original. In contrast, the 300 dpi version provided legible text, but not characters faithful to the original. Note the pronounced jaggies of "i" in "Bulletin," the incomplete rendering of the letter "e," and the filled in "s" in "carbohydrates."

    Table 3. Tradeoffs in the Digital Conversion of a Mixed Page: Text and Halftone

    uncompressed file size

    compressed file size

    visual assessment of quality

    OCR result

    300 dpi, 1-bit, descreened

    380 Kb

    53 Kb


    18 errors, 94.1%

    600 dpi, 1-bit, text setting

    1.5 Mb

    77 Kb

    fidelity (text)

    2 errors, 99.3%

    600 dpi, 1-bit, descreened

    1.5 Mb

    127 Kb

    fidelity (text/ht)

    2 errors, 99.3%

    900 dpi, 1-bit, text setting

    3.35 Mb

    155 Kb

    fidelity (text)

    could not OCR*

    1,200 dpi, 1-bit, text setting

    5.96 Mb

    215 Kb

    fidelity (text)

    could not OCR

    300 dpi, 8-bit grayscale

    2.98 Mb

    606 Kb

    fidelity (text/ht)

    5 errors, 98.4%

    *Neither OCR program used could accommodate images with resolutions above 600 dpi; only WordScan could process the grayscale image.

    For the mixed page, two scanning methods captured both the text and the halftone with fidelity. The first, 600 dpi bitonal with a descreening filter, resulted in a lossless compressed file of 127 Kb, while the 300 dpi 8-bit version with JPEG lossy compression produced a file almost five times as large. Scanning times between the two varied considerably, with grayscale capture taking four times longer. Depending on equipment used, higher grayscale scanning speeds are possible, but we know of no vendor offering competitive pricing for 300 dpi 8-bit and 600 dpi 1-bit image capture.

    For this page, the body text measures 1.7 mm, and the caption text is 1.4 mm. We found that both the 600 dpi bitonal and the 300 dpi 8-bit versions could produce single-pass OCR text files that would lend themselves to full information retrieval applications. On the other hand, the 300 dpi 1-bit version resulted in an accuracy rate of 94.1%, slightly below the 95% rate identified by the National Agricultural Library and others as the threshold for cost effectiveness when accuracy is required.[11]

    The difference in visual quality among the digital files is most evident in the halftone rendering on this page. (See Example 2, including the postscript file.) Without descreening, resolution alone could not eliminate moiré patterns or replicate the simulated range of tones present in the original halftone. Had the curator determined that illustrations were not significant, we could have saved approximately 19% in storage costs for these pages by scanning and 600 dpi with no enhancement, and thus improved throughput. We have found that sophisticated image enhancement capabilities must be incorporated in bitonal scanners to capture halftones, line engravings, and etchings.

    We also created derivative files from several high-resolution images in a manner analogous to the processes used in Cornell's prototype digital library. These are intended to balance legibility, completeness, and speed of delivery to the user. Examples 2a-2c show the impact on legibility for body text, caption, and halftone. Each image is approximately 24 Kb and would be delivered to the user at the same speed. With respect to image quality, the text in all three cases is comparable, but the halftone derived from the richer 600 dpi bitonal scan is superior to either the 300 dpi descreened image or the unenhanced 600 dpi version.[12] As network bandwidths, compression processes, and monitor resolutions improve, we would expect to create higher resolution images, which could also be derived from the 600 dpi image. In the meantime, the quality of the 100 dpi derivatives has been judged by users sufficient to support on-screen browsing and minimize print requirements.

    [Return to top]

    Costs and Benefits

    There are compelling preservation, access, and economic reasons for creating a digital master in which all significant information contained in the source document is fully represented. The most obvious argument for full-informational capture can be made in the name of preservation. Under some circumstances, the digital image can serve as a replacement for the original, as is the case with brittle books or in office backfile conversion projects. In these cases, the digital image must fully represent all significant information contained in the original as the image itself becomes the source document and must satisfy all research, legal, and fiscal requirements. If the digital image is to serve as a surrogate for the original (which can then be stored away under proper environmental controls), the image must be rich enough to reduce or eliminate users' needs to view the original.

    It may seem ironic, however, that an equally strong case can be made for the access and economic advantages of full informational capture. As Nancy Gwinn has recently stated in an article about funding digital conversion projects, "Everything libraries do in the digital world will be more visible to more people."[13] If observers of digital library development are correct that use will (and must) increase with enhanced access, then the strengths and weaknesses of digital conversion will become readily apparent -- to librarians, users, and funders. If access in the digital world emphasizes ease of use over physical ownership, then present and future users must have confidence that the information they receive is authentic. Their needs will not likely be served by a static version of a digital image, and it is natural to anticipate that many will soon prefer alternative formats such as PDFs derived from the digital images. Digital masters faithful to originals should be used to create these multiple images and formats because:

    In an ideal world, we would be able to create digital masters from hard copy sources without regard to cost, and to produce multiple derivatives tailored to the individual user upon request. In the real world, of course, costs must be taken into account when selecting, converting, and making accessible digital collections. Those looking for immediate cost recovery in choosing digital over hard copy or analog conversion will be disappointed to learn that preliminary findings indicate that these costs can be staggering. Outside funding may be available to initiate such efforts, but the investments for continuing systematic conversion of collections to develop a critical mass of retrospective digitized material, electronic access requirements, and long-term maintenance of digital libraries will fall to institutions both individually and collectively. These costs will be sustainable only if the benefits accrued are measurable, considerable, and sustainable.

    Table 4 compares the average time per volume spent to reformat brittle books via photocopy, microfilm, and digital imaging from paper or microfilm.[14] The total costs for any of these reformatting processes are comparable. What is notable are the shifts in time spent in activities before, during, and after actual conversion. Only in the case of scanning microfilm do we observe that the actual conversion times fall below half of the total. In this case, many of the pre-conversion activities were subsumed in the process of creating the original microfilm. Perhaps more significantly, we see an increasing percentage of time spent in post-conversion activities associated with digital reformatting efforts. These increases reflect time spent in creating digital objects from individual image files, providing a logical order and sequence, as well as minimal access points. These steps are analogous to the binding of photocopy pages and to the processing of microfilm images onto reels of film.

    Table 4. (Mean) Average Times Spent on Reformatting

    medium for reformatting








    Microfilm (RLG median times)




    Digital images (CLASS)




    Digital images (COM Project)




    Digital images (Project Open Book)




    Table 4 does not translate times to costs, but the times are indicative of the range of labor costs associated with reformatting projects. The actual costs of retrospective conversion will vary widely, according to the condition, formats, content, and volume of the original collections; the choice of scanning technologies; scanning in-house versus outsourcing; the level of metadata needed to provide basic access; and the range of searching/processing functions to be supported following conversion. Despite these differences, it seems clear that the costs of creating high-quality images that will endure over time will be less than the costs associated with creating lower-quality images that fail to meet long-term needs. As Michael Lesk has noted, "since the primary scanning cost is not the machine, but the labor of handling the paper, using a better quality scan . . . is not likely to increase a total project cost very much."[15] Should the paper handling capabilities of overhead and flatbed scanners improve, we would expect to see high-resolution conversion times decrease further, thereby allowing managers to shift funds to accommodate the greater percentage of post-conversion activities.

    Although the costs associated with reformatting and basic access will be high, these digital conversion efforts will pay off when the digital collections serve many users, and institutions develop mechanisms for sharing responsiblity for distribution and archiving. Given these considerations, creating images which reflect full informational capture may prove to be the best means to ensure flexibility to meet user needs, and to preserve historically valuable materials in an increasingly digital world.

    [Return to top]


    [1] See, for example, Michael Lesk, Substituting Images for Books: The Economics for Libraries, January 1996.
    and Task Force on Archiving of Digital Information, Preserving Digital Information (final report), commissioned by the Commission on Preservation and Access and The Research Libraries Group, Inc., June 1996. In the Yale cost model, projected cost recovery rates in the digital library assume that demand will increase by 33% because access to electronic information is "easier" and more timely (Task Force). [return to fn 1]

    [2] Draft Statement of Principles on the Preservation of and Long-Term Access to Australian Digital Objects, 1995. [return to fn2]

    [3] Anne R. Kenney and Stephen Chapman, Digital Imaging for Libraries and Archives, Ithaca, NY: Cornell University Library, June 1996, iii. [return to fn3]

    [4] Reformatting via digital or analog techniques presumes that the informational content of a document can somehow be captured and presented in another format. Obviously for items of intrinsic or artifactual value, a copy can only serve as a surrogate, not as a replacement. [return to fn4]

    [5] Michael Ester, "Issues in the Use of Electronic Images for Scholarship in the Arts and the Humanities," Networking in the Humanities, ed. Stephanie Kenna and Seamus Ross, London: Bowker Saur, 1995, p. 114. [return to fn5]

    [6] Geoffrey Nunberg, "The Digital Library: A Xerox PARC Perspective," Source Book on Digital Libraries, Version 1.0, ed. Edward A. Fox, p. 82. [return to fn6]

    [7] Cornell has been developing a benchmarking approach to determining resolution requirements for the creation and presentation of digital images. This approach is documented in Kenney and Chapman, Tutorial: Digital Resolution Requirements for Replacing Text-Based Material: Methods for Benchmarking Image Quality, Washington, DC: Commission on Preservation and Access, April 1995, and also in Digital Imaging for Libraries and Archives. [return to fn7]

    [8] For more information on assessing significant information in printed materials, see Anne R. Kenney, with Michael A. Friedman and Sue A. Poucher, Preserving Archival Material Through Digital Technology: A Cooperative Demonstration Project, Cornell University Library, 1993, Appendix VI; for assessment of photographs, see James Reilly and Franziska Frey, Recommendations for the Evaluation of Digital Images Produced from Photographic, Microphotographic, and Various Paper Formats, report presented to the National Digital Library Program, June 1996, and Franziska Frey, "Image Quality Issues for the Digitization of Photographic Collections," IS&T Reporter, Vol. 11, No. 2, June 1996. [return to fn8]

    [9] See, "Verifying predicted quality," Digital Imaging for Libraries and Archives, pp. 28-31. [return to fn9]

    [10] Responses to Cornell and University of Michigan RFPs for imaging services over the past several years indicate that the cost differential between 600 and 300 dpi scanning is narrowing rapidly. [return to fn10]

    [11] In their Text Digitization Project, the National Agricultural Library determined that using OCR programs to generate searchable text for their image database required minimum accuracy rates of 95% to be considered cost-effective (conversation with Judi Zidar, NAL). Kevin Drum of Kofax, a leading provider of conversion services, argues that "if you cannot get 95% accuracy, the cost of manually correcting OCR errors outweighs the benefit." See, Tim Meadows, "OCR Falls Into Place," Imaging Business, Vol. 2, No. 9, September 1996, p. 40. [return to fn11]

    [12] Randy Frank, one of the leading figures in the TULIP and JSTOR journal projects, acknowledges that a primary benefit of capturing journal pages at 600 dpi instead of 300 dpi is the quality of the resulting derivative converted on-the-fly using the University of Michigan TIF2GIF conversion utility. Conversation with Anne R. Kenney, March 1996. [return to fn12]

    [13] Nancy E. Gwinn, "Mapping the Intersection of Selection and Funding," Selecting Library and Archive Collections for Digital Reformatting: Proceedings from an RLG Symposium, Mountain View: The Research Libraries Group, Inc., August 1996, 58. [return to fn13]

    [14] Figures used to calculate percentages of time for reformatting activities were obtained from the following sources: for preservation photocopy and the CLASS project, see Anne R. Kenney and Lynne K. Personius, The Cornell/Xerox/Commission on Preservation and Access Joint Study in Digital Preservation, Commission on Preservation and Access, 1992, p. 41-43; for microfilm, see Patricia A. McClung, "Costs Associated with Preservation Microfilming: Results of the Research Libraries Group Study," Library Resources & Technical Services, October/December 1986, p. 369; for the Cornell Digital-to-COM Project, project files from time studies; and for Project Open Book, see Paul Conway, Yale University Library's Project Open Book: Preliminary Research Findings, D-Lib Magazine, February 1996. Because actual times were not available for title identification, retrieval, circulation, searching, and curatorial review (common pre-conversion activities), these were drawn from the RLG study (McClung, Table 1) for comparative purposes. Yale and Cornell will issue a joint report later this year detailing the times and costs associated with the full range of activities in the hybrid approach of creating preservation quality microfilm and digital masters. Note that this table does not include times for queuing, cataloging, or any activities related to delivering information to the user, only times associated with activities from selection to storage of archival masters that have been minimally ordered and structured. [return to fn14]

    [15] Michael Lesk, Substituting Images for Books: The Economics for Libraries. [return to fn15]

    [Return to top]

    Copyright © 1996 Stephen Chapman and Anne R. Kenney

    Editorial corrections at the request of the authors, November 1, 1996.

    D-Lib Magazine |  Current Issue | Comments
    Previous Story | Next Story