Search   |   Back Issues   |   Author Index   |   Title Index   |   Contents

Articles

spacer

D-Lib Magazine
November 2004

Volume 10 Number 11

ISSN 1082-9873

Assessing the Durability of Formats in a Digital Preservation Environment

The INFORM Methodology

 

Andreas Stanescu
Software Architect
Digital Collections and Preservation Services
OCLC Online Computer Library Center, Inc.
Dublin, OH
<andreas_stanescu@oclc.org>

Red Line

spacer

Introduction

There are two necessary components in any measuring system—the units of measurement themselves and the process of applying those units of measurement [1]. For example, two easily understood measuring systems are the ones used for distance measurement: the English system and the metric system. Both can be used interchangeably to measure the distance between point A and point B. The process of applying the units of measurement can be thought of as finding the path from point A to point B and counting the number of measurement units along that path. Then, if needed, a method of converting from one measurement system to another can be used to communicate the result to different audiences.

Imagine, for a moment, how difficult it was for people to communicate before the meter and the yard were invented. Two people looking at the same mountain could have two very different opinions on how far the mountain was. Moreover, they could not communicate that knowledge to a third person not present at the scene since they had none of the words necessary to encode that information. Once the English system and the metric system were invented and conversions between the two were established, people across the world could receive the raw measurement between point A and point B and use that information to make individual decisions based on personal preferences.

Similarly, digital preservation is in a dire need of measurement and communication tools, including the aspect of digital preservation discussed in this article: measuring preservation durability of digital formats. People in charge of preserving our national heritage or public records must make decisions today that will have impacts well into the future. It is cheaper and safer to analyze and compare potential actions before actions are actually taken, and making a poor preservation decision today can lead to content loss or the need to engage in expensive salvaging efforts later. Yet, up till now there have been no objective means to identify which digital format is most apt to sustain the passage of time. There has been no way to compare two format specifications. There has been no scale to measure format specifications' preservation durability. Finally, there have been no metrics to communicate the measurements to a wide audience, recognizing differences in awareness, expertise, language and interests.

This article presents a methodology for measuring the preservation durability of digital formats. The methodology gives digital preservation archivists the tools to measure and identify formats suited for preservation but lets individuals use their own interpretation of the results to make their ultimate preservation decisions. The author hopes that institutions embracing preservation will use this methodology and contribute their expertise to enhance it over time. Preservation durability doesn't have such a clear definition as a meter [2], but the author hopes that the method described here will be the first step towards creating a useful definition for preservation durability.

Digital Preservation with Respect to Formats Used

The mission of digital preservation is to preserve access to mountains of journals, articles, research papers, photographs, maps, public records and many other published works, which are only available in digital form. Many of us have found boxes of 5 1/4 inch floppy disks or old backup tapes that are either unreadable by modern drives or illegible by modern software programs. Some of the most famous examples of digital records almost lost are the Domesday Book and the 1976 Viking landings on Mars [3], but one wonders how many other less famous examples exist out there [4].

Current conditions

Research shows [5] that longevity issues for digital media are two-fold: on one hand, the media must last the passage of time (e.g., CD-ROM, tape, disk); on the other hand, even if media can be read by modern drives, the information must be stored in formats that can be understood by modern programs (e.g., PDF, TIFF, JPEG). A well-known database of formats [6] lists about 1000 digital formats and interim versions, many of which haven't been used in more than a decade. Given the speed at which formats come and go, how can modern librarians and archivists identify those formats most apt to survive the passage of time?

Some digital archives have decided to automatically convert all PDFs to page-by-page TIFF images [7], related by a METS-based [8] XML document. Others have developed new, specialized formats and conversion processes to address preservation concerns [9].

At OCLC, a method has been developed for analyzing digital formats and quantifying preservation actions before they are undertaken. This method compares formats and preservation approaches and provides a way to track their potential loss over time to provide an objective approach to evaluating preservation decisions. The INFORM methodology offers one possible way of identifying them and tracking their inevitable march to obsolescence.

Two other prominent projects in the digital preservation domain have recently unveiled their findings.

  • Virtual Remote Control project [10] attempts to discover how specific web resources, documents and web sites, change over time in order to anticipate their potential disappearance. While similar, the INFORM method presented here attempts to analyze digital formats and not specific web resources.
  • Koninklijke Bibliotheek's e-Depot defines a Preservation Subsystem that ensures access to the digital object over time, but it doesn't discuss how one decides to convert PDF files to JPEGs [11]. In this context, the INFORM methodology could be used to feed risk information into the decision process.

The INFORM Methodology

INFORM (INvestigation of FOrmats based on Risk Management) is a methodology developed at OCLC for investigating and measuring the risk factors of digital formats and providing guidelines for preservation action plans. In other words, INFORM attempts to discover specific threats to preservation and measure their possible impact on preservation decisions. Moreover, by repeating the process, changes in the threat model over time can be detected, to which one can act accordingly.

A comprehensive approach to digital format assessment must include the following considerations: (1) risk assessment; (2) significant properties of the format under consideration; (3) the features of the format as defined in the format specification. The method of incorporating the latter two aspects in a preservation decision will be detailed at a later time.

There are 6 classes of risk that must be assessed:

  1. Digital object format - risks introduced by the specification itself, but also including compression algorithms, proprietary (closed) vs. open formats, DRM (copy protection), encryption, digital signatures.
  2. Software - risks introduced by necessary software components such as operating systems, applications, library dependencies, archive implementations, migration programs, implementations of compression algorithms, encryption and digital signatures.
  3. Hardware - risks introduced by necessary hardware components including type of media (CD, DVD, magnetic disk, tape, WORM), CPU, I/O cards, peripherals.
  4. Associated organizations - risks related to the organizations supporting in some fashion the classes identified above, including the archive, beneficiary community, content owners, vendors, open source community.
  5. Digital archive - risks introduced by the digital archive itself (i.e., architecture, processes, organizational structures).
  6. Migration and derivative-based preservation plans - risks introduced by the migration process itself, not covered in any other category.

It is very important to analyze the digital format specification in the proper context. Each specification has some dependencies on software and hardware, some more than others. The INFORM method described here takes the guesswork out of the decision-making process and records the logic behind why, for example, TIFF would be a preferred choice. The documentation resulting from this methodology becomes part of the record of the preservation actions undertaken over time.

Sometimes, specific software or hardware is required to render an object in a given digital format. Newer technologies—such as DRM (digital rights management), encryption and digital signatures—rely on proprietary, secret software (often protected by law) and hardware components. Other times, the specification may have lost its currency among commercial and open source developers, with only one or two products still supporting the aging format. In this case, the fact that its software and community dependencies are unique in their respective classes poses significant risk to the format's longevity.

Specific risks have been identified for each of the classes listed above. In general, they are:

  • Whether or not royalties or license fees are or may be requested;
  • Whether the source or specification can be independently inspected;
  • Whether revisions have maintained support for backward compatibility;
  • Whether it is complex or poorly documented;
  • Whether it is widely accepted or simply a niche format;
  • Whether competing or similar formats or components exist;
  • Whether embedded metadata can be mapped to other formats;
  • Whether DRM, encryption or digital signatures can be used;
  • Whether applicable expertise can be easily found;
  • Whether revisions happen so fast that the archive cannot keep up with demand;
  • Whether extensions, such as executable sections or narrowly supported features, can be added;
  • Whether authenticity can be easily compromised during transformations, either accidentally or maliciously; and
  • Whether the associated organization or community is too small, in danger of collapsing, unique in its class or not easily replaceable.

Each risk factor is assessed in terms of the probability of its occurring and, if it occurs, the impact it would have in terms of damage to the objects stored in this format. Probability is measured by a 5-point scale, going from very low chance (equivalent to less than a 1% chance) to very high chance (equivalent to more than 26%). Impact is measured by another 5-point scale, going from minor (no, or insignificant, data loss) to catastrophic (unavoidable complete data loss) [12]. The result is called the risk exposure, and it is the combination of the two inputs: probability and impact. For example, a risk assessment of 1D (where 1 represents probability and D represents impact) would yield a risk exposure of "very low probability of unavoidable partial data loss".

The measuring process

Applying the units of measurement, in this case the risk factors, is akin to finding a path from point A to point B and then counting how many units fit on that path. But in the case of measuring how effectively a particular digital format can be preserved, the measurement process isn't as simple as counting meters.

With the INFORM methodology, an archive first defines the preservation access system for the digital format to be measured. This consists of the definition of software and hardware components needed to render objects saved in that format, collectively known as the rendering platform. The platform selected must accommodate multiple formats, not only to reduce costs through reuse, but also to allow container formats, office documents or HTML pages, to gather all dependencies on the same rendering platform.

Once the components of the rendering platform are known, all dependent associated organizations are identified and described. This step is important since changes in organizational structure can affect the viability of the digital format in the long term.

The next step is to send template questionnaires to reviewers. This group of people may consist of computer scientists, format experts, librarians, hardware specialists, lawyers, open source developers, archivists, authors, etc. The exact composition of the reviewer group, however, is a matter of availability, time, money and other constraints.

The reviewers are also asked to document the experience on which this opinion is based and to justify their opinions. Obviously, the larger the group, the less biased the overall results will be; the more diverse the group, the more trustworthy the results. The more informed and experienced the group, the more reliable the results will be. The power of this method is in its scalability. It allows gathering whatever measurements are possible now and improving them as time passes and more resources become available.

Collating the results

Once reviewers complete the questionnaires, their assessments are split in 3 categories: 1) risks that probably don't need any immediate action, 2) risks that may require planning and possible action in the near future, and 3) risks that should be investigated immediately. Then, the results are collated and reported in two different formats. These two formats communicate two important items of information: one is the distribution of the risk exposures; the second is a discrete risk value determined by averaging all of the risk exposure results.

The first method of reporting the assessments immediately points out into which category most risks fall. More importantly, though, it clearly identifies how many risks are rated the highest and may need immediate attention.

The second method uses a simple summarization process to come up with a discrete result that can be used to compare two different formats, or to compare two different assessments at two different points in time, in an attempt to find a trend. This method can objectively determine if, for example, the plan to migrate all JPEG [13] still images to TIFF [14] format is indeed the best course of action at this time. If JPEG compares favorably to TIFF in terms of preservation risks, then migration costs can be postponed until the comparison changes in some meaningful way. It may never do that, but only time will tell.

Stop for a minute and look at this example. It doesn't say that the two formats compare well in terms of features and therefore JPEG is just as suitable as TIFF. As a matter of fact, the lossy compression algorithm used in JPEG objects is generally considered an impediment to preservation, so when given the choice, TIFF objects should be created instead. However, if a collection already has a large number of JPEG objects, converting them to TIFF will not magically improve the quality of the pictures, so the differences in the features set cannot be used in the migration decision making process.

Using the results in preservation decisions

The results of the measuring process described can be used to make preservation decisions. However, each decision is unique to the archivist making it because it is filtered through the archivist's own aversion to risk or by means of requirements placed upon them by the collection owners. This is very similar to investing money: the broker has a method to report investment risk to investors but exactly how risky an investment is can only be perceived through the eyes of the beholder. What's risky to one person may not be risky to another. Similarly, what one person believes to be an obsolete digital format may be different than another person's beliefs.

Next steps

Clearly, the process of measuring the preservation durability of digital formats is best done in collaboration with digital archives, global registries, institutional repositories, digital libraries, digital format experts, software and hardware developers and researchers. OCLC hopes to find interested parties in the digital preservation community to share in this process.

Over time, the method and its application can be honed to offer more precise results and to lead to more reliable preservation decisions. The results should be kept in a publicly available format registry to allow widespread participation and collaboration. Lastly, the history of results should point out trends in technologies and allow participants to make informed decisions as early or as late as necessary, by observing the rate of decay of the technologies needed to render a given digital format.

References

[1] Hille, Helmut. Fundamentals of a Theory of Measurement. Lecture delivered on the Spring-Meeting of the German Physical Society in March 1997, at Ludwig-Maximilians-University, Munich. Available at <http://www.helmut-hille.de/theory.html>.

[2] Unit of length (meter). The NIST Reference on Constants, Unit, and Uncertainty. Available at <http://physics.nist.gov/cuu/Units/meter.html>.

[3] Jesdadun, Anick. Digital memory threatened as file formats evolve. 2003. HoustonChronicle.com. Available at <http://www.chron.com/cs/CDA/story.hts/tech/1739675>.

[4] Besser, Howard. Digital longevity. Chapter in Handbook for Digital Projects: A Management Tool for Preservation and Access (pages 155-166). 2000. Andover, MA: Northeast Document Conservation Center. Available at <http://www.gseis.ucla.edu/~howard/Papers/sfs-longevity.html>.

[5] Jackson, Julian. Digital Longevity: the lifespan of digital files. Available at <http://www.dpconline.org/graphics/events/digitallongevity.html>.

[6] Wotsit's format: the programmer's resource. Available at <http://www.wotsit.org/>. Quoted by Lars R. Clausen, "Handling File Formats", Arhus, Denmark. May 2004.

[7] Goethals, Andrea. Action Plan: PDF 1.2. FCLA. May 2003. Available at <http://www.fcla.edu/digitalArchive/pdfs/action_plans/pdf_1_2.pdf>.

[8] METS: Metadata Encoding & Transmission Standard. Official web site available at <http://www.loc.gov/standards/mets/>.

[9] van Wijngaarden, Hilde, and Erik Oltmans. "Digital Preservation and Permanent Access: The UVC for Images. Proceedings of the Imaging Science & Technology Archiving Conference, San Antonio, USA, April 2004 Available at <http://www.kb.nl/hrd/dd/dd_links_en_publicaties/publicaties/uvc-ist.pdf>.

[10] Virtual Remote Control. Project home page available at <http://irisresearch.library.cornell.edu/VRC/risk.html>.

[11] Oltmans, Erik, Raymond J. van Diessen, and Hilde van Wijngaarden. "Preservation Functionality in a Digital Archive." Paper presented at the Joint Conference on Digital Libraries, Tucson, Arizona, June 7-11, 2004. Conference home page available at <http://www.jcdl2004.org/>.

[12] Lawrence, Gregory, et al. 2000. Risk Management of Digital Information: A File Format Investigation. Washington, D.C.: Council on Library and Information Resources. Available at <http://www.clir.org/pubs/reports/pub93/contents.html>.

[13] W3C. JPEG JFIF Specification. Feb. 1996. Available at <http://www.w3.org/Graphics/JPEG/>.

[14] Adobe Systems Incorporated. TIFF Revision 6.0. June 1992. Available at <http://partners.adobe.com/asn/developer/pdfs/tn/TIFF6.pdf>.

Copyright © 2004 OCLC Online Computer Library Center, Inc. Used with permission.
spacer
spacer

Top | Contents
Search | Author Index | Title Index | Back Issues
Previous Article | Next Article
Home | E-mail the Editor

spacer
spacer

D-Lib Magazine Access Terms and Conditions

doi:10.1045/november2004-stanescu