AIHT: Conceptual Issues from Practical Tests

Search | Back Issues | Author Index | Title Index | Contents

D-Lib Magazine
December 2005

Volume 11 Number 12

ISSN 1082-9873

AIHT

Conceptual Issues from Practical Tests

Clay Shirky
Interactive Telecommunications Program
New York University
<clay@shirky.com>

Introduction

Ten years after the Web turned every institution into an accidental publisher, the simple difficulties of long-term storage are turning them into accidental archivists as well. For digital preservation to flourish, those institutions must be able to implement preservation tools without having to create them from scratch. The Archive Ingest and Handling Test (AIHT), a project of the National Digital Information Infrastructure and Preservation Program (NDIIPP), was created with the idea that by giving a moderately complex digital archive to a variety of participants, we would be able to better understand which aspects of digital preservation were institution-specific, and which aspects were more general.

The design of the test, as the original Call for Proposals put it, was:

The essence of preservation is institutional commitment. Because the number of individuals and organizations that produce digital material is far larger and growing much faster than the number of institutions committed to preserving such material, any practical preservation strategy will require mechanisms for continuous transfer of content from the wider world into the hands of preserving institutions.

The Archive Ingest and Handling Test (AIHT) is designed to test the feasibility of transferring digital archives in toto from one institution to another. The purpose is to test the stresses involved in the wholesale transfer, ingestion, management, and export of a relatively modest digital archive, whose content and form are both static. The test is designed to assess the process of digital ingest, to document useful practices, to discover which parts of the handling of digital material can be automated, and to identify areas that require further research or development.

The archive in question was a copy of George Mason University's collection of materials collected after the attacks of September 11th, 2001. The archive itself was small, containing over 57,000 files and totaling roughly 12 gigabytes in size. It was, however, complex enough in the mix of formats and available metadata to provide a moderately complex test: there was a wide mix of included formats, though these were weighted heavily to formats common to email and the Web. Approximately one third of the archive was in ASCII text, either in email or plain text; approximately one third was in HTML; and the remaining files were mainly image files (29%), of which most were GIF or JPEG formats, with a few audio files (4%) and video files (.2%). There were also a number of non-standard file extensions, and a smattering of rarer file types, such as Photoshop (.psd) files. Many of these files were accompanied by little metadata other than what could be derived from direct examination of the filesystem or the files themselves.

The AIHT was overseen by Martha Anderson of the Library of Congress, and was completed earlier this year. What follows are overall observations from the operation of that test, in cooperation with George Mason University (GMU), the donor of the original archive; Harvard University, testing the Harvard Digital Repository Service; Johns Hopkins University, testing both DSpace and Fedora; Old Dominion University, designing "self-archiving objects" based on MPEG21 DIDL; and Stanford University, testing the Stanford Digital Repository.

Startup: The transfer of the digital archive from GMU to the Library of Congress

We had imagined that the startup phase, in which the archive was transferred from GMU to the Library of Congress, would be trivially easy. In fact, we discovered that even seemingly simple events like the transfer of an archive are fraught with low-level problems, problems that are in the main related to differing institutional cultures and expectations. Because we had expected the handoff to be trouble-free, the lessons learned during this phase were particularly surprising.

Identifiers Aren't

There were several sorts of identifiers mixed together in the system – URLs, file names, file extensions, query strings, and digests.¹ None of these identifiers worked perfectly, even for the limited service they had to perform during the transfer. URLs were not easily convertible to file names, because different file systems would silently alter or delete locally disallowed characters. Extensions were both variable (.jpg, .jpeg, .JPEG) and often pointed to unrecreatable context (it is impossible to know the structure of the CGI program if all you have is the resulting output), and in some cases the labels were simply incorrect (GIF files labeled .jpg.) Query strings were similarly altered by the receiving file systems and also suffered from the loss of context in the absence of the interpreting program, and digests, which are guaranteed to point uniquely to collections of bits, only referred to atomic uniqueness, not contextual uniqueness.

The identifiers that come from donors or other external sources must be examined carefully and cannot simply be taken at face value. Furthermore, during ingest, there need to be ways of independently evaluating the information that may be contained in the identifiers, such as determining whether files labeled .jpg are really in JPEG format. One of the participants in the AIHT labeled this GOGI: Garbage Out, Garbage In, an inversion of the Garbage In, Garbage Out maxim of the computer science community. GOGI is not a function of failed diligence by the donor, but rather a result of natural (and inevitable) differences in technical environments and human judgment. As a result, balancing the risk of context loss with the expense of doing a theoretically ideal ingest will be a difficult and continual tradeoff for preserving institutions.

Phase 1 – Ingest and Markup

Phase 1 of the AIHT simply required that each participating institution take possession of a hard drive containing the GMU archive. Once in possession of this drive, they were to do whatever it took to move that material into one or more local stores, including producing the metadata required for their system. Though this phase of the work did not require any formal collaboration between the participants, we discovered that there were a number of issues common to two or more of the participants, often revolving around the difference between the diverse and unspecified data of the GMU archive and the clean data that many of the systems had been built around.

Requirements Aren't

Because the cost and value of digital content are strongly affected by the quality of the metadata (as per the GOGI principle above) several of the participants designed strong policies and requirements, including processes for interviewing the donor, forms to be filled out about the donated objects, and so on. The design of the AIHT made many of those requirements moot. Allowing even a few seconds of examination for each object, ingest of the GMU archive would take weeks.

Declaring that a piece of metadata is required is really an assertion that content without that metadata is not worth preserving, or that the institution will expend whatever resources are necessary to capture any missing but required metadata. In practice, many of these requirements turned out to be unenforceable. The desirability of a digital object will be more closely related to the value of its contents than to the quality of its metadata. It is obvious that many kinds of contents would be much easier to preserve with, say, the creator's view of playback environments attached, but it is much less obvious that any piece of content without such metadata would not be worth acquiring. As a result, many of the proposed required fields turned out to be desired fields. The general move here is from a fixed target – all preserved content will have X fields of metadata – to a flexible one – most metadata is optional, but some kinds of metadata are more important than other kinds.

Donations as Triage

The size of the GMU archive with which we worked is a hundred thousand times smaller than what Google indexes today, a corpus that is itself only a small subset of the total digital world. For a metadata requirement to work, the receiving institution must be able to force the donor to bear the cost of markup and cleanup of the data, or must be both able and willing to refuse to take noncompliant material.

Both of these strategies, shifting cost or refusing delivery, are more appealing in theory than in practice. The essential asymmetry of a donation to a preserving institution is that if donors had the energy and ability to canonicalize the metadata around the donated objects, they would also be equipped to preserve it themselves. The corollary is that donors, especially of bulk collections, are likely to force preserving institutions into a choice between accepting imperfectly described data or receiving no data at all.

Designing a particularly high and rigid bar for metadata production for donors may therefore be counterproductive. Instead, preserving institutions should prepare themselves for a kind of triage: In cases where the unique value of the data is low and the capabilities of the donor institution are high, insist on the delivery of clean, well-formatted data; where value is high and capabilities are low, accept the archive, knowing that both cost and loss will be generated in preparing the metadata after ingest; and where value and donor capabilities are both moderate, share the burden of cleaning and marking up the data as equitably as possible.

Even Small Errors Create Management Problems in Large Archives

The AIHT was designed around an archive that was small in size but varied in format – many file types, various forms of donation or acquisition, and so on. With 60,000 files, a process that correctly handles 99% of cases correctly still generates 600 exceptions – exceptions that, if they require human effort, will require hours of work to handle.

We found such exceptions in almost every operation during the AIHT. It affected the creation of filenames during ingest, the reading of MIME types during inspection of the objects, assessments of the validity of file encodings, transformation of the objects from one format to another. It affected one-off processes, home-grown tools, and commercial and open-source software. No class of operations was immune.

The math here is simple and onerous: Even a small percentage of exceptions in operations on a large archive can create a large problem, because it inflates staff costs. And, as noted above, the GMU archive is itself an insignificant fraction of the material that potentially can be preserved. The size and complexity of an archive can easily grow by an order of magnitude. This stands in marked contrast to the difficulty of making tools and processes 10 times better. Yet a 99.9% efficient process running on an archive of 6 million items creates exactly the same problems for an institution as does a 99%/60,000 combination.

Since the volume of digital material being produced yearly, in both number of objects and total size, continues to grow dramatically, this is another case where the core strategy is again triage: Reduce the exceptions that are easy to fix through the improvement of tools; apply human effort to those exceptions in which fixing one error saves either a large number of files or files of great significance; and be prepared to declare some number of files beyond the current economic threshold of preservation. These files may be assigned to a kind of purgatory where they are not deleted, but neither do they fall inside an institution's radius of commitment for preservation. The value of such a "store without preserving strategy" is that technologies or processes eventually may make the material recoverable at an acceptable cost.

Phase II – Export

The Export phase of the AIHT was in many ways the heart of the test. The key issue being tested was whether or not the GMU archive, once ingested into one of the participant's systems, could be easily shared with another participant. The three broad areas of the test were:

How difficult would it be to package an entire archive for export?
How difficult would it be for a different institution to take in data marked up using someone else's standards?
And how much gain, if any, would there be in such sharing over raw ingest of the data?

Multiple Expressions Both Create and Destroy Value

There are multiple ways to describe and store a given piece of digital data. This is true both at the content level, (i.e. canonicalizing all JPEG files to JPEG2000, storing binary data in Base64) and at the metadata level, where the metadata format, fields, and contents are all variable. These multiple expressions create value by recording content in multiple ways, thus limiting the risk of catastrophic failure, and by allowing different institutions to preserve their own judgment, observations, and so on, thus maximizing the amount and heterogeneity of available context.

However, in several ways multiple expressions also destroy value. At the level of bit storage, altered expression of the content itself will defeat all forms of bit-level comparison, such as the commonly used MD5 digests. Because of the pressure to use simple, well-understood tools, the loss of digest-style comparison will create significant pressure on validation of content held in different preservation regimes, especially when the content is visual or audible in nature and thus not amenable to simple comparison of the rendered product.

At the level of the digital item (as opposed to the bits used to store that item) multiple expressions also defeat simple translation of semantics. There is not now, and will never be, a single markup standard for digital content, because metadata reflects a worldview. Metadata is not merely data about an object. It is data about an object in a particular context, created by a particular individual or organization. Since organizations differ in outlook, capabilities, and audiences served, the metadata produced by those organizations will necessarily reflect those different contexts. Several metadata fields, such as original file name and format type, will appear in almost any archive, but other fields, reflecting curatorial judgment or notes specific to the local preservation environment, will not be readily translated or used outside their original context. At the archive level, multiple expressions increase the cost of ingesting archives from other institutions, because of the cognitive costs associated with understanding and translating the various forms of data and metadata in the original archive.

There is no perfect solution. The ideal number of forms of expression is more than one, to allow for some variability, but less than one per institution, so that some interoperability is possible. Basic grammatical interoperability would be an enormous victory for digital preservation. The goal should be to reduce, where possible, the number of grammars in use to describe digital data and to maximize overlap, or at least ease of translation, between commonly used fields. But it should not be to create a common superset of all possible metadata.

Phase III – Migration of Format

Phase III tested simulated format change over time and tested migration of format. Maintaining digital materials for a short term is relatively simple, as the most serious of the playback issues – altered software, operating systems, and hardware for rendering files do not appear in the short term. Over the long haul, however, merely keeping the same content in the same format actually increases the risk of loss, as the continued alteration of the playback environment may make the content unrenderable even though the underlying bits have been perfectly preserved.

Playback Drift: The Silent Killer

Many technologists studying the problem of digital preservation begin thinking about the related issue of long-term bit preservation: How, given a string of binary data, can you guarantee its preservation for a hundred years?

Long-term bit storage is a difficult and interesting problem, but preserving the mere digits is not in fact the goal of digital preservation. We have many examples of perfectly stored but difficult to read bits today, such as GIS data commingled with proprietary and undocumented applications written in FORTRAN. This is a nested set of issues: What format is the data written in? What applications can understand or interpret that format? What operating systems can run those applications? What hardware can run those operating systems? Depending on how far in the future you want to project, one can even imagine asking questions such as, What sorts of energy can power that hardware?

This bundle of issues presents the would-be preservationist with a kind of "playback drift," the tendency of a fixed set of binary data to stop functioning or being interpreted in the expected or hoped-for manner, because the complex ecosystem of applications, operating systems, and hardware changes, even though the data may be stored perfectly over decades. Indeed, the better the long-term bit preservation becomes, the greater the danger of playback drift.

Many of the thorniest issues in digital preservation are affected in some way by playback drift, from questions of provenance (is a migrated file "the same" when used as evidence in a court case?) to copyright (is format conversion a violation of the DMCA?). And because playback drift is really a complex of problems, there is no single strategy that will work in all cases. It is critical, however, for institutions that want to preserve data beyond the timeline of a few years to factor playback risk into their calculations.

Tool Behavior Is Variable

When considering the viability of a piece of data in a particular format, there is a two-by-two matrix of possibilities. The first axis is correctness: The data either does or does not conform with some externally published standard. The second axis is rendering: The data either does or does not render in software intended to play that format.

Conforms To Standard/ Renders In Software	Conforms To Standard/ Does Not Render
Does Not Conform/ Renders in Software	Does Not Conform/ Does Not Render

The sad truth is that all four quadrants of that matrix are occupied – in addition to the unsurprising categories of correct/renders and incorrect/doesn't render, there is data that fails to conform to the standard but renders in software, and data that passes the standard but doesn't render. You can see the latter two categories at work on the Web today, where noncompliant HTML is rendered by browsers designed around Postel's Law ("be liberal in the data you accept and rigorous in the data you send out") and where some compliant XHTML pages do not render correctly in some browsers.

It is troubling that the variability extends even to tools intended to do the conformance checking, where tools meant to validate certain formats themselves have variable implementations, failing or even crashing while reading otherwise compliant files. As a result, there is no absolute truth, even in a world of well-defined standards, and institutions will need to determine, on a format-by-format basis, how to define viability – by format, by playback, or both.

Conclusions

Because the AIHT was conceived of as a test that assumed the autonomy of its participants, most of the conclusions from the test are related to specific institutional assumptions, practices, or goals and will be most relevant to institutions that are taking on similar problems in similar contexts. There are, however, two larger conclusions that we believe will be relevant to many institutions undertaking digital preservation as a function or goal:

Data-centric Is Better Than Tool-centric or Process-centric at Large Scale

Because the NDIIPP has always been conceived of as a multi-participant effort, and because we hosted several meetings with interested parties from many different types of institutions, we have never believed that homogenization of institutional methods or goals was either possible or desirable. In particular, we believe that having many different strategies for preservation, ranging from active management to passive spreading of multiple copies, provides the best hedge against unforeseen systemic failure. The "bug" that led to the loss of the ancient Library of Alexandria was a lack of offsite backup.
As a result, we have become convinced that data-centric strategies for shared effort are far more scalable than either tool- or environment-centric strategies. A data-centric strategy assumes that the interaction between institutions will mainly be in the passing of a bundle of data from one place to another – that data will leave its original context and be interpreted in the new context of the receiving institution. Specifying the markup of the data itself removes the need for identical tools to be held by sender and receiver, and the need to have a sender and receiver with the same processes in place for handling data.
By focusing standardization efforts on data, and allowing tools and processes to grow in varied ways around that data, we believe we can maximize the spread of content to varied environments while minimizing the cost of doing so. We also believe that that strategy will be more feasible in the short run – because of cost – and better in the long run – because of variety of strategy – than trying to get all the potential partners in a loose network of preservation to converge on either particular technologies or practices.

Preservation Is an Outcome

Preservation is an outcome. When data remains accessible after a certain period, then it has been preserved; when not, then not. In some cases, data can be preserved without there being anyone who has preservation as their principle goal, as with Google's preservation of much of usenet's history. It is also true that if material doesn't remain accessible despite the efforts of a preserving institution, as with the stored but now unreadable Landsat satellite data, then it has not been preserved.
Having an institutional commitment to preservation, and backing that up with good staff and good tools, only raises the likelihood of preservation; it does not guarantee it. Casting digital data to the winds lowers the chance that it will be preserved, but it does not mean it will automatically be destroyed. Because the efficacy of preservation can only be assessed after the fact, using a variety of strategies to preserve, even within a single institution, may well be a better strategy than putting all efforts toward one single preservation system.

A final conclusion from the AIHT is that there is a pressing need for continual comparative testing of preservation tools and technologies. Every phase of the AIHT exposed significant low-level issues. Institution-scale digital preservation tools are new and have typically been tested in only a few environmental settings, generally in the environment of their creation. No matter how rigorous these tests may be, at the least such a situation creates the risk of lack of portability.

Continual testing of tools for managing digital content, and publication of the results, will be critical to driving adoption of such tools. This will bring about several obvious benefits: Continual testing is an essential precursor to continual improvement. A steady stream of bug reports and feature requests will be valuable input to the creators and modifiers of such tools. A well-known public test site or network of such sites will create an environment for the creators of preservation tools to assess one another's work, including examining the interfaces such systems must support in any larger federation of preserving institutions. The ability to study the platforms being tested will begin to give potential users of such systems a view into what is available and appropriate in various contexts.

Note

1. The URLs were the unique identifiers as taken from sites spidered or donated from the Web. File (and directory) names were either generated from these URLs, recreating some of the directory structure of the spidered or donated sites, or were generated upon being added to the GMU collection. File extensions came in several types, including labels of content type (e.g., .txt, .doc); instructions to the original webserver (e.g., .cgi, .asp); version numbers for older, undisplayed content (e.g., .jpg.2), and user-generated notation (e.g., .data.) Query strings are content appended to a URL after a '?' symbol, and act as additional instructions to a webserver, but in this instance were also preserved as part of the file names during transfer, and digests were MD5 checksums of the content.

D-Lib Magazine Access Terms and Conditions

doi:10.1045/december2005-shirky

D-Lib MagazineDecember 2005

Volume 11 Number 12 ISSN 1082-9873