Digital Preservation File Format Policies of ARL Member Libraries: An Analysis
Kyle Rimkus, Thomas Padilla, Tracy Popp and Greer Martin
Whether overseeing institutional repositories, digital library collections, or digital preservation services, repository managers often establish file format policies intended to extend the longevity of collections under their care. While concerted efforts have been made in the library community to encourage common standards, digital preservation policies regularly vary from one digital library service to another. In the interest of gaining a broad view of contemporary digital preservation practice in North American research libraries, this paper presents the findings of a study of file format policies at Association of Research Libraries (ARL) member institutions. It is intended to present the digital preservation community with an assessment of the level of trust currently placed in common file formats in digital library collections and institutional repositories. Beginning with a summary of file format research to date, the authors describe the research methodology they used to collect and analyze data from the file format policies of ARL Library repositories and digital library services. The paper concludes with a presentation and analysis of findings that explore levels of confidence placed in image, text, audio, video, tabular data, software application, presentation, geospatial, and computer program file formats. The data show that file format policies have evolved little beyond the document and image digitization standards of traditional library reformatting programs, and that current approaches to file format policymaking must evolve to meet the challenges of research libraries' expanding digital repository services.
Nearly twenty years ago in their seminal publication Preserving Digital Information: Report of the Task Force on Archiving of Digital Information, Waters and Garret wrote on the important role trusted file formats would soon go on to play in the burgeoning field of digital preservation:
Another migration strategy for digital archives with large, complex, and diverse collections of digital materials is to migrate digital objects from the great multiplicity of formats used to create digital materials to a smaller, more manageable number of standard formats that can still encode the complexity of structure and form of the original (Waters and Garret, 1996, 28).
Indeed, the identification of such "standard formats" would soon begin to occupy the attention of many information professionals working in digital libraries. Risk Management of Digital Information: A File Format Investigation, for example, details Cornell University Library's efforts, in the late 1990s, to develop file format migration policies based on principles of risk management. While the report's authors note that at the time of the study itself, few cultural memory organizations were even willing to risk endorsing specific file formats (Lawrence, et al., 2000, 1), many of the file formats singled out for preservation purposes during this period, particularly those intended for use in digital reformatting efforts, are still held in high regard today.
As a case in point, the United States National Archives and Records Administration (NARA) published a best practices document in 1998 for its Electronic Access Project to digitize selected archival materials for online access that endorsed the Tagged Image File Format (TIFF) for production master filesa recommendation echoed to this day by many practitioners in the field of digital library imaging (Rieger, 2008). Similarly, the Standard Generalized Markup Language (SGML) and its successor the Extensible Markup Language (XML) began to garner trust in text encoding circles (Cohen and Rozenzweig, n.d.), while Waveform Audio, or the WAVE file format gained traction for use in digital audio preservation (Bamberger and Brylawski, 2010).
Why then are some file formats considered better-suited to preservation than others? Open file formats are generally preferred to closed, proprietary formats because the way they encode content is transparent. On the other hand, adoption of a proprietary file format by a broad community of content creators, disseminators and users, is often considered a reliable indicator of that format's longevity. Additional qualities such as complexity, the presence of digital rights management controls, and external dependencies are also seen as relevant factors to consider when assessing file formats for preservation (Rog and van Wijk, 2008, 3-4). There is, however, no failsafe formula for file format policy decisions. While Stanford University prototyped an Empirical Walker to combine machine-automated and human assessments of file formats in use in their own digital preservation repository (Anderson, 2005), and the Online Computer Library Center (OCLC) developed the INFORM methodology to assess the long-term reliability of file formats considered for use in digital preservation environments (Stanescu, 2005), how to weigh the relative value of the preservation qualities of file formats often differs from one institution to another.
These considerations came to the fore with the advent of institutional repositories in the early 2000s. As institutional repository managers sought to strike a balance between lowering barriers to deposit and acquiring content that would stand the test of time, they often expressed their file format policies, in contrast to the prescriptive requirements of digitization guidelines, as recommendations. The implementation of the DSpace institutional repository software platform at its original home institution, the Massachusetts Institute of Technology (MIT), is an excellent example. Its policy differentiates file formats by the categories of "supported," "known," and "unsupported" (MIT Libraries, 2013). Likewise, the Illinois Digital Environment for Access to Learning and Scholarship (IDEALS) at the University of Illinois at Urbana-Champaign categorizes file formats as "highest confidence-full support," "moderate confidence-intermediate support," and "low confidence-basic preservation only" (Illinois Digital Environment for Access to Learning and Scholarship, 2013).
Elaborating the specific terms of preservation services to a designated community of users is a key concept in the Open Archival Information System (OAIS) specification (Consultative Committee for Space Data Systems, 2002) and the framework for a Trusted Digital Repository (TDR) (Consultative Committee for Space Data Systems, 2011) and its predecessor Trusted Repository Audit Certification (TRAC) (Dale and Ambacher, 2007). Widespread knowledge of these and similar frameworks and models has spurred the development, in certain quarters, of repository services built expressly for the digital preservation function. Of all digital library services, these repositories generally feature the most carefully conceived file format policies of all. The Florida Digital Archive (FDA), a digital preservation service available to all libraries affiliated with the state university system in Florida, provides its users with detailed action plans for specific file formats, as well as assessments of high, medium, or low file format confidence levels. In addition, the FDA developed extensive guidelines related to preservation risks such as encryption, password protection, compression, proprietary fonts, and digital rights management controls (Florida Virtual Campus, 2013).
This trajectory would suggest that digital library file format policies have become more expansive over time to meet the changing needs of evolving repository services. But is this borne out by today's digital preservation file format policies in research libraries? This question, informed by the trends in file format development summarized above, furnished the focal point of this study.
By gathering and assessing data on the level of confidence currently placed in file formats by member libraries of the ARL, a non-profit organization of North American academic research libraries whose membership requirements include institutional commitment to sustaining significant research collections, including those in digital format (Association of Research Libraries, 2013), this study seeks to contribute evidence of value to the profession's evolving discussion of best practices in digital preservation. The paper's authors collected data from October 2012-June 2013, and began by identifying a data model to reflect the terms and relationships designated below in bold, and fully defined in Appendix I:
Using a locally developed database designed to reflect the data model described above, the authors followed the following process: For each ARL Library in the official list of 175 institutions, they browsed websites to identify every Repository or Digital Library Service with an online presence; next, they browsed the websites for each Repository or Digital Library Service in search of public documentation on File Format Policies; finally, for each accepted File Format identified in a File Format Policy, they assigned a Confidence Level placed in it by policy's wording of High Confidence for file formats whose encoded content was guaranteed functional preservation, and Medium Confidence for file formats only guaranteed bit-level preservation or designated as acceptable but not preferred (a full explanation of these distinctions is available in Appendix I: Definitions of Terms). For those Repository or Digital Library Services without readily available public documentation, the authors requested information via email with a service manager identified on the library website. This approach afforded the authors a comprehensive view of exactly how much digital preservation policy information institutions are making available on their websites; their findings are summarized below.
A spreadsheet of the file format policy data collected for this paper is available at Appendix II of this paper. The data were drawn from the file format policies of 118, or 51% of 253 ARL Repository or Digital Library Services identified by the authors following the methodology described above. They discovered 73 of these File Format Policies on publicly available websites, whereas 45 were provided to them by repository managers in response to direct email queries. 174 file formats appear in these 118 policies. By type, they break down into the categories Application (14), Audio (19), Computer programs (17), Geospatial (6), Image (28), Presentation (10), Spreadsheet/database (28), Text/document (36), Video (15).
The five most commonly occurring file formats in all policies (see Table 2 for more information) are the Tagged Image File Format (extension TIFF, or TIF) (115), the Waveform Audio File Format (WAV) (80), the Portable Document Format (PDF) (74), JPEG (JPG, JPEG) (70), and Plain text document (TXT, ASC) (69). The five most frequently occurring file formats given High Confidence in all policies are the Tagged Image File Format (TIFF, TIF) (88), Plain text document (TXT, ASC) (52), the Portable Document Format (PDF) (49), the Waveform Audio File Format (WAV) (47), and the Extensible Markup Language (XML) (47). The five most frequently occurring file formats given Medium Confidence in all policies are Quicktime (MOV, QT) (47), Microsoft Excel (XLS) (39), Microsoft Word (DOC) (38), Microsoft Powerpoint (PPT) (38), and RealAudio (RAM, RA, RM) (35).
Table 1. Top 15 File Formats Listed by Occurrence
Using the data referenced above, the authors used a simple calculation to assign levels of Relative Confidence to file formats. This number, expressed as a percentage, was arrived at by subtracting the number of Moderate Confidence recommendations from High Confidence recommendations for a particular file format, and then dividing the difference by the total number of recommendations for that format. If the resultant percentage is positive, it indicates a greater proportion of High Confidence recommendations relative to the Moderate Confidence recommendations for a given file format. To weed out false positives, this percentage was only calculated for file formats that appear in at least 10 policies. The five file formats with the highest Relative Confidence values (Table 2) are Comma Separated Values (CSV) (73%), the Machine Readable Cataloging Record (MARC) (68%), the Tagged Image File Format (TIFF, TIF) (53%), the Audio Interchange File Format (AIF, AIFC, AIFF) (53%), and Plain text document (TXT, ASC) (51%).
Table 2. File formats identified with positive relative confidence values
Regarding this paper's central question of the effect expanding digital library services have had on digital preservation file format policies, the data show that practitioners place high levels of confidence in trusted formats for documents and images with origins in library reformatting programs. These categories feature eight and five formats with positive Relative Confidence values, respectively (Tables 3 and 4).
Table 3. Document file formats with more than 10 occurrences
Table 4. Image formats with more than 10 occurrences
By contrast, digital preservation managers appear to take a less generous view of file format types that do not have their roots in longstanding library digitization efforts. The categories of application, computer program, geospatial, and presentation files do not count a single format among them with a positive Relative Confidence value. The spreadsheet/database and video categories (Table 5) have one positively ranked file format each. For audio formats, there are only two.
Table 5. Video formats with more than 10 occurrences
These results point to a common compromise repository managers make for file formats they are not accustomed to managing within internal digital production workflows, namely, guaranteeing them "bit-level" preservation storage without implying that the content their files encode will stand the test of time.
Table 6. Top 15 file formats listed by order of occurrence in policies expressing medium confidence
The implications of these findings are explored in the next section of this paper.
The data gathered in the course of this study would suggest that, as of mid-2013, research library professionals in North America appear to trust only 18 file formats in all (Table 2). The numbers, however, only tell part of the story. In the course of their research, the authors learned as much from the data gathering processreviewing the way file format policies were expressed online or the way that repository managers described their approach to file format management in emailsas from the data themselves.
Despite the intense focus on digital preservation in recent years, for instance, only a meager number of repositories have taken the step of formulating thorough file format policies. In addition to the example furnished by the Florida Digital Archive cited above, Deep Blue at the University of Michigan ("Deep Blue Preservation and Format Support", 2013), Boston University's digital preservation policy (Boston University, 2013), and the University of Minnesota's Digital Conservancy ("University of Minnesota Digital Conservancy", 2013) furnish examples of thoughtfully conceived approaches to file format policymaking.
It is also clear that many institutions are relying on the judgment of perceived experts to inform their own file format policy decisions, and that they are looking in particular to the creators of broadly adopted repository management software platforms for guidance. The Massachusetts Institute of Technology (MIT), home to development of the open source DSpace institutional repository software prior to the establishment of the DuraSpace not-for-profit, is a case in point. Numerous repository managers identified in this study either referred to MIT's file format policies (MIT Libraries, 2013) as those they had adopted for their own use, or presented actual charts and terminology breaking down file format policies in a manner nearly identical to the MIT DSpace model.
Comments made by repository managers during the data gathering period would imply that Archivematica is poised to play a similar role for the growing number of institutions that deploy it. Archivematica is an open source suite of digital preservation "microservices" that enable collection managers to oversee such digital preservation actions as file format normalization and the management of content in accordance with the OAIS concept of submission, archival, and dissemination information packages (Archivematica, 2013). Several digital preservation managers referred to Archivematica's ongoing file format policy registry and associated migration paths as the policies they intended to adopt at their own institutions.
As far as the future of digital preservation policy management is concerned, it bears emphasis that contemporary file format policies are very much rooted in relatively small-scale data management practicesstewarding files through digitization workflows, for example, or curating a university's research publications. In many cases, bit-level preservation services are offered to obviate the need to make hard decisions about unappealing file formats. For example, the RealAudio format appears 35 times in all identified format policies, but is promised exclusively Moderate Confidence, or bit-level support, without a single High Confidence rating. This is not to be read as an endorsement of RealAudio as a preservation file format so much as an acknowledgment that RealAudio files exist within many academic libraries' designated communities of users, and that 35 repositories have taken it upon themselves to preserve them as-is. Bit-level support, however, is not necessarily a vote of confidence for the preservation characteristics of a file format. Especially in the case of institutional repositories, the provision of a storage service for all commonly encountered file types is more often than not a recognition that file format use frequently extends well beyond a short-list of preferred archival formats. In this respect, bit-level support for everything that comes into a given repository implies a compromise with a social reality rather than a hard-line application of digital preservation format assessment methodologies.
It is instructive to view these trends in light of recent research from the world of large-scale, long-term web archiving. In Formats over Time: Exploring UK Web History, Andrew N. Jackson presents a file format analysis of 2.5 billion resources harvested in the .uk domain for the Archives of the United Kingdom, with the conclusion that "most file formats last much longer than five years, that network effects appear to stabilise formats, and that new formats appear at a stable, manageable rate" (Jackson, 2012, 4). In particular, Jackson's research highlights the persistence on the web of image formats such as JPEG, TIFF, PNG, and GIFall of which rank highly in this studywhile pointing to the decline and near disappearance over time of the once common X BitMap (XBM) format, which, interestingly, does not figure at all in any known ARL policies.
Despite the web's importance as an indicator of file formats trusted for sharing access to digital information, it also conceals an entire world of digital content production from view. To remain with image formats, few photographers or graphic designers begin their work in the GIF, PNG, or JPEG format, even if these are what they eventually use for the web distribution of their images. Rather, their files more frequently begin their lives in proprietary production master formats such as RAW, Digital Negative (DNG), or Photoshop Document (PSD). Such production file formats are likely to be found in collections of electronic records, not to mention a broad variety of other file formats saved on donors' hard drives, as libraries and archives begin to increase the acquisition of born digital materials. This is not dissimilar to the challenges libraries and their collaborators in information technology face as they articulate strategies to effectively steward scientific data and the broad variety of files produced throughout the research process in different disciplines. The way that managers of these emergent services craft their own file format policies will certainly have a significant influence on the future of digital preservation planning.
These looming frontiers notwithstanding, traditional notions of file format recommendations in libraries are beginning to receive scrutiny and challenge. De Vorsey and McKinney, in writing of the digital collections stewarded by the National Library of New Zealand, take issue with efforts to anoint certain file formats as "archival." In practice, they observed considerable variance between specimens of even the most common "preservation" file formats, these most often resulting from differing interpretations of format standards by the software that encoded them. As a result, they advocate shifting the focus from file formats per se, and instead matching file profiles against application profiles to determine an institution's ability to provide access to content:
Our experience with New Zealand's documentary heritage is that files contain multifarious properties. These are based on the world of possibilities that the format standard describes, but can also include non-standard properties. The range of possibilities and relationships between them is such that it is quite meaningless to purely measure a file's adherence to a format standard (De Vorsey and McKinney, 2010, 43).
Such developments would suggest that the already challenging prospect of file format policymaking for research library collections is about to become even more daunting. At present, ARL member file format policies largely reflect a high level of confidence with a limited number of file formats used in library digitization programs and the web transmission of scholarly communication. Outside of these file formats, however, policies indicate a much lower level of confidence in their respective repositories' abilities to provide adequate preservation services for file formats in the categories of application, computer program, geospatial, and presentation, and, to a lesser extent, audio, tabular data, and video. As libraries and archives begin to set their sights on collections of heterogeneous files such as born-digital electronic records and research data, this is expected to spur on further evolution not only in the file formats that appear in digital preservation policies, but in the way file format policies are articulated and implemented.
The authors wish to acknowledge the Research and Publication Committee of the University of Illinois at Urbana-Champaign Library, which provided support for the completion of this research.
 Anderson, Richardson, Hannah Frost, Nancy Hoebelheinrich, and Keith Johnson. 2005. "The AIHT at Stanford University: Automated Preservation Assessment of Heterogeneous Digital Collections." D-Lib Magazine 11 (12) (December): 10. http://doi.org/10.1045/december2005-johnson
 Bamberger, Rob, and Sam Brylawski. 2010. "The State of Recorded Sound Preservation in the United States: A National Legacy at Risk in the Digital Age". Council on Library and Information Resources and The Library of Congress.
 Cohen, Daniel J., and Roy Rosenzweig. 2013. "Digital History: A Guide to Gathering, Preserving and Presenting the Past on the Web."
 Consultative Committee for Space Data Systems. 2002. "Reference Model for an Open Archival Information System (OAIS)". CCSDS Secretariat.
 Consultative Committee for Space Data Systems. 2011. Audit and Certification of Trustworthy Digital Repositories: Recommended Practice. Recommended Practice Issue 1. Washington, DC: CCSDS Secretariat.
 Dale, Robin L., and Bruce Ambacher. 2007. "Trusted Repositories Audit & Certification: Criteria and Checklist". Chicago: Online Computer Library Center and The Center for Research Libraries.
 De Vorsey, Kevin, and Peter McKinney. 2010. "Digital Preservation in Capable Hands: Taking Control of Risk Assessment at the National Library of New Zealand." Information Standards Quarterly 22 (2): 4144.
 Derrot, Sophie, Louise Fauduet, Clément Oury, and Sébastien Peyrard. 2013. "Preservation Is Knowledge: A Community-driven Preservation Approach." In iPres2012: Proceedings Of the 9th International Conference on Preservation of Digital Objects, 1118. Toronto, ON, Canada: University of Toronto Faculty of Information.
 Florida Libraries Virtual Campus. 2013. "Florida Digital Archive: FDA File Preservation Strategies by Format."
 Kenney, Anne R. 1996. Digital Imaging for Libraries and Archives. Ithaca, N.Y.: Dept. of Preservation and Conservation, Cornell University Library.
 Lawrence, Gregory W., and et al. 2000. Risk Management of Digital Information: a File Format Investigation. Washington, D.C.: Council on Library and Information Resources.
 Pearson, D., and C. Webb. 2008. "Defining File Format Obsolescence: A Risky Journey." National Library of Australia Staff Papers.
 PREMIS Editorial Committee. 2012. "PREMIS Data Dictionary for Preservation Metadata: Version 2.2."
 Rieger, Oya Y. 2008. "Preservation in the Age of Large-Scale Digitization: A White Paper". Washington, D.C.: Council on Library and Information Resources.
 Rog, J., and C. Van Wijk. 2008. "Evaluating File Formats for Longterm Preservation." Koninklijke Bibliotheek 2: 1214.
 Stanescu, Andreas. 2005. "Assessing the Durability of Formats in a Digital Preservation Environment: The INFORM Methodology." OCLC Systems & Services 21 (1) (March 1): 6181. http://doi.org/10.1108/10650750510578163
 Waters, Donald, and John Garrett. 1996. "Preserving Digital Information: Report of the Task Force on Archiving of Digital Information". The Commission on Preservation and Access and The Research Libraries Group.
Appendix I: Definitions of Terms
One of the 125 member libraries listed in the online membership directory of the Association of Research Libraries (Association of Research Libraries, 2013) during this study's data collection period of October 2012-June 2013.
Repository or Digital Library Service
Any digital library repository or production unit that serves the preservation planning function of recommending file formats for the long-term viability of digital content. This includes institutional repositories that manage digital items submitted by a community of external users, often research faculty, for long-term access; digital production units that generate content for digital library collections, often through the digital reformatting of analog materials in libraries and archives; and digital preservation repositories with a clear charge to maintain enduring access to digital content.
File Format Policy
An official statement of preference for specific file formats over others, sometimes expressed as a recommendation, other times as a set of requirements for deposit into a digital library collection or repository.
A standardized way to structure the data stored in a computer file, or a self-contained data-stream or package of related data-streams made available as a discrete entity to a computer's operating system and its programs. That is, in this study, the term File Format is used in a broad sense to encompass discrete data packages that store homogenous content (e.g. a text file), as well as complex digital objects composed of several file or bitstream objects encased in a wrapper or bundling file format (e.g. H.264-encoded video stored within a QuickTime file wrapper). Essentially, anything that appears to a modern operating system's file browser as a packet of information represented by a character string, a dot, and an extension is considered a representative example of a file format.
Many File Format Policies contain stipulations that go beyond the File Format level. A repository may accept JPEG2000 files, for example, but only on the condition that they were created utilizing a lossless compression algorithm; TIFF files as long as they are version 6.0 of the standard; or document formats such as the Portable Document Format (PDF), provided that they do not contain embedded media content. As important as these distinctions are, this study focuses on File Format designations understood in a very broad sensethat of a packet of data represented on the file system level by a dot and an extension, and considers such refinements, while important, out of scope.
File Format Type
A categorization of file formats based on common use categories, as defined below:
Few repository or digital library policies designate confidence levels in file formats in quite the same way. Many differentiate between service levels, guaranteeing, for example, content migration for trusted file formats but bit-level preservation services for others. Others rank file formats by levels of confidence in their long-term accessibility. As such, it is not uncommon to encounter repository policies that categorize file formats using subjective terms. In seeking to find common ground across so much variety, the authors of this study settled on two categories: high confidence and moderate confidence, and created the guidelines below to differentiate between them:
File format policy data collected for this article and submitted to D-Lib Magazine is available here in PDF.
The data is also downloadable in PDF, XLS and CSV formats from the IDEALS institutional repository at http://hdl.handle.net/2142/47421.
About the Authors