Volume 17, Number 5/6
Table of Contents
Institutional Repositories and Digital Preservation: Assessing Current Practices at Research Libraries
University of Massachusetts Amherst
In spring 2010, authors from the University of Massachusetts Amherst conducted a national survey on digital preservation of Institutional Repository (IR) materials among Association of Research Libraries (ARL) member institutions. Examining the current practices of digital preservation of IR materials, the survey of 72 research libraries reveals the challenges and opportunities of implementing digital preservation for IRs in a complex environment with rapidly evolving technology, practices, and standards. Findings from this survey will inform libraries about the current state of digital preservation for IRs.
Digital preservation is a significant problem facing libraries. Libraries are struggling with how to preserve the scholarly and cultural record now that this information is increasingly being produced in digital formats. In the age of print, information was relatively simple to preserve since paper is a durable format when made properly and stored under the proper conditions. However, now that we have entered the digital age, preserving information has become a more complex task. Digital information is fragile and faces many threats including technological obsolescence and the deterioration of digital storage media. The ultimate irony, as pointed out by Paul Conway, is that, "as our capacity to record information has increased exponentially over time, the longevity of the media used to store the information has decreased equivalently."  For example illuminated manuscripts have lasted for over 1000 years, but a CD will degrade in as little as 15 years.
Perhaps an even greater threat than the deterioration of storage media is technological obsolescence. In an article titled, Digital Longevity: the lifespan of digital files, Julian Jackson states, "the rate of change in computing technologies is such that information can be rendered inaccessible within a decade."  In many cases software upgrades may not support legacy file formats, and without the intervention of digital preservation techniques the information will no longer be accessible. If the digital scholarly record is to be preserved, libraries need to establish new best practices for preservation. For their part, creators need to be more proactive about archiving their work. The relatively recent development of institutional repositories (IRs) offers some promise in ensuring the long term preservation of digital scholarship.
However, there has been some debate about whether IRs were intended to provide long-term preservation of digital scholarship. In her foreword to the 2007 Census of Institutional Repositories, Abby Smith writes, "A conspicuous fact about institutional repositories, confirmed by the MIRACLE Project findings, is that there is no consensus on what institutional repositories are for."  She goes on to say:
For example, many institutions that plan or pilot test repositories are motivated by the desire to change the dynamics of scholarly communication ... Other institutions identify stewardship of digital assets, especially their preservation, as a key function of a repository. Yet survey data confirm that repositories are not yet providing key preservation services, such as guaranteeing the integrity of file formats for future use. 
Perhaps one of the most often quoted definitions of an institutional repository is from Clifford Lynch's 2003 essay "Institutional Repositories: Essential Infrastructure for Scholarship in the Digital Age." In this essay, Lynch defines IRs as:
A set of services that a university offers to the members of its community for the management and dissemination of digital materials created by the institution and its community members. It is most essentially an organizational commitment to the stewardship of these digital materials, including long-term preservation where appropriate, as well as organization and access or distribution. 
This study aims to find out whether long-term preservation is part of the mission of institutional repositories at Association of Research Libraries member institutions, and if so, what plans IRs have to provide long-term preservation of their content.
This study investigated the following questions related to digital preservation of IR content:
- Is preservation part of the mission and goals of IRs?
- What preservation policies exist for IRs?
- What preservation strategies are IRs currently implementing?
- Are the necessary rights and agreements in place to preserve the content of IRs?
- Are all of the materials in IRs of sufficient quality and importance to warrant long-term preservation?
- Do IRs currently have the necessary sustainability in terms of funding and staffing to carry out long-term preservation of their contents?
The authors of this study decided to send out a survey to ARL libraries, because we thought that the majority would have IRs. We also thought that most ARL libraries would at least be thinking about digital preservation at this point, if not actively taking measures to ensure long term preservation of the contents of their IRs.
The growing body of literature available on digital preservation and institutional repositories comes from a diverse group of scholars representing equally diverse perspectives. This literature review provided insight into different facets of the authors' survey, such as digital preservation methods and strategies, content recruitment and sustainability issues related to institutional repositories, and opportunities and challenges concerning digital preservation in the context of institutional repositories. However, very few articles were found which examine current digital preservation practices of institutional repositories in the United States. Librarian Charles W. Bailey, Jr.'s "Institutional Repository Bibliography"  offers a comprehensive view of the publication record on Institutional Repository topics, the majority of which focus on best practices, predictions, and opinion papers, as opposed to statistical analysis. Compared with the large number of articles listed in the section on general literature related to IRs, the subsection "Institutional Repository Digital Preservation Issues"  has only a small number of publications listed.
With digital content increasing exponentially in the current Information Age, libraries have come to realize the importance of digital preservation. Paul Wheatley states that "careful consideration must be given to the preservation needs of materials to be archived within an institutional repository" . Nancy Y. McGovern and Aprille C. McKay  also described several significant opportunities for digital preservation offered by IRs in their article published in 2008, including digital content management, opportunities for content to creators to learn about their role in digital preservation, and faculty legacy preservation.
Long-term digital preservation came to scholars' attention even before the birth of IRs in 2002. In 1996, Don Waters and John Garrett wrote a landmark report calling attention to the need for digital preservation by stating, "Failure to look for trusted means and methods of digital preservation will certainly exact a stiff, long-term cultural penalty."  During the same year, the Digital Preservation Coalition was established in the United Kingdom, and in the United States the Library of Congress developed a national strategy for preserving digital information. In 2002, the Consultative Committee for Space Data Systems (CCSDS) published the Recommendation for Space Data System Standards Reference Model for an Open Archive Information System (OAIS). The OAIS model provides a comprehensive framework for all functions required for digital preservation including ingest, storage, retrieval, and long-term preservation of digital objects. However, implementation of digital preservation in IRs is still in its infancy. As pointed out by Karen Markey and others, "it may not be surprising that there is a gap between the claims of stewardship, or aspirations for stewardship, by institutional repositories and their current ability to preserve digital assets. Organizational models for digital preservation are only now emerging and they are quite diverse ... Implementation of digital preservation in IRs, however, is still in its infancy." 
With IR software gradually integrating support for preservation, there seems to be more hope for IR managers in implementing digital preservation for IRs. However, it is not sufficient to rely only on software since various facets have to be considered when preserving digital content. As Eliot Wilczek and Kevin Glick state in their article "it seems obvious that no existing software application could serve on its own as a trustworthy preservation system. Preservation is the act of physically and intellectually protecting and technically stabilizing the transmission of the content and context of electronic records across space and time, in order to produce copies of those records that people can reasonably judge to be authentic. To accomplish this, the preservation system requires natural and juridical people, institutions, applications, infrastructure, and procedures."  Similarly, the challenges for digital preservation in the context of IRs are also pointed out by Nancy Y. McGovern and Aprille C. McKay, including "little control over what is ingested into the IR; deposit of materials in less-optimal formats, with poor metadata and insufficient intellectual property rights clearance; and digital content that is difficult or costly to preserve." 
As the preservation of IR content is becoming a bigger concern among IR managers, an assessment of current practices is needed. In 2005, Anne Kenney and Ellie Buckley from Cornell University conducted a "Survey of Institutional Readiness" on developing digital preservation programs. The survey found that "only about one third of institutions have developed, approved and implemented digital preservation policies."  Five years later, what is the status of digital preservation practices in the context of IRs among ARL libraries? The survey results presented in this paper attempt to find out.
Findings and Analysis
The survey contained six sections with a total of twenty-four questions, which aimed to investigate current practices in relation to the existence of digital preservation policies, digital preservation strategies, rights to preserve the content, content quality, and sustainability. As mentioned before, the survey was sent out to ARL libraries. The ARL website listed 125 libraries in May of 2010. Of these, the authors limited their survey to the 72 academic libraries that had institutional repositories. Fifty-two percent of the surveys were returned. Of the surveys returned, 43 percent were returned completely filled out. The responses were collected and analyzed using online survey analysis tools and spreadsheets.
The first section of the survey covered two general questions. The first question asked what platform survey respondents used for their IRs. DSpace was the most popular with 57.9 percent of survey respondents using it for their IR. Other systems being used for IR platforms include 26.3 percent using Digital Commons, 5.3 percent using ContentDM, 2.6 percent using DigiTool, and a remaining 7.9 percent choosing other. Among the 7.9 percent who chose other, three respondents specified the other platform they were using. One IR used a Digital Commons back-end with an XTF based front-end, and another reported using a "thoroughly modified Greenstone" system. The third respondent used various systems to make up their IR including; ETD-db for electronic theses and dissertations, VT ImageBase for digital images, and ContentDM for archival and scholarly collections.
The second question in this section asked whether preservation was part of the mission of the IR. For the vast majority, 97.4 percent, preservation was part of the mission of the IR. Only 2.6 percent of respondents reported that preservation was not a part of the mission of the IR. One of the respondents who answered No commented that preservation would eventually be part of the mission of the IR. If respondents answered no, they were thanked for their time and exited from the survey. The rest of the questions were related to digital preservation, and most would not be applicable for an IR that did not have preservation as one of its goals.
Developing preservation policies ought to be the first step toward guaranteeing preservation actions. The strategies for preserving IR content and the decisions about what content requires short, medium, or long term preservation should be driven by preservation policies. With IR content growing rapidly, it is important to look at how policies have been developed to guide the implementation of digital preservation for IR content.
In this survey, 51.5 percent of respondents indicated that their IRs have preservation policies. Encouragingly, this result showed that there has been an increase in digital preservation policy development since the 2003-2005 Cornell survey. For further investigation, the authors asked whether or not the IR provides long-term preservation to all submitted content. Seventy-eight percent of respondents indicated that they are committed to provide long-term preservation for their IR content. In examining the policies provided by the respondents, the authors found that many institutions guarantee preservation only for certain file formats; 90.0 percent of polices clearly identified supported or recommended file formats, while the rest of the institutions briefly say they are committed to long-term digital preservation of all materials housed in their IRs . From the policies provided, the most commonly supported file formats are listed in the Appendix, Table 1.
The third section of the survey asked several questions about the strategies employed to preserve IR content. Ninety percent of respondents reported that their IR content is at least backed up and stored in a secure storage system. Sixty-three percent of the respondents reported that they had a checksum algorithm to detect errors in the data stored in their IR. However, other digital preservation strategies such as migration, emulation, and refreshing were reported by only half, or less, of the institutions surveyed (See Figure 1). In the comments on this question, one respondent mentioned that the list of digital preservation strategies being used is a "developing list" and another respondent said that this was "in development."
The survey went on to ask whether digital preservation strategies were handled internally by the IR system itself or with external systems and services. The data show that many institutions are taking advantage of some features of their IR system that support digital preservation. In addition these libraries supplement the limited preservation features of most IR systems with external preservation systems and services (See Figure 2). The comments reveal some of the external systems currently being used to support digital preservation. They include LOCKSS, MetaArchive, DuraCloud, iRODS, CDL curation services, and InterPARES as well as Bepress backup for Digital Commons repositories and campus IT backup. Checksums were mentioned as a preservation feature internal to the DSpace repository system.
The next question asked whether the institution had a digital preservation system in place for its IR content and other digital collections. The largest percentage, 39.3 percent, had no digital preservation system in place. The next largest category, 32.1 percent, was those that had a private LOCKSS network in place. Another 28.6 percent had a custom designed digital preservation system, and 10.7 percent shared the use of a digital preservation system with other institutions.
Encouraging to see was that 58.6 percent of respondents reported recording preservation metadata about the digital objects in their IRs. Some of the most frequently collected types of preservation metadata included technical information needed to preserve the resource, rights information, provenance or ownership history, and authorized change histories of the resource (See Figure 3). However, consistency might be an issue, particularly if the IR is primarily collecting user-supplied metadata. One respondent pointed out that "not all collections have preservation metadata; it varies based on the sophistication of the collection." Another respondent commented that they "are working on standards and best practices that address all types of metadata." In this section of the survey the authors also wanted to know whether the IR system could export all of its content and all of its metadata, since this is key for migrating to a new or better system in the future. Most respondents, 96.7 percent, reported that the IR system was able to export all of its content, and 93.3 percent reported that their IR system was able to export all of its metadata. Data about which IR systems could not export all their content and all their metadata was not collected.
Rights and Agreements
Copyright and intellectual property are also important issues to consider when thinking about the stewardship of scholarly materials. When Open Access (OA) was first conceived of as a solution to the scholarly communication problem, the IR was developed as a way to implement OA in academia. Therefore, acquiring the rights from content contributors and copyright holders to distribute the content freely is an integral part of collecting content for IRs. However, securing the necessary rights and agreements to preserve the materials is also important, because implementing long-term digital preservation strategies, such as migrating to new formats in the future, may necessarily involve changing the content to some extent. Since preservation and access go hand in hand, the survey sought to find out whether IRs have the necessary agreements in place with content contributors and copyright holders to preserve and provide access to submitted content.
Among the repositories surveyed, 72.4 percent indicated that they had made agreements with content contributors to provide preservation services for submitted content. These agreements were usually made during the deposit process. Various types of agreements include online click through agreements, written agreements, policies, MOUs, and verbal agreements. However, making agreements with content contributors is only the first step, because for a significant portion of IR content, the content creator or contributor may not necessarily be the copyright holder. The survey results show that while most IRs ask for permission from contributors to preserve content, not all will necessarily ask for the same permission from the copyright holders, such as publishers. When asked whether or not the IR secures permission from content contributors, 96.7 percent of respondents answered yes (see Figure 4). However, only 56.7 percent indicated that they would ask for the same permissions from copyright holders if they were different from content contributors (see Figure 5). The comments section revealed that many institutions do not consider providing copyright clearance on behalf of content contributors to be part of their responsibilities. Most agreements provided by survey respondents state that content contributors need to warrant that they either own the copyright of the submitted content or that they have permission to submit the work if the copyright is owned by another party.
The most important roles that IRs play are to collect, manage, and disseminate the digital scholarship that their communities produce. Collecting content is the first step to building an IR, and since their inception this is what IR managers have primarily focused their efforts on. Digital scholarship can be collected in different ways, and how it is collected may affect its quality as well as the ability to preserve it. It is worth investigating how content is collected and how quality is ensured since different levels of preservation effort will be made depending on both the initial quality of the content and its format.
Eighty percent of IRs reported that they have a collection policy in place. From the provided links to policies in the comments section, we discovered that collection policies mostly include selection criteria (such as the nature and type of the materials that can be submitted), recommended file formats, and procedures (such as withdrawal, access, and preservation.) As to how content is deposited in the IR, the survey asked about three methods: author self-archiving, by third party on behalf of the author, and by repository staff. The results showed that content is deposited in the IR by using all three methods in 92.0 percent of surveyed institutions. The next question asked survey respondents to indicate rough proportions for each type of deposit method. The answers varied widely, but the overall pattern showed that repository staff are still depositing much of the content that goes into IRs.
As we discussed, no matter how content is deposited in the IR, the quality of deposited content should be examined before digital preservation actions are considered, as the initial quality of deposited content can directly affect the success of digital preservation efforts. If the quality of the content cannot be assured, then significant problems may arise. These problems may include format obsolescence, poor quality or unreadable images or scans, insufficient metadata to manage and preserve the materials, etc. For this reason, the last question in this section examined whether or not IRs have mechanisms in place to ensure the quality of submitted content. Consistent with our expectations, 83.3 percent of respondents are using authentication mechanisms (see Figure 6). Authentication mechanisms allow an administrator to define resources that can be accessed and to track users as well as submitted content. In addition, 70.0 percent provide submission guidelines, and 66.7 percent indicated that repository staff review submitted content. These are all important actions to take in order to ensure that high quality content, worthy of preservation, is being submitted to the IR. Results show that only 20.0 percent of respondents are also using a peer review system with their IRs. It is not clear to us what content is subject to peer review, but we imagine that it would include the types of materials that typically employ peer review such as journal articles and conference proceedings. For previously published materials, most likely peer review occurred prior to deposit in the IR.
The last section of this survey looked at sustainability issues for IRs as this has a direct impact on the preservation of their content. The first question asked if the IR had sustainable long-term funding. At this point the majority of IRs, 63.3 percent, do have sustainable long-term funding. However, there are still a significant number of IRs whose funding situation is uncertain; 13.3 percent of respondents reported that their IRs do not have sustainable long-term funding, and 23.3 percent reported that they didn't know if their IRs had sustainable funding. Comments about this question ranged from "as long as the library decides it's a worthwhile project" to "the library's new strategic plan includes a long term commitment to the IR" and "it is funded out of the library budget."
The next question asked if the IR had adequate and sustainable staffing. The data show that this is still a problem area for many IRs. Answers to this question are split right down the middle; 48.3 percent responded that they have adequate staffing, 48.3 percent responded that they do not have adequate staffing, and 3.4 percent said they did not know whether they had adequate staffing or not. One respondent commented that "At a keep-alive level, there is adequate staffing unless we lose staffing lines. As content increases and increased formats are handled that must be migrated, it's not clear that we could handle it with our existing staff." Another reported that their "staffing is less than one FTE," and still another commented that their "success means [they] need more than one full-time staff and one part-time student worker, but budget does not allow for it." Numerous respondents had comments to make about this question, which further emphasizes the fact that adequate staffing levels are a concern for many IR managers.
When asked what level of digital preservation the IR was currently providing, 20.0 percent responded that the IR was providing short term preservation. Short-term preservation was defined as access either for a defined period of time while use is predicted or until materials becomes inaccessible because of changes in technology. Medium term preservation was defined as continued access beyond changes in technology for a defined period of time but not indefinitely, and was reported by 36.7 percent. Surprisingly to the authors, 43.3 percent reported that they were currently providing long-term digital preservation or access to the content for an indefinite period of time. Although 43.3 percent report that their IRs are currently providing long-term digital preservation, numerous comments show a slightly different picture. One respondent wrote, "We continue to develop standards and best practices. Long term preservation is definitely our goal." Another said, "By the end of this year, we should have detailed preservation policies and procedures in place. As part of the strategic plan implementation, we will work on implementing preservation policies and procedures." Still another commented, "We aim for long term preservation, but I think we need a better preservation plan in place." It is hard to tell with complete accuracy whether 43.3 percent are actually providing long-term preservation today, but these comments seem to suggest that IRs may be engaged in a planning process to provide long-term preservation rather than providing it in a fully operational way.
Responses to the last survey question strengthen the theory that most IRs are currently in a planning mode rather than a fully operational mode for providing long-term digital preservation. When asked if the IR was currently engaged in planning a process to provide long-term digital preservation of its content, 67.7 percent answered yes; 16.7 percent said no; and only 16.7 percent reported that they were already providing long-term digital preservation. Comparing the 16.7 percent from this question against the 43.3 percent who reported that the IR was currently providing long-term preservation in the previous question suggests that long-term digital preservation is really more of a goal than a reality for most IRs at this point.
The results of the survey show that an increasing number of research libraries have started to move digital preservation programs ahead by developing preservation policies. The growing awareness about making agreements and securing permissions for preserving IR content signifies another step forward, although some concerns may remain when the responsibilities of seeking permissions are assigned to content contributors. Content contributors may be frustrated if they do not have sufficient knowledge of copyright issues or if they lack the time to secure the necessary permissions from copyright holders to self-archive their previously published works. These issues impede the ability of an IR to collect content as well as to preserve content. An innovative approach needs to be developed to address these concerns. Assuring quality of content and collecting content in formats that can more easily be preserved is another area that might need more consideration. A list of supported file formats could offer preservation guidance to content contributors; however, it may narrow the scope of content for IRs. Collection policies, such as selection criteria and submission guidelines, are helpful for guiding decisions about preservation efforts and ensuring that the content of IRs is worth the cost and effort that it will take to preserve. Since the IR is still in a stage of development at many institutions, lack of sustainable funding and adequate staffing could present an obstacle in implementing successful digital preservation programs. It will be important to address these sustainability issues as part of the planning process for building a digital preservation program. Despite these challenges it is very encouraging to see a large number of digital preservation policies being developed and an increasing number of digital preservation strategies being implemented for IRs. We expect to see great steps forward in the next five years.
During the process of the survey and preparation of this paper, we received a lot of support from our colleagues and friends. Here we would like to thank Robert McGeachin and Sandra Tucker from Texas A&M University Library for sharing their IR managers email list with us. We also want to thank our colleague Stephen McGinty from W.E.B Du Bois Library at University of Massachusetts Amherst, and Dr. Marta Deyrup from Seton Hall University Library for their insightful comments on the paper.
 Conway, Paul. Preservation in the Digital World. Washington, D.C.: Council on Library and Information Science, March 1996. http://www.clir.org/pubs/abstract/pub62.html.
 Jackson, Julian. Digital Longevity: the lifespan of digital files. York: Digital Preservation Coalition. http://www.dpconline.org/events/previous-events/306-digital-longevity.
 Smith, Abby. Foreword to Census of Institutional Repositories in the United States MIRACLE Project Research Findings, by Karen Markey, Soo Young Rieh, Beth St. Jean, Jihyun Kim, and Elizabeth Yakel. Washington, D.C.: Council on Library and Information Science, February 2007. http://www.clir.org/pubs/reports/pub140/contents.html#fore.
 Lynch, Clifford A. Institutional Repositories: Essential Infrastructure for Scholarship in the Digital Age. Washington, D.C.: Association of Research Libraries, February 2003. http://www.arl.org/bm~doc/br226ir.pdf.
 Bailey Jr., Charles W. Institutional Repository Bibliography. http://digital-scholarship.org/irb/.
 Wheatley, Paul. "Institutional Repositories in the context of digital preservation," Microform & Imaging Review 33 (2004), 135-46. http://dx.doi.org/10.1515/MFIR.2004.135. doi:10.1515/MFIR.2004.135
 McGovern, Nancy Y., and Aprille C. McKay, "Leveraging short-term opportunities to address long-term obligations: A perspective on Institutional Repositories and Digital Preservation Programs," Library Trends 57, no.2 (2008): 262-79. http://muse.jhu.edu/journals/library_trends/v057/57.2.mcgovern.html.
 Waters, Donald, and John Garrett, Preserving Digital Information: Report of the Task Force on Archiving of Digital Information (Washington D.C.: The Commission on Preservation and Access, 1996), 68. http://www.clir.org/pubs/abstract/pub63.html.
 Markey, Karen, Soo Young Rieh, Beth St. Jean, Jihyun Kim, and Elizabeth Yakel, Census of Institutional Repositories in the United States MIRACLE Project Research Findings. Washington, D.C.: Council on Library and Information Science, February 2007. Accessed May 27, 2010. http://www.dspacedev2.org/images/LinkTo/clir%20report.pdf.
 Wilczek, Eliot, and Kevin Glick, Fedora and the Preservation of University Records. 2006. Accessed May 2, 2010. http://dca.lib.tufts.edu/features/nhprc/reports/index.html.
 Kenney, Anne, and Ellie Buckley. "Developing Digital Preservation Programs: the Cornell Survey of Institutional Readiness, 2003-2005." August 15, 2005. Accessed May 15, 2010 http://worldcat.org/arcviewer/1/OCC/2007/08/08/0000070519/viewer/file1088.html#article0.
|Text File Formats
|Plain Text (US-ASCII, UTF-8)
||.odt, .ods, .odp
|Image File Formats
About the Authors
Yuan Li is the Scholarly Communication Librarian at Syracuse University (SU). Prior to joining the SU Library, Yuan worked as Digital Initiatives Librarian at the University of Rhode Island; Digital Repository Resident Librarian at the University of Massachusetts, Amherst; Digital Initiative Developer in the Graduate School of Library & Information Studies at the University of Rhode Island, and as Metadata Developer in the Special Collections and Archives Unit of the University of Rhode Island Library. Yuan holds an MLIS from the University of Rhode Island and a Master of Engineering degree in Applied Computer Science from the National Computer System Engineering Research Institute of China. She also holds a Bachelor of Engineering degree in Computer Science and Technology from Yanshan University (China).
Meghan Banach is the Bibliographic Access and Metadata Coordinator at the University of Massachusetts Amherst. In addition to providing leadership for the Bibliographic Access and Metadata Unit of the Information Resources Management Department, she is a member of the UMass Amherst Scholarly Communication Team and focuses primarily on the management of electronic theses and dissertations in the institutional repository. She also chairs the UMass Amherst Digital Creation and Preservation Working Group and serves on the Metadata Working Group. Her research interests center on managing, preserving, and providing access
to digital materials. She holds an MLIS with an Archives Management Concentration from the Simmons College Graduate School of Library and Information Science and a BA in History from Mount Holyoke College.