Search   |   Back Issues   |   Author Index   |   Title Index   |   Contents

Articles

spacer

D-Lib Magazine
September 2003

Volume 9 Number 9

ISSN 1082-9873

The Digital Preservation of e-Prints

 

Stephen Pinfield
Information Services
University of Nottingham, United Kingdom
<Stephen.Pinfield@Nottingham.ac.uk>

Hamish James
Arts and Humanities Data Service
King's College London, United Kingdom
<hamish.james@ahds.ac.uk>

Red Line

spacer

Introduction

The first question that needs to be addressed when discussing the digital preservation of e-prints is "do we need to preserve e-prints at all?" Whilst many people in the digital preservation community may instinctively answer this question "yes", or at least "some of them", many people from the e-prints community would say "no", or at least "preservation should not be a priority". This paper addresses the question of whether or not e-prints should be preserved and then goes on to make some comments about the practical issues that arise from the suggested answer.

"E-prints" are being defined here as electronic versions of research papers or similar research output. They may be pre-prints (drafts of papers before they have been refereed) or post-prints (after they have been refereed). They may also include material such as chapters from scholarly books or conference papers, which may not be formally refereed but are nevertheless important research output. A movement is beginning to develop amongst some stakeholders in the scholarly communication process where e-prints are being deposited in open-access online repositories so that the literature is freely available to users. The enormous potential of these repositories has been enhanced by the development of the Open Archives Initiative (OAI) Protocol for Metadata Harvesting, which facilitates interoperability between repository servers. As the name "Open Archives Initiative" implies, online repositories are often referred to as "archives" but this loose use of the term "archive" does not necessarily imply a curation or long-term preservation function. The question is: should e-print archives perform a truly archival function by preserving their contents for the long term?

The question of preservation

One way of addressing this question is to consider the arguments of a leading advocate of e-prints, Stevan Harnad. Harnad's views on digital preservation of e-prints have been expressed in a number of publications and e-mail discussion lists [1]. They might be summarised as follows:

  1. E-prints are "duplicates" of the conventionally published literature. In other words, they do not replace the traditional journal literature but complement it.
  1. E-prints are about "immediate access". They allow rapid dissemination of the literature (in contrast to journals where there is often a long delay between acceptance of a paper and its publication). When housed in open-access repositories, e-prints are also free at the point of use (in contrast to the journal literature which is usually hidden behind a "tollgate").
  1. Effort should be concentrated on filling the repositories. Anyone who has worked in this area knows that getting the content is a major priority and a major challenge.
  1. If this is the case, Harnad argues, preservation should not be a priority. In fact, it may be an unnecessary distraction.
  1. Putting an emphasis on preservation may detract from or slow down efforts to fill repositories. This might be the case if repository managers delay making repositories available until there are watertight preservation policies in place, or if they put additional barriers in the way of authors before they can submit papers (such as insisting on particular file formats or metadata standards).
  1. When there is a critical mass of content, it can be "retro-fitted for more rigorous preservation." This will particularly apply if and when e-prints become the primary outlet for scholarly communication.
  1. In the meantime, preservation is not an urgent priority. After all, the largest e-print repository, arxiv.org, was set up in 1991 and all of its contents are still accessible.
  1. Preservation efforts should be concentrated on the conventionally published versions of papers rather than on e-print repositories. Harnad has a striking analogy to illustrate this point: preserving e-prints is like a museum curator preserving a copy of an artefact rather than the original artefact itself.

Referring to the importance of OAI against the Open Archival Information System (OAIS) reference model (a standard for digital preservation), Harnad says "Forget about OAIS for now! The OAI-compliance of the Eprint Archives is enough for now" [2].

In contrast, many would argue that digital preservation should be an important aspect of the service delivered by e-print repositories. Peter Hirtle, in a discussion that includes e-prints and other information resources, suggests, "An OAI system that complied with the OAIS reference model, and which offered assurances of long-term accessibility, reliability, and integrity, would be a real benefit to scholarship" [3]. Of course, the difference between this and Stevan Harnad's view should not be exaggerated. Harnad does not say "never", and Hirtle does not say digital preservation should dominate the agenda. However, there is at the very least a difference of emphasis. Some repository managers would go further than Hirtle and suggest that it is irresponsible to set up a repository without having preservation policies in place.

A number of possible reasons for preserving e-prints might be suggested:

  1. Preserving (open) access. This is perhaps the most important reason. It would be an irony if a paper in an open-access e-print repository could be accessed today and yet in ten years' time had been allowed to decay so that it was inaccessible. Carrying out necessary preservation work (such as migrating to new versions of software) may be necessary to preserve open access to the content.
  1. Where e-prints are commonly cited. Posting an e-print creates a community of users (or potential users). In some disciplines, citing e-prints may be the norm and so users would expect access to that e-print to be preserved so that citations to it would continue to be valid. Even after a paper has been formally published, researchers may expect to be able to cite the post-print in an e-print repository since access to it may be easier (and toll-free).
  1. Where e-prints contain or sit alongside more than the conventionally published paper. In some cases, e-prints may have more detail or additional data associated with them, which means that they should be preserved in their own right. Take an example. A well-known paper by Steve Lawrence providing evidence that open-access papers are more frequently cited was published by Nature [4]. An e-print of this paper is also available that is more detailed and gives more data than the Nature version [5]. It is certainly useful for both versions to be available and therefore reasonable to suggest that both versions of the paper should be preserved. They each make an important and different contribution to the scholarly literature.
  1. Where e-prints form part of a specific collection. Repository managers may want to preserve e-prints which form part of a coherent collection and where the existence of the collection itself adds value. For example, institutional repositories may wish to preserve all papers in a particular subject area, such as local or regional studies.
  1. Guarantees of preservation may attract authors to submit papers. It has been suggested that making authors aware that repositories will preserve their papers may be the carrot that will attract them to deposit their papers. As yet, however, there does not seem to be any specific evidence for this.

The question arises can these arguments in favour of the preservation of e-prints be reconciled with the emphasis on filling repositories? Where much of the debate seems to imply that a choice must be made between devoting resources and effort to either filling repositories or digital preservation, what is the way forward for repository managers?

Possible Way Forward

Perhaps the key point here is that filling repositories and digital preservation need not be mutually exclusive. Rather than being "either...or..." it may be "both...and...". Filling repositories is crucial but there is no reason why work on preservation cannot run in parallel.

Filling repositories is certainly a major priority. You cannot preserve e-prints if they are not there in the first place! It is also a major challenge. Achieving "buy in" from researchers (which requires a culture shift on their part) is not easy to effect. Authors will need to be persuaded to deposit their work in institutional or subject repositories and to help them to do so, this should be made as easy as possible. They can be encouraged, for example, by "self-archiving by proxy" policies (where repository administrators carry out the depositing procedure on behalf of the author) [6]. However, work on preservation can also begin now. The work may be limited to a certain extent, since it is clear that, whatever happens, such work should not be allowed to discourage authors. Authors should certainly not have unnecessary demands placed on them as part of the depositing process. But even bearing this in mind, work can begin now behind the scenes on preservation issues alongside efforts to fill the repositories.

The UK SHERPA project has the construction of a series of institutional OAI-compliant e-print repositories and the investigation of the preservation of this content as its two main aims [7]. The full project title "Securing a Hybrid Environment for Research Preservation and Access" highlights this dual track approach of access and preservation. The hybridity referred to in the acronym is one where the conventionally published literature can coexist with open-access e-print repositories. SHERPA is funded by the Joint Information Systems Committee (JISC) and the Consortium of University Research Libraries (CURL). It is part of the JISC-funded FAIR programme (Focus on Access to Institutional Resources) [8]. SHERPA involves a partnership of a number of leading research libraries who are initiating the development of e-print repositories within their own institutions. At the same time, they are also investigating the implementation of standards-based preservation. Libraries are the natural home for initiating this activity since they have a tradition of managing access to information resources and preserving them for the future. But the project recognises that libraries cannot act in isolation and so a large part of project activity will be devoted to advocacy within the academic community.

As part of its early work on digital preservation, SHERPA has begun to address some of the practical issues. One of the SHERPA partners, the Arts and Humanities Data Service (AHDS), recently led work on a JISC-funded study into the feasibility and requirements of preserving e-prints [9]. This report considered detailed issues such as file formats and metadata in the wider context of the e-print lifecycle, organisational and sustainability issues. Some of the key points about how e-prints can be preserved are discussed below.

Practical Issues: Technical Challenges

Digital preservation is often viewed as primarily a technical problem — and it would be foolish to pretend that preserving e-prints does not involve technical challenges. However, compared with the difficulties of preserving resources such as the BBC Domesday [10] system, the preservation of e-prints is relatively straightforward from a technical point of view. As with any electronic resource, the key challenge of preserving e-prints is overcoming frequent hardware and software obsolescence, ensuring that information ultimately encoded in a very user-unfriendly way as a series of 0s and 1s can continue to be decoded into more human readable forms.

This task is easier for e-prints than some other types of electronic resource because e-prints are, essentially, "paper documents made electronic". It is the method of delivery (rather than the file formats themselves) that distinguishes them from other types of electronic material. Currently, e-prints will usually contain only text and static images, which are among the simplest forms of electronic content to preserve. They seldom contain dynamic content, such as audio or real-time simulations, again probably because e-prints are still closely tied to traditional ideas of paper pre-prints, where such types of content are impossible. They are usually written and stored in ways that are designed to facilitate paper printing and publication. The file formats, metadata requirements and software applications used to manage and view e-prints can all be used to manage and disseminate other forms of electronic content, and this means that the e-prints community will not have to solve the technical problems which are unique to e-prints.

To keep the information in an e-print (as any electronic resource) accessible through more than one generation of hardware and software, it is important to know how the information was originally encoded. A strategy can then be developed to decode it in the future. Unfortunately, very little of this preservation metadata is currently collected for e-prints, to the extent that an e-print repository may not even be able to tell exactly what file formats it holds. You may know that it is HTML, but do you know which version of HTML it is, and is it actually valid HTML at all? It makes sense to collect this type of information when the e-print is first submitted, rather than trying to work it out years later once the formats have passed into history and information and expertise in them is rare. To collect this type of information easily, e-print repositories will need automated tools that can identify file formats. Some tools do exist, but it is fair to say that there is room for much more work, both in developing the tools and integrating them into e-print repository management software.

The limitations of print on paper are of course artificial in an electronic environment. Once researchers become comfortable with e-prints as a way of disseminating their work, they may start to create files that look more and more like multimedia products. While this is not yet a real issue, it may soon become one. Such a change in habits would increase the technical challenges involved in preserving e-prints. For example, the inclusion of more dynamic content would make preservation more difficult. Dynamic content tends to involve inherently more complex formats making it more difficult to separate data from the software that enables users to access the content. At present, e-prints are very self-contained, but there is potential for the text of an e-print that presents the analysis of data to be integrated more closely with the dataset itself. This would mean that the successful preservation of the e-print would be inextricably linked with the preservation of the data (and perhaps also the software tools for its analysis). Such developments will need to be carefully monitored.

Practical Issues: Organisational and Managerial Challenges

Many of the technical issues associated with the preservation of e-prints are becoming clearer, but there is more uncertainty when it comes to wider organisational and managerial issues. One key uncertainty is that of institutional commitment. Some very useful work has already been done in identifying incentives for institutions and their members to move ahead in this area. However, a great deal more needs to be done to get the issues on the agenda of more faculty, institutional managers, and other stakeholders in the scholarly communication process.

Funding is a particular challenge. The costs of digital preservation in general are still difficult to calculate, and it is unclear as yet how much of the work will be funded. It is equally unclear how open-access in general will be funded. Establishing costing and funding models for digital preservation of open-access materials is therefore doubly difficult. Whilst carrying out the feasibility study mentioned previously, it became clear that any discussion of long-term preservation is premature for many institutional repositories while they rely on short-term project funding. More stable funding streams and more reliable costing models need to be established.

Organisational models for managing and preserving e-prints also need to be further investigated. The model that is being tested out by a number of projects (including SHERPA) is an institutional one. Institutions provide access to the information assets created by their members. However, digital preservation may not necessarily be an activity carried out by all institutions. Models emerging from early digital preservation activity suggest that consortia or supra-institutional agencies may be better suited to manage the complex issues involved. If this is the case, what is the relationship between the institutional repository (which manages access to content) and the preservation agency (which preserves it)? Within the SHERPA project we plan to consider this question as well as to analyse more distributed approaches to preservation.

One key question associated with the preservation of e-prints is "which ones should be preserved?" This is where selection and retention criteria are important since it is essential to recognise that this is not necessarily an "all or nothing" situation. One key aspect of preservation has always been selection: which documents are to be preserved and why? This applies as much in the print world as in the electronic. But whereas many of the issues are well understood for print on paper, they are still unclear for electronic resources. Further work needs to be done on this in relation to e-prints but a number of possible criteria might be suggested. It might be possible, for example, for a policy to be developed where only post-prints are preserved rather than pre-prints. A slightly more sophisticated version of this (so as not to exclude disciplines that do not have formal peer review practices as the absolute norm) would be the preservation of papers accepted for publication, as opposed to those not (yet) accepted. Alternatively, e-prints may be preserved where they are accompanied by other data, or where the e-print itself is a fuller version of the conventionally published paper. There may be local priorities for preservation. For instance, items may be preserved on a particular subject. Authors themselves may be asked to identify from their own works which ones they would regard as important for preservation. A pragmatic approach adopted by some institutions already is that papers submitted in particular file formats will be preserved.

This last suggestion is interesting because, unlike the others, it bases selection on a technical characteristic of the e-print rather than the e-print's subject matter or publication status. Such a policy will itself have significant cost implications. In very general terms, the more complex a file format, the greater the potential cost of preserving content held in that file format. Thus an ASCII text file should be cheaper to maintain than an HTML file. For a start, since it lacks mark-up, the ASCII file will be smaller in storage terms. It requires less sophisticated software to view, and has also proven more stable than HTML (ASCII text has remained the same during the period of time HTML has gone through four versions, and development has moved on into XHTML).

Conclusion

Digital information is lost when it is left unattended while hardware, software and media continue to develop. Without intervention, an e-print may be subject to media degradation within a few years. Even if the e-print is securely backed-up, a few more years will see the e-print's content become inaccessible as software and hardware change. Without a strong institutional commitment, institutional e-print repositories will be unable to preserve their holdings, and they may also struggle to convince faculty to deposit work. At present the apparent assumption among parts of the e-print community is that decisions about preservation can be left until later, but this does not fit well with much advice on digital preservation, which emphasises taking action early in the life-cycle of an electronic resource to make it simpler and less expensive to maintain in the future.

Nevertheless, repository managers should not lose sight of the immediate importance of filling the repositories. There is certainly no reason to delay moving forward on this issue. It is essential for the increasing number of faculty and information professionals who recognise the potential importance of e-print repositories to seek to bring about change in their local communities and beyond. Filling the repositories also requires significant institutional commitment. The question of digital preservation should not be ignored but nor should it act as a brake on getting content in place.

Filling the repositories and preserving their content need not be mutually exclusive. They are two parts of the same ultimate objective — providing easy access to the literature, now and in the future. Even in the short term, the provision of access might further the requirements of preservation. The widespread use of Dublin Core and OAI as a framework for the discovery of e-prints also serves to help standardise metadata, which is also used in preservation. Restricted file format choices aid the reader and repository manager now, and will make preservation more feasible by simplifying the problem.

As SHERPA and other similar projects begin to test out many of these issues, it is important to retain a clear vision of the purpose of e-print repositories. E-print repositories are about improving the scholarly communication process by providing easy access to the literature. If preservation of e-prints is to be important at all, it must be preservation for access.

Notes and References

[1] Stevan Harnad, "For whom the gate tolls? How and why to free the refereed research literature online through author/institution self-archiving, now". 2001. Available at: <http://www.cogsci.soton.ac.uk/~harnad/Tp/resolution.htm>. See also Harnad's contributions to the September98 forum discussion list, for example <http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/2681.html>. Stevan Harnad has confirmed in a personal communication that this is an accurate summary of his views.

[2] Stevan Harnad, contribution to the September98 forum discussion list <http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/2681.html>.

[3] Peter Hirtle, "Editorial: OAI and OAIS: What's in a name?" D-Lib Magazine 7(4) April 2001 <doi:10.1045/april2001-editorial>.

[4] Steve Lawrence "Free online availability substantially increases a paper's impact". Nature, 411, 6837, p. 521, 2001. Nature: webdebates. Available at: <http://www.nature.com/nature/debates/e-access/Articles/lawrence.html>.

[5] Steve Lawrence, "Online or invisible". Available at: <http://www.neci.nec.com/~lawrence/papers/online-nature01>.

[6] See <http://www.eprints.org/self-faq/#libraries-do>.

[7] SHERPA <http://www.sherpa.ac.uk>. See also John MacColl and Stephen Pinfield, "Climbing the scholarly publishing mountain with SHERPA". Ariadne, 33, September-October 2002. Available at: <http://www.ariadne.ac.uk/issue33/sherpa/>.

[8] See Stephen Pinfield, "Open Archives and UK Institutions: An Overview". D-Lib Magazine 9(3) March 2003. Available at <doi:10.1045/march2003-pinfield>.

[9] Hamish James et al. Feasibility and requirements study on preservation of e-prints. Report commissioned by the Joint Information Systems Committee (JISC), 2003. Available at: <http://www.jisc.ac.uk/uploaded_documents/e-prints_report_1-0.pdf>.

[10] CAMILEON <http://www.si.umich.edu/CAMILEON/domesday/domesday.html>.

Copyright © Stephen Pinfield and Hamish James
spacer
spacer

Top | Contents
Search | Author Index | Title Index | Back Issues
Previous Article | Next article
Home | E-mail the Editor

spacer
spacer

D-Lib Magazine Access Terms and Conditions

DOI: 10.1045/september2003-pinfield