Stories

D-Lib Magazine
May 1998

ISSN 1082-9873

Archives in a New Paradigm of Scientific Publishing

Physical Review Online Archives (PROLA)


Timothy Thomas
Computer Research and Applications Group
Los Alamos National Laboratory
Los Alamos, New Mexico
trt@lanl.gov

Introduction

A new vision of scientific publishing is emerging in the Physics community. It is based on three fundamental elements: A preprint server, an electronic peer-reviewed, edited journal, and an electronic archive of past published papers. The preprint server offers speed, openness and flexibility. The journal offers validated, certified statements of accepted progress. The online archive offers a desktop accessible statement of the established foundations of scientific truth in Physics.

This vision began to be articulated by a variety of people at Los Alamos National Laboratory about five years ago -- just as the tools that would enable it were emerging. The preprint server was first off the block, since it was based on the already well-understood list-serve technology. A prototype of the archives, based on the then-new Wide Area Information Server (WAIS) technology, followed. The American Physical Society (APS)[1] took on the task of incubating an electronic version of the existing journal Physical Review based on the new Standard Generalized Markup Language (SGML). Now, after a long story of many difficulties overcome, that long-ago vision is almost ready for realization -- and it is remarkably close to the original idea. True, the technology has dramatically advanced, primarily with the advent of the Web, reduction of storage costs, and improvements in retrieval engines, but the basic vision remains the same.

Unexpectedly, the hardest problem turned out to be the archive, and not because of technical problems, though there were plenty of those, but because of legal, business-case, cultural, and institutional problems. The fundamental technical problem was how to produce, navigate, store, index, and distribute the massive number of page images required. That was solved early, and with each turn of the technological wheel, got easier, faster, and cheaper. The fundamental institutional problem was that the owner of the material was the publisher, who traditionally viewed its business exclusively as publishing a journal, not as operating an archive.

Archiving has always been the responsibility of libraries, which unfortunately are also the most important customers of a scientific publisher. The basic conundrum is that if publishers make an electronic archives available, they will undercut one of the most important activities of their major customers, that of archiving the historical record. An online archive makes it unnecessary to go to the library to access old copies of the journal; therefore, the library loses its incentive to subscribe to the journal. Furthermore, an archive searchable by title, author, full-text, etc. moves the task of locating the information from the library catalogue to the desk of the user. It also assigns the responsibility for supporting that service to the publisher, not the librarian. Thus, no matter how you look at it, an electronic archive, offered by the publisher, is a threat to the publisher's major customer -- not a situation that encourages rapid innovation.

When, some years ago at Los Alamos, we implemented the ability to hyperlink to and from any article to its errata, comments and references, this conflict with library functioning was strongly highlighted. Naturally, the demonstrated tool only worked between articles owned by the APS. Referencing to other publishers' holdings would require reciprocal agreements. APS took a very cautious view of those negotiations, arguing it was a publisher, not a library. As a professional society, APS had enough problems trying to convert the production cycle to electronic form. However, had the society vigorously pursued those agreements, the conflict with the traditional library function of bridging the gaps between different publishers' journals, would have been starkly illustrated. Thus, the sequence of deployments turned out to be: pre-print server, e-journal, archives.

The Problems with Libraries as Electronic Archivists

Why wasn't the archiving task originally given to a library? A library would no doubt have found the problem of enormous interest and well within its technical scope. The answer is simply that the publisher suspected that the huge store of already published journals (that the society owned, but received only a tiny income from) could be a major source of funds in the future. Possibly this new revenue stream could replace the funds that would be lost when the libraries canceled their subscriptions in response to available e-journals and electronic archives. True, the library might be encouraged to continue a subscription with the justification of maintaining access for its narrow community, but for a global system for scientific literature, the subscription method would be a very inefficient and awkward method of maintaining access. A country-wide support payment or some pay-per-view method would insure greater access at much less cost and effort. Once the electronic archive is up and reliable, the libraries will no longer be able to justify heating, air-conditioning, and guarding the paper copies they currently hold, and will cull the paper. With the paper will very likely go the feeling of responsibility for maintaining the collection. These revolutionary possibilities inspired caution, not bold action.

The Problems with APS as Electronic Archivists

The APS is a not-for-profit publisher. The society functions on behalf of its members. And the members, having heard of the APS-supported Physical Review OnLine Archives (PROLA) project[2], were beginning to demand access to the service. After all, the pre-print server, while admittedly a somewhat easier technical challenge, was storming ahead. The APS's answer to this demand was to continue funding the PROLA project as a pilot at Los Alamos. Thus, the society was visibly working the problem, and could show the functioning system to anyone interested. They could then take the time necessary to figure out how to extend their publishing business to include archiving. Meanwhile, the Los Alamos team kept adding more and more tools to PROLA, such as a system that guessed the reference you were looking for by permuting the information you entered, and tools that estimated the value of articles by posting the number of times an article was referenced, or accessed. As part of the project, physicists at Los Alamos began to use the system, and its potential value became more apparent.

At that point, APS hired Mark Doyle, who had worked at Los Alamos on the pre-print system, and assigned him the task of filling in the missing articles and porting PROLA to APS. The existing PROLA was deeply integrated into the computer infrastructure at Los Alamos, and could not be easily moved. This was because a few years ago, sufficient high speed disk memory could not be obtained for a reasonable cost, so the tape/robot mass storage system at Los Alamos was used. With the advent of RAID technology this problem was solved. APS purchased a BoxHill 200 gig disk farm and began the porting.

There was also another problem that became very apparent at this point -- that of integrating the now complete online e-journals with PROLA. To accomplish this, the first idea was to specify an "archive time". E-journal issues would be published, held on the publication server for a fixed time, then moved to PROLA. This would allow the archives and the e-journals to have a separate organization and style. However, it quickly became apparent that that idea was sort of silly. Rather, it made more sense to simply add new articles directly to the archive as they appeared and make the archive seamless with the journal, as it is on the library shelves. However, in order to make it seamless, the organization and look-and-feel of the two systems would have to be harmonious, and they were not.

PROLA, which had developed independently of the electronic journal, was organized differently. At its core was a doc-info page for each article, which contained the bibliographic data and links to all the elements related to an article (such as references, abstract, page images, printable versions, pdf versions, errata, comments, lists of articles that referenced that article, etc.). In front of that page were three navigation methods: browsing, searching, and retrieving. Browsing primarily used a table of contents to find the doc-info page of interest; searching used a fielded WAIS index; and retrieval used a simple database lookup. The digital version of the journal Phys. Rev., on the other hand, was centered around a fixed, "authorized" version of the full article in SGML displayed via a pdf version. The articles were accessible via direct links from the table of contents, and browsing was organized around "issues". These differences arose naturally from the differences between an archival library and a published journal. In order to unify the two systems, something would have to give. Since the publisher owned the material, it was the archival organization that had to go, otherwise APS would have had to redesign their "published journal" version to make it more like an archival version. That conflicted with all their instincts, and they inevitably imposed the digital journal system on the archive to make it more seamless with what they view as their major activity -- publishing current journal issues.

I believe that this phase will be temporary, and that as the usage patterns emerge, it will become obvious to APS that the archival format is the more natural for the electronic world. At that point, they will reinstate the doc-info page, and again hang off of it the vast variety of dynamic and diverse elements that make up an archival version.

At first, it wasn't clear how dynamic this archival version really was. We initially thought that an archive was a static system, but soon discovered that it was constantly changing. Every time we added a new article, all the articles that it referenced had to be updated to link back to the new article in the "articles that reference this article" list that hung off each doc-info page. Then, there was the constantly improving text version. For example, the web continues to support more and more elements required for the display of mathematical formulas. Then, there was the constantly evolving image processing tools, and printing tools, and on-and-on. One thing is now clear: An electronic archive is not static and will require continuous support and updating to stay modern and useful.

Some General Problems with Electronic Archives

Occasionally, it is suggested that maintaining accessibility to current forms of electronic media far into the future will be a problem for electronic archives. Our experience is, "We wish!". We wish we could leave the system alone long enough for the media to age. In fact, we have been constantly required to upgrade the tools and storage methods. Not because the old ones don't work, but because users continuously demand new functions and better service. This expectation of constant improvement is much more intense in cyberspace than in the old paper-based archival world. This is because the Web instantly exposes the typical user to the most advanced technology every time he visits his favorite commercial site. No archivist of the future will have the luxury of contemplating her collection as it slowly ages; she will be scrambling to add new features and meet new demands.

I suspect this fear of media aging is based on second hand experience with essentially unused archives. Those which are simply put on tapes and moved to a warehouse. In those cases media aging may be an issue. But PROLA's version of an archive is nothing like putting old back-up tapes in a warehouse. Rather, it is a vision of a heavily used, global system that provides an essential and constantly evolving functionality in the support of science.

From the publisher's point of view, the biggest problem posed by an electronic archive was the unknown financial impact. What would the added costs be? What income could be generated, and how would it impact current financing systems? Initially, several alternative business models were considered, including pay-per-view, the country license, site subscriptions, and individual subscriptions. In the end, it was judged that potential usage patterns were insufficiently known, and that no permanent decisions could be made at that time. In the interim it was decided to explore the idea of adding free access to PROLA to all subscriptions to the electronic journal. However, that decision is still under review. A free/pay line, which separates material that is freely available from that which requires payment, is very likely to be established in line with the existing e-journals. Most likely, abstract and bibliographic elements will be free, and the access to the full article will require payment. What to do with comments, errata, reference-to and reference-by lists, etc., is still under active consideration. On the costs side, again, insufficient information is available. Particularly vexing is the unknown costs of new hardware required to keep abreast of rapidly changing technology. Initially, the requirement for continued staffing support was underestimated since the archive was viewed as much more static that it actually turned out to be.

Up to now, all we have is guesses about both sides of the financial equation. However, we do know, with absolute certainty, that the costs of a global electronic archive are going to be much, much less than the existing system of paper copies in thousands of distributed locations -- and the availability and usefulness is going to be much, much greater. It is this certainty that helps drive the project forward, in spite of the manifest social and financial difficulties.

PROLA is a global system by nature, both from a technical and from a customer point of view. As such, it probably should have a global financing support system. Interestingly, the European physicists I talked to tended to view the natural financial system as that of a country license, while the Americans tended to think the library license was most natural. The fact that American libraries are actually conduits for funds from the government to the publisher is often ignored by American users, who seem to prefer to imagine that the libraries have independent sources of funds from their customers. I believe a global information system like PROLA should have a global financial system to support it. However, that said, I cannot identify any mechanism to provide that global support aside from pay-per-view, which would not be a good way to provide universal access.

Physics Archives in Particular

From my perspective as a cognitive psychologist, physicists have an odd view of their literature. Some have told me, for example, that no paper over five years old is of any value whatsoever. In this view, the pre-print server is by far the most important leg of the triumvirate, the published papers are only for tenure, and the archives are useless.

From my view, archives actually play an essential role in any science. Science is after all a social activity. The social nature of science is critical for establishing truth. Truth in Physics is established by the submission of ideas and results with their associated mathematical, logical, and experimental evidence to a public forum for evaluation. That public forum is made up of competent scientific peers who share a common understanding and are potentially capable of replicating the results and confirming the mathematics. Furthermore, if a new result can be shown to be in conflict with earlier established and confirmed results, then a resolution of the conflict is required before the new idea can be accepted as true.

The archives play a critical role in this process of establishing truth. They provide a quick and efficient way to locate prior results, to identity related derivations, to reference prior ideas, and in general, they provide a context in which to frame the debate. Without any archives, the arguments and objections that properly greet any new idea, would be very difficult to resolve, or even to carry on, since we would have to rely on faulty human memory for holding all the accumulated knowledge of our predecessors.

When considering the role of archives, I tried to discover if the existing paper archives were used at all, and if so, what they were used for. The first question was easy to answer. At Los Alamos, they have a complete collection of Phys. Rev. from its inception, and they have a simple system to crudely reveal usage. They put a colored sticker on the journal's binding when it is re-shelved for the first time in a given year. With this system, a glance at the shelves shows that virtually all the bound journals are removed at least once in every year, so they are being used -- for something. Perhaps users are checking spelling, perhaps looking for content, perhaps validating required "literature searches", or perhaps something not yet identified. One of the advantages of PROLA is that when it is fully up and working, these questions should be much easier to answer, and the publisher will have a real economic reason to answer them.

Conclusion

In spite of many difficulties and many obstacles, the new vision of scientific publishing is becoming a reality in Physics. The pre-print server is already functioning, the electronic journals are now available, and the electronic archives are about to be deployed. I expect that these new tools and methods will, when they are fully utilized, dramatically speed the course of innovation and discovery in Physics, greatly increase the global participation in Physics, and alter the scientific landscape in ways not yet anticipated.

These are certainly interesting times!


1The author is the PROLA Program Manager at Los Alamos, and can in no way speak for the American Physical Society. APS is currently in the process of a major redesign of PROLA to make it more compatible with their existing systems. Los Alamos's responsibility was to design and implement a prototype system. The system that will eventually be deployed, and how it will be financially supported, is entirely the responsibility of the American Physical Society.

The opinions, views, and interpretations herein expressed are those of the Author and do not necessarily reflect those of the Los Alamos National Laboratory (LANL) or the Government.

2The details of the PROLA system are described in a technical article to follow in the next issue of this journal.

Top | Magazine
Search | Author Index | Title Index | Monthly Issues
Previous Story | Next Story
Comments | E-mail the Editor

hdl:cnri.dlib/may98-thomas