Creating a Networked Computer Science Technical Report Library

James R. Davis
Design Research Institute
Xerox Corporation
502 Rhodes Hall
Cornell University
Ithaca, NY 14853
[email protected]

D-Lib Magazine, September 1995

Computer scientists have long been using the Internet as a medium for transporting reports and documentation of many kinds, including, but not limited to, technical reports about computer science. The material available on the Internet has grown in size, and equally important, has become better organized. Three pioneering systems, UCSTRI, WATERS, and Dienst, led us to being within range of having a true computer science technical report library, with a collection built from the technical reports of the nation's computer science and engineering universities and research laboratories. This article describes the Networked Computer Science Technical Report Library (NCSTRL), an attempt to reach this goal.

First Efforts to Build a Library

NCSTRL (pronounced "ancestral", and hopefully connoting a bountiful progeny) is a direct descendent of two earlier systems, WATERS [French], which was sponsored by the National Science Foundation (NSF), and Dienst [Lagoze], which was created as part of the Department of Defense's Advanced Research Projects Agency (ARPA)-sponsored CSTR project. Both were inspired by the pioneering Unified Computer Science Technical Reports Index (UCSTRI [vanH]) system from the University of Indiana.

Until recently, the most common method for sharing files between two users was the File Transfer Protocol (FTP). FTP allows a user on one machine to log into a second machine and then move files between the two. FTP was often used in a simplified form called "anonymous" FTP, which allowed read-only copying without logging in. With anonymous FTP, a site could "publish" (or make available) materials simply by placing then in anonymous FTP directories. Any user on the Internet who knew of their existence could copy the files. As institutions began to place documents online in this way a rudimentary online collection began to grow.

This collection was not yet a library, for it was disorganized and lacked effective means of being searched. There was no way, for example, for users to know which machines had anonymous FTP directories, or what was in those directories. It soon became common for FTP directories to include small "README" files containing indexes or annotated listings of the contents. This helped somewhat, provided one knew at least the name of the machine with the collection. Visiting, say, Cornell's FTP server, you would first list the contents. If you spotted a file with a likely name (e.g. "index" or "README"), you could copy it, then read it, looking for interesting material, noting the names of interesting files, and then go copy those. Finally, you were ready to read the files. It was a cumbersome process, and if you wanted to search several sites, you had to repeat the process at each one.

At this point, several new protocols appeared: WAIS, Gopher, and HTTP. While they differ from each other in many ways, they are similar in that each supports some form of search. Where FTP helps you copy a file whose name you know, these three protocols help you learn the name of that file. For example, you could send WAIS a list of words likely to be found in the title of a report of interest, and it could tell you the names of files with those words in the title. The unifying idea of these three systems is that of a catalog record, which is to a file in an FTP directory as a card catalog card is to a book.

The concept of a catalog record was integral to Marc VanHeynigan's Unified Computer Science TR Index (UCSTRI), which was the first (May 1993) comprehensive attempt to solve three limitations of an (FTP-based) Internet library and take advantage of World Wide Web technology. You no longer had to know the names of each FTP site; you no longer had to search each in turn, and you didn't have to know the file name. UCSTRI is "unified" because it gathers README files from nearly 200 FTP directories and uses them as catalog records. A user sends UCSTRI a search query, consisting of one or more keywords, and UCSTRI replies with those portions of the README files that match the keywords. A user could now search 200 sites with a single command. Since UCSTRI also includes a World Wide Web URL for each document, a user can jump directly to documents of interest and follow threads to linked resources.

UCSTRI succeeded in showing that the Internet "collection" was large enough to be a useful resource, once organized for searching. UCSTRI's weakness lay in the "catalog records" it used. README files are intended for human eyes, and are thus written in natural language rather than a structured format that would better suit an automated system. Beyond this technical problem, UCSTRI also faced a social problem, namely that there was rarely any institutional support for maintaining the accuracy and completeness of the files in the same manner that a librarian would maintain a card catalog.

The next systems, WATERS and Dienst, improved on UCSTRI. WATERS, which also appeared in the summer of 1993, mandated a uniform file format for cataloging, and included a tool to help site administrators maintain the catalog files. Using a single format (an extension of refer) made search more reliable. The maintenance tool (techrep) lowered the amount of effort required to keep the refer files current. But these benefits also raised the cost to participate relative to UCSTRI. Sites distributing works with UCSTRI didn't have to do anything they weren't already doing - UCSTRI copied the index files itself. So the effective cost was zero. But sites running WATERS have to actually install and run some code to participate. Perhaps because of this cost, WATERS did not unleash a flood of new reports.

The Dienst system, which began in October 1993, introduced other innovations to UCSTRI. First, the search system allowed searches by author, title, or abstract. Second, its user interface allowed users to view documents in several different ways. Users could read documents a page at a time on the screen, or search miniature page images, or print documents.

Dienst's real power came from features not directly apparent to end users. First, where UCSTRI and WATERS dealt with files, Dienst handled documents, which represent the same content independent of the format. You may be reading this article with a Web browser, as HTML, or you may have printed it out, as PostScript. If your vision is impaired, you might store it as an audio file. All three formats would be stored in distinct files and would have different names, hiding the underlying unity in content. Dienst would consider them all variations of the same document. Second, the Dienst architecture is richer than that of UCSTRI or WATERS, comprising three underlying services: the Index Server, which searches catalogs to find documents, the Repository Server, which stores documents in various file formats, and the User Interface server, which mediates all human use of the system. Separating the human interface from the other two components makes it easy for other researchers to create systems that attach to the Dienst collection and perform new kinds of services on it, such as the contents alerting service provided by SIFT [Yan].

Why NCSTRL?

If all the ancestors of NCSTRL were successful, why do we still lack an universal online library of computer science technical reports? To answer this, members of the WATERS group and the Dienst group met for a two-day workshop on April 7 - 8, 1995. We decided there were two main reasons. First, we had relied too much on word of mouth to inform departments of the possibilities of our projects. Second, many departments that were aware of the systems believed the perceived costs of installing either package still outweighed the perceived benefits.

The possible costs included the risk of obsolescence. The two services were incompatible, so sites were apparently reluctant to choose, opting instead to wait to see which service was likely to prevail. The benefits were, we thought, under valued. To an extent, the Web is still seen as more of a playground than a vehicle for serious academic work, and some potential sites might not have recognized the benefits of participating in a project like NCSTRL.

We believe NCSTRL provides scholarly and financial advantages to all its users. Researchers will be able to easily search a body of material that is now slow, diffused, and difficult to access. Authors will gain a wider audience than they now enjoy. In particular, since NCSTRL (like the earlier systems) searches all sites, authors at less well-known institutions have an equal chance of at least having their reports noticed. Both these advantages grow as more sites participate. Departments will gain a clean, effective management system for their technical reports and will eliminate much of their current copying and mailing charges. The savings at Cornell alone are estimated to be in the thousands of dollars.

At the April meeting, we determined that it was possible to create a single unified CS technical report service, combining the collections of both Dienst and WATERS, and capable of expansion to those sites currently running neither. This was the beginning of NCSTRL. The initial NCSTRL meeting also included a representative from the Computing Research Association (CRA), an association of more than 160 North American academic departments of computer science and computer engineering, industrial laboratories engaging in basic computing research and affiliated professional societies. The CRA represents nearly all the potential contributors to NCSTRL, and its participation makes it more likely that NCSTRL will become universal where the earlier systems were not. We demonstrated a prototype of NCSTRL to the CRA board in July of this year, where they strongly endorsed the concept of unified electronic access to CS technical report literature and encouraged member institutions to catalog their reports using the NCSTRL system.

A Technical Infrastructure Based on Interoperating Distributed Servers

The NCSTRL architecture combines the power and flexibility of Dienst with the ease of installation of WATERS. The technology underlying NCSTRL is a network of interoperating digital library servers. The digital library servers provide three services: repository services that store and provide access to documents, index services that allow searches over bibliographic records, and user interface services that provide the human front-end for the other services. The services interoperate using an open protocol, so that other software systems can use them as well. The power of NCSTRL comes from the architecture, while the ease of installation comes from having two levels, Lite and Standard. The Lite version is intended for sites with few resources, and will have a lower startup investment, while the Standard version will offer greater functionality. Sites participating in NCSTRL will be able to install either. No matter which they install, the complete technical report collection will be available to all parties. NCSTRL will have a uniform user interface, hiding almost all the underlying diversity. Users should not need to know which level of software a site is running, and departments will have a smooth upgrade path from the basic to the advanced should they desire additional capability.

The NCSTRL Lite software requires about the same cost to participate as WATERS does. Participating departments (or sites) need only put their reports online (via FTP or HTTP), and edit a catalog file with the techrep tool. The three high level library services (repository, index, UI) will be performed for the Lite sites by a dedicated, central NCSTRL server, which acts as an NCSTRL gateway for all the Lite sites. The user interface and index services for a Lite site executes on the central server, and are therefore not customizable or extensible by the sites, unlike sites running Standard. The Lite user interface does not provide the thumbnail browsing and online reading found in the Standard user interface, but otherwise the distinction between Lite and Standard sites is invisible to the user. All documents from all institutions are searchable through any of the user interface gateways. The results of searches are selectable links to the repository copies of documents that match the search criteria.

The NCSTRL Standard software will, for the most part, be an extended version of Dienst, with two exceptions. Where Dienst used an ad-hoc protocol for location independent naming of documents, NCSTRL will use the handle server architecture developed at CNRI, providing scaleable, extensible location-independent naming for documents [Arms]. Second, NCSTRL will have a small amount of fault tolerance by replicating index records on a Backup server. Thus, when a site is down, it will still be possible to identify interesting reports stored there although the full texts may be temporarily unavailable.

Interoperation of digital library services through an open protocol is an important aspect of the system. It allows access to the collection and its indexing data not only from the user interface services, but from other value-added services. We foresee that the open architecture of the NCSTRL will make it a significant corpus for a variety of research projects. This makes it is extensible, and therefore capable of migrating to more advanced technologies as these come along. To ensure that NCSTRL is designed and operated so as to be capable of such migration, a working group for NCSTRL was set up as part of the D-Lib Forum.

Participating in NCSTRL

Participation in NCSTRL is open to all academic departments awarding a Ph.D. in computer science or engineering and to research facilities of industry and government. Further information is available at http://willow.tc.cornell.edu/Info/about-ncstrl.html.

Further Plans for NCSTRL

We view NCSTRL as a pilot system, not a research system. When NCSTRL opens (on November 1, 1995), most of its software will be literally copied from the two earlier systems on which it is based. Users of those existing systems may scarcely notice the change. But there are profound limitations in those systems that we must eventually transcend. NCSTRL is missing many features that will be important in digital libraries, for example, support for multi-media, annotation, and replication of storage. Some of these, though perhaps not all, will be required for the user population we wish to serve.

In the long run, we expect NCSTRL per se to disappear. A unified collection of computer science technical reports, however valuable in its own right, is far more valuable when it is part of a unified collection of all knowledge. There are many interesting digital library projects underway now, and in the coming years we hope to see a consensus national or international digital library infrastructure appear. (Many of the researchers involved with NCSTRL are working to help define this infrastructure.) When this infrastructure is mature enough to offer as high a level of service as NCSTRL, NCSTRL will flow into it, as a river to a sea, and cease to be.

Acknowledgments

This work was supported in part by the Advanced Research Projects Agency under Grant No. MDA972-92-J-1029 with the Corporation for National Research Initiatives (CNRI) and by the National Science Foundation under Grant No. NSF-CDA-9308259. Its content does not necessarily reflect the position or the policy of the Government or CNRI, and no official endorsement should be inferred.

References

[Arms]: William Y.Arms, "Handles and the Handle System," July 27, 1995. Corporation for National Research Initiatives, Reston, Va. hdl://cnri/handle/intro; http://WWW.CNRI.Reston.VA.US/home/cstr/handle-intro.html
[French]: Jim French, Ed Fox, Kurt Maly, and Alan Selman, "Wide Area Technical Report Service: Technical Reports Online," Communications of the ACM, 38(4), April 1995, p. 45.
[Lagoze]: Carl Lagoze and Jim Davis, "Dienst: An Architecture for Distributed Document Libraries," Communications of the ACM, 38(4), April 1995, p. 47.
[VanH]: Marc VanHeyningen, "The Unified Computer Science Technical Report Index: Lessons in Indexing Diverse Resources," 2nd International World Wide Web Conference, WWW'94 Oct. 1994, pp. 535-543.
[Yan]: Yan, T and H. Garcia-Molina, "SIFT -- A tool for Wide-Area Information Dissemination," Proc. 1995 USENIX Technical Conference, New Orleans, 1995, pp. 177-86. http://sift.stanford.edu.

hdl://cnri.dlib/september95-davis