California Digital Libary
A Virtual union catalog is a possible alternative to the centralized database of distributed resources found in many library systems. Such a catalog would not be maintained in a single location but would be created in real time by searching each local campus or affiliate library�s catalog through the Z39.50 protocol. This would eliminate the redundancy of record storage as well as the expense of loading and maintaining access to the central catalog. This article describes a test implementation of a virtual union catalog for the University of California system. It describes some of the differences between the virtual catalog and the existing, centralized union catalog (MELVYL). The research described in the paper suggests enhancements that must be made if the virtual union catalog is to become a reasonable service alternative to the MELVYL® catalog.
The University of California Union Catalog
The University of California, with its nine campuses located throughout the state, adopted the goal of "One University, One Library" in 1977. Under this goal, the resources of these geographically distributed libraries would be treated as a single collection available to the entire scholarly community of the University. The first step toward this goal was the development of a union catalog for the libraries. After early attempts at a book catalog and a subsequent microfiche version, in 1982 the union catalog came into being as an online public access system known as MELVYL®. This centralized database is built from catalog records sent by all the cataloging departments of participating libraries to the California Digital Library (CDL) where the MELVYL catalog is housed. Participating libraries include the California State Library, the Center for Research Libraries, and a number of affiliated institute libraries. In all, there are twenty-nine separate (and diverse) input streams that feed into the union catalog on either weekly or monthly update schedules.
Parallel to the MELVYL database, which contains records for monographs and non-book format materials, is the state of California's union database of serials. This includes serials records for the University of California, the California State Universities, other public and private research libraries such as Stanford and University of Southern California, as well as the union lists of public libraries, law libraries and medical libraries. For this database there are thirty-seven different input streams representing nearly 600 libraries that are updated anywhere from weekly to yearly.
Functions of the Union Catalog
First and foremost the MELVYL catalog is a document discovery tool for end-users. At a time when most other catalogs were limiting users to left-anchored exact heading matching, MELVYL had keyword searching on titles and subjects as well as a sophisticated personal name algorithm that can retrieve an AACR heading based on a variety of user input.
The catalog also turned out to be an important tool for the libraries themselves and was soon incorporated into inter-library loan, collection development and even cataloging functions. One particular aspect of the catalog that has proven to have added value beyond our original intentions relates to the unique way that records from different sources are merged and stored.
For each unique title we created a merged record that could contain all the uniquely contributed fields by each cataloging source. This means that our underlying record can have multiple 100 or 245 fields, as well as a variety of other USMARC fields. Naturally, we only show the end-user a single view of this record, but all the variant fields in the record can contribute to record access. This means that if a single library adds a subject heading in their own local catalog, when added to the MELVYL catalog that subject heading provides access to all copies of that title.
It also means (and this is the part that we didn't anticipate) that if one campus contributes a full catalog record and another creates only a minimal record, the latter library gains all the functionality of the full cataloging of the former. This became an important side effect of our merged record when libraries were undergoing retrospective conversion, and again when AACR2 necessitated updating large numbers of name headings. While not a substitute for bringing their own local catalogs up to date, at least union catalog users were benefiting from efforts made by any library in the University of California system, and the libraries themselves may have had greater options in terms of where to put their limited resources during those times of change.
Today, all campuses have integrated library systems and the quality of the records in those systems and input to the MELVYL catalog is quite high. Still, we keep finding new uses for the union catalog. Recently it became the basis for a patron-initiated request system that allows cross-library lending with minimal staff interaction in the ILL departments. The periodicals file is linked in rather clever ways to both locally-mounted and remote abstracting and indexing databases so that users can go from a retrieved citation to a list of libraries that carry the periodical title. The union catalog is also becoming the university's catalog of Internet-accessible resources.
The Virtual Union Catalog Concept
The current MELVYL centralized database model was developed nearly 20 years ago, prior to the widespread availability of networks and distributed databases. It was also developed before the participating libraries had OPACs of their own. It runs on a large mainframe computer using locally-developed software, some of which is actually over twenty years old. Hardware maintenance costs are high and database update and maintenance functions are labor-intensive. As part of an ongoing process of service evaluation, which includes the evaluation of service efficiency and cost effectiveness, studies were initiated to determine if it is reasonable to seek alternatives to the centrally housed union catalog that could achieve many of the same goals of quality user service, 24x7 availability and excellent response time.
One possible alternative to the central database is a virtual union catalog. Such a catalog would not be maintained in a single location but would be created in real time by searching each local campus or affiliate library�s catalog through the Z39.50 protocol. This would eliminate the redundancy of record storage as well as the expense of loading and maintaining access to the central catalog. A distributed catalog makes obvious sense in our current environment where every library has its own database and retrieval interface. The wide-spread use of Z39.50 and its implementation in nearly all modern library systems means that there should not be major technological barriers to a distributed solution. Or, so it seems.
We are hardly the first to think of, much less implement, a distributed catalog solution. Even just over the last year this technology has moved from the "gee whiz" to the "of course" stage. The ability to send queries to one or more other library catalogs is a regular feature, although the actual implementation details vary. Consortia similar in characteristics to the University of California are actively using this technology. The Committee on International Cooperation (CIC), for example, has published a report of their experience with their virtual catalog implementation.
The CIC report expressed some dissatisfaction with the virtual catalog as a discovery tool and advocated more development on the part of library automation vendors. Some of the problems that they encountered were foreseen, such as inconsistency in results between catalogs that had defined their indexes differently, but there was no way to quantify the dis-ease that the librarians at these institutions experienced. The University of California is unique in that we have a current centralized catalog so we can compare the results between these two catalog technologies.
Although some of the advantages and disadvantages of the virtual union catalog approach could be anticipated without a test implementation, there is nothing to compare to actual experience with a new technology. And many of the results presented here would not have been foreseen by study of other systems that have attempted the same design.
There are no absolute measures of OPAC effectiveness that we could use to evaluate the virtual union catalog, but because we do have a centralized union catalog, we are able to make comparisons between the MELVYL catalog, with which we are familiar, and a virtual union catalog. We expected there to be many differences, so the goal was not to rate the virtual union catalog against MELVYL but to describe the differences and determine if the virtual union catalog could provide a reasonable service alternative to the MELVYL catalog.
Campus main libraries were contacted and given the opportunity to volunteer to participate in the comparison. For this test, no attempt was made to cover all campus input sources (affiliated libraries, special libraries) or non-UC sources. The assumption was that the "main" library was the best target for our purposes.
Six campus libraries chose to participate. These included three different library systems: Innovative Interfaces (four sites), DRA WebCat, and OCLC SiteSearch (one site each).
Catalog Search Capabilities
A preliminary analysis of the search capabilities for participating systems was performed. Not only did we have three different "brands" of library system to connect to, it's also the case that Z39.50 search capabilities are not the same as local OPAC search capabilities. Indexes available via Z39.50 for the six participating libraries are listed in Figure 1. Note that differences occur not only between library system "brands" but also within different installations of the same vendor system due to configuration choices made by the libraries.
Figure 1 - Fields Available Through Z39.50
It was actually more difficult finding indexes common to all of the participating systems than we had anticipated. Some of the systems had an overall keyword search that combined keywords from a range of access fields but did not do keyword searching on individual heading types. Yet two of the systems (MELVYL and the SiteSearch implementation) did have index-specific keyword searching (title keyword, subject keyword) but were lacking a general keyword search analogous to the others. When we included author searching, we knew that we were going to see a great variation in how those systems processed queries.
In the end we selected:
To test the keyword search, we had to simulate it on the two systems that didn't have that index by searching a combination of title words and subject words. As anticipated, the results were not easily comparable.
So before even beginning our test, we had had to limit ourselves to catalog search functions that would provide only a minimum of searching capabilities for known item and subject access. This, in itself, was an interesting lesson in distributed searching and we went into the test phase with even less hope that we would be able to show that the virtual union catalog could be a viable public service tool.
The Search Queries
We wanted to see how the virtual union catalog stood up under real user queries. To get these queries we selected a single file, representing about one day's searching, from the MELVYL search logs. This included all commands that were issued to the system during that time span, so although we started with many tens of thousands of log entries, in the end we had a rather small set of viable searches. We selected only those searches that represented the three indexes we wished to test. Of those, we eliminated the searches that received a zero result. This gave us a set of searches that we knew would retrieve some records on the MELVYL catalog. In a later step, we also removed searches that did not get at least one hit among the six libraries that were part of the study, since our actual study group was only a subset of the MELVYL coverage.
Test 1 - Record "Explosion"
One of the questions we needed to answer on the prospect of moving from a centralized catalog of merged records to distributed catalogs has to do with the total number of records that would be retrieved through the distributed method. The MELVYL catalog has about 10 million titles representing 18 million "copies", but about 2/3 of the catalog is made up of records with only one holding. The other 1/3, or roughly 3 million records, account for the other 12 million holdings.
Because we can limit a MELVYL search to an individual campus, we were able to crudely approximate the effect of searching each catalog separately by running our test queries nine times, each time limiting the results to a single campus. We could then compare the total of these searches against the total retrieved in the merged database. The results varied based on the index used, but were not greatly different from what one would expect from the overall catalog composition:
Test 2 - Searching Against Campus Z39.50 Servers
After removing any searches that returned zero results in Phase 1, the same searches were run against the six campus catalogs through their Z39.50 server function. This was done using an automated search program, and we were pleased that the results were generally quick. We did have to be careful not to overload the campus servers, because our search engine was going against their public catalog and could potentially send searches fast enough to negatively affect actual users. Fortunately, by now our searches had been reduced to short lists and we didn't have problems.
A sample of results is given in Figure 2. For each search, we knew how many items were retrieved when the search on the MELVYL catalog was limited to that campus' holdings. These were then compared to the numbers retrieved from that campus' online catalog. A zero in any column means that the results were the same; positive means that more records were retrieved using Z39.50 against the campus catalog, and negative means that fewer records were retrieved from the campus system than from MELVYL. Most notable about these results is the lack of any consistency between the MELVYL retrievals and the local system retrievals. Within the same library system, some searches will retrieve many more items than the same search on MELVYL, and some will receive many fewer.
We had expected there to be differences -- explainable differences -- the results of this test exhibited a much wider range of variation than we had anticipated. However, the numbers alone were only an indication that there was something there worth investigating.
Figure 2 Comparison of Searches, Z39.50 vs. Union Catalog
Names with initials
Order of names
Different forms of the same name
Test 3 - Qualitative Analysis of Search Differences
There seemed to be no pattern or consistency to the search results we had received. To understand why, a group of campus librarians (see acknowledgements) undertook to analyze the differences. They did this by manually repeating a selection of the test searches and looking at the resulting retrievals.
What they found, as you might have guessed, was that just about every imaginable difference that could occur between library catalogs did indeed manifest itself in our sample.
As expected, author searching turned up numerous reasons for differences in results. The format of author names as input in USMARC records is rigorously standardized. What isn't standard is how our systems index those names, nor how library system user interfaces deal with the variety of name forms that users will input at the query line.
Exact title searching should yield fairly consistent results. All of the catalogs are referring the query to a heading index and are searching from left-to-right. But even within this limitation, differences arose.
Comparison of keyword searching between the MELVYL catalog and the local catalogs via Z39.50 is of limited accuracy because MELVYL does not have a keyword index that combines words from a wide selection of fields. We included this search, however, because we had no other way of testing a subject search; many of the systems did not have a subject index and we felt that it was important to include subject searching in our test.
Other differences that we found between local systems and MELVYL weren't particular to a specific type of search. Among these were:
Requirements for a Virtual Union Catalog
The scope of this project was not sufficient to provide a full test of functional requirements for a virtual union catalog, but some important general areas have been identified which would require further analysis and testing prior to planning for the production use of this architecture.
Database Consistency & Search Accuracy
For a virtual union catalog to be feasible, the participating databases must offer a uniform set of indexes and search functions that retrieve comparable items from each catalog. In the current environment, it is not possible to formulate a search that yields predictable results from the databases. Evidence of this lack of consistency and its affect on search accuracy and predictability is a significant result of this test. This means that the first step in creating a virtual union catalog is to create compatible local catalogs that are designed to support the distributed environment. It appears that a common use of Z39.50 in libraries today is not a distribution of our catalogs but a kind of harvesting in disparate databases. While this is an obvious statement of fact, we still seem to harbor a somewhat illogical hope that this harvesting will inexplicably yield consistent and accurate results.
The MELVYL union catalog serves the entire University of California community as well as the larger research library community. It is essential that the catalog be available as close to 24 x 7 as possible. As part of a virtual union catalog, local system downtime, scheduled or unscheduled, would impact the availability of the catalog as a whole.
Capacity Planning for Campus OPACs and the Network
The development of a virtual union catalog design would have important implications for local system search capacity and network load. Each search that is now directed only to the centralized union catalog would instead be broadcast to all of the campus catalogs and potentially all contributing systems. Local campus systems would each need to be able to respond to an additional 300,000 searches per week, based upon current MELVYL catalog activity. Network capacity planning would be required to accommodate the increased bi-directional traffic between the libraries.
Sorting, Merging and Duplicate Removal
Searches issued against the union catalog retrieve a set of records that have been merged to eliminate duplicate bibliographic records and sorted prior to input into the database. Broadcast searches return a set of records without merging or sorting. Version 3.0 of the Z39.50 protocol includes a sort function but few systems currently support this feature. Even with that sort in place, the union catalog interface would have to merge the retrieved sets as well as remove duplicate bibliographic information while maintaining individual holdings data. Because searches across our libraries often retrieve large result sets, sorting and merging is expected to be technologically challenging.
I want to thank the following librarians who did the painstaking analysis of the search results:
Nancy Kushigian, University of California, Davis
Appendix: Z39.50 Search Results
To view the differences between searches in MELVYL with the campus AT and the same searches performed on the campus catalog through Z39.50, please see the attached Appendix.
Copyright � 2000 Karen Coyle
|Top | Contents
Search | Author Index | Title Index | Monthly Issues
Previous story | In Brief
Home | E-mail the Editor
D-Lib Magazine Access Terms and Conditions