University of California San Francisco
Few digital libraries begin with the drama that accompanied creation of the University of California San Francisco's (UCSF) Tobacco Control Archives (TCA). As chronicled in Stanton Glantz's The Cigarette Papers Online (Glantz et al.), UCSF's extensive digital collections of tobacco industry documents began in 1993 with an anonymous donation of documents from the files of the Brown & Williamson tobacco company. After contentious legal maneuvering in which the University of California refused to bow to tobacco industry pressure, the Library and Center for Knowledge Management scanned the documents and made them available on the Internet via the fledgling World Wide Web in 1995. In the years following its first online collection of tobacco industry documents, the Library added several other tobacco industry document collections to its online collection (Table 1).
From a technical perspective, the TCA collections reflect their distinct origins and the historical development of the web. There are significant differences in the quality of document images and in the extent, type and quality of their metadata. The early Brown & Williamson document images, for example, are very poor, and the metadata was created when no one expected the tobacco documents library to grow to its present size and importance. In fact, the Brown & Williamson collection began as a CD ROM project and was quickly replaced with web access as that medium became available. Through grant funding, however, the Library has been able to index subsequent collections and create searchable text files through optical character recognition (OCR). The collections are individually searchable using SWISHan open source search enginethrough a web interface. Work with these early collections included experiments to test efficient methods to create digital collections and to understand the value of access to a worldwide community.
Based upon success in this prior work and growing interest among health policy researchers and funding agencies in these collections, a new opportunity emerged. In 2000, The American Legacy Foundation (ALF) (Foundation), dedicated to a "legacy" of reducing the deleterious personal and social effects of tobacco product use, awarded funding to UCSF to create a permanent collection of over 4 million tobacco industry documents. This material, representing seven tobacco companiesAmerican Tobacco, Philip Morris, R.J. Reynolds, Brown & Williamson, Lorillard, the Tobacco Institute, and the Center for Tobacco Researchwas released as part of the Master Settlement Agreement (MSA) with the National Association of Attorneys General (NAAG). The 1998 settlement (Multistate Settlement with the Tobacco Companies) required companies to provide paper copies of their documents to the Minnesota State Depository and to provide online access until 2008. The purpose of the Legacy Tobacco Documents Library (LTDL) is to preserve these documents beyond the term of the MSA and to provide a way to search all the documents through a single interface. The primary intended users of the LTDL are public health advocates, medical professionals, lawyers, and others interested in research on tobacco control.
The project started with award of the ALF grant in January 2001 and delivery of 40 computer tapes. The tapes, provided by the National Association of Attorneys General, were created in July 1999 and contained document images and document records produced by the seven tobacco companies involved in the Master Settlement Agreement. The four million tobacco documents, consisting of twenty million pages, date from the 1950's through 1999. The documents include tobacco industry science reports, marketing reports, correspondence, budgets, meeting minutes, and other business-related documents. Table 2 lists the major collections and their sizes.
The project team had to make the "build vs. buy" decision for the LTDL and opted to not buy, but license the University of Michigan's Digital Library Extension Service (DLXS) (Digital Library Extension Service) software for this project. The team was impressed with the performance of XPAT, the search engine around which DLXS is built. It was also determined that while DLXS was somewhat idiosyncratic and had little documentation, it met most of the project's requirements and would allow the project team to meet its objective of building the library in one year. The LTDL was implemented using DLXS's bibliographic modules. These modules permitted searching across all tobacco industry collections or limiting the selection to one company's documents. Additional features included fielded searching, Boolean searches, truncation and date range searching. A popular bookbag option allows users to select, collect and email citations.
DLXS was also chosen for this project because it was developed in a server and application environment similar to that already in use at the UCSF Library/Center for Knowledge Management. The LTDL runs on Sun Enterprise 450 multi-processor servers running Solaris with over four terabytes (TB) of disk storage on Sun T3 disk arrays, which provide high performance and high reliability. The web server is Apache, document records are stored in PostgreSQL relational databases, and most application system components use PERL for scripting. DLXS was a good fit with the staff's preference for open-source tools and existing expertise in using them.
The road to DLXS implementation was not entirely smooth, however, and extensive, collegial support from the University of Michigan's DLXS team was needed to bridge gaps in DLXS documentation and to cope with its somewhat idiosyncratic code. In some ways, DLXS was as opaque as the human genomethe functions of most components were clear, but the interrelationships between components were often confusing and there seemed to be programmatic equivalent of "junk genes" in the hundreds of PERL scripts in the distribution. Patience prevailed, fortunately, and resulted in a successful implementation. The only major location modification was a rewritten document viewing program. The LTDL team wanted to offer users the choice of TIF, PDF or GIF viewing of the documents. Most documents were delivered as TIF's, so that is the most "original" image format. Most users prefer PDF's, and a page-by-page GIF viewer was provided for users with slow computers or network connections.
In addition to installing and configuring the middleware, one of the many tasks involved in building the LTDL was mapping document record fields for cross-collection searching while maintaining the integrity of the original data and optimizing its accessibility. In the LTDL, cross-collection searching of documents from all the tobacco companies is based on fields common to all the collections. When searching an individual collection, however, all the fields in records are searchable. Table 3 shows an example of mapping between the Philip Morris and RJ Reynolds fields for cross-collection searching. As shown, the number of fields searched is somewhat reduced when a user combines two or more collections.
Throughout the project, the Library was committed to maintaining the integrity of the original data for archival purposes. In creating the XML that is indexed by XPAT and used for searching, however, several small changes were made to enhance access to the documents through cross-collection searching. For example, users often search for names mentioned in the documents, but some of the names in the collection records had Xs in front of the names, such as XXMARY. Since the search engine does not search for strings and only truncates the end of a word, users would never find these documents in a search. Rather than take out the Xs, a program looked for all such occurrences and added the name without the extra characters. In another case, dates were normalized in the XML to allow efficient searching of the document sets.
The Legacy Tobacco Documents Library was launched on January 30, 2002, one year to the day since the work began. The site has been enthusiastically received and frequently used. In April 2002, the Library released version 1.1, which provides date range searching in the advanced search option and short, persistent bookmarks for more convenient citations. Instead of navigating the search interface, documents can be directly accessed using bookmarks such as the one in the document reference shown below.
After the initial release of the LTDL, the Library began the task of collecting (or spidering) documents from tobacco industry sites released since July 1999, as this is roughly the latest date of the documents on the original tapes. The effort involves three separate components:
There are technical challenges in this work that involve writing programs to collect documents, resolving field mapping and provenance questions, and integrating new documents into the existing collections.
To date, the spidering effort has resulted in the acquisition of over 500,000 additional tobacco industry documents, many of which, surprisingly, have document dates that precede 1999. Additional documents for three companies were released in July 2002, and spidered documents for the other four companies will be released incrementally as data management chores associated with them are completed. Since the Library plans to continue collecting and releasing documents for the duration of the MSA, new challenges face the Library in providing users information about the differences between the nominal "document date" field, another field that holds the date a document was produced by a tobacco company, and yet another field that is the date added to LTDL. Team members are currently working with researchers to develop an optimal way of presenting this otherwise confusing information.
One of biggest tasks facing the LTDL developers is upgrading to a newer version of DLXS. The LTDL was built using DLXS version 4. The current version is version 9, and version 10 is under development. Unfortunately, the "giant leap" in versions and sorting out local modifications made to enhance document viewing options, for example, make this migration a daunting task. Nevertheless, the team expects to undertake the effort because newer DLXS middleware promises to provide the ability to offer searching for both structured and unstructured (OCR text, for example) data. The newer DLXS middleware is also reported to be more tightly coded and better documented than the version in use.
Most important, the UCSF Library/Center for Knowledge Management is committed to the long-term preservation and management of its tobacco industry documents collections. The Library will continue to work with researchers and the digital library community to resolve the intellectual and technical challenges presented in creating and expanding this large collection of digital documents over the next several years.
The work described in this paper was supported by The American Legacy Foundation, the Robert Wood Johnson Foundation, the California Tobacco-Related Disease Research Program and the American Medical Association. Robert Horton, Minnesota State Archivist, was very helpful in planning many aspects of this work. UCSF project team members were Albert Jew, John Kunze, Robert Mason, Cynthia Rider, Heidi Schmidt, Celia White and William White.
[Multistate Settlement] Multistate Settlement with the Tobacco Industry. 1998. Web Document. The National Association of Attorneys General. Available: <http://www.library.ucsf.edu/tobacco/litigation/msa.pdf>.
Copyright © Heidi Schmidt, Karen Butter, and Cynthia Rider