Evaluation of Digital Repository Software at the National Library of Medicine

Search | Back Issues | Author Index | Title Index | Contents

D-Lib Magazine
May/June 2009

Volume 15 Number 5/6

ISSN 1082-9873

Evaluation of Digital Repository Software at the National Library of Medicine

Jennifer L. Marill
National Library of Medicine
<jennifer.marill@nih.gov>

Edward C. Luczak
Computer Sciences Corporation
<edward.luczak@nih.gov>

	Abstract The National Institutes of Health (NIH) National Library of Medicine® (NLM) undertook an 18-month project to evaluate, test and recommend digital repository software and systems to support NLM's collection and preservation of a wide variety of digital objects. This article outlines the methodology NLM used to analyze the landscape of repository software and select three systems for in-depth testing. Finally, the article discusses the evaluation results and next steps for NLM. This project followed an earlier NLM working group, which created functional requirements and identified key policy issues for an NLM digital repository to aid in building NLM's collection in the digital environment. Introduction In order to fulfill the National Library of Medicine's (NLM) mandate to collect, preserve and make accessible the scholarly and professional literature in the biomedical sciences, irrespective of format, the Library has deemed it essential to develop a robust repository infrastructure to manage a large amount of material in a variety of digital formats. A number of NLM's Library Operations program areas need such a digital repository to support their existing digital collections and to expand the ability to manage a growing amount of digitized and born-digital resources. In May 2007, the Associate Director for Library Operations approved the creation of the Digital Repository Evaluation and Selection Working Group (WG) to evaluate commercial systems and open source software and select one (or a combination of systems and software) for use as an NLM digital repository. The group's work followed an earlier Digital Repository Working Group, which created functional requirements and identified key policy issues for an NLM digital repository to aid in building NLM's collection in the digital environment. Project Scope, Deliverables and Working Guidelines The scope of the Digital Repository Evaluation and Selection project was to perform an extensive evaluation of existing commercial and open source digital repository systems and software. The evaluation included those systems and software already identified by the Digital Repository Working Group, as well as any new or previously overlooked software. The evaluation was to include hands-on testing against a set of functional requirements based on the Open Archival Information System (OAIS) model – ingest, archival storage, data management, administration, preservation planning, and access – as specified in the NLM Digital Repository Policies and Functional Requirements Specification [1]. The evaluation was also to include an assessment of the systems and software based on a set of non-functional requirements. The primary deliverable of this project was a recommendation on which system or suite of software to implement for NLM's digital repository. The recommendation needed to take into account software costs and staffing resources necessary for a pilot or initial implementation. The WG was also charged with forwarding any policy issues that needed management resolution for either this project or implementation. Policy issues related to the priorities for digital preservation were outside the scope of this project. The full project team held weekly 1.5 hour meetings. Sub-groups were created to meet separately and conduct specific analysis and testing tasks. Working documents and correspondence were posted on a project wiki. The WG included staff from many areas of Library Operations, including the History of Medicine Division, the Public Services Division, and the Technical Services Division; the Office of Computer and Communications Systems; and one staff each from the National Center for Biotechnology Information and from the NIH Library. The project was originally expected to conclude within nine months to one year, however, the time required for hands-on testing led to the extension of the project to nearly 18 months. The following working guidelines were developed to help further define the goals and scope of the NLM digital repository: Institutional Resource The NLM digital repository will be a resource that will enable NLM's Library Operations to preserve and provide long-term access to digital objects in the Library's collections. Contents The NLM digital repository will contain a wide variety of digital objects, including manuscripts, pamphlets, monographs, images, movies, audio, and other items. The repository will include digitized representations of physical items, as well as born digital objects. NLM's PubMed Central® will continue to manage and preserve the biomedical and life sciences journal literature. NIH's Computer Information Technology Branch will continue to manage and preserve the Department of Health and Human Services/NIH videocasts. Future Growth The NLM digital repository should provide a platform and flexible development environment that will enable NLM to explore and implement innovative digital projects and user services utilizing the Library's digital objects and collections. For example, NLM could consider utilizing the repository as a publishing platform, a scientific e-learning/e-research tool, or to selectively showcase NLM collections in a very rich online presentation. Resources Staff from NLM's Office of Computer and Communications Systems (OCCS) will provide system architecture and software development resources to assist in the implementation and maintenance of the NLM digital repository. Staff from NLM's Library Operations will define the repository requirements and capabilities, and manage the lifecycle of NLM digital content. Project Timeline The Working Group held its kick-off meeting June 12, 2007 and completed all work by December 2, 2008. The project was divided into the following phases: Phase 1: Completed September 25, 2007. An initial evaluation was conducted of 10 systems, and 3 were selected for in-depth testing. Phase 2: Completed October 22, 2007. A test plan was developed and a test data set was assembled containing a wide range of content types. Phase 3: Completed October 13, 2008. Three systems were installed at NLM and hands-on testing and scoring of each was performed. On average, each system required 85 testing days or just over 4 months from start of installation to completion of scoring. Phase 4: Completed December 2, 2008. The final report was completed and submitted. Initial Evaluation of Ten Systems and Software Based on the work of the previous NLM Digital Repository Working Group, the WG scanned the literature and conducted investigations to construct a list of ten systems and software for initial evaluation. The ten systems included: Open Source: DAITSS [2], DSpace [3], EPrints [4], Fedora [5], Greenstone [6], Keystone DLS [7]. Commercial: ArchivalWare [8], CONTENTdm [9], DigiTool [10], VITAL [11]. The WG then developed a set of "Master Evaluation Criteria," to provide a decision method to narrow the ten systems to three systems for detailed consideration. At this point, tool functionality as described in available software documentation was considered one of many factors in this down-selection process. The Master Evaluation Criteria included: Functionality – Degree of satisfaction by design analysis of the requirements enumerated in the NLM Functional Requirements Specification [1]. Scalability – Ability for the repository to scale to manage large collections of digital objects. Extensibility – Ability to integrate external tools with the repository to extend the functionality of the repository, via provided software interfaces (APIs), or by modifying the code-base (open source software). Interoperability – Ability for the repository to interoperate with other repositories (both within NLM and outside NLM) and with the NLM integrated library system. Ease of deployment – Simplicity of installation and ease of integration with other needed software. System security – Ability of the system to meet HHS/NIH/NLM security requirements. System performance – Overall performance and response time (accomplished via load testing). System availability (24x7 both internally and externally). Physical environment – Ability to deploy multiple instances for offsite and disaster recovery; ability to function with the NIH off-site backup facility; ability for components to reside at different physical locations; ability for development, testing and production environments. Platform support – Operating system and database requirements. Staff expertise to deal with required infrastructure. Demonstrated successful deployments – Relative number of satisfied users or organizations. System support – Quality of documentation and responsiveness of support staff or developer/user community (open source) to assist with problems. Strength of development community – Reliability and support track record of the company providing the software; or size, productivity, and cohesion of the open source developer community. Stability of development organization – Viability of the company providing the software; or stability of the funding sources and organizations developing open source software. Strength of technology roadmap for the future – Technology roadmap that defines a system evolution path incorporating innovations and "next practices" that are likely to deliver value. Each criterion was equally assessed on a scale of 0 (none of the criterion is present) to 3 (high level of criterion is present). After the functional and non-functional criteria above were addressed, cost of software deployment, including initial cost of software, plus cost of software integration, modifications, and enhancements, were also considered on a scale of 0 (highest cost) to 3 (lowest cost). In order to conduct these initial investigations, the WG was divided into four subgroups and each subgroup evaluated two or three of the ten systems. Each subgroup presented their research findings and initial ratings to the full WG. The basis for each rating was discussed, and an effort was made to ensure that the criteria were evaluated consistently across all ten tools. The subgroups finalized their ratings to reflect input received from discussions with the full Working Group. All ten systems were ranked and three were identified for further consideration and in-depth testing: DigiTool, DSpace, and Fedora. Because Fedora has a limited user interface, the WG selected Fez [12], a Web interface to Fedora, to enable more effective testing. In-Depth Testing of Three Systems Using a staggered schedule, DSpace 1.4.2, DigiTool 3.0, and Fedora 2.2/Fez 2 Release Candidate 1 were installed on NLM servers for extensive hands-on testing. The WG established a ground-rule that the latest production versions of each system would be installed and tested. OCCS conducted demonstrations and tutorials for DSpace and Fedora, and Ex Libris provided training for DigiTool, so that members could familiarize themselves with the functionalities of each system. A Consolidated Digital Repository Test Plan [13] was created based on the requirements enumerated in the NLM Digital Repository Policies and Functional Requirements Specification. The Test Plan contained 129 specific tests. Each test could be scored from 0 to 3, indicating the extent to which the test element could be successfully demonstrated or documented (0=none, 1=low, 2=moderate, 3=high). Each system could receive a total score of 387 if all tests were scored as 3 (high). All the test elements were represented in a spreadsheet for convenience. Four subgroups of the WG (Access, Metadata and Standards, Preservation and Workflows, Technical Infrastructure) were formed to evaluate specific aspects of each system. Each test was allocated to one of the four subgroups, who were tasked to conduct that test on all three systems. Scores were added up for each subgroup's set of test elements. A cumulative score for each system was calculated by totaling the four subgroup scores. In addition to the hands-on testing, the WG contacted numerous users and customers of all the software. Information was elicited about software use, the size and nature of the repository collections, the size and skill sets of the repository teams, etc. Recommendations and Next Steps After completion of all testing, the WG recommended that NLM select Fedora as the core system for the NLM digital repository. The WG was highly impressed with a number of Fedora capabilities, including the strong technology roadmap, the excellent underlying data model that can handle NLM's diverse materials, the active development community, Fedora's adherence to standards, and Fedora's use by leading institutions and libraries with similar digital project goals. Fedora is also seen as a low risk choice for now, as it is open source and no license fees are involved. The WG also recommended that work should begin immediately on a Fedora pilot project using four identified collections of materials from NLM and the NIH Library. Most of these collections already have content files and metadata for loading into a repository. After an initial pilot phase at approximately six to eight months, the effort will be evaluated. NLM senior staff concurred with this recommendation and work has already begun on the pilot implementation. Implementation of the pilot using Fedora will provide real-world experience with actual NLM collections. The four pilot collections contain a variety of digital formats: digitized monographs on cholera dating from 1830 to 1890; digitized motion pictures of a historical nature; digitized images from important historical anatomical atlases; and a selection of annual reports from NIH Institutes and Centers. The pilot will focus on Submission Information Package (SIP) creation, developing data models for the above material, and understanding metadata needs. The pilot will also investigate "companion" tools that work with Fedora, focusing on three areas: administrative interface tools (e.g., Fez, Muradora [14]); file identification, verification and characterization tools (e.g., JHOVE [15], DROID [16]); and user access tools such as page turning software. As each pilot collection is completed, NLM intends to evaluate its work with the following types of questions: How workable is Fedora in the NLM environment? Should NLM investigate partnerships with other Fedora users? How actively should NLM participate in the Fedora community? NLM has much work ahead of it, but the value of its in-depth evaluation and selection process has been significant. Using a set of well-defined evaluation criteria and test cases has enabled NLM to perform hands-on testing and develop an in-depth understanding of repository software prior to undertaking its initial implementation. Acknowledgments Members of the Digital Repository Evaluation and Selection Working Group were: Diane Boehr, Brooke Dine, John Doyle, Laurie DuQuette, Jenny Heiland-Luedtke, Felix Kong, Kathy Kwan, Edward Luczak (contractor), Jennifer Marill (chair), Michael North, Deborah Ozga, John Rees, and Doron Shalvi (contractor). For More Information See <http://www.nlm.nih.gov/digitalrepository/> for more information about NLM's efforts and to find the full report, Recommendations on NLM Digital Repository Software [17]. Notes & References 1. National Library of Medicine. Digital Repository Policies and Functional Requirements Specification. March 16, 2007. <http://www.nlm.nih.gov/digitalrepository/NLM-DigRep-Requirements-rev032007.pdf>. 2. DAITSS. <http://daitss.fcla.edu/>. 3. DSpace. <http://www.dspace.org/>. 4. EPrints. <http://www.eprints.org/>. 5. Fedora. <http://www.fedora-commons.org/>. 6. Greenstone. <http://www.greenstone.org/>. 7. Keystone DLS. <http://www.indexdata.com/keystone/>. 8. ArchivalWare^™ (PTFS). <http://www.archivalware.net/>. 9. CONTENTdm^® (OCLC). <http://www.contentdm.com>. 10. DigiTool^® (Ex Libris). <http://www.exlibrisgroup.com/digitool.htm>. 11. VITAL (VTLS). <http://www.vtls.com/products/vital>. 12. Fez. <http://sourceforge.net/projects/fez/>. 13. National Library of Medicine. Digital Repository Test Plan. <http://www.nlm.nih.gov/digitalrepository/Consolidated-DR-Testplan-Template.xls>. 14. Muradora. <https://fedora-commons.org/confluence/display/MURADORA/Muradora>. 15. JHOVE: JSTORE Harvard Object Validation Environment. <http://hul.harvard.edu/jhove/>. 16. DROID: Digital Record Object Identification. <http://droid.sourceforge.net/>. 17. National Library of Medicine. Recommendations on NLM Digital Repository Software. December 2, 2008. <http://www.nlm.nih.gov/digitalrepository/DRESWG-Report.pdf>.

	Top \| Contents Search \| Author Index \| Title Index \| Back Issues Previous Article \| Next Article Home \| E-mail the Editor

	D-Lib Magazine Access Terms and Conditions doi:10.1045/may2009-marill