Challenges for Service Providers When Importing Metadata in Digital Libraries

D-Lib Magazine
April 2002

Volume 8 Number 4

ISSN 1082-9873

Challenges for Service Providers When Importing Metadata in Digital Libraries

Marilyn McClelland
David McArthur
Sarah Giersch
CollegisEduprise
Morrisville, NC 27560 USA
<[email protected]> <[email protected]> <[email protected]>

Gary Geisler
University of North Carolina at Chapel Hill
Chapel Hill, NC 27599 USA
<[email protected]>

	Abstract Much of the usefulness of digital libraries lies in their ability to provide services for data from distributed repositories, and many research projects are investigating frameworks for interoperability. In this paper, we report on the experiences and lessons learned by iLumina after importing IMS metadata. iLumina utilizes the IMS metadata specification, which allows for a rich set of metadata (Dublin Core has a simpler metadata scheme that can be mapped onto a subset of the IMS metadata). Our experiences identify questions regarding intellectual property rights for metadata, protocols for enriched metadata, and tips for designing metadata services. Introduction Much work has been done in the digital library community to facilitate the exchange of metadata between digital libraries. Specifications like Dublin Core, IEEE Learning Object Metadata (LOM), or IMS [1] have been developed for tagging metadata in a standard xml format [IMS Specification]. Frameworks are being developed to provide interoperability for digital libraries [Arms, Fulker, IMS Model, and Lagoze] with many using the protocol for metadata harvesting developed by the Open Archives Initiative (OAI) [Van de Sompel, Warner]. For the discussion here, we consider a simple, high-level view of the interaction of metadata entities, shown in Figure 1, which illustrates the logical separation of data providers and service providers to distinguish the different roles in handling metadata [Van de Sompel, Warner]. Our digital library project, iLumina (http://www.ilumina-dlib.org), is both a data provider and a service provider. Prior to the winter of 2002, iLumina consisted only of metadata that had been created by iLumina contributors. During the winter of 2002, iLumina imported metadata from another data provider. In this article we discuss challenges that arose as we worked to integrate iLumina services with imported metadata. Figure 1: High Level View of Metadata Entities We also discuss here the obstacles and challenges we did not adequately anticipate that relate primarily to the integration of the imported data into the services provided by iLumina. Our goal for this article is to discuss the 'lessons learned' that identify relevant considerations for a better design of the repository as well as services that could generalize to other implementers. Some of these challenges, such as the use of different vocabularies, have been discussed elsewhere; but this paper reports on specific, tangible experiences. In some cases, we can suggest alternative solutions. In other cases, the import experience identifies unanswered questions that need to be addressed by the digital library community. Perhaps sharing our experiences can provide a guide to the hazards that need to be considered in the design of repository persistence and digital library services. We start with an overview of the iLumina project to provide a context for our import experiences. Next we share the issues and challenges we encountered and the unanswered questions we identified. Finally, we discuss some of the lessons we learned that will help shape the redesign of our repository and services. Brief Overview of iLumina iLumina is a digital library of undergraduate teaching materials for science, mathematics, technology and engineering (SMETE) education, now being developed by Eduprise, the University of North Carolina at Wilmington (UNCW), Georgia State University, Grand Valley State, and Virginia Tech [2]. Our experience suggests that faculty across the country have created a wealth of digital resources for teaching, often small or granular in size, and are willing to share them. However, faculty lack a repository where they can submit their materials, find related ones, create new content, and collectively improve both the quantity and quality of digital teaching resources. iLumina will provide such a repository, and related community and user services. To promote such sharing, iLumina not only provides services to facilitate the creation and acquisition of resources, but also offers tools to end-users that will enable them to find and make use of digital objects. One related theme is that IMS metadata can underpin tools that enable resource sharing. iLumina uses IMS metadata tools to create rich and standardized descriptions of the learning objects that it manages. However, for this to work effectively, the costs of creating such rich metadata must be (relatively) low in comparison to its benefits. Many have argued that minimalist metadata (such as DC Core), is not only easier to create than IMS/IEEE descriptions, but also more cost-effective. One question the iLumina project is addressing is whether this is true. Architecturally, iLumina is both distributed (content) and centralized (metadata). That is, we encourage content providers to maintain their own materials, but we gather and manage their metadata so end-users can come to a single site — the iLumina portal — to find these distributed collections. iLumina implements the full IMS metadata information model (including recent updates) and maintains the data through relational mappings into an SQL Server. Building on this model, we have developed a set of tools to create, manage, search and view IMS metadata and the iLumina resources they describe. This has enabled us to catalog numerous library resources (currently, over 800 and increasing steadily). Many of these resources are actually collections embedded in the larger iLumina collection, and we provide tools to describe and view (with metadata) not only individual resources, but the special collections themselves. IMS metadata provides a foundation for a number of core library services already in place, including basic (and advanced) search, browsing, and a contribution form allowing new content providers to catalog their materials for iLumina (although materials are reviewed before being finally admitted to the library). We have also moved ahead on an extended set of services, such as personal collections [Geisler] and ratings, although most of these are still in the development stage. Maintaining profile or identity data for users is one of the services iLumina provides to simplify contributions. A registered user of the iLumina site can create and edit a profile of personal information. When the user logs on and chooses to make a contribution, this profile information is used to automatically populate the contributor's fields in the metadata specification. Thus, users do not have to reenter their personal information for each contribution. iLumina acquires metadata both by cataloging metadata from individuals and by importing metadata in batches. While importing yields high amounts of metadata, those processes do not invite users to participate in the library environment. Therefore, a web-based Contribution Form was designed for registered users to play an active role in contributing a resource and its metadata into iLumina. To minimize cataloging errors, we collect metadata using a combination of drop-down lists containing controlled vocabularies and fill-in-the-blank fields. One of the challenges of acquiring metadata is to minimize the cost of metadata creation for authors. Some solutions we have implemented to contain these costs include: requiring that a limited number of elements be completed (Title, Discipline); automating completion of some elements (Contributor); designing efficient interfaces; educating users through point-of-need help screens; and training undergraduate student catalogers (i.e., cheap labor!) to aid authors. Before contributions to iLumina are exposed publicly through iLumina tools, each contribution is assigned to an appropriate Area Editor, based on the discipline of the resource. The Area Editor decides whether to accept or reject the contribution. Only resources that have been accepted will be publicly available through the iLumina portal and services. Undergraduate math and science students are hired and trained as catalogers for iLumina. After a minimal amount of training, these students have proven to be effective catalogers. In some cases, student catalogers fill out the iLumina contribution form for the resource author. In other cases, authors fill in a minimal amount of metadata and student catalogers complete the remaining metadata. Student catalogers also clean up metadata by using the iLumina browse tools to pinpoint anomalies in the data. The Import Experience Under the Open Archives Initiative (OAI), numerous repositories provide a standard protocol for harvesting their metadata. Metadata can be harvested in a variety of standard formats like Dublin Core (the required OAI format) or IMS. Since iLumina was developed to support the IMS standard, we wanted to import metadata in the IMS format. Though we realize that Dublin Core can be mapped to IMS, Dublin Core is a subset of IMS, so our thought was that resources cataloged using IMS might have richer metadata. Thus we selected an OAI data provider that exports in IMS format. After importing metadata from an OAI data provider and working with the metadata for integrations with iLumina services, we realized that opportunities exist to enrich, or add value to the metadata. The metadata can be enriched by: Validating that the metadata format adheres to the IMS Metadata specification (or DC format) Validating the cataloging — including editing typos, incomplete data, misplaced element values, or cataloging errors Providing additional cataloging or adding metadata by populating unpopulated elements (if the original record is sparse) Editing metadata content (sometimes for clarity, in other cases to fit field size) Updating contributor personal information Exposing metadata through iLumina search/browse/portal services Providing a web interface to permit the author to edit metadata Guaranteeing a minimal quality level of data by reviewing a learning resource and its metadata before exposing them through iLumina services Exposing metadata for use in building special collections Developing special collection metadata One of the services provided by iLumina is a browse facility that provides a profile of the learning resources (as shown in Figure 2). Thus, some of the functionality of the iLumina site is built on the expectation that data will be present in certain fields to drive the browse functionality. For example, the browse page provides views of the metadata based on media type and learning resource type. The media type is developed from mapping the data type (essentially a mime type) in /lom/technical/format [3]. In the data we imported into iLumina, however, neither the /lom/technical/format or /lom/educational/learningresourcetype are populated, so the imported data is not visible in these browser views of the metadata. This is not an error (since learning resource type is not a required IMS metadata element), but it highlights differences in the cataloging of metadata and the risks of building site (browse) functionality based on the expectation that specific (non-mandatory) elements would have value in imported metadata records. Figure 2: Browse Relies on Population of Certain IMS Elements To some extent, the problems that arise from missing elements in imported metadata — or from elements in imported metadata whose values are from different vocabularies used in the importing digital library environment — might be solved if catalogers were free to modify or enrich imported metadata by adding values for unpopulated elements. The problem is the lack of clarity about the conventions and/or courtesies surrounding the use of, and freedom to change, the shared metadata [Bird]. iLumina discussed intellectual property rights early on and developed separate rights statements for our resources and our metadata. Our perception was that the rights statements within the IMS specification pertain to the rights of the learning resource and not the metadata. Basically, our metadata rights statement gives users permission to copy and redistribute for non-commercial use. After working through importing metadata and integrating it with our services, we uncovered additional issues related to the intellectual property rights of the metadata, as indicated by the opportunities to enrich metadata listed above. Another import problem occurred when the field size of an imported element exceeded the field size in the database. For example, IMS suggests that at minimum a field size up to 2000 characters should be supported for /lom/general/description. In our database, description has a field size of 8000 characters. However, one of the imported items had a description larger than 8000 characters. To support a field size larger than 8000 characters, we would have to switch to using a different datatype in the database and retest all the interfaces to the database. Even if we redesign the database to accommodate this particular metadata item, this is an arbitrary method to determine the field size to support. Our options are to reject the metadata instance if any field size is exceeded or truncate fields to fit our database knowing that we will not be able to roundtrip the xml instance. Neither of these current options is very appealing. Questions relating to the metadata intellectual property rights that we did not resolve include: Under what circumstance can imported metadata be altered (validated, edited, corrected, updated, enhanced, reused, and/or truncated)? Should there be a formal mechanism to communicate metadata alterations to the data provider? How can it be indicated to users that the metadata has been altered from the original? In addition to the above, how can it be indicated which pieces of the metadata have been altered? What makes sense in supporting versioning of the metadata? How should copyright information of the learning resource be distinguished from the copyright information for the metadata? Should the original cataloger be credited; if so, how? As metadata is shared, will it contain an electronic trail of passing from one repository to another? How should a harvester choose to request either the 'enriched' metadata or the 'original' metadata? Currently, in IMS, IEEE LOM and OAI there is no a specified policy for editing or exporting metadata that was imported from another repository. As a starting point, iLumina can make individual agreements with other repositories that specify under what circumstances iLumina can make changes to the metadata and whether or not iLumina can export the revised metadata. iLumina should be able to roundtrip the original metadata and provide it for export on request. For enriched metadata, iLumina's modification of the metadata could be reflected only in a vague way, via the IMS specification, by adding a meta-metadata entry for validation with a 'known' iLumina entity (LOM Jan18-02 mandates only one creator, so this would eliminate the option of two creators). There is no mechanism that would track specific modifications. For example, if iLumina fleshed out the metadata by adding previously unpopulated tags or correcting errors after validating values, the meta-metadata entry would only indicate iLumina as the metadata contributor as well as provide a date. There would be no details on the specific modifications. On the browse page, the imported data is revealed under the discipline element. The problem here is not that the imported data omits the discipline element, but rather that it often uses a discipline classification scheme different from that used by iLumina. More generally, the imported metadata had several elements that used different vocabularies from iLumina. Therefore, even though the other repository also uses a controlled vocabulary, since it is a different vocabulary, the metadata does not fully exploit the services provided by iLumina. This is not a problem unique to iLumina and the libraries from which we have imported materials. Currently, although many OAI data providers use controlled vocabularies, there is little standardization on which controlled vocabulary is used [Liu]. This situation complicates the ability of the service provider to develop generic services for data generated by a variety of providers. In light of diverse controlled vocabularies, in the next stage of iLumina development more care will be taken to develop a flexible architecture. To recap, as a result of our experience importing metadata to iLumina, the following issues have been identified: Opportunities to enrich metadata exist, but little has been done to address the intellectual property rights for metadata or the protocol for services for enriched metadata. Evolving standards for LOM, NSDL and OAI are increasing the need for a flexible metadata architecture that can accommodate changes. Answers are needed for the questions: Do service providers need an agreement with data providers on enriching metadata? How should original versus enriched metadata be harvested? Should this be part of standards? When metadata is enriched, how to cite changes or credit multiple catalogers? How can we tease out the implications of all this for persistence architecture (versioning, roundtrip original, enriched, multiple xml schemas)? Repositories use controlled vocabularies but provide few policies on the appropriate standards for vocabularies: how can these standards be established? Lessons Learned *Metadata will be incomplete and contain errors.* This sounds obvious now, but we did not appreciate the full impact until working through the import. Some of our surprise was likely due to our forgetting that others take a different approach to cataloging. iLumina has trained undergraduate students as iLumina catalogers and they work cheap! Input forms make use of list boxes and automated population of data to reduce errors. In addition, we used the iLumina browse tools to highlight data inconsistencies for the catalogers to edit. The required reviews by Area Editors also provide another opportunity for validation. The problem we encountered is that some site functionality is based on metadata values. For example, we planned to identify imported data by looking at the field /lom/metametadata/catalogentry/source and that field was not populated in the imported metadata. In our next version of iLumina, we will be aware of the side effect of missing or incomplete metadata on functionality. *Lesson 1 - Don't count on accuracy in data!* *Specialized vocabularies complicate importing data.* Although iLumina uses controlled vocabularies, a few of these vocabularies were developed based on the specific needs of iLumina. In some cases, fields serve a dual role for administrative purposes as well as metadata content. For example, iLumina has developed an internal use for the element 'status' with the vocabulary set (submitted, accepted, rejected). Resources are marked as 'accepted' after passing review by the Area Editor. Only entities with a value of 'accepted' will be exposed through browse and search. In order for the imported data to be visible, 'status' must have a value of 'accepted'. However, the imported data used the LOM defined vocabulary (Draft, Final, Revised, Unavailable) for status. Consequently, we had the problem that if we didn't change the value of the imported element, it would not be exposed through the iLumina site. Yet, for reasons outlined above, we didn't want to be altering the original metadata in an unconstrained way. In looking at our use of status more closely, we now see that it would be more appropriate to establish a separate, administrative field for the review status. For future revisions, we will separate administrative data from the metadata [NSDL]. We took two lessons from this experience. *Lesson 2 - Avoid developing specialized, custom vocabularies.* *Lesson 3 - Establish separate 'administrative' fields for data to tie into functionality of the website.* *Metadata specifications are evolving.* The iLumina database schema was developed using an entity relationship diagram for the first version of the IMS metadata specification. After mapping the fields in the XML instance of the imported data to the database schema, shortfalls in the database schema came to light. The original design of the iLumina database overlooked datatypes defined in the LOM specification, like langstring and vocabulary. This did not cause problems while all metadata was being created by iLumina catalogers, as the iLumina cataloging is done in English and the iLumina vocabularies are known. The import of others' data highlighted the omission of the datatypes in the schema for persistence of the metadata and the limitations of the current design for future use. Another problem is that we had not made provisions for storing non-IMS metadata. For example, Dublin Core can be mapped into IMS, so iLumina could store DC metadata, but with our current model, the XML generated for export would be in IMS format. One solution is to store the intact XML document in a single field in the database. This appears to be the policy that the National Science, Technology, Engineering and Mathematics Digital Library (NSDL) is evolving for sharing and transacting metadata across its collections, since it now plans to retain native metadata and to map it into the common NSDL schema, based on Dublin Core. Storing the XML instance insures our ability to roundtrip the original document and eliminates the challenge of mapping various XML schemas into the iLumina database schema. *Lesson 4 - Maintain the original XML instance of imported data to preserve all mappings and to be able to roundtrip the original.* Conclusions We have discussed iLumina and the obstacles we encountered when importing metadata into our library. We are less naïve after the import experience. We have gathered ideas to use in the next rewrite of iLumina to build a more effective system. Some of our challenges are likely due to our relative inexperience in the implementation of IMS metadata and the transacting of metadata across repositories. However, many of the difficulties we face, such as the evolving processes for handling intellectual property rights, are common to everyone in the digital library community. Indeed, these problems are likely to become even more acute in the immediate future, as a growing community of distributed content, metadata and service providers begins to adopt the OAI metadata harvesting protocol. The National Science, Technology, Engineering and Mathematics Digital Library (NSDL) is a case in point [Saulnier, Zia]. Scheduled for debut in the fall of 2002, the NSDL hopes to become, over the next five years, the largest library of digital teaching and learning materials in the nation. The National Science Foundation (NSF), through its Division of Undergraduate Education is funding the initial development of the NSDL, providing support to over 60 coordinated projects (including iLumina) that will contribute collections and services. Because the NSDL is a massively distributed project, common metadata standards and processes for transacting and reusing these descriptions will be essential to knit together the diverse pieces. An initial set of standards for NSDL has been established, and working groups in this area are already tackling some of the hard problems that remain. This article outlines some of those challenges and underscores the point that not all the problems will be solved merely by adopting a common metadata element schema. Notes 1. IMS originally was the acronym for Instructional Management System. The organization retained the IMS initials and adopted the name IMS Global Learning Consortium, Inc. The consortium involves industry, academic, and government communities in the design and use of open specifications to provide interoperability for distributed learning services and technologies. In addition to learning resource metadata specifications, IMS has specifications in a number of other areas. For more information on IMS specifications, see the IMS website at <http://www.imsproject.org>. 2. This work is funded as part of the NSF DLI-Phase 2, grant #0002935. For details on the project see the iLumina website at <http://www.ilumina-dlib.org>. 3. This is the standard xpath syntax for hierarchical IMS/IEEE LOM metadata elements. In the case of /lom/technical/format, for example, the xml root element is "lom" with a child metadata element "technical". The metadata element "technical" has a child metadata element with the name "format"; other children of "lom" include educational, general, and relation. References [Arms] Arms, W. Y., D. Hillmann, C. Lagoze, D. Krafft, R. Marisa, J. Saylor, C. Terrizzi, and H. Van de Sompel, "A Spectrum of Interoperability: The Site for Science Prototype for the NSDL". D-Lib Magazine, 8(1), January 2002, <http://www.dlib.org/dlib/january02/arms/01arms.html>. [Bird] Bird, S., "White Paper on Machine-Readable Rights Information". Prepared for the Technical Committee of the Open Archives Initiative, October 2001, <http://www.ldc.upenn.edu/sb/oai/rights.html>. [Fulker] Fulker, D, and G. Janee, "Components of an NSDL Architecture: Technical Scope and Functional Model". Submitted to the Joint Conference on Digital Libraries, Portland, 2002. [Geisler] Geisler, G, S. Giersch, D. McArthur, and M. McClelland, "Creating Virtual Collections in Digital Libraries: Benefits and Implementation Issues". Accepted as a long paper for the Joint Conference on Digital Libraries, Portland, 2002. [IMS Model] "IMS Digital Repositories Interoperability Information Model", version 1.0 Base Specification, February 5, 2002, <http://www.imsproject.org/digitalrepositoriesteam.html>. [IMS Specification] "IMS Learning Resource Meta-data Specification", <http://www.imsproject.org/metadata/index.html>. [Lagoze] Lagoze, C. (ed) "Core Services in the Architecture of the National Digital Library for Science Education (NSDL)". Submitted to the Joint Conference on Digital Libraries, Portland, 2002. [Liu] Liu, X., K. Maly, M. Zubair, and M. Nelson, "Arc - An OAI Service Provider for Digital Library Federation", D-Lib Magazine,, 7(4) April 2001, <http://www.dlib.org/dlib/april01/liu/04liu.html>. [NSDL} NSDL: Functional Requirements Draft - working version 1.0 as of January 13, 2002. <http://repository.comm.nsdlib.org/cgi-bin/wiki.pl?FunctionalRequirementsDraft>. [Saulnier] Saulnier, B., "NSDL Portal Power," Cornell Engineering Magazine, 7(3), Fall 2001, <http://nsdl.comm.nsdlib.org/portalpower.html>. [Van de Sompel] Van de Sompel, H. and C. Lagoze, "The Open Archives Initiative Protocol for Metadata Harvesting", 2001, <http://www.openarchives.org/OAI/openarchivesprotocol.html>. [Warner] Warner, S. "Exposing and Harvesting Metadata Using the OAI Metadata Harvesting Protocol: A Tutorial". HEP Libraries Webzine, Issue 4, June 2001. <http://library.cern.ch/HEPLW/4/papers/3/>. [Zia] Zia, L., "Growing a National Learning Environments and Resources Network for Science, Mathematics, Engineering and Technology Education," D-Lib Magazine, 7(3), 2001, <http://www.dlib.org/dlib/march01/zia/03zia.html>. Copyright 2002 © Marilyn McClelland, David McArthur, Sarah Giersch, and Gary Geisler

	Top \| Contents Search \| Author Index \| Title Index \| Back Issues Previous article \| Next article Home \| E-mail the Editor

	D-Lib Magazine Access Terms and Conditions DOI: 10.1045/april2002-mcclelland