Generation of XML Records across Multiple Metadata Standards

This paper describes the process that Eisenhower National Clearinghouse (ENC) staff went through to develop crosswalks between metadata based on three different standards and the generation of the corresponding XML records. ENC needed to generate different flavors of XML records so that metadata would be displayed correctly in catalog records generated through different digital library interfaces. The crosswalk between USMARC, IEEE LOM, and DC-ED is included, as well as examples of the XML records.

Introduction

The Eisenhower National Clearinghouse (ENC) has created the largest collection of K-12 mathematics and science curriculum resources in the nation. Begun in 1992, ENC was authorized by federal legislation as part of the Excellence in Mathematics, Science and Engineering Education Act of 1990 and funded by the U.S. Department of Education. ENC was tasked with planning, developing, organizing, and implementing a national clearinghouse for K-12 science and mathematics resources. In order to meet these objectives, ENC created an online database containing searchable bibliographic catalog records that included detailed descriptions of traditional format and digital K-12 science and mathematics curriculum resources. Underlying the ENC records is a standard library framework for indexing and storing documents (USMARC) with required fields from accepted cataloging standards such as the Anglo-American Cataloging Rules, 2nd edition [1]. The resulting schema provided a description of resources that combined standard and non-traditional, value-added features [2]. Examples of traditional fields include title, author, edition, physical description, series, grade level, and general materials designator (GMD). Some of the non-traditional fields included ENC-defined science and mathematics subject identifiers, equipment specifications, ordering information, evaluation data, extensive abstracts, pedagogical type, geographical focus, and physical media type.

In 1996, the National Science Foundation (NSF) began studying the development of a national digital library for science, technology, engineering and mathematics (STEM) education. Since then, NSF has developed the National Science, Technology, Engineering, and Mathematics Education Digital Library (NSDL) program to develop a national digital library that will constitute an online network of learning environments and resources for science, technology, engineering, and mathematics education at all levels [3]. The first ENC project funded through NSF NSDL in 2000 was the Learning Matrix¹, which focuses on improving the preparation of math and science teachers by supporting faculty who teach math and science courses in two- and four-year colleges [4].

Two other ENC collections were funded by NSF NSDL in 2001. The Gender and Science Digital Library (GSDL)² , covering K-16, is a collaboration between ENC and the Educational Development Center (EDC), Gender and Diversities Institute. The International Technology Education Association (ITEA) and ENC are collaborating on the K-12 National Digital Library on Technological Literacy (ICON)³. In the fall of 2002, a K-12 collection building proposal was also funded by NSF NSDL to allow ENC to work with selected federal agencies to create both collection-level and object-level metadata for digital resources developed through federal funds. The digital resources cataloged as part of this project, the Federal Education Digital Resources Library (FEDRL), are accessible through ENC Online and the NSDL portal.

Metadata for these NSF-funded digital library collections are entered through a web-based Cataloging Tool⁴, and are based on the IEEE Learning Object Metadata (LOM) standard. Additional metadata elements have been added based on the collection's content and audience needs. The LOM schema, built on the work done by the Dublin Core group and with origins in the ARIADNE⁵ and IMS⁶ projects, was designed to describe educational objects [5]. The schema has a wide variety of elements (almost 80 elements and sub-elements) that can describe digital objects in a detailed way. The indexing protocols for each NSF-funded digital collection cataloged by ENC and its collaborators follows a modified POOL-IMS Version 1.0 [6] that is in turn based on the IMS Learning Resource Metadata Specification, an XML-compliant schema for indexing learning objects. The POOL-IMS Version 1.0 is a modification of the CanCore Learning Resource Metadata Specification [7].

In addition to traditional bibliographic data such as title and author, resources described using IEEE/LOM metadata include a wide range of information that conveys their possible educational use. For example, in the case of software, the description might include how interactive the resource is; other data cataloged might be the audience for whom a resource was developed, where the learning will take place, or the level of difficulty of the material.

The infrastructure ENC has in place is being used to facilitate building and providing access to these mathematics and science digital resources. The content of all ENC collections is being delivered through a content management system, Vignette CMS 6. Vignette has been in use by ENC since December 1999 and is being used to develop the database foundation for the NSDL collections. All metadata created by ENC are placed into SQL relational databases. The Cataloging Tool provides the interface between the metadata that are being entered by content specialists and catalog librarians and the database in which the data are stored. XML records are being generated for all the collections. The data in these XML records are both searched by ENC's search engine Autonomy as well as harvested for the NSDL OAI (Open Archives Initiative) repository.

2. Problem Statement

Because the native metadata for the ENC collections follow different metadata standards (USMARC and IEEE 1484.12.1-2002 Learning Object Metadata (LOM) Standard) and the metadata to be harvested via the NSDL OAI repository follows the Dublin Core metadata standard [8, 9], ENC needed to develop crosswalks⁷ between these three standard metadata schemas. ENC also needed to generate different flavors of XML records so that metadata would be displayed correctly in catalog records generated through different digital library interfaces. XML is an open, text-based markup language that provides structural and semantic information to data based on a specific schema such as USMARC. These XML records are searched by the Autonomy search engine with the metadata displayed in two different formats: the format used for the ENC DL libraries (Learning Matrix, ICON, and GSDL) and that used for ENC Online⁸. The XML records are also exported in a Dublin Core format, so they are available to the NSDL OAI harvester.

XML records generated by the Learning Matrix, ICON, and GSDL are based on the IMS Learning Resource Metadata Specification and are the most straightforward to produce—there is a one-to-one correspondence between the metadata that are entered in the cataloging tool and that which are displayed as part of the catalog record. ENC also has to generate a USMARC XML record from the digital library metadata to be searched via ENC Online. This requires the IEEE LOM metadata to be crosswalked to the USMARC metadata standard. A third flavor of XML record is generated from both USMARC and the IEEE LOM metadata. These XML records have been crosswalked to DC-ED so that they are harvestable by the NSDL and searchable through the NSDL.org⁹ interface. A fourth type of XML record is generated so that IEEE LOM metadata can be displayed in a USMARC format via the ENC Online interface. In the future, an XML record will be generated in the IEEE LOM format based on the USMARC metadata used to describe ENC resources. An overview of this process can be found in Table 1.

Table 1. Metadata Crosswalk between Native Metadata, XML Record Format, and Search Interface for ENC Digital Library Collections

3. Development of the Crosswalks

The crosswalk development/XML record team grew as time went on. Originally, the database developers for the ENC MARC records and the digital library LOM records were working independently. It quickly became clear this was a task that required the collaboration of the database developers and technical specialists who had insight into the indexing, search, and display of the data as well as someone who had understanding of the nature and content of the data. The team faced several hurdles, including the changing nature of the IEEE LOM Standard as it went through its revision stages [5]. For example, the datatype format¹⁰ changed in some fields from Langstring¹¹ to Undefined. After countless hours and drafts, the crosswalks were tested and XML records were generated. Sample OAI and IEEE LOM XML records can be found in the appendix at the end of this paper.

The first crosswalk developed was from the ENC USMARC to the IEEE LOM schema. The main goal was to preserve as much of the data from the ENC USMARC records in the new schema while following the IEEE LOM Standard. This was done by implementing vocabularies with ENC as the source and adding new ENC extensions. The LOM Standard recommends specific values for the data but allows for adding terms from different vocabularies as long as the values are identified with the source of each vocabulary.

As the standard states, the use of the recommended vocabularies results in the greatest interoperability potential [5]. That is why every effort was made to crosswalk the values from the USMARC records to the standards-based vocabularies. However, there were many fields in which the use of the recommended vocabularies did not make sense. For example, Grade Level in the ENC USMARC record was crosswalked to Learning Context in the IEEE LOM Format. The values recommended by the standard are "school," "higher education," "training," and "other." These values did not meet the level of specificity of the data in the USMARC records, so the individual grade level vocabulary was identified as ENCgrade with the source ENClearningcontext.

Some of the data in the ENC USMARC record are not appropriate for any of the fields in the IEEE LOM record. Examples of this are the table of contents, media type, and primary series name. These data add to the value of the record from both a bibliographic and a content perspective. When there was not a good fit between the nature of the data and the fields offered in the IEEE LOM schema, extensions were added within elements that had similar categories of data. The extensions allow ENC to preserve valuable data that would otherwise be lost in the IEEE LOM schema.

The second crosswalk was between the IEEE LOM and the Dublin Core (DC-ED) schemas. This was a little more straightforward because the fields in latter schema are a subset of the former. In this case, some of the fields that required an extension when crosswalking from ENC USMARC to IEEE LOM had a corresponding field in Dublin Core. For example, media type required an extension when crosswalking from USMARC to IEEE LOM but could be mapped to Format in Dublin Core.

The last crosswalk that was developed mapped the data from the IEEE LOM schema into the ENC USMARC schema. In this last case, many valuable data were lost. It did not make sense to add extensions for the metadata that were in the LOM records into the USMARC records. This is because the IEEE LOM-crosswalked-into-USMARC records were only going to be used in the context of ENC online and had to be operable within that system. Since none of the ENC USMARC records have metadata for categories such as level of interactivity or typical learning time, and those metadata are not displayed or searched through ENC online, it would be fruitless to crosswalk them from LOM to USMARC. As a result the USMARC XML records for the resources that were originally cataloged in the IEEE LOM schema have much less data. Table 2. includes the crosswalk between USMARC, IEEE LOM, and DC-ED.

4. Conclusion

ENC is not unique in its need to produce different flavors of XML records to conform to multiple schemas. Just as ENC chose the IEEE LOM schema, digital libraries should choose a schema that best embodies the nature of their resources and their cataloging goals. Crosswalks that extend interoperability are essential so that the digital library collections can be accessible through a variety of portals and search interfaces. As more organizations share what they have learned as they strive for maximum interoperability of their records that richly describe digital resources, the development of crosswalks will be better understood and more easily accomplished.

Notes

7. Crosswalks, which are also called maps, show how data in one metadata schema can be expressed in another. The act of translating the data from one schema to another is called crosswalking.

10. Datatype is an indicator that the values of the data elements have specific features, such as a string of characters or a period of time.

11. Langstring is a datatype that associates a specific language with a string of characters.

References

[1] American Library Association. 2002. Anglo-American cataloging rules, 2nd ed. Chicago: American Library Association.

[2] Plummer, K. (2000). Cataloging K-12 math and science curriculum resources on the internet: A non-traditional approach. In Metadata and organizing educational resources on the internet. (ed: Jane Greenberg). The Haworth Information Press, an imprint of The Haworth Press, Inc., pp 53-65.

[3] Zia, L. (March 2001). Growing a national learning environments and resources network for science, mathematics, engineering, and technology education: current issues and opportunities for the NSDL program. D-Lib Magazine 7(3). Retrieved June 15, 2003, from <doi:10.1045/march2001-zia>.

[8] Dublin Core Metadata Initiative: Dublin Core Metadata Element Set, Version 1.1: Reference Description. Retrieved June 15, 2003, from <http://dublincore.org/documents/dces>.

Appendix: Sample OAI and IEEE LOM XML Records

Figure 1 below is an example of an OAI XML record generated from IEEE LOM metadata. Figure 2 provides an example of an IEEE LOM XML record generated from IEEE LOM educational metadata, while Figure 3 is an example of an IEEE LOM XML record generated from IEEE LOM general metadata.

</header>

<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">

<dc:title>Human blood</dc:title>

<dc:description>This is the opening page of a biological anthropology tutorial that discusses human blood components, with lessons on ABO blood types and Rh blood types. There are links to a table of 25 human blood types and a table of the frequency of ABO blood types in United States. The tutorial presents many charts, diagrams, photos, and illustrations, and includes the audio capability to listen to pronunciations of most vocabulary words. At the end of each section is a practice quiz with feedback to let students know if they have answered correctly. There are links to related Internet sites and a web exploration suggestion.</dc:description>

<dc:publisher>Palomar College (San Marcos, CA), Behavioral Sciences Department, Anthropology Program</dc:publisher>

<dc:contributor>Dennis ONeil</dc:contributor>

<dc:type>Image</dc:type>

<dc:type>Sound</dc:type>

<dc:type>Text</dc:type>

<dc:format>Text/HTML</dc:format>

<dc:format>Audio/Basic</dc:format>

<dc:format>Image/GIF</dc:format>

<dc:format>Image/JPEG</dc:format>

<dc:language>English</dc:language>

<dc:coverage />

<dc:creator />

<dc:subject>Blood</dc:subject>

<dc:subject>Blood types</dc:subject>

<dc:subject>Integrated/interdisciplinary approaches</dc:subject>

<dc:subject>Social sciences</dc:subject>

<dc:subject>Anthropology</dc:subject>

<dc:subject>Attributes of design</dc:subject>

<dc:subject>Creative process</dc:subject>

<dc:subject>Science</dc:subject>

<dc:subject>Life Science</dc:subject>

<dc:subject>Genetics</dc:subject>

<dc:relation />

<dc:source />

<dc:identifier>http://anthro.palomar.edu/blood/default.htm</dc:identifier>

<dc:identifier />

</oai_dc:dc>

</metadata>

<about>

<baseURL>http://www.enc.org/terms</baseURL>

<identifier>Eisenhower National Clearinghouse Rights Information</identifier>

<metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>

</originDescription>

</provenance>

</about>

</record>

Figure 2. Sample IEEE LOM XML Record Generated From IEEE LOM Metadata (IEEE LOM Element 5.0 Educational Metadata Only)

<interactivitytype>mixed</interactivitytype>

<learningresourcetype source="encdlwebpedagogicaltype">Tutorials</learningresourcetype>

<learningresourcetype source="encdlwebpedagogicaltype">Course Materials</learningresourcetype>

<intendedenduserrole>Learner</intendedenduserrole>

<context source="enclearningcontext">Undergraduate lower division</context>

</typicalagerange>

<difficulty>medium</difficulty>

</typicallearningtime>

<langstring lang="en">The typical learning time is an estimate of the length of time it might take to read this tutorial and take the accompanying interactive quizzes. The time will vary depending on the students previous experience with this material.</langstring>

</description>

</educational>

Figure 3. Sample IEEE LOM XML Record Generated From IEEE LOM Metadata (IEEE LOM Element 1.0 General Metadata Only)

<entry>http://anthro.palomar.edu/blood/default.htm</entry>

</identifier>

</identifier>

<title>

<langstring lang="en">Human blood : an introduction to its components and types</langstring>

</extension>

</title>

<langstring lang="en">This is the opening page of a biological anthropology tutorial that discusses human blood components, with lessons on ABO blood types and Rh blood types. There are links to a table of 25 human blood types and a table of the frequency of ABO blood types in United States. The tutorial presents many charts, diagrams, photos, and illustrations, and includes the audio capability to listen to pronunciations of most vocabulary words. At the end of each section is a practice quiz with feedback to let students know if they have answered correctly. There are links to related Internet sites and a web exploration suggestion.</langstring>

</description>

</extension>

</keyword>

</coverage>

</general>

D-Lib Magazine
September 2003

Volume 9 Number 9

ISSN 1082-9873