Alaaeldin M. Hafez
The purpose of this article is to raise and address a number of issues related to the conversion of Federal Geographic Data Committee metadata into MARC21 and Dublin Core. We present an analysis of 466 FGDC metadata records housed in the National Biological Information Infrastructure (NBII) node of the FGDC Clearinghouse, with special emphasis on the length of fields and the total length of records in this set. One of our contributions is a 34 element crosswalk, a proposal that takes into consideration the constraints of the MARC21 standard as implemented in OCLC's World Cat and the realities of user behavior.
This paper describes a continuing digital library research project at the Energy and Environmental Information Resources Center to enhance access to Federal Geographic Data Committee (FGDC) data sets. It presents a mapping of selected FGDC metadata elements into Dublin Core (DC) and MARC21 metadata that is based on standard crosswalks [Mangan 1997; LC 1999]. The FGDC elements included in our mapping are referred to as "essential FGDC metadata." They provide the basis for a converter being developed to import FGDC metadata into the Online Computer Library Center's WorldCat, its Cooperative Online Resource Catalog (CORC) project, and into local MARC-based library catalogs. We also analyze a data set of 466 FGDC records: 1) as a criterion for selecting essential FGDC elements, and 2) in terms of FGDC record length, because record and field lengths are a limitation for records in WorldCat and often in local library systems.
One impetus for this work is our discovery in 1998 that more than 50% of the queries directed at the National Biological Information Infrastructure (NBII) node of the FGDC Clearinghouse retrieve zero (0) hits for the user. To us, that number represents a failure in the system architecture. A follow-up analysis of NBII log files between the period of July 1998 and March 1999 substantiated the earlier finding.
We are following two research threads: the first is to create an alternative Clearinghouse model that makes management and maintenance of the metadata easier for the individuals responsible for taking time to create FGDC compatible metadata; the second is to convert existing and future metadata to more widely used metadata standards for inclusion in systems other than the Clearinghouse. Our metadata converter model addresses both concerns. (The permanent URL for our converter project is: <http://eeirc.nwrc.gov/converter>.) Before describing our project, however, it may be useful to first offer some important definitions for readers who are not professional librarians.
WorldCat is an international bibliographic database of more than 40 million records maintained by the Online Computer Library Center (OCLC) in Dublin, Ohio, and used by more than 34,000 libraries worldwide [OCLC 1999]. WorldCat records are in MARC21 format, which is the current version of the MARC (Machine Readable Cataloging) standard originally developed in the 1960s by the Library of Congress [LC 1998]. MARC21 is used in the United States and Canada. There are also other national and international MARC standards such as UKMARC and UNIMARC.
The Cooperative Online Research Catalog (CORC) is an initiative sponsored by OCLC to develop the creation and sharing by libraries of metadata for Internet resources. Some of the main features of CORC are the integration of Dublin Core and MARC21 metadata into a single system that provides for both shared and local metadata for digital and physical items, editing in DC and MARC21 views, import and export of DC and MARC21 records, RDF/XML import and export, authority control, assisted (DDC) classification and subject heading assignment, automated keyword extraction and data extraction, link maintenance, and Unicode support [CORC 1999].
The Dublin Core Metadata Initiative is well known to researchers in the digital library community [DC 1999]. The first Dublin Core Metadata Workshop was sponsored by OCLC and the National Center for Supercomputing Applications in March 1995. Since that time, six more workshops have taken place, with the Seventh Dublin Core Workshop (DC-7) being held October 25-27, 1999 at Die Deutsche Bibliothek in Frankfurt, Germany [DC-7].
2. Analysis of FGDC Metadata
FGDC metadata is based on the "Content Standard of Digital Geospatial Metadata" [FGDC 1998]. The standard is available in several electronic formats, for example, as hyptertext images [CSDGM Image Map 1998]. FGDC metadata has a hierarchic structure of more than 300 elements, including 199 data entry elements, that are organized into seven information sections and three supporting sections called templates:
Sections and elements are either mandatory, mandatory if applicable, or optional. Templates are not used alone, but are inserted into information sections at appropriate places. Some data elements are repeatable, as are the templates. Only the Identification and Metadata Reference sections (sections 1 and 7) are mandatory in a fully compliant FGDC metadata record.
After an FGDC record is created in one of the available editors, for example the MetaMaker program created by FGDC [MetaMaker 1999] and the Army Corps of Engineers' effort called CorpsMet [CorpsMet 1999], the structured ASCII text file is run through a parser which first checks its syntax and then outputs three different versions of the record (text, HTML, SGML). All three versions are then sent to a node within the FGDC Clearinghouse. There, Isite software indexes the SGML version of the record. The nodes are then searched through one of the Clearinghouse web sites. The user's request is made to a web form which is sent to a Z39.50 client that broadcasts the request to all the selected nodes, then returns the results of the query to the user's browser as one set (see <http://18.104.22.168/>). It should be noted that the FGDC Clearinghouse has published statements that indicate that, on average, 10% or more of nodes are not functioning at any given time. We believe that the percentage may even be higher. The reader may check the status of the FGDC Clearinghouse nodes at anytime at the following location: < http://22.214.171.124/serverstatus.html>.
The model we are approaching to address this reliability problem eliminates the complicated Z39.50 based nodes. Instead, researchers will register the location of their metadata files with a central search engine/converter. During this registration process, a unique persistent identifier will be assigned for each full metadata record. At that point, the content of the file will be ingested into this centralized portal which will offer features such as searching, browsing, and conversion to MARC21 or other metadata standards. We will be reporting on a working prototype of this model in the months ahead.
Analyzing the NBI Data Set for Record Length
FGDC records provide considerably more information than is usually found in library online catalogs. This applies to both the kind and amount of information that they convey. Thus, one of our goals was to determine how much FGDC records may exceed the field and record length limits of MARC21 records in OCLC's WorldCat database. While MARC21 records have a theoretical maximum record length of 99,999 ASCII characters and a maximum field size of 9,999 ASCII characters, in WorldCat the maximum record length is 4096 characters and the maximum field size is 1230 characters in a variable field. WorldCat also has a maximum of 50 variable fields.
Accordingly, we obtained a data set of 466 metadata records from the National Biological Information Infrastructure (NBII) of the U.S. Geological Survey. The output of the SGML format maps into a flat text file of 444 element cells for each record. A summary of results is presented in Table 1. For those interested in examining the data in more detail, see Appendix A: NBII Data.
It is clear that the FGDC metadata in this set is, essentially, of a different type than the typical catalog record found in WorldCat. The largest record in the set contains 28042 bytes, that is, nearly seven times larger than the WorldCat record length limit. The largest field value is about eight times larger than the WorldCat field limit. In fact, about 74% of this set exceeds the maximum record size, while 51% of the records have at least one field, usually the abstract (element 14 in our output), distributor liability statement (element 376), or the process description (element 135) that exceeds the field length limit in WorldCat.
How does this relate to records in OCLC's CORC system, which allows the import and export of both MARC21 and Dublin Core records? According to Thomas Hickey, CORC Project Manager at OCLC, the differences in size between FGDC records in the NBII dataset and MARC21 bibliographic records in WorldCat should not be a problem for the CORC system. The only real limitation to record size in CORC is what browsers can handle. There have been some problems with records having tens of thousands of bytes, but the average FGDC record is well below this range. WorldCat may adopt CORC's XML system sometime in the future, but for now, moving very long CORC records into WorldCat would require an algorithm to cut or drop fields in order to make the record fit. In other words, the WorldCat record would display abbreviated data in some fields, but the CORC system would display the entire record. The newer Dublin Core, XML/RDF, and FGDC standards do not have field or record length limitations.
Criteria for Mapping and Converting FGDC Elements
It is not our intention to map and convert all 300-plus FGDC elements (or 195 data entry elements). Rather, we selected a smaller number of elements that we refer to as "essential FGDC metadata" for a fully compliant FGDC record. Elements were selected for three reasons: 1) they are required (mandatory) for the production of a fully compliant FGDC record; 2) they are search keys such as author, title, subject, and date that are commonly found in online library catalogs; 3) they are fields commonly used by creators of FGDC metadata that may be used as search keys by persons interested in FGDC geospatial data sets. The first two criteria are determined, respectively, from mandatory elements in the "Content Standard for Digital Geospatial Metadata" (CSDGM) and by generally accepted library practice for the selection of access points in online catalogs. The third criterion is based on a frequency analysis of the NBII data set for actual usage of FGDC elements by persons who created the metadata records. The results of this analysis are presented in Table 2.
Columns 1 and 2, respectively, give the tag numbers and names of each essential FGDC element as is found in the CSDGM. Column 3 gives the number of times each essential element was used in the sample set (out of a possible 466 times).
3. FGDC to MARC21/DC Crosswalk
The following table (Table 3) presents our crosswalk from FGDC to Dublin Core and MARC21. It consists of 34 essential FGDC elements and is based on standard crosswalks [LC 1999; Mangan 1997]. It includes mandatory elements from the Identification and Metadata Reference sections, as well as specific elements from the Spatial Reference, Distribution, Citation, Time Period, and Contact sections. Our crosswalk has similarities as well as differences with the "Metadata Entry System" for minimally compliant metadata that has been proposed recently by the Federal Geographic Data Committee [FGDC 1999]. We recommend that the reader compare those guidelines with the elements in our crosswalk. Appendix B contains a detailed discussion of the essential FGDC metadata elements.
The crosswalk and converter represent the current state of an evolving process rather than a final product. The converter software program is written in C by one of us (Alaaeldin Hafez). It has a modular and adaptable design, that is, it is very easy to add, change, or delete particular features within its general design. However, even the best machine conversion may require some human intervention: in other words, librarians may want to do some editing of records produced by the converter in order to adapt them to their local automated library systems. It also includes our reasons why there are temporary blank spaces in the crosswalk in Table 3.
The realities of mapping FGDC to MARC21 and Dublin Core standards are most clearly understood by examining the record and field length limits of the OCLC WorldCat system. It is our supposition that there are others who are interested in putting FGDC records into their local MARC21 library systems to increase the access points and availability of this valuable metadata. The whole notion of cooperative cataloging mandates that we look for least common denominators for our metadata standards. While some library automation systems do not impose the same kind of limits as WorldCat, it would be counterproductive to design individual crosswalks for each library vendor's system.
It appears the CORC project's success will translate into a new way of storing metadata for OCLC over time. Given OCLC's leadership in the field, there is a good chance that the XML based record structure will be adopted by vendors. Completion of a migration away from MARC, however, considering the massive investment of equipment and training in libraries is years in the future. Therefore, metadata conversion efforts ought to consider the OCLC WorldCat field and record length and number limitations as constants for now.
One of the core issues we would like to highlight is the lack of a persistent URI for FGDC metadata. As the system is currently designed, an SGML version of the record is dumped into a Z39.50 database server. Each time the system re-indexes, the address for the record is changed. This design flaw embedded in the FGDC Clearinghouse model violates a core rule of networked information. No stronger statement of this is available than that made by Tim Berners-Lee of the World Wide Web Consortium:
"The most fundamental specification of Web architecture, while one of the simpler, is that of the Universal Resource Identifier, or URI. The principle that anything, absolutely anything, "on the Web" should [be] identified distinctly by an otherwise opaque string of characters is core to the universality." [Berners-Lee 1998].Until the problem of dynamic metadata locations is addressed, it will not be possible to create meta-metadata for the FGDC records on a large scale. There are other problems with the Z39.50 FGDC Clearinghouse system, such as slow response time and unreliable search results. These are liabilities that cause the metadata searcher and creator to lose faith in the system, thus accelerating the need to export the metadata into other systems with better user interfaces. Solutions must take metadata maintenance into consideration.
An area ripe for empirical investigation is to study what preferences and habits scientists have when searching FGDC metadata. Myke Gluck and Bruce Frasier, for example, have shown that the appearance or format of metadata records has a very large effect on the user's perception of relevance [Gluck 1998].
Another fruitful area of digital library research is to study the relationship between metadata and scholarly electronic journals. We believe FGDC metadata should be peer reviewed and included in the institutional reviews of scientists for promotion and tenure.
More discussion and critical analysis is due. We hope our effort here will stimulate an exchange of ideas.
[1.] By way of background, Adam Chandler is a systems librarian, Dan Foley is a cataloger, and Alaaeldin M. Hafez is a computer scientist. Our library, the Energy and Environmental Information Resources Center (EE-IR Center) is a digital special library of text, numeric, and geospatial data. It was formed as a partnership between the National Wetlands Research Center (NWRC) of the U.S. Geological Survey, and the Center for Advanced Computer Studies of the University of Louisiana (CACS/ULL). Both partners are located in Lafayette, Louisiana. The EE-IR Center is funded by the Office of Scientific and Technical Information (OSTI) of the U.S. Department of Energy. The scope of the collection pertains to energy and the environment of Louisiana, especially the wetland areas of South Louisiana. An area of special interest is pollution and contamination of the Lower Mississippi Watershed and offshore in the Gulf of Mexico. For more information, see Foley 1999 [Foley 1999].
The EE-IR Center is located in the NWRC Library. Other Center personnel are NWRC Librarian Judy Buys and GIS Specialist Suzanne Harrison. The work presented in this paper is funded by U.S. Dept. of Energy Grant No. DOE-FG02-97ER1220. The principal investigators for our digital library project under this grant are Dr. Vijay Raghavan, CACS/USL, and Gaye Farris, Branch Chief, Technical and Informatics Branch, NWRC.
[2.] We are grateful to Susan Stitt of NBII for supplying this data set to us.
[3.] For readers unfamiliar with the MARC21 bibliographic format, the best introduction is "Understanding MARC Bibliographic: Machine Readable Cataloging" [Furrie 1998]. Throughout this paper, a three-digit number indicates a MARC tag for a particular MARC field. Fields have subfields $a, $b, $c, etc., where the dollar sign ($) is a sub-field indicator. For example, the notation 856 $u refers to an Electronic Location and Access field (856) having a subfield ($u) that contains a Uniform Resource Locator (URL)).
[Berners-Lee 1998] Berners-Lee, Tim. (1998). "Web Architecture from 50,000 feet." Retrieved 5 May 1999 from: <http://www.w3.org/DesignIssues/Architecture.html>
[CSDGM Image Map 1998] CSDGM Image Map 1998. (1998). "An Image Map of the Content Standard for Digital Geospatial Metadata: Version 2, 1998 (FGDC-STD-001 June 1998)." Available at: <http://www.its.nbs.gov/fgdc.metadata/version2/>
[CORC 1999] CORC -- Cooperative Online Resource Catalog. Available at: <http://www.oclc.org/oclc/research/projects/corc/>
[CorpsMet] United States. Army. Corps of Engineers (1999). "CorpsMet." Available at the Corps' "Geospatial Data Clearinghouse Node" Web page: <http://corpsgeo1.usace.army.mil>
[DC-7] 7th Dublin Core Metadata Workshop, October 25-27, 1999, Die Deutsche Bibliothek Frankfurt am Main, Germany. Available at: <http://www.ddb.de/partner/dc7conference/>
[DC 1999] Dublin Core Metadata Initiative. (1999). Available at: <http://purl.org/DC/>
[FGDC 1998] Federal Geographic Data Committee. (1998). "Content Standard of Digital Geospatial Metadata, Version2, 1998." Available at: <http://www.fgdc.gov/metadata/contstan.html>
[FGDC 1999] Federal Geographic Data Committee. (1999). "Metadata Elements Included in the Metadata Entry System." Retrieved 9 September 1999 from: <http://www.fgdc.gov/clearinghouse/metadataesystem/mes_description.html>
[Foley 1999] Foley, Dan. (1999). "Metadata in a Digital Special Library: the Energy and Environmental Information Resources Center in Lafayette, Louisiana." Journal of Southern Academic and Special Librarianship: 01[iuicode: <http://www.icaap.org/iuicode?62.01.02.04> ]
[Furrie 1998] Furrie, Betty. (1998). ""Understanding MARC Bibliographic: Machine Readable Cataloging" Fifth edition reviewed and edited by the Network Development and MARC Standards Office, Library of Congress. Available at: <http://lcweb.loc.gov/marc/umb/>
[Gluck 1998] Gluck, Myke, and Bruce Fraser. (1998). "Usability of Geospatial Metadata or Space-Time Matters." presented in the "Theory and Practice of the Organization of Image and Other Visuo-Spatial Data for Retrieval: From Indexing to Metadata" Session. American Association for Information Science 1998 Annual Meeting, Pittsburgh, Pennsylvania, 25-29 October 1998.
[Iannella 1999] Iannella, Renato. (1999). "DC Agent Qualifiers: DC Working Draft, 12 November 1999." Available at: <http://www.mailbase.ac.uk/lists/dc-agents/files/wd-agent-qual.html>
[LC 1998] Library of Congress. Network Development and MARC Standards Office. (1998). "MARC 21: Harmonized USMARC and CAN/MARC." 22 October 1998 . Available at: <http://lcweb.loc.gov/marc/annmarc21.html>
[LC 1999] Library of Congress. Network Development and MARC Standards Office. (1999). "Dublin Core/MARC/GILS Crosswalk." Available at: <http://lcweb.loc.gov/marc/dccross.html>
[Mangan 1997] Mangan, Elizabeth. (1997). "Crosswalk: FGDC Content Standards for Digital Geospatial Metadata to USMARC." Available at: <http://alexandria.sdc.ucsb.edu/public-documents/metadata/fgdc2marc.html>
[MetaMaker] MetaMaker. (1999). U.S. Geological Survey. Available at: <http://www.emtc.usgs.gov/metamaker/nbiimker.html>
[OCLC 1999] OCLC Online Computer Library Center, Inc. [home page] (1999). <http://www.oclc.org/oclc/menu/home1.htm>
7. Contact Information
Alaaeldin M. Hafez
Copyright � 2000 Adam Chandler, Dan Foley and Alaaeldin M. Hafez
|Top | Contents
Search | Author Index | Title Index | Monthly Issues
Previous story | Next story
Home | E-mail the Editor
D-Lib Magazine Access Terms and Conditions