Metadata Clean Sweep: A Digital Library Audit Project
R. Niccole Westbrook, University of Houston Libraries
As digital library collections grow in size, metadata issues such as inconsistencies, incompleteness and quality become increasingly difficult to manage over time. Unfortunately, successful user search and discoverability of digital collections relies almost entirely on the accuracy and robustness of metadata. This paper discusses the pilot of an ongoing digital library metadata audit project that was collaboratively launched by library school interns and full-time staff to alleviate poor recall, poor precision and metadata inconsistencies across digital collections currently published in the University of Houston Digital Library. Interns and staff designed a multi-step project that included metadata review of sample items from each collection, systematic revision of previously published metadata and recommendations for future metadata procedures and ongoing metadata audit initiatives. No such metadata audit efforts had been conducted on the UH Digital Library and the project yielded data that provided staff with the opportunity to significantly improve the overall quality and consistency of metadata for collections published over the nearly three year life of the repository. This article also contains lessons learned and suggestions on how a similar metadata audit project could be implemented in other libraries hosting digital collections.
The University of Houston Digital Library (UHDL) is a multimedia digital collection of materials documenting the history of the University of Houston, City of Houston, State of Texas as well as other historically and culturally significant materials. Launched nearly three years ago, the UHDL contains metadata generated using a set of data dictionaries created by a metadata librarian in the Digital Services Department (DS) in the University of Houston Libraries around the inception of the UHDL. After three years, the UHDL contained nearly 15,000 items, many of which were known to contain inconsistent, outdated or incomplete metadata. Additionally, past decisions about metadata treatment had been occasionally made on a collection or even item level, and the UHDL lacked metadata consistency across collections. Therefore, DS initiated an audit project, the goal of which was to identify and correct metadata issues across all collections in the UHDL.
DS is a small department that relies heavily on library school interns to design and conduct high-level projects (Westbrook and Reilly, 2011, p. 21-25). Two students were selected during the Fall 2011 semester who were designated project leaders for the UHDL metadata audit. The metadata audit project was designated as a remote project, meaning that the two student interns never visited the DS offices. Rather, all communication and deliverables were managed through online tools such as Skype, Google Docs, Google Calendar and Blogger (Westbrook, 2012). These two interns, working closely with one DS librarian and a staff person, drafted a project plan and executed the first steps of the UHDL audit initiative.
There were four steps in the UHDL pilot audit project, each of which simultaneously drove the evolution of the pilot project and served as a learning opportunity for the interns assigned to the project. The first step was a literature review in which interns familiarized themselves with existing literature pertaining to metadata audit projects and metadata quality analysis. This step was crucial to creating a worthwhile framework for assessing metadata quality that could be easily implemented at other institutions and also served to give the library school intern team leaders the expertise and background they would need to propose and execute a successful audit pilot. The second step was the planning step in which interns became familiar with project planning strategies and collaboratively drafted a project plan outlining the goals, steps and resources involved in the audit pilot. This step served as a communication check point between DS staff and intern project leaders to ensure that the direction and steps of the project meshed well with existing UHDL procedures and goals. It also emphasized to the interns the importance of making project planning a step and provided them with the opportunity to practice project planning and execution in a practical setting. Finally, allowing the interns to plan the project from start to finish gave them authentic authority over the nature and outcome of the project. In the third step, the interns conducted data collection and analysis based on the steps outlined in the project plan. This first sweep of metadata published in the UHDL will be used as a baseline for expanding on the pilot project described in this paper. This step also gave the interns an opportunity to critically evaluate Dublin Core metadata associated with digital objects and make high level strategic suggestions regarding procedures and workflowsan excellent management skill. Finally, the pilot project concluded with the implementation of corrections and changes identified by the interns. This step served as a final check point in which DS staff had the opportunity to critically evaluate suggestions made by the interns before incorporating them into live metadata or existing metadata procedures. This paper reports a summary of the project design, methods, findings and lessons learned by the participants.
Initially, project leads conducted a survey of existing literature in search of documented best practices or case studies of previously conducted digital library metadata audit projects to use as models for the UHDL project. A survey of previously conducted metadata quality assessment frameworks showed that most of these frameworks were ad hoc, intuitive, and incomplete and therefore did not produce robust and systematic measurement models (Stvilia, Gasser, Twidale, & Smith, 2007, p. 2). Though some of the available case studies successfully captured an individual organization's perception of metadata quality, not all aligned well with the general structure of metadata quality by schema or standard, ultimately limiting their reuse at other institutions (Stvilia, et al., 2007, p. 2). Many organizations choose to develop their own metadata quality assessments based on local standards or practices. Because of the lack of a universal framework and a lack of literature on metadata audit project planning, DS chose to develop its own framework, which is reported here. While the UHDL is a CONTENTdm repository using the Dublin Core metadata schema, the project reported here is theoretical enough that the general ideas could be applied to any digital library collection or repository.
One underlying theme was culled from existing literature. Considering the significance of metadata quality, it is advisable for digital libraries to evaluate their metadata collections frequently as the evaluation results can help identify defects in processes and practices for metadata collection and creation (Ma, Lu, Lin and Galloway, 2009, p. 1). This paper reports the first steps of what will be an increasingly sophisticated ongoing metadata audit initiative in DS.
A preliminary survey of the literature did provide a general framework for how to evaluate the overall quality of metadata. When evaluating the quality of metadata associated with digital objects, there are two main things to consider: the quality of the information available about the objects themselves, and the quality of the metadata created to describe the objects (Beall 2005, p. 2). Because items slated for digitization come with a finite amount of information, the main focus of metadata quality review for this project was the quality of the metadata created to describe the objects. In general, "high-quality metadata should allow digital users to intuitively conduct the tasks such as identifying, describing, managing and searching data (Ma, et al., 2009, p. 1). Where errors exist in metadata, access to material available through a digital library can be easily blocked (Beall, 2005, p. 10). Beall argues that strategies can be developed for dealing with typographical errors and other data quality problems that can lead to improved data quality and user access to digital objects. Clean data bears a high cost, but in the context of digital libraries, "the benefit of this cost is accurate, error-free data and consistent access to that data. Data quality control is an essential part of digital library management (2005, p. 17)." One goal of the UHDL metadata audit project was therefore to focus on metadata created to describe digital objects and focus on decreasing the number of errors in publicly available records.
The concepts of completeness and consistency also emerged from the literature as potential foci for metadata audit efforts. According to Shreeves, completeness in this study is defined as a relational information quality dimension and is defined as the degree to which an information object matches the ideal representation for a given digital object (Shreeves, Knutson, Styilia, & Twidale, 2005, p. 9). When evaluating the completeness of a record, judgment should be made on the basis of sufficiency for use in the aggregated database by meeting the requirements of finding, identifying, selecting, obtaining, and collocating within that database (Shreeves et al., 2005, p. 9). There are two forms of consistency to consider when evaluating the quality of metadata records. Semantic consistency is the "extent to which the collections use the same values (vocabulary control) and elements for conveying the same concepts and meanings throughout" (Shreeves et al., 2005, p. 11). Conversely, structural consistency is the extent to which similar elements of a metadata record are represented with the same structure and format. Based on the findings above, a framework was developed that focused on eliminating typographical and factual errors and improving overall completeness and consistency of metadata across collections in the UHDL.
Design of Project Plan
Once the literature search provided the skeleton of a strategy for evaluating metadata quality, the next step was to devise a project plan that would outline the goals, steps and desired outcomes of the audit pilot (Appendix). The intern project leads collaborated to design a plan for the pilot audit of the metadata audit initiative, and set forth a timeline for the project deliverables. Once a project proposal, which included how the work would be organized, how many hours would be devoted to the project over the course of the first semester, what additional resources would be needed, and when and how the components would be delivered, the project leaders presented their plan to DS staff for approval and revision.
With 15,000 total records in the UHDL, one crucial goal of the pilot audit was to define a way to efficiently and effectively sample a representative subset of each collection published in the UHDL. The method needed to be reasonable given the resources at hand, and was aimed at identifying global changes that could be applied on the collection level as well as collections that were good candidates for full or partial metadata revision. It was therefore determined that an initial evaluation of the metadata for six hundred items would be conducted based on Dublin Core guidelines, using specifically the standards outlined in the UHDL Dublin Core Metadata Guidelines & Best Practices documenta locally created guide to DS metadata creation. Twenty items were evaluated from each collection within the UHDL according to the following breakdown:
In total, thirty collections were sampled. The quality control audit was performed using several evaluative processes. Each individual record was reviewed for completeness, defined as possessing some data in the following Core Element fields:
All fields were assessed for typographical errors and all hyperlinks were checked to confirm their functionality. Titles were additionally reviewed to ensure that they followed appropriate Dublin Core style guidelines. The Creator and Contributor elements, if present, were examined for their compliance with Library of Congress Subject Heading (LCSH) formattinga practice that was known to have been inconsistently followed during UHDL metadata creation. Descriptions were assessed to determine their accuracy and appropriate local style formatting rules. For instance, only first word and proper nouns should be capitalized and periods should be present at the end of each sentence. Date elements were evaluated for compliance with formatting set forth in the UHDL DC guidelines. Keywords and Subject terms were reviewed to ensure that they appropriately represented the content and that they met a locally defined level of robustness. Records were also checked for the presence of file name elementswhich are crucial to DS workflows such as delivery of high resolution images to patrons. Finally, hyperlinks contained within each collection that directed users to items in the UH Library Catalog or finding aids in Archon were checked to ensure they were not broken. The consistency of elements both within and between collections was assessed and any other unusual occurrences in the metadata were noted by the reviewers, and presented to DS staff in the form of a document containing suggestions for larger issues that should be addressed in a subsequent stage of the ongoing metadata audit initiative best metadata creation practices, to incorporate into existing documentation moving forward.
Data Collection and Analysis
In the next stage, intern leaders analyzed data collected about the 600 item records reviewed. Record level errors and quality issues were compiled in an Excel spreadsheet, and individual progress on the audit was reported in an online management system used by DS (Westbrook, 2012). Intern leaders also offered broad recommendations for both metadata creation and workflow improvements. These recommendations will be considered in the development of future digital library collections, and subsequent quality audits will follow the model of this project.
The results of the data analysis for the UHDL revealed record level issues, collection level issues and issues that spanned more than one collection. As the project was divided equally between the two project leaders, analysis of each batch roughly corresponded to half of the items audited. The first batch review consisted of 263 items. Missing descriptions accounted for the largest number of errors (37), followed by incorrect casing in the Title field (14), typographical errors in the description field (13), errors in the description field's terminal punctuation (13), and capitalization/casing errors in descriptions (11). Additional instances of repeat mistakes occurred in the form of typographical errors in other fields (7) and broken links (4). Four records were found to have unusual date ranges, which necessitated review of the date field standards contained in the data dictionaries. In addition to item level errors, the data analysis also surfaced collection level anomalies. Occasional inconsistencies were identified in the title field. For instance, some title fields lead with a date and/or ID number, while the other items in the same collection did not contain this data. Another example of a collection anomaly that was identified was in the consistency of publisher data. In one collection the same publisher was entered at least four different ways: city first, publisher first, full address present and publisher name only. Data analysis also revealed problems that spanned more than one collection. In the second batch review of 300 items, the Genre field of every item sampled was not in compliance with Dublin Core standards demonstrating a UHDL-wide change in metadata creation practices and existing metadata revision that was needed. During the Implementation of Suggested Corrections phase of the pilot audit, plans were made to address all three categories of problems that were encountered.
There were also general recommendations put forth by the intern project leaders. In some cases, more descriptive titles may be useful in helping users locate items, especially in recall of previously viewed items. Similarly, it was found that the number of subject headings applied to records is inconsistent and some collections could benefit from the addition of more subject headings, which also aid in search and discovery. Within one collection, it was even reported that several photos appeared to be partially cut off. In this way, the audit pilot extended beyond strictly metadata review to encompass overall quality of items in the UHDL. At least one more recent collection did not contain evidence of any errors or inconsistencies and was in compliance with Dublin Core standards. This collection could perhaps be used as a model or template for future quality evaluation. This finding also suggests that current workflows and best practices in place for metadata creation in DS are reasonably effective and most of the work to improve the overall quality of items in the UHDL will consist of elevating early collections to the standards now in place.
Implementation of Suggested Corrections
Once Data Analysis was complete, digital services staff worked with student workers in the office to review the proposed metadata changes and incorporate them into the live site. At this time, a distinction was made between individual errors and inconsistencies that needed to be changed in single records and more programmatic reviews and revisions that needed to take place at the collection and UHDL level. Data Analysis generated a spreadsheet shared among project participants in Google Docs that provided a list of single items across all collections that needed one-time corrections. In some instances, the pilot audit also identified certain collections that were more prone to particular errors in just one or two fields, making it possible to conduct partial metadata revisions on every record in a collection. There was a third category of collections that needed full metadata revisions. Finally, issues were discovered that would require DS staff to consider metadata creation practices broadly and make a decision about what standards to keep in order to correct inconsistencies that had inadvertently grown over time due to lack of concrete standards or best practices documentation on that specific topic.
In review of the project, there are several aspects that contributed to success and others that the project team might do differently in the future. First, due to the small number of full-time staff available and the large number of DS projects, it was not practical to set aside the amount of staff time that might be needed for even a pilot audit project. While the project still required close oversight from DS staff, designating interns as project leads allowed DS to conduct a pilot audit much sooner and more quickly than would have been possible relying solely on staff resources. Other institutions might consider relying on intern helpeven at the project design levelin order to conduct audit projects where resources are scarce.
At the project planning level, the form of deliverable created by the interns was useful in that it captured many levels of information about item records and collections and will therefore act as a template for subsequent audit initiatives. The spreadsheet proved valuable in creating a list of correction notes for single items. DS staff also encouraged interns to input copious notes on the overall state of collections and inconsistencies they discovered between collections. Because the UHDL was developed in collaboration with a variety of stakeholders over the course of three years, these inconsistencies were many. For institutions interested in conducting similar projects, creating robust record-keeping mechanisms early in the project might contribute to the success of the project.
Finally, sharing the spreadsheet through Google Docs allowed DS staff working on the project to check in at any time on the progress of the project and to review specific questions with interns working remotely when necessary. Using a spreadsheet to collect suggestions rather than allowing interns to make changes to live metadata also ensured that DS staff had an opportunity to review metadata suggestions before implementing them. This was especially important because the interns on the project were just learning about metadata standards and quality, while staff had more expertise to lend. For institutions interested in relying heavily on intern help to conduct digital collections audit projects, structuring intern work outside the live, public-facing system can be an important mechanism for maintaining control over the changes made. Additionally, this added layer of protection can help with buy-in if stakeholders are hesitant to give leadership roles at the intern level.
There were also project strategies that were less successful. First of all, for smaller collections consisting of 50 or fewer items, a sample size of 10 would have been sufficient for gauging whether that collection required significant metadata review or whether minor issues existed. For the pilot project, a sample size of 20 items per collection was applied across the board and this sample size was overkill for smaller collections.
Additionally, rather than each worker creating a list of found errors and then compiling the lists, it may be more efficient to maintain a single list of errors. Doing so might help to ensure that record-keeping performed by more than one individual results in a consistent style, format and content. The project team noticed several mistakes that were introduced in the list merge step. Because Google Docs allows for simultaneous collaborative editing of a document, having a single, centralized list of errors would help standardize reporting.
Finally, over the course of the semester, interns spent the first half of their internship experience reviewing literature and drafting the project plan because no such plans were available in existing literature as a model. For the pilot project, this step was necessary since interns were unable to identify a viable model for digital collection audit projects. If DS were to conduct this type of project again, more time could be spent in Data Collection and Analysis by using the plan created by intern leads in the Fall 2011 semester as a basis for future projects. Similarly, as more institutions embark on digital library metadata audit projects and publish case studies and findings, the community will have a greater foundation for planning and implementing successful initiatives.
Future Directions and Conclusion
While the pilot audit predictably did not solve all of the metadata issues within the UHDL, there are many small measures that were immediately implemented to improve both metadata creation practices and quality of existing metadata. Since the pilot audit, DS has implemented an internal Quality Control workflow at the time of metadata creation and assessment of the efficacy of this program will be evaluated as more collections are published. DS staff also continue periodic batch reviews, particularly focused on the collections and fields found in the pilot audit to have the highest incidence of error beyond those records that were sampled for the pilot audit itself.
Because access to content is the primary goal of the metadata of a digital library, managing errors that can affect access should be an ongoing priority for the digital library. A more thorough evaluation of quality, assessing each individual record, would be useful but would require larger staff and additional time. Future audits will likely be conducted in collaboration with staff from the UH Libraries Cataloging and Metadata Services Department, especially in conjunction with a metadata specialist for which the UH Libraries are currently in the hiring process. Interdepartmental audit projects may be able to use field and collection level data gathered by this pilot audit as a starting point. For instance, DS hopes to target collections that this pilot audit identified as needing further review as a preliminary step to larger, more thorough interdepartmental audit projects. In the case of these collections that have been flagged as having a high incidence of errors, particular emphasis can be placed on fields that have already been identified by this pilot audit as prone to error in specific collections.
Future UHDL metadata audit projects might re-evaluate what constitutes quality metadata, and to what extent metadata can be "good enough" for local retrieval needs, perhaps allowing for some errors to persist in light of the limited resources available for audit initiatives. As the UHDL grows and efforts are increased to share content on a variety of platforms, determining how the quality of the metadata affects interoperability may guide these decisions. An additional consideration for future projects is who is best suited to perform quality auditsthose associated with the creation of the metadata, or external individualsand whether using human, automated, or combined evaluation is most efficient for determining the quality of the metadata. This can also include considering how many staff members are necessary for a good quality evaluation, and whether sampling of the metadata records is sufficient. An assessment of how the communication of metadata creation guidelines can impact the quality of the product will also be investigated. Investing in quality assurance and continuing to build best practice and quality standard guidelines will help ensure that, as the UHDL continues to grow, quality of item records available to the user does not diminish.
Audits such as the one conducted for this project are part of a larger picture, one which includes ongoing metadata maintenance of a growing digital collection. Successful metadata maintenance must include both correction of found mistakes and correction to practices that lead to those mistakes. The project yielded valuable results that can be applied to improve the quality of metadata of the UHDL and can serve as a model for future audits for this and other institutions.
 Barton, J., Currier, S., & Hey, J.M.N. (2003). Building quality assurance into metadata creation: An analysis based on the learning objects and e-prints communities of practice. In: S. Sutton et al. (Eds.), DC-2003: Proceedings of the International DCMI Metadata Conference and Workshop, September 28-October 2, 2003, Seattle, Washington, USA. Syracuse, NY: Information Institute of Syracuse.
 Beall, Jeffrey. (2005). Metadata and Data Quality Problems in the Digital Library. Journal of Digital Information 6, no. 3 (2005): 1-20.
 Bruce, T.R., & Hillmann, D.I. (2004). The continuum of metadata quality: Defining, expressing, exploiting. In D.I. Hillmann & E.L. Westbrooks (Eds.), Metadata in Practice (pp. 238-256). Chicago, IL: American Library Association.
 Dushay, N. & Hillmann, D. (2003). Analyzing metadata for effective use and re-use. Paper presented at the DCMI Metadata Conference and Workshop, Seattle, WA.
 Kelly, B., Closier, A. & Hiom, D. (2005). Gateway standardization: A quality assurance framework for metadata, Library Trends, 53(4), 637-50.
 Ma, S., Lu, C., Lin X., & Galloway, M. (2009). Evaluating the Metadata Quality of the IPL. Proceedings of the American Society for Information Science and Technology, 46(1), 1-17. http://dx.doi.org/10.1002/meet.2009.1450460249
 Park, J-R. (2009). Metadata Quality in Digital Repositories: A Survey of the Current State of the Art. Cataloging & Classification Quarterly, 47(3), 213-228. http://dx.doi.org/10.1080/01639370902737240
 Shreeves, S. L., Knutson, E. M., Stvilia, B., Palmer, C. L., Twidale, M. B. & Cole, T. W. (2005). Is "Quality" Metadata "Shareable" Metadata? The Implications of Local Metadata Practices for Federated Collections. Proceedings of Twelfth National Conference of the Association of College and Research Libraries. Chicago, IL: Association of College and Research Libraries.
 Stvilia, B., Gasser, L., Twidale, M. B. & Smith, L. C. (2007). A framework for information quality assessment. Journal of the American Society for Information Science and Technology, 58, 1720-1733.
 Westbrook, R.N. (2012). Online Management System: Wielding Web 2.0 Tools to Collaboratively Manage and Track Projects. Journal of Library Innovation, forthcoming.
 Westbrook, R.N. & Reilly, M. (2011). Growing a Mutually-Beneficial Digital Curation Internship Program that is Sustainable and Low Cost. Archiving 2011: Preservation Strategies and Imaging Technologies for Cultural Heritage Institutions and Memory Organizations, 21-25.
 Wisneski, R., & Dressler, V. (2009). Implementing TEI Projects and Accompanying Metadata for Small Libraries: Rationale and Best Practices. Journal of Library Metadata 9(3-4), 264-288.
Metadata Pilot Audit Project Proposal
1. Statement of Need
2. Project Design and Evaluation Plan
3. Project Resources
4. Impact/Future Implications
About the Authors