Harvard's Perspective on the Archive Ingest and Handling Test

Search | Back Issues | Author Index | Title Index | Contents

D-Lib Magazine
December 2005

Volume 11 Number 12

ISSN 1082-9873

Harvard's Perspective on the Archive Ingest and Handling Test

Stephen Abrams, Digital Library Program Manager

Stephen Chapman, Preservation Librarian for Digital Initiatives

Dale Flecker, Associate Director for Planning and Systems

Sue Kreigsman, Digital Projects Librarian

Julian Marinus, Digital Library Software Engineer

Gary McGath, Digital Library Software Engineer

Robin Wendler, Metadata Analyst

Harvard University Library
{stephen_abrams,steve_chapman,dale_flecker,suzanne_kriegsman,julian_marinus,gary_mcgath, robin_wendler}@harvard.edu

Introduction

The Harvard University Library (HUL) has operated a preservation repository, the Digital Repository Service (DRS), for the past five years, with over 3.2 million digital objects (12 TB) currently under managed storage [1]. By both policy and design, the DRS is intended for highly "curated" digital assets; that is, those that are owned and submitted by known users, created according to well-known workflows and meeting well-known technical specifications, and in a small set of approved formats. In the future, however, we feel that most, if not all, of these restrictions will be gradually eliminated, thus positioning the DRS much more as an institutional repository. In view of this, the Archive and Ingest Handling Test (AIHT), organized by the Library of Congress, and requiring the import, export, and manipulation of a sizeable test corpus of unknown provenance, was an excellent opportunity for us to investigate a number of issues we can expect to face in the future.

The most significant of these issues include:

Automated extraction of technical metadata from digital objects
Automated generation of DRS-compliant Submission Information Packages (SIPs)
Identification of explicit and implicit constraints on acceptable digital objects throughout the current repository architecture and workflows
Systematic preservation migrations
Post-migration quality assurance testing
Metadata models for capturing provenance information
Repository export

The AIHT project team drew upon the expertise of HUL staff in the following areas:

Digital reformatting
Administrative and financial reporting
Repository design, implementation, and workflow
Format identification, validation, and characterization
Metadata analysis

The AIHT project was successful on two levels. On a large scale, the project validated the supposition that significant bodies of digital material can be transferred easily between preservation institutions making use of radically different technological infrastructures. More parochially, the project uncovered a number of areas of DRS design, practice, and policy that will require some level of enhancement as HUL expands the services and scope of the repository to better serve the needs of the university for reliable long-term access to its manifold digital assets.

Phase I: Ingest

During the ingest phase a number of scaling issues arose almost immediately. The test corpus was delivered as a single 9 GB compressed archive file (911da.tar.gz) on an NTFS-formatted disk. As the DRS ingest process is implemented on a Unix server, this necessitated staging the data through a Windows platform. However, many of the most commonly used Windows decompression tools have a file size limitation of 4 GB. Once an appropriate tool was installed, the archive file was decompressed and expanded to a locally-attached NAS disk. However, although this disk had a total capacity of over 100 GB, any single file was limited to 2 GB, so the initial attempt failed. It was therefore necessary to transfer the original compressed archive by FTP to a Unix environment, where it was decompressed (a file size-related patch was necessary for the Unix tool as well) and expanded into 57,450 individual files, which were then FTP'ed back to the Windows platform. While the single 9 GB file was transferred in approximately 3 hours, moving 12 GB of individual files took over 35 hours, with numerous error restarts. (In particular, the FTP client balked at transferring zero length files.) Given the unknown provenance of the test data, we were surprised, albeit pleasantly, to discover that all of the files were virus free.

The DRS draws a fundamental distinction between primary content and metadata about that content. Content is stored in a RAID-based file system, with automatic replication to an off-site robotic tape library; administrative, technical, and structural metadata are stored in an Oracle 9i relational database. (In the HUL infrastructure, descriptive metadata is stored in public access catalogs external to the repository. As issues of access were out of scope for the AIHT project, no descriptive metadata was generated or stored.) Access to metadata stored in the repository is mediated through a Java API.

Flow chart showing the DRS architecture

Figure 1. DRS architecture

DRS SIPs take the form of an arbitrary number of individual content files along with a single XML-formatted control file containing loader directives and administrative and technical metadata about the content files. Minimum standards for required DRS technical metadata have been established for text, image, and audio content types and are enforced through a variety of mechanisms, including the SIP schema, database table-level constraints, and the repository API. (Image metadata is consistent with the NISO Z39.87 data dictionary [2]; audio metadata is consistent with the evolving AES-X098B schema [3].) Thus, the main technical challenge of Phase I was to generate the necessary control file and populate it with appropriate metadata. This was accomplished using an extension to the standard JHOVE Audit output handler [4].

The default high-level behavior of the Audit handler can be summarized as follows:

audit(directory)
{
for (all files in directory) {
if (file is directory) {
audit(file);
}
else {
validate and characterize file;
output summary file information;
}
}
}

A new output handler, known as DSIP (DRS SIP handler), was created by extending the Audit class such that the output format conforms to the schema for the DRS SIP control file [5].

At the time that the project began, JHOVE modules were available for the following formats: ASCII, GIF, JPEG, PDF, TIFF, UTF-8, and XML. As part of the project, new modules were developed for the AIFF, HTML, JPEG 2000, and WAVE formats. (All of these modules were subsequently included in the May 26, 2005, production release of JHOVE.) This module set was sufficient to support over 93% of the files in the AIHT corpus. The distribution of the remaining 7% showed a very long and shallow tail, with more than 90 additional formats suggested by unique file extensions. It should be further noted that the format of the files as determined by JHOVE differed, in some cases significantly, from the MIME types supplied in the original transfer manifest from the Library of Congress and from the formats implied by the file extensions.

**Table 1**
Format	MIME type	By manifest	By extension	As validated by JHOVE
AIFF	audio/x-aiff	162	151	162
ASCII/UTF-8	text/plain	20,207	10,822	30,887
GIF	image/gif	1,337	1,320	1,339
HTML	text/html	16,667	16,579	3,649
JPEG	image/jpeg	12,752	12,763	12,576
PDF	application/pdf	1,663	1,664	1,659
TIFF	image/tiff	1,538	1,533	1,537
WAVE	audio/x-wave	2,015	2,016	2,015
XML	text/xml	0	1	8
Unknown	application/octet-stream	1,141	10,643	3,618
57,492			57,492	57,450

The HTML format appears to be particularly problematic: over 78% of the files putatively identified as HTML in the transfer manifest were determined to be malformed by JHOVE. These files were ingested into the DRS as plain text files. Files whose format could not be determined by JHOVE were ingested as opaque objects of type "application/octet-stream".

Note the discrepancy in the total number of files: 57,492 documented in the transfer manifest vs. 57,450 found on the file system. In fact, the manifest recorded information about 49 files not found on the file system and the file system contained 7 files not recorded in the manifest. The majority of these discrepancies can be traced to file system incompatibilities between NTFS and UFS with regard to pathname length, case, and character set.

As mentioned previously, the DRS is currently implemented to enforce a number of constraints based on the assumption that all depositors will follow DRS technical specifications for object creation. In the context of the AIHT project these constraints had to be relaxed to permit the ingestion of the test data. The more significant of these changes included:

Permit 0 length files (34 files had a length of 0)
Accept previously unsupported formats: AIFF, HTML, PDF, and WAVE
In audio metadata, accept a bit depth of 8 (rather than the standard of 16 previously established for internal audio conversion projects)
In image metadata, accept bits/sample of 1,1,1 (a highly unusual configuration for raster data organized into three 1-bit channels found in 27 files)

Phase II: Export and Re-import

Prior to the AIHT project, the DRS API did not include export functionality: sets of files could be requested by name (or other uniquely identifying properties) but there was no facility for bulk export. A simple Dissemination Information Package (DIP) format was established in the form of a compressed archive file (harvard.tar.gz) consisting of the 57,450 individual content files and a single METS file [6] containing administrative and technical metadata. The administrative metadata included the original pathname of the file, its size, MIME type, and MD5 checksum. Technical metadata for images was formatted in the MIX schema [7]; for audio, the AES-X098B schema was used; and for text, the NYU text METS extension schema (textmd.xsd).

For the re-import step of Phase II, the DIPs from the other three participating institutions were evaluated for consistency with the DRS architecture. The Johns Hopkins DIP did not include technical metadata while the Old Dominion DIP was based on MPEG-21 DIDL (Digital Item Description Language) [8], an unfamiliar technology; therefore we decided to import the Stanford DIP. This consisted of a METS file and two parallel directory trees: one for the content files and one for the technical metadata about those files. The technical metadata for each content file was stored in its own separate file. Due to this distributed organization, we were not able to use a simple XSL stylesheet transformation, but rather had to create a Perl script to generate the DRS SIP; the process would have been simpler if the DIP METS file directly contained all of the metadata.

Stanford also used JHOVE to generate much of its technical metadata. Unfortunately, however, they were working with an older version of the code, pre-release 1.0 (beta 2) of 2004-07-19. HUL used the then current internal development version, pre-release 1.0 (beta 3) of 2005-02-04, which corrected a number of known errors in the previous version as well as providing several new format modules. This version disparity led to some inconsistencies in the reporting of the technical metadata between the initial ingest and the re-import of the Stanford DIP.

**Table 2**
Format	MIME type	Original DIP	Stanford DIP
AIFF	audio/x-aiff	162	162
ASCII/UTF-8	text/plain	30,887	30,910
GIF	image/gif	1,339	0
HTML	text/html	3,649	1,222
JPEG	image/jpeg	12,576	10,766
PDF	application/pdf	1,659	1,662
TIFF	image/tiff	1,537	1,537
WAVE	audio/x-wave	2,015	0
XML	text/xml	8	5
Unknown	application/octet-stream	3,618	11,186
57,450			57,450

Post-mortem investigation disclosed the following reasons for these discrepancies:

ASCII/UTF-8: The Stanford DIP characterized 23 HTML files as plain text. The Harvard DSIP tool, making use of the HTML module newly available in the beta 3 pre-release, characterized these files as HTML.
GIF: The beta 2 GIF module contained a known error in which the image bits/sample property was not reported. As this is a required property for DRS image metadata, all Stanford GIF files were ingested into the DRS as opaque objects of type "application/octet-stream".
HTML: The HTML module introduced in the beta 3 pre-release was used to characterize the files based on parsing the content. As this facility was not available in beta 2, the Stanford characterization was based on other, less definitive, criteria, such as file extension.
JPEG: The Stanford DIP characterized 1,810 fewer files as JPEG. The beta 2 JPEG module contained a known error in which files containing an optional EXIF metadata segment were erroneously reported as invalid. This error was corrected in the beta 3 pre-release.
PDF: The Stanford DIP characterized 3 additional files as PDF that DSIP rejected as being not well-formed. One of the files had a PDF header that did not start at byte offset 0. Although this is a technical violation of the PDF syntax [9], the looser requirements of the Acrobat reader allow the header to start anywhere in the first 1024 bytes. The PDF module in the current production release of JHOVE (1.0 of 2005-05-26) follows this looser restriction with appropriate notification.
WAVE: The beta 2 WAVE module contained a known error in which the audio data encoding property was not reported. As this is a required property for DRS audio metadata, all Stanford WAVE files were ingested as opaque objects of type "application/octet-stream".
XML: The Stanford import characterized 3 fewer files as XML. The beta 2 XML module did not report the character encoding for these files. As this is a required property for DRS text metadata, these files were ingested as opaque objects of type "application/octet-stream".

These discrepancies point out the importance of standardizing on processing tools and criteria for format well-formedness and validity. They also highlight an unfortunate feature of the current DRS, which only permits deposit of objects accompanied by minimally required metadata. Objects without this minimal metadata can be ingested only by being downgraded to opaque objects of type "application/octet-stream". Future enhancements to the repository will relax this restriction.

Phase III: Format Migration

HUL chose to perform a migration of various image files to the JPEG 2000 format [10]. There is great local interest at Harvard in the retrospective conversion of substantial numbers of existing TIFF images to enhance their utility by permitting the dynamic image manipulation facilitated by the JPEG 2000 format. The three goals that guided the design of the migration were:

To preserve fully the integrity of the GIF, JPEG, and TIFF source data when transformed into the JPEG 2000 (JP2) format
To maximize the utility of the new JP2 objects
To minimize migration costs

The source population of 15,452 GIF, JPEG, and TIFF files was subdivided into 25 classes based on format, color space, compression, bits/sample, and image size. Each class had a corresponding set of unique codec parameters. In keeping with our established goals of preserving integrity and maximizing usability, the codec configuration used the following specifications:

RLCP (resolution-layer-component-position) progression order
Tile size 1024×1024
Reversible 5-3 wavelet transform
Decomposition levels based on source image maximum pixel size
Two quality layers: 50% (35dB pSNR) and 100% (full quality)
Reversible channel quantization
Highest quality coding predictor offset
Grayscale and sRGB colorspaces for single- and multi-channel images, respectively

These ensured lossless compression and fast decoding and permitted the dynamic manipulation (pan and zoom) of the resulting images. Since none of the advanced color management features of the JPEG 2000 JPX profile was required – none of the JPEG files included embedded color profiles and none of the TIFF files made use of the WhitePoint or PrimaryChromaticities colorimetry tags – the JP2 profile was used instead. As the codec did not accept GIF as a source format, the 1,339 GIF files were first transformed to equivalent uncompressed RGB TIFFs.

The initial attempt at transformation failed for a significant proportion of the source images (5,031 out of 15,452, or 32.5%):

**Table 3**
Format	Color space	Compression	Bits / sample	Maximum pixel-size	Source files			JPEG 2000 files
GIF	Palette	LZW	8	1-300	963	1,339	1,339	391	706	706
				30 -600	171			126
				601-1200	204			188
				1201-2400	1			1
JPEG	YCbCr	DCT	8	10-300	2	67	12,576	2	66	8,183
				301-600	12			12
				601-1200	37			36
				1201-2400	13			13
				2401-4800	3			3
			8,8,8	1-300	3,390	12,501		1,922	8,117
				301-600	3,061			1,798
				601-1200	3,729			2,749
				1201-2400	2,107			1,440
				2401-4800	210			207
				4801-9600	4			1
			8,8,8,8	601-1200	8	8		0	0
TIFF	Bitonal	None	8	2401-4800	6	6	1,537	6	6	1,532
	RGB	None	8,8,8	1-300	561	1,510		561	1,510
				301-600	2			2
				601-1200	927			927
				1201-2400	2			2
				2401-4800	18			18
		LZW	8,8,8	1201-2400	3	16		2	11
		LZW	8,8,8	2401-4800	13	16		9	11
		PackBits	8,8,8,8	601-1200	5	5		5	5
15,452						15,452	15,452	10,421	10,421	10,421

Post-mortem investigation revealed the following limitations and errors in the codec, all of which were quickly addressed by the codec vendor in a subsequent release.

GIF (633 failures): All of these images made use of transparent background color. The derived TIFF files maintained the transparent background color, which unfortunately was not a feature of TIFF supported by the codec.
JPEG (4,393 files): The 4,385 single-channel and three-channel images were subject to a program bug in the codec. Additionally, the codec did not support four-channel JPEG source images, so all 8 of those files were rejected.
TIFF (5 failures): All of these images were subject to a program bug in the codec.

The total file size of the resulting JPEG 2000 files increased 1.4 GB over that of the original source files, a 28% increase. In general, the resulting file sizes for images derived from GIF and JPEG sources increased by a factor of three, while those derived from TIFF sources decreased by a factor of three.

**Table 4**
Source				JPEG 2000
Format	Files	Total	Average	Total	Average
GIF	706	35 MB	51 KB	111 MB	161 KB
JPEG	8,183	1.7 GB	214 KB	5.3 GB	678 KB
TIFF	1,532	3.3 GB	2.2 MB	1.1 GB	700 KB
10,421		5.0 GB	501 KB	6.4 GB	647 KB

Since approximately 19% of the original images were in the RGB color space, an additional codec option to perform an RGB-to-YUV color space transformation could have been specified. Since this would have the effect of removing correlations between the color channels, we would have expected to see somewhat better compression ratios.

Automated and manual quality assurance testing was performed subsequent to the migration. JHOVE was used to verify that all of the resulting JPEG 2000 images met the established specifications. Manual side-by-side viewing of before and after images was performed on a small, through representative, sample under ISO 3664 calibrated viewing conditions [11]. A commercial image processing application was used to perform pixel-by-pixel comparisons of source and target images. For all TIFF-to-JPEG 2000 migrations, which did not require a color space transform, the comparison showed exact numerical equivalence; for all JPEG-to-JPEG 2000 migrations, which did involve a YCbCr-to-sRGB color space transform, the comparison revealed small errors (on the order of σ= 0.02), presumably the result of numerical round-off. In all cases, however, the JPEG 2000 images were considered to be visually lossless by trained observers of the Harvard College Library Digital Imaging Group (HCL-DIG). Thus, at best the transformation process can be considered mathematically lossless and at worst, perceptually lossless.

The current DRS data model does not provide a mechanism to capture arbitrary provenance metadata. Based on our Phase III investigation, the HUL team recommended the use of the PREMIS event model for this purpose [12]. Unfortunately, the project schedule did not allow sufficient time for HUL to attempt to implement any changes to the DRS. (In addition to the necessary table model and API changes, this would have required significant alteration of existing work flows.) Provenance metadata for the derived JPEG 2000 images were maintained in the DRS in the form of discursive text inserted into an existing administrative note field.

Conclusions

In general, the test proceeded smoothly along the lines defined in the project proposal. There were no major difficulties that necessitated significant changes to the original HUL project plan. The test provided HUL with important information regarding potential changes to its Digital Repository Service that will increase its ability to accept a wider range of digital resources generated by agents and processes beyond HUL's control or influence. JHOVE proved to be a fundamentally important infrastructure component, used by all project participants for the automated validation of content files, extraction of technical metadata, and generation of repository-compliant SIPs. The format migration process was fully automated once the appropriate codec specifications were developed. Our results suggest that for images, effective post-transformation QA testing can be performed in an automated manner.

All AIHT participants successfully completed all required project phases, thus, the project was successful in validating a key assumption of the evolving National Digital Information Infrastructure Preservation Program (NDIIPP) infrastructure [13]: namely, that significant bodies of digital content can be transferred without loss between institutions utilizing radically different preservation architectures and technologies. However, the scaling problems uncovered during the project underscore the importance of identifying and deploying robust tools and protocols in preservation workflows. In general, the limiting factor in large-scale data transfer appears to be the number of objects, rather than their individual or total size. This suggests that appropriate SIPs and DIPs for such transfers should be arbitrarily nestable container objects. The design of such objects would be a fruitful area for collaborative standardization within the digital library and preservation communities.

References

[1] Harvard University Library, Overview: Digital Repository Service (DRS), August 11, 2004 <http://hul.harvard.edu/ois/systems/drs/>.

[2] NISO Z39.87 / AIIM 20-2002, Data Dictionary-Technical Metadata for Digital Still Images, Draft Standard for Trial Use, June 1, 2002-December 31, 2003 <http://www.niso.org/standards/standard_detail.cfm?std_id=731>.

[3] AES-X098B, Administrative and structural metadata for audio objects (work in progress by the Audio Engineering Society SC-03-06 Working Group on Digital Library and Archive Systems).

[4] Harvard University Library, JHOVE - JSTOR/Harvard Object Validation Environment, May 26, 2005 <http://hul.harvard.edu/jhove/>.

[5] Harvard University Library, DRS User Manual for Data Loading, October 4, 2005 <http://hul.harvard.edu/ois/systems/drs/load_manual/wwhelp/wwhimpl/js/html/wwhelp.htm>.

[6] Library of Congress, METS: Metadata Encoding and Transmission Standard Official Web Site, January 25, 2005 <http://www.loc.gov/standards/mets/>.

[7] Library of Congress, MIX: NISO Metadata for Images in XML Schema Official Web Site, August 30, 2004 <http://www.loc.gov/standards/mix/>.

[8] ISO/IEC 21000-2:2005, Information technology - Multimedia framework (MPEG-21) - Part 2: Digital Item Declaration, October 6, 2005.

[9] Adobe Systems Incorporated, PDF Reference: Adobe Portable Document Format Version 1.6 (5th ed.; Berkeley: Peachpit Press, 2005) <http://partners.adobe.com/public/developer/en/pdf/PDFReference16.pdf>.

[10] ISO/IEC 15444-1, Information technology - JPEG 2000 image coding system: Core coding system, September 23, 2004.

[11] ISO 3664:2000, Viewing conditions - Graphic technology and photography, August 15, 2005.

[12] OCLC/RLG, Data Dictionary for Preservation Metadata: Final Report of the PREMIS Working Group, May 2005 <http://www.oclc.org/research/projects/pmwg/premis-final.pdf>.

[13] Library of Congress, Preserving Our Digital Heritage: Plan for the National Digital Information Infrastructure Preservation Program, October 2002 <http://www.digitalpreservation.gov/repor/ndiipp_plan.pdf>.

D-Lib Magazine Access Terms and Conditions

doi:10.1045/december2005-abrams

D-Lib MagazineDecember 2005

Volume 11 Number 12 ISSN 1082-9873

Harvard's Perspective on the Archive Ingest and Handling Test

Introduction

Phase I: Ingest

Phase II: Export and Re-import

Phase III: Format Migration

Conclusions

References

Copyright © 2005 Stephen Abrams, Stephen Chapman, Dale Flecker, Sue Kreigsman, Julian Marinus, Gary McGath, and Robin Wendler

D-Lib Magazine
December 2005

Volume 11 Number 12

ISSN 1082-9873

Copyright © 2005 Stephen Abrams, Stephen Chapman, Dale Flecker, Sue Kreigsman,
Julian Marinus, Gary McGath, and Robin Wendler