Reagan Moore, Chaitan Baru, Arcot Rajasekar, Bertram Ludaescher, Richard Marciano, Michael Wan, Wayne Schroeder, and Amarnath Gupta
[This is the second of a two-part article. The first part appeared in the March 2000 issue of D-Lib Magazine. Part 1 described persistence issues and provided a generic description of the scalable technology for managing media and context migration.]
Abstract"Collection-Based Persistent Digital Archives: Part 2" describes the creation of a one million message persistent E-mail collection. It discusses the four major components of a persistent archive system: support for ingestion, archival storage, information discovery, and presentation of the collection. The technology to support each of these processes is still rapidly evolving, and opportunities for further research are identified.
1. Collection Support, General Requirements
Persistent archives can be characterized by two phases, the archiving of the collection, and the retrieval or instantiation of the collection onto new technology. The processes used to ingest a collection, transform it into an infrastructure independent form, and store the collection in an archive comprise the persistent storage steps of a persistent archive. The processes used to recreate the collection on new technology, optimize the database, and recreate the user interface comprise the retrieval steps of a persistent archive. The two phases form a cycle that can be used for migrating data collections onto new infrastructure as technology evolves. The technology changes can occur at the system-level where archive, file, compute and database software evolves, or at the information model level where formats, programming languages and practices change.
1.1 Collection Process Definition
The initial data set ingestion and collection creation can be seen as a process in which:
(a) objects are captured, wrapped as XML digital objects, and categorized in a (relational) database system,
(b) the collection is ingested into an archival-storage system using containers to hold the digital objects along with all pertinent meta-data and software modules.
The migration cycle can be seen as the reverse of the ingestion process in which:
(a) containers are brought out of deep-store and loaded into a (possibly NEW) database system (that can be relational or hierarchical or object oriented),
(b) the database is queried to form (possibly NEW) containers that are placed back into a (possibly NEW) archival storage system.
Note the similarities in steps (a) and (b) of the ingestion and migration processes. In order to build a persistent collection, we consider a solution that "abstracts" all aspects of the data and its preservation. In this approach, data object and processes are codified by raising them above the machine/software dependent forms to an abstract format that can be used to recreate the object and the processes in any new desirable forms.
Data objects are abstracted by marking the contents of each digital object with tags that define the digital object structure. Tags are also used to mark the attributes that are used to organize the collection and define the collection context. Processes are abstracted such that one can create a new "procedure" in a language of choice. Examples are the ingestion procedures themselves. They comprise "abstract load modules" for building the collection. Similarly, the querying procedures can be represented as "abstract mappings from a definition language to a query language" and visualization presentation procedures can be cast as "style-sheet abstractions".
The multiple migration steps can be broadly classified into a definition phase and a loading phase. The definition phase is infrastructure independent, whereas the loading phase is geared towards materializing the processes needed for migrating the objects onto new technology.
We illustrate these steps by providing a detailed description of the actual process used to ingest and load a million-record E-mail collection at the San Diego Supercomputer Center (SDSC). Note that the SDSC processes were written to use the available object-relational databases for organizing the meta-data. In the future, it may be possible to go directly to XML-based databases.
I. Ingestion/Storage Phase
The SDSC infrastructure uses object-relational databases to organize information. This makes data ingestion more complex by requiring the mapping of the XML DTD semi-structured representation onto a relational schema. Two aspects of the abstraction of objects need to be captured: relationships that exist in and among the data, and hierarchical structures that exist in the data. These were captured in two different types of abstractions: through a relational Data Definition Language (DDL), and through a semi-structured Document Type Definition. The relational abstraction is a mature technology that facilitates querying about the meta-data, whereas the semi-structured abstraction is our chosen uniform information model. Hence, our process used both of these technologies to manage digital objects. In the future, with the emergence of XML-based database systems, only the semi-structured representation will be needed. In the model below, only the XML-DTD was stored as part of the abstract object; instead of storing the DDL, we stored the procedure for creating a DDL from a DTD. A system-dependent DDL was created using the DTD and the DTD-to-DDL mapping procedure with the addition of system-specific information. The software that creates the system-dependent DDL comprises the instantiation program between the digital objects stored in the archive, and the collection that is being assembled on new technology.
The steps used to store the persistent archive were:
Define Digital Object
II. Load Phase
The load phase uses the information models that were archived in the ingestion phase. The information models are read out of the archive and used to drive the database instantiation software:
Create generator for the Database-DDL --- (E)
Create generator for the database Loader --- (F)
Create generators for presentation interface and storage --- (G)
Generate Containers and Store
In the ingestion phase, the relational and semi-structured organization of the meta-data is defined. No database is actually created, only the mapping between the relational organization and the object DTD. Note that the collection relational organization does not have to encompass all of the attributes that are associated with a digital object. Separate information models are used to describe the objects and the collections. It is possible to take the same set of digital objects and form a new collection with a new relational organization.
In the load phase, the mappings between the relational and semi-structured representations of the objects and collections are combined with the semi-structured information model to generate the relational representation for the new database on the target system. The information is encapsulated as a software script that can be used to create the new database tables for organizing the attributes.
A second script is created that is used to parse the digital objects that are retrieved from the archive, and load the associated meta-data into the new database tables.
Steps (A),(B),(C) and (D) can be interpreted as abstract mark-up formats for digital objects and steps (E),(F) and (G) can be interpreted as abstract procedures. The formats and procedures can be combined to support migration of the collection onto new software and hardware systems, as well as migration onto new information models or data formats and new procedure languages.
In a system-level migration process, a database is created using (X), the database is instantiated from the copy of the container(s) in the archival storage system using (Y), and the data in the database is stored into a new archival storage system using (Z').
In a format-level migration process, new versions of (A),(B),(C),(D) are created based on the prior values, a database is created using the original (X) and instantiated with the prior (Y), and then the data in the database is reformatted and stored in the archival storage system using the new (Z').
In a language-level migration process, new versions of (E),(F),(G) are created based on the original values, and stored as part of the packaged container. The data itself is not migrated.
2. Persistent Archive Demonstration
The steps required to ingest a collection, archive the data and the collection description, and then re-create a database and support queries against the collection have been demonstrated as part of a persistent archive prototype. A collection of 1 million Newsgroup (Usenet) messages was created from technical topic groups, including Computer Science, Science, Humanities, and Social Science. The Usenet collection was chosen because RFC 1036 provides a standard for Usenet messages that defines both required and optional attributes.
The 1-million record E-mail collection was ingested, archived, and dynamically rebuilt within a single day. This was possible because all steps of the process were automated. The demonstration is scalable such that archiving of 40-million E-mail records can be done within a month. The steps for the 1-million record demonstration included assembling the collection, tagging each message using XML, archival storing of the digital objects, instantiating a new collection, indexing the collection, presenting the collection through a Web interface, and supporting queries against the collection. The required Usenet meta-data attributes form a core set that can be applied to all E-mail messages.
2.1 Ingestion Process
A typical raw E-mail record contains keywords that represent either required attributes, such as Path, From, Newsgroups, Subject, Date, Message-ID, optional keywords such as Organization, Mime-Version, Content-Type, and other user defined keywords that follow a defined syntax. The keywords can be used to identify a specific message, and therefore constitute the attributes used to organize the E-mail collection.
A DTD was derived which reflects the RFC1036 structure. Each of the required, optional, and other keyword items were associated with a seqno attribute used to record information on the sequence in which the various keywords appear in the original document. The order of appearance of keywords may be different in different documents. In the ingestion step, each E-mail record was parsed into the standard DTD. As a result, the messages can be displayed using Microsoft Notepad, an XML viewer, by applying the DTD. This provides the ability to impose a presentation style on the objects in a collection.
Loading sources with regular structure into a relational database (RDB) has several benefits:
We created 3 hand-crafted tables for the Newsgroup collection for the relational database implementation:
Note that the semi-structured nature of the E-mail messages is more easily represented with an XML semi-structured representation.
To store the collection, the E-mail messages were aggregated into 25 containers, each holding about 40,000 records. Note that during ingestion, unique tags were added to define the beginning and end of each message. This is required because it is possible for an E-mail message to contain an encapsulated E-mail message, making it difficult to create digital objects by explicit analysis of the complete collection. In the current implementation of the data handling system, a "container" file can be registered with SRB/MCAT (the SDSC data handling system). SRB/MCAT then provides access mechanisms for retrieving individual messages from such containers.
The ingestion process was carried out on a workstation. The system was an SGI Indigo 2 (MIPS R10000 Processor, 195 Mhz, Memory size: 128 Megabytes). The collection size was 2.6 Gbytes. It took 12 hours to assemble the collection from the Newsgroup storage, 1 hour 39 minutes to parse the raw collection into an XML DTD, and 1 hour to store the containers into HPSS.
2.2 Instantiation Process
The instantiation process included the retrieval of the data sets from the HPSS archive, the creation of a load file for insertion into an Oracle database, the optimization of the Oracle index, and support for a query against the collection. The index optimization step decreased the time needed to do a query against the collection from 20 minutes to 1 second.
The time needed to create a database holding the collection included 1 hour to retrieve the data from the archive, 2 hours 40 minutes to build the load file, 4 hours to load the database, and 4 hours to optimize the database index.
3. Remaining Technical Issues
The four major components of the persistent archive system are support for ingestion, archival storage, information discovery, and presentation of the collection. The first two components focus on the ingestion of data into collections. The last two focus on access to the resulting collections. The technology to support each of these processes is still rapidly evolving. Hence consensus on standards has not been reached for many of the infrastructure components. At the same time, many of the components are active areas of research. To reach consensus on a feasible collection-based persistent archive, continued research and development is needed. Examples of the many related issues are listed below:
3.1 Research Opportunities
Important research areas that we suggest pursuing include:
Multiple communities across academia, the federal government, and standards groups are exploring strategies for managing very large archives. The persistent archive community needs to maintain interactions with these communities to track development of new strategies for data management and storage. The technology proposed by SDSC for implementing persistent archives builds upon interactions with many of these groups. Explicit interactions include collaborations with Federal planning groups , the Computational Grid , the digital library community , and individual federal agencies .
The proposed persistent archive infrastructure combines elements from supercomputer centers, digital libraries, and distributed computing environments. The synergy that is achieved can be traced to identification of the unique capabilities that each environment provides, and the construction of interoperability mechanisms for integrating the environments. The result is a system that allows the upgrade of the individual components, with the ability to scale the capabilities of the system by adding resources. By differentiating between the storage of the information content and the storage of the bits that comprise the digital objects, it is possible to create an infrastructure independent representation for data organized by collections. Collection-based persistent archives are now feasible that can manage the massive amounts of information that confront government agencies.
The data management technology has been developed through multiple federally sponsored projects, including the DARPA project F19628-95-C-0194 "Massive Data Analysis Systems," the DARPA/USPTO project F19628-96-C-0020 "Distributed Object Computation Testbed," the Data Intensive Computing thrust area of the NSF project ASC 96-19020 "National Partnership for Advanced Computational Infrastructure," the NASA Information Power Grid project, and the DOE ASCI/ASAP project "Data Visualization Corridor." Additional projects related to the NSF Digital Library Initiative Phase II and the California Digital Library at the University of California will also support the development of information management technology. This work was supported by a NARA extension to the DARPA/USPTO Distributed Object Computation Testbed, project F19628-96-C-0020.
 Moore, R., "Enabling Petabyte Computing, The Unpredictable Certainty, Information Infrastructure through 2000," National Academy Press, 1997.
 Foster, I., Kesselman, C., "The Grid: Blueprint for a New Computing Infrastructure," Chapter 5, "Data-intensive Computing," Morgan Kaufmann, San Francisco, 1999.
 Baru C. "Archiving Meta-data," 2nd European Conference on Research and Advanced Technology for Digital Libraries (poster), Sept. 19-23, 1998, Crete, Greece.
 Baru, C., et al., "A data handling architecture for a prototype federal application," Proceedings of the IEEE Conference on Mass Storage Systems, College Park, MD, March 1998.
Copyright � Reagan Moore, Chaitan Baru, Arcot Rajasekar, Bertram Ludaescher, Richard Marciano, Michael Wan, Wayne Schroeder, and Amarnath Gupta
|Top | Contents
Search | Author Index | Title Index | Monthly Issues
Book Review | Next story
Home | E-mail the Editor
D-Lib Magazine Access Terms and Conditions