The Warwick Framework

A Container Architecture for Diverse Sets of Metadata

Carl Lagoze
Digital Library Research Group, Cornell University
[email protected]

D-Lib Magazine, July/August 1996

ISSN 1082-9873

This paper is a abbreviated version of The Warwick Framework: A Container Architecture for Aggregating Sets of Metadata. It describes a container architecture for aggregating logically, and perhaps physically, distinct packages of metadata. This "Warwick Framework" is the result of the April 1996 Metadata II Workshop in Warwick U.K.

Introduction and Motivation

With the rapid increase in the number and variety of networked resources, there is a growing need for an architecture for associating diverse types of metadata with those resources. This requirement is increasingly obvious in the current World Wide Web, where the primary tools for finding networked resources are "web-crawlers" or "spiders" that index the full-text of HTML pages. While the value of these tools should not be underestimated, their shortcomings become obvious when one, for example, searches for documents about "Mercury" and finds a mixture of pages about the planet Mercury, the element Mercury, the Greek God Mercury, and articles from the San Jose Mercury-News. More importantly, these tools are completely useless for the many non-textual documents - images, audio, video, and executable programs (accessible through CGI scripts) - that populate the Web.

A series of metadata workshops, the first in March 1995 in Dublin Ohio and the second in April 1996 in Warwick U.K, were convened to address this issue and propose solutions. The Dublin Workshop resulted in the Dublin Core, a set of thirteen metadata elements intended to describe the essential features of networked documents. The Dublin Core metadata set is meant to be both simple enough for easy use by creators and maintainers of Web documents and sufficiently descriptive to assist in the discovery and location of networked resources. The thirteen elements of the Dublin Core include familiar descriptive data such as author, title, and subject. A few fields in the Core, such as coverage and relationship, are less familiar.

The Warwick Workshop was convened a year later to build on the Dublin results and provide a more concrete and operationally useable formulation of the Dublin Core, in order to promote greater interoperability among content providers, content catalogers and indexers, and automated resource discovery and description systems. The April 1996 workshop also was an opportunity to assess a year of experimentation with the Dublin Core by a number of researchers and developers.

While there was consensus among the attendees that the concept of a simple metadata set is useful, there were a number of fundamental questions concerning the real utility of the Dublin Core as it was defined at the end of the preceding workshop. Does the very loosely defined Dublin Core really qualify as a "standard" that can be read and processed programmatically? Should the number of the core elements be expanded, to increase semantic richness, or reduced, to improve ease-of-use by authors and/or web publishers? Will authors reliably attach core metadata elements to their content? Should a core metadata set be restricted to only descriptive cataloging information or should it include other types of metadata such as administrative information, linkage data, and the like? What is the relationship of the Dublin Core to other developing work in metadata schemes, particularly in those areas such as rights management information (terms and conditions)?

The workshop attendees concluded that the answer to these questions and the route to progress on the metadata issue lay in the formulation a higher-level context for the Dublin Core. This context should define how the Core can be combined with other sets of metadata in a manner that addresses the individual integrity, distinct audiences, and separate realms of responsibility of these distinct metadata sets.

The result of the Warwick Workshop is a container architecture, known as the Warwick Framework. The framework is a mechanism for aggregating logically, and perhaps physically, distinct packages of metadata. This is a modularization of the metadata issue with a number of notable characteristics.

It allows the designers of individual metadata sets to focus on their specific requirements, without concerns for generalization to ultimately unbounded scope .
It allows the syntax of metadata sets to vary in conformance with semantic requirements, community practices, and functional (processing) requirements for the kind of metadata in question.
It separates management of and responsibility for specific metadata sets among their respective "communities of expertise".
It promotes interoperability by allowing tools and agents to selectively access and manipulate individual packages and ignore others.
It permits access to the different metadata sets that are related to the same object to be separately controlled.
It flexibly accommodates future metadata sets by not requiring changes to existing sets or the programs that make use of them.

The separation of metadata sets into packages does not imply that packages are completely semantically distinct. In fact, it is a feature of the Warwick Framework that an individual container may hold packages, each managed and maintained by distinct parties, which have complex semantic overlap.

Examining the Metadata Issue in Context

The organizers of the 1995 Dublin Metadata Workshop intentionally limited its scope, avoiding, as the workshop report states, "the size and complexity of the resource description problem". While this strategy was effective for reaching consensus at the first workshop, it became obvious at the second workshop that it was an impediment to moving beyond the Dublin Workshop results. By the end of the first day of the Warwick Workshop, three questions had surfaced, each of which made clear the need to broaden our perspective.

Should the number of elements in the Dublin Core be expanded or contracted? Some workshop attendees felt that in order for the Core to succeed as a tool for authors, its number of elements should be restricted to only the most basic descriptive elements. Others saw the need for new fields such as terms and conditions or administrator.
Should the syntax of the Core be strictly defined or left unstructured? Many attendees wanted to avoid the painful syntax wars that are familiar to those who have participated in standards efforts. However, without a stricter definition of syntax, the Dublin Core does not provide the level of interoperability for which it was intended.
Should the Core be targeted solely at the existing WWW architecture, or extend that architecture? There is a strong argument for specifying a metadata standard that can be implemented within the existing World Wide Web framework (browsers, servers, HTML specification, etc.). However, the Web is clearly not the model for the optimal information infrastructure, and many of its flaws are the subject of active discussion in the IETF, W3C, and other venues. Many of the Workshop attendees felt that it was important to describe a metadata framework that extends existing WWW technology and provides guidance on how that technology might evolve.

We can answer these questions by stepping back from our focus on core metadata elements and examining some of the general principals of metadata.

Metadata takes a variety of forms, both specialized and general.

Descriptive cataloging is but one of many classes of metadata. Yet, even if we restrict ourselves to this category, we observe that there exists and is legitimate reason for a variety of cataloging methodologies and interchange formats. The Anglo-American cataloging rules (AARC2) and MARC interchange format (and its numerous variations) are well established in the library world. MARC records are generally the domain of professional catalogers because of the complex rules and arcane structure of the MARC record. In addition there are a number of simpler descriptive rules, such as that suggested by the Dublin Core. These are usable by the majority of authors, but do not offer the degree of precision and organization that characterizes library cataloging. Finally, there are domain-specific formats such as the Content Standard for Digital Geospatial Metadata (CSDGM) that is the result of work by the Federal Geographic Data Committee (FGDC).

Descriptive cataloging alone, however, does not cover the complete set of descriptive information required in the information infrastructure. We list below some of the other metadata types that are required for real work applications.

terms and conditions - This is metadata that describes the conditions for use of an object. Terms and conditions might include an access list of who can view the object, a "conditions of use" statement that might be displayed before access to the object is allowed, a schedule (tariff) of prices and fees for use of the object, or a definition of permitted uses of an object.
administrative data - This is metadata related to the management of an object in a particular server or repository. Some examples of information stored in administrative data are date of last modification, date of creation, and the administrator's identity.
content ratings - This is a description of attributes of an object within a multidimensional scaled rating scheme, as assigned by some rating authority; an example might be the suitability of the content for various audiences. The technical subcommittee of PICS (Platform for Internet Content Selection) in the IETF is an effort to create a framework for defining such content ratings.
provenance - This is data defining source of origin of some content object, for example the location of some physical artifact from which the content was scanned. It might also include a summary of all algorithmic transformations that have been applied to the object (filtering, decimation, etc.).
linkage or relationship data - This is data about the relationship of a content object to other objects; examples are the relationships between a set of articles and a containing journal, between a translation and the work in the original language, between a subsequent edition and the original work, and between the components of a multimedia work.
structural data - This is data defining the logical components of complex or compound objects and how to access those components. A simple example is a table of contents. A more complex example is the list of components of a software suite.

New metadata sets will develop as the networked information infrastructure matures.

The range of metadata needed to describe and manage objects is likely to continue to expand as we become more sophisticated in the ways in which we characterize and retrieve objects and also more demanding in our requirements to control the use of networked information objects. The architecture must be sufficiently flexible to incorporate new semantics without requiring a rewrite of existing metadata sets.

Different communities will propose, design, and be responsible for different types of metadata.

Each logically distinct metadata set may represent the interests of and domain of expertise of a specific community; for example, catalogers should create and maintain descriptive cataloging sets and parties with legal and business expertise should oversee terms and conditions metadata sets. The syntax and notation of each should be determined by the responsible party and fit the semantic requirements of the type of metadata. For example, textual representations might be sufficient for descriptive cataloging data, but are inappropriate for terms and conditions metadata, which may be expressible only through executable (or interpretable) programs.

There are many "users" of metadata.

Just as there are disparate sources of metadata, different metadata sets are used by and may be restricted to distinct communities of users and agents. Machine readability may be a high priority for some types of metadata, whereas others may be targeted for human readability. The terminology in some types of metadata may be domain specific. Each "user" of metadata should be able to directly access that metadata that is relevant to it. From the opposite perspective, there may be reason to selectively restrict access to certain types of metadata associated with an object to certain communities of users or agents. Finally, metadata related to an object may have an independent existence as separately owned and separately priced intellectual property.

Metadata and data have similar behaviors and characteristics.

Strictly partitioning the information universe into data and metadata is misleading. What may appear to be metadata in one context, may look very much like data in another. For example, some critic's review of a movie qualifies as metadata - it is a description of the content, the movie. However, the review itself is intellectual content that can stand alone as data in many instances. Like other data it may have associated metadata and, notably, terms and conditions that protect it as an intellectual object. This recursive relationship of data and metadata may nest to an arbitrarily deep level.

The metadata sets associated with an object may be physically collocated or may be referenced indirectly.

If we allow for the fact that metadata for an object consists of logically distinct and separately administered components, then we should also provide for the distribution of these components among several servers or repositories. The references to distributed components should be via a reliable persistent name scheme, such as that proposed for Universal Resources Names (URNs) and Handles. We note that indirect reference to distributed components also implies that individual metadata sets may be shared. For example, assume a repository with many content objects, some of which have common terms and conditions for access (e.g. a university digital library with a site license for a set of periodicals). We should be able to express this by linking, by a name reference, one encoding of the terms and conditions to the set of objects. Similarly, we should be able to modify the terms and conditions for the set of objects by changing the one shared encoding. The shared terms and conditions metadata may reside in a repository managed by an outside provider that specializes in intellectual property management.

The Warwick Framework Architecture

The result of this analysis at the Warwick Workshop is an architecture, the Warwick Framework, for aggregating multiple sets of metadata. The Warwick Framework has two fundamental components. A container is the unit for aggregating the typed metadata sets, which are known as packages.

A container may be either transient or persistent. In its transient form, it exists as a transport object between and among repositories, clients, and agents. In its persistent form, it exists as a first-class object in the information infrastructure. That is, it is stored on one or more servers and is accessible from these servers using a globally accessible identifier (URI). We note that a container may also be wrapped within another object (i.e., one that is a wrapper for both data and metadata). In this case the "wrapper" object will have a URI rather than the metadata container itself.

Independent of the implementation, the only operation defined for a container is one that returns a sequence of packages in the container. There is no provision in this operation for ordering the members of this sequence and thus no way for a client to assume that one package is more significant or "better" than another. At the container level, each package is an bit stream. One implication of these properties is that any encoding (transfer syntax) for a container must allow the recipient of the container to skip over unknown packages within the container (in other words, the size of each package must be self describing at the container level).

Each package is a typed object; its type may be inferred after access by a client or agent. Packages are of three types:

metadata set - These are packages that contain actual metadata. Some examples of this are packages that are MARC records, Dublin Core records, and encoded terms and conditions. A potential problem is the ability of clients and agents to recognize and process the semantics of the many metadata sets. In addition, clients and agents will need to adapt to new metadata types as they are introduced. Initial implementations of the Warwick Framework will probably include a set of well known metadata sets, in the same manner that most Web browsers have native handlers for a set of well-known MIME types. Extending the Framework implementations to handle an extensible metadata sets will rely on a type registry scheme.
indirect - This is a package that is an indirect reference to another object in the information infrastructure. While the indirection could be done using URLs, we emphasize that the existence of a reliable URN implementation is a necessary to avoid the problems of dangling references that plague the Web. We note three possibly obvious, but important, points about this indirection. First, the target of the indirect package is a first-class object, and thus may have its own metadata and, significantly, its own terms and conditions for access. Second, the target of the indirect package may also be indirectly referenced by other containers (i.e., sharing of metadata objects). Finally, the target of the indirection may be in a different repository or server than the container that references it.
container - This is a package that is itself a container. There is no defined limit for this recursion.

The figure below shows a simple example of a Warwick Framework container. The container in this example contains three logical packages of metadata. The first two, a Dublin Core record and a MARC record, are contained within the container as a pair of packages . The third metadata set, which defines the terms and conditions for access to the content object, is referenced indirectly via a URI in the container. Note that the syntax for terms and conditions metadata is not yet defined.

The mechanisms for associating a Warwick Framework container with a content object (i.e., a document) depend on the implementation of the Framework.

The reverse linkage, that which ties a container to a piece of intellectual content, is also relevant. Anyone can, in fact, create descriptive data for a networked resource, without permission or knowledge of the owner or manager of that resource. This metadata is fundamentally different from that metadata that the owner of a resource chooses to link or embed with the resource. We, therefore, informally distinguish between two categories of metadata containers, which both have the same implementation.

An internally-referenced metadata container is the metadata that the author or maintainer of a content object has selected as the preferred description(s) for the object. This metadata is associated with the content by either embedding it as part of the structure that holds the content or referencing it via a URI. An internally-referenced metadata container referenced via a URI is, by nature, a first-class networked object, and may have its own metadata container associated with it. In addition, an internally-referenced metadata container may back-reference the content that it describes via a linkage metadata element within the container.
An externally-referenced metadata container is metadata that has been created and is maintained by an authority separate from the creator or maintainer of the content object. In fact, the creator of the object may not even be aware of this metadata. There may an unlimited number of such externally-referenced metadata containers. For example, libraries, indexing services, ratings services, and the like may compose sets of metadata for content objects that exist on the net. As we stated earlier, these externally-referenced metadata containers are themselves first-class network objects, accessible through a URI and having some associated metadata. The linkage to the content that one of these externally-referenced containers purports to describe will be via a linkage metadata element within the container. There is no requirement, nor is it expected, that the content object will reference these externally-referenced containers in any way.

The following figure shows an example of this relationship. Three metadata containers are shown. The one internally-referenced metadata container is embedded in the content object (it does not have a URI, nor does it have a linkage package that references the content). The two externally-referenced metadata containers are independent objects. They each have a URI and reference the content object via its URI.

The internally-referenced metadata container in this illustration could also be indirectly referenced by the content. In this case it would have its own URI (say URI₄) and would have a linkage package referencing URI₃(the content).

Open Issues in the Warwick Framework

Time at the Warwick workshop did not permit a full exploration of all the issues involved in the proposed framework. There are several topics that urgently call for more detailed and extended examination prior to finalizing the framework. We briefly summarize those issues here.

Semantic interaction of overlapping sets - Certainly the most fundamental question about the Warwick Framework is the semantic interaction and overlap of the multiple metadata sets that may exist in a container. While packages are to some extent logically distinct, they may have semantics that overlap in complex ways. For example, a container may contain two descriptive cataloging metadata packages: one MARC and the other Dublin Core. A more complex example is a container that contains multiple terms and conditions metadata sets at different levels of recursion in a container.
In the end, the semantics of the metadata associated with an object need to be understood by the "consumers" of the metadata - the clients and agents that access objects and the users that configure these clients and agents. We run the danger, with the full expressiveness of the Warwick Framework, of creating such complexity that the metadata is effectively useless. Finding the appropriate balance is a central design problem.
Type Registry - The Framework design requires that packages are strongly typed. An agent or client will be able to determine the type of the metadata in a package; definers of specific metadata sets should ensure that the set of operations and semantics of those operations will be strictly defined for a package of a given type. We expect that a limited set of metadata types will be widely used and "understood" by browsers and agents. However, the type system must be extensible, and some method that allows existing clients and agents to process new types must be a part of a full implementation of the Framework.
Data Encoding - The Warwick Framework presents two data encoding problems. At the container level, what is the syntax for transferring sets of packages? This syntax must be independent from the syntax of the packages themselves, which are opaque at this level. The more difficult data encoding problems exist at the package level. Some metadata sets can be adequately expressed in ASCII, as a set of attribute/value pairs. Others require more expressive syntax; for example, rules that describe the terms and conditions for access to an object are best expressed via some type of executable program or agent. There is a need to agree on one or more syntaxes for the various metadata sets.
Efficiency - The power of the Warwick Framework lies in its recursive and distributed characteristics. This lends great power to the model, but in an actual implementation may be quite inefficient. Even in the context of the relatively simple World Wide Web, the Internet is often unbearably slow and unreliable. Connections often fail or time out due to high load, server failure, and the like. In a full implementation of the Warwick Framework, access to a "document" might require negotiation across distributed repositories. The performance of this distributed architecture is difficult to predict and is prone to multiple points of failure. Efficient operation of this distributed architecture will depend an improved network infrastructure using caching, data or object replication, dynamic load balancing, and other methods being examined in distributed systems research.
Repository Access - It is clear that some protocol work will need to be done to support container and package interchange and retrieval. We foresee the need for various forms of retrieval. The simple form is retrieval of a container for an object. A more complex form is retrieval of only those containers that include packages of a specific set of types. The requirements for this protocol have not been explored in any detail. Some examination of the relationship between the Warwick Framework and ongoing work in repository architectures would likely be fruitful.

Implementing the Warwick Framework

Simplicity of design and rapid deployment were primary considerations in the design of the Dublin Core. At first glance it may seem that, with the Warwick Framework, we have forsaken this motivation and have proposed an architecture that does not fit with the current world of HTML, HTTP, and WWW browsers. In fact, the basic notion of the Framework, the ability to place a number of metadata sets in a container, can be expressed in the context of the existing WWW infrastructure.

We miss an important opportunity, however, if we constrain the design and possible implementations according to the existing Web. This infrastructure will surely evolve and may even be replaced by a more powerful information infrastructure. Research and development of such an infrastructure is being undertaken in the NSF/DARPA/NASA Joint Digital Library Initiative, other international digital library research projects, and a number of other venues.

The complete version of this paper provides details on a number of possible implementations. We briefly summarize these below.

HTML - Rapid deployment of the Warwick Framework will only occur if the initial implementation requires no change to existing WWW software. A limited implementation of the Framework is possible in HTML 2.0, which is transparent to existing browsers, spiders, and HTML authoring tools. This implementation takes advantage of two tags in HTML 2.0:
- The META tag is used to embed metadata within the HEAD portion of HTML documents. We propose an encoding for the value of the NAME attribute that groups a number of META tags into a single metadata set.
- The LINK tag provides for both indirect linking to reference definition for a metadata scheme and for indirect linking to a set of metadata.
MIME - MIME is the set of standards (RFC-1522 and others) that were originally created to allow varying content in e-mail messages. The capabilities of MIME can be used for a straightforward implementation of the container/package architecture of the Warwick Framework. WWW browsers already have limited support for MIME, and their level of support is likely to increase over time.
The proposed MIME implementation of the Warwick Framework exploits the multipart type in MIME, which is used for messages that include multiple components, each with a possibly different type. The body parts of a MIME multipart message directly correspond to the packages in a Warwick Framework container. In addition, the MIME content-type message/external-body can be used to implement the Warwick Framework "indirect package".
SGML - SGML is the meta-language that is used to define HTML. By this, we mean that SGML is a language that is used to define other languages, typically ones for marking up textual documents. Those languages are defined by preparing a Document Type Definition (DTD).
Implementing the Warwick Framework in SGML requires a DTD that can handle the container/package architecture, and can deal with indirect packages and metadata sets. This DTD should be capable of including packages that have their own DTDs; for example, the Dublin Core DTD being prepared as one of the results of the Warwick Workshop. The framework DTD must also be able to incorporate metadata packages that do not conform to any DTD.
The proposed Warwick Framework DTD uses the <container> element and the %PackageTypes parameter entity to implement the container/package hierarchy. Parameter entities are essentially text substitution macros for portions of a DTD. Package which have their own DTDs are easily included using the SGML idiom of overriding the definition of %md-set parameter entity, and by providing the required DTD fragment in the document's declaration subset. Packages in a non-SGML format can be incorporated by use of the NOTATION attribute on the <package> element.
Distributed Object - An object-oriented implementation of the Warwick Framework is appropriate due to the strong typing, information hiding, and inheritance hierarchy that is inherent in the object-oriented model. Distributed object technology extends the object abstraction by providing non-local access to objects - a client of an object may be located in a different address space or different machine than the server that contains the actual implementation of the object. CORBA is one well-known example of a distributed object architecture.
The Warwick Framework container and package abstractions can be implemented as classes in a object type hierarchy. The class MetaDataContainer is an object with one method - GetPackages - that returns the set of packages in the container. These packages are of type MetaDataPackage, which is the root of a type hierarchy that sub-types to all the possible manifestations of a package in the Warwick Framework.
The Kahn/Wilensky Framework, a result of the DARPA-funded Computer Science Technical Reports Project, proposes a distributed information infrastructure into which an object implementation of the Warwick Framework fits. Kahn/Wilensky proposes the information in the infrastructure is stored as digital objects, which are content-independent packages encapsulating intellectual content, or the data of the object, and associated descriptive material (e.g., metadata). Work on a distributed object implementation of Kahn/Wilensky, using ILU, is currently underway in the Cornell Digital Library Research Group. This implementation incorporates the full Warwick Framework abstraction into a digital object, permitting arbitrary aggregations of metadata and content within a first-class (named) object. Each element of the aggregation may, itself, be a first-class object with independent administration, descriptive data, and rules for access.

Acknowledgments

This paper would not have been possible without the contributions of C. Lynch and R. Daniel, Jr., the co-authors of the complete Warwick Framework paper. In addition, the author wishes to thank the organizers of the metadata workshops, especially S. Weibel, whose efforts provided an essential forum for this and other related work. The ideas here draw extensively from discussions at the Warwick workshop; they also reflect the influence of work done on the still-incomplete White Paper on Networked Information Discovery and Retrieval by C. Lynch, A. Michaelson, C. Preston, and C. Summerhill that is being prepared for the Coalition for Networked Information. We would also like to acknowledge the extensive work of E. Miller, J. Knight, M. Tomlinson, L. Burnard, C.M. Sperberg-McQueen, and L. Quin on the HTML, MIME, and SGML implementation proposals described here.