The Warwick Framework
A Container Architecture for Diverse Sets of Metadata
Digital Library Research Group, Cornell University
D-Lib Magazine, July/August 1996
This paper is a abbreviated version of The Warwick Framework: A
Container Architecture for Aggregating Sets of Metadata.
It describes a container architecture for aggregating logically,
and perhaps physically, distinct packages of metadata.
This "Warwick Framework" is the result of the April
1996 Metadata II Workshop in Warwick U.K.
Introduction and Motivation
With the rapid increase in the number and variety of networked
resources, there is a growing need for an architecture for associating
diverse types of metadata with those resources. This requirement
is increasingly obvious in the current World Wide Web, where the
primary tools for finding networked resources are "web-crawlers"
or "spiders" that index the full-text of HTML pages.
While the value of these tools should not be underestimated, their
shortcomings become obvious when one, for example, searches for
documents about "Mercury" and finds a mixture of pages
about the planet Mercury, the element Mercury, the Greek God Mercury,
and articles from the San Jose Mercury-News. More importantly,
these tools are completely useless for the many non-textual documents
- images, audio, video, and executable programs (accessible through
CGI scripts) - that populate the Web.
A series of metadata workshops, the first in March 1995 in Dublin
Ohio and the second in April 1996 in Warwick U.K, were convened
to address this issue and propose solutions. The Dublin Workshop
resulted in the Dublin Core,
a set of thirteen metadata elements intended to describe the essential
features of networked documents. The Dublin Core metadata set
is meant to be both simple enough for easy use by creators and
maintainers of Web documents and sufficiently descriptive to assist
in the discovery and location of networked resources. The thirteen
elements of the Dublin Core include familiar descriptive data
such as author, title, and subject. A few
fields in the Core, such as coverage and relationship,
are less familiar.
The Warwick Workshop was convened a year later to build on the
Dublin results and provide a more concrete and operationally useable
formulation of the Dublin Core, in order to promote greater interoperability
among content providers, content catalogers and indexers, and
automated resource discovery and description systems. The April
1996 workshop also was an opportunity to assess a year of experimentation
with the Dublin Core by a number of researchers and developers.
While there was consensus among the attendees that the concept
of a simple metadata set is useful, there were a number of fundamental
questions concerning the real utility of the Dublin Core as it
was defined at the end of the preceding workshop. Does the very
loosely defined Dublin Core really qualify as a "standard"
that can be read and processed programmatically? Should the number
of the core elements be expanded, to increase semantic richness,
or reduced, to improve ease-of-use by authors and/or web publishers?
Will authors reliably attach core metadata elements to their content?
Should a core metadata set be restricted to only descriptive cataloging
information or should it include other types of metadata such
as administrative information, linkage data, and the like? What
is the relationship of the Dublin Core to other developing work
in metadata schemes, particularly in those areas such as rights
management information (terms and conditions)?
The workshop attendees concluded that the answer to these questions
and the route to progress on the metadata issue lay in the formulation
a higher-level context for the Dublin Core. This context should
define how the Core can be combined with other sets of metadata
in a manner that addresses the individual integrity, distinct
audiences, and separate realms of responsibility of these distinct
The result of the Warwick Workshop is a container architecture,
known as the Warwick Framework. The framework is a mechanism for
aggregating logically, and perhaps physically, distinct packages
of metadata. This is a modularization of the metadata issue with
a number of notable characteristics.
- It allows the designers of individual metadata sets to focus
on their specific requirements, without concerns for generalization
to ultimately unbounded scope .
- It allows the syntax of metadata sets to vary in conformance
with semantic requirements, community practices, and functional
(processing) requirements for the kind of metadata in question.
- It separates management of and responsibility for specific
metadata sets among their respective "communities of expertise".
- It promotes interoperability by allowing tools and agents
to selectively access and manipulate individual packages and ignore
- It permits access to the different metadata sets that are
related to the same object to be separately controlled.
- It flexibly accommodates future metadata sets by not requiring
changes to existing sets or the programs that make use of them.
The separation of metadata sets into packages does not imply that
packages are completely semantically distinct. In fact, it is
a feature of the Warwick Framework that an individual container
may hold packages, each managed and maintained by distinct parties,
which have complex semantic overlap.
Examining the Metadata Issue in Context
The organizers of the 1995 Dublin Metadata Workshop intentionally
limited its scope, avoiding, as the workshop report states,
"the size and complexity of the resource description problem".
While this strategy was effective for reaching consensus at the
first workshop, it became obvious at the second workshop that
it was an impediment to moving beyond the Dublin Workshop results.
By the end of the first day of the Warwick Workshop, three questions
had surfaced, each of which made clear the need to broaden our
- Should the number of elements in the Dublin Core be expanded
or contracted? Some workshop attendees felt that in order
for the Core to succeed as a tool for authors, its number of elements
should be restricted to only the most basic descriptive elements.
Others saw the need for new fields such as terms and conditions
- Should the syntax of the Core be strictly defined or left
unstructured? Many attendees wanted to avoid the painful syntax
wars that are familiar to those who have participated in standards
efforts. However, without a stricter definition of syntax, the
Dublin Core does not provide the level of interoperability for
which it was intended.
- Should the Core be targeted solely at the existing WWW
architecture, or extend that architecture? There is a strong
argument for specifying a metadata standard that can be implemented
within the existing World Wide Web framework (browsers, servers,
HTML specification, etc.). However, the Web is clearly not the
model for the optimal information infrastructure, and many of
its flaws are the subject of active discussion in the IETF, W3C,
and other venues. Many of the Workshop attendees felt that it
was important to describe a metadata framework that extends existing
WWW technology and provides guidance on how that technology might
We can answer these questions by stepping back from our focus
on core metadata elements and examining some of the general principals
Metadata takes a variety of forms, both specialized and general.
Descriptive cataloging is but one of many classes of metadata.
Yet, even if we restrict ourselves to this category, we observe
that there exists and is legitimate reason for a variety of cataloging
methodologies and interchange formats. The Anglo-American cataloging
rules (AARC2) and MARC interchange format
(and its numerous variations) are well established in the library
world. MARC records are generally the domain of professional catalogers
because of the complex rules and arcane structure of the MARC
record. In addition there are a number of simpler descriptive
rules, such as that suggested by the Dublin Core. These are usable
by the majority of authors, but do not offer the degree of precision
and organization that characterizes library cataloging. Finally,
there are domain-specific formats such as the Content Standard for Digital
Geospatial Metadata (CSDGM)
that is the result of work by the Federal Geographic Data Committee
Descriptive cataloging alone, however, does not cover the complete
set of descriptive information required in the information infrastructure.
We list below some of the other metadata types that are required
for real work applications.
- terms and conditions - This is metadata that describes
the conditions for use of an object. Terms and conditions might
include an access list of who can view the object, a "conditions
of use" statement that might be displayed before access to
the object is allowed, a schedule (tariff) of prices and fees
for use of the object, or a definition of permitted uses of an
- administrative data - This is metadata related to the
management of an object in a particular server or repository.
Some examples of information stored in administrative data are
date of last modification, date of creation, and the administrator's
- content ratings - This is a description of attributes
of an object within a multidimensional scaled rating scheme, as
assigned by some rating authority; an example might be the suitability
of the content for various audiences. The technical subcommittee
of PICS (Platform for Internet Content Selection)
in the IETF is an effort to create a framework for defining such
- provenance - This is data defining source of origin
of some content object, for example the location of some physical
artifact from which the content was scanned. It might also include
a summary of all algorithmic transformations that have been applied
to the object (filtering, decimation, etc.).
- linkage or relationship data - This is data about the
relationship of a content object to other objects; examples are
the relationships between a set of articles and a containing journal,
between a translation and the work in the original language, between
a subsequent edition and the original work, and between the components
of a multimedia work.
- structural data - This is data defining the logical
components of complex or compound objects and how to access those
components. A simple example is a table of contents. A more complex
example is the list of components of a software suite.
New metadata sets will develop as the networked information
The range of metadata needed to describe and manage objects is
likely to continue to expand as we become more sophisticated in
the ways in which we characterize and retrieve objects and also
more demanding in our requirements to control the use of networked
information objects. The architecture must be sufficiently flexible
to incorporate new semantics without requiring a rewrite of existing
Different communities will propose, design, and be responsible
for different types of metadata.
Each logically distinct metadata set may represent the interests
of and domain of expertise of a specific community; for example,
catalogers should create and maintain descriptive cataloging sets
and parties with legal and business expertise should oversee terms
and conditions metadata sets. The syntax and notation of each
should be determined by the responsible party and fit the semantic
requirements of the type of metadata. For example, textual representations
might be sufficient for descriptive cataloging data, but are inappropriate
for terms and conditions metadata, which may be expressible only
through executable (or interpretable) programs.
There are many "users" of metadata.
Just as there are disparate sources of metadata, different metadata
sets are used by and may be restricted to distinct communities
of users and agents. Machine readability may be a high priority
for some types of metadata, whereas others may be targeted for
human readability. The terminology in some types of metadata may
be domain specific. Each "user" of metadata should be
able to directly access that metadata that is relevant to it.
From the opposite perspective, there may be reason to selectively
restrict access to certain types of metadata associated with an
object to certain communities of users or agents. Finally, metadata
related to an object may have an independent existence as separately
owned and separately priced intellectual property.
Metadata and data have similar behaviors and characteristics.
Strictly partitioning the information universe into data and metadata
is misleading. What may appear to be metadata in one context,
may look very much like data in another. For example, some critic's
review of a movie qualifies as metadata - it is a description
of the content, the movie. However, the review itself is intellectual
content that can stand alone as data in many instances. Like other
data it may have associated metadata and, notably, terms and conditions
that protect it as an intellectual object. This recursive relationship
of data and metadata may nest to an arbitrarily deep level.
The metadata sets associated with an object may be physically
collocated or may be referenced indirectly.
If we allow for the fact that metadata for an object consists
of logically distinct and separately administered components,
then we should also provide for the distribution of these components
among several servers or repositories. The references to distributed
components should be via a reliable persistent name scheme, such
as that proposed for Universal Resources Names (URNs)
and Handles. We note that indirect reference
to distributed components also implies that individual metadata
sets may be shared. For example, assume a repository with many
content objects, some of which have common terms and conditions
for access (e.g. a university digital library with a site
license for a set of periodicals). We should be able to express
this by linking, by a name reference, one encoding of the terms
and conditions to the set of objects. Similarly, we should be
able to modify the terms and conditions for the set of objects
by changing the one shared encoding. The shared terms and conditions
metadata may reside in a repository managed by an outside provider
that specializes in intellectual property management.
The Warwick Framework Architecture
The result of this analysis at the Warwick Workshop is an architecture,
the Warwick Framework, for aggregating multiple sets of metadata.
The Warwick Framework has two fundamental components. A container
is the unit for aggregating the typed metadata sets, which are
known as packages.
A container may be either transient or persistent. In its transient
form, it exists as a transport object between and among repositories,
clients, and agents. In its persistent form, it exists as a first-class
object in the information infrastructure. That is, it is stored
on one or more servers and is accessible from these servers using
a globally accessible identifier (URI). We note that a container
may also be wrapped within another object (i.e., one that
is a wrapper for both data and metadata). In this case the "wrapper"
object will have a URI rather than the metadata container itself.
Independent of the implementation, the only operation defined
for a container is one that returns a sequence of packages in
the container. There is no provision in this operation for ordering
the members of this sequence and thus no way for a client to assume
that one package is more significant or "better" than
another. At the container level, each package is an bit stream.
One implication of these properties is that any encoding (transfer
syntax) for a container must allow the recipient of the container
to skip over unknown packages within the container (in other words,
the size of each package must be self describing at the container
Each package is a typed object; its type may be inferred after
access by a client or agent. Packages are of three types:
- metadata set - These are packages that contain actual
metadata. Some examples of this are packages that are MARC records,
Dublin Core records, and encoded terms and conditions. A potential
problem is the ability of clients and agents to recognize and
process the semantics of the many metadata sets. In addition,
clients and agents will need to adapt to new metadata types as
they are introduced. Initial implementations of the Warwick Framework
will probably include a set of well known metadata sets, in the
same manner that most Web browsers have native handlers for a
set of well-known MIME types. Extending the Framework implementations
to handle an extensible metadata sets will rely on a type registry
- indirect - This is a package that is an indirect reference
to another object in the information infrastructure. While the
indirection could be done using URLs, we emphasize that the existence
of a reliable URN implementation is a necessary to avoid the problems
of dangling references that plague the Web. We note three possibly
obvious, but important, points about this indirection. First,
the target of the indirect package is a first-class object, and
thus may have its own metadata and, significantly, its own terms
and conditions for access. Second, the target of the indirect
package may also be indirectly referenced by other containers
(i.e., sharing of metadata objects). Finally, the target
of the indirection may be in a different repository or server
than the container that references it.
- container - This is a package that is itself a container.
There is no defined limit for this recursion.
The figure below shows a simple example of a Warwick Framework
container. The container in this example contains three logical
packages of metadata. The first two, a Dublin Core record and
a MARC record, are contained within the container as a pair of
packages . The third metadata set, which defines the terms and
conditions for access to the content object, is referenced indirectly
via a URI in the container. Note that the syntax for terms and
conditions metadata is not yet defined.
The mechanisms for associating a Warwick Framework container with
a content object (i.e., a document) depend on the implementation
of the Framework.
The reverse linkage, that which ties a container to a piece of
intellectual content, is also relevant. Anyone can, in fact, create
descriptive data for a networked resource, without permission
or knowledge of the owner or manager of that resource. This metadata
is fundamentally different from that metadata that the owner of
a resource chooses to link or embed with the resource. We, therefore,
informally distinguish between two categories of metadata containers,
which both have the same implementation.
- An internally-referenced metadata container is the
metadata that the author or maintainer of a content object has
selected as the preferred description(s) for the object. This
metadata is associated with the content by either embedding it
as part of the structure that holds the content or referencing
it via a URI. An internally-referenced metadata container referenced
via a URI is, by nature, a first-class networked object, and may
have its own metadata container associated with it. In addition,
an internally-referenced metadata container may back-reference
the content that it describes via a linkage metadata element
within the container.
- An externally-referenced metadata container is metadata
that has been created and is maintained by an authority separate
from the creator or maintainer of the content object. In fact,
the creator of the object may not even be aware of this metadata.
There may an unlimited number of such externally-referenced metadata
containers. For example, libraries, indexing services, ratings
services, and the like may compose sets of metadata for content
objects that exist on the net. As we stated earlier, these externally-referenced
metadata containers are themselves first-class network objects,
accessible through a URI and having some associated metadata.
The linkage to the content that one of these externally-referenced
containers purports to describe will be via a linkage metadata
element within the container. There is no requirement, nor is
it expected, that the content object will reference these externally-referenced
containers in any way.
The following figure shows an example of this relationship. Three
metadata containers are shown. The one internally-referenced metadata
container is embedded in the content object (it does not have
a URI, nor does it have a linkage package that references the
content). The two externally-referenced metadata containers are
independent objects. They each have a URI and reference the content
object via its URI.
The internally-referenced metadata container in this illustration
could also be indirectly referenced by the content. In this case
it would have its own URI (say URI4) and would have
a linkage package referencing URI3 (the content).
Open Issues in the Warwick Framework
Time at the Warwick workshop did not permit a full exploration
of all the issues involved in the proposed framework. There are
several topics that urgently call for more detailed and extended
examination prior to finalizing the framework. We briefly summarize
those issues here.
- Semantic interaction of overlapping sets - Certainly
the most fundamental question about the Warwick Framework is the
semantic interaction and overlap of the multiple metadata sets
that may exist in a container. While packages are to some extent
logically distinct, they may have semantics that overlap in complex
ways. For example, a container may contain two descriptive cataloging
metadata packages: one MARC and the other Dublin Core. A more
complex example is a container that contains multiple terms and
conditions metadata sets at different levels of recursion in a
In the end, the semantics of the metadata associated with an object
need to be understood by the "consumers" of the metadata
- the clients and agents that access objects and the users that
configure these clients and agents. We run the danger, with the
full expressiveness of the Warwick Framework, of creating such
complexity that the metadata is effectively useless. Finding the
appropriate balance is a central design problem.
- Type Registry - The Framework design requires that
packages are strongly typed. An agent or client will be able to
determine the type of the metadata in a package; definers of specific
metadata sets should ensure that the set of operations and semantics
of those operations will be strictly defined for a package of
a given type. We expect that a limited set of metadata types will
be widely used and "understood" by browsers and agents.
However, the type system must be extensible, and some method that
allows existing clients and agents to process new types must be
a part of a full implementation of the Framework.
- Data Encoding - The Warwick Framework presents two
data encoding problems. At the container level, what is the syntax
for transferring sets of packages? This syntax must be independent
from the syntax of the packages themselves, which are opaque at
this level. The more difficult data encoding problems exist at
the package level. Some metadata sets can be adequately expressed
in ASCII, as a set of attribute/value pairs. Others require more
expressive syntax; for example, rules that describe the terms
and conditions for access to an object are best expressed via
some type of executable program or agent. There is a need to agree
on one or more syntaxes for the various metadata sets.
- Efficiency - The power of the Warwick Framework lies
in its recursive and distributed characteristics. This lends great
power to the model, but in an actual implementation may be quite
inefficient. Even in the context of the relatively simple World
Wide Web, the Internet is often unbearably slow and unreliable.
Connections often fail or time out due to high load, server failure,
and the like. In a full implementation of the Warwick Framework,
access to a "document" might require negotiation across
distributed repositories. The performance of this distributed
architecture is difficult to predict and is prone to multiple
points of failure. Efficient operation of this distributed architecture
will depend an improved network infrastructure using caching,
data or object replication, dynamic load balancing, and other
methods being examined in distributed systems research.
- Repository Access - It is clear that some protocol
work will need to be done to support container and package interchange
and retrieval. We foresee the need for various forms of retrieval.
The simple form is retrieval of a container for an object. A more
complex form is retrieval of only those containers that include
packages of a specific set of types. The requirements for this
protocol have not been explored in any detail. Some examination
of the relationship between the Warwick Framework and ongoing
work in repository architectures would likely be fruitful.
Implementing the Warwick Framework
Simplicity of design and rapid deployment were primary considerations
in the design of the Dublin Core. At first glance it may seem
that, with the Warwick Framework, we have forsaken this motivation
and have proposed an architecture that does not fit with the current
world of HTML, HTTP, and WWW browsers. In fact, the basic notion
of the Framework, the ability to place a number of metadata sets
in a container, can be expressed in the context of the existing
We miss an important opportunity, however, if we constrain the
design and possible implementations according to the existing
Web. This infrastructure will surely evolve and may even be replaced
by a more powerful information infrastructure. Research and development
of such an infrastructure is being undertaken in the NSF/DARPA/NASA Joint Digital
other international digital library research projects, and a number
of other venues.
The complete version of this paper
provides details on a number of possible implementations. We briefly
summarize these below.
- HTML - Rapid deployment of the Warwick Framework will
only occur if the initial implementation requires no change to
existing WWW software. A limited implementation of the Framework
is possible in HTML 2.0, which is transparent to existing browsers,
spiders, and HTML authoring tools. This implementation takes advantage
of two tags in HTML 2.0:
- The META tag is used to embed metadata within the HEAD portion
of HTML documents. We propose an encoding for the value of the
NAME attribute that groups a number of META tags into a single
- The LINK tag provides for both indirect linking to reference
definition for a metadata scheme and for indirect linking to a
set of metadata.
- MIME - MIME is the set of standards (RFC-1522
and others) that were originally created to allow varying content
in e-mail messages. The capabilities of MIME can be used for a
straightforward implementation of the container/package architecture
of the Warwick Framework. WWW browsers already have limited support
for MIME, and their level of support is likely to increase over
The proposed MIME implementation of the Warwick Framework exploits
multipart type in MIME, which is used for messages
that include multiple components, each with a possibly different
type. The body parts of a MIME multipart message directly correspond
to the packages in a Warwick Framework container. In addition,
the MIME content-type
message/external-body can be
used to implement the Warwick Framework "indirect package".
- SGML - SGML is the meta-language
that is used to define HTML. By this, we mean that SGML is a language
that is used to define other languages, typically ones for marking
up textual documents. Those languages are defined by preparing
a Document Type Definition (DTD).
Implementing the Warwick Framework in SGML requires a DTD that
can handle the container/package architecture, and can deal with
indirect packages and metadata sets. This DTD should be capable
of including packages that have their own DTDs; for example, the
Dublin Core DTD being prepared as one of the results of the Warwick
Workshop. The framework DTD must also be able to incorporate metadata
packages that do not conform to any DTD.
The proposed Warwick Framework DTD uses the
element and the
%PackageTypes parameter entity to
implement the container/package hierarchy. Parameter entities
are essentially text substitution macros for portions of a DTD.
Package which have their own DTDs are easily included using the
SGML idiom of overriding the definition of
parameter entity, and by providing the required DTD fragment in
the document's declaration subset. Packages in a non-SGML format
can be incorporated by use of the
- Distributed Object - An object-oriented implementation
of the Warwick Framework is appropriate due to the strong typing,
information hiding, and inheritance hierarchy that is inherent
in the object-oriented model. Distributed object technology extends
the object abstraction by providing non-local access to objects
- a client of an object may be located in a different address
space or different machine than the server that contains the actual
implementation of the object. CORBA is one
well-known example of a distributed object architecture.
The Warwick Framework container and package abstractions can be
implemented as classes in a object type hierarchy. The class
is an object with one method -
GetPackages - that
returns the set of packages in the container. These packages are
MetaDataPackage, which is the root of a type
hierarchy that sub-types to all the possible manifestations of
a package in the Warwick Framework.
The Kahn/Wilensky Framework, a result
of the DARPA-funded Computer Science Technical Reports Project,
proposes a distributed information infrastructure into which an
object implementation of the Warwick Framework fits. Kahn/Wilensky
proposes the information in the infrastructure is stored as digital
objects, which are content-independent packages encapsulating
intellectual content, or the data of the object, and associated
descriptive material (e.g., metadata). Work on a distributed
object implementation of Kahn/Wilensky, using ILU,
is currently underway in the Cornell Digital Library Research Group.
This implementation incorporates the full Warwick Framework abstraction
into a digital object, permitting arbitrary aggregations of metadata
and content within a first-class (named) object. Each element
of the aggregation may, itself, be a first-class object with independent
administration, descriptive data, and rules for access.
This paper would not have been possible without the contributions
of C. Lynch and R. Daniel, Jr., the co-authors of the complete
Warwick Framework paper. In addition, the author wishes to thank
the organizers of the metadata workshops, especially S. Weibel,
whose efforts provided an essential forum for this and other related
work. The ideas here draw extensively from discussions at the
Warwick workshop; they also reflect the influence of work done
on the still-incomplete White Paper on Networked Information Discovery
and Retrieval by C. Lynch, A. Michaelson, C. Preston, and C. Summerhill
that is being prepared for the Coalition for Networked Information.
We would also like to acknowledge the extensive work of E. Miller,
J. Knight, M. Tomlinson, L. Burnard, C.M. Sperberg-McQueen, and
L. Quin on the HTML, MIME, and SGML implementation proposals described
- Lagoze, Carl and Lynch, Clifford and
Daniel, Ron, Jr. June, 1996. The Warwick Framework: A Container
Architecture for Aggregating Sets of Metadata. Cornell Computer
Science Technical Report TR96-1593.
- Weibel, Stuart. July, 1995. Metadata:
The Foundations of Resource Description. D-lib Magazine. http://www.dlib.org/dlib/July95/07
- Weibel, Stuart and Godby, Jean and Miller,
Eric and Daniel, Ron. 1995. OCLC/NCSA Metadata Workshop Report.
- The Library of Congress. Machine-Readable
- Federal Geographic Data Committee. Content
Standards for Digital Geospatial Metadata. http://geochange.er.us
- The Federal Geographic Data Committee.
- Miller, Jim and Resnick, Paul and Singer, David, Rating Services and
Rating Systems (and their Machine Readable Descriptions), Platform for Internet
Content Selection Version 1.1, May 1996,
- Universal Resource Names. http://union.ncsa.uiuc.edu/HyperN
- Corporation for National Research Initiatives.
The Handle System. http://www.handle.net.
- The NSF/DARPA/NASA Digital Library Initiative.
- MIME. RFC-1522.
- Marchal, Beniot. A Gentle Introduction to SGML.
- Object Management Group.
- Robert Kahn and Robert Wilensky. A
Framework for Distributed Object Services. May 13, 1995.
- Corporation for National Research Initiative.
Computer Science Technical Reports Project.
- Xerox Palo Alto Research Laboratory. Inter-Language
- Cornell Digital Library Research Group.
Copyright © 1996 Carl Lagoze