W3C and Digital Libraries

James S. Miller
World Wide Web Consortium
Cambridge, Massachusetts
[email protected]

D-Lib Magazine, November 1996

ISSN 1082-9873

It was my distinct pleasure to listen to a talk by William Y. Arms, given at the AusWeb '96 conference, this past July. Bill highlighted a number of areas where the distributed library community was working hard to solve serious problems with the World Wide Web (WWW) infrastructure. In particular, he discussed in detail the difficulties involved in solving two problems:

Persistent names for documents. Bill's perspective was that names, or identifiers, must persist for hundreds of years, must be independent of any sort of location or particular server, and must allow for effectively locating the referenced document. The names must also be unique in the sense that two different names refer to two different documents.
Attaching appropriate metadata to documents. The metadata could then be used to locate documents of interest.

As always, Bill made an excellent case for the need for both of these items. He further argued that the existing WWW protocols do not provide solutions to these problems. After hearing him out, I found that I agree completely with the need for persistent, globally unique names and for metadata. But I do not agree that the existing WWW protocols do not offer the solutions. Indeed, I believe that most of the engineering needed to accomplish them is already complete, but there is work of a societal nature that remains to be done. This work can be done only by the institutions most directly involved with the problems.

This article, prepared at his request, is my response to his excellent talk. It is, mostly, a call for the digital library community to join with the Web community to solve the remaining engineering problems, but suggests that the real problems are not primarily technical.

Introducing the W3C

To explain my viewpoint, I must first describe both the World Wide Web Consortium (W3C) and my own work at the Consortium. W3C is a formalized collaboration between internationally known research organizations: the U.S. office is a research group at Massachusetts Institute of Technology's (MIT) Laboratory for Computer Science (LCS); the European office is a pair of research groups at INRIA (Institut National de Recherche en Informatique et en Automatique, the French national computer science laboratory); and the Japan/Korea office is a research group at Keio University. The funding model of the Consortium is primarily membership fees from companies, organizations, and government offices. In addition, each W3C office is free to seek additional funding through traditional research or development grants.

All three offices share a common Director, Tim Berners-Lee, the inventor of the Web. Jean-François Abramatic is the Chairman of the organization, and is the manager of the W3C team world wide. W3C has about 30 full-time staff members at its three offices, plus about five additional engineers who have been seconded to the W3C by their employers. is to "help the Web reach its maximum potential," which we do by working with our member companies to evolve the specifications that underlie the Web. Our work primarily involves working with companies in a pre-competitive environment, helping to define what parts of the specifications are critical to ensure the long term growth and interoperability of the Web, and what parts are best left to competition in the marketplace. We develop and refine both specifications and reference code (all of which we make freely available to the public).

We also initiate joint projects whose aim is to use the existing Web infrastructure in new ways. These projects lead both to modifications to the infrastructure and to the design of protocols or agreements on the use of the infrastructure for particular new applications.

Our technical work is organized in three broad areas:

Architecture, directed by Dan Connolly at our US office. The core of this work consists of two specifications: HTTP (Hyper Text Transfer Protocol, the primary transfer protocol for the Web), and URLs (Uniform Resource Locators, the naming mechanism for the Web). The architecture group produces two sets of reference code: libwww a C-based library originally written at CERN (European Laboratory for Particle Physics), and Jigsaw a Java-based Web server. In addition, the architecture group has work on connecting HTTP to the evolving distributed object infrastructure (known as http-ng); improving the performance and reliability of HTTP ("industrial-strength Web"); and the transmission of real-time audio and video.
User Interface, directed by Vincent Quint at our European office. Here, the main focus is on HTML (Hyper Text Markup Language) and related specifications. The user interface group works on the development of Cascading Style Sheets, internationalization, new graphics formats, improvements to the handling of fonts, printing, etc. W3C provides a reference code system, Amaya, which is both a Web browser and an editor.
Technology and Society, directed by Jim Miller at our US office. The main focus of this group is not on existing Web specifications, but rather on applications of the Web for particular purposes, as well as for evolving the specifications in response to society's needs. Work in this area includes the PICS (Platform for Internet Content Selection) system for labeling information on the Web, as well as work on electronic payment negotiation (JEPI, the Joint Electronic Payment Initiative) and security (Digital Signature Initiative). The technology and society area will also be addressing issues such as the protection of intellectual property on the Web, access for the visually impaired, and protecting individual privacy.

The W3C has over 150 corporate members, drawn about 50% from the U.S. and 50% from Europe; we are expanding rapidly in Japan at the current time. We have been successful in simultaneously moving the Web specifications forward while making sure that the major browsers and servers interoperate. Ensuring this graceful evolution is not easy, and constitutes the majority of our work at the W3C. We sponsor workshops, run on-going working groups, elect editorial review boards, and manage cross-industry projects to make sure that all member companies have an opportunity to understand and help evolve the technology, and to make sure that the major technology innovators will work together when necessary to ensure interoperability.

It is against this background that I listened to Bill's talk and found myself in both agreement and disagreement. Let me cover his two points individually.

Persistent names

At the core of this problem lie two issues. The first is the guarantee of perpetual existence, the issue Bill emphasized in his talk. What form of name can be created that can be used forever, outlasting the Internet and current media? The second is the guarantee of uniqueness: two distinct names refer to two distinct objects.

From an engineer's point of view (and good engineering design is the hallmark of the Web: it innovates only when necessary, but remains extremely flexible) the former is easily answered. To an engineer, there is no forever. Instead, there is a fixed lifetime and a mechanism for moving forward before that lifetime expires. This is precisely the work of OCLC on its PURL server, and it can be combined with the work of Hyper-G to allow updates of referring documents as the naming system moves forward (i.e. the federating of separately maintained PURL servers).

What we need to move forward on persistent names, then, is not new technology or engineering. Instead, there must be one or more entities that take institutional charge of the issuing and resolving of unique names, and a mechanism that will allow this entire set of names to be moved forward as the technology progresses. While changes to the Web itself might help make the problem simpler or more robust, the need for institutional commitment to the naming system can not be "engineered away."

Thus my first response: The Digital Library community must identify institutions of long standing that will take the responsibility for resolving institutionalized names into current names for the forseeable future.

Once this is done W3C can help explore with these institutions any remaining naming issues. But without the institutions to back the names, there can be no true progress.

What might such an institution be? And what would be the costs? I propose that a small consortium of well-known (perhaps national) libraries could work to provide the computing infrastructure. What is needed is a small set of replicated servers, perhaps two per site with three sites. Each would need roughly 64 megabytes of memory and 4 gigabytes of disk space to resolve about 10,000,000 names. At current prices, this would cost about $80,000. And if we add 4 gigabytes of disk per year to each system (thus supporting 10,000,000 new names per year), the additional cost would be only on the order of $10,000 per year. One funding model for this infrastructure might be a charge for creating a permanent copy of a document, on the order of $1.00 per megabyte. This could cover a notarization step (proving the document existed as of a certain date), copying the data itself, permanent archival and updating as media changes and naming systems change, and the creation of the permanent name itself.

Meta Data

There are three separate subproblems that underlie the "metadata problem." The first, and by far the hardest, is a question of what the metadata elements should be. This entails hard decisions about what information must be captured, who can capture it reliably, what will be useful for searching both today and 1,000 years hence, standardization of names, and canonicalization (i.e. standardization of representation) of the metadata values. Again, these problems can be addressed only by institutional agreement, and are subject to modification over time. In fact, the set of workshops organized by OCLC and partners over the past year is directly addressing this very hard problem and the form of the results (the Dublin core and the Warwick framework) are becoming clear.

The remaining problems are technical and much easier to solve once the first has been addressed: the encoding of the metadata into a form that can be used by a computer, and the retrieval and transmission of that metadata. I argue that the Internet has a system that is in the process of wide deployment that addresses both of these issues: the PICS(Platform for Internet Content Selection) system. While PICS was initially created to address the problem of child protection on the Internet, it is important to look beyond this at the actual technology to see that PICS is, fundamentally, a metadata system. At a recent meeting, several members of the digital library community did exactly this and have suggested that PICS (with slight modifications) may well form a base for encoding and transmitting metadata derived from the Dublin core and the Warwick framework.

The key to understanding PICS as a metadata system is to look carefully at the three things that it specifies:

A method for naming and describing a metadata system. This is provided by what PICS calls (for historical reasons) a "rating service description file." This is in a text format defined by Rating Services and Rating Systems (and Their Machine Readable Descriptions). The name of a metadata system (i.e. an agreed upon set of metadata elements) is a URL, which can be made persistent by using a PURL server to provide a guarantee of longevity. The URL uniquely identifies this particular package of metadata elements, and the ratings service description file connects this name with both a human and computer readable description of the metadata system. The computer readable description is embedded in the rating service description file itself, while the human readable description is left as a free format document accessible from the Web.
A method for encoding metadata. This is the "label list" format specified in PICS Label Distribution Label Syntax and Communication Protocols. It allows several different metadata systems to be transmitted in a single list (each metadata system is known, for historical reasons, as a "label"). Each metadata system is identified by its name (i.e. the URL as specified in the rating service description file) so that automatic tools can correlate the metadata system with its description. In addition, the PICS encoding contains several optional metadata elements of its own (date of last modification, expiration date of the metadata, date on which the metadata was created, etc.) as well as the ability to both be digitally signed (so users can be assured of the authenticity of the metadata) and to contain a "cryptographic hash" of the document they describe so that a PICS label can be detached from the document and still be reliably linked with the document to which it applies. PICS does have one limitation, which the W3C PICS Working Group will be addressing in the near future: it permits only numeric values for metadata elements. While this is sufficient for any controlled vocabulary system, it is not sufficient for the more general needs of a full metadata service.
Three methods of distribution. Metadata in the PICS format can be transmitted either inside of an HTML document (using the META element of the document header), along with a document sent by any RFC-822 transport (email, news, HTTP), or from a trusted third party server (known as a "label bureau"). The format of the metadata is identical in all three cases, allowing software that reads the metadata to use a single parser for all three transmission mechanisms. In addition, if HTTP is used to request a document from a Web server, PICS specifies a mechanism for the user to specify which forms of metadata are desired (by using the URLs that name the metadata sets). The server can then embed or attach a PICS label containing only the metadata sets requested. All three methods are described in detail in PICS Label Distribution Label Syntax and Communication Protocols.

Thus my second response: The Digital Library community must identify the sets of metadata that are important. Once that is done, encoding and distributing metadata using PICS will leverage the existing infrastructure of the Web so that deployment and use of metadata will be a natural extension of existing systems.

Conclusion

The World Wide Web Consortium is interested in working with the Digital Library community to make the vision of a worldwide, searchable, information space a reality. The Consortium is prepared to work with institutions that can commit to supporting such an infrastructure. We will help ensure that the infrastructure is sound from an engineering point of view, as well as work towards its universal adoption and integration with the existing Web information space. Toward that end, W3C will work with both the W3C member companies and the PICS community to improve the existing PICS infrastructure to support the full metadata needs of the Digital Library community. But the hardest problems to be solved are not technological: they are problems of our social and institutional structure that can only be solved by cooperation and agreement within the Digital Library community itself. And these processes are well underway.