Stories

D-Lib Magazine
June 1998

ISSN 1082-9873

Resolving DOI Based URNs Using Squid

An Experimental System at UKOLN


Andy Powell
UKOLN, University of Bath
Bath, UK
www.ukoln.ac.uk/ukoln/staff/a.powell.html
a.powell@ukoln.ac.uk

Introduction

The Digital Object Identifier (DOI) and the Uniform Resource Name (URN) are two initiatives attempting to define long term identifiers for information resources. These initiatives are related, in that they both try to overcome the limitations of the Uniform Resource Locator (URL) insofar as it is used to "identify" resources on the Internet. The URL does not provide a stable, long term identifier, it simply provides the current location of the resource (or copy of the resource). If the resource moves, its URL changes. It is likely that a formal method of encoding DOIs as URNs, will be developed in the future.

This article describes an experimental system that allows DOIs encoded as URNs to be resolved on behalf of Web browsers by Squid [1]. A method of encoding a DOI as a URN is described below. Squid is a public domain caching proxy Web server that is widely used throughout the Internet community. It is based on code originally developed by the Harvest project [2] and now continues to be developed under the auspices of the National Laboratory for Applied Network Research (NLANR) cache project. Recent versions of Squid provide some support for URNs [3], albeit at a reasonably trivial level. This support was primarily introduced to allow Squid to return lists of sites for mirrored resources.

UKOLN has developed software to extend this support for URNs, enabling Squid to resolve DOI based URNs into URLs and return those URLs to the requesting Web browser in the form of an HTTP redirect. The software relies on recent beta versions of Squid version 1.2 and on some support for URNs in the Web browser. At the time of writing, only Netscape Navigator version 4 appears to offer the appropriate support to allow URN resolution in this way.

By using this experimental system it is possible for staff at UKOLN to type DOI based URNs directly into the Location window of their Netscape Navigator browsers.

Uniform Resource Names (URNs)

The IETF URN Working Group [4] is currently defining a persistent identifier for information resources known as the Uniform Resource Name (URN). The URN and the more familiar URL together make up the set of resource identifiers known as Uniform Resource Identifiers (URIs) that are used to identify and locate information on the Web. The requirements for URNs are defined by RFC 1737 [5].

The working group will define the mechanics that enable global scope, persistence, and legacy support for URNs and the requirements for namespaces to support this structure. Members of the working group have developed two experimental protocols for resolving URNs into URLs, one using the Domain Name System (DNS) [6] the other using HTTP [7]. The experimental system described here uses the second of these protocols to allow Squid to resolve DOI based URNs.

URNs have the following syntax:

    "urn:" NID ":" NSS

where NID is the Namespace Identifier, and NSS is the Namespace Specific String. Some examples of URNs are given later. The syntax is defined fully in [8].

Digital Object Identifiers (DOIs)

The DOI System [9] is an identification system for digital media developed originally by the Association of American Publishers (AAP) in collaboration with the Corporation for National Research Initiatives (CNRI); it is now governed by The International DOI Foundation. Designed to provide persistent and reliable identification of digital objects, the DOI system is based on The Handle System [10], which has been developed by CNRI. The DOI has attracted significant attention from the publishing community as an important new identifier for electronic intellectual content.

The DOI is made up of two parts, the prefix and the suffix, separated by a forward slash ('/'). The prefix is assigned by the Directory Manager. In the future there may be many Directory Managers; however, at the time of writing there is only one. The prefix is also made up of two parts, separated by a full-stop ('.'). The first part indicates the Directory Manager who has assigned the DOI and is currently always "10". The second part indicates the publisher who will be registering DOIs using this prefix. The suffix is assigned by the publisher. It may be based on an existing standard identification scheme, for example a Serial Item and Contribution Identifier (SICI) [11] or Publisher Item Identifier (PII) [12], or it may be based on some proprietary in-house scheme.

Two example DOIs are shown below. The first identifies the Digital Object Identifier System home page on the Web. The prefix is '10.1000' and the suffix is "1".

    10.1000/1

The second identifies an article entitled "Developmental expression of a DNA repair gene in Arabidopsis" published in an Elsevier Science journal. In this case, the suffix is based on a PII.

    10.1016/S0921877797000232

DOIs and their associated URLs are held in the DOI Directory, a distributed database based on the Handle System. Currently, DOIs are used on the Web by embedding them into URLs as shown below:

    http://dx.doi.org/10.1000/1
    http://dx.doi.org/10.1016/S0921877797000232

A Web browser resolves a DOI by contacting the Web server at dx.doi.org. The Web server queries the DOI Directory and uses the URL associated with the DOI to return an HTTP redirect to the Web browser, causing it to retrieve the required resource.

DOI based URNs

RFC-2288 [13] shows how several existing bibliographic identifiers can be encoded as URNs. It describes URNs based on the International Standard Book Number (ISBN), the International Standard Serials Number (ISSN) and the SICI. For example, the ISSN for D-Lib Magazine could be used as the basis for the following URN:

    urn:issn:1082-9873

Although RFC-2288 doesn't cover DOIs, it seems reasonable to infer from it that they will be encoded as URNs in the following way:

    urn:doi:10.1000/1

URN support in Web browsers

One of the impediments to the experimental deployment of URNs is the lack of support for URN resolution protocols in mainstream Web browsers. This is not surprising because the protocols are at an early stage of development. However, Netscape Navigator version 4 does contain some support for URNs: if an HTTP proxy servier has been appropriately configured (see next section), it will pass URNs on to an HTTP proxy for resolution.

The HTTP proxy can be configured manually using the "Edit", "Preferences" menu. (Select "Advanced" then "Proxies"). Alternatively, the proxy configuration can be set automatically using an auto-proxy config file (a small piece of JavaScript) identified by a URL. This has the advantage that all the configuration information for a site can be held in one place.

The auto-proxy config file in use at UKOLN is shown below:

function FindProxyForURL(url, host)
{
    if (url.substring(0, 4) == "urn:")
        return "PROXY resolver.ukoln.ac.uk:3128";
    if (isPlainHostName(host) ||
        dnsDomainIs(host, ".bath.ac.uk") ||
        dnsDomainIs(host, ".ariadne.ac.uk") ||
        dnsDomainIs(host, ".niss.ac.uk") ||
        dnsDomainIs(host, ".ukoln.ac.uk"))
        return "DIRECT";
    else
        return "PROXY wwwcache.bath.ac.uk:3128";
}

Notice that this configuration file explicitly sets a proxy server for URNs and that the server is different from the normal HTTP and FTP proxy server (though this need not be the case). Notice also that, by comparing the first two components of the URN (e.g., by checking for "urn:doi: " rather than simply 'urn:'), it would be possible to use different proxy servers to resolve different URN namespaces.

URN support in Squid

Squid versions 1.2 beta 9 and later, provide some support for URNs [14]. This support is provided primarily as a mechanism for resolving URNs into lists of URLs for mirrored resources. Depending on the way in which it is configured, Squid may attempt to determine the "nearest" of the mirror sites and return a single URL (as an HTTP redirect) or it may simply return a list of URLs from which the end user may choose.

As supplied, Squid assumes that a URN takes the form

    urn:fqdn:url-path

Given such a URN, Squid converts it into a list of URLs as specified in RFC 2169 [7] by making the query

    http://fqdn/uri-res/N2L?urn

In other words, Squid connects to a CGI script, /uri-res/N2L, on the host named by fqdn to obtain a list of URLs for url-path.

Consider a trivial example. If Squid is given the URN

    urn:www.apache.org:

to resolve, it connects to the N2L script running on www.apache.org. The script returns a list of mirror sites for the Apache HTTP server project.

Resolving DOIs using Squid

Given a URN of the form

    urn:doi:10.1000/1

to resolve, Squid will treat doi as the fully-qualified domain name of a host running an N2L CGI script. Of course, doi isn't a fully-qualified domain name. However Squid can be fooled into thinking that doi is a valid host name by adding an entry for doi into the /etc/hosts file on the machine on which it is running. (The /etc/hosts file is one of the methods used by UNIX machines to resolve host names).

The machine associated with the doi name must be running an N2L CGI script that can resolve DOIs. Such a script is described below. If the resolution of the DOI is successful, the N2L script returns a list of one or more URLs (though currently DOIs always resolve to a single URL). If a single URL is returned, Squid issues an HTTP redirect to the Web browser by adding a Location: header to the reply. In the case of an error or a failed DOI lookup, N2L returns zero URLs in which case Squid returns an HTML error message to the Web browser.

The N2L script

A simple N2L (name to location) CGI Perl script to resolve a DOI into a URL has been developed by UKOLN. It is based on the example N2L script provided with Squid. It performs some minimal syntax checking of the supplied URN and then calls the hdlres command (see below) to query the Handle system and resolve the supplied DOI (or Handle) into one or more URLs.

The N2L script is available for downloading [15]. It can be queried directly using a URL of the form:

    http://resolver.ukoln.ac.uk/uri-res/N2L?urn:doi:10.1000/1

The hdlres command

The hdlres command is a very simple Handle resolver. Given a DOI to resolve it simply returns the URL associated with that DOI. It is based on the hdl_test command supplied with the Handle client library code. Currently hdlres does very little checking of the DOI (or Handle) that it is given to resolve or of the results returned from the Handle system.

The C source code for hdlres is available [16]. A copy of the Handle System Client Library is required to build the hdlres command.

Summary

When an end user types a DOI based URN into their Web browser's Location: field, the URN is passed to Squid for resolution. Squid connects to the N2L CGI script running on the doi machine, which in turn connects to a Handle server using the hdlres command. Assuming that the DOI is valid, the Handle system returns the URL associated with the DOI and Squid uses this as the basis for an HTTP redirect sent back to the end user's Web browser.

Conclusions and Further Work

The experimental system described in this article shows that the use of a Web cache such as Squid may be one method of deploying URNs without the need to integrate support for URN resolution protocols into every Web browser. However, the system requires further work. Some ideas on the areas that might be investigated in the future are outlined below.

Squid currently returns the URL associated with a DOI based URN to the Web browser as an HTTP redirect, thus causing the browser to retrieve the resource identified by the URN. There are three related problems with this approach. Firstly, repeated accesses to the same resource cause the DOI to be resolved each time. Secondly, the Web browser has no knowledge of the URN used to identify the resource after it has been retrieved. It only knows the URL that the DOI based URN resolved to. So, for example, it is the URL that is displayed in the browser's Location: field and that will be stored if the resource is bookmarked. Finally, and more fundamentally, Squid maintains no knowledge that a particular URN is associated with a particular resource. Squid may well cache the resource, but only after it has been retrieved by the browser as a result of the HTTP redirect. The cached resource is associated with it's URL, not with it's URN.

The first problem is partially relieved by the use of a caching Handle server [17] co-located with the Squid server. In this way, there is a local cache of DOI to URL mappings and a corresponding improvement in DOI resolution response times.

The second and third problems may be solved with a change to the way in which URNs are resolved by Squid. Rather than returning an HTTP redirect to the browser, it may be possible for Squid to resolve the URN, retrieve the resource, cache it (such that the cached resource is associated with the URN) and finally return the resource directly to the browser. In such a system, subsequent requests for the same URN might be satisfied by using the cached copy of the resource directly. However, the details of how such a caching system might work are beyond the scope of this article.

Finally, the system described here is not limited to DOI based URNs. Indeed, the example N2L script available from UKOLN has already been extended to support the IETF URN namespace [18].

References

  1. Squid Internet Object Cache
    http://squid.nlanr.net/
  2. Harvest
    http://www.tardis.ed.ac.uk/harvest/
  3. UKOLN Metadata Resources - URNs
    http://www.ukoln.ac.uk/metadata/resources/urn/
  4. IETF URN Working Group
    http://www.ietf.org/html.charters/urn-charter.html
  5. Functional Requirements for Uniform Resource Names - K. Sollins, L. Masinter
    ftp://ftp.isi.edu/in-notes/rfc1737.txt
  6. Resolution of Uniform Resource Identifiers using the Domain Name System - R. Daniel, M. Mealling
    ftp://ftp.isi.edu/in-notes/rfc2168.txt
  7. A Trivial Convention for using HTTP in URN Resolution - R. Daniel
    ftp://ftp.isi.edu/in-notes/rfc2169.txt
  8. URN Syntax - R. Moats
    ftp://ftp.isi.edu/in-notes/rfc2141.txt
  9. The Digital Object Identifier (DOI) System
    http://www.doi.org/
  10. The Handle System
    http://www.handle.net/
  11. Serial Item and Contribution Identifier
    http://sunsite.Berkeley.EDU/SICI/
  12. Publisher Item Identifier
    http://www.elsevier.nl/inca/homepage/about/pii/
  13. Using Existing Bibliographic Identifiers as Uniform Resource Names - C. Lynch, C. Preston, R. Daniel
    ftp://ftp.isi.edu/in-notes/rfc2288.txt
  14. URN support in Squid
    http://squid.nlanr.net/Squid/urn-support.html
  15. N2L CGI script
    http://www.ukoln.ac.uk/metadata/software-tools/#N2L
  16. Source code for hdlres
    http://www.ukoln.ac.uk/metadata/software-tools/#hdlres
  17. Caching Handle Server
    http://www.handle.net/download.html
  18. A URN Namespace for IETF Documents - R. Moats
    http://www.ietf.org/internet-drafts/draft-ietf-urn-ietf-05.txt

Acknowledgments

Thanks to Rachel Heery (UKOLN) and Laurence Lannon (CNRI) for their comments on earlier versions of this article and particularly to Martin Hamilton (Loughborough University) whose idea formed the basis for much of this work.

UKOLN is funded by the British Library Research and Innovation Centre, the Joint Information Systems Committee of the Higher Education Funding Councils, as well as by project funding from the JISC's Electronic Libraries Programme and the European Union. UKOLN also receives support from the University of Bath where it is based.

Copyright © 1998 Andy Powell

Top | Magazine
Search | Author Index | Title Index | Monthly Issues
Previous Story | Next Story
Comments | E-mail the Editor

hdl:cnri.dlib/june98-powell