UK Office for Library and Information Networking
University of Bath, Bath BA2 7AY, UK
D-Lib Magazine, July/August 1996
Lorcan Dempsey describes some recent metadata and resource discovery initiatives in the UK's Electronic Libraries Programme (eLib) and within the European Union's Fourth Framework Programme for research and technological development. A particular focus will be the quality controlled, subject-based, information gateways within eLib, the ROADS project which supports them, and the large-scale Desire project within Europe. These share some common features and development strands. An opening section will place these developments within a broader view of the evolution of metadata services and some trends and possible developments will be highlighted in a closing section.
This issue of D-Lib Magazine reports the significant steps that are being taken to address the issue of resource description in a distributed, electronic environment. A companion article lays out the results of the Second Invitational Workshop on Metadata, jointly organised by the OCLC Office of Research and UKOLN (Dempsey and Weibel). Important steps were made there in preparing the Dublin Core Metadata Element Set for operational use and in identifying the need for a higher level container architecture -- named the Warwick Framework -- for the aggregation of metadata information objects. UKOLN sought this collaboration because it is now involved in several important European initiatives in this area and recognised the importance of creating points of contact with work being done elsewhere. This article aims to introduces readers to this European work.
Metadata is data which describes attributes of a resource. Typically, it supports a number of functions: location, discovery, documentation, evaluation, selection and others. These activities may be carried out by users (human users or their agents) or by client programs (especially selection between locations and formats). It is recognised that in an indefinitely large resource space, effective management of networked information will increasingly rely on effective management of metadata. Increased commercialisation and complexity of information resources makes this need all the greater.
It is unlikely that some monolithic metadata format will be universally used. This is for a number of more or less well known reasons. There is a variety of types of metadata. There is traditional descriptive information of the kind found in library catalogues, which typically includes such attributes as author, title, some indication of intellectual content and so on. There is information that might help a client application make a decision based on format (where certain local browser equipment is available) or on location (to save bandwidth). There are different types of user: a user as customer wishes to know the terms under which an object is available; a user as researcher may wish to have some extended documentation about a particular resource, its provenance for example. There are different types of resource. Some resources may have a fugitive existence, existing to satisfy some temporary need and are only ever minimally described, if at all; some are important and valuable scholarly or commercial resources, where the value of extensive description is recognised. Some resources may be simple; some may be complex in various ways. There will be many different information providers, some commercial 'yellow pages' type services, some scholarly or research oriented services, in different organisational configurations with different target audiences and products. Metadata may be closely coupled with the object it describes as an intrinsic part of its composition; or it may have no intrinsic link with it at all. And so on ...
At the same time, the technical and service environments in which metadata services are being deployed are the subject of rapid change and development. Thus, the nature of the problem to be solved suggests a variety of solutions.
It is helpful to consider a metadata spectrum, along which data becomes successively fuller, more structured, specialised, and expensive to create. Three bands along this spectrum are suggested in the table below, which outlines services that use metadata. This division into bands is presented for purposes of analysis only -- as a way of presenting a view over an existing situation -- and not as a set of fundamental distinctions. At one end is data used in services which currently support location and limited discovery. Terse metadata allows searching for known items, and for topics. This is the area inhabited by the webcrawlers which automatically extract data from resources. They currently provide web forms access onto proprietary databases. There are moves to enhance the metadata on which they operate and to develop frameworks for interworking.
At the other end of the spectrum is metadata which supports much richer functionality. It is often associated with research or scholarly activity, requires specialised knowledge to create and maintain, and caters for specialist domain-specific requirements. Metadata formats being developed here can express a variety of formal and intellectual relationships, and are often part of a larger framework which provides for the encoding of the content they describe. Typically, they are looking at the creation of a variety of SGML document type definitions to cater for metadata and 'content'. Such initiatives include the Inter-university Consortium for Political and Social Research (ICPSR) SGML codebook initiative to describe social science data sets, the Encoding Archival Description (EAD), Content Standards for Digital Geospatial Metadata (FGDC) and the Consortium for Interchange of Museum Information (CIMI). The Text Encoding Initiative can also be included here, and influenced some of these other developments. There has been some interest in the Z39.50 profile for access to digital collections for search and retrieval of such metadata. This has been taken furthest with CIMI where a companion profile is being developed for specific use with museum data.
The middle band contains services which contain full enough descriptions to allow a user to assess the potential utility or interest of a resource without actually having to retrieve it or connect to it, but not so full or complex as to require very specialist staff to create. Typically, but not essentially, these descriptions are manually created, or are manual enhancements of automatically extracted descriptions, and they include a variety of descriptive and other attributes. They may be created to be loaded directly into a discovery service or may be harvested. Typically they are simple enough to be created by non-specialist users, or not to require significant discipline-specific knowledge. Descriptions tend to be of discrete objects and do not capture multiple relationships between objects. Again, typically, they involve some selectivity in what they describe and may have more or less explicit criteria for selection. For these reasons, they may be expensive to create, again driving an interest in author- or publisher- generated description and automatic extraction techniques. Examples of services in this category are OCLC's NetFirst, and the subject-based information gateways being developed in the UK Electronic Libraries Programme and in Desire.
Against this analytic background, one can note some likely future directions, especially across the boundaries of these bands. Author- or site-produced metadata will become more important for crawlers and the middle band of services. This metadata may be harvested unselectively, or only from selected sites, depending on the service environment. An important motivation for such harvesting is to overcome some of the deficiencies of current crawlers without a provider incurring the cost of record creation. In some respects, the crawlers will assume characteristics of the middle band as presented above.
At the same time, communities using the richer 'documentation' formats will wish to disclose information about their resources to a wider audience. How best to achieve this will have to be worked out: perhaps 'discovery' records will be exported into other systems.
These directions suggest that the middle band will become more important as a general-purpose access route, maybe with links to richer domain-specific records in some cases. Incidentally, the current interest in the Dublin Core Metadata Element Set can perhaps partly be explained by the fact that it has been positioned so that its target uses coincide with the trends outlined above: it provides the basis for a 'discovery' record; it can provide the basis for embedded content description which can be harvested by robot; and it aims to provide the basis for interoperability across richer description models.
And what about MARC? MARC will clearly continue to be used where integration with existing bibliographic systems is important. It is more complex than most of the 'discovery' formats mentioned here. Some of the issues to do with MARC are discussed elsewhere [Heery , 1996. Further reading, ref. 3.].
|Examples||Current web crawler services||OCLC's NetFirst service. The UK subject based information gateway services.||Metadata is associated with 'collection-based' resources - archives, electronic text collections, museum collections, geospatial reference data, social science data sets, and so on.|
|Data creation||Extracted from resources by software robot. The data is typically terse and relatively uncontrolled. Crawler indexes are thus relatively cheap to create.||Manual or manual enrichment of robot extracted data. Data is of medium fullness, comprising basic descriptive data and some other types of data. Creation requires basic information and domain knowledge.||Manual. It is typically very full covering not only generic attributes (e.g. provenance, archival responsibility) but fuller domain specific requirements. Will often be scholarly and research materials justifying and requiring specialist information and domain knowledge to create metadata. It is therefore typically expensive to create.|
|Coverage||Aim for comprehensive coverage within defined area. Web-based resources are central, though there is some attention also given to Newsgroups.||Often selective within particular subject or geographic domain. Promote a filtered, quality controlled approach. Typically, the metadata exists independently of the object it describes.||Quite a specific focus. Typically collection oriented. The metadata may be included in the same larger framework as the 'content' it describes.|
|Granularity and aggregation||Individual object level. Discrete objects tend to be indexed, with some facilities for tracking of links between objects.||Often (but not necessarily) will be at server level for 'cost' reasons. Limited structuring devices to represent the variety of relationships that might exist between objects.||Typically at collection level and at various lower levels. Expressive enough to capture a variety of intellectual and formal relationships at different levels.|
|Metadata formats||There is some movement to agree on ways of sharing indexing information between services, perhaps based on Harvest's internal format, SOIF.||Tend to be simple attribute-value pairs although a decision was recently made to represent the Dublin Core as an SGML Document Type Definition. A requirement is not to impose a large overhead for creators. Examples are IAFA templates, the Dublin Core and RFC 1807.||SGML Document Type Definitions are in preparation in a number of domains. Typically, metadata is one component of a larger framework for document description. Examples are TEI, CIMI, FGDC, ICPSR SGML Codebook initiative, EAD.|
|Search and retrieve||Proprietary search engines accessible through web forms.||There has been some development of distributed search and retrieve approaches (Dienst, WHOIS++, LDAP, ...). Some available through web interfaces onto proprietary engines.||This area is still under development. Some approaches are considering the use of Z39.50, specifically the profile for access to digital collections.|
The Electronic Libraries Programme (eLib) is a three-year programme to modernise library services in the UK higher education community. The programme is spending approximately 15 million pounds over three years, and currently about 60 individual projects are in progress. Funding comes from the Joint Information Systems Committee of the Higher Education Funding Councils. UKOLN maintains the eLib information pages where further information about the programme and individual projects can be found.
eLib is organised around a number of project areas. One of these is Access to Network Resources. A major focus of this area is metadata and resource discovery services. Several subject-based information gateways have been funded, and are in various stages of start-up. They include:
These are all organised in various ways but share certain characteristics. They are very much located in the middle band as outlined above: they aspire to provide discovery services to UK higher education. In some ways, they are rather like early abstracting and indexing services. They are developing databases of resource descriptions from which other services are derived. They are defining selection and quality criteria which determine which resources are included. They are aiming to have comprehensive coverage of resources within UK higher education, but take different approaches to resources elsewhere. They are focusing on the server level - it is difficult to do otherwise and sustain this level of description - but aspire to move to the object level, however that might be achieved. Typically, their staff effort goes into management and administration, 'cataloguing', technical development, and promotion.
Given the variety, volume, and volatility of relevant resources, it is likely that the subject services will consider how to lever their effort in a number of ways: by collaboration (with similar initiatives elsewhere, with commercial services, with a range of interested organisations), by more proactive use of gathering, discovery and alerting technologies, and so on.
The services are considering various technical solutions to their services, and clearly have to make choices about metadata formats. The base service will have to provide web access to a searchable database of descriptions and some organised browsing facilities. Some of the services are using the ROADS software, described in the next section.
There are also some other projects in the Access to Network Resources Area. CAIN and RUDI are looking at creating creating multimedia resources in the areas of conflict studies and urban design respectively. They are also looking at 'information gateways' as described here, but unlike the projects mentioned above, this is not their central purpose.
ROADS is a development project, also in the Access to Network Resources area of eLib.
A major objective of ROADS is to provide a set of software tools for the eLib subject-based services described above. These tools will allow the construction of distributed services and provide some data creation and other tools. An early motivation was that it should be possible to avoid the fragmentation of the current bibliographic environment and allow users have unified access to a distributed resource. ROADS is being developed in concert by the University of Loughborough (technical development), UKOLN at the University of Bath (metadata issues, requirements), SOSIG at the University of Bristol (coordination and project management, liaison with services and information providers, requirements), and Bunyip (technical consultancy).
The development of ROADS has been guided by several strategic decisions. Wheels should not be reinvented: solutions should be based on Internet standards. Where those are being formed, ROADS experience will be put into their development. The system needs to support services in the middle band identified above, and certain decisions flow from that.
For this type of application, what is required is a description which is simple to create yet full enough for effective retrieval and relevance judgement. This implies a description which falls between the terseness of the crawlers and the fullness of a research library catalogue record. ROADS is looking at IAFA/WHOIS++ templates because they offer a basis for international consensus and because we can influence their development. There are also a range of template types: for user, for service, for document, and so on, allowing a fuller range of objects to be represented than with some other formats.
ROADS introduced the concept of 'trusted information provider', arguing that the collection of descriptions from approved authors and web-site administrators will be essential to a sustainable service. Again, a simple record structure will facilitate this and data creation and gathering tools are being prepared as part of the project.
A distributed framework is required: the user is best served by a solution which allows autonomously managed services to be accessible individually or as a navigable unit. Furthermore, we think this should be done in a network-friendly way and allow for multiple different configurations of service. This aspiration is typical of directory type applications, and we see this application as a distributed 'yellow pages' directory. The project is using WHOIS++ because it is a leading candidate for this type of service. Also because, through use of the Common Indexing Protocol (aka centroids), it supplies a mechanism for selecting between servers to search.
Any solution should not too much prejudge how the service wishes to supply their services. Areas for customisation are identified at requirements stage.
The choice of IAFA and WHOIS++ was governed by pragmatic reasons: there is no ideological adherence to them as the 'one true path'. In the timescales of the project, it was important to have something which was immediately serviceable.
Importantly, the project acknowledges that there is no one solution. There will be several protocols and formats and much more on-the-wire conversions between them. It is in this context that ROADS includes a component which looks at interoperability with MARC and Z39.50. This will result in a report on semantic interoperability among USMARC, IAFA/WHOIS++ templates, and the Z39.50 BIB-1 attribute set and also a small demonstrator system. The latter will involve integrated access to a selection of bibliographic resources and a subject-based network service using Z39.50 and WHOIS++. A Z39.50/WHOIS++ gateway is being developed in this context.
At a wider level, ROADS aims to provide a collaborative framework through which the UK community can help contribute to the design and development of the future of such services. This is at the service and organisational level, and at the technical level where numerous issues are outstanding. Standards for the creation of descriptions and for the search and retrieve of such descriptions are still under development and open to influence.
What stage is the project at? Omni and SOSIG are currently in production mode using ROADS version 0. This was developed so that services would have something to work with from an early stage. It provides a simple search engine, several tools for creation and processing of IAFA records, and a web interface. The services offer a search service and can generate browsable pages by subject (class number) and other categories. ROADS Version 1 is due to go into Alpha test shortly and incorporates WHOIS++. At this stage, the services will continue to provide standalone servers accessed over the web. It includes a range of other functional enhancements. In early 1997, ROADS Version 2 will be released which will implement the Common Indexing Protocol. At this stage, we will test the ability of the system to provide unified guided access to a metadata resource distributed over several subject specific servers. Each server will generate a centroid, an inverted index style representation of its content, which will be collected in an index server. The centroids provide 'forward knowledge' about the contents of servers in the distributed directory so that sensible decisions can be taken about navigation. A fuller description of the technical architecture of ROADS and its implementation of WHOIS++ can be found elsewhere [Knight and Hamilton , 1996. Further reading, ref. 4.].
Further development may be possible at this stage, and will certainly take place within the Desire project. Although ROADS aims to provide the eLib subject services with a systems framework, other users are not discouraged. As such it has a secondary aim in demonstrating the efficacy of a distributed solution in one compelling service environment. If successful it would hope to extend the scope to look at other directory applications. Users would benefit enormously from a simple to maintain, working directory service for distributing information about courses, research interests, 'white pages' data, and so on.
Further details of the ROADS system and its partners are available on the ROADS web pages.
DESIRE addresses the needs of European researchers to locate and retrieve information relevant to their research activities. It is a project within the Telematics for Research area of the European Union's Fourth Framework Programme (the EU organises its research and technological development activity in 5 year cycles or 'framework programmes'). Desire is a large project examining many issues to do with the use of the web (security, authoring, caching, training and other areas); it is coordinated by Surfnet in The Netherlands.
One component of Desire looks at Indexing and Cataloguing of web resources. Within this, a dual approach has been taken. A robot-based web index is being developed to assist in locating information objects in an indefinitely large information space: it aims to provide exhaustive, 'vacuum-cleaner' coverage of web pages, working with a network-friendly, distributed approach developed within the Nordic Web Index project at the University of Lund. At the same time several information gateways are being developed in particular subject areas (social science, art, engineering) which are based on quality-controlled resource catalogues containing full descriptions of resources which meet specified quality criteria. The systems framework for the information gateways will be an enhancement of the software developed in ROADS. Multilingual support will be added, for example, as well as some functional enrichment. The engineering gateway will be provided by the University of Lund and the art one by Koninklijke Bibliotheek, the National Library of the Netherlands. The social science gateway will build on SOSIG, giving it an important additional dimension as it extends coverage into Europe. Lund, the National Library of the Netherlands, and the ROADS group are the partners in the Indexing and Cataloguing subconsortium within Desire.
Work began on Desire in early 1996. Among the first deliverables of Desire will be state of the art reports on metadata formats (coordinated by UKOLN) and on robot-based web indexing (coordinated by the University of Lund). These reports will be made publicly available on the web later this year.
Experience to date with the IAFA templates being used in ROADS have shown that they are flexible enough to support a range of requirements. There are some outstanding issues to be resolved, especially where various structuring devices are required. The content of the records is being refined in use, and that experience will be fed into future versions of ROADS, into Desire, and into wider metadata discussions. The distributed solution has yet to be tested, but again will provide valuable working experience.
As noted, the subject services have focused on web sites rather than on individual information objects within them. What of the user who wants to find an image of the Mona Lisa? They have no direct access to such an object at this level of description. Of course, this is one reason that we are interested in author- or site- created metadata
For the moment, especially given the dual focus within Desire, this problem has prompted reflection on linkages between the quality-controlled description-based services, and robot-generated indexes. For example, if one locates a server of interest through the former, then a search could be generated on a web index to retrieve data about objects on those particular servers. However, although potentially useful, this does not address the Mona Lisa problem. One has still to locate a potentially relevant server unaided.
Another approach would be to crawl those resources which have been added to the information gateways, creating an index at object level which complements the server-level descriptions. In this case, the searcher for Mona Lisa would have a better chance of finding it and would know that it was part of a resource which had met certain selection and quality criteria.
These issues will be explored within ROADS and Desire.
ROADS is designed to provide a federating solution for a particular class of application. However, the world is very clearly multiprotocol and multiformat, and will continue to be so. A variety of interworking strategies will be required , at various levels. The Common Indexing Protocol (centroids) and the Warwick Framework are interesting in this respect. For ROADS, a defined issue is interworking with Z39.50 and MARC resources. However, this is one part only of a much wider picture. There are also UK initiatives in the archives, electronic texts, and other communities which will be generating metadata of the 'documentation' kind (perhaps EAD, TEI and MARC, for example). This will involve some discussion within UK higher education of how to bring this variety into the same context of use, and ROADS will contribute to that discussion.
eLib is one manifestation of the UK belief that central action is helpful in moving higher education information systems forward. The emphasis of the eLib subject-based services has been on discovery. They identify and describe resources of potential interest to the communities they serve. It might now be interesting to attend in an organised way to disclosure - the publication of descriptions of one's own resources. This will be the initial focus of imminent UK activities in the archives community and of such initiatives as the newly set up Arts and Humanities Data Service which will be sponsoring the creation of collections of digital materials and their description. It is also, of course, typically what libraries have done through their catalogues.
Discovery and disclosure are intimately linked. An organised approach to the disclosure of resources, where universities, research organisations and other information repository managers make descriptions available for reuse in several different service scenarios would be valuable in a number of contexts. This is not least as an exercise in self-promotion in an increasingly competitive environment. The Higher Education Funding Councils might have an interest in the organised disclosure of the intellectual output of UK universities; and each university might have an interest in the organised disclosure of its own output. In this inevitably distributed environment, lessons learned in ROADS will be of interest.
The contents of this issue of D-Lib Magazine are testament to the burgeoning activity in this area and its importance. It is an inherently international activity. We hope that we can continue to build on the multinational cooperation witnessed at the second metadata workshop at Warwick and contribute to and benefit from the ongoing discussion and development.
My thanks to colleagues Rachel Heery and Michael Day for helpful input. The 'Mona Lisa problem' is a formulation of Chris Rusbridge's. UKOLN is funded by the Joint Information Systems Committee of the Higher Education Funding Councils and by the British Library Research and Innovation Centre. This article draws on work carried out in the ROADS and Desire projects. Any views expressed in this article are the author's own.