Cross-searching Subject Gateways: the Query Routing and Forward Knowledge Approach

D-Lib Magazine
January 1998

ISSN 1082-9873

Cross-Searching Subject Gateways
The Query Routing and Forward Knowledge Approach

John Kirriemuir, Dan Brickley
Institute for Learning and Research Technology
University of Bristol, UK
[email protected] [email protected]

Susan Welsh
OMNI: Organising Medical Networked Information
Nottingham University, UK
[email protected]

Jon Knight, Martin Hamilton
Loughborough University of Technology,
Loughborough, UK
[email protected] [email protected]

Abstract

A subject gateway, in the context of network-based resource access, can be defined as some facility that allows easier access to network-based resources in a defined subject area. The simplest types of subject gateways are sets of Web pages containing lists of links to resources.
Some gateways index their lists of links and provide a simple search facility. More advanced gateways offer a much enhanced service via a system consisting of a resource database and various indexes, which can be searched and/or browsed through a Web-based interface. Each entry in the database contains information about a network-based resource, such as a Web page, Web site, mailing list or document. Entries are usually created by a cataloguer manually identifying a suitable resource, describing the resource using a template, and submitting the template to the database for indexing.
Subject gateways are also known as subject-based information gateways (SBIGs), subject-based gateways, subject index gateways, virtual libraries, clearing houses, subject trees, pathfinders and other variations thereof. This paper describes the characteristics of some of the subject gateways currently accessible through the Web, and compares them to automatic "vacuum cleaner" type search engines, such as AltaVista. The application of WHOIS++, centroids, query routing, and forward knowledge to searching several of these subject gateways simultaneously is outlined. The paper concludes with looking at some of the issues facing subject gateway development in the near future. The paper touches on many of the issues mentioned in a previous paper in D-Lib Magazine, especially regarding resource-discovery related initiatives and services [1].

Characteristics of Subject Gateways
There are a considerable number of Web-based gateways that can be used to locate network-based resources in some particular subject area. Nearly all of these gateways have unique features, additional subject-based services, and different approaches to how information about network-based resources is stored in the resource description database.
Basic gateway facilities
Most subject gateways allow the end-user to either search or browse the database of resource descriptions. For example, the AstroWeb Astronomy gateway [2] consists of a browsable multi-level menu of sub-areas and resources, as well as a WAIS-based search mechanism. In addition, most gateways allow the user the options of case sensitive searching and stemming, where resource descriptions containing variations of a term are located (for example, painter begets painted, paints and other terms beginning with paint). For example, see ADAM, the Art, Design, Architecture and Media information gateway [3].
Additional searching facilities
Some gateways provide extra facilities for enhanced searching. SOSIG, the Social Science Information Gateway [4], incorporates a thesaurus containing social science terminology. This gives users the option of generating alternative terms/keywords with which to search the resource catalogue. SOSIG also allows users to search on resources that are located in distinct geographic areas [5], such as in the whole world, just in Europe or just in the UK. EEVL, the Edinburgh Engineering Virtual Library [6], allows users to search on a subset of types of resources, such as electronic journals, mailing lists and/or conference announcements. PAW [7], the Physics Around the World gateway, allows users to select the level of suitability for educational physics resources e.g. kids, school or university. Aqueous [8], a gateway dedicated to water related Web sites, allows two levels of searching (deep dive and shallow dive).
Additional subject-related services
Some subject gateways have built a considerable number of related services and information sources around their core searchable/browsable gateway. Biz/ed [9], the Business and Economics information gateway for students, teachers and lecturers, contains large amounts of business and economic information, such as company financial data and economic datasets from sources such as the office of National Statistics. Eldis [10], the Electronic Development and Environment Information Service provides access to several related bibliographic databases. History [11], a gateway to (not surprisingly) network-based history resources, provides details of historians and their research and teaching interests. OMNI [12], a medical and health subject gateway, has a section which allows users to purchase health-related CD-ROMS. OMNI also allows you to search across other databases of resources [13], such as a database of dental resources described by DERWEB.
Resource cataloguing
The key difference between subject gateways and the popular automated large-scale Web indexing systems such as AltaVista is the quality of the results which the end-user receives. This is dependent on the nature of the cataloguing process. For example, we searched Altavista and OMNI on the term epilepsy on January 4th 1998; from the results, we observed the descriptions of the first three hits:
Altavista search results

nbsp;The Future of Epilepsy Surgery for Children. BY JOSEPH R. MADSEN, M.D. Nikki, now a college sophomore, was born with a very large fluid collection in.

Dallas Epilepsy Association" "2906 Swiss Avenue", "Dallas" Contact: Volunteer Coordinator "(214) 823-8809 " Services: "Counseling. Financial assistance...

EPILEPSY INFORMATION. Canine epilepsy is a problem in Border Collies. Both primary (potentially genetic) and secondary (trauma, disease, drugs, etc)...

OMNI search results

The Neuroscience Consortium at Birmingham has been formed by a number of departments within the University with the aim of developing Birmingham as a centre of excellence in neuroscience. Apart from general information on the Consortium, the main feature of the Consortium home page is coverage of the various research groups at Birmingham with interests in neuroscience, giving details of individual research projects, staff and publications. The subject interests of these groups include autonomic function, basal ganglia, degeneration and trophins, epilepsy and prion diseases.

This NSE site provides a wide range of basic information prepared by professionals working in the field of epilepsy, and is updated regularly. Topics covered include diagnosis, treatment, seizures, living with epilepsy (driving, pregnancy, school, work, safety in the home etc.) and NSE services.

This guide to paediatric epilepsy, written by Timothy F.Hoban of the Loyola University Medical Center, covers causes, clinical manifestations, guidelines for treatment and clinical case studies, including absence epilepsy, febrile seizures and infantile spasms.

Here, we can see that the resources catalogued in OMNI have been described in a "human readable" fashion, whilst the entries in AltaVista are presented more as "raw data". OMNI points directly to the home page or start point of a resource while Altavista often points to pages without context, leaving the user to find their own way.
AltaVista records are created by an automatic process and typically consist of a mixture of metadata offered by the author of the page (if this is available) and text picked up from the page itself. In contrast, OMNI records offer information created by a cataloguer, which is designed to highlight the main features of a resource in a easily-readable, concise fashion.
In addition, cataloguing by hand allows keywords to be added to the record which enables more relevant results to be retrieved and offers the opportunity to develop thesaurus-based searching.
AltaVista indexes individual pages, not resources. As an illustration of the difference between a page and a resource, consider that an online textbook could consist of many web pages, hyperlinked together via a table of contents. The AltaVista software does not know which set of pages on a server constitute a resource and when it encounters a large collection of pages, is likely to index a random sample [14]. Subject gateways such as OMNI, on the other hand, catalogue at the resource level, and will therefore describe resource composed of many pages in a much more coherent fashion.
Lastly, the resources described in a subject gateway are likely to have been hand-picked and catalogued with a particular audience in mind. The OMNI gateway, for example, includes only biomedical resources of interest to the higher education and research community, and catalogues resources with this in mind. Thus, resources included in OMNI are indexed and classified in a similar way to books in an academic medical library, and are selected so that they are at an appropriate level for students, researchers, lecturers, etc. This tailored and selective approach is not possible for a service such as Altavista, which successfully serves a much broader community.
ROADS-based gateways
Several of the gateways mentioned in this paper use the software from the ROADS (Resource Organisation And Discovery in Subject-based services) initiative [15]. The software, which is freely available, enables a Webmaster to set up a subject gateway. As well as the software, the ROADS project provides support for people either setting up a gateway from scratch, or who convert their existing resource catalogue into a ROADS compatible form. Records in ROADS databases are stored in templates based on the IAFA template [16]. The combination of ROADS and the ROADS templates allows gateways a flexible, but (if they wish) comprehensive, entry to be written for every resource they catalogue.
The software toolset includes various facilities to assist cataloguers in data entry, such as a specialised cataloguers interface, and also provides database and indexing facilities as well as various optional tools to assist in database and data integrity management. For example, a customisable link checker is included, which automatically checks the URLS of all the resources catalogued, at whatever time intervals the gateway maintainer specifies e.g. daily, weekly, monthly. The link checker can be configured to tell you not only which links have failed, but what type of failure was encountered, as well as those resources which have failed consecutive link checking.
Currently, there are eleven ROADS-based subject gateways which are accessible [17] to the general public. Another seven are in various stages of construction, and will "go live" over the next few months. In addition, the ROADS project is in negotiation with several organisations about either setting up a gateway from scratch, converting their existing collection of resource descriptions into a ROADS-compatible format to build a new gateway, or cross-searching (as will be explained later) their non-ROADS gateway with other subject gateways.
Subject areas covered by subject gateways

As can be seen from the examples so far, many subject areas are covered by subject gateways. Some subject areas are without a subject gateway; for example, there are no gateways of the scope and type discussed so far that cover subject areas such as music or religious studies. However, some subject areas are covered by more than one gateway. Looking at engineering, we have already mentioned EEVL [6], but in addition, there is EELS [18], the Swedish-based Engineering Electronic Library, which has also catalogued a large number of engineering resources, and WWEVL [19], the Wastewater Engineering section of the Virtual Library.
The subject area covered by the largest number of subject gateways is probably that of health and medicine. We have already mentioned OMNI, the main UK gateway to biomedical networked resources. In addition, there are medical/health gateways such as Six Senses [20], Medical Matrix [21], Healthweb [22], Hon [23] and Medweb [24]. This leaves someone looking for quality medical resources with several dilemma's. Which subject gateways should they choose? And out of those chosen, which order should they be visited in? A medical/health gateway that is most suitable for one subtype of resource i.e. it contains a lot of catalogued entries for quality resources of that subtype, may not be so suitable for another subtype; therefore, should a user poll each of a selection of health/medical subject gateways for each individual query in order to get a good level of subject gateway coverage?
The same issues arise for people involved in inter-disciplinary resource discovery. For example, a student could be writing an essay on "The Socio-economic implications of vaccination programmes". From this title we can see that they would be interested in relevant quality resources that may be located through either a social science, economics or medical subject gateway. However, it would be time consuming to search several gateways in each of these areas. What the student really requires is some mechanism where they can execute a single cross-search of several subject gateways in these areas, and have a cumulative listing of the results presented to him/her. Recent research and development in such cross-searching of subject gateways has been undertaken; the technical mechanisms of such a system are described in the next section.

Query routing and forward knowledge

The increased availability of networked databases is rapidly leading to the situation where many users will need to query multiple distributed databases in order to locate all of the information that they require. Unfortunately, this has traditionally meant that the end user has to either query each of these databases individually or else use a standardized search and retrieval protocol client (such as a Z39.50 [25] client) that has been pre-configured to search a set of remote database servers. The first of these places the burden of locating the remote databases and learning each database's query interface on the end user. The second means that remote database servers and network links are often unnecessarily used even when their databases holds no information relevant to a user's query.

What is needed to improve this situation is for the remote databases to be able to let each other have some knowledge of the sort of data that they hold in advance of the end user's query being processed. This is known as "forward knowledge" and can be used to provide "query routing" from a single initial database server on to other servers that are likely to hold relevant information. If this forward knowledge contains information about the query language in use at the remote databases, the end user's client might also be able to translate the user's initial query into a form that is appropriate for the remote server and translate the results into a standard display format.
The Common Indexing Protocol (CIP v3) [26] is intended to fulfill this need for forward knowledge to permit efficient query routing to take place. CIP v3 is based upon the concept of centroids, which stems from the WHOIS++ directory access protocol [27] (indeed WHOIS++ is known as CIP v1). A directory access protocol such as WHOIS++ is designed to allow queries to be made against a directory database of people or services (so-called white pages and yellow pages services respectively, named after the two types of telephone directories in the USA ), which can be useful for allowing email addresses to be looked up for example [28]. Whilst CIP v3 itself is not tied to any particular database or access protocol, in order to understand the basic principles behind CIP v3, this paper will first outline the simpler generation and use of centroids in WHOIS++.
A centroid can be thought of as a summary of the information known about by a WHOIS++ server. A WHOIS++ server holds its own information in the form of textual attribute-value pair based records, each of which is derived from a template with a specific template type (such as a USER or a DOCUMENT). For example, the information derived from a medical subject gateway about one resource could be:
 Template: DOCUMENT
 Handle: 0000001
 Title: Warts, and the treatment of warts
 Author: Daniel Smith
 Publisher: Knobbly Books
 
A centroid is generated by taking all of the records of a similar template type and then listing the unique terms shared by all instances of each attribute. For example for the single record above, the following centroid would result:
  Template:     DOCUMENT
  Handle:       0000001
  Title:        Warts
                and
                the
                treatment
                of 
  Author:       Daniel
                Smith
  Publisher:    Knobbly 
                Books
 
Note that this is a simplification of the actual WHOIS++ protocol response -- see RFC1913 for a detailed description of the WHOIS++ protocol's centroid support. Also, notice that the word "warts" only appears once in the centroid for the Title attribute, even though it appears twice in the actual title. If the database held another record such as:
 Template-Type: DOCUMENT
 Handle: 0000002
 Title: Warts - self treatment using kitchen appliances
 Author: Daniel Brown
 Publisher: Medi Books
 
then the resulting centroid generated from both of these records would be:
  Template:     DOCUMENT
  Handle:       0000001
                0000002
  Title:        Warts
                and
                the
                treatment
                of
                self
                using
                kitchen
                appliances
  Author:       Daniel
                Adams
                Brown
  Publisher:    Knobbly
                Books
                Medi
 
The discarding of duplicate values from each attribute in a collection of records of the same template type saves space when the centroid is generated over a large number of records, especially if the bulk of the attributes in each record contain natural language text. This is particularly effective when generating centroids from gateways that concentrate on one subject. For example, a medical/health subject gateway would contain many instances of the same medical/health terms, such as names of common diseases and illnesses. Additionally, the WHOIS++ protocol allows centroids to be compressed and centroids for records that have changed since the last poll to be retrieved.
In order to make use of centroids, a WHOIS++ server must gather one or more centroids from other WHOIS++ servers. A WHOIS++ server that can only generate a centroid but not gather them can only act as a leaf node in a WHOIS++ mesh. A server that can gather and make use of centroids is known as an index server. An index server gathers the centroids by polling another WHOIS++ server. The polling mechanism in WHOIS++ allows either the indexing server to pull the centroid from the other server, or the other server can push the centroid to the index server. The former allows the polling server to determine when it will receive centroids to update its index. The latter allows the polled server to determine when it is convenient for it to generate and send the centroid, which in some cases might be an expensive operation in terms of CPU power and/or network bandwidth. In these cases, the polled server may wish to generate centroids for all of its polling servers in off-peak hours to minimise the impact on its other operations. The second method also allows the polled server to send centroid updates as soon as its database has changed, which ensures that the WHOIS++ mesh is kept up to date. The method employed is decided by the respective server administrators, based on the needs and the limitations of their server implementation (some servers only implement one of the methods).
Once an index server has gathered its centroids from the servers it polls, it can start making use of them to help WHOIS++ clients route users' queries through the WHOIS++ mesh. When an index server receives a query, it first attempts to locate matching records in its own local database. It then looks in the centroids that it has gathered to determine whether any of the servers it has polled have the required terms in their centroids. If they have, the server returns both its own records and a set of referrals to the servers that have promising matches in their centroids. Note that even if all of the terms in a user's query appear in the centroid from a polled server, it does not mean that that server will generate any hits from the query. This is because the centroid is merely a summary of the data held in the remote server and the duplicate removal that gave us a smaller centroid than the original data represents a loss of information. The WHOIS++ client actually needs to make a search of the remote server in order to find out what records, if any, do match the user's query.
Once the WHOIS++ client has received the initial index server's response, it stores any matching records and then looks at the referrals. A WHOIS++ client can either allow the end user to choose which servers to ask next, or it can automatically follow the referrals by passing the query onto the polled servers [30]. It is important to note that these polled servers may also be index servers themselves and so they may return both records and referrals to still further servers. This chaining of referrals allows a WHOIS++ client to exhaustively search all WHOIS++ servers that are likely to have relevant information in a WHOIS++ mesh. As this is a true mesh (as opposed to a directed graph or tree for example), the WHOIS++ client must keep track of the servers that it has already queried in order to prevent it from asking the same server the same thing twice and getting stuck in a loop.
CIP v3 enhances this simple centroid mechanism by abstracting itself from the confines of the WHOIS++ protocol. CIP v3 can carry centroid-like summaries of databases that are not constrained to be formed from attribute-value pair based records and that can be accessed via any number of different search and retrieval or database access protocols. The index data that CIP v3 passes between servers makes use of MIME to allow binary data to be used. There are likely to be a number of MIME types devised that are dedicated to passing different sorts of index data around between servers that have different sorts of back-end databases and access protocols. Some servers may be able to accept more than one MIME [31] [32] [33] [34] [35] type and multi-protocol clients (such as web browsers or special purpose multi-protocol gateways) will be able to use the different access protocols to seamlessly access radically different databases on the behalf of the end user.
CIP v3 allows multiple protocols and database formats to be handled by separating the metadata required to process the CIP index information (the header) from the index data itself (the payload). Along with the MIME content type for the payload, the header contains a unique identifier for the dataset from which the index has been generated (known as the Dataset Identifier or DSI) and a Uniform Resource Identifier (URI) that is used as the basis for referrals generated from the dataset.
CIP v3 takes the collapsing of the payload body further than the original centroids of WHOIS++. It allows index servers to pass copies of indexes retrieved from other CIP v3 index servers. These can either be passed straight through (so that the destination index server gets exactly the same centroid as the intermediate index server got from the original polled server) or they can be aggregated. An aggregated index is where two or more indexes are merged together by the intermediate index server before being sent to another index server. This potentially reduces the amount of data in the index, making the index transfer consume less time and bandwidth (which may be an important factor for indexes gathered from large datasets). However, it does mean that the referral chain through the index server mesh is lengthened. There is also the danger in multi-protocol meshes that the client may be delivered an aggregated index that forces the referral chain to pass through a section of the index server mesh that uses a protocol that the client does not speak. The index server manager must therefore weigh the tradeoff between reduced index sizes with the longer referral chains and multi-protocol problems carefully.
A cross-searching demonstrator, allowing people to cross-search a number of subject-based gateways, is available for exploration [36].
Issues and developments
The implementation of cross-searching is clouded by various issues and the development of other resource discovery (associated) technologies. Gateways wishing to offer a unified service should consider these:
Duplicate results
If a system was set up that allowed several gateways in the same subject area to be cross-searched, then inevitably there would be some duplication of results, as some resources would be catalogued by more than one gateway. Even gateways covering seemingly different subject areas, such as SOSIG (social sciences) and OMNI (medicine/health) often have a few "common" catalogued resources. There are several approaches to resolving duplicates; the most promising are listed below.

The simplest method is to leave them in - though this would mean unnecessarily long lists of results, and removes one of the advantages of subject gateways over the larger "whole web" indexes, i.e. the lack of duplication of information in the results of a search.

Amalgamate the results - if a resource is catalogued by two or more gateways, then amalgamate the result and present it as one hit i.e. present one title, one URL, but all of the individual gateway descriptions.

Only include the resource description that is the longest out of the two or more that describe the record. This may mean that services that provide more concise resource descriptions never see their entries, when duplicated by other gateways, appear in the results of a cross-search. In addition, some users may prefer more concise resource descriptions.

Let the user decide, before searching, which gateway should provide the descriptions for any duplicate resources.

Differing collection development policies
We have previously looked at the progress made by various gateways in defining resource cataloguing guidelines. However, the problem of there being no global standard or unified way of either:

deciding whether a resource is of sufficient relevance and quality to be catalogued by a subject gateway (resource selection criteria), or

describing how resource descriptions should be written (cataloguing rules),
means that the combined results of a cross-search will contain links to resources of a differing minimum "quality", as well as resource descriptions that are inconsistent in how they are written.
However, effort has been undertaken to identify different criteria for resource selection, as well as the role of classification schemes in Internet resource description and discovery. This has been undertaken by the DESIRE project, and comprehensive reports [37] are available for future gateways and clumps of gateways to use when deciding on their approach to defining resource selection and classification criteria. In addition, the eLib subject services using ROADS are addressing the issue of adherence to an agreed cataloguing policy. Draft guidelines [38] have been produced to assist in a shared approach to HOW data elements are described and WHICH particular data elements are used. The development of generic cataloguing standards will also be important as an aid to semantic interoperability.
Hybrid service cross-searching
Web-based gateways where people can search or browse a catalogue of resource descriptions are not the only tools that are of use to people seeking information in a particular subject area. Services and resources such as library OPACS are also of use. Ideally, searches should be unified across a range of resources, with duplicates "weeded out" of a homogenised set of search results, as it is most likely that people will be more attracted to searching across a range of media in one go, through one unified interface, than through a range of different interfaces. The need for development of these types of "hybrid" system has been identified by the eLib programme [39]. In previous experiments, a centroid was generated from a university library OPAC, enabling cross-searching between the OPAC and various subject gateways.
Cross-browsing and RDF
While cross-searching has been described and demonstrated through this paper and associated work, the problem of cross-browsing a selection of subject gateways has not been addressed. Many gateway users prefer to browse, rather than search. Though browsing usually takes longer than searching, it can be more thorough, as it is not dependent on the users terms matching keywords in resource descriptions (even when a thesaurus is used, it is possible for resources to be "missed" if they are not described in great detail).
As a "quick fix", a group of gateways may create a higher level menu that points to the various browsable menus amongst the gateways. However, this would not be a truly hierarchical menu system, as some gateways maintain browsable resource menus in the same atomic (or lowest level) subject area. One method of enabling cross-browsing is by the use of RDF.
The World Wide Web Consortium has recently published a preliminary draft specification for the Resource Description Framework (RDF) [40]. RDF is intended to provide a common framework for the exchange of machine-understandable information on the Web. The specification provides an abstract model for representing arbitrarily complex statements about networked resources, as well as a concrete XML-based syntax for representing these statements in textual form. RDF relies heavily on the notion of standard vocabularies, and work is in progress on a 'schema' mechanism that will allow user communities to express their own vocabularies and classification schemes within the RDF model.
RDF's main contribution may be in the area of cross-browsing rather than cross-searching, which is the focus of the CIP. RDF promises to deliver a much-needed standard mechanism that will support cross-service browsing of highly-organised resources. There are many networked services available which have classified their resources using formal systems like MeSH or UDC. If these services were to each make an RDF description of their collection available, it would be possible to build hierarchical 'views' of the distributed services offering a user interface organised by subject-classification rather than by physical location of the resource.
Multilingual issues

With the growth of the Web outside the English speaking regions of the world, the need has arisen to provide better handling of multilingual issues within metadata. Subject gateways are beginning to want to generate metadata for resources in local languages and/or in multiple languages to make it as easier for their users to use as possible. This means a number of things. Firstly the cataloguers have to be able to enter information in the appropriate character sets and end users have to be able to enter local characters into the web based forms. Many widely deployed browsers still only implement US-ASCII entry in web forms, but newer browsers will allow the use of UniCode characters.
A similar character set issue needs to be address for the indexing information, and, whilst WHOIS++ was limited to ISO-8859-1 characters, the newer CIP v3 draft proposals mandate the use of Unicode. However, whilst allowing users and cataloguers to make use of multiple languages and character sets, it does run the danger of increasing "false positives", where the same word means different things in different languages. To overcome this, the end user would need to specify the language(s) being used when making a search, and the CIP index data would need to have some form of tagging specifying which language an index term was in.
Also, for practical reasons, if a service wishes to take part in a multinational indexing mesh it would be well advised to include an English version of at least the major sections of metadata (description, keywords and title) as English is the de facto international language of business and science and so would make the resource available to the widest population on the Net.
Acknowledgements
The authors acknowledge the assistance of Lisa Gray in identifying health/medical gateways on the Web, and Debra Hiom, Paul Hofman, Rachel Heery, Andy Powell, Emma Worsfold and Lorcan Dempsey for helpful comments.
References
[1] ROADS to DESIRE: Some UK and Other European Metadata and Resource Discovery Projects, Lorcan Dempsey, UKOLN (UK Office for Library and Information Networking), http://www.dlib.org/dlib/july96/07dempsey.html
[2] AstroWeb Astronomy Subject Gateway, http://www.cv.nrao.edu/fits/www/astronomy.html
[3] ADAM; the Art, Design, Architecture and Media information gateway, http://www.adam.ac.uk/advanced/
[4] SOSIG; the SOcial Science Information Gateway, http://www.sosig.ac.uk/
[5] SOSIG advanced searching facilities; search for resources according to whether they are based in Europe, the UK or anywhere in the world, http://www.sosig.ac.uk/roads/cgi/searchex.pl
[6] EEVL, the Edinburgh Engineering Virtual Library. EEVL allows you to search for resources of a particular type e.g. mailing list, http://www.eevl.ac.uk/search.html
[7] PAW, the Physics Around the World gateway, which allows people to search on resources suitable for people of specific educational levels, http://www.tp.umu.se/TIPTOP/paw/search.html
[8] Aqueous, a gateway dedicated to water-related resources. Though there is little quality control, as people can submit resource descriptions which are automatically indexed by the gateway, the gateway contains descriptions of a large number of resources of varying degrees of association to water, http://www.aqueous.com/
[9] Biz/ed, a dedicated business and economics information gateway for students, teachers and lecturers, http://www.bized.ac.uk/
[10] Eldis, the Electronic Development and environment Information Service, consists of a central gateway containing details of network-based documents on environmental and development issues. Eldis offers several others services, with an emphasis on bibliographic sources in this subject field, http://www.ids.ac.uk/eldis/eldis.html
[11] History, a gateway of network-based historical resources, with additional information on history academics and their research interests, and recent publications in the field, http://ihr.sas.ac.uk/ihr/info1.html
[12] OMNI, Organising Medical Networked Information, is a gateway allowing access to a catalogue of large numbers of medical and health resource descriptions, http://omni.ac.uk/
[13] The OMNI database search interface, which allows you to search across several databases of health/medical related resources compiled by external organisations, http://omni.ac.uk/general-info/sw-search.html
[14] The issue of how much of the Web large search engines such as AltaVista actually index is an interesting one. Here, the Webmaster of a site queries what proportion of large Web sites AltaVista indexes, http://www5.zdnet.com/anchordesk/talkback/talkback_11638.html. The chief technical officer of AltaVista responds to the points in the query, http://www5.zdnet.com/anchordesk/talkback/talkback_13066.html.
[15] The ROADS initiative is funded by the JISC (Joint Information Systems Committee) http://www.jisc.ac.uk/ through the eLib programme http://www.ukoln.ac.uk/services/elib/. The ROADS (Resource Organisation And Discovery in Subject-based services) Web information service is shared amongst the three project partners across the interconnected ROADS Web sites. The ROADS software can be picked up from the Loughborough http://www.roads.lut.ac.uk/ partner site, which also contains pointers to related mailing lists. The UKOLN ROADS site http://www.ukoln.ac.uk/roads/ contains the ROADS template registry and various ROADS software tools, such as data format conversion scripts. The ILRT ROADS site http://www.ilrt.bris.ac.uk/roads/ contains various background information about the project, background information on cross-searching, and various guides to the ROADS software and project.
[16] The ROADS template registry. The different types of template which can be used to store resource information are defined here, http://www.ukoln.ac.uk/roads/templates/
[17] The ROADS users lists. This list contains descriptions of, and pointers to, all of the ROADS-based gateways that are publicly accessible. http://www.ilrt.bris.ac.uk/roads/who/
[18] EELS, the Engineering Electronic Library, Sweden. This covers engineering resources in such sub-areas as physics, mathematics, energy technology, nuclear technology, light and optical technology, http://www.ub2.lu.se/eel/eelhome.html
[19] WWEVL, the Wastewater Engineering Virtual Library. This allows you to browse or search across several hundred resources. The results of a search indicate the sub-categories which each catalogued resource falls into, http://www.cleanh2o.com/cleanh2o/ww/welcome.html. WWEVL is one subject within the Virtual Library initiative, http://www.mth.uea.ac.uk/VL/Overview.html
[20] The Six Senses gateway contains medical sites reviews and gives a composite "review score" for each site, http://www.sixsenses.com/
[21] Medical Matrix, a large US-based medical/health gateway aimed mainly at physicians and healthcare workers, http://www.medmatrix.org/index.asp
[22] Healthweb, a medical/health gateway with a catalogue built by the collaborative effort of librarians from several academic medical centers, http://www.healthweb.org/
[23] HON, the Health On the Net Foundation medical gateway, http://www.hon.ch/
[24] MedWeb, a gateway maintained by the Emory University Health Science Center Library, http://www.cc.emory.edu/WHSCL/medweb.html
[25] The Z39.50 Maintenance Agency Official Text for Z39.50 (1995), http://lcweb.loc.gov/z3950/agency/1995doce.html
[26] The Architecture of the Common Indexing Protocol (CIP). An Internet draft, December 1997, by J. Allen and M. Mealling, ftp://ietf.org/internet-drafts/draft-ietf-find-cip-arch-01.txt
[27] Architecture of the WHOIS++ service, RFC 1835, August 1995, by P. Deutsch, R. Schoultz, P. Faltstrom and C. Weider, http://src.doc.ic.ac.uk/rfc/rfc1835.txt
[28] The Little Black Book - Mail Bonding with OSI Directory Services, Prentice Hall, 1992, by M.T. Rose.
[29] Architecture of the Whois++ Index Service, RFC 1913, February 1996 by C. Weider, J. Fullton and S. Spero, http://src.doc.ic.ac.uk/rfc/rfc1913.txt
[30] How to Interact with a Whois++ Mesh, RFC 1914, February 1996, by P. Faltstrom, R. Schoultz and C. Weider, http://src.doc.ic.ac.uk/rfc/rfc1914.txt
[31] Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies, RFC 2045, November 1996, by N. Freed and N. Borenstein, http://src.doc.ic.ac.uk/rfc/rfc2045.txt
[32] Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types, RFC 2046, November 1996, by N. Freed and N. Borenstein, http://src.doc.ic.ac.uk/rfc/rfc2046.txt
[33] MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text, RFC 2047, November 1996, by K. Moore, http://src.doc.ic.ac.uk/rfc/rfc2047.txt
[34] Multipurpose Internet Mail Extension (MIME) Part Four: Registration Procedures, RFC 2048, November 1996, by N. Freed, J. Klensin and J. Postel, http://src.doc.ic.ac.uk/rfc/rfc2048.txt
[35] Multipurpose Internet Mail Extensions (MIME) Part Five: Conformance Criteria and Examples, RFC 2049, November 1996, by N. Freed and N. Borenstein, http://src.doc.ic.ac.uk/rfc/rfc2049.txt
[36] A demonstrator model allowing people to cross-search subject gateways is available through http://www.ilrt.bris.ac.uk/roads/cross/
[37] Three reports resulting from work undertaken in the DESIRE (Development of a European Service for Information on Research and Education) project are available on the UKOLN metadata site, http://www.ukoln.ac.uk/metadata/DESIRE/. These reports include "Selection criteria for quality controlled information gateways" and "The role of classification schemes in Internet resource description and discovery".
[38] ROADS Cataloguing Guidelines. Draft (v. 0.1) by Michael Day, UKOLN, http://www.ukoln.ac.uk/roads/cataloguing/cataloguing-rules.html
[39] A press release from the eLib Phase 3 Programme: Hybrid Libraries and Large Scale Resource Discovery (CLUMPS) and Digital Preservation, http://www.ukoln.ac.uk/services/elib/background/pressreleases/summary2.html
[40] The World Wide Web Consortium RDF (Resource Discovery Framework) Web site, http://www.w3.org/Metadata/RDF/

Copyright © 1998 John Kirriemuir, Dan Brickley, Susan Welsh, Jon Knight, Martin Hamilton

hdl:cnri.dlib/january98-kirriemuir

D-Lib MagazineJanuary 1998

ISSN 1082-9873

The Query Routing and Forward Knowledge Approach

Abstract

Characteristics of Subject Gateways

Basic gateway facilities

Additional searching facilities

Additional subject-related services

Resource cataloguing

Altavista search results

OMNI search results

ROADS-based gateways

Subject areas covered by subject gateways

Query routing and forward knowledge

Issues and developments

Duplicate results

Differing collection development policies

Hybrid service cross-searching

Cross-browsing and RDF

Multilingual issues

Acknowledgements

References

Copyright © 1998 John Kirriemuir, Dan Brickley, Susan Welsh, Jon Knight, Martin Hamilton

D-Lib Magazine
January 1998