Defining Collections in Distributed Digital Libraries

D-Lib Magazine
November 1998

ISSN 1082-9873

Defining Collections in Distributed Digital Libraries

Carl Logoze
Cornell University
[email protected]

David Fielding
Cornell University
[email protected]

1. Order and Chaos in Global Information Space

The World Wide Web provides unprecedented access to globally distributed content. The success of the Web, both in terms of number of resources and use of those resources, is largely due to three defining characteristics of the Web. Because of its universality, anyone can participate in the Web, as author, publisher, or consumer, with a minimal investment of hardware and expertise. Because of its uniformity, resources, services, and users participate on the Web as equals in a common information space. Finally, because of decentralization, the Web is fundamentally anarchistic beyond basic agreement at the technical level on protocols and transport mechanisms.

These principles that are fundamental to the Web's success are also the root of the problems that frequently confound its users. As many have found, universality often translates to "quantity without quality", where content from Nobel Prize winners co-exists with content from prize winners from a local first-grade writing contest. Uniformity frequently means that specialized and domain-specific tools, technologies, and guidance essential for using many classes of information (e.g., geo-spatial, statistical, scientific) are difficult or impossible to find. Decentralization frequently means that it is difficult to impose the organizational structures necessary, ensuring information integrity -- i.e., reliability and accessibility, security and privacy for content and users, and survivability (preservation) of information.

This apparent paradox in the utility of the Web, in fact, reflects the highly variable manner in which people seek out and use information in their daily lives. At times their motivations may resemble those of shoppers in a busy commercial district hoping to stumble upon the perfect gift. Such serendipitous browsing is often served by the lack of organization in Web space. In contrast, there are other times, when people wish to undertake more focused, discipline-specific information tasks or when they wish to purposely screen out certain genre of information (e.g., protect their children from inappropriate content). In these situations greater levels of organization, selection, and specialization than are currently available on the Web are more appropriate.

The challenge in designing digital library architectures and systems is to accommodate these different models of information behavior. Selection, organization, and specialization should be permitted without being imposed. In addition, mechanisms for introducing selection, organization, and specialization should be flexible, extensible, and independent of other characteristics of the digital library, such as how content and services are physically distributed or how and by whom the components of the digital library are managed.

In this paper, we describe a design for a digital library collection service. The collection service is an independent mechanism for introducing structure into a distributed information space. Due to its independence from other services and mechanisms in the digital library, the collection service neither constrains other organizational models nor does it impose structure when it is neither needed nor desired.

The motivation for the collection service design lies in traditions well established in the library community, where collection development serves three important roles:

Selection - defining a set of resources that are members of the collection. These may be all the resources in the library (the collection of the Cornell University Library) or a subset of the total resources (the South East Asia Collection of the Cornell University Library). In the traditional library setting, selection usually implies physical containment or demarcation (e.g., a special set of shelves or room in the library where the members of the collection reside).

Specialization - designating a set of resource discovery aids or cataloging techniques, which are tailored to the characteristics of the collection or the audience to which the collection is targeted.

Administration - establishing a set of management and preservation policies that conform to the collection characteristics.

The collection service architecture adapts these traditional collection-related roles to the distributed and dynamic nature of digital libraries. First, it defines collection membership through criteria rather than containment -- resources become members of the collection because they conform to a set of formal criteria (for example, subject classification, language, or genre). Such criteria allow automatic and/or dynamic selection of resources from a set of distributed information sources, based on either metadata about those resources or the content within the resources themselves. Second, by providing query routing and query pre-processing and post-processing facilities the collection service facilitates resource discovery that is tailored to the characteristics of the collection (rather than to the features of a specific search engine). Finally, the collection service acts as a distributed metadata repository, storing, disseminating, and processing data relevant to the management and administration of objects in the collection.

The collection service is one of several services in the component-based digital library architecture that we are developing and experimenting with as part our research. Other services include a repository service for storing digital content, a naming service for registering and resolving unique names for objects, and an index service that processes queries for the discovery of content.

Defining the collection service as a service distinct from other services, notably index and repository services, is significant for a number of reasons.

Physical location of resources, which is relatively static, is unrelated to their membership in a collection -- the resources that are members of a collection may be distributed across multiple repositories.

A single object may exist in multiple collections, which are defined by multiple collection services under separate administration.

Specification of what resources are in collections and how collections are administered is distinct from definitions of query capabilities, which are defined by index services.

From both the technical and legal perspective, collection definition is relatively lightweight. Repository and index services require machinery for storing digital objects and processing queries on them. The operation of this machinery depends on rights to access disseminations of that content. In contrast, the logically distinct collection service defined in this paper is fundamentally a simple query routing mechanism that requires no access to the content of individual digital objects.

The life-cycle of collections is independent of that of the objects within them. While some collections may be relatively stable, resembling the research libraries that are today the backbone of scholarly research, others could be created in response to short-term, yet important needs -- for example, medical or natural emergencies.

The remainder of this paper is structured as follows. Section 2, which follows, summarizes the component-based digital library architecture that is the context for the design described in this paper. Section 3 describes a collection abstraction that is appropriate for the new networked information environment. Section 4 describes the status of our implementation of the collection abstraction. We then give some concluding remarks in Section 5.

2. Establishing the Context: Component-based Distributed Digital Libraries

Over the past four years the Cornell Digital Library Research Group (CDLRG) has been researching the technology and deployment of distributed digital libraries. Our work on digital library architecture is based on the following principles:

Open Architecture - Following well-known software engineering principles, the functionality of a digital library system is available in the form of distinct function units (services), each of which has operational semantics exposed through an open protocol.

Federation - Digital Library Systems are managed aggregations of these functional units (or services) and the resources to which they provide access. New functionality can be added to these systems through the implementation of value-added services, which interact with existing services using established protocols.

Distribution - The components (and content) of a digital library may be spread over the global Internet, but are presented to the user as a single uniform system.

The initial result of this work is Dienst [LSDK95], the technical foundation for the Networked Computer Science Technical Research Library [DL99] (NCSTRL - pronounced "ancestral"), a digital library of computer science research reports. More recently, we have been designing the Cornell Reference Architecture for Distributed Digital Libraries [LP98] (CRADDL - pronounced "cradle"), a set of components that form the core of a digital library infrastructure. CRADDL is being implemented in the CORBA distributed object framework; CRADDL services are deployed as CORBA objects, and service requests are expressed as methods requests to those objects.

CRADDL defines a basic set of digital library services, which interact as shown in Figure 1. By core, we mean the set of services that are necessary to provide basic digital library functionality: object naming and storage, object discovery, and user access. Because the architecture is open (its functionality is exposed through service-based protocols), other services can be added to enhance this core functionality.

Figure 1 - Interaction of core digital library services

A brief summary of these services and their interactions is as follows.

Content in the architecture is stored in the form of digital objects, which aggregate one or more byte streams, associate content-specific behaviors with these aggregations, and secure access to these behaviors through rights management mechanisms.

The repository service provides the mechanisms for the deposit, storage, and access to digital objects. A digital object is considered contained within a repository if the URN of that object resolves to the respective repository (and, thus, access to the object is only available via a service request to that repository). The CDLRG is currently building FEDORA [PL98], a digital object and repository architecture, that builds on earlier work in digital object infrastructure [KW95] and metadata architecture [DLP98].

Digital Objects are identified by globally-unique names -- URNs -- that are registered with the naming service. The naming service is able to resolve a URN to one or more physical locations. The CDLRG is working with the Handle System developed by the Corporation for National Research Initiatives.

The index service provides the mechanism for discovery of digital objects via query. Individual index servers index actual or surrogate information on sets of digital objects (that may be distributed across multiple repository servers). Queries to these index servers return results sets that contain the URNs of digital objects that match the query (and possibly other meta-information that facilitates the display and handling of the results set). The CDLRG is developing index servers based on the STARTS protocol [GC97], jointly developed with the Stanford University Digital Library Project.

The collection service, the subject of this paper, provides the mechanisms for the aggregation of access to sets of digital objects and services into meaningful (from some community's prospective) collections.

User interface services or gateways provide human-centered entry points to the functionality of the digital library. The design of a user interface gateway can be highly customized for a specific community using mechanisms such as language, help facilities, and graphical aids. Each user interface gateway uses the information provided by one or more collection services to permit searching for and access to objects within those collections.

Of these five services, only the user interface is accessed directly by a human. The others are used by programs, in particular other CRADDL services, but also by other digital library or publishing systems. This modular design allows easy integration of higher-level digital library services (summarization services, payment services, and the like) with existing CRADDL services, or evolution of existing services as the architecture matures.

The modular design also creates a hierarchy of selection mechanisms in the digital library architecture, which facilitates and encourages the creation of customized digital libraries:

Creators of digital resources select the content they are interested in making available.

Repository managers may adopt policies that implicitly select the digital objects that can be deposited into the repository. These policies may be motivated by legal considerations (no pornographic or libelous content), quality judgements (only objects created by certain parties), or any other criteria.

Administrators of index servers select the digital objects that are indexed in that server. For example, one index server may index all the digital objects in a selected set of repositories, while another may index specific digital objects for which they have signed licensing agreements.

Collection services apply broader (not digital object specific) selection mechanisms against the query interfaces of one or more index services.

User interface gateways select one or more collections that users can search over and access objects within.

3. Defining a Collection in a Distributed Digital Library

Earlier in this paper, we defined three roles that collection services provide in the traditional library: selection, specialization, and administration. From the standpoint of user visibility, selection dominates these roles; the quality and usefulness of a library is generally determined by the resources available from it. Without a doubt, the models for collection selection and containment, as used in the traditional library sense, are challenged in the digital library.

First, and most obvious, the network makes it irrelevant whether the physical bits that make up a digital resource are located on a disk drive in the library or across the world. In fact, the notion of physical location for an individual resource in a digital library is ill-defined. A single resource, as perceived by a user, may actually be an aggregation of physical bit streams (or programmatically produced bit streams) from widely distributed sources. For example, consider a multimedia encyclopedia, which is a composition of text, images, audio, moving images, and live data feeds that reside on, or are produced from, distributed servers.

A more subtle and, from the policy point of view, more troublesome issue arises from the difficulty of placing a distinct boundary around the resources contained in a digital collection. Consider the problem of linkages among resources. For example, if Object A is contained in a collection, are objects B, C, and D that are linked to from Object A also contained in the collection? If so, are all objects transitively linked to Object A via other objects also contained? This issue has been explored by others as a problem of defining the "control zone" that libraries establish [RA96]. Solutions to this problem have important implications in areas such as legal liability and public service responsibilities.

Finally, traditional library policies for selection and acquisition are challenged by the different model of "publication" on the Internet. Traditional publishing, that which involves physical media (e.g., books, maps), is characterized by a relatively small number of publishing authorities (due to a high cost of entry) and a relatively low frequency of publication. Standard library practices rely on these characteristics to make the acquisitions and collection administration process manageable. For example, library acquisitions departments may "trust" the quality of certain publishers or series from certain publishers. To shortcut the overhead of item-by-item selection, libraries may in selected cases adopt blanket acquisitions policies for those series.

These selection and acquisition techniques are not appropriate for networked publishing for a number of reasons. On the net, cost of entry is relatively low and, in effect, anyone can become a publisher. There is no way to assume the legitimacy of these publishing authorities. (In fact, recent experiences in Internet news publication have shown that the time pressures of Web publication often challenge standards of quality of supposed "legitimate" publishers.) Because of the negligible cost of publishing, the frequency at which new resources appear is orders of magnitude greater than in traditional publishing. Finally, many of these resources are ephemeral; disappearing due to whim or poor administration by the publishing authority.

With these characteristics of networked resources in mind, we suggest the following definitions, both logical and operational, for a digital library collection and containment within that collection.

A collection is logically defined as a set of criteria for selecting resources from the broader information space. The nature of these criteria may vary in complexity. A static and degenerate example is simply a list of resource identifiers -- for example, URNs or ISBNs. Another rather restrictive example is the set of resources that are stored in a specific repository. More interesting criteria are those that allow dynamic growth of the collection from resources that appear in multiple repositories. One simple example of this is the set of resources with a metadata field value that matches a certain controlled vocabulary value – for example, a Dublin Core subject element with the value "computer science". Even more interesting are criteria that employ advanced Natural Language techniques such as vocabulary analysis, for example determining the age-appropriateness of materials, or genre analysis, for example creating collections of newspaper articles or scholarly reports [PMRC98].

From an operational perspective, the membership in a digital library collection is defined in terms of resource discovery: a collection provides tools for resource discovery, and the resources in the collection are those that can be directly found using those resource discovery tools. (Note that this definition omits the resources transitively accessed through linkages from these discovered objects.) In the context of the service-based architecture described in Section 2 collection-specific resource discovery tools have the following characteristics.

They direct queries only to those index servers that can return objects in the collection.

They employ filtering techniques, either within the queries or through post-processing the results of the queries, to select only those objects in the respective index servers that fit the collection criteria.

They employ resource discovery aids that are specialized for the collection. Examples of such as aids are domain-specific stop-word lists, stemming algorithms, or thesauri.

An example, shown in Figure 2, illustrates both this logical and operational definition. At the bottom of figure are five repositories that provide access to a number of digital resources. The red-shaded circles in the repositories are resources about computer science and the green-shaded circles are resources about economics. (For the sake of this simple example, we can say that the aboutness of a resource is determined by the value of a controlled-vocabulary metadata field -- e.g., Dublin Core subject -- associated with the resource.) As illustrated, objects fitting these subject classifications are only located within some repositories and are mixed in with objects that do not fit the collection criteria.

Illustrated above the repositories are a set of index servers that download information (via some protocol) from these repositories. Each index server is administratively configured to access only certain repositories (based on quality judgements, licensing agreements, or other reasons). Discovery of resources about computer science involves only querying index servers "1" and "3" incorporating filters in the query of the nature "subject equals computer science". Similarly, discovery of resources about economics involves only querying index servers "2" and "3" with an appropriate query filter.

Selective query routing and filtering offers the important advantage of facilitating focused resource discovery. The need for focused resource discovery is one of the primary motivators of efforts to establish standards for Web metadata, where the goal is creating mechanisms that enable queries to focus on certain semantic characteristics of resource (e.g., author, title, date of publication). Our concern here is permitting resource discovery to focus on a specific category of resources. For example, assume that an author "Joseph Halpern" publishes documents in both computer science and economics. The mechanism shown above makes it possible for user interfaces (and users) to designate that a search for documents that match the query "author equals Joseph Halpern" should only return resources from a specific collection, computer science. The combination of semantic focus and collection focus make it possible for networked resource discovery to move beyond the "high recall without precision" problem that characterizes current web search services.

This definition of collection, and the resources that are members of it, has a number of advantages:

Location and Administrative Independence -- There is no linkage between the membership of a resource in a collection and its location in a repository nor its collocation with other member objects. A corollary of this is that collections can be created, and subsequently shut down, on demand: resources do not need to be moved to physical locations; in fact no changes need to be made to the objects themselves.

Dynamic Membership -- Since collection membership is criteria based, rather than resource based, new resources (that fit the criteria) automatically become part of the collection (by virtue of the fact that they are returned by queries against the collection). The same is true for the deletion of resources and their corresponding deletion from the collection.

Extensibility -- The concept of collection criteria for determining collection membership is inherently quite powerful. In addition to applying simple static criteria based on metadata characteristics (e.g., the controlled vocabulary in the example above), there is the opportunity to employ more dynamic and contextual criteria as they are developed -- for example, criteria based analysis of link topology [GKR98].

4. Collection Service Implementations

In the previous section we described a collection both conceptually, as a criterion for resource membership, and operationally, as tools for resource discovery. In this section, we describe implementations of this definition. First, we describe the implementation of the collection service in Dienst and its deployment in NCSTRL. Second, we describe work-in-progress to implement this as a CORBA service in a more component-based system.

4.1 The Dienst Collection Service

The NCSTRL collection currently provides access to over 24,000 computer science research reports from over 120 institutions. Discovery of documents and access to those documents involve the interoperation of over forty servers communicating via the Dienst protocol and proxy servers operating through FTP and HTTP.

The NCSTRL collection is logically and administratively divided into publishing authorities. Each publishing authority has control over addition and administration of documents in their own sub-collection repositories. The metadata fields (e.g., title, author, abstract) for each document in these repositories are then indexed by one or more index servers. The metadata is accessed through Dienst protocol requests to the respective repository. The Dienst collection service allows the federation of these index and repository servers into a single uniform collection. The Dienst protocol requests [DIENST] defined for the collection service give access to the following information:

The list of publishing authorities that are part of the collection. These are the organizations and repositories that are members of NCSTRL (e.g., Cornell Computer Science Department, D-Lib Magazine, the CoRR artificial intelligence collection).

The network location. The address and port of the Dienst index servers that store indexing information for each organization. For example, indexing information for Cornell Computer Science may be stored at foo.ncstrl.org port 80 and bar.ncstrl.org port 8083.

Meta information about each of the index servers. At present, this meta-information indicates whether the index server should be considered primary or secondary. However, our intention is to expand this meta-information to include data about last update of the index and performance information that could be used for performance-based query routing [DFL98].

The correspondence of index servers to repository servers. This provides the index servers with information on the repository servers from which they should download meta information for indexing.

Within NCSTRL, information from the collection service is used by user interface gateways to the collection. Each user interface service is configured with the host and port number of a collection server. Periodically (every hour) each user interface gateway contacts a collection server to obtain collection information, as described above. The requesting user interface server then stores the collection information internally in a table. The user interface gateway then stores this information to later create a search interface, for example showing a list of publishing authorities from which a user may choose those to which searches should be restricted, and determine to which index servers searches should be routed.

Figure 3 illustrates the interaction between a collection server, user interface servers, and index servers in Dienst. As shown, each user interface servers queries the collection server (via protocol) for collection information. For a specific query, an individual user interface (labeled UI₁in the figure) uses this collection information to determine which index servers should process the query.

The Dienst collection service has a number of limitations. First, collection criteria are hard-wired into the implementation. As described above, the NCSTRL collection is partitioned into sub-collections that correspond to the partitioning of the name space for documents in NCSTRL (i.e., each sub-collection corresponds to a Handle System naming authority). Second, the Dienst protocol and server implementation limits the ability of user interface servers to interact with more than one collection (and its associated set of sub-collections). This has prevented us from expanding the NCSTRL service to include additional scholarly collections; for example, physics, mathematics, etc. Finally, the Dienst architecture incorrectly conflates the functions of the user interface service with query routing. Although the collection service provides information for query routing, the actual dispatch of queries to index servers takes place in the user interface service. This limits our capacity to performing query routing that is highly collection specific.

4.2 The CRADDL Collection Service

CRADDL is a component or service based digital library architecture that we are currently developing as a reference implementation of our research results and as a testbed for future research. Two CRADDL services, the FEDORA repository architecture and the STARTS index server implementation, are in the prototype phase and are available for interoperability testing. In this section, we describe our initial work on the design and prototyping of the collection service.

The CRADDL collection service is implemented as a set of distributed servers that act as a metadata repository for collection specific information, and that perform collection specific query routing in the manner described in Section 3. Each collection service maps to a single collection; in effect, a collection exists and is accessible in the digital library infrastructure if there is a collection service for it.

4.2.1 The collection service and user interface servers

Similar to Dienst, the main consumers of collection services are user interface gateways. Each user interface gives human-friendly access to one or more collections through interaction with the collection services corresponding to those collections. The interaction between user interface gateways and collection services, through defined protocol requests, involves the exchange of a number of types of metadata about the collection:

A collection description including the name of the collection and a free-text description of the collection. User interfaces may use this information to assist users in choosing among collections for queries.

The elements of the collection hierarchy. Collections in CRADDL have a hierarchical structure. This expands the capabilities of the Dienst/NCSTRL collection, which is sub-divided into a one-level deep hierarchy corresponding to the member publishing authorities.

Query capability and customization information. The purpose of this is to facilitate the creation of collection-specialized query forms by a user interface. Our present model for representing and transmitting this type of information is the "Source Metadata" provided by STARTS, which provides information on query fields supported (e.g., title, author) and query modifiers supported (e.g., ‘>’, ‘>=’) at an indexing site.

In addition, the interaction between a user interface service and the collection service involves submission of query requests and the return of corresponding result sets. This interaction is further described in Section 4.2.2.

Figure 4 gives an example of how a user interface service might use the collection metadata provided by collection servers. The user interface server shown subscribes to three collections: computer science, physics and economics. People using this user interface can choose the collection to which they wish to direct their search, as shown in the first example form at the left bottom of the figure. (On a help screen, the user interface service might display the collection service-supplied descriptions of the collections.) Following the choice of a collection, in this case, a user has chosen computer science, a query screen tailored to that collection is presented (the user interface uses information provided by the respective collection service to create this screen). As shown in the bottom right of the figure, this customized screen for computer science might include a choice of ACM classification as a query feature.

Figure 4 - User interface interaction with collection services

One open issue in the interaction between user interface servers and collection services is how user interface servers "learn" about collections and their associated services. We plan to examine methods whereby user interface servers can discover new collections (in the spirit of the old WAIS directory of servers).

4.2.2 Components of the collection service

Each collection service consists of two types of servers: a central collection server (CCS) and one or more collection query routers (CQR).

The CCS serves as the central point of management of the collection. Collection management involves creation and modification of:

Collection criteria - As defined in Section 3, these are the characteristics of resources that should be included in the collection. We are currently considering RDF graphs [RDF98] of Dublin Core metadata as a simple, but limited, method for expressing collection criteria. Using these, we can express a Dublin Core Criteria such as "subject equals computer science". We are also considering using RDF Schema Specifications [RDF98b], when they are further developed, since they will have the capability to express graph domain and range constraints with which more complex collection criteria could be expressed.

Index server tables - These are the set of index servers that are used for searches by the collection service.

Collection metadata - This is metadata, as defined in Section 4.2.1, such as collection description, query capabilities, and the like.

Each CQR provides 1) local, replicated access to collection metadata and 2) query routing tailored for local conditions. The former (replication of metadata) makes sense from the standpoint of reliability. The logic for the latter (localized query routing) is as follows. We assume that index servers will be distributed globally with replication of individual index servers. As is well known, global connectivity varies dramatically. We can model patterns of global connectivity through the notion of a connectivity region [LFP98]. A connectivity region is defined as a group of nodes on the network that among themselves have good connectivity, relative to nodes outside of the region. (We note that connectivity regions do not necessarily correspond to geographic regions, due to peculiarities in the global telecommunications networks.) Localized query routing is then defined as dispatching queries, if possible, to those index servers that are within a single connectivity region. In case of index server failure, a backup index server should be chosen from another region with which there is relatively good connectivity.

The CQR is the mechanism for performing this localized query routing. Each connectivity region for a collection has a corresponding CQR. When a user interface subscribes to a collection it contacts the CCS from which it obtains a list of CQRs. Based on its own analysis of connection characteristics to the available CQRs it then chooses a CQR as its "local" connectivity region. The user interface then uses that CQR for collection specific queries, which are routed by the CQR to index servers in the connectivity region.

Figure 5 - Distributed collection service and connectivity regions

Figure 5 illustrates this regional architecture and service interactions with it. The three gray circles represent connectivity regions. Each region has a collection query router (CQR), pictured in yellow, and a set of index servers, pictured in red. (Note that index servers are assigned to a region in the context of a specific collection. Another collection might assign an index server to a completely different region.) The CQRs communicate with the central collection server (CCS), shown as the blue rectangle, to obtain copies of collection metadata. When a user interface server, shown in green, subscribes to the collection, it first contacts the CCS. Once the user interface chooses a CQR, it then submits queries to that CQR, which then dispatches those queries to regional index servers (the communication links shown as black arrows).

In Dienst and NCSTRL we implemented a limited version of this regional architecture in which regions are statically configured. In reality, connectivity between nodes on the Internet is highly dynamic. The configuration of regions -- the index servers that are members of a region -- should adapt to changing connectivity and server load. We are currently exploring methods for sharing load information among the CQRs and the CCS to allow dynamic region configuration. This information sharing is shown in Figure 5 by the red communication arrows between the CCS and CQRs.

5. Conclusion

The physical proximity or collocation of resources is irrelevant to networked information systems. Globally distributed content can be immediately and uniformly available. The current World Wide Web demonstrates the advantages of such universal access, yet it also shows its flaws. Attributes of the traditional library such as organization, specialization, and selection have been shown, in many situations, to be necessary for effective resource discovery and use.

We have described in this paper a mechanism that facilitates such organization, specialization, and selection in a distributed information space. The logical independence of this mechanism, the collection service, from other digital library services allows the organizational dimension to be independent from the physical distribution of content and the administration of that content, and it allows the coexistence of several organizational schemes. Moreover, it does not prohibit the dissemination of, discovery of, and access to content and services in the relatively chaotic fashion that makes the current World Wide Web such a success.

Finally, we have described two implementations of the collection service. The first, and somewhat limited, is deployed as part of the globally distributed NCSTRL collection. The second more powerful implementation is currently under development as part of our digital library architecture research.

Acknowledgements

The work described in this paper was funded by the Defense Advanced Research Project Agency under Grant No. MDA 972-96-1-006 with the Corporation for National Research Initiatives. This paper does not necessarily represent the views of CNRI or DARPA. We would also like to acknowledge the contributions of the other members of the Cornell Digital Library Research Group: Naomi Dushay, Sandra Payette, and Dean Krafft. Finally, we’d like to thank Jim Davis, whose initial design of Dienst made this work possible.

References

[LSDK95] C. Lagoze, E. Shaw, J. R. Davis, and D. B. Krafft, "Dienst Implementation Reference Manual", Cornell Computer Science Technical Report TR95-1514, May 1995, http://cs-tr.cs.cornell.edu:80/Dienst/UI/1.0/Display/ncstrl.cornell/TR95-1514.

[DL99] J. R. Davis and C. Lagoze, "NCSTRL: Design and Deployment of a Globally Distributed Digital Library", to appear in IEEE Computer, February 1999.

[LP98] C. Lagoze and S. Payette, "An Infrastructure for Open-Architecture Digital Libraries", Cornell Computer Science Technical Report TR98-1690, June 1998, http://cs-tr.cs.cornell.edu:80/Dienst/UI/1.0/Display/ncstrl.cornell/TR95-1514.

[PL98] S. Payette and C. Lagoze, "Flexible and Extensible Digital Object and Repository Architecture (FEDORA)", Second European Conference on Research and Advanced Technology for Digital Libraries (ECDL98), Heraklion, Crete, September 1998.

[KW95] R. H. Kahn and R. Wilensky, "A Framework for Distributed Object Services", Corporation for National Research Initiatives", http://www.cnri.reston.va.us/cstr/arch/k-w.html.

[DLP98] R. Daniel Jr., C. Lagoze, and S. Payette, "A Metadata Architecture for Digital Libraries", Advances in Digital Libraries 1998, Santa Barbara, April 1998.

[GC97] L. Gravano, Kevin Chang, Hector Garcia-Molina, Carl Lagoze, and Andreas Paepcke, "STARTS: Stanford Protocol for Internet Retrieval and Search", January 1997, http://www-db.stanford.edu/~gravano/starts.html.

[RA96] R. Atkinson, "Library Functions, Scholarly Communication, and the Foundation of the Digital Library: Laying Claim to the Control Zone", The Library Quarterly, July 1996.

[GKR98] D. Gibson, J. Kleinberg, and P. Raghavan, "Inferring Web Communities from Link Topology", in Proceedings of the 9^th ACM Conference on Hypertext and Hypermedia, Pittsburgh 1998.

[PMRC98] A. Paepcke, H. Garcia-Molina, G. Rodriquez, and J. Cho, "Beyond Document Similarity: Understanding Value-Based Search and Browsing Technologies", Stanford University Technical Report SIDL-WP-1998-0099.

[LFP98] C. Lagoze, D. Fielding, and S. Payette, "Making Global Digital Libraries Work: Collection Services, Connectivity Regions, and Collection Views", ACM Digital Libraries ’98, Pittsburgh, June 1998.

[DIENST] J. Davis and C. Lagoze, "Dienst Protocol Version 4.1", http://www.cs.cornell.edu/NCSTRL/protocol.html.

[DFL98] N. Dushay, J. L. French, and C. Lagoze, "Distributed Searching: Predicting Performance of Remote Indexers", forthcoming.

[RDF98] O. Lasila and R. R. Swick eds., "Resource Description Framework (RDF) Model and Syntax Specification", W3C Working Draft 08 October 1998, http://www.w3.org/TR/WD-rdf-syntax/.

[RDF98b] D. Brickley, R. V. Guha, and A. Layman eds., "Resource Description Framework (RDF) Schema Specification", W3C Working Draft 14 August 1998, http://www.w3.org/TR/WD-rdf-schema/

Copyright © 1998 Carl Lagoze and David Fielding

Top | Magazine
Search | Author Index | Title Index | Monthly Issues
Previous Story | Next Story
Comments | E-mail the Editor

D-Lib Magazine Access Terms and Conditions
hdl:cnri.dlib/november98-lagoze

D-Lib MagazineNovember 1998

ISSN 1082-9873

1. Order and Chaos in Global Information Space

2. Establishing the Context: Component-based Distributed Digital Libraries

3. Defining a Collection in a Distributed Digital Library

4. Collection Service Implementations

5. Conclusion

Acknowledgements

References

Copyright © 1998 Carl Lagoze and David Fielding

D-Lib Magazine
November 1998