Search   |   Back Issues   |   Author Index   |   Title Index   |   Contents

Conference Report


D-Lib Magazine
January/February 2009

Volume 15 Number 1/2

ISSN 1082-9873

A Workshop Series for Grid/Repository Integration


Andreas Aschenbrenner
State and University Library, Goettingen

Tobias Blanke
King's College London, Arts and Humanities e-Science Support Centre (AHeSSC)

Neil P Chue Hong

Nicholas Ferguson
OGF.eeig, OGF Europe

Mark Hedges
King's College London, Centre of e-Research (CeRch)

Red Line


Researchers and information scientists are often forced to move back and forth between different digital environments in their daily work. Institutional or thematic repositories have become a prevalent mechanism to manage publications, and increasingly also to manage research outputs and primary data. Integration of the user's natural work environment with repositories is improving (albeit slowly), and repository-based research environments are emerging (e.g., eSciDoc1). The natural habitat for many scientific users, however, is e-Infrastructure like the "grid". At the same time, these users are employing repositories – often home-grown systems – to store their research outputs and publications.

A workshop series supported by OGF-Europe,2 DReSNeT3 and OMII4 set out to reduce this fragmentation and explore the interfaces between grid- and repository-based architectures. Past workshops include:

Each of these four workshops was attended by between 50 and 100 participants from a variety of backgrounds. Despite a tangible terminology gap between repository managers and some (scientific) users, the commonalities between requirements and existing systems were astonishing.

High on the agenda of all participants is preservation. Funding organizations or legal frameworks often demand the preservation of primary data for 10 years and more, and another factor is the inherent value of the primary data, e.g., such as irreproducible climate data. As Flavia Donno analyzed in her presentation at OGF23, the scientific and the repository communities need to share with each other their experiences in preservation and work towards open standards such as OAIS.5

Participants at all four workshops emphasised various aspects of preservation. Two related discussions included:
  1. The capture of provenance data with regard to both the documentation of the (organizational) context, as well as the technical audit trail, since data is created through a number of steps in a scientific workflow;
  2. The preservation of the creation process and the discourse about scientific results, e.g., by integrating Wikis, Blogs, etc., directly into the trusted repository.6

Many of the measures used to ensure preservation are also conducive to good scientific practice in general. This includes preservation of the context of data and publications. In her talk at OGF23, Francoise Genova's demonstration was on the Strasbourg Astronomical Observatory, which links publications and the data upon which they are based. Reliable metadata are, of course, essential for this. Indeed, a recurring topic at the workshops was metadata in all its aspects, from semi-automatic metadata creation to ensuring metadata quality.

All these issues – preservation, good scientific practice, metadata and collection management – emphasize the tight integration of the repository with the user's work environment where the data is created. Reliable audit trails, metadata, etc. can only be created if the users' work environments and repositories connect seamlessly. Moreover, as Andrew Treloar (Australian National Data Service, ANDS7) put it in the discussion at IEEE e-Science 2008: "Repositories are just the plumbing". All the tedious bits of repository management need to be hidden from the user wherever possible. At best, metadata should not be user-created, it should just be there; systems should interface on-the-fly; and data should not be accessioned into the repository, but be born and sustained in the repository without the user noticing.

The federation of distinct repositories is obviously an important factor for achieving this. Users must not be required to ingest data multiple times, e.g., into their institutional repository and into a thematic repository too. This should be the goal regardless of whether in the future there will be only one virtual repository based on multiple distinct repository systems, or whether there will be buckets of federated repositories addressing specific needs (e.g., data safety for confidential medical records). Because of this, the OAI-ORE protocol8 has been received with high interest by the scientific community as it resembles protocols dubbed "grid" in many ways. OAI-ORE allows for virtualization of digital repositories just as e.g., the OGSA-DAI9 protocol virtualizes distributed databases.

Essentially, all kinds of infrastructure need to care – as David de Roure's eloquent analysis at IEEE e-Science 2008 shows – more about the roll-in of users rather than roll-out of services. Users are understandably highly sensitive about tools being "delivered" to them by somebody who might be less proficient in the area of the users' work. In this respect, the users don't want to be "served", rather they are looking for simple and pragmatic environments with which they can interface. Hence, better interweaving of existing infrastructures is vital to ensure user satisfaction and eventually the persistence of each of those infrastructures.

How do we get there?

Of course, repositories and the grid are not the only kinds of infrastructure available, but interweaving them is an essential step towards the simple and pragmatic plumbing the user is seeking. The repository community has for a long time acknowledged that other components besides repositories are needed to achieve a suitable information environment. Format registries (e.g., Pronom,10 GDFR11) or a persistent identifier resolver (e.g., Pilin12) are some of the components that were identified during the workshop series. Grid-based workflow engines, storage services, and other modules could interface with repositories in a very similar way.

All possible technological considerations are, of course, caught between the underlying trends of experimentation and standardization. We may need more experimentation to reach the desired level of integration. However, the time for pure experimentation is coming to an end for both repositories and the grid. Funding institutions were represented at the workshops, and they are looking for operational, trustworthy scientific repositories with many of the features described above. For example, the Australian National Data Service (ANDS) has recently been commissioned (Andrew Treloar, IEEE e-Science 2008); the call for DataNet by the NSF in the USA13 is being evaluated (Ed Seidel, IEEE e-Science 2008 conference); and the latest call for e-Infrastructure scientific repositories by the European Commission14 is just out (Krystyna Marek, OGF23).

The next pragmatic step is the upcoming workshop of the series of grid/repository integration workshops. It is the Open Grid Forum 25, which will be held in Catania, on March 2-6, 2009.15 "Infrastructure" may be of many kinds, but let's ensure that these infrastructures can interweave.


1. eSciDoc, <>.

2. OGF Europe, <>.

3. DReSNe, <>.

4. OMII-UK, <>.

5. OAIS - Open Archival Information System (OAIS) Reference Model. <>.

6. cf. Andreas Hense's talk at IEEE e-Science 2008, <>.

7. Australian National Data Service, <>.

8. Open Archives Initiative Object Reuse and Exchange (OAI-ORE), <>.

9. OGSA-DAI. <>.

10. PRONOM, <>.

11. Global Digital Format Registry, <>.

12. PILIN, <>.

13. DataNet forum, <>.

14. e-Infrastructure Call for Participation, <>.

15. OGF25/EGEE User Forum, <>.

Copyright © 2009 Andreas Aschenbrenner, Tobias Blanke, Neil P. Chue Hong, Nicholas Ferguson, and Mark Hedges

Top | Contents
Search | Author Index | Title Index | Back Issues
Previous Article | Next Conference Report
Home | E-mail the Editor


D-Lib Magazine Access Terms and Conditions