Long-term Preservation for Spatial Data Infrastructures: a Metadata Framework and Geo-portal Implementation
Arif Shaon, Science and Technology Facilities Council, Didcot, UK
With growing concerns about environmental problems, and an exponential increase in computing capabilities over the last decade, the geospatial community has been producing increasingly voluminous and diverse environmental datasets. Long-term preservation of these environmental data exposed through uniform and interoperable Spatial Data Infrastructures (SDIs) is not typically addressed, but is highly important for applications that require continued access to both current and historical data, e.g., for monitoring climate change. The work presented in this article investigates the requirements for ensuring sustained access to environmental data from the perspective of a preservation-aware SDI. We take INSPIRE as an exemplar for our analysis and model development. In addition, we present an implementation approach in the form of a Geo-Portal that incorporates a preservation profile of the ISO 19115 metadata standard.
Keywords: preservation; geospatial; geo-portal; metadata; INSPIRE; ISO 19115
The European Commission INSPIRE Directive requires public authorities across Europe to provide access to their environmental datasets in a uniform and interoperable manner. To ensure this level of interoperability, the directive mandates the adoption of common Implementing Rules (IR) for metadata, data specifications, network services, and data sharing through a pan-European Spatial Data Infrastructure (SDI). While this is an effective way of ensuring interoperability across disparate datasets, it does not guarantee sustainability of those datasets over an indefinite period of time (for example, ensuring compatibility with future technology or ensuring continued access even after a provider has ceased to exist). Interoperability's scope must extend to the temporal dimension.
In general, sustained access to environmental data is becoming increasingly important and difficult, especially in light of its growing volume. For example, satellite Earth observation (EO) data can grow at rates of terabytes per day. (Janée, et al., 2008) Historical geospatial data is also increasing in value for monitoring and analyzing social, environmental (e.g., global climate change) and economic changes that occur over time (McGarva, et al., 2009). Without effective long-term preservation, these environmental data (both current and historical) face the risk of becoming unusable over time. From this perspective, there is a pressing need for the long-term preservation of the data made available through SDIs like INSPIRE.
In this article, we investigate the requirements for developing a preservation-aware SDI based on the OAIS reference model, a widely adopted ISO standard for digital preservation (CCSDS, 2002). In addition, we present an implementation approach in the form of a Geo-Portal that incorporates a preservation profile of the ISO 19115 metadata standard.
2. The Main Challenges of Preserving Environmental Information
In general, environmental data inherit the preservation challenges inherent in all digital information (McGarva, et al., 2009). These challenges are further complicated by some of the characteristics of environmental datasets, such as diverse and highly structured data formats, and the need for special domain knowledge for accurate interpretation. Moreover, in the context of SDIs such as INSPIRE, state-of-the-art service-oriented infrastructures adopt common exchange formats (application schemas) that reflect domain-specific conceptual data models ('feature types') rather than directly reflecting underlying database storage schemas. These application schemas and their relationships (e.g., mapping) with the corresponding persistence datasets would need to be preserved to ensure appropriate accessibility and re-use of those datasets in the future.
On the positive side, it should be possible, in principle, to apply existing widely adopted preservation mechanisms and standards, such as the OAIS reference model described in Section 3 below, to the long-term preservation of geospatial data. In fact, a number of European archives are currently adopting, or are looking to adopt, the OAIS model and other related specifications for the long-term preservation of their geospatial datasets (Bos, et al., 2010). These organisations would, however, significantly benefit from a best-practice implementation profile of the OAIS model for geospatial datasets and an INSPIRE-compliant metadata model for describing and sharing the relevant preservation aspects (see Section 4) of such datasets through the INSPIRE SDI, neither of which exist at present.
3. The OAIS Reference Model
The Reference Model for an Open Archival Information System (OAIS) is a very important ISO standard (ISO 14721:2003) for addressing issues associated with the long-term preservation of digitally encoded information (CCSDS, 2002). The OAIS describes a number of conceptual models in order to aid formulation of a suitable preservation strategy for digital objects. Of particular importance among the OAIS models is the Information Model that broadly describes the metadata requirements associated with retaining a digital object over the long-term (See Figure 1). We consider the different components of the OAIS information model from the perspective of long-term preservation of geospatial datasets.
Figure 1: A Partial View of the OAIS Information Model
3.1 Content Information
This is the set of information that needs to be preserved over the long-term. In the case of spatial datasets, it should be the 'original' version of a dataset rather than a domain specific representation of that dataset. For example, in the INSPIRE SDI, where geospatial datasets are mapped on to 'application schema' to represent particular facets of phenomena on the earth as 'geographic features' (e.g., a pan-European road transport network), the source dataset, rather than its 'mapped view', should form the 'Content Information'.
3.2 Preservation Description Information (PDI)
This type of information is needed to efficiently manage and preserve a digital object over an indefinite period of time. This includes various information about the life-cycle of a dataset, such as its provenance and versioning history, as well as reference and annotation-related information.
3.3 Representation Information (RI)
This is a component of the Content Information that is required to accurately render a preserved digital object on a future technological platform. This encompasses all levels of abstraction and refers to both the structural and semantic composition, such as recreating the original appearance of the digital object, or analysing it for a concordance (CCSDS, 2002). The use of RI can be recursive, especially in cases where meaningful interpretation of one RI element requires further RI (See Figure 1). The RI for a dataset may include information about its technical dependencies, such as software required to access the dataset, compatible operating platform, and so on.
With respect to an SDI, RI refers to the sustained ability to interpret the semantics of a digital dataset, i.e., how the digital objects relate to a conceptual model of some universe of discourse (ISO 19101:2002 Geographic information Reference model). For instance, a transport network dataset stored in a geo-database or a Shapefile will be meaningless unless the tables or digital objects can be interpreted as 'road features' defined in a relevant conceptual model.
3.4 Packaging Information
This type of information is used to bind a data object and its associated metadata (such as PDI and Descriptive Information) into an identifiable unit or package for preservation. For example, if a data object is compressed before being ingested into an archive, the packaging information for that dataset would include information about the underlying structure of its compressed form.
3.5 Descriptive Information
This information is needed to facilitate efficient discovery and accessibility of a preserved data object, typically through a search and retrieval facility provided by the long-term preservation archive. Descriptive information about a data object may be derived from its PDI and other metadata. For a spatial dataset that is exposed as a 'feature type' through, for example, an OGC standardised Web Feature Service (WFS), the descriptive information could include the information (e.g., keywords, abstract) about that 'feature type' provided in the 'GetCapabilities' document of the WFS.
3.6 Designated Community/ Knowledge Base
This encompasses all identified potential consumers (human, software application, etc.) to whom the preserved data object is beneficial in terms of its accurate interpretation and proper utilisation. The level of recursion for a particular element of Representation Information (RI) about a data object is likely to depend on the level of knowledge that the designated community has about that element. For example, if the designated community has considerable understanding of the OGC Web Feature Service, then the representation information of a dataset that is exposed through WFS as 'feature types' could just include the service name 'OGC Web Feature Service'. Conversely, if the designated community has no understanding of WFS, the representation information of such dataset would have to include a detailed implementation and use specification of the OGC WFS, among other related information.
A generic viewpoint assumption in an SDI for long-term preservation would define the user community of the SDI as the OAIS 'designated community', with the semantics of harmonised conceptual models that enable domain-specific representation (e.g., 'feature types') of a spatial dataset within the SDI constituting the OAIS 'knowledge base'.
4. A Preservation-aware Spatial Data Infrastructure
We have analysed the INSPIRE architecture in the context of the OAIS reference model with a view to determining the requirements for a preservation-aware SDI. Functionally, INSPIRE consists of the components shown in Figure 2 below.
Figure 2: A Preservation-aware SDI
An analysis of the applicability of the OAIS reference model to the INSPIRE SDI identifies the following three core requirements for ensuring sustained accessibility and usability of the data exposed through such SDIs.
4.1 Long-term preservation of geospatial data repositories
An effective and coherent approach is required to preserve the individual data repositories made available through the SDI over the long-term. This needs to address various complex issues, such as compatibility of data with future repository technology and ensuring its continued access even after its provider has ceased to exist. However, because this aspect is provider-specific, and dependent on the adoption of suitable preservation policies and strategies, it is not dealt with further here.
4.2 Preservation-aware Metadata Model
The ISO 19115 metadata model adopted in the INSPIRE SDI is sufficient for capturing enough of the context surrounding the data (data quality, maintenance, use/processing) to enable its effective discovery. However, the metadata elements defined in ISO 19115 do not capture other important preservation-related metadata specified in the OAIS Reference model, such as PDI and RI (See Section 3). For example, the ISO 19115 model does not address the mappings between a source geospatial data set and its canonical representation, which typically describes particular facets of phenomena on the earth as 'geographic features'. Such 'feature-based' representation of a geospatial dataset is usually described by an appropriate 'application schema' and exposed by the INSPIRE SDI. This type of information is a significant aspect of a geospatial dataset's RI, without which accurate interpretation and re-use of the dataset on a future technological platform may not be possible.
Therefore, a preservation-aware SDI would require a preservation-focused metadata model that would help capture accurate and sufficient description of all aspects (including the aforementioned preservation-related aspects) of a geospatial dataset, as well as being flexible for addition of future requirements. However, as RI of a dataset could be highly complex and detailed (depending on the requirement of the designated community), it may be sufficient for a preservation metadata model for an SDI to include only an overview of the RI associated with a dataset. Access to the complete set of RI could be provided through an RI repository or registry (See Figure 2), if supported by the data provider. There are other benefits in adopting such an approach that are discussed in Section 5.1.
4.3 Long-term curation of metadata catalogues
The metadata catalogues are instrumental in facilitating discovery of the datasets held in the repositories by enabling searching of the metadata that describe those datasets. However, without curation proper management, quality assurance and preservation the metadata, too, may become unusable over time. For example, it may become out of step with the data that it describes. Therefore, it is also crucial to apply effective long-term curation measures to the metadata catalogues within an SDI (Shaon and Woolf, 2008).
5. A Preservation profile of the ISO 19115 Metadata model
As stated above, the ISO 19115 metadata model is not sufficient for capturing and providing users with the information needed to enable accurate interpretation of geospatial data in the future. To address this issue, we have developed a preservation profile of ISO 19115 based on the metadata requirements specified in the OAIS reference model and the PREMIS data dictionary. The rationale of this profile is to enable recording preservation-related information about a geospatial dataset, while retaining the ability of the core ISO 19115 model to capture descriptive and contextual metadata about that dataset. The preservation profile incorporates the key preservation concepts into the core ISO 19115 model, as shown in Figure 3 below.
Figure 3: A preservation profile of ISO 19115 Metadata Model
5.1 Representation Information
The OAIS reference model defines the Representation Information (RI) about a digital object as the information required to enable access to preserved digital objects in a meaningful way (CCSDS, 2002). In ISO 19115, the only notable RI-related information defined is the information about the application schema(s) (the MD_ApplicationSchemaInformation class shown in Figure 3) used to create a particular feature view of a source geospatial dataset. The preservation profile extends this concept to incorporate information about the mappings between the source data and application schema along with the applications/software/services required to effectively apply the mappings.
Figure 4: Representation Information elements of the preservation profile of ISO 19115 Metadata Model
In particular, the preservation profile defines the PM_FeatureTypeMappingInformation class, shown in Figure 4 above, to record information about the mapping(s) between a source dataset and its canonical 'feature-based' representation. The preservation profile also defines additional elements (otherRepInfo and environmentInfo properties of PM_RepresentationInformation class shown in Figure 4) to enable capturing other data-specific RI (data formats, storage media), in the form of web-accessible resources (through HTTP URLs). It is envisaged that detailed RI about a geospatial dataset may not directly benefit its typical users, as they are likely to rely on the current data provider or preservation body to make the data available to them, generally through web services, which apply the aforementioned mappings. Nevertheless, this approach provides the users with the option to access the RI (made available on the web through, for example, an RI registry by the data provider/preservation body) about a dataset, which, if necessary, could be used to reconstruct and re-use that dataset on a future technological platform (See Figure 2). From an archivist's perspective, it is an important mechanism for providing access to the data in a consistent manner into the future. It also provides flexibility in terms of the metadata model/format used to capture data-specific RI without being constrained by the ISO 19115 model.
5.2 Data life cycle information
Detailed information about important preservation-related changes and events occurring during the life-cycle of a dataset is essential for verifying the provenance of a dataset as well as the reliability of its preservation in the future. In addition, this type of information could contain a detailed history of every preservation measure (e.g., migration) applied to a dataset during its lifecycle, in order to assist its future curators in understanding and determining the updated preservation requirements for that dataset. For instance, a provider may choose to migrate an existing road transport dataset into a new database schema more closely reflecting an INSPIRE application schema (a process sometimes known as 'Extraction-Transformation-Load', or ETL); it is important to document this schema transformation for preservation purposes. Similarly, for quality assurance purposes, it is important to be able to verify the history of ownership of a dataset.
With this in mind, the preservation profile extends the LI_Lineage and LI_ProcessStep elements (See Figure 3) defined in the ISO 19115 model to capture detailed information about the lifecycle of a dataset. The dataset lifecycle information in the preservation profile is divided into two main categories: Dataset Provenance Information (change of ownership and/or preservation body) and Dataset Event Information (all major events, including preservation-related ones, such as major platform changes and preservation certification processes that have affected the data during its life cycle, useful for audit trailing and quality checking purposes).
Figure 5: Dataset Event information elements of the ISO 19115 Preservation Profile
Important among these elements is the PM_PreservationCertificationEvent (a specialised PM_Event class shown in Figure 5) defined to provide information about any certification examination(s) conducted, to ensure adequacy of the preservation measure(s) applied to a dataset. This should provide the users with some level of confidence in the preservation method(s) applied to, and consequently, in the longevity of the data.
In the OAIS, this type of information is referred to as 'Preservation Descriptive Information' (See Section 3.2).
5.3 Data Authenticity Verification Information
The ISO 19115 model adopts a number of data quality related concepts, for example DQ_Elements as shown in Figure 3, from the ISO 19113 and 19114 standards (for representing the quality principles and evaluation procedures associated with geographic information) in order to provide a detailed account of the quality assurance measures applied to a dataset. The preservation profile adds to this the ability to verify unauthorised modifications to a dataset by recording its fixity information, such as a checksum and digital signature. This may be important, for instance, where major asset management or security programmes depend on the accuracy of information in a dataset, and it is important to be sure that data has not been altered.
Figure 6: Resource Authenticity Verification information elements of the ISO 19115 Preservation Profile
As illustrated in Figure 6 above, the preservation profile defines the PM_ResourceVerificationInformation class as a specialised DQ_Element class (of ISO 19115:2003 core). It is intended to record fixity information (PM_FixityInformation class), such as a checksum and digital signature (PM_SignatureInformation class) about a dataset to enable verification of unauthorised alterations made to that dataset.
In the context of the OAIS information model, this type of information is categorised as the 'Preservation Descriptive Information' associated with a dataset.
Annotation in the digital world has long been recognised as an effective means of adding value to digital information. It can, in effect, help establish collaborative links between data providers, data users and a preservation body. Thus, annotation has the potential to facilitate enhanced efficiency of a preservation process, and thereby improving the quality of both data and metadata.
However, annotation without the intended context may become meaningless. For example, an annotation may be used to label particular map features with descriptive text, which may contain values of some attributes associated those features (Bose & Reitsma, 2005). These attribute values alone, without the correct association with the corresponding map features (the annotation context), would be meaningless. For more complex and dynamic environmental datasets, it may be useful for users to be able to annotate specific features or attributes for collaborative analysis or interpretation, for instance in an emergency response scenario. While not directly related to preservation, it is not difficult to appreciate the long-term value of such information, for example, during post-disaster audit of response capability.
Figure 7: Annotation elements of the ISO 19115 Preservation Profile
Therefore, the preservation profile defines as extensions to the MD_Usage elements of the core ISO 19115 (See Figure 7) a number of suitably structured elements to capture detailed annotation related information (PM_Annotation class) with traceability to the data context (PM_AnnotationContext class) to which the annotation refers.
6. Implementation of a Preservation-aware SDI
We have also implemented a web-based portal that demonstrates some underlying functions of a preservation-aware SDI. This portal implements the main features of a community Geo-Portal as implied by the INSPIRE directive, such as data discovery using geospatial metadata, data downloading (if applicable), metadata creation and validation and so on. The preservation focused Geo-Portal adds to these features the ability to annotate data and metadata, track different versions of a metadata record, and view preservation related information about a dataset, such as Representation Information and preservation history.
Figure 8: The Data Annotation Interface of the Prototype Geo-Portal
Figure 9: The Metadata Annotation Interface of the Prototype Geo-Portal
As Figures 8 and 9 demonstrate, the annotation interface of this Geo-Portal provides a convenient way of viewing and recording annotation-related information about both data and metadata, such as annotation context, annotator information and the actual annotation. In order to facilitate traceability of annotation context, the annotation interface enables annotate-able context for both data and metadata to be selected in a convenient and innovative manner. For annotating metadata records, it enables selecting particular elements or attributes to be annotated, which are then automatically identified using a unique XPath relative to the record (See Figure 9). For data annotation, the interface enables annotation of specific elements (such as an individual feature or attribute) of the dataset(s) chosen from a list of dataset overviews that are extracted from the original metadata record (See Figure 8).
This Geo-Portal prototype demonstrates the feasibility of providing users with the information needed to locate and use a dataset along with assessing its quality, integrity and long-term preservation; or even potentially recreating an INSPIRE-conformant dataset in the future from source, if necessary, through a web-based, easy-to-use interface. The portal uses the aforementioned preservation profile of ISO 19115 as the underlying metadata format for describing the datasets to which it provides access.
The implementation of the Geo-Portal prototype is based on GeoNetwork, an open source and standards-based catalogue service that is widely adopted within the geospatial community for managing and sharing geospatial resources on the web.
6.1 A Test Case
We have tested the prototype Geo-Portal by recording and annotating preservation metadata about some weather observation datasets exposed by an OGC-compliant Web Feature Service (WFS). This WFS is built on the 'Complex Datastore' version of GeoServer, which enables representation of data from a relational database in a GML-based application schema (e.g., Climate Science Modelling Language, CSML) defined independently of the underlying database structure. This special edition of GeoServer was a research endeavour by the SEEGrid and GeoServer communities.
Considering the aforementioned special capability of the WFS, the datasets exposed by it provided ideal examples of 'feature-based' representations of source spatial datasets. We used the preservation profile of ISO 19115 (through the prototype Geo-Portal's metadata recording facility) to record useful Representation Information (RI) for some of these datasets. This RI included the mappings used to generate a "feature-based" canonical representation of a dataset as well as other metadata. The following XML snippet provides an example of such an RI.
Listing 1: an example of Representation Information recorded using the ISO 19115 Preservation Profile
Of particular note in the above XML snippet is the 'CI_OnlineResource' related metadata elements, such as 'mappingFile' and 'processingApplication'. These elements are defined to record references to web-based resources providing more comprehensive (and possibly complex) information about the aspects of the data that they represent. In the above XML snippet, the 'processingApplication' element points to a web-based document providing detailed information about the GeoServer WFS, such as the input parameters and computer platform required to apply the mappings (described by the 'mappingDescription' and 'mappingFile' elements) to the corresponding dataset. These web-based resources could be encoded in any format chosen by the preservation body/archive concerned. Thus, the preservation profile of ISO 19115 provides flexibility in terms of the metadata model/format used to capture data-specific RI without being constrained by the ISO 19115 model while ensuring the accessibility of such information in a uniform and coherent manner.
7. Conclusions and Future Direction
Long-term preservation of environmental data exposed through uniform and interoperable SDIs is not currently addressed in the INSPIRE Directive but is highly important for applications that require continued access to both current and historical data, for instance for monitoring climate change. The work presented in this article investigates the requirements for ensuring sustained access to environmental data from the perspective of a preservation-aware and INSPIRE-conformant SDI. In addition, the work demonstrates an approach to implementing such an SDI, providing the information needed to assess the quality, integrity and long-term preservation of data, as well as ensuring its effective discovery and use.
Future work in this area would need to focus on the implementation of efficient and interoperable preservation solutions for the data repositories made available through the SDI. In particular, this should include preserving mappings (potentially with varying levels of complexity) between a source dataset and its corresponding feature view(s) to enable accurate recreation of those feature views in the future. The emerging ESA LTDP initiative is expected to make significant contributions to this area as it aims to formulate a coordinated and coherent approach to the long-term preservation of EO space data archives across Europe. The work presented in this article may be relevant as an exploratory exercise for this ESA initiative.
 OGC Web Feature Service OpenGIS Web Feature Service (WFS) Implementation Specification.
 A framework for defining and describing a set of core preservation metadata (based on the OAIS reference model) that would be required to facilitate a long-term data preservation process in a digital archive (PREMIS, 2008).
 Bos, M., Gollin, H., Gerber, U., Leuthold, J., and Meyer, U., 2010. Archiving of geodata, A joint preliminary study by swisstopo and the Swiss Federal Archive, SWISS Archive.
 Bose, R. and Reitsma, F., 2005. Advancing Geospatial Data Curation. Conference on Ensuring Long-term Preservation and Adding Value to Scientific and Technical Data, online papers archived by the Institute of Geography, School of Geosciences, University of Edinburgh.
 Janée, G., Mathena, J. and Frew, J., 2008. A Data Model and Architecture for Long-term Preservation. Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL):134-144. http://dx.doi.org/10.1145/1378889.1378912. Also available here.
 Consultative Committee for Space Data Systems (CCSDS), 2002. Reference Model for an Open Archival Information System (OAIS). Recommendation for Space Data Systems Standard, CCSDS Blue Book.
 Shaon, A. and Woolf, A. 2008. An OAIS Based Approach to Effective Long-term Digital Metadata Curation, Computer and Information Science, 1(2), 2-12.
About the Authors