Linda L. Hill
Computational models are created to simulate a set of processes observed in the natural world in order to gain an understanding of these processes and to predict the outcome of natural processes given a specific set of input parameters. Conceptual and theoretical modeling constructs are expressed as sets of algorithms and implemented as software packages. The modeling software packages, if adequately described for human understanding and machine processing, can become objects in digital library collections where they can be found and used in applications without the direct involvement of the creator. This amounts to the publishing of modeling software with accompanying metadata in the same way that other publications are treated in library collections. This paper addresses the requirements for a content standard to describe such computational models. This work is part of the Alexandria Digital Earth Prototype (ADEPT) project at the University of California, Santa Barbara, an NSF Digital Library II project (Alexandria Digital Library Project, 2001). The intent is to add modeling software packages as collection objects in the ADEPT collections to support research, education, and learning activities and to enable the matching of appropriate datasets in the digital library collections to modeling software.
The creation of computational modeling software has grown at an accelerating rate since the earliest applications to the modeling of real-world phenomena during the 1940s. Supported by increasingly powerful hardware, software, and networking environments, growing numbers and varieties of computational models are being developed to support research, development, and education in all areas. For a variety of historical and technical reasons, including major problems of interoperability at all semantic levels and weak support for "publishing" computational models, library-based mechanisms to support the widespread distribution and use of modeling software have been slow to develop. While distributed digital libraries (DLs) and the worldwide web offer a natural infrastructure for such distribution, critical aspects of an effective infrastructure have yet to evolve. In particular, there are no generally accepted procedures for describing computational models in ways that support cataloging, search, selection, and use.
In this paper, we propose a content standard for describing computational models. This Content Standard for Computational Models (CSCM) was developed partly in response to the general need for such descriptions and partly in response to the immediate needs of the Alexandria Digital Earth Project (ADEPT) at the University of California, Santa Barbara (UCSB). ADEPT is developing services that facilitate the construction of personalized digital collections that support learning in a variety of contexts. Since the Project views computational models of environmental phenomena as critical DL resources for helping students understand and reason scientifically about natural and human-influenced phenomena, it is useful to provide a metadata framework to standardize the way in which modeling software is described so that models can be integrated into DLs with other types of information in support of education and learning.
Content Standards and Computational Modeling Software
The primary purpose of CSCM is to provide enough information that potential users of the model (other than its creators) have a reasonable chance of finding it in a distributed DL environment, evaluating its potential applicability for their purposes (e.g., research, education), obtaining it, running it successfully in some computational environment and with appropriate datasets, and understanding the results. Computational modeling software will process certain kinds of data and produce specified output; it will incorporate certain variables and parameters; it will have known limitations and will be more suitable for some uses than for others; it may operate only in some computational environments and may require that other software packages be simultaneously available; and its use may be subject to licensing agreements. It is important to provide potential users with an understanding of these aspects and also with a sense of the theoretical and computational choices made by the modeler to represent the real-world phenomenon. All of these characteristics need to be documented in metadata, along with contact information for obtaining the software or getting help in using it.
In relation to these primary goals, we note that the standard is not intended to define the manner in which the information is presented to a user, but to specify a description framework to support search, retrieval, and evaluation. The design of user interfaces and report presentations is an independent activity based on the metadata structure. We also note that the standard is not currently specified to the point of being able to fully support machine-machine analogs of such activities.
In developing a CSCM, one must resolve some issues that are generic to metadata standards and others that are specific to computational models. We have adopted the metadata design framework of the International Standards Organization's TC 211 group for their metadata standard for geographic information (International Organization for Standardization (ISO), 2000), which is in turn based on the U.S. Federal Geographic Data Committee's Content Standard for Digital Geospatial Metadata (U.S. Federal Geographic Data Committee, 1998). Furthermore, we assume that metadata based on the CSCM will co-exist in DLs with other metadata structures. The use, for example, of a relatively small number of search buckets into which heterogeneous metadata descriptions can be mapped provides a useful mechanism for supporting interoperability among different metadata representations (Frew et al., 1999).
There exist many definitions of models in general and computational models in particular (Aris, 1978 (reprinted 1994); Benz, 1997; Chorley, 1967; Dee, 1994). An adequate core definition of computational models for current purposes is:
a set of computational codes, executable in some software/hardware environment, that transform a set of input data into a set of output data, with the input, output, and transformation typically having some interpretation in terms of real-world phenomena.
Two specific examples of models satisfying this broad definition have been described using this initial version of the CSCM to test the design of the content standard (see <http://www.alexandria.ucsb.edu/doc>). The first model (Smith, 2001) takes the form of a set of C-Language codes that transform two initial input datasets into two output datasets. The input datasets represent a land surface and a flow of water over the surface; the transformation represents a time-dependent erosion process; and the two output datasets represent the land surface and water flow field at later times. The second example (Clarke, 2001) is a cellular automaton model of urban growth where multiple datasets showing a variety of land cover properties for at least four urban time periods are used for input, and the output visualizes and predicts urban growth into the future by using urban growth coefficients. Model metadata will continue to be created for the ADEPT project, with modifications to the CSCM as necessary.
Several general considerations arise in deciding how to structure a content standard for computational modeling software. First, the syntactic and semantic complexity of many models makes it difficult to provide a definitive metadata description of reasonable length, more difficult than in the case of many other classes of digital objects. Hence, a specific strategy has been to assume that search, evaluation, and use are typically iterative processes, requiring that the metadata contain pointers to more detailed information, which in turn may contain other pointers. Second, it is useful to have a conceptual framework to help guide the design of a content standard for computational models. Drawing on various characterizations of models (Aris, 1978 (reprinted 1994); Benz, 1997; Chorley, 1967; Dee, 1994), we view models as generally having four increasingly specific levels of representation, in both syntactic and semantic terms. These are the conceptual, symbolic, algorithmic, and coding representations of the model.
The conceptual representation describes the model at the highest level. For the erosion model, for example, it would characterize the model in terms of land and water surfaces and the conservation of water flowing over a surface and the conservation of sediment eroded from the surface and transported by the water. The symbolic representation is typically, but not always, in terms of some mathematical or logical language with an interpretation of the symbols in terms of real-world phenomena. In the case of the erosion model, this representation takes the form of two partial differential equations. The algorithmic representation provides a high-level view of how the symbolic representation is converted into a set of computations, while the coding representation of these algorithms provides codes that are, or can be compiled into, executables in some specific computing environment. The erosion model, for example, is specified at the algorithmic level by indicating that the water flow equation is transformed into a finite difference scheme using an upwind scheme and that the land surface erosion equation is transformed into a finite difference scheme using a Crank-Nicholson scheme. At the coding level, it is specified by a set of C-language programs and the environment in which they would run. Hence we may view the information represented in these four categories as moving from a high-level description of the model and its applicability to the details needed to execute it in a specific computation environment. The ADEPT CSCM provides a structure for these levels of description through narrative elements and elements for specific details of input and output variables, parameters, datasets, and processing flow.
The CSCM consists of approximately 165 elements divided into ten sections:
An outline of the elements (version dated May 2001) is included as an appendix to this article. The standard continues to evolve through interaction with the ADEPT metadata and collection building efforts. This version and the latest version of the CSCM are available for download from the Alexandria Digital Library Project's website (under Documentation) at <http://www.alexandria.ucsb.edu>. Here you will also find examples of the use of the content standard, including those referenced above.
Content Standard Issues
Of the many issues related to the design of the CSCM, the following are highlighted because they were central to our internal discussions.
CSCM Descriptive Design
Identification (CSCM sections 1 and 3)
Fitness for Use (CSCM sections 2 and 9)
One of the key purposes of model documentation is to provide information about the calibration and validation tests that have been used, the experiments that have been run, the peer reviews that have been published, and the current known uses. Particularly useful is a citation to a dataset that can be used to test the model. Some of this information will accumulate through time as the model is used and may exist independent of the metadata description of the model itself. However, to the extent possible, having citations to external sources containing reviews and experiments will be very valuable for evaluation of fitness for use.
Access and Constraints (CSCM section 4)
Environment (CSCM section 5)
Functionality (CSCM sections 6, 7, and 8)
Metadata Documentation (CSCM section 10)
Definition of Elements
The definitional format used by the CSCM is recognized internationally by the geospatial community. A data element is the logically primitive item of metadata. Compound data elements are called entities. They consist of groups of data elements and other compound elements. Each element and entity is defined by the following characteristics:
Obligation / Condition
Obligation applies within a nested set of elements. If an entity (compound element) is Optional, elements in the set that are Mandatory only apply if the entity itself is selected to be used in the description of a model.
The domain also may note that the domain is free from restrictions, and any values that can be represented by the type of the data element can be assigned. These unrestricted domains are represented by the use of the word free followed by the type of the data element.
Some domains can be partly, but not completely, specified with code lists. In these circumstances, the list includes the option of other as a valid value. When other is available for selection, a conditional element is provided where the other value can be entered.
In cases where domain values can be selected from external sources (e.g., from a classification scheme or thesaurus), compound elements are used to document the source of the terminology or classification notation.
For compound elements, the domain specifies the section and line numbers of the elements that make up the compound description.
CSCM for the Modeling and Digital Library Communities
Digital library designers need to think in terms of all forms of scholarly knowledge. This includes more than text and data. Increasingly image and geospatially-oriented forms of information are being incorporated into DLs. Adding computational models is a natural extension and will facilitate awareness and use of modeling software for research and education. Programmatic services that "understand" the metadata of models and datasets, both existing in DLs, can begin assisting users in making good matches for experimentation. Digital library services that capture the output of modeling runs and facilitate documenting them will greatly enhance the potential re-use of model output for learning and training. The value of uniform descriptions of modeling software based on content standards will be recognized in this environment. Researchers and instructors will recognize the value of well-formed metadata for comparing and contrasting models and understanding the thought processes and system elements behind the models. Good documentation will also encourage reliable calibration and validation efforts in order for models to gain recognition as accurate re-creations of natural systems. We can expect the development of tools and services designed to ease the creation of documentation and the use of models to follow the adoption of content standards for computational models.
This work is funded by a grant from the National Science Foundation, the Alexandria Digital Earth Prototype Project (IIS-9817432), Smith and Goodchild, University of California at Santa Barbara. We also acknowledge the valuable discussions of the issues CSCM design with members of the ADEPT research staff and a group of professors and students who attend a half-day workshop on the UCSB campus. ADEPT staff member Tim Tierney has developed an XML metadata creation tool based on the CSCM.
Alexandria Digital Library Project. (2001). Alexandria Digital Earth Prototype (ADEPT). University of California, Santa Barbara. Available: <http://www.alexandria.ucsb.edu> .
Aris, R. (1978 (reprinted 1994)). Mathematical modelling techniques. London; San Francisco: Pitman (Dover, New York).
Benz, J., et al. (1997). Documentation of mathematical models in ecology; unpopular task? Ecomod, December, 1997, 1-7.
Chorley, R. J., Haggett, P. (1967). Integrated Models In Geography. Worshire & London: Ebenezer Baylis & Sons Ltd.
Clarke, K. (2001). SLEUTH Urban Growth Model (version 2.1). National Center for Geographic Information Analysis, Santa Barbara. Available: <http://www.ncgia.ucsb.edu/projects/gig/project_gig.htm> .
Dee, D. P. (1994). Guidelines for Documenting the Validity of Computational Modeling Software (24pp ). Delft, The Netherlands: International Association of Hydraulic Research.
Digital Library for Earth System Education. (2001). Homepage. Available: <http://www.dlese.org>.
Frew, J., Freeston, M., Hill, L., Janee, G., Larsgaard, M., & Zheng, Q. (1999). Generic query metadata for geospatial digital libraries, Proceedings of the Third IEEE Meta-Data Conference (Meta-Data '99), April 6-7, 1999, Bethesda, MD, sponsored by IEEE, NOAA, Raytheon ITSS Corp., and NIMA.
International Organization for Standardization (ISO). (2000). Geographic Information Metadata (CD 19115.3): International Organization for Standardization (ISO).
Smith, T. R. (2001). Erosion Model. Available: <http://www.alexandria.ucsb.edu/doc>.
U.S. Federal Geographic Data Committee. (1998). Content Standard for Digital Geospatial Metadata. Available: <http://fgdc.er.usgs.gov/metadata/contstan.html> [2001, May 11].
Appendix: Outline of Content Standard for Computational Models
Copyright 2001 Linda L. Hill, Scott J. Crosier, Terrance R. Smith and Michael Goodchild