Replication of Results
and the Need for Test Suites

William Y. Arms

Preliminary Draft
January 2, 1998

The overall objective

The current phase of digital library research is highly empirical. A researcher who is developing a new concept implements software that incorporates the concept, demonstrates it with some trial set of data, reports observations on the results, and encourages others to build on the work. This is an effective method of working during the early stages of an experimental field, but as the field matures, we need a more systematic methodology.

For example, three of the current DLI projects are doing work in image recognition. Each is tackling a different aspect of the same problem: to be able to search collections for images that match specific criteria. However, the three projects are using their work in different applications and with different data. Therefore, any comparison of the three approaches is highly subjective.

There are two closely related needs:

Replication of results
It should be possible for other researchers to repeat experiments, with different data and different implementations, and to replicate the basic results.

The result should be evaluated against relevant, repeatable criteria, so that strengths and weaknesses of alternative approaches can be compared and improvements measured.

The need for test suites

Hopefully, the D-Lib Metrics working group will help the development of ways to measure the effectiveness of various aspects of digital library research. The next requirement is standard test data that researchers can use to evaluate their work.

I envisage a test suite that consists of a group of standard sets of test data that represent the major categories of material in digital libraries. The requirements for the test suite are demanding:

January 2, 1998