Levels of Abstraction
This month, Scientific American celebrates its 150th anniversary and has devoted its monthly issue to key technologies of the 21st Century. Not surprisingly, information technologies figure prominently not just in the section given over to them but also as a recurrent motif in the sections on transportation, medicine, manufacturing, energy and environment, and society. Collectively, these stories illustrate the range of information technologies -- from microprocessors to knowledge bases to computer-assisted design.
That we can talk about electronic information at multiple levels of abstraction reflects a fundamental attribute of software -- that the same type of encoded information can contain instructions to the machine and the data over which those instructions operate. This is the stored program concept, which was articulated by Alan Turing in 1937, and which forms a cornerstone of John von Neumann's computer architecture. The abstraction of the instructions from the physical configuration of the equipment, together with advances in information theory and hardware engineering, enabled computers to evolve from special function to multi-purpose machines capable of handling greater quantities of data at higher speeds, while the software itself evolved into a complex hierarchy of abstractions and languages. These range from assembly language to operating systems to special-purpose end-user applications, where software wears a human face.
This progression, it seems to me, has two important and inter-related dimensions. First, the very model of computing itself, with its levels of abstractions, has a powerful parallel in libraries. The core idea of a "library" is a set of operations and procedures required to create, manage, and access a collection regardless of the format or media of objects in the collection. Indeed, the term, "library", has already migrated to the world of software, where developers often use the term, "software libraries", for sets of procedures and functions. Second, the many advances in technology (including software) have permitted us to create and capture new forms of digital data, witness the archives of remote sensing and other satellite-based information. So, if we take the notion of a software library and add to it the notion of a digital collection, we get a "digital library". And digital libraries, like software, can potentially support levels of abstraction from those internal to the software to those that facilitate human-computer interactions.
For example, the stored program idea and the related notion of binary code, which can be represented by pulses of electrons, has helped to create great masses of data, like satellite imagery or remote sensing data, as well as the tools that enable us to interpret the stream of signals and reconstruct the image. To move to the next level of abstraction, we want to be able to employ another collection of tools that will enable end-users to find information in the information -- all the frames, for example, that exhibit the moons around Jupiter from all of the Voyager fly-bys.
Clearly, we have made more progress with highly structured data like genetic code or protein sequences and with information whose native expression is alphanumeric (whether it originated on a word processor or from a scanner) than we have with the content of images or the deep structure of media-diverse information. But even the more developed applications require users to know quite a bit about the subject. Searching GenBank, for example, is extremely difficult unless you know what protein or genetic sequences are. Nevertheless, in the case of GenBank and similar databases, an agreed-upon vocabulary for identifying the phenomena together with the power of digital representation means that a level and scope of searching is available that was hitherto impossible. End-user searching of text, moreover, has advanced beyond strict boolean operations based on controlled vocabularies to querying that seeks to capture degrees of relevance and similarity. We may not be able to ask a library of images to return all examples of Renaissance oil paintings portraying Mary, Anne, and the infant Jesus, but we can ask it to examine very long streams of telemetry signals and identify specific forms. If these forms have meaning for us, so much the better. Finding the forms is the first step; finding all of the paintings that are similar to Leonardo's Virgin of the Rocks may have to wait a day or three.