The Model Editions Partnership

"Smart Text" and Beyond

David R. Chesnutt
Department of History
University of South Carolina
Columbia, South Carolina
[email protected]

D-Lib Magazine, July/August 1997

ISSN 1082-9873

The Model Editions Partnership, a consortium of seven historical editions, is currently developing a series of prototypes which will be mounted on the World-Wide Web later this year. These small samples (equivalent to 150-200 pages) will demonstrate a variety of intellectual approaches in creating new editions for the Internet. Using a subset of the SGML markup system developed by the Text Encoding Initiative (TEI), the editors are preparing image editions (using images of historical manuscripts) and live text editions (using transcribed historical documents). A third approach uses a sequel database with CGI scripts to provide the user interface. The user interface for the SGML models uses software provided under a grant from Electronic Book Technologies.

The Partnership, now in its second year, is centered at the University of South Carolina and supported by a three-year grant from the National Historical Publications and Records Commission. The first year, the Partners focused largely on the issues of developing a system design to meet the needs scholarly editors (see a "Prospectus for Electronic Historical Editions," http://mep.cla.sc.edu). This year, they have concentrated on developing a TEI/SGML markup system and preparing the content of the prototypes. The final year will be devoted to launching the prototypes and documenting the experience.

As a successor to the Partnership, we are now preparing the ground to build an American Documentary Heritage database (ADH) Unlike the Partnership, the ADH would include modern editions from all disciplines which publish letters, diaries, journals, public records, and other documentary source materials. The goal is to make the best of modern scholarship available not only in colleges and universities, but also in public libraries and in high schools which offer advanced placement courses in history, literature and the arts. The project involves new kinds of partnerships with both publishers and libraries and posits a self-sustaining economic model. But that's another story. For the moment, I want to give you a sense of the variety of the sample editions underway; to elaborate on the Partnership's experience with the TEI/SGML markup; and then to conclude with a few comments about areas where collaborative research could be fruitful.

An Overview of the Prototypes

The scholars who head the Partner editions are all seasoned documentary editors whose work reflects the diversity of the field itself. Each of the prototypes is a reflection of the kinds of material they work with and the editorial expertise they bring to the Partnership. This is quick look at how the mini-editions are shaping up.

The Margaret Sanger Papers: Images of manuscripts from a variety of collections relating to Sanger's indictment for postal violations, with hypertext links to research files, a chronology of Sanger's life, and information relating to the sources.
The Lincoln Legal Papers: Images of original manuscripts from every court in which Lincoln practiced law, organized into case histories, plus indexes at the subject level and a unique search facility.
The Papers of General Nathanael Greene: A group of "live texts" of documents and abstracts related to the "Race for the Dann" from the published series, with links to the full texts of letters and documents abstracted in the printed series.
The Documentary History of the First Federal Congress: A selection linking live texts relating to the creation of the Executive Department drawn from many volumes in the series as well as unpublished letters on the subject.
The Documentary History of the Constitution and the Bill of Rights: A selection of live texts drawn from the volume A Necessary Evil with sophisticated markup for indexing and retrieval.
The Papers of Elizabeth Cady Stanton and Susan B. Anthony: A selection of live texts published in the first volume of the edition and images of original manuscripts and maps documenting their efforts to organize women in western New York.
The Papers of Henry Laurens: A selection of live texts from the published series documenting the seizure of power from the Royal authorities during the early stages of the American Revolution and utilizing the published index.

With the exception of the Lincoln Legal Papers, all of the texts and images are part of an SGML database.

Gluing it together with SGML

At first glance, the 1300-page TEI Guidelines would appear to be a formidable barrier to any scholar not schooled in technology. Bound in green wrappers and commonly referred to as the "green books," the two volumes provide markup schemes for texts ranging from simple poems to others which require word-level markup for linguistic analysis. The needs of documentary editors fall somewhere in between. By taking a subset of the TEI markup and making a few alterations, we have developed a markup system that seems far less intimidating. Tutorials with explicit examples designed for scholarly editors that we are now developing will, we hope, lower the barriers even further . But why adopt a complicated SGML markup system in the first place, when something as easy to use as HTML is available?

From a scholar's point of view, HTML has two strikes against it. First, HTML was designed as formatting markup to determine the appearance of text on a computer screen. It's still driven today by that "presentational" aspect. For example, companies like Netscape and Microsoft support extensions of HTML that reflect the desires of commercial customers to create more attractive pages. Consequently, the new extensions resemble the typesetting markup used to control the appearance of books, journals, and other printed matter.

The second strike against HTML is its simplicity. For example, user searches cannot conveniently distinguish between Washington as president or Washington as the nation's capital, or between information about Washington the person and documents written by Washington the person. Because of the ever-increasing quantity of material on the Web, intellectual access is a major issue. And that's when the advantages of complex markup systems like those developed by the TEI become better alternatives.

Consider the following example. Letters are the most common type of documents found in documentary editions. This is true for historical editions, literary editions, and editions which deal with the history of science, religion or any other discipline within the humanities.

<doc>

<head>Henry Laurens to George Washington</head>
<dateline>Philadelphia, July 11, 1778</dateline>
<docbody>

<salute>Sir.</salute>
I beg leave to refer Your Excellency to ....
The present Cover will convey ... Two Acts....
<list >

<item>1. Empowering Your Excellency to call in the Aid of such Militia ...</item>
<item>2. Intimating the desire of Congress that Your Excellency Co-operate with Vice Admiral Count d'Estaing ....</item>

</list>
Congress have directed me to propose ....
<closing>I have the honor to be.....</closing>
<signed>Henry Laurens.....</signed>

</docbody>
<sourcenote> ALS, Washington Papers, DLC; addresse....</sourcenote>

</doc>

With this kind of markup, we begin to see how the document is organized. And with good software, we can use style sheets to determine the letter's appearance -- as well as the appearance of all other letters. More striking, however, are the ways we can use the markup to infer information about the letter and other documents in the series.

If "George Washington" appears in the <head>, he is almost certainly the author or the addressee of the letter. If he appears in the <docbody>, he is almost certainly being talked about. And if he appears in the <sourcenote>, the reference is almost certainly to the location of the original letter. The more sophisticated SGML markup, in and of itself, describes only the structure of the letter. At the same time, the markup serves as a kind of "metadata" which allows users to construct more intelligent and rewarding searches. The search for "George Washington" in the <head> returns a list of all the letters to and from Washington; the search in the <docbody> element , a list of all the letters in which he is discussed; and the search in the <sourcenote> element, a list of all the locations of the original letters--all of which are very useful to researchers and difficult, if not impossible, to replicate in HTML markup.

Markup as Metadata

"Metadata" is becoming a buzzword and the connotations associated with it largely depend on which group is using the term. Librarians may mean one thing; computer scientists another; and archivists still another. But basically, it's just a convenient way of saying one kind of data is being used to provide information about another kind data.

Scholarly editions have many kinds of metadata. Printed volumes abound with footnotes or endnotes which provide additional information about the information contained in the documents. Editions frequently include biographical dictionaries which identify the characters of the cast. And they usually include indexes which provide conceptual views of the text. All of these are "snapshots" of particular "views" of the text or additions to it. What makes these views and additions available in an electronic edition is the markup. One aim of the Partnership is to provide a range of prototypes which demonstrate some of the different ways of constructing intellectual views of the text.

"Some of the different ways" is a phrase which reflects the reality imposed by finite resources. From some point of view, the Partnership could be a "never ending story" -- which would undoubtedly give our funders pause for concern. As the editors have become more confident, their visions of the prototypes have expanded. For example, the documents drawn from the editions on the Ratification of the Constitution and the First Federal Congress contain diary entries and newspaper accounts which provide reports of speeches made during the constitutional and congressional debates. To enable users to locate all of the accounts of a particular individual's speech, we developed special markup features which allow us to connect the name of the individual with the report of his speech. A search for "James Madison" as speaker will return a list of all documents containing reports of his speeches. Or, the user can couple the search with dates to narrow the search to Madison's speeches within a certain time period. And, if we provide subject indexing for the topics of speeches, it becomes possible to connect the speaker search with the subject term.

Indexing to identify abstract concepts or to provide explicit entries for names, events, and other subjects, has been a time-honored task among documentary editors. Although making indexes is extraordinarily labor-intensive, indexes are the most valuable tools editors provide to create intellectual access. As we move toward the creation of a national database, we need to find ways to effectively make use of the indexes in previously published editions. The first challenge is to find ways to use technology to allow us to embed existing indexes within the texts themselves. Making hypertext links between an existing index and the page numbers in an electronic text can be automated fairly easily.

But this gets messy when two documents appear on the same page -- making it unclear to which document the index is referring. A better solution would be to embed the indexing terms themselves within the relevant paragraph or sentence, then regenerate the index. This would make it easier for users to locate information and it would ensure that the indexing is not "lost" when the text is migrated as technology changes.

Reindexing even a single volume by hand does not seem feasible; a typical index contains about 8,000 to 10,000 page citations. What we need are clever programs which take an indexing entry, analyze the page on which it appears, present the indexer with a choice of suggested locations, and then put the entry in the appropriate place. Similar approaches could be used to solve the related problems of finding ways to link indexes across editions and to allow users to refine their searches so that the results can be as narrowly focused as the user may require.

Finding ways to use technology to help rebuild indexes, however, is a small example of a much larger and more important issue: developing generalized tools to add semantic content and access to metadata associated with texts. Like most projects, the Partnership developed ad hoc pattern-matching programs to automate certain types of markup. Given the work being done in other domains like ontology, content analysis, and linguistic analysis, much more sophisticated approaches for adding metadata to electronic text are possible. The inferences we draw from textual analysis can help us develop new views of the text. By the same token, we should be able to use those inferences to help us identify elements within the text and embed the markup which will allow users to retrieve that information. A simple case would be the development of an algorithm to identify dates which could then be used to develop a program which either automatically tagged the date or prompted the user for confirmation or intervention.

A more interesting case would be the use of ontological clustering or content analysis to identify related portions of texts in different editions and then to establish links between those texts. To move toward these more sophisticated approaches will require a broad range of scholars working together in a collaborative effort. In short, those of us in the humanities need to be thinking about developing new initiatives with our colleagues in the sciences.