Search   |    Back Issues   |    Author Index   |    Title Index   |    Contents

Featured Collection

spacer

D-Lib Magazine
March 2006

Volume 12 Number 3

ISSN 1082-9873

The Perseus Digital Library

Contributed by
Gregory Crane
Tufts University

New content and services for 19th century American documents

The Perseus Digital Library is releasing a collection that introduces significant new content and brings new technologies and services to a mainstream digital library. All XML source texts of the collection are available under an open source license. All software is likewise designed as open source.

XML Tagging

All of the markup in the example shown in Figure 1 was automatically added. Buell and Corinth are correctly interpreted George P. Buell and Corinth, MS. General John Pope – a dominant figure in this text – is never mentioned by his first name. The system thus mistakenly links Pope to "Curran Pope," named twice elsewhere in the document. The XML tagging includes both the rule used ("most common") and the number of times that Curran has been mentioned in the text (2).

Example of xmltagging

Figure 1.

Sample Page View

Figure 2 shows a representative page from the 19th century collection as seen in the Perseus digital library interface. The right hand side contains a range of added services (designed to become Fedora disseminators) relevant to the text. In this case, we see a summary of people, places and dates on a particular page of text. The digital library system can chunk documents in various ways and could, in this case, provide a chapter at a time.

Example of pageover

Figure 2.
View Larger Version

Personal Name Searching

Searching for people with the last name Webster calls up the list shown in Figure 3. Since Daniel and Noah Webster are pre-eminent figures in their domains, Webster often appears in documents without a forename. We have chosen to defer automatically aligning possible name variants (e.g., Daniel Webster and D. Webster) since we were not satisfied with initial results. We expect to reactivate this feature in a more sophisticated form in the future.

Example of xmltagging

Figure 3.

Frequency Searching

Almost all of the Richmonds in our collection denote Richmond, VA. The example shown in Figure 4 illustrates the results for Richmond, KY: the results include two figures. In the document with the most potential references to Richmond, KY, the system reports that there may be as many as 71 references but probably at least 16. The lower figure, which maximizes precision, reports phrases such as "Richmond, Kentucky", where the language makes it highly likely that we have the right Richmond. The higher figure includes all those Richmonds that could be Richmond, KY. Given the pervasive references to Richmond, VA, many of the 71 will be false positives.

Example of pageover

Figure 4.

Our named entity identification system tuned for 19th century historical materials has extracted from the 55 million word test collection 1.5 million personal names, 1 million place names, 600,000 dates, 500,000 organization names, and 12 million auomatically generated annotations. While we have extracted dates and places in the Perseus Digital Library since 2000, the new system is optimized for the challenges posed by American English, which contains both semantic ambiguity (e.g., Washington can be a person or a place) and which uses the same names over and over again (there were more than 150 places named Washington in mid nineteenth century America). This system lays the foundation for a general web service to tag named entities in documents that could process up to 1 billion words in a day. The current database of named entities from the initial test collection can be expanded to cover all pre-twentieth century American English publications. For the technical details and general performance of the system, see Crane and Jones.

A new release of the Perseus Digital Library System includes an initial interface whereby users can browse and search semantically:

  • Semantic searching: In the American collection, you can now search for Washington as a person or as a place. Likewise, you can search for a particular place. Thus, you can search only for [Richmond in KY], minimizing the far more numerous references to Richmond, VA.
  • Automatic metadata generation: Listing the most commonly cited people, places, dates and other organizations provides a window into the content of a document. This automated technology provides a scalable approach that can be applied to millions of documents. See a list of people, places and dates that J. B. Jones, A Rebel War Clerk's Diary," cited in September 1863.

Collection Description

Sources on 19th century American culture and especially the US Civil War: This release includes a test collection of 55 million words, with a number of core works on the Civil War. Our purpose was to create a testbed whereby we could study the problems and opportunities presented by various genres of 19th century American English. Content includes:

  • the works of John Greenleaf Whittier, illustrating not only the problems of representing an existing print edition as part of a larger digital collection but also of applying a named entity identification system, initially optimized for historical works, to a literary corpus.
  • George Bancroft's 10 volume History of the United States (1859), which reflects the nation's image of itself at the dawn of the Civil War and illustrates the challenges of managing a history that covers several centuries.

Acknowledgements

The foundations for this work were laid with support from the National Endowment for the Humanities and the National Science Foundation under the DLI-2 Initiative. Tufts University provided support for much of the data entry, and the open source policy championed by its president, Lawrence Bacow, has allowed us to make the XML source texts accessible. The Institute for Museum and Library Services and our collaborators at the University of Richmond in their Civil War Newspapers Project provided much of the support for developing the backend named entity analysis system and the digital library named entity searching and browsing environment. Gregory Crane scanned most of the American books, supervised their OCR, post-processing into TEI-conformant XML and was primarily responsible for the named entity analysis system. Adrian Packel built the named entity searching and browsing environment. Lisa Cerrato, Alison Jones, David Mimno, and Gabe Weaver all made major contributions to the work.

Copyright© 2006 Gregory Crane

Top | Contents | Home

doi:10.1045/march2006-featured.collection