Text Mining at an Institution with Limited Financial Resources

D-Lib Magazine

July/August 2016
Volume 22, Number 7/8
Table of Contents

Text Mining at an Institution with Limited Financial Resources

Drew E. VandeCreek
Northern Illinois University Libraries
drew@niu.edu

DOI: 10.1045/july2016-vandecreek

(This Opinion piece presents the opinions of the authors. It does not necessarily reflect the views of D-Lib Magazine, its publisher, the Corporation for National Research Initiatives, or the D-Lib Alliance.)

Abstract

The digital humanities are now coming to the attention of a growing number of scholars and librarians, including many at medium-sized and small institutions that lack significant financial resources. Should these individuals seek to explore text mining, one of the digital humanities core activities, they are likely to confront the fact that their library cannot afford the typical expensive database products that contain large volumes of materials suitable for analysis. In this opinion piece, I suggest that vendors would benefit from increasing their customer base by offering potential users the opportunity to purchase discrete portions of data sets individually. This approach may prove practicable for libraries able to muster relatively modest sums for the purchase of single items. It also may represent a new source of revenue for vendors, or at least an opportunity to build trust and goodwill in the digital humanities community.

The Problem

The digital humanities' increasing prominence in academic life, marked by such things as the advertisements seeking applications for new positions and calls for papers, has brought it to the attention of a large number of humanities scholars, librarians and administrators not employed at the larger institutions that have heretofore often led the field's development. Many have expressed an interest in the field. These individuals often do not have access to as many financial resources as the field's leaders often enjoy. This shortfall makes itself apparent in any number of ways: the lack of a technical infrastructure robust enough to support many types of digital humanities work; a lack of information technology professionals that understand, appreciate and can support the work; and an inability to attend professional development workshops at other institutions. Another potential problem to be faced by this new group of practitioners at non-elite institutions with limited resources will arise when they undertake text mining, one of the digital humanities' core activities, and confront the expense of acquiring a corpus of data to mine. In this article I discuss the problem, and propose a partial solution which, while far from ideal, could allow these practitioners to begin.

Text Mining: the Cost of Getting Started

I attended the University of Michigan's "Beyond CTRL+F: Text Mining Across the Disciplines 2016" workshop on February 1, 2016. I want to thank the University of Michigan Libraries for organizing and hosting the event. I enjoyed it. It must have taken a great deal of work.

When the workshop first came to my attention, I noticed that participants could attend at no charge. This was too good to be true. Working at a state university in the bankrupt state of Illinois, I of course have access to no financial support for professional development activities. I happily drove to Ann Arbor and stayed overnight at my own expense, then took part in the workshop. Without the free-admission policy, I might not have gone to the event.

The workshop began with a session devoted to "finding your corpus." This seemed reasonable. No one can perform text mining until they have some text. The session featured representatives of several vendors of subscription products providing access to large amounts of textual materials: ProQuest, JSTOR, Gale, Alexander Street Press (full disclosure — I edited an online product for Alexander Street Press and have cashed their checks) and several others. It dawned on me that the no-charge policy resulted, of course, from these vendors' sponsorship of the event. As sponsors, they enjoyed the opportunity to pitch their products to members of a captive audience who had expressed an interest in text mining.

Vendor representatives described how scholars and students might use their products for text-mining projects. They presented an impressive set of resources, but they did emphasize that library users were not simply to bring up one of their databases and begin to download the very large bodies of text they wanted to use. Vendors of online library resources typically offer their products for subscription with the proviso that library patrons not use them too much. From a vendor's point of view, a database user might download a very large amount of text and then turn around and put it on the web for free use. Thus, they monitor their product's use, and terminate access if they detect that a patron is downloading too much material.

Vendor representatives at the Ctrl+F event explained that their policies direct prospective text miners to use their products to discover potentially suitable text materials, then submit a request for a specific corpus, which they will then prepare and deliver for an extra fee in the range of $500-$1,000.

This made something very apparent to me: text mining is in many cases only practicable at its intended scale at institutions commanding the financial resources necessary to 1) subscribe to these products, and 2) go on to pay the additional fee. Of course Open Access entities like HathiTrust make text materials available at the scale required for text mining activities at no cost, but it is important to recognize that vendors of subscription-based products like those discussed at the Ctrl+F event also represent a major source of text materials that scholars will likely find very attractive.

I noticed that a significant number of scholars employed at institutions well outside the vendors' target audience of university libraries with budgets allowing them to purchase or subscribe to high-cost digital resources in the humanities attended the "Beyond Ctrl+F" event. Those with whom I conversed often emphasized that they were happy to attend such an introductory-level event hosted by a major institution of high reputation. It offered an opportunity to get oriented in the field, to get started in the work. I suspect that a number of these individuals must have reached the same conclusion that I did: "I can only do this if I can find text available at no charge. I must direct my research toward questions that can be answered by reference to free-use data alone."

My Experience

I attended the Ctrl+F event as a digital humanities professional responsible for the encouragement and support of activities like text mining at my university. I am also a scholar of nineteenth and early twentieth century American intellectual and political history. I am interested in language and rhetoric in American political development. More specifically, I am interested in how Americans have talked about the federal government. What did they have to say about its scope of activity? How might Americans have understood what it did, or did not do? What language did they use to argue for more, or less, government involvement in the American economy and society? Did their language reflect the influence of major intellectual traditions like liberalism and republicanism in political thought, or perhaps romanticism and sensibility in literature and culture?

I turned to speeches and debates in Congress as a good source of arguments for and against specific state activities. This led me to the Congressional Record, a very large set of text that is available in a searchable text format from several sources. The Library of Congress' A Century of Lawmaking for a New Nation web site provides free access to full-text versions of the Congressional Record beginning with the year 1995. I needed access to full-text versions of the record from the nineteenth century. This led me to ProQuest Congressional, a subscription product providing a variety of Congressional materials. Unfortunately, my university library's subscription to ProQuest Congressional did not include materials from the Congressional Record before 1985. When our Acquisitions Department contacted ProQuest to inquire about the matter they learned that we might purchase the back file materials for the nineteenth-century Congressional Record for a one-time payment of approximately $25,0000. This was an all-or-nothing proposition: purchase the entire back file, or purchase nothing.

ProQuest's price was a complete non-starter at my financially strapped university.

I asked librarians at several institutions with large library budgets if they might acquire materials for me, in effect providing an inter-library loan, but found that vendors' contracts restrict use to individuals defined as members of an individual institution's user community.

I attempted to resolve my problem by asking vendors if they would sell me my preferred chunk of data by itself (the Congressional Record, 1873-1896), rather than an entire database product or back file, at a more reasonable price. ProQuest declined to negotiate, but Hein Online (another vendor of digitized government documents) agreed. I bought, at my own expense, the text of the Congressional Record for the period 1873-1896 for a price I could accept. I now have it available for research.

Upon completing this transaction, I discovered that the University of North Texas Libraries, which present a digitized version of the entire Congressional Record, would provide me with their uncorrected text data at no charge. I thank the University of North Texas Libraries for the use of their data, and recommend them to other students and scholars. Their collections include a large amount of digitized Texas newspapers, as well as records of the Federal Communication Commission. However, like other not-for-profit providers of text data, North Texas offered uncorrected copy. With two versions of the same data in hand, I may have an opportunity to compare the results they produce in text-mining work. In any event, corrected text is clearly more useful than uncorrected materials.

The Vendors' Perspective

As I pondered the situation, I tried to take ProQuest's point of view. I understand that most library vendors are private concerns and need to make a profit for their investors. Their representatives sell that product in order to earn a living. Nevertheless, the Congressional Record is a government publication available at no charge in libraries and other depositories of federal materials. How could ProQuest charge so much for the use of it?

I imagined that from ProQuest's perspective, they are not selling access to a government publication in the public domain. They are selling access to a value-added version of it: a digitized, full-text searchable version of the materials available in an online format. Their costs include funds devoted to the initial digitization of materials originally published in an analog format; the markup and other technical work required to prepare the text for use with a search engine; the storage and preservation of the materials on a technical infrastructure requiring maintenance and upgrades; and the online service of the digital materials themselves, again on an infrastructure requiring maintenance and regular upgrades.

Of these costs, those devoted to digitization itself deserve specific discussion. Many librarians and humanities scholars have taken some part in the digitization of materials at some point in their career. Experience with the process reveals that the various software products that convert type-set, analog materials to a digital format are far from foolproof. They often produce enough errors to compromise the materials' usefulness, at least to some degree. This is especially true of older materials, in which ink has often faded and pages have yellowed with age. In my experience nineteenth-century materials digitized from an analog format usually have a very high error rate.

I examined a small sample of ProQuest's Congressional Record materials, which they courteously provided me. It contained a very small amount of scanning errors, significantly fewer than those found in the portion of the UNT data that I reviewed, and about the same as the Hein materials. I tentatively determined that in my case vendors provide access to better text than that available for free.

If a researcher were to attempt to bring the Open Source data up to the quality of the ProQuest materials, s/he would have to find a way to fix many of the errors in it, most likely by using a script that finds and replaces common scanning errors in a document. In my experience most humanities scholars and students cannot write search and replace scripts, nor do they know how to find them online, ready to use, and implement them in ways that many technologists and programmers do. I certainly do not. Most libraries and medium-sized and smaller institutions with limited resources lack access to this type of technical expertise.

Thus, when Hein and ProQuest charge fees for materials in the public domain, they charge for access to more accurate digitized text.

A Measure of Progress

My experience with Hein Online led me to draw a parallel to another experience I had with a vendor in a somewhat similar, but not identical, situation. In the past several years I have taken part in the activities of the Digital POWRR Project, an IMLS-funded activity that produced a study of digital preservation challenges and potential solutions at medium-sized and smaller colleges and universities lacking large financial resources. Our study included the review of a number of applications and tools available for use in digital preservation activities. Among them we found a comprehensive, all-in-one product called Preservica. They made no pricing information available online. We had to call for a quote.

When we contacted a Preservica sales representative to ask if they might make the product available to our study for testing at little or no cost, they immediately rejected us, explaining that Preservica is a version of a digital preservation product that the company originally sold to large corporations such as banks. They have now begun to market it to other very large institutions with need to preserve digital materials that have suitable budgets, ranging from universities to state and national governments. Apparently, medium-sized and smaller institutions with little money did not represent an attractive market segment.

The Digital POWRR Project published a white paper resulting from the study, "From Theory to Action: Good Enough Digital Preservation for Under-Resourced Cultural Heritage Institutions". It recommended that institutions unable to afford a product like Preservica adopt a one-step-at-a-time approach to digital preservation activities using sets of open-source tools in combinations suited to their particular needs.

Another thing occurred in the process of conducting the study. Through a frank and open exchange of views with members of the Digital POWRR team, Preservica executives became aware that they were leaving money on the table by adopting a call-for-quote stance and pricing their product at a level that put it well out of reach of smaller, less prosperous institutions. We urged them to adopt a more transparent pricing policy and become aware of this other market, which the response to our study has shown is vast. There are only so many institutions with the resources necessary to buy Preservica at their initial price level. What happens when they all have acquired or constructed a satisfactory digital preservation application? Where does the company find growth then?

Preservica executives changed their position, instituting a transparent, online pricing policy and devising versions of their product priced to suit more modest budgets. I want to suggest that vendors of large sets of humanities text materials do the same.

My Recommendation

I suggest that vendors of library database products recognize that they can contribute to future scholarship, ease a major, obvious inequity in the field and, perhaps, find a new source of revenue by making chunks of text data available for sale on an à la carte basis. In many cases, this would require them to offer libraries that do not subscribe to their products a free trial-period use so that researchers might identify materials of interest. It would also require the additional administrative work involved in processing a number of transactions involving lesser amounts of funds than those to which they are accustomed. I understand that vendors will raise these objections, but I believe they should investigate this potential sales model in a systematic fashion and determine if they can earn profits with it.

I submit that vendors would not need to understand this approach as a charity measure. I suspect that purveyors of large, online humanities text databases may well confront a situation similar to that which the Digital POWRR team perceived in Preservica's case. Once they have sold their products to the limited number of institutions able to afford them, where do they find growth? Of course they can grow by introducing new products, but do they not want to find revenue growth in legacy products as well?

Representatives of a number of vendors may reply to this observation by noting that they price their products on the basis of an institution's number of full-time enrolled students, or offer access to a limited number of simultaneous logins, measures that can help a smaller institution. This is not enough. It may prove to be a benefit to smaller institutions to some degree, but it is only a partial measure. It certainly does not help cases like mine — a large institution lacking the budget level to buy even these versions of products — and there are many such institutions. If vendors do not recognize and respond to the market made up of medium-sized and smaller institutions of lesser financial means, I fear that they will make a powerful contribution to the perpetuation of the existing situation: students and scholars at the wealthiest colleges and universities can do text mining work with access to very large collections of suitable materials, while others may never find their corpus. Those vendors will also, in my estimation, leave money on the table. Even if they cannot earn any profit from this type of sale, it may be worthwhile for them to sell materials at a modest loss in order to earn the trust and goodwill of the scholars, librarians, and other practitioners populating the digital humanities.

I ask vendors to consider the above proposition, and digital humanists and librarians at institutions of all sizes and financial conditions to raise these issues associated with access to their materials with vendors' sales representatives.

Acknowledgements

The author thanks Jim Millhorn of Northern Illinois University Libraries and Alix Keener of the University of Michigan Libraries for help in gathering information for this article.

About the Author

Drew E. VandeCreek is Director of Digital Scholarship and Co-Director of the Digital Convergence Lab at Northern Illinois University Libraries. He holds a Ph.D. in American History from the University of Virginia. He has secured funding for and directed the development of a number on online resources exploring nineteenth-century American history, available from the University Libraries Digital Collections.