Newspaper collections are the subject of an increasing number of large-scale digitisation projects. In Papers Past (http://paperspast.natlib.govt.nz), a collection of over a million newspaper pages, the introduction of full-text search has made a wealth of information findable that was previously hidden. The search feature is dependent on text extracted from the newspaper page images with Optical Character Recognition (OCR), so any improvement in OCR accuracy will add value to the collection by improving our users' chances of finding useful information.
The Papers Past newspapers were digitised from microfilm as 400 DPI bitonal images over a period of several years. For future newspapers, we wondered whether OCR accuracy would be improved by "going grey", and digitising to 8-bit greyscale instead. Accepted wisdom is that greyscale digitisation produces higher OCR accuracy than bitonal digitisation. To test this assumption, we digitised three reels of microfilmed historic newspapers in both bitonal and greyscale, had them OCRed, and carried out a hand-count of the OCR accuracy on a random set of text samples. The experiment had a clear and surprising outcome: using our existing business processes, there was no evidence of any improvement in OCR accuracy from greyscale digitisation.
The National Library of New Zealand ("the Library") has made a collection of digitised newspapers available through its Papers Past website at <http://paperspast.natlib.govt.nz>. At the time of writing the site provides access to nearly 1.2 million newspaper pages, comprising 225 thousand issues from 45 titles published between 1839 and 1920.1
The original Papers Past website debuted in 2001, and gave users access to scans of the newspaper pages, which could be viewed and printed, but not searched. In 2005 the Library ran a pilot project to investigate using optical character recognition (OCR) to generate full text and make the newspapers in Papers Past searchable. The pilot was successful, and the Library decided that all future content in Papers Past should be OCRed.
Papers Past was re-launched in September 2007 with a new interface and the ability to search the text of those titles whose content has been OCRed. Currently 24 of 45 titles (representing about 56% of the collection) are searchable, and the remainder will be OCRed over the next twelve months as part of our digitisation programme.
Offering full-text search has increased the usage of Papers Past dramatically, from 9,275 unique visitors and 14,680 visits in August 2007 to 117,000 unique visitors and 252,000 visits in June 2008. We attribute the site's increased popularity to three factors: improved user interface, improved functionality (i.e., full text search), and increased referrals from the Google search engine, which has indexed much of the OCRed text.
During the OCR pilot project we spoke to many vendors who recommended that we start using greyscale digitisation as it would increase OCR accuracy. While this was not an option for the million pages already digitised as bitonal images, the Library decided to investigate digitising newspapers as 400 DPI 8-bit greyscale images in future years. A new project was initiated to gather evidence that greyscale input images really do benefit OCR accuracy.
This article describes the new project, known internally as the Greyscale Evaluation Project. We began with the expectation that greyscale digitisation would deliver obvious improvements in OCR accuracy, but this proved not to be the case in our situation. This article describes how we evaluated the effect of greyscale digitisation on OCR accuracy, summarises our results, and discusses our thoughts about the outcome and the lessons we have drawn from it.
We decided that the best way of determining whether changing to greyscale scanning would have a positive impact on the overall level of OCR accuracy was to process some microfilmed newspapers in both bitonal and greyscale and compare the outputs. We concluded that a hand-count of samples taken at random would be the best way of making a fair comparison.
We started our experiment with the following hypothesis:
Greyscale scanning produces a significantly higher level of OCR accuracy than bitonal scanning.
2.2. Data preparation
Three rolls of microfilm were digitised and OCRed multiple times using a process that was identical except for digitisation parameters (bit depth and resolution). The full data preparation process was as follows:
Figure 1: DSC bitonal sample (Reel 28258)
Figure 2: DSC greyscale sample (Reel 28258)
The Library compared the hand-counted OCR accuracy rate for bitonal and greyscale for each sample, and then calculated the average accuracy rates. Table 1 summarises the data and results.
The overall average bitonal accuracy rate was 97.53%, whereas that of greyscale was 3.43% less at 94.10%. The images scanned at 300dpi had a lower average OCR accuracy so the 300DPI data has been excluded from further analyses.
We also performed a direct comparison for the 400DPI scans for each reel and found that in reel 28258 bitonal had the higher accuracy rate 34 times (85%), in reel 28259 bitonal had the higher accuracy rate 31 times (77.5%), and in reel 35903 bitonal had the higher accuracy rate 72 times (90%).
2.4. Analysis of outliers
Following this experiment, we examined the variation by looking at every instance of 5% or more variation in accuracy between bitonal and greyscale.
The two Daily Southern Cross reels (28258 and 28259) had few samples in which the variation between greyscale and bitonal was greater than 5% there were six instances out of 40 in 28258 and two out of 40 in 28259. The Colonist reel (35903) had twenty-five samples out of 80 in which the variation between greyscale and bitonal was greater than 5%.
In general, in cases where greyscale digitisation was more than 5% worse, the scans were characterised by pale text, blurry text, and poor contrast. This was a particular problem for reel 35903. Overall, there are only two instances where greyscale is more than 5% better.
3. (Lack of) conclusions
The greyscale evaluation was based on the assumption stated in our hypothesis that greyscale scanning produces a significantly higher level of OCR accuracy than bitonal scanning and our results were therefore quite unexpected.
Our hypothesis is unsupported by the available evidence. The hand-count provided no evidence that greyscale digitisation improves OCR accuracy, and the analysis of outliers did not have enough data to draw any conclusions.
Based on vendor recommendations, and our own understanding of the OCR process, we expected to see significant and obvious evidence that greyscale was superior. However, this evidence did not materialise.
This experiment was designed to yield a practical outcome that could be applied in a large-scale and largely automated digitisation and OCR workflow. We are primarily interested in solutions that will help us in affordable ways in our commercial setting with current technology. We are less interested in ways that OCR can be improved on a page-by-page basis under the guidance of a human expert.
4.1 Why greyscale scanning is theoretically better
Prior to the experiment we reviewed the available commercial and academic literature. We also looked at best practice and the choices made by other similar projects. However, it was difficult to find much material published on the topic.
We found that there was general consensus among OCR vendors and other experts that greyscale images result in higher OCR accuracy than bitonal images, for several reasons:
All these reasons suggest that greyscale scanning will improve OCR accuracy, though most authorities we contacted noted these improvements will vary depending on the type of material. Several experts identified potential improvements that greyscale may offer in the future, but that are not yet commercially available, such as the ability to binarise a greyscale image after it has been zoned into articles to reduce the effects of variation within each newspaper page.
4.2. Possible explanations
In our experiment greyscale scanning was no better than bitonal scanning. We have considered a number of reasons that this might be the case.
In conclusion, we believe that the explanation is probably a mixture of these. The major benefits of greyscale digitisation are in handling pages that need substantial preparation, such as deskewing, whilst our pages are generally of good and consistent quality and therefore will not benefit from these. The other obvious benefits of greyscale superior binarisation and additional information are both unproven with current technology, and apparently had little impact on OCR accuracy.
4.3. The costs and benefits of going grey
An obvious long-term advantage of changing to greyscale scanning, with regard to OCR quality, is that there is more information in the scans for the OCR program to work from. This additional information has not helped in the present experiment, which uses the best technology currently commercially available, but in the future advanced OCR software may be able to make better use of it. Greyscale scanning can therefore be seen as a hedge against future technology improvements.
A second advantage of greyscale scanning is the better representation of pictorial content, such as photographs and illustrations, that may appear in historic newspapers. While this is out of scope for the current experiment, it may drive our thinking in the future: Papers Past currently does not contain very much pictorial content, but this may change as we start including more 20th century material.
However, going grey also means more cost, effort and inconvenience. The OCR processing costs more as it requires vendors to use more resources (CPU and storage). Both storage and transport requirements increase because 8-bit greyscale TIFFs can be up to 80 times3 as large as bitonal TIFFs when uncompressed, and even when compressed (with some loss of detail), they can be 20 to 40 times as large. This would dramatically increase our long-term storage costs and make the transfer of data to vendor problematic.
In light of New Zealand's low rate of broadband uptake, there may be other costs in the delivery of large greyscale images to Papers Past users. As they are much larger than bitonal images, they would either take much longer to download, or would have to be transformed in some way to be made accessible. With increasing access to broadband, this problem should diminish over time.
Going grey would also incur a certain amount of risk as we replace a known and successful workflow with one that is relatively unknown to us. For example, the OCR quality from greyscale images appears to be quite sensitive to changes in the digitisation parameters. In a preliminary experiment we digitised the same microfilm reels using a scanning process that sharpened the images (resulting in a halo effect around the letters), and the resulting OCR accuracy of the greyscale images was on average 10% worse than the bitonal equivalents. This suggests that the process is vulnerable to scanning errors that decrease image and OCR quality, and that changing to greyscale scanning may require ongoing monitoring above and beyond our current quality assurance procedures.Finally, we believe that there are other ways to increase OCR accuracy that may be cheaper or more effective than changing to greyscale digitisation, such as re-filming. In 2006/2007 we re-filmed and re-digitised the New Zealand Gazette and Wellington Spectator, providing cleaner inputs to OCR, and resulting in an improvement in OCR accuracy of nearly 10% (as measured by average machine-estimated accuracy rates, not by hand-count). The National Library of Australia has reached similar conclusions, determining that the only way to significantly improve OCR accuracy is to improve the quality of the source materials or make manual adjustments to the process for each file, and ultimately attempting to solve the problem laterally, by asking users to voluntarily correct the OCR output after the fact.4
4.3.1. Future considerations
Most major historical newspaper projects have chosen to digitise in greyscale, and most disagreement is over the resolution: for example, the British Library and the Bibliothèque nationale de France consider that 300dpi is sufficient, while the Library of Congress and the National Library of Australia require 400dpi. The Databank of Digital Daily Newspapers (DDD) project from the Koninklijke Bibliotheek found that scanning at 300ppi was the consensus.5 In the meantime the debate has moved on to the benefits of colour scanning and direct digitisation.
As noted by Edwin Klijn, "Scanning from the originals is generally acknowledged to produce higher quality master images. There is some disagreement among the survey respondents as to whether one should scan in colour or greyscale. Scanning in colour produces a master that is closer to the original newspaper (more 'authentic') than greyscale. Also, according to some respondents colour images may lead to better OCR results, or at least provide better 'raw materials' to improve the OCR in due course."6
4.4. Open questions
We were surprised to find little documented evidence of claims that greyscale digitisation provides a higher level of OCR accuracy than bitonal. This may be a result of commercial sensitivity on the part of OCR vendors, or may be an assumption based on the undisputed advantage of greyscale scanning: the better representation of pictorial content. Whatever the reason, it has left us with several questions to place before the newspaper digitisation community:
The last of these questions is particularly important in high-volume settings. One of the major costs of greyscale OCR is the transportation of large quantities of data, but this can be reduced significantly by binarising at the time of digitisation and sending the resulting bitonal images to the OCR vendor. This approach has been adopted by the National Library of Australia, who have run experiments to compare binarisation programs and techniques and select the software that yields the best OCR results.7
Another argument used in favour of greyscale is that users who are reading digitised newspapers on computer screens prefer to read text from greyscale scans. However, we suspect this is another assumption that is not tested and documented, and that may be untrue. In our limited experience, it is true of advanced users (such as image processing professionals), but regular users often express a preference for bitonal. We are considering a follow-up experiment to test this issue.
In this article, we have described an experiment to test the immediate benefit to our users of "going grey" by scanning historic newspapers for Papers Past in greyscale rather than bitonal.
Given our existing selection policy, digitisation methods, and vendors, we could find no evidence that using greyscale digitisation in Papers Past would increase OCR accuracy (which is not the same as saying we found bitonal scans are better than greyscale scans). As a result, the project team recommended that the Library continue its practice of bitonal digitisation for Papers Past for now, but that we be prepared to review this decision as more information becomes available, and as more pictorial content is selected for digitisation
In the future we will investigate other ways of improving OCR accuracy for historic newspapers that are robust and reliable in a high-volume setting.
Our key message to anyone else with a treasure trove of bitonal scans is not to assume that their quality is too poor to OCR. You might be pleasantly surprised at the value of OCRing what you have now, rather than re-scanning in greyscale.
The authors would like to thank the following people:
Notes and References
1. Current collection statistics are available from: <http://paperspast.natlib.govt.nz/cgi-bin/paperspast?a=p&p=about>.
2. One way to estimate whether the chosen reels are "typical quality" is to compare their machine-estimated OCR accuracy for the bitonal scans to that of the wider Papers Past collection. The machine-estimated accuracy for the Daily Southern Cross was 95.526% for reel 28258 and 95.277% for reel 28259. This can be compared to the machine-estimated OCR accuracy rates for the first 13 titles OCRed in Papers Past, which were a mixture of poor and high quality titles. These range from 72.90% to 99.20%, with an average of 93.16%. Eleven of the 13 had average estimates in the 90-99% range. This suggests (but does not confirm) that the Daily Southern Cross data in the current experiment were slightly better than average quality.
3. Utah Digital Newspapers Digital Newspaper Project Handbook. Slide 34.
4. Holley, Rose. "Increasing the Accuracy of OCR". <http://www.nla.gov.au/ndp/project_details/documents/ANDP_IncreasingOCRaccuracy.pdf>.
7. Holley, Rose. Personal communication.
Copyright © 2009 Tracy Powell and Gordon Paynter