As the last frontier in optical character recognition (OCR) research, offline handwriting recognition has appealed to researchers in artificial intelligence and pattern recognition for many years. OCR is not sufficient for flawless transcription of handwritten document images, especially degraded ones. In contrast with the 10–15% word error rate of automatic speech recognition, that of handwriting recognition is 30–40% if the image quality is good and can be 70–80% for degraded images. As Figure 1 shows, text excerpted from a New York State Pre-hospital Care Report (PCR) is difficult to read due to several types of degradation: low contrast, intense ambient noise, and blemishes. However, research on creating indices of keywords and semantic structures from OCR output has led to emerging applications1–4 to automate archival of degraded handwritten document images for fast information retrieval.
Figure 1. This image of a handwritten document exhibits several types of degradation.
To aid such efforts, we have explored new image pre-processing techniques and investigated the feasibility of adapting information retrieval formulations to the statistical configuration of handwriting recognition systems. We developed a technique using the Markov random field (MRF)5 to separate text from background (binarization). We also used the top-n hypotheses returned by a handwriting recognizer6 to estimate the occurrences of keywords more accurately.3,4
Traditional binarization algorithms (like Niblack's and Otsu's algorithms) do not handle text-shape modeling very well, and thus cannot effectively reduce the effects of intense noise. Our approach5 trains an MRF that models the local connectivity of text strokes and polishes their edges according to the MRF potential. We also used the MRF to remove ruled lines without breaking text strokes. Figure 2 shows the binarized region of the text in Figure 1. Compared to existing binarization and line-removal algorithms, this method reduces the word error rate by about 15%.
The binarized region of text shown in Figure 1
The most important factor in search engine performance is the number of occurrences of query terms in a document. For instance, in the text, ‘Patient is experiencing trauma,’ the keyword ‘trauma’ appears, and thus the text is returned when we search the entire text corpus for articles relevant to a certain trauma. The more frequently a keyword appears, the more relevant the article. Vector model-based information retrieval uses a formulation of term frequency (TF) and inverse document frequency to compute a weighted sum of the occurrences of keywords that measures query-document similarity.
Multiple causes, including incorrect segmentation of word images and misrecognition of words, result in handwriting recognition errors, possibly affecting information retrieval. A good language model (LM) can prevent some of these errors. Despite errors in automatic transcription (usually the top-1 hypothesis returned by the handwriting recognizer), we can find more fixes for transcription errors from the top-n hypotheses. Unlike existing methods1, 2 that estimate TF from isolated word recognition results, our approach uses word sequence recognition hypotheses to provide better modeling of segmentation errors and LM.
To avoid high computational cost, we introduced a novel dynamic programming algorithm,4 reducing the runtime of TF estimation to polynomial complexity. Just as when we do handwriting recognition, we use dynamic programming to find the optimal word sequence, maximizing the likelihood of correct identification of image features. In information retrieval tasks, given a query term t and a document d, we need to estimate the mean value of the number of occurrences of t in d:
Here, is the observed sequence of image features of document is a series of word images, is a series of terms, is the probability that is a valid segmentation of , is the word sequence recognition probability, and is the number of occurrences of t in . We improved the mean average precision4 from 11.7% to 20.4% when running 28 queries on 324 PCR forms.
Further application of our techniques to broader sources of document images will be investigated in future work, such as automatically searching historical manuscripts collected by libraries all over the world. Another possible application is automatic indexing of road signs and any other textual areas in video. We believe our methods can be extended to these potentially promising applications.
Huaigu Cao is currently a scientist at BBN Technologies. The research described in this article was done when he was pursuing his doctoral degree at the University at Buffalo.
Anurag Bhardwaj, Venu Govindaraju
Center for Unified Biometrics and Sensors (CUBS)
Department of Computer Science and Engineering,
School of Engineering and Applied Sciences
University at Buffalo
Anurag Bhardwaj is a PhD candidate.
Venu Govindaraju is a distinguished professor and the director of CUBS.