News Menu

Symbolic indirect correlation improves text recognition

A novel partial word matching system is more accurate than existing methods and can easily identify new words.

22 February 2008

Ashutosh Joshi

As tablet PCs and document transcription systems become ubiquitous, versatile and accurate handwriting recognition (HR) systems are increasingly critical. Yet the two main approaches suffer from serious limitations. The character-/sub-character approach segments the handwritten text into individual letters or strokes, and uses grammatical constraints to build a computer-readable transcript.¹ However, the accuracy of these systems is limited by segmentation errors, that is, mistakes in correctly identifying the start and end of each character or stroke. Word-based methods,² on the other hand, describe the shape of the word as a whole unit. Each one must be added to the shape dictionary before it can be identified. Consequently, updating the shape dictionary can be tedious and often makes such systems impractical.

To address these difficulties, the symbolic indirect correlation (SIC) classifier relies on partial word matches. These matches allow the classifier to eliminate the segmentation step used in character and sub-character systems. And, unlike word-based systems, SIC can identify new words that do not appear in a shape dictionary, without retraining the classifier or estimating complicated models for new text.

SIC can be applied to both speech and handwriting recognition. Here, we discuss the latter. The classifier compares features of handwritten input (the query) to those from a list of ink samples (reference words) that are labeled with the corresponding computer-readable spelling (the lexical reference string). It then proposes a tentative correspondence between sequences of two or more consecutive letters (lexical polygrams) in the reference set, and parts of the input. Based on this correspondence, the classifier scores each potential spelling of the query (the classes or class labels). The query is then assigned the class with the highest score.

We have shown that SIC can recognize query words, even when the reference set does not contain a single ink sample from the lexicon³ (a dictionary of all possible words that the input sample can be assigned to). To our knowledge, no other classifier can use a mutually exclusive lexicon and training set. In addition, as the size of the reference set increases, the error asymptotically approaches Bayes’ error, the theoretical limit for classifiers.

Figure 1. The SIC classifier uses a two-stage process to identify unknown handwriting samples.

Figure 1 shows a high-level overview of SIC. When a person writes in a tablet PC, for instance, the online ink can be viewed as a time-ordered string of (x, y) coordinates. From this input string, the classifier extracts relevant features: for example, online HR may use the curvature or height of the sample from point to point, while offline HR may employ the vectors defining the contour of the word image. Since the SIC methodology only requires that the features maintain the original sequence of the query signal, the exact attributes extracted can vary. To identify matching subsegments, these query attributes are then compared to ones from labeled handwriting samples in the reference set.⁴ In the same way, the classifier compares the transcript of the signal reference to the lexicon entries (the lexical class labels, shown in Figure 1).

The next step extracts second-level features, such as the starting positions of all subsegment matches that surpass a given threshold. These second-level features in the signal and lexical domains are then evaluated to assign the query to the correct class.

We have formulated both a graph-theoretic^3,5,6 and a probabilistic⁵ approach to perform the second-level comparison. Here, we discuss the probabilistic method. Polygram matches between the sample in question and a set of labeled reference words are used to derive the query likelihood for each class, P(query|class).^3,7 Then, like character-/sub-character recognition schemes, SIC builds a lexical transcript by imposing grammatical constraints on the assigned polygrams. Unlike these methods, however, unconstrained partial word matching does not rely on the error-prone segmentation of the query into letters or strokes. Also, large reference sets have several instances of the variant shapes of most letters, making our scheme more robust.

Figure 2. The classifier can identify the partial word match between the two letter strings ‘be’ and ‘th’ in the query ‘beneath’and the two-word reference string ‘these beauty.’

Figure 3. The lexical transcripts ‘beneath’ and ‘closer’ can be assigned many possible segmentations. A segmentation assignment is any path from a root to a leaf that does not terminate in an X.

Figure 2 shows an example of partial word matches between the query ‘beneath’ and two reference words, ‘these’ and ‘beauty.’ The dashed line under ‘beneath’ represents the part of the input that did not match any section of the reference words, while matches are shown in like colors. We assume correspondence between the matched and unmatched segments of the query and the lexical polygrams of a class. Figure 3 shows these correspondences as branches of a tree. Grammatical constraints help prune the tree.³ The classifier computes the likelihood of each branch from two probabilities learned by matching reference words against each other:

The input handwriting sample is then assigned to the class with the maximum likelihood.

The SIC classifier incorporates the advantages of both character-/sub-character-based and word-based systems. It uses only polygram statistics learned from the reference set, and thus avoids the pitfall of estimating the parameters of an assumed model. SIC also avoids segmentation errors common in character-/sub-character systems. Finally, because it uses only the feature sequence, rather than position, it can accommodate common nonlinear distortions, such as stretching and contraction in camera and tablet-based optical character recognition.

Ashutosh Joshi

Fair Isaac

Sunnyvale, CA

Ashutosh Joshi received his PhD in handwriting recognition from Rensselaer Polytechnic Institute, NY, in 2006. For the past two years, he has been applying pattern recognition techniques in the financial and retail sectors with Fair Isaac Corp.

References:

1. J. Hu, M. Brown, W. Turin, HMM based online handwriting recognition, PAMI-18, pp. 1039-1045, 1996.

2. P. Rao, A knowledge-based approach to script recognition without training, PAMI-17, pp. 1233-1239, 1995.

3. A. Joshi, Symbolic Indirect Correlation Classifier, Rensselaer Polytechnic Institute, Troy, NY, 2006.

4. D. Lopresti, G. Nagy, A. Joshi, Match graph generation for symbolic indirect correlation, Proc. SPIE 6067, pp. 606706, 2006.doi:10.1117/12.651827

5. G. Nagy, S. Seth, S. Mehta, Y. Lin, Indirect Symbolic Correlation approach to unsegmented text recognition, DIAR 03, 2003.doi:10.1117/12.529291

6. G. Nagy, A. Joshi, M. Krishnamoorthy, Y. Lin, D. Lopresti, S. Mehta, S. Seth, A nonparametric classifier for unsegmented text, Proc. SPIE 5296, pp. 102-108, 2004.

7. A. Joshi, G. Nagy, D. Lopresti, S. Seth, A maximum-likelihood approach to symbolic indirect correlation, 18th ICPR 3, pp. 99-103, 2006.