Proceedings Volume 2422

Document Recognition II

Luc M. Vincent, Henry S. Baird
cover
Proceedings Volume 2422

Document Recognition II

Luc M. Vincent, Henry S. Baird
View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 30 March 1995
Contents: 7 Sessions, 34 Papers, 0 Presentations
Conference: IS&T/SPIE's Symposium on Electronic Imaging: Science and Technology 1995
Volume Number: 2422

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Recognition I
  • Recognition II
  • Document Layout Analysis
  • Error Analysis
  • Recognition of Non-Roman Text
  • Special Document Recognition Topics
  • Restoration
Recognition I
icon_mobile_dropdown
Public domain optical character recognition
Michael D. Garris, James L. Blue, Gerald T. Candela, et al.
A public domain document processing system has been developed by the National Institute of Standards and Technology (NIST). The system is a standard reference form-based handprint recognition system for evaluating optical character recognition (OCR), and it is intended to provide a baseline of performance on an open application. The system's source code, training data, performance assessment tools, and type of forms processed are all publicly available. The system recognizes the handprint entered on handwriting sample forms like the ones distributed with NIST Special Database 1. From these forms, the system reads hand-printed numeric fields, upper and lowercase alphabetic fields, and unconstrained text paragraphs comprised of words from a limited-size dictionary. The modular design of the system makes it useful for component evaluation and comparison, training and testing set validation, and multiple system voting schemes. The system contains a number of significant contributions to OCR technology, including an optimized probabilistic neural network (PNN) classifier that operates a factor of 20 times faster than traditional software implementations of the algorithm. The source code for the recognition system is written in C and is organized into 11 libraries. In all, there are approximately 19,000 lines of code supporting more than 550 subroutines. Source code is provided for form registration, form removal, field isolation, field segmentation, character normalization, feature extraction, character classification, and dictionary-based postprocessing. The recognition system has been successfully compiled and tested on a host of UNIX workstations. This paper gives an overview of the recognition system's software architecture, including descriptions of the various system components along with timing and accuracy statistics.
Character segmentation using visual interword constraints in a text page
Tao Hong, Jonathan J. Hull
Character segmentation is a critical preprocessing step for text recognition. In this paper a method is presented that utilizes visual inter-word constraints available in a text image to split word images into smaller image pieces. This method is applicable to machine-printed texts in which the same spacing is always used between identical pairs of characters. The visual inter- word constraints considered here include information about whether a word image is a sub- image of another word image. For example, given two word images A and B, which are `mathematical' and `the.' If the short word image B is found to be a sub-image of the long word image A, the longer image A is split into three pieces, A1, A2, and A3, where A2 matches B, A1 corresponds to `ma,' and A3 corresponds to `matical.' The image piece A1 can be further used to split A3 into two parts, `ma' and `tical.' This method is based purely on image processing using the visual context in a text page. No recognition is involved.
Dynamic classifier combination using neural network
Dar-Shyang Lee, Sargur N. Srihari
There is an increasing trend in recent OCR research to improve recognition performance by combining several complementary algorithms. Numerous successful applications of classifier combination have been described in the literature. However, most of the combination algorithms neglect the fact that classifier performances are dependent on various pattern and image characteristics. More effective combination can be achieved if that dependency information is used to dynamically combine classifiers. Two types of dynamic selections are distinguished. The postconditioned selection seeks better approximation to unconditioned classifier output distribution. The preconditioned selection captures the variations in the density function of classifier outputs conditioned on the inputs. Although both types of selections have the potential to improve combination performance, we argue that preconditioned selections have lower error bounds than postconditioned selections. The difficulties of applying preconditioned selections are identifying characteristic variables and estimating their effects on classifier outputs. A solution using neural network to achieve preconditioned selection is suggested. Two examples on handprinted digit recognition are used to illustrate the potential gain.
Sliding window technique for word recognition
Stephen W. Lam, Anuradha Bhate, Sargur N. Srihari
The proposed approach addresses the problem of recognition of touching characters by forming a closed loop system between segmentation and isolated character classification for mutually beneficial feedback. Multiple hypotheses are generated and ranked on the basis of various constraints such as classifier confidence, geometric or structural requirements and contextual information at every intermediate step. The method uses a variable width window sliding throughout the word and results in a tree structure with intermediate nodes representing validated characters. Each partial path is assigned a confidence value on the basis of segmentation confidence, recognition confidence and first order transition probability, if applicable. Final candidates for the field truth are ranked according to the value of path confidence. The proposed system is more robust since it uses all the available knowledge sources, including context, run time at every intermediate step.
High-performance OCR preclassification trees
Henry S. Baird, C. L. Mallows
We present an automatic method for constructing high-performance preclassification decision trees for OCR. Good preclassifiers prune the set of alternative classes to many fewer without erroneously pruning the correct class. We build the decision tree using greedy entropy minimization, using pseudo-randomly generated training samples derived from a model of imaging defects, and then `populate' the tree with many more samples to drive down the error rate. In [BM94] we presented a statistically rigorous stopping rule for population that enforces a user-specified upper bound on error: this works in practice, but is too conservative, driving the error far below the bound. Here, we describe a refinement that achieves the user- specified accuracy more closely and thus improves the pruning rate of the resulting tree. The method exploits the structure of the tree: the essential technical device is a leaf-selection rule based on Good's Theorem [Good53]. We illustrate its effectiveness through experiments on a pan-European polyfont classifier.
Generalizing edit distance for handwritten text recognition
Giovanni Seni, V. Kripasundar, Rohini K. Srihari
In this paper the Damerau-Levenshtein string difference metric is generalized in two ways to more accurately compensate for the types of errors that are present in the script recognition domain. First, the basic dynamic programming method for computing such a measure is extended to allow for merges, splits and two-letter substitutions. Second, edit operations are refined into categories according to the effect they have on the visual `appearance' of words. A set of recognizer-independent constraints is developed to reflect the severity of the information lost due to each operation. These constraints are solved to assign specific costs to the operations. Experimental results on 2,335 corrupted strings and a lexicon of 21,299 words show higher correcting rates than with the original form.
Stochastic attribute grammar model of document production and its use in document image decoding
Philip A. Chou, Gary E. Kopec
Document image decoding (DID) refers to the process of document recognition within a communication theory framework. In this framework, a logical document structure is a message communicated by encoding the structure as an ideal image, transmitting the ideal image through a noisy channel, and decoding the degraded image into a logical structure as close to the original message as possible, on average. Thus document image decoding is document image recognition where the recognizer performs optimal reconstruction by explicitly modeling the source of logical structures, the encoding procedure, and the channel noise. In previous work, we modeled the source and encoder using probabilistic finite-state automata and transducers. In this paper, we generalize the source and encoder models using context-free attribute grammars. We employ these models in a document image decoder that uses a dynamic programming algorithm to minimize the probability of error between original and reconstructed structures. The dynamic programming algorithm is a generalization of the Cocke-Younger-Kasami parsing algorithm.
Recognition II
icon_mobile_dropdown
Modified character-level deciphering algorithm for OCR in degraded documents
Chi Fang, Jonathan J. Hull
Modifications to a previous character-level deciphering algorithm for OCR are presented in this paper that are able to handle touching characters and are tolerant to mistakes made at the clustering stage. The objective of a character-level deciphering algorithm is to assign alphabetic identities to character patterns such that the character repetition pattern in an input text matches the letter repetition pattern provided by a language model. Degradation in document images usually causes the occurrence of touching characters and mistakes in clustering the character patterns, which pose difficulties for character-level deciphering algorithms. The modifications proposed in this paper tightly integrate visual constraints from characters and touching patterns with constraints from a language model to decode touching characters and to detect and reverse clustering mistakes. It provides a deciphering algorithm with robust performance under image degradation.
Separable source models for document image decoding
Anthony C. Kam, Gary E. Kopec
This paper defines separable models and examines their relationship to other types of Markov source models used for document image decoding. Loosely, a separable source is one that may be factored into a product of 1-dimensional models that represent horizontal and vertical structure, respectively. More formally, a separable model is a collection of Markov subsources that is similar to a recursive transition network. Associated with some of the nodes are position constraints that restrict entry to the nodes to certain regions of the image plane. The top-level subsource is a vertical model whose nodes are all tightly constrained to specific horizontal positions. The importance of separable models is that they form the basis for fast decoding algorithms based on heuristic search. In many situations, the natural structure of a class of document images leads the user to define a model that is recursive, but not necessarily separable. We describe an algorithm for converting a recursive source into an equivalent separable form, if one exists. A key factor determining the success of source separation is the number of tightly constrained nodes. We present a procedure for propagating user-supplied position constraints, typically given for a small subset of nodes, to the remaining nodes of the model.
Word recognition using a lexicon constrained by first/last character decisions
Sheila X. Zhao, Sargur N. Srihari
In lexicon based recognition of machine-printed word images, the size of the lexicon can be quite extensive. The recognition performance is closely related to the size of the lexicon. Recognition performance drops quickly when lexicon size increases. Here, we present an algorithm to improve the word recognition performance by reducing the size of the given lexicon. The algorithm utilizes the information provided by the first and last characters of a word to reduce the size of the given lexicon. Given a word image and a lexicon that contains the word in the image, the first and last characters are segmented and then recognized by a character classifier. The possible candidates based on the results given by the classifier are selected, which give us the sub-lexicon. Then a word shape analysis algorithm is applied to produce the final ranking of the given lexicon. The algorithm was tested on a set of machine- printed gray-scale word images which includes a wide range of print types and qualities.
Selection of image features for distribution-map classifiers
Tin Kam Ho
A useful metric for pattern classification can be derived from a series of one-dimensional maps of empirical class-conditional distributions. To achieve maximum classification accuracy, the distributions need to be projected into subspaces where they are locally dense and the classes are at least partially separable. We describe a method where useful projections are obtained automatically by a sequence of cluster and linear discriminant analyses. Each projection can be considered as a feature that contributes to the elimination of unresolved ambiguities. We discuss conditions under which the method can achieve maximum accuracy. In an experiment of applying the method to machine-printed character images, we show that the method yields a classifier that has very low error rate, and can be improved with additional training data.
Markov source model for printed music decoding
Gary E. Kopec, Philip A. Chou, David A. Maltz
This paper describes a Markov source model for a simple subset of printed music notation. The model is based on the Adobe Sonata music symbol set and a message language of our own design. Chord imaging is the most complex part of the model. Much of the complexity follows from a rule of music typography that requires the noteheads for adjacent pitches to be placed on opposite sides of the chord stem. This rule leads to a proliferation of cases for other typographic details such as dot placement. We describe the language of message strings accepted by the model and discuss some of the imaging issues associated with various aspects of the message language. We also point out some aspects of music notation that appear problematic for a finite-state representation. Development of the model was greatly facilitated by the duality between image synthesis and image decoding. Although our ultimate objective was a music image model for use in decoding, most of the development proceeded by using the evolving model for image synthesis, since it is computationally far less costly to image a message than to decode an image.
Document Layout Analysis
icon_mobile_dropdown
Extraction of text layout structures on document images based on statistical characterization
Su S. Chen, Robert M. Haralick, Ihsin T. Phillips
The textual structures like the characters, words, text lines, paragraphs on a document image are usually laid out in a very structured manner -- having preferred spatial relations. These spatial relations are rarely deterministic; instead, they describe correlations and likelihoods. Therefore, any realistic document layout analysis algorithm should utilize this type of knowledge in order to optimize its performances. In this paper, we first describe a method for automatically generating a large amount of almost 100% correct ground truth data for the document layout analysis. The bounding boxes for the characters, words, text lines, paragraphs are expressed in a hierarchy. Then based on these layout ground-truth, we build statistical models to model the layout structures for the words, text lines, paragraphs on document images. Finally, we described an algorithm that utilizes these statistical models to extract the text words on document images. The performance of the algorithm is evaluated and reported.
Document page decomposition using bounding boxes of connected components of black pixels
Jaekyu Ha, Ihsin T. Phillips, Robert M. Haralick
Segmentation of document images can be performed by projecting image pixels. This pixel projection approach is one of widely used top-down segmentation methods and is based on the assumption that the document image has been correctly deskewed. Unfortunately, the pixel projection approach is computationally inefficient. It is because each symbol is not treated as a computational unit. In this paper, we explain a new technique which is highly tactical in the profiling analysis. Instead of projecting image pixels, we first compute the bounding box of each connected component in a document image and then we project those bounding boxes. Using the new technique, this paper describes how to extract words, text lines, and text blocks (e.g., paragraphs). This bounding box projection approach has many advantages over the pixel projection approach. It is less computationally involved. When applied to text zones, it is also possible to infer from the projection profiles how bounding boxes (and, therefore, primitive symbols) are aligned and/or where significant horizontal and vertical gaps are present. Since the new technique manipulates only bounding boxes, it can be applied to any noncursive language documents.
Multithresholding for document image segmentation
Shriram V. Revankar
Document images contain information in several dominant intensity levels. Detecting these dominant levels helps segmenting the regions in the image that need to be processed differently, if subsequently the image was to be recognized or printed. In this paper we propose a multithresholding algorithm to segment different dominant intensity levels in a document image based on intensity histogram and intensity transition histograms of the image. The method involves determining the number of dominant intensity levels in the image and computing the necessary number of thresholds that facilitate the segmentation.
Automatic line detection in document images using recursive morphological transforms
Bin Kong, Su S. Chen, Robert M. Haralick, et al.
In this paper, we describe a system that detects lines of various types, e.g., solid lines and dotted lines, on document images. The main techniques are based on the recursive morphological transforms, namely the recursive opening and closing transforms. The advantages of the transforms are that they can perform binary opening and closing with any sized structuring element simultaneously in constant time per pixel, and that they offer a solution to morphological image analysis problems where the sizes of the structuring elements have to be determined after the examination of the image itself. The system is evaluated on about 1,200 totally ground-truthed IRS tax form images of different qualities. The line detection output is compared with a set of hand-drawn groundtruth lines. The statistics like the number and rate of correct detection, miss detection, and false alarm are calculated. The performance of 32 algorithms for solid line detection are compared to find the best one. The optimal algorithm tuning parameter settings could be estimated on the fly using a regression tree.
Text/graphics separation for technical papers
Oleg G. Okun, Sergey V. Ablameyko
One of the important operations in automatic analysis of technical papers is a text separation from graphics. In practice, a document skew often occurs both for initial document and for its image after scanning. Also text and graphic blocks can exist which have no rectangular shape. In these cases, the standard text/graphics separation methods such as projection profiles or run length smoothing are not always suitable. In this paper, we propose the text/graphics separation algorithm based on two simple and standard properties of technical paper pages. We call them as area and text compactness properties. The area property takes into account the geometrical relationships between text and graphics. The text compactness property reflects the spatial relationships between text components within block and between text and graphics. An application of both properties allows us to accurately perform the separation in the cases above. No skew correction is required before separation and text and graphic blocks can have arbitrary shape.
Error Analysis
icon_mobile_dropdown
Counting OCR errors in typeset text
Jonathan S. Sandberg
Frequently object recognition accuracy is a key component in the performance analysis of pattern matching systems. In the past three years, the results of numerous excellent and rigorous studies of OCR system typeset-character accuracy (henceforth OCR accuracy) have been published, encouraging performance comparisons between a variety of OCR products and technologies. These published figures are important; OCR vendor advertisements in the popular trade magazines lead readers to believe that published OCR accuracy figures effect market share in the lucrative OCR market. Curiously, a detailed review of many of these OCR error occurrence counting results reveals that they are not reproducible as published and they are not strictly comparable due to larger variances in the counts than would be expected by the sampling variance. Naturally, since OCR accuracy is based on a ratio of the number of OCR errors over the size of the text searched for errors, imprecise OCR error accounting leads to similar imprecision in OCR accuracy. Some published papers use informal, non-automatic, or intuitively correct OCR error accounting. Still other published results present OCR error accounting methods based on string matching algorithms such as dynamic programming using Levenshtein (edit) distance but omit critical implementation details (such as the existence of suspect markers in the OCR generated output or the weights used in the dynamic programming minimization procedure). The problem with not specifically revealing the accounting method is that the number of errors found by different methods are significantly different. This paper identifies the basic accounting methods used to measure OCR errors in typeset text and offers an evaluation and comparison of the various accounting methods.
Systematic bias in OCR experiments
In this paper, we examine the effects of systematic differences (bias) and sample size (variance) on computed OCR accuracy. We present results from large-scale experiments simulating several groups of researchers attempting to perform the same test, but using slightly different equipment and procedures. We first demonstrate that seemingly minor systematic differences between experiments can result in significant biases in the computed OCR accuracy. Then we show that while a relatively small number of pages is sufficient to obtain a precise estimate of accuracy in the case of `clean' input, real-world degradation can greatly increase the required sample size.
Preliminary evaluation of histogram-based binarization algorithms
Junichi Kanai, Kevin O. Grover
To date, most optical character recognition (OCR) systems process binary document images, and the quality of the input image strongly affects their performance. Since a binarization process is inherently lossy, different algorithms typically produce different binary images from the same gray scale image. The objective of this research is to study effects of global binarization algorithms on the performance of OCR systems. Several binarization methods were examined: the best fixed threshold value for the data set, the ideal histogram method, and Otsu's algorithm. Four contemporary OCR systems and 50 hard copy pages containing 91,649 characters were used in the experiments. These pages were digitized at 300 dpi and 8 bits/pixel, and 36 different threshold values (ranging from 59 to 199 in increments of 4) were used. The resulting 1,800 binary images were processed by all four OCR systems. All systems made approximately 40% more errors from images generated by Otsu's method than those of the ideal histogram method. Two of the systems made approximately the same number of errors from images generated by the best fixed threshold value and Otsu's method.
Recognition of Non-Roman Text
icon_mobile_dropdown
Divide-and-conquer approach to Japanese text segmentation
Stephen W. Lam, Qunfeng Liao, Sargur N. Srihari
This paper presents a robust text segmentation algorithm for printed Japanese documents. A divide-and-conquer approach is proposed to handle a large variety of image qualities and print styles. The approach can adapt its processing strategies according to the text quality, i.e., a method using diverse knowledge sources will be exploited to segment degraded text while a fast simple method will be used for good quality text. Since the algorithm can adaptively select the methods for different scenarios, the segmentation is highly efficient in terms of speed and accuracy. The segmenter has tree modules for image preprocessing, line segmentation, and character segmentation. The preprocessor uses the statistical information of the image connected components to globally estimate character size and uses projection profile to determine image quality. The line segmenter requires a `thresholding and smoothing' step prior to line extraction if the image is noisy. During character segmentation, the character segmenter first tries to locate components which contain touching characters. If touching characters exist, an algorithm which includes a profile-based splitting and classifier-based multiple hypothesis processing will be invoked to perform the segmentation.
Segmentation versus segmentation-free for recognizing Arabic text
May Allam
Arabic character recognition faces many technical difficulties, but the most challenging problem is the cursive characteristic of Arabic text. The present work explores two different approaches to solve the cursive problem. The first approach depends upon preprocessing to segment connected characters. The other approach is a segmentation-free technique where whole portions of connected characters are recognized without prior segmentation. Both methods employ hidden Markov models for classification. The present paper investigates the robustness of the two techniques and suggests how to combine them so that the weakness of one is compensated for by the strength of the other.
Robust recognition of degraded machine-printed characters using complementary similarity measure and error-correction learning
Norihiro Hagita, Minako Sawaki
Most conventional methods in character recognition extract geometrical features such as stroke direction, connectivity of strokes, etc., and compare them with reference patterns in a stored dictionary. Unfortunately, geometrical features are easily degraded by blurs, stains and the graphical background designs used in Japanese newspaper headlines. This noise must be removed before recognition commences, but no preprocessing method is completely accurate. This paper proposes a method for recognizing degraded characters and characters printed on graphical background designs. This method is based on the binary image feature method and uses binary images as features. A new similarity measure, called the complementary similarity measure, is used as a discriminant function. It compares the similarity and dissimilarity of binary patterns with reference dictionary patterns. Experiments are conducted using the standard character database ETL-2 which consists of machine-printed Kanji, Hiragana, Katakana, alphanumeric, an special characters. The results show that this method is much more robust against noise than the conventional geometrical feature method. It also achieves high recognition rates of over 92% for characters with textured foregrounds, over 98% for characters with textured backgrounds, over 98% for outline fonts, and over 99% for reverse contrast characters.
Visual similarity analysis of Chinese characters and its uses in Japanese OCR
Tao Hong, Stephen W. Lam, Jonathan J. Hull, et al.
Traditionally, a Chinese or Japanese optical character reader (OCR) has to represent each character category individually as one or more feature prototypes, or a structural description which is a composition of manually derived components such as radicals. Here we propose a new approach in which various kinds of visual similarities between different Chinese characters are analyzed automatically at the feature level. Using this method, character categories are related to each other by training on fonts; and character images from a text page can be related to each other based on the visual similarities they share. This method provides a way to interpret character images from a text page systematically, instead of a sequence of isolated character recognitions. The use of the method for post processing in Japanese text recognition is also discussed.
Special Document Recognition Topics
icon_mobile_dropdown
Spotting phrases in lines of imaged text
Francine R. Chen, Dan S. Bloomberg, Lynn D. Wilcox
A system that searches for user-specified phrases in imaged text is described. The search `phrases' can be word fragments, words, or groups of words. The imaged text can be composed of a number of different fonts and can contain graphics. A combination of morphology, simple statistical methods and hidden Markov modeling is used to detect and locate the phrases. The image is deskewed, and then bounding boxes are found for text-lines in the image using multiresolution morphology. Baselines, toplines and the x-height in a text-line are identified using simple statistical methods. The distance between baseline and x-height is used to normalize each hypothesized text-line bounding box, and the columns of pixel values in a normalized bounding box serve as the feature vector for that box. Hidden Markov models are crated for each user-specified search string and to represent all text and graphics other than the search strings. Phrases are identified using Viterbi decoding on a spotting network created from the models. The operating point of the system can be varied to trade off the percentage of words correctly spotted and the percentage of false alarms. Results are given using a subset of the UW English Document Image Database I.
Keyword spotting via word shape recognition
Jeff L. DeCurtins, Edward C. Chen
With the advent of on-line access to very large collections of document images, electronic classification into areas of interest has become possible. A first approach to classification might be the use of OCR on each document followed by analysis of the resulting ASCII text. But if the quality of a document is poor, the format unconstrained, or time is critical, complete OCR of each image is not appropriate. An alternative approach is the use of word shape recognition (as opposed to individual character recognition) and the subsequent classification of documents by the presence or absence of selected keywords. Use of word shape recognition not only provides a more robust collection of features but also eliminates the need for character segmentation (a leading cause of error in OCR). In this paper we describe a system we have developed for the detection of isolated words, word portions, as well as multi-word phrases in images of documents. It is designed to be used with large, changeable, keyword sets and very large document sets. The system provides for automated training of desired keywords and creation of indexing filters to speed matching.
Blur hit-miss transform and its use in document image pattern detection
Dan S. Bloomberg, Luc M. Vincent
The usefulness of the hit-miss transform (HMT) and related transforms for pattern matching in document image applications is examined. Although the HMT is sensitive to the types of noise found in scanned images, including both boundary and random noise, a simple extension, the blur HMT, is relatively robust. The noise immunity of the blur HMT derives from its ability to treat both types of noise together, and to remove them by appropriate dilations. In analogy with the Hausdorff metric for the distance between two sets, metric generalizations for special cases of the blur HMT are derived. Whereas Hausdorff uses both directions of the directed distances between two sets, a metric derived from a special case of the blur HMT uses just one direction of the directed distances between foreground and background components of two sets. For both foreground and background, the template is always the first of the directed sets. A less restrictive metric generalization, where the disjoint foreground and background components of the template need not be set complements, is also derived. For images with a random component of noise, the blur HMT is sensitive only to the size of the noise, whereas Hausdorff matching is sensitive to its location. It is also shown how these metric functions can be derived from the distance functions of the foreground and background of an image, using dilation by the appropriate templates. The blur HMT is implemented efficiently with Boolean image operations. The FG and BG images are dilated with structuring elements that depend on image noise and pattern variability, and the results are then eroded with templates derived from patterns to be matched. Subsampling the patterns on a regular grid can improve speed and maintain match quality, and examples are given that indicate how to explore the parameter space. The blur HMT can be used as a fast heuristic to avoid more expensive integer-based matching techniques. Truncated matches give the same result as full erosions and are much faster.
Script determination in document images
Larry Spitz
We have developed techniques for distinguishing which language is represented in an image of text. This work is restricted to an important subset of the world's languages, using techniques that should be applicable across even more comprehensive samples. The method first classifies the script into two broad classes: European and Asian. This classification is based on the spatial relationships of fiducial points related to the upward concavities in character structures. Script identification within the Asian class, (Japanese, Chinese, Korean) is performed by analysis of the optical density distribution of the text images.
Measuring document image skew and orientation
Dan S. Bloomberg, Gary E. Kopec, Lakshmi Dasari
Several approaches have previously been taken for identifying document image skew. At issue are efficiency, accuracy, and robustness. We work directly with the image, maximizing a function of the number of ON pixels in a scanline. Image rotation is simulated by either vertical shear or accumulation of pixel counts along sloped lines. Pixel sum differences on adjacent scanlines reduce isotropic background noise from non-text regions. To find the skew angle, a succession of values of this function are found. Angles are chosen hierarchically, typically with both a coarse sweep and a fine angular bifurcation. To increase efficiency, measurements are made on subsampled images that have been pre-filtered to maximize sensitivity to image skew. Results are given for a large set of images, including multiple and unaligned text columns, graphics, and large area halftones. The measured intrinsic angular error is inversely proportional to the number of sampling points on a scanline. This method does not indicate when text is upside-down, and it also requires sampling the function at 90 degrees of rotation to measure text skew in landscape mode. However, such text orientation can be determined (as one of four directions) by noting that Roman characters in all languages have many more ascenders than descenders, and using morphological operations to identify such pixels. Only a small amount of text is required for accurate statistical determination of orientation, and images without text are identified as such.
Evaluation of an automatic markup system
Kazem Taghva, Allen Condit, Julie Borsack
One predominant application of OCR is the recognition of full text documents for information retrieval. Modern retrieval systems exploit both the textual content of the document as well as its structure. The relationship between textual content and character accuracy have been the focus of recent studies. It has been shown that due to the redundancies in text, average precision and recall is not heavily affected by OCR character errors. What is not fully known is to what extent OCR devices can provide reliable information that can be used to capture the structure of the document. In this paper, we present a preliminary report on the design and evaluation of a system to automatically markup technical documents, based on information provided by an OCR device. The device we use differs from traditional OCR devices in that it not only performs optical character recognition, but also provides detailed information about page layout, word geometry, and font usage. Our automatic markup program, which we call Autotag, uses this information, combined with dictionary lookup and content analysis, to identify structural components of the text. These include the document title, author information, abstract, sections, section titles, paragraphs, sentences, and de-hyphenated words. A visual examination of the hardcopy is compared to the output of our markup system to determine its correctness.
Performance analysis of 2D-barcode-enhanced documents
Jackson D. He, Chung-Chi Jim Li, Sundeep Kumar, et al.
Barcode-enhanced documents are used in office automation and in applications involving shipping manifests, vehicle registration forms, photo-ID, etc. They consist of human readable text and machine readable barcodes, which duplicates the text in a secure machine-readable form. Two-dimensional barcodes are used when it is beneficial to encode a large amount of data. The advantage of barcode-enhanced document over OCR is that the rejection and misdecode rates are several orders of magnitude smaller. This paper analyzes the performance of 2-D barcode-enhanced documents, characterized by the rates of correct decode, rejection, and misdecode of the 2-D barcode. Primarily using PDF417 as an example, we show how these rates are determined by the parameters used in the encoding process and the decoder design, and from the environment. We find that in real-world applications the misdecode rates are extremely low, and thus for all practical purposes the limiting parameter is the rejection rate.
Restoration
icon_mobile_dropdown
Design and application of binary differencing filters
For given binary image and degradation processes, an optimal mean-absolute-error translation- invariant filter can be designed via the representation of such filters as a union of morphological hit-or-miss transforms. The present paper investigates a different optimization methodology by representing translation-invariant filters as differencing filters. Rather than employing structuring templates to build the entire output image, as is done with direct hit-or- miss representation, differencing filters only employ templates that locate value flips (black-to- white or white-to-black). Differencing filters play a central role in several digital document processing tasks. It is shown how differencing filters are statistically designed and applied for image restoration and resolution conversion.
Characterization of vibration-induced image defects in input scanners
Robert P. Loce, George Wolberg
Typical office scanners employ a moving linear-sensor array or moving optics. The velocity of the components is generally not constant in time. It may be modulated directly (at one or more frequencies) by dynamic errors of gears, timing belts, and motors, and indirectly by structural vibrations induced by gears, fans, etc. Nonuniform velocity is known to cause undesirable brightness fluctuation and warping in the sampled image. The present paper characterizes the image defects induced by nonuniform velocity. A companion paper utilizes the degradation information to develop an algorithm to restore the degraded image.
Restoration of images degraded by input scanner vibrations
George Wolberg, Robert P. Loce
Images scanned in the presence of mechanical vibrations are subject to artifacts such as brightness fluctuation and geometric warping. The goal of this work is to develop an algorithm to invert these distortions and produce an output digital image consistent with a scanner operating under ideal uniform motion conditions. The image restoration algorithm described in this paper applies to typical office scanners that employ a moving linear sensor array (LSA) or moving optics. The velocity of the components is generally not constant in time. Dynamic errors are introduced by gears, timing belts, motors, and structural vibrations. A companion paper characterizes the artifacts induced by nonuniform velocity. In this work, we make use of the instantaneous LSA velocity to reconstruct an underlying piecewise constant or piecewise linear model of the image irradiance function. The control points for the underlying model are obtained by solving a system of equations derived to relate the observed area samples with instantaneous LSA velocity and a spatially varying sampling kernel. An efficient solution exists for the narrow band diagonal matrix that results. The control points computed with this method fully define the underlying irradiance function. That function is then suitable for resampling under ideal scanning conditions to produce a restored output digital image.