Proceedings Volume 3651

Document Recognition and Retrieval VI

cover
Proceedings Volume 3651

Document Recognition and Retrieval VI

View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 7 January 1999
Contents: 5 Sessions, 19 Papers, 0 Presentations
Conference: Electronic Imaging '99 1999
Volume Number: 3651

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • OCR Systems and Techniques
  • Handwriting Recognition
  • Models and Evaluations
  • Information Retrieval
  • Document Analysis
OCR Systems and Techniques
icon_mobile_dropdown
Text enhancement in digital video
One difficulty with using text from digital video for indexing and retrieval is that video images are often in low resolution and poor quality, and as a result, the text can not be recognized adequately by most commercial OCR software. Text image enhancement is necessary to achieve reasonable OCR accuracy. Our enhancement consists of two main procedures, resolution enhancement based on Shannon interpolation and text separation from complex image background. Experiments show our enhancement approach improves OCR accuracy considerably.
Determining the resolution of scanned document images
Dan S. Bloomberg
Given the existence of digital scanners, printers and fax machines, documents can undergo a history of sequential reproductions. One of the most important determiners of the quality of the resulting image is the set of underlying resolutions at which the images were scanned and binarized. In particular, a low resolution scan produces a noticeable degradation of image quality, and produces a set of printed fonts that cause omnifront OCR systems to operate with a relatively high error rate. This error rate can be reduced if the OCR system is trained on text having fax scanner degradations, but this also requires that the OCR system can determine in advance if such degraded fonts are present.
Character string extraction from newspaper headlines with a background design by recognizing a combination of connected components
Hiroaki Takebe, Yutaka Katsuyama, Satoshi Naoi
In this paper we propose a new method of extracting a character string from images with a background design. In Japanese newspaper headlines, it is common for character components to be placed independent of background components. In view of this, we represent a character string candidate as a consistent combination of connected components, and we calculate its character string resemblance value. In this case, a character string resemblance value of a combination of connected components depends upon its character recognition result and the area of the rectangular area occupied by it. We then extract the combination of connected components that has the maximum character string resemblance value. We applied this method to 142 headline images. The results show that the method accurately extracted a character string from various kinds of images with a background design and the method has a favorable processing speed.
Text segmentation for automatic document processing
Dinesh P. Mital, Wee Leng Goh
There is a considerable interest in designing automatic systems that can scan a given paper document and store it on electronic media for easier storage, manipulation and access. Most documents contain graphics and images, in addition to text. Thus, the document image has to be segmented to identify text and image regions, so that appropriate techniques may be applied to those regions. In this paper, we have presented a new technique for image segmentation in which text and image regions, in a given document image, are automatically identified. The technique is based on the differential processing text extraction concept. The proposed technique is capable of analyzing complex document image layouts. The document image is processed by using textural feature analysis. Results of the proposed method are presented with test images which demonstrate the robustness of the technique.
Postprocessing algorithm for the optical recognition of degraded characters
Haisong Liu, Minxian Wu, Guofan Jin, et al.
The researches of document recognition and retrieval have grown rapidly in recent years, especially the processing of real-world documents, photocopies, faxes, and microfiches, in which the characters may be degraded. Optical character recognition has some advantages dealing with the degraded cases because it makes use of the whole information included in the degraded characters. In this paper, we use a simple postprocessing algorithm, which is based on the similarity measure techniques, on the optical correlation output plane to improve the discrimination ability of the pattern recognition procedures, especially for the optical recognition of degraded characters. The study is divided into two parts: one recounts a computer simulated example corresponding to pattern recognition by the use of input images that may be blurred, rotated, or corrupted by additive Gaussian noise, and the second part describes an incoherent optical correlator based optoelectronic processor for the character recognition.
Handwriting Recognition
icon_mobile_dropdown
Word-level optimization of dynamic programming-based handwritten word recognition algorithms
Paul D. Gader, Wen-Tsong Chen
In the standard segmentation-based approach to lexicon- driven handwritten word recognition, character recognition algorithms are generally trained on isolated characters and individual character-class confidence scores are combined to estimate confidences in the various hypothesized identities for a word. In this paper, results from investigating alternatives to these standard methods are presented. We refer to these alternative methods as system-level optimization methods.
Maximum mutual information estimation of a simplified hidden MRF for offline handwritten Chinese character recognition
Understanding of hand-written Chinese characters is at such a primitive stage that models include some assumptions about hand-written Chinese characters that are simply false. So Maximum Likelihood Estimation (MLE) may not be an optimal method for hand-written Chinese characters recognition. This concern motivates the research effort to consider alternative criteria. Maximum Mutual Information Estimation (MMIE) is an alternative method for parameter estimation that does not derive its rationale from presumed model correctness, but instead examines the pattern-modeling problem in automatic recognition system from an information- theoretic point of view. The objective of MMIE is to find a set of parameters in such that the resultant model allows the system to derive from the observed data as much information as possible about the class. We consider MMIE for recognition of hand-written Chinese characters using on a simplified hidden Markov Random Field. MMIE provides improved performance improvement over MLE in this application.
Modeling the trade-off between completeness and consistency in genetic-based handwritten character prototyping
Claudio De Stefano, A. Della Cioppa, Angelo Marcelli
This paper presents a contribution to exploit Learning Classifier Systems using niching for finding the set of prototypes to be used by a Optical Character Recognition system. In particular, we investigate the niching method based on explicit fitness sharing and face the problem of estimating the values of the configuration parameters that allow the system to provide an optimal set of prototypes, i.e. a set of prototypes that represent, for the problem at hand, the best compromise between completeness and consistency. The solution reported in the paper is based on a characterization of the system behavior as a function of the configuration parameters. Such a characterization, derived within the framework set by Learning Classifier System theory, has been used to design an experimental protocol for finding, with a small number of experiments, the desired prototypes. A large set of experiments by using the National Institute of Standards and Technology database has been performed to validate the proposed solution. The experimental findings allow to draw the conclusion that Learning Classifier Systems represent a promising tool to be used for handwritten character recognition.
Robust baseline-independent algorithms for segmentation and reconstruction of Arabic handwritten cursive script
Khaled Mostafa, Ahmed M. Darwish
The problem of cursive script segmentation is an essential one for handwritten character recognition. This is specially true for Arabic text where cursive is the only mode even for typewritten font. In this paper, we present a generalized segmentation approach for handwritten Arabic cursive scripts. The proposed approach is based on the analysis of the upper and lower contours of the word. The algorithm searchers for local minima points along the upper contour and local maxima points along the lower contour of the word. These points are then marked as potential letter boundaries (PLB). A set of rules, based on the nature of Arabic cursive scripts, are then applied to both upper and lower PLB points to eliminate some of the improper ones. A matching process between upper and lower PLBs is then performed in order to obtain the minimum number of non-overlapping PLB for each word. The output of the proposed segmentation algorithm is a set of labeled primitives that represent the Arabic word. In order to reconstruct the original word from its corresponding primitives and diacritics, a novel binding and dot assignment algorithm is introduced. The algorithm achieved correct segmentation rate of 97.7% when tested on samples of loosely constrained handwritten cursive script words consisting of 7922 characters written by 14 different writers.
Models and Evaluations
icon_mobile_dropdown
The Bible, truth, and multilingual OCR evaluation
Tapas Kanungo, Philip Resnik
In this paper we propose to use the Bible as a dataset for comparing OCR accuracy across languages. Besides being available in a wide range of languages, Bible translations are closely parallel in content, carefully translated, surprisingly relevant with respect to modern-day language, and quite inexpensive. A project at University of Maryland is currently implementing this idea. We have created a scanned image dataset with groundtruth from an Arabic Bible. We have also used image degradation models to create synthetically degraded images of a French Bible. We hope to generate similar Bible datasets for other languages, and we are exploring alternative corpora with similar properties such the Koran and the Bhagavad Gita. Quantitative OCR evaluation based on the Arabic Bible dataset is currently in progress.
Federal Register document image database
Michael D. Garris, Stanley A. Janet, William W. Klein
A new, fully-automated process has been developed at NIST to derive ground truth for document images. The method involves matching optical character recognition (OCR) results from a page with typesetting files for an entire book. Public domain software used to derive the ground truth is provided in the form of Perl scripts and C source code, and includes new, more efficient string alignment technology and a word- level scoring package. With this ground truthing technology, it is now feasible to produce much larger data sets, at much lower cost, than was ever possible with previous labor- intensive, manual data collection projects. Using this method, NIST has produced a new document image database for evaluating Document Analysis and Recognition technologies and Information Retrieval systems. The database produced contains scanned images, SGML-tagged ground truth text, commercial OCR results, and image quality assessment results for pages published in the 1994 Federal Register. These data files are useful in a wide variety of experiments and research. There were roughly 250 issues, comprised of nearly 69,000 pages, published in the Federal Register in 1994. This volume of the database contains the pages of 20 books published in January of that year. In all, there are 4711 page images provided, with 4519 of them having corresponding ground truth. This volume is distributed on two ISO-9660 CD- ROMs. Future volumes may be released, depending on the level of interest.
OmniPage vs. Sakhr: paired model evaluation of two Arabic OCR products
Tapas Kanungo, Gregory A. Marton, Osama Bulbul
Characterizing the performance of Optical Character Recognition (OCR) systems is crucial for monitoring technical progress, predicting OCR performance, providing scientific explanations for the system behavior and identifying open problems. While research has been done in the past to compare performances of two or more OCR systems, all assume that the accuracies achieved on individual documents in a dataset are independent when, in fact, they are not. In this paper we show that accuracies reported on any dataset are correlated and invoke the appropriate statistical technique--the paired model--to compare the accuracies of two recognition systems. Theoretically we show that this method provides tighter confidence intervals than methods used in OCR and computer vision literature. We also proposed a new visualization method, which we call the accuracy scatter plot, for providing a visual summary of performance results. This method summarizes the accuracy comparisons on the entire corpus while simultaneously allowing the researcher to visually compare the performances on individual document images. Finally, we report on the accuracy and speed performances as a function of scanning resolution. Contrary to what one might expect, the performance of one of the systems degrades when the image resolution is increased beyond 300 dpi. Furthermore, the average time taken to OCR a document image, after increasing almost linearly as a function of resolution, suddenly becomes a constant beyond 400 dpi. This behavior is most likely because the OCR algorithm samples the images at resolutions 400 dpi and higher to a standard resolution. The two products that we compare are the Arabic OmniPage 2.0 and the Automatic Page Reader 3.01 from Sakhr. The SAIC Arabic dataset was used for the evaluations. The statistical and visualization methods presented in this article are very general and can be used for comparing accuracies of any two recognition systems, not just OCR systems.
Information Retrieval
icon_mobile_dropdown
Multimodal browsing of images in Web documents
Francine R. Chen, Ullas Gargi, Les Niles, et al.
In this paper, we describe a system for performing browsing and retrieval on a collection of web images and associated text on an HTML page. Browsing is combined with retrieval to help a user locate interesting portions of the corpus, without the need to formulate a query well matched to the corpus. Multi-modal information, in the form of text surrounding an image and some simple image features, is used in this process. Using the system, a user progressively narrows a collection to a small number of elements of interest, similar to the Scatter/Gather system developed for text browsing. We have extended the Scatter/Gather method to use multi-modal features. With the use of multiple features, some collection elements may have unknown or undefined values for some features; we present a method for incorporating these elements into the result set. This method also provides a way to handle the case when a search is narrowed to a part of the space near a boundary between two clusters. A number of examples illustrating our system are provided.
Effectiveness of thesauri-aided retrieval
Kazem Taghva, Julie Borsack, Allen Condit
In this report, we describe the results of an experiment designed to measure the effects of automatic query expansion on retrieval effectiveness. In particular, we used a collection-specific thesaurus to expand the query by adding synonyms of the searched terms. Our preliminary results show no significant gain in average precision and recall.
Document image recognition and retrieval: where are we?
This paper discusses survey data collected as a result of planning a project to evaluate document recognition and information retrieval technologies. In the process of establishing the project, a Request for Comment (RFC) was widely distributed throughout the document recognition and information retrieval research and development (R&D) communities, and based on the responses, the project was discontinued. The purpose of this paper is to present `real' data collected from the R&D communities in regards to a `real' project, so that we may all form our own conclusions about where we are, where we are heading, and how we are going to get there. Background on the project is provided and responses to the RFC are summarized.
Document Analysis
icon_mobile_dropdown
Guideline for specifying layout knowledge
Toyohide Watanabe
Until today, many layout recognition/analysis methods have been proposed, but the guideline for knowledge representation applicable to documents which should be analyzed newly is not always discussed directly. This paper addresses such a subject. Generally, the documents can be categorized into the appropriate document types on a basis of the features of layout structures. Then, the processing mechanisms are assessed with a view to establishing the criteria for selecting the knowledge representation means appropriate to the document types. First, we define the physical layout structure and logical layout structure in addition to the traditional concepts of layout structure and logical structure. Second, we define the document type on the basis of the relationship between the physical layout structure and logical layout structure. Third, we make the knowledge representation means and the processing mechanisms clear under the document types. Finally, we show a criterion or guideline for knowledge representation means and processing mechanisms with respect to the logical layout structure and physical layout structure. For this discussion, our basic view is derived from our document understanding methods which we have developed for several different documents.
Learning to identify hundreds of flex-form documents
Janusz Wnek
This paper presents an inductive document classifier (IDC) and its application to document identification. The most important features of the presented system are learning capability, handling large volumes of highly variant documents, and high performance. IDC learns new document types (variants) from examples. To this end, it automatically extracts discriminatory features from images of various document types, generates generalized descriptions, and stores them in the knowledge base. The classification of an unknown document is based on matching its description to all general rules in the knowledge base, and selecting the best matching document types as final classifications. Both learning and identification processes are fast and accurate. The speed is gained due to optimal image processing and feature construction procedures. Identification accuracy is very high despite the fact that the discriminatory features are generated solely based on page layout information. IDC operates in two separate components of an EDMS: Knowledge Base Maintainer (KBM) and Production Identifier (PI). KBM builds a knowledge base and maintains its integrity. PI utilizes learned knowledge during the identification processes.
New method for logical structure extraction of form document image
Bing Liu, Zao Jiang, Hong Zhao, et al.
Many methods on form document image analysis have been proposed, but few have treated the extraction of logical structure. A new method for the logical structure extraction of form document is proposed in this paper. The algorithm of it consists of three phases: global division of the whole document, local logical structure analysis and global re- division of the whole document. GLG method emphasizes the global layout structure analysis and has higher accuracy. It is robust for treating with the accidental direct adjacent relationship between two irrelated cells. In addition, a logical structure tree is proposed to represent the logical structure of a form document.
Development of OCR system for portable passport and visa reader
Yury V. Visilter, Sergey Yu. Zheltov, Anton A. Lukin
The modern passport and visa documents include special machine-readable zones satisfied the ICAO standards. This allows to develop the special passport and visa automatic readers. However, there are some special problems in such OCR systems: low resolution of character images captured by CCD-camera (down to 150 dpi), essential shifts and slopes (up to 10 degrees), rich paper texture under the character symbols, non-homogeneous illumination. This paper presents the structure and some special aspects of OCR system for portable passport and visa reader. In our approach the binarization procedure is performed after the segmentation step, and it is applied to the each character site separately. Character recognition procedure uses the structural information of machine-readable zone. Special algorithms are developed for machine-readable zone extraction and character segmentation.