Proceedings Volume 6067

Document Recognition and Retrieval XIII

cover
Proceedings Volume 6067

Document Recognition and Retrieval XIII

View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 15 January 2006
Contents: 7 Sessions, 25 Papers, 0 Presentations
Conference: Electronic Imaging 2006 2006
Volume Number: 6067

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Handwriting Recognition
  • Optical Character Recognition
  • Image Processing
  • Emerging Applications
  • Document Retrieval
  • Invited Paper II
  • Learning and Classification
Handwriting Recognition
icon_mobile_dropdown
Combining one- and two-dimensional signal recognition approaches to off-line signature verification
Siyuan Chen, Sargur Srihari
A signature verification method that combines recognition methods of one-dimensional signals, e.g., speech and on-line handwriting, and two-dimensional images, e.g., holistic word recognition in OCR and off-line handwriting is described. In the one-dimensional approach, a sequence of data is obtained by tracing the exterior contour of the signature which allows the application of string-matching algorithms. The upper and lower contours of the signature are first determined by ignoring small gaps between signature components. The contours are combined into a single sequence so as to define a pseudo-writing path. To match two signatures a non-linear normalization method, viz., dynamic time warping, is applied to segment them into curves. Shape descriptors based on Zernike moments are extracted as features from each segment. A harmonic distance is used for measuring signature similarity. The two-dimensional approach is based on using features describing the word-shape. When the two methods are combined, the overall performance is significantly better than either method alone. With a database of 1320 genuines and 1320 forgeries the combination method has an accuracy of 90%.
Spotting words in handwritten Arabic documents
Sargur Srihari, Harish Srinivasan, Pavithra Babu, et al.
The design and performance of a system for spotting handwritten Arabic words in scanned document images is presented. Three main components of the system are a word segmenter, a shape based matcher for words and a search interface. The user types in a query in English within a search window, the system finds the equivalent Arabic word, e.g., by dictionary look-up, locates word images in an indexed (segmented) set of documents. A two-step approach is employed in performing the search: (1) prototype selection: the query is used to obtain a set of handwritten samples of that word from a known set of writers (these are the prototypes), and (2) word matching: the prototypes are used to spot each occurrence of those words in the indexed document database. A ranking is performed on the entire set of test word images-- where the ranking criterion is a similarity score between each prototype word and the candidate words based on global word shape features. A database of 20,000 word images contained in 100 scanned handwritten Arabic documents written by 10 different writers was used to study retrieval performance. Using five writers for providing prototypes and the other five for testing, using manually segmented documents, 55% precision is obtained at 50% recall. Performance increases as more writers are used for training.
HCCR by contour-based elastic mesh fuzzy feature
Offline handwritten Chinese character recognition is one of the difficult problems in pattern recognition area because of its large stroke distortion, writing anomaly, and no stroke ranking information can be gotten. The basic characteristic of Chinese character is that it is composed of four kinds of stroke, i.e. horizontal, vertical, 45 degree direction and 135 degree direction. A Chinese character can be uniquely confirmed by the quantity of the four directional strokes and its relative position. From the contour of Chinese character, we can get the features mentioned above. In this paper, we proposed first to modify an existed contour extraction algorithm and obtained strict single pixel contour of Chinese character, and then to give a contour-based elastic mesh fuzzy feature extraction method. Comparison experimental results show that the performance of our approaches is encouraging and can be comparable to other algorithms.
Optical Character Recognition
icon_mobile_dropdown
Partitioning of the degradation space for OCR training
Generally speaking optical character recognition algorithms tend to perform better when presented with homogeneous data. This paper studies a method that is designed to increase the homogeneity of training data, based on an understanding of the types of degradations that occur during the printing and scanning process, and how these degradations affect the homogeneity of the data. While it has been shown that dividing the degradation space by edge spread improves recognition accuracy over dividing the degradation space by threshold or point spread function width alone, the challenge is in deciding how many partitions and at what value of edge spread the divisions should be made. Clustering of different types of character features, fonts, sizes, resolutions and noise levels shows that edge spread is indeed shown to be a strong indicator of the homogeneity of character data clusters.
Match graph generation for symbolic indirect correlation
Daniel Lopresti, George Nagy, Ashutosh Joshi
Symbolic indirect correlation (SIC) is a new approach for bringing lexical context into the recognition of unsegmented signals that represent words or phrases in printed or spoken form. One way of viewing the SIC problem is to find the correspondence, if one exists, between two bipartite graphs, one representing the matching of the two lexical strings and the other representing the matching of the two signal strings. While perfect matching cannot be expected with real-world signals and while some degree of mismatch is allowed for in the second stage of SIC, such errors, if they are too numerous, can present a serious impediment to a successful implementation of the concept. In this paper, we describe a framework for evaluating the effectiveness of SIC match graph generation and examine the relatively simple, controlled cases of synthetic images of text strings typeset, both normally and in highly condensed fashion. We quantify and categorize the errors that arise, as well as present a variety of techniques we have developed to visualize the intermediate results of the SIC process.
Toward quantifying the amount of style in a dataset
Xiaoli Zhang, Srinivas Andra
Exploiting style consistency in groups of patterns (pattern fields) generated by the same source has been demonstrated to yield higher accuracies in OCR applications. The accuracy gains obtained by a style consistent classifier depend on the amount of style in a dataset in addition to the classifier itself. The computational complexity of style-based classifiers precludes their applicability in situations where datasets have small amounts of style. In this paper, we propose a correlation-based measure to quantify the amount of style in a dataset and demonstrate its use in determining the suitability of a style consistent classifier on both simulation and real datasets.
Robust feature extraction for character recognition based on binary images
Optical Character Recognition (OCR) is a classical research field and has become one of most successful applications in the area of pattern recognition. Feature extraction is a key step in the process of OCR. This paper presents three algorithms for feature extraction based on binary images: the Lattice with Distance Transform (DTL), Stroke Density (SD) and Co-occurrence Matrix (CM). DTL algorithm improves the robustness of the lattice feature by using distance transform to increase the distance of the foreground and background and thus reduce the influence from the boundary of strokes. SD and CM algorithms extract robust stroke features base on the fact that human recognize characters according to strokes, including length and orientation. SD reflects the quantized stroke information including the length and the orientation. CM reflects the length and orientation of a contour. SD and CM together sufficiently describe strokes. Since these three groups of feature vectors complement each other in expressing characters, we integrate them and adopt a hierarchical algorithm to achieve optimal performance. Our methods are tested on the USPS (United States Postal Service) database and the Vehicle License Plate Number Pictures Database (VLNPD). Experimental results shows that the methods gain high recognition rate and cost reasonable average running time. Also, based on similar condition, we compared our results to the box method proposed by Hannmandlu [18]. Our methods demonstrated better performance in efficiency.
Image Processing
icon_mobile_dropdown
DOCLIB: a software library for document processing
Most researchers would agree that research in the field of document processing can benefit tremendously from a common software library through which institutions are able to develop and share research-related software and applications across academic, business, and government domains. However, despite several attempts in the past, the research community still lacks a widely-accepted standard software library for document processing. This paper describes a new library called DOCLIB, which tries to overcome the drawbacks of earlier approaches. Many of DOCLIB's features are unique either in themselves or in their combination with others, e.g. the factory concept for support of different image types, the juxtaposition of image data and metadata, or the add-on mechanism. We cherish the hope that DOCLIB serves the needs of researchers better than previous approaches and will readily be accepted by a larger group of scientists.
Address block features for image-based automated mail orientation
M. Shahab Khan, Hrishikesh B. Aradhye, Wayne T. Cruz
When mixed mail enters a postal facility, it must first be faced and oriented so that the address is readable by automated mail processing machinery. Existing US Postal Service (USPS) automated systems face and orient domestic mail by searching for fluorescing stamps on each mail piece. However, misplaced or partially fluorescing postage causes a significant fraction of mail to be rejected. Previously, rejected mail had to be faced and oriented by hand, thus increasing mail processing cost and time. Our earlier work successfully demonstrated the utility of machine-vision-based extraction of postal delimiters-such as cancellation marks and barcodes-for camera-based mail facing and orientation. Arguably, of all the localized information sources on the envelope image, the destination address block is the richest in content and the most structured in its form and layout. This paper focuses exclusively on the destination address block image and describes new vision-based features that can be extracted and used for mail orientation. Our results on real USPS datasets indicate robust performance. The algorithms described herein will be deployed nationwide on USPS hardware in the near future.
A robust stamp detection framework on degraded documents
Detecting documents with a certain stamp instance is an effective and reliable way to retrieve documents associated with a specific source. However, this unique problem has essentially remained unaddressed. In this paper, we present a novel stamp detection framework based on parameter estimation of connected edge features. Using robust basic-shape detectors, the approach is effective for stamps with analytically shaped contours, when only limited samples are available. For elliptic/circular stamps, it efficiently exploits the orientation information from pairs of edge points to determine its center position and area, without computing all the five parameters of an ellipse. In our approach, we considered the set of unique characteristics of stamp patterns. Specifically, we introduced effective algorithms to address the problem that stamps often spatially overlay their background contents. These give our approach significant advantages in detection accuracy and computation complexity over traditional Hough transform method in locating candidate ellipse regions. Experimental results on real degraded documents demonstrated the robustness of this retrieval approach on large document database, which consists of both printed text and handwritten notes.
Adaptive pre-OCR cleanup of grayscale document images
Ilya Zavorin, Eugene Borovikov, Mark Turner, et al.
This paper describes new capabilities of ImageRefiner, an automatic image enhancement system based on machine learning (ML). ImageRefiner was initially designed as a pre-OCR cleanup filter for bitonal (black-and-white) document images. Using a single neural network, ImageRefiner learned which image enhancement transformations (filters) were best suited for a given document image and a given OCR engine, based on various image measurements (characteristics). The new release improves ImageRefiner in three major ways. First, to process grayscale document images, we have included three grayscale filters based on smart thresholding and noise filtering, as well as five image characteristics that are all byproducts of various thresholding techniques. Second, we have implemented additional ML algorithms, including a neural network ensemble and several "all-pairs" classifiers. Third, we have introduced a measure that evaluates overall performance of the system in terms of cumulative improvement of OCR accuracy. Our experiments indicate that OCR accuracy on enhanced grayscale images is higher than that of both the original grayscale images and the corresponding bitonal images obtained by scanning the same documents. We have noticed that the system's performance may suffer when document characteristics are correlated.
JBIG2 text image compression based on OCR
The JBIG2 (joint bi-level image group) standard for bi-level image coding is drafted to allow encoder designs by individuals. In JBIG2, text images are compressed by pattern matching techniques. In this paper, we propose a lossy text image compression method based on OCR (optical character recognition) which compresses bi-level images into the JBIG2 format. By processing text images with OCR, we can obtain recognition results of characters and the confidence of these results. A representative symbol image could be generated for similar character image blocks by OCR results, sizes of blocks and mismatches between blocks. This symbol image could replace all the similar image blocks and thus a high compression ratio could be achieved. Experiment results show that our algorithm achieves improvements of 75.86% over lossless SPM and 14.05% over lossy PM and S in Latin Character images, and 37.9% over lossless SPM and 4.97% over lossy PM and S in Chinese character images. Our algorithm leads to much fewer substitution errors than previous lossy PM and S and thus preserves acceptable decoded image quality.
Emerging Applications
icon_mobile_dropdown
Active document versioning: from layout understanding to adjustment
Xiaofan Lin, Hui Chao, Greg Nelson, et al.
This paper introduces a novel Active Document Versioning system that can extract the layout template and constraints from the original document and then automatically adjust the layout to accommodate new contents. "Active" reflects several unique features of the system: First, the need of handcrafting adjustable templates is largely eliminated through layout understanding techniques that can convert static documents into Active Layout Templates and accompanying constraints. Second, through the linear text block modeling and the two-pass constraint solving algorithm, it supports a rich set of layout operations, such as simultaneous optimization of text block width and height, integrated image cropping, and non-rectangular text wrapping. This system has been successfully applied to a wide range of professionally designed documents. This paper covers both the core algorithms and the implementation.
Graphic design principles for automated document segmentation and understanding
When designers develop a document layout their objective is to convey a specific message and provoke a specific response from the audience. Design principles provide the foundation for identifying document components and relations among them to extract implicit knowledge from the layout. Variable Data Printing enables the production of personalized printing jobs for which traditional proofing of all the job instances could result unfeasible. This paper explains a rule-based system that uses design principles to segment and understand document context. The system uses the design principles of repetition, proximity, alignment, similarity, and contrast as the foundation for the strategy in document segmentation and understanding which holds a strong relation with the recognition of artifacts produced by the infringement of the constraints articulated in the document layout. There are two main modules in the tool: the geometric analysis module; and the design rule engine. The geometric analysis module extracts explicit knowledge from the data provided in the document. The design rule module uses the information provided by the geometric analysis to establish logical units inside the document. We used a subset of XSL-FO, sufficient for designing documents with an adequate amount complexity. The system identifies components such as headers, paragraphs, lists, images and determines the relations between them, such as header-paragraph, header-list, etc. The system provides accurate information about the geometric properties of the components, detects the elements of the documents and identifies corresponding components between a proofed instance and the rest of the instances in a Variable Data Printing Job.
A new document authentication method by embedding deformation characters
Document authentication decides whether a given document is from a specific individual or not. In this paper, we propose a new document authentication method in physical (after document printed out) domain by embedding deformation characters. When an author writers a document to a specific individual or organization, a unique error-correcting code which serves as his Personal Identification Number (PIN) is proposed and then some characters in the text line are deformed according to his PIN. By doing so, the writer's personal information is embedded in the document. When the document is received, it is first scanned and recognized by an OCR module, and then the deformed characters are detected to get the PIN, which can be used to decide the originality of the document. So the document authentication can be viewed as a kind of communication problems in which the identity of a document from a writer is being "transmitted" over a channel. The channel consists of the writer's PIN, the document, and the encoding rule. Experimental result on deformation character detection is very promising, and the availability and practicability of the proposed method is verified by a practical system.
CAPTCHA challenge strings: problems and improvements
Jon Bentley, Colin Mallows
A CAPTCHA is a Completely Automated Public Test to tell Computers and Humans Apart. Typical CAPTCHAs present a challenge string consisting of a visually distorted sequence of letters and perhaps numbers, which in theory only a human can read. Attackers of CAPTCHAs have two primary points of leverage: Optical Character Recognition (OCR) can identify some characters, while nonuniform probabilities make other characters relatively easy to guess. This paper uses a mathematical theory of assurance to characterize the probability that a correct answer to a CAPTCHA is not just a lucky guess. We examine the three most common types of challenge strings, dictionary words, Markov text, and random strings, and find substantial weaknesses in each. We therefore propose improvements to Markov text, and new challenges based on the consonant-vowel-consonant (CVC) trigrams of psychology. Theory and experiment together quantify problems in current challenges and the improvements offered by modifications.
An automatically updateable web publishing solution: taking document sharing and conversion to enterprise level
Fuad Rahman, Yuliya Tarnikova, Rachmat Hartono, et al.
This paper presents a novel automatic web publishing solution, Pageview(R). PageView(R) is a complete working solution for document processing and management. The principal aim of this tool is to allow workgroups to share, access and publish documents on-line on a regular basis. For example, assuming that a person is working on some documents. The user will, in some fashion, organize his work either in his own local directory or in a shared network drive. Now extend that concept to a workgroup. Within a workgroup, some users are working together on some documents, and they are saving them in a directory structure somewhere on a document repository. The next stage of this reasoning is that a workgroup is working on some documents, and they want to publish them routinely on-line. Now it may happen that they are using different editing tools, different software, and different graphics tools. The resultant documents may be in PDF, Microsoft Office(R), HTML, or Word Perfect format, just to name a few. In general, this process needs the documents to be processed in a fashion so that they are in the HTML format, and then a web designer needs to work on that collection to make them available on-line. PageView(R) takes care of this whole process automatically, making the document workflow clean and easy to follow. PageView(R) Server publishes documents, complete with the directory structure, for online use. The documents are automatically converted to HTML and PDF so that users can view the content without downloading the original files, or having to download browser plug-ins. Once published, other users can access the documents as if they are accessing them from their local folders. The paper will describe the complete working system and will discuss possible applications within the document management research.
Document Retrieval
icon_mobile_dropdown
Automatic redaction of private information using relational information extraction
Kazem Taghva, Russell Beckley, Jeffrey Coombs, et al.
We report on an attempt to build an automatic redaction system by applying information extraction techniques to the identification of private dates of birth. We conclude that automatic redaction is a promising concept although information extraction is significantly affected by the presence of OCR error.
Document clustering: applications in a collaborative digital library
Fuad Rahman, Aman Kumar, Yuilya Tarnikova, et al.
This paper introduces a document clustering method within a commercial document repository, FileShare(R). FileShare(R) is a commercial collaborative digital library offering facilities for sharing and accessing documents over a simple Internet browser (e.g. Microsoft(R) Internet Explorer(R), Netscape(R) or Opera(R)) within groups of people working on common projects. As the number of documents increases within a digital library, displaying these documents in this environment poses a huge challenge. This paper proposes a document clustering method that uses a modified version of the traditional K-Means algorithm to categorize documents by their themes using lexical chaining within the FileShare(R) repository. The proposed algorithm is unsupervised, and has shown very high accuracy in a typical experimental setup.
Author name recognition in degraded journal images
Aliette de Bodard de la Jacopière, Laurence Likforman-Sulem
A method for extracting names in degraded documents is presented in this article. The documents targeted are images of photocopied scientific journals from various scientific domains. Due to the degradation, there is poor OCR recognition, and pieces of other articles appear on the sides of the image. The proposed approach relies on the combination of a low-level textual analysis and an image-based analysis. The textual analysis extracts robust typographic features, while the image analysis selects image regions of interest through anchor components. We report results on the University of Washington benchmark database.
Invited Paper II
icon_mobile_dropdown
Complex document information processing: prototype, test collection, and evaluation
G. Agam, S. Argamon, O. Frieder, et al.
Analysis of large collections of complex documents is an increasingly important need for numerous applications. Complex documents are documents that typically start out on paper and are then electronically scanned. These documents have rich internal structure and might only be available in image form. Additionally, they may have been produced by a combination of printing technologies (or by handwriting); and include diagrams, graphics, tables and other non-textual elements. The state of the art today for a large document collection is essentially text search of OCR'd documents with no meaningful use of data found in images, signatures, logos, etc. Our prototype automatically generates rich metadata about a complex document and then applies query tools to integrate the metadata with text search. To ensure a thorough evaluation of the effectiveness of our prototype, we are also developing a roughly 42,000,000 page complex document test collection. The collection will include relevance judgments for queries at a variety of levels of detail and depending on a variety of content and structural characteristics of documents, as well as "known item" queries looking for particular documents.
Learning and Classification
icon_mobile_dropdown
Comparative evaluation of different classifiers for robust distorted-character recognition
Basil As-Sadhan, Ziad Al Bawab, Ammar El Seed, et al.
This paper investigates and compares between applying the algorithms of Support Vector Machine (SVM), Principal Component Analysis (PCA), Individual Principal Component Analysis (iPCA), Linear Discriminant Analysis (LDA), and Single-Nearest-Neighbor Method (1-NNM) to distorted-character recognition. Applying SVM achieves a classification error rate of 2.15% on the Letter-Image Dataset [Frey and Slate 1991]. This error rate is statistically comparable to the best number in the literature on this dataset that the authors are aware of, which is 2%. This was archived by a fully connected MLP neural network with adaboosting, where training was performed on 20 machines [Schwenk and Bengio 1997]. However, using SVM on a single machine, takes less than 3.5 minutes for training. The features of the dataset and the errors committed by SVM were analyzed in an attempt to combine classifiers and reduce the error rate. We report the results achieved for the different techniques used.
Style consistent nearest neighbor classifier
Srinivas Andra, Xiaoli Zhang
Most pattern classifiers are trained on data from multiple sources, so that they can accurately classify data from any source. However, in many applications, it is necessary to classify groups of test patterns, with patterns in each group generated by the same source. The co-occurring patterns in a group are statistically dependent due to the commonality of source. The dependence between these patterns introduces style context within a group that can be exploited to improve the classification accuracy. In this paper, we present a style consistent nearest neighbor classifier that exploits style context in groups of adjacent patterns to improve the classification accuracy. We demonstrate the efficacy of the proposed classifier on a dataset of machine-printed digits where the proposed classifier reduces the error rate by 64.5%.
Optimally combining a cascade of classifiers
Conventional approaches to combining classifiers improve accuracy at the cost of increased processing. We propose a novel search based approach to automatically combine multiple classifiers in a cascade to obtain the desired tradeoff between classification speed and classification accuracy. The search procedure only updates the rejection thresholds (one for each constituent classier) in the cascade, consequently no new classifiers are added and no training is necessary. A branch-and-bound version of depth-first-search with efficient pruning is proposed for finding the optimal thresholds for the cascade. It produces optimal solutions under arbitrary user specified speed and accuracy constraints. The effectiveness of the approach is demonstrated on handwritten character recognition by finding a) the fastest possible combination given an upper bound on classification error, and also b) the most accurate combination given a lower bound on speed.
Versatile document image content extraction
We offer a preliminary report on a research program to investigate versatile algorithms for document image content extraction, that is locating regions containing handwriting, machine-print text, graphics, line-art, logos, photographs, noise, etc. To solve this problem in its full generality requires coping with a vast diversity of document and image types. Automatically trainable methods are highly desirable, as well as extremely high speed in order to process large collections. Significant obstacles include the expense of preparing correctly labeled ("ground-truthed") samples, unresolved methodological questions in specifying the domain (e.g. what is a representative collection of document images?), and a lack of consensus among researchers on how to evaluate content-extraction performance. Our research strategy emphasizes versatility first: that is, we concentrate at the outset on designing methods that promise to work across the broadest possible range of cases. This strategy has several important implications: the classifiers must be trainable in reasonable time on vast data sets; and expensive ground-truthed data sets must be complemented by amplification using generative models. These and other design and architectural issues are discussed. We propose a trainable classification methodology that marries k-d trees and hash-driven table lookup and describe preliminary experiments.