Proceedings Volume 7874

Document Recognition and Retrieval XVIII

cover
Proceedings Volume 7874

Document Recognition and Retrieval XVIII

View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 24 January 2011
Contents: 10 Sessions, 42 Papers, 0 Presentations
Conference: IS&T/SPIE Electronic Imaging 2011
Volume Number: 7874

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 7874
  • Invited Presentation I
  • Content Analysis
  • Recognition
  • Segmentation
  • Writer Identification or Verification
  • Information Retrieval
  • Document Recognition
  • OCR Error and Binarization
  • Interactive Paper Session
Front Matter: Volume 7874
icon_mobile_dropdown
Front Matter: Volume 7874
This PDF file contains the front matter associated with SPIE Proceedings Volume 7874, including the Title Page, Copyright information, Table of Contents, Introduction (if any), and the Conference Committee listing
Invited Presentation I
icon_mobile_dropdown
Scientific challenges underlying production document processing
Eric Saund
The Field of Document Recognition is bipolar. On one end lies the excellent work of academic institutions engaging in original research on scientifically interesting topics. On the other end lies the document recognition industry which services needs for high-volume data capture for transaction and back-office applications. These realms seldom meet, yet the need is great to address technical hurdles for practical problems using modern approaches from the Document Recognition, Computer Vision, and Machine Learning disciplines. We reflect on three categories of problems we have encountered which are both scientifically challenging and of high practical value. These are Doctype Classification, Functional Role Labeling, and Document Sets. Doctype Classification asks, "What is the type of page I am looking at?" Functional Role Labeling asks, "What is the status of text and graphical elements in a model of document structure?" Document Sets asks, "How are pages and their contents related to one another?" Each of these has ad hoc engineering approaches that provide 40-80% solutions, and each of them begs for a deeply grounded formulation both to provide understanding and to attain the remaining 20-60% of practical value. The practical need is not purely technical but also depends on the user experience in application setup and configuration, and in collection and groundtruthing of sample documents. The challenge therefore extends beyond the science behind document image recognition and into user interface and user experience design.
Content Analysis
icon_mobile_dropdown
Automated identification of biomedical article type using support Vector machines
In Cheol Kim, Daniel X. Le, George R. Thoma
Authors of short papers such as letters or editorials often express complementary opinions, and sometimes contradictory ones, on related work in previously published articles. The MEDLINE® citations for such short papers are required to list bibliographic data on these "commented on" articles in a "CON" field. The challenge is to automatically identify the CON articles referred to by the author of the short paper (called "Comment-in" or CIN paper). Our approach is to use support vector machines (SVM) to first classify a paper as either a CIN or a regular full-length article (which is exempt from this requirement), and then to extract from the CIN paper the bibliographic data of the CON articles. A solution to the first part of the problem, identifying CIN articles, is addressed here. We implement and compare the performance of two types of SVM, one with a linear kernel function and the other with a radial basis kernel function (RBF). Input feature vectors for the SVMs are created by combining four types of features based on statistics of words in the article title, words that suggest the article type (letter, correspondence, editorial), size of body text, and cue phrases. Experiments conducted on a set of online biomedical articles show that the SVM with a linear kernel function yields a significantly lower false negative error rate than the one with an RBF. Our experiments also show that the SVM with a linear kernel function achieves a significantly higher level of accuracy, and lower false positive and false negative error rates by using input feature vectors created by combining all four types of features rather than any single type.
Introduction of statistical information in a syntactic analyzer for document image recognition
André O. Maroneze, Bertrand Coüasnon, Aurélie Lemaitre
This paper presents an improvement to a document layout analysis system, offering a possible solution to Sayre's paradox ("a letter must be recognized before it can be segmented; and it must be segmented before it can be recognized"). This improvement, based on stochastic parsing, allows integration of statistical information, obtained from recognizers, during syntactic layout analysis. We present how this fusion of numeric and symbolic information in a feedback loop can be applied to syntactic methods to simplify document description. To limit combinatorial explosion during exploration of solutions, we devised an operator that allows optional activation of the stochastic parsing mechanism. Our evaluation on 1250 handwritten business letters shows this method allows the improvement of global recognition scores.
High recall document content extraction
Chang An, Henry S. Baird
We report methodologies for computing high-recall masks for document image content extraction, that is, the location and segmentation of regions containing handwriting, machine-printed text, photographs, blank space, etc. The resulting segmentation is pixel-accurate, which accommodates arbitrary zone shapes (not merely rectangles). We describe experiments showing that iterated classifiers can increase recall of all content types, with little loss of precision. We also introduce two methodological enhancements: (1) a multi-stage voting rule; and (2) a scoring policy that views blank pixels as a "don't care" class with other content classes. These enhancements improve both recall and precision, achieving at least 89% recall and at least 87% precision among three content types: machine-print, handwriting, and photo.
Shape codebook based handwritten and machine printed text zone extraction
Jayant Kumar, Rohit Prasad, Huiagu Cao, et al.
In this paper, we present a novel method for extracting handwritten and printed text zones from noisy document images with mixed content. We use Triple-Adjacent-Segment (TAS) based features which encode local shape characteristics of text in a consistent manner. We first construct two codebooks of the shape features extracted from a set of handwritten and printed text documents respectively. We then compute the normalized histogram of codewords for each segmented zone and use it to train a Support Vector Machine (SVM) classifier. The codebook based approach is robust to the background noise present in the image and TAS features are invariant to translation, scale and rotation of text. In experiments, we show that a pixel-weighted zone classification accuracy of 98% can be achieved for noisy Arabic documents. Further, we demonstrate the effectiveness of our method for document page classification and show that a high precision can be achieved for the detection of machine printed documents. The proposed method is robust to the size of zones, which may contain text content at line or paragraph level.
Recognition
icon_mobile_dropdown
A MRF model with parameter optimization by CRF for on-line recognition of handwritten Japanese characters
Bilan Zhu, Masaki Nakagawa
This paper describes a Markov random field (MRF) model with weighting parameters optimized by conditional random field (CRF) for on-line recognition of handwritten Japanese characters. The model extracts feature points along the pen-tip trace from pen-down to pen-up and sets each feature point from an input pattern as a site and each state from a character class as a label. It employs the coordinates of feature points as unary features and the differences in coordinates between the neighboring feature points as binary features. The weighting parameters are estimated by CRF or the minimum classification error (MCE) method. In experiments using the TUAT Kuchibue database, the method achieved a character recognition rate of 92.77%, which is higher than the previous model's rate, and the method of estimating the weighting parameters using CRF was more accurate than using MCE.
Improving a HMM-based off-line handwriting recognition system using MME-PSO optimization
Mahdi Hamdani, Haikal El Abed, Tarek M. Hamdani, et al.
One of the trivial steps in the development of a classifier is the design of its architecture. This paper presents a new algorithm, Multi Models Evolvement (MME) using Particle Swarm Optimization (PSO). This algorithm is a modified version of the basic PSO, which is used to the unsupervised design of Hidden Markov Model (HMM) based architectures. For instance, the proposed algorithm is applied to an Arabic handwriting recognizer based on discrete probability HMMs. After the optimization of their architectures, HMMs are trained with the Baum- Welch algorithm. The validation of the system is based on the IfN/ENIT database. The performance of the developed approach is compared to the participating systems at the 2005 competition organized on Arabic handwriting recognition on the International Conference on Document Analysis and Recognition (ICDAR). The final system is a combination between an optimized HMM with 6 other HMMs obtained by a simple variation of the number of states. An absolute improvement of 6% of word recognition rate with about 81% is presented. This improvement is achieved comparing to the basic system (ARAB-IfN). The proposed recognizer outperforms also most of the known state-of-the-art systems.
SemiBoost-based Arabic character recognition method
A SemiBoost-based character recognition method is introduced in order to incorporate the information of unlabeled practical samples in training stage. One of the key problems in semi-supervised learning is the criteria of unlabeled sample selection. In this paper, a criteria based on pair-wise sample similarity is adopted to guide the SemiBoost learning process. At each time of iteration, unlabeled examples are selected and assigned labels. The selected samples are used along with the original labeled samples to train a new classifier. The trained classifiers are integrated to make the final classfier. An empirical study on several Arabic similar character pairs with different similarities shows that the proposed method improves the performance as unlabeled samples reveal the distribution of practical samples.
First experiments on a new online handwritten flowchart database
Ahmad-Montaser Awal, Guihuan Feng, Harold Mouchère, et al.
We propose in this paper a new online handwritten flowchart database and perform some first experiments to have a baseline benchmark on this dataset. The collected database consists of 419 flowcharts labeled at the stroke and symbol levels. In addition, an isolated database of graphical and text symbols was extracted from these collected flowcharts. Then, we tackle the problem of online handwritten flowchart recognition from two different points of view. Firstly, we consider that flowcharts are correctly segmented, and we propose different classifiers to perform two tasks, text/non-text separation and graphical symbol recognition. Tested with the extracted isolated test database, we achieve up to 90% and 98% in text/non-text separation and up to 93.5% in graphical symbols recognition. Secondly, we propose a global approach to perform flowchart segmentation and recognition. For this latter, we adopt a global learning schema and a recognition architecture that considers a simultaneous segmentation and recognition. Global architecture is trained and tested directly with flowcharts. Results show the interest of such global approach, but regarding the complexity of flowchart segmentation problem, there is still lot of space to improve the global learning and recognition methods.
Segmentation
icon_mobile_dropdown
Segmenting texts from outdoor images taken by mobile phones using color features
Zongyi Liu, Hanning Zhou
Recognizing texts from images taken by mobile phones with low resolution has wide applications. It has been shown that a good image binarization can substantially improve the performances of OCR engines. In this paper, we present a framework to segment texts from outdoor images taken by mobile phones using color features. The framework consists of three steps: (i) the initial process including image enhancement, binarization and noise filtering, where we binarize the input images in each RGB channel, and apply component level noise filtering; (ii) grouping components into blocks using color features, where we compute the component similarities by dynamically adjusting the weights of RGB channels, and merge groups hierachically, and (iii) blocks selection, where we use the run-length features and choose the Support Vector Machine (SVM) as the classifier. We tested the algorithm using 13 outdoor images taken by an old-style LG-64693 mobile phone with 640x480 resolution. We compared the segmentation results with Tsar's algorithm, a state-of-the-art camera text detection algorithm, and show that our algorithm is more robust, particularly in terms of the false alarm rates. In addition, we also evaluated the impacts of our algorithm on the Abbyy's FineReader, one of the most popular commercial OCR engines in the market.
A perceptive method for handwritten text segmentation
This paper presents a new method to address the problem of handwritten text segmentation into text lines and words. Thus, we propose a method based on the cooperation among points of view that enables the localization of the text lines in a low resolution image, and then to associate the pixels at a higher level of resolution. Thanks to the combination of levels of vision, we can detect overlapping characters and re-segment the connected components during the analysis. Then, we propose a segmentation of lines into words based on the cooperation among digital data and symbolic knowledge. The digital data are obtained from distances inside a Delaunay graph, which gives a precise distance between connected components, at the pixel level. We introduce structural rules in order to take into account some generic knowledge about the organization of a text page. This cooperation among information gives a bigger power of expression and ensures the global coherence of the recognition. We validate this work using the metrics and the database proposed for the segmentation contest of ICDAR 2009. Thus, we show that our method obtains very interesting results, compared to the other methods of the literature. More precisely, we are able to deal with slope and curvature, overlapping text lines and varied kinds of writings, which are the main difficulties met by the other methods.
Improved document image segmentation algorithm using multiresolution morphology
Page segmentation into text and non-text elements is an essential preprocessing step before optical character recognition (OCR) operation. In case of poor segmentation, an OCR classification engine produces garbage characters due to the presence of non-text elements. This paper describes modifications to the text/non-text segmentation algorithm presented by Bloomberg,1 which is also available in his open-source Leptonica library.2The modifications result in significant improvements and achieved better segmentation accuracy than the original algorithm for UW-III, UNLV, ICDAR 2009 page segmentation competition test images and circuit diagram datasets.
Writer Identification or Verification
icon_mobile_dropdown
Feature relevance analysis for writer identification
Imran Siddiqi, Khurram Khurshid, Nicole Vincent
This work presents an analytical study on the relevance of features in an existing framework for writer identification from offline handwritten document images. The identification system comprises a set of 15 features combining the orientation and curvature information in a writing with the well-known codebook based approach. This study aims to find the optimal feature subset to identify the author of a questioned document while maintaining acceptable identification rates. Employing a genetic algorithm with a wrapper method we carry out a feature selection mechanism and identify the most relevant features that characterize the writer of a handwritten document.
Using perturbed handwriting to support writer identification in the presence of severe data constraints
Jin Chen, Wen Cheng, Daniel Lopresti
Since real data is time-consuming and expensive to collect and label, researchers have proposed approaches using synthetic variations for the tasks of signature verification, speaker authentication, handwriting recognition, keyword spotting, etc. However, the limitation of real data is particularly critical in the field of writer identification in that in forensics, adversaries cannot be expected to provide sufficient data to train a classifier. Therefore, it is unrealistic to always assume sufficient real data to train classifiers extensively for writer identification. In addition, this field differs from many others in that we strive to preserve as much inter-writer variations, but model-perturbed handwriting might break such discriminability among writers. Building on work described in another paper where human subjects were involved in calibrating realistic-looking transformation, we then measured the effects of incorporating perturbed handwriting into the training dataset. Experimental results justified our hypothesis that with limited real data, model-perturbed handwriting improved the performance of writer identification. Particularly, if only one single sample for each writer was available, incorporating perturbed data achieved a 36x performance gain.
Statistical characterization of handwriting characteristics using automated tools
We provide a statistical basis for reporting the results of handwriting examination by questioned document (QD) examiners. As a facet of Questioned Document (QD) examination, the analysis and reporting of handwriting examination suffers from the lack of statistical data concerning the frequency of occurrence of combinations of particular handwriting characteristics. QD examiners tend to assign probative values to specific handwriting characteristics and their combinations based entirely on the examiner's experience and power of recall. The research uses data bases of handwriting samples that are representative of the US population. Feature lists of characteristics provided by QD examiners, are used to determine as to what frequencies need to be evaluated. Algorithms are used to automatically extract those characteristics, e.g., a software tool for extracting most of the characteristics from the most common letter pair th, is functional. For each letter combination the marginal and conditional frequencies of their characteristics are evaluated. Based on statistical dependencies of the characteristics the probability of any given letter formation is computed. The resulting algorithms are incorporated into a system for writer verification known as CEDAR-FOX.
Information Retrieval
icon_mobile_dropdown
Keyword and image-based retrieval of mathematical expressions
Richard Zanibbi, Bo Yuan
Two new methods for retrieving mathematical expressions using conventional keyword search and expression images are presented. An expression-level TF-IDF (term frequency-inverse document frequency) approach is used for keyword search, where queries and indexed expressions are represented by keywords taken from LATEX strings. TF-IDF is computed at the level of individual expressions rather than documents to increase the precision of matching. The second retrieval technique is a form of Content-Based Image Retrieval (CBIR). Expressions are segmented into connected components, and then components in the query expression and each expression in the collection are matched using contour and density features, aspect ratios, and relative positions. In an experiment using ten randomly sampled queries from a corpus of over 22,000 expressions, precision-at-k (k = 20) for the keyword-based approach was higher (keyword: μ = 84.0, σ = 19.0, imagebased: μ = 32.0, σ = 30.7), but for a few of the queries better results were obtained using a combination of the two techniques.
Word spotting for handwritten documents using Chamfer Distance and Dynamic Time Warping
Raid M. Saabni, Jihad A. El-Sana
A large amount of handwritten historical documents are located in libraries around the world. The desire to access, search, and explore these documents paves the way for a new age of knowledge sharing and promotes collaboration and understanding between human societies. Currently, the indexes for these documents are generated manually, which is very tedious and time consuming. Results produced by state of the art techniques, for converting complete images of handwritten documents into textual representations, are not yet sufficient. Therefore, word-spotting methods have been developed to archive and index images of handwritten documents in order to enable efficient searching within documents. In this paper, we present a new matching algorithm to be used in word-spotting tasks for historical Arabic documents. We present a novel algorithm based on the Chamfer Distance to compute the similarity between shapes of word-parts. Matching results are used to cluster images of Arabic word-parts into different classes using the Nearest Neighbor rule. To compute the distance between two word-part images, the algorithm subdivides each image into equal-sized slices (windows). A modified version of the Chamfer Distance, incorporating geometric gradient features and distance transform data, is used as a similarity distance between the different slices. Finally, the Dynamic Time Warping (DTW) algorithm is used to measure the distance between two images of word-parts. By using the DTW we enabled our system to cluster similar word-parts, even though they are transformed non-linearly due to the nature of handwriting. We tested our implementation of the presented methods using various documents in different writing styles, taken from Juma'a Al Majid Center - Dubai, and obtained encouraging results.
Automatic identification of ROI in figure images toward improving hybrid (text and image) biomedical document retrieval
Daekeun You, Sameer Antani, Dina Demner-Fushman, et al.
Biomedical images are often referenced for clinical decision support (CDS), educational purposes, and research. They appear in specialized databases or in biomedical publications and are not meaningfully retrievable using primarily textbased retrieval systems. The task of automatically finding the images in an article that are most useful for the purpose of determining relevance to a clinical situation is quite challenging. An approach is to automatically annotate images extracted from scientific publications with respect to their usefulness for CDS. As an important step toward achieving the goal, we proposed figure image analysis for localizing pointers (arrows, symbols) to extract regions of interest (ROI) that can then be used to obtain meaningful local image content. Content-based image retrieval (CBIR) techniques can then associate local image ROIs with identified biomedical concepts in figure captions for improved hybrid (text and image) retrieval of biomedical articles. In this work we present methods that make robust our previous Markov random field (MRF)-based approach for pointer recognition and ROI extraction. These include use of Active Shape Models (ASM) to overcome problems in recognizing distorted pointer shapes and a region segmentation method for ROI extraction. We measure the performance of our methods on two criteria: (i) effectiveness in recognizing pointers in images, and (ii) improved document retrieval through use of extracted ROIs. Evaluation on three test sets shows 87% accuracy in the first criterion. Further, the quality of document retrieval using local visual features and text is shown to be better than using visual features alone.
Automatic extraction of numeric strings in unconstrained handwritten document images
M. Mehdi Haji, Tien D. Bui, Ching Y. Suen
Numeric strings such as identification numbers carry vital pieces of information in documents. In this paper, we present a novel algorithm for automatic extraction of numeric strings in unconstrained handwritten document images. The algorithm has two main phases: pruning and verification. In the pruning phase, the algorithm first performs a new segment-merge procedure on each text line, and then using a new regularity measure, it prunes all sequences of characters that are unlikely to be numeric strings. The segment-merge procedure is composed of two modules: a new explicit character segmentation algorithm which is based on analysis of skeletal graphs and a merging algorithm which is based on graph partitioning. All the candidate sequences that pass the pruning phase are sent to a recognition-based verification phase for the final decision. The recognition is based on a coarse-to-fine approach using probabilistic RBF networks. We developed our algorithm for the processing of real-world documents where letters and digits may be connected or broken in a document. The effectiveness of the proposed approach is shown by extensive experiments done on a real-world database of 607 documents which contains handwritten, machine-printed and mixed documents with different types of layouts and levels of noise.
Document Recognition
icon_mobile_dropdown
Unsupervised method to generate page templates
Hervé Déjean
In this paper, we propose a method for automatically inferring the different page templates used to layout the document content. The first step of the method consists in performing a logical analysis of the document. Depending of the coverage of this step, a given number of document elements will be labeled. Then geometric relations are computed between these labeled elements, and page templates candidates are generated using frequent related elements. A fuzzy matching operation allows for selecting the most frequent and relevant page templates for a given document. Such page templates can be used to correct errors produced during the different previous steps of the document analysis: zoning, OCR, and logical analysis. Evaluation has been performed using the INEX book track collection.
Font group identification using reconstructed fonts
Ideally, digital versions of scanned documents should be represented in a format that is searchable, compressed, highly readable, and faithful to the original. These goals can theoretically be achieved through OCR and font recognition, re-typesetting the document text with original fonts. However, OCR and font recognition remain hard problems, and many historical documents use fonts that are not available in digital forms. It is desirable to be able to reconstruct fonts with vector glyphs that approximate the shapes of the letters that form a font. In this work, we address the grouping of tokens in a token-compressed document into candidate fonts. This permits us to incorporate font information into token-compressed images even when the original fonts are unknown or unavailable in digital format. This paper extends previous work in font reconstruction by proposing and evaluating an algorithm to assign a font to every character within a document. This is a necessary step to represent a scanned document image with a reconstructed font. Through our evaluation method, we have measured a 98.4% accuracy for the assignment of letters to candidate fonts in multi-font documents.
How carefully designed open resource sharing can help and expand document analysis research
Bart Lamiroy, Daniel Lopresti, Henry Korth, et al.
Making datasets available for peer reviewing of published document analysis methods or distributing large commonly used document corpora for benchmarking are extremely useful and sound practices and initiatives. This paper shows that they cover only a very tiny segment of the uses shared and commonly available research data may have. We develop a completely new paradigm for sharing and accessing common data sets, benchmarks and other tools that is based on a very open and free community based contribution model. The model is operational and has been implemented so that it can be tested on a broad scale. The new interactions that will arise from its use may spark innovative ways of conducting document analysis research on the one hand, but create very challenging interactions with other research domains as well.
Multiple-agent adaptation in whole-book recognition
In order to accurately recognize textual images of a book, we often employ various models including iconic model (for character classification), dictionary (for word recognition), character segmentation model, etc., which are derived from prior knowledge. Imperfections in these models affect recognition performance inevitably. In this paper, we propose an unsupervised learning technique that adapts multiple models on-the-fly on a homogeneous input data set to achieve a better overall recognition accuracy fully automatically. The major challenge for this unsupervised learning process is, how to make models improve rather than damage one another? In our framework, models measure disagreements between their input data and output data. We propose a policy based on disagreements to adapt multiple models simultaneously (or alternately) safely. We will construct a book recognition system based on this framework, and demonstrate its feasibility.
OCR Error and Binarization
icon_mobile_dropdown
Ancient documents bleed-through evaluation and its application for predicting OCR error rates
V. Rabeux, N. Journet, J. P. Domenger
This article presents a way to evaluate the bleed-through defect on very old document images. We design measures to quantify and evaluate the verso ink bleeding through the paper onto the recto side. Measuring the bleed-through defect alows us to perform statistical analysis that are able to predict the feasibility of different post-scan tasks. In this article we choose to illustrate our measures by creating two OCR error rate predicting models based bleed-through evaluation. Two models are proposed, one for Abbyy FineReader * which is a very power-full commercial OCR and OCRopus † which is sponsored by Google. Both prediction models appears to be very accurate when calculating various statistic indicators.
Binarization of camera-captured document using A MAP approach
Xujun Peng, Srirangaraj Setlur, Venu Govindaraju, et al.
Document binarization is one of the initial and critical steps for many document analysis systems. Nowadays, with the success and popularity of hand-held devices, large efforts are motivated to convert documents into digital format by using hand-held cameras. In this paper, we propose a Bayesian based maximum a posteriori (MAP) estimation algorithm to binarize the camera-captured document images. A novel adaptive segmentation surface estimation and normalization method is proposed as the preprocessing step in our work and followed by a Markov Random Field based refine procedure to remove noises and smooth binarized result. Experimental results show that our method has better performance than other algorithms on bad or uneven illumination document images.
Statistical multi-resolution schemes for historical document binarization
In previous work , we proposed the application of the Expectation-Maximization (EM) algorithm in the binarization of historical documents by defining a multi-resolution framework. In this work, we extend the multiresolution framework to the Otsu algorithm for effective binarization of historical documents. We compare the effectiveness of the EM based binarization technique to the Otsu thresholding algorithm on historical documents. We demonstrate how the EM can be extended to perform an effective segmentation of historical documents by taking into account multiple features beyond the intensity of the document image. Experimental results, analysis and comparisons to known techniques are presented using the document image collection from the DIBCO 2009 contest.
Interactive Paper Session
icon_mobile_dropdown
A simple and effective figure caption detection system for old-style documents
Zongyi Liu, Hanning Zhou
Identifying figure captions has wide applications in producing high quality e-books such as kindle books or ipad books. In this paper, we present a rule-based system to detect horizontal figure captions in old-style documents. Our algorithm consists of three steps: (i) segment images into regions of different types such as text and figures, (ii) search the best caption region candidate based on heuristic rules such as region alignments and distances, and (iii) expand caption regions identified in step (ii) with its neighboring text-regions in order to correct oversegmentation errors. We test our algorithm using 81 images collected from old-style books, with each image containing at least one figure area. We show that the approach is able to correctly detect figure captions from images with different layouts, and we also measure its performances in terms of both precision rate and recall rate.
Reflowing-driven paragraph recognition for electronic books in PDF
Jing Fang, Zhi Tang, Liangcai Gao
When reading electronic books on handheld devices, content sometimes should be reflowed and recomposed to adapt for small-screen mobile devices. According to people's reading practice, it is reasonable to reflow the text content based on paragraphs. Hence, this paper addresses the requirement and proposes a set of novel methods on paragraph recognition for electronic books in PDF. The proposed methods consist of three steps, namely, physical structure analysis, paragraph segmentation, and reading order detection. We make use of locally ordered property of PDF documents and layout style of books to improve traditional page recognition results. In addition, we employ the optimal matching of Bipartite Graph technology to detect paragraphs' reading order. Experiments show that our methods achieve high accuracy. It is noteworthy that, the research has been applied in a commercial software package for Chinese E-book production.
Ruling line detection and removal
In this paper we present a procedure for removing ruling lines from a handwritten document image that does not require any preprocessing or postprocessing tasks and it does not break existing characters. We take advantage of common ruling line properties such as uniform width, predictable spacing, position vs. text, etc. The deletion procedure of the detected ruling line is based on the fact that the coordinates of three collinear points have a determinant equal to zero. The system is evaluated on synthetic page images in five different languages and is compared to a previous methodology.
Natural scene logo recognition by joint boosting feature selection in salient regions
Wei Fan, Jun Sun, Satoshi Naoi, et al.
Logos are considered valuable intellectual properties and a key component of the goodwill of a business. In this paper, we propose a natural scene logo recognition method which is segmentation-free and capable of processing images extremely rapidly and achieving high recognition rates. The classifiers for each logo are trained jointly, rather than independently. In this way, common features can be shared across multiple classes for better generalization. To deal with large range of aspect ratio of different logos, a set of salient regions of interest (ROI) are extracted to describe each class. We ensure the selected ROIs to be both individually informative and two-by-two weakly dependant by a Class Conditional Entropy Maximization criteria. Experimental results on a large logo database demonstrate the effectiveness and efficiency of our proposed method.
A framework to improve digital corpus uses: image-mode navigation
Loris Eynard, Vincent Malleron, Hubert Emptoz
In this paper, we propose a new system to enhance navigation inside digital corpora. This system is based on an automatic indexation in image mode and provides the user intuitive navigation in interactive time. Keywords and containers are extracted directly from the document images to create an Image Mode Index, which shows the keywords as cut-out images of their actual appearances. Our approach recreates a summary of the structured documents, following indications given by the creators of the document themselves. Our system is detailed in the general case and sample applications on a 19th century handwritten corpus and a 18th century machine printed text corpus are provided. This approach, developed for documents unreachable otherwise, can be applied on any corpus where keywords and containers can be identified.
Parameter calibration for synthesizing realistic-looking variability in offline handwriting
Wen Cheng, Dan Lopresti
Motivated by the widely accepted principle that the more training data, the better a recognition system performs, we conducted experiments asking human subjects to do evaluate a mixture of real English handwritten text lines and text lines altered from existing handwriting with various distortion degrees. The idea of generating synthetic handwriting is based on a perturbation method by T. Varga and H. Bunke that distorts an entire text line. There are two purposes of our experiments. First, we want to calibrate distortion parameter settings for Varga and Bunke's perturbation model. Second, we intend to compare the effects of parameter settings on different writing styles: block, cursive and mixed. From the preliminary experimental results, we determined appropriate ranges for parameter amplitude, and found that parameter settings should be altered for different handwriting styles. With the proper parameter settings, it should be possible to generate large amount of training and testing data for building better off-line handwriting recognition systems.
Automatic segmentation of subfigure image panels for multimodal biomedical document retrieval
Beibei Cheng, Sameer Antani, R. Joe Stanley, et al.
Biomedical images are often referenced for clinical decision support (CDS), educational purposes, and research. The task of automatically finding the images in a scientific article that are most useful for the purpose of determining relevance to a clinical situation is traditionally done using text and is quite challenging. We propose to improve this by associating image features from the entire image and from relevant regions of interest with biomedical concepts described in the figure caption or discussion in the article. However, images used in scientific article figures are often composed of multiple panels where each sub-figure (panel) is referenced in the caption using alphanumeric labels, e.g. Figure 1(a), 2(c), etc. It is necessary to separate individual panels from a multi-panel figure as a first step toward automatic annotation of images. In this work we present methods that add make robust our previous efforts reported here. Specifically, we address the limitation in segmenting figures that do not exhibit explicit inter-panel boundaries, e.g. illustrations, graphs, and charts. We present a novel hybrid clustering algorithm based on particle swarm optimization (PSO) with fuzzy logic controller (FLC) to locate related figure components in such images. Results from our evaluation are very promising with 93.64% panel detection accuracy for regular (non-illustration) figure images and 92.1% accuracy for illustration images. A computational complexity analysis also shows that PSO is an optimal approach with relatively low computation time. The accuracy of separating these two type images is 98.11% and is achieved using decision tree.
A new method for perspective correction of document images
José Rodríguez-Piñeiro, Pedro Comesaña-Alfaro, Fernando Pérez-González, et al.
In this paper we propose a method for perspective distortion correction of rectangular documents. This scheme exploits the orthogonality of the document edges, allowing to recover the aspect ratio of the original document. The results obtained after correcting the perspective of several document images captured with a mobile phone are compared with those achieved by digitizing the same documents with several scanner models.
Robust keyword retrieval method for OCRed text
Yusaku Fujii, Hiroaki Takebe, Hiroshi Tanaka, et al.
Document management systems have become important because of the growing popularity of electronic filing of documents and scanning of books, magazines, manuals, etc., through a scanner or a digital camera, for storage or reading on a PC or an electronic book. Text information acquired by optical character recognition (OCR) is usually added to the electronic documents for document retrieval. Since texts generated by OCR generally include character recognition errors, robust retrieval methods have been introduced to overcome this problem. In this paper, we propose a retrieval method that is robust against both character segmentation and recognition errors. In the proposed method, the insertion of noise characters and dropping of characters in the keyword retrieval enables robustness against character segmentation errors, and character substitution in the keyword of the recognition candidate for each character in OCR or any other character enables robustness against character recognition errors. The recall rate of the proposed method was 15% higher than that of the conventional method. However, the precision rate was 64% lower.
Online medical symbol recognition using a Tablet PC
Amlan Kundu, Qian Hu, Stanley Boykin, et al.
In this paper we describe a scheme to enhance the usability of a Tablet PC's handwriting recognition system by including medical symbols that are not a part of the Tablet PC's symbol library. The goal of this work is to make handwriting recognition more useful for medical professionals accustomed to using medical symbols in medical records. To demonstrate that this new symbol recognition module is robust and expandable, we report results on both a medical symbol set and an expanded symbol test set which includes selected mathematical symbols.
Characterizing challenged Minnesota ballots
Photocopies of the ballots challenged in the 2008 Minnesota elections, which constitute a public record, were scanned on a high-speed scanner and made available on a public radio website. The PDF files were downloaded, converted to TIF images, and posted on the PERFECT website. Based on a review of relevant image-processing aspects of paper-based election machinery and on additional statistics and observations on the posted sample data, robust tools were developed for determining the underlying grid of the targets on these ballots regardless of skew, clipping, and other degradations caused by high-speed copying and digitization. The accuracy and robustness of a method based on both index-marks and oval targets are demonstrated on 13,435 challenged ballot page images.
A mask-based enhancement method for historical documents
This paper proposes a novel method for document enhancement. The method is based on the combination of two state-of-the-art filters through the construction of a mask. The mask is applied to a TV (Total Variation) - regularized image where background noise has been reduced. The masked image is then filtered by NLmeans (Non LocalMeans) which reduces the noise in the text areas located by the mask. The document images to be enhanced are real historical documents from several periods which include several defects in their background. These defects result from scanning, paper aging and bleed-through. We observe the improvement of this enhancement method through OCR accuracy.
Document image retrieval with morphology-based segmentation and features combination
Tiago C. Bockholt, George D. C. Cavalcanti, Carlos A. B. Mello
Digital libraries need more than just a retrieval based on keywords, which can be inefficient for some applications. Thus, a document retrieval based on content of the digitized image version of the document can be a more appropriated approach. This paper discusses the retrieval of document images by means of identifying a variety of elements present in the document's image body. We propose a new strategy to identify and combine features extracted from a document image. We also consider the task of constructing an optimized feature set to improve the retrieval performance and to validate our experiments on an assorted database. Experimental results show that the proposed segmentation together with a wisely feature combination increase the overall retrieval performance. Moreover the retrieved images demonstrate the generality and effectiveness of our approach for an efficient segmentation and classification of document images.
Boosting based text and non-text region classification
Layout analysis is a crucial process for document image understanding and information retrieval. Document layout analysis depends on page segmentation and block classification. This paper describes an algorithm for extracting blocks from document images and a boosting based method to classify those blocks as machine printed text or not. The feature vector which is fed into the boosting classifier consists of a four direction run-length histogram, and connected components features in both background and foreground. Using a combination of features through a boosting classifier, we obtain an accuracy of 99.5% on our test collection.
OMR of early plainchant manuscripts in square notation: a two-stage system
Carolina Ramirez, Jun Ohya
While Optical Music Recognition (OMR) of modern printed and handwritten documents is considered a solved problem, with many commercial systems available today, the OMR of ancient musical manuscripts still remains an open problem. In this paper we present a system for the OMR of degraded western plainchant manuscripts in square notation from the XIV to XVI centuries. The system has two main blocks, the first one deals with symbol extraction and recognition, while the second one acts as an error detection stage for the first block outputs. For symbol extraction we use widely known image-processing techniques, such as Sobel filtering and Hough Transform, and SVM for classification. The error detection stage is implemented with a hidden Markov model (HMM), which takes advantage of a priori knowledge for this specific kind of music.