Document Recognition and Retrieval XI

The impact of running headers and footers on proximity searching

Kazem Taghva, Julie Borsack, Tom Nartker, et al.

Show abstract

Hundreds of experiments over the last decade on the retrieval of OCR documents performed by the Information Science Research Institute have shown that OCR errors do not significantly affect retrievability. We extend those results to show that in the case of proximity searching, the removal of running headers and footers from OCR text will not improve retrievability for such searches.

Hierarchical logical structure extraction of book documents by analyzing tables of contents

Feng He, Xiaoqing Ding, Liangrui Peng

Show abstract

Logical structure extraction of book documents is significant in electronic document database automatic construction. The tables of contents in a book play an important role in representing the overall logical structure and reference information of the book documents. In this paper, a new method is proposed to extract the hierarchical logical structure of book documents, in addition to the reference information, by combining spatial and semantic information of the tables of contents in a book. Experimental results obtained from testing on various book documents demonstrate the effectiveness and robustness of the proposed approach.

Style-independent document labeling: design and performance evaluation

Song Mao, Jong Woo Kim, George R. Thoma

Show abstract

The Medical Article Records System or MARS has been developed at the U.S. National Library of Medicine (NLM) for automated data entry of bibliographical information from medical journals into MEDLINE, the premier bibliographic citation database at NLM. Currently, a rule-based algorithm (called ZoneCzar) is used for labeling important bibliographical fields (title, author, affiliation, and abstract) on medical journal article page images. While rules have been created for medical journals with regular layout types, new rules have to be manually created for any input journals with arbitrary or new layout types. Therefore, it is of interest to label any journal articles independent of their layout styles. In this paper, we first describe a system (called ZoneMatch) for automated generation of crucial geometric and non-geometric features of important bibliographical fields based on string-matching and clustering techniques. The rule based algorithm is then modified to use these features to perform style-independent labeling. We then describe a performance evaluation method for quantitatively evaluating our algorithm and characterizing its error distributions. Experimental results show that the labeling performance of the rule-based algorithm is significantly improved when the generated features are used.

The past, present, and future of web information retrieval

Monika Henzinger

Show abstract

In this article we describe the approach taken by the first web search engines, discuss the state of the art, and present some of the challenges for the future.

Retrieving topical sentiments from online document collections

Matthew F. Hurst, Kamal Nigam

Show abstract

Retrieving document sby subject matter is the general goal of information retrieval and othe rcontent access systems. There are aspects of textual content, however, which form equally valid election critiria. One such aspect is that of sentiment or polarity - indicating the author's opinion or emotional relationship with some topic. Recent work in this are has treated polarity effectively as a discrete aspect of text. In this paper we present a lightweight but robust approach to combining topic and polarity thus enabling content access systems to select content based on a certain opinion about a certain topic.

Adaptive color document image binarization for text retrieval

Yi Li, Zhiyan Wang, Haizan Zeng

Show abstract

This paper presents a decision tree based adaptive binarization method for text retrieval in color document images. This method extends Ni-Black windowed thresholding technique and hue (H), saturation (S) and value (V) are employed. First, an observation window is retrieved, and based on standard deviation of H, S and V, a pre-defined decision tree is used for selecting proper variables that should be employed. Secondly, Karhunen-Loeve Transform (KLT) is used for eliminating correlation and reducing dimension. Finally, center point of the window is classified based on 2-D standard normal distribution. The result shows that our binarization method generates better result than Ni-Black and other global thresholding binarization method such as Otsu’s in color document images. A comparison using a commercial OCR system shows that our method can be used in various situations for high quality text retrieval.

Word image retrieval using binary features

Bin Zhang, Sargur N. Srihari, Chen Huang

Show abstract

Existing word image retrieval algorithms suffer from either low retrieval precision or high computation complexity. We present an effective and efficient approach for word image matching by using gradient-based binary features. Experiments over a large database of handwritten word images show that the proposed approach consistently outperforms the existing best handwritten word image retrieval algorithm. Dynamic Time Warping (DTW) with profile-based shape features. Not only does the proposed approach have much higher retrieval accuracy, but also it is 893 times faster than DTW.

SmartNails: display- and image-dependent thumbnails

Kathrin Berkner, Edward L. Schwartz, Christophe Marle

Show abstract

In order to overcome poor readability of text and recognizability of image features in low resolution thumbnails, a novel image representation of compound document images - a SmartNail representation - is presented. SmartNails are replacements or supplements to traditional thumbnails for compound documents and contain cropped and scaled image and text segments. Image- and text-based analysis are merged to generate a layout for a particular display size with selected readable text and recognizable image regions. The analysis is efficiently performed by using information from document layout analysis and JPEG 2000 compressed file headers.

Automatic document navigation for digital content remastering

Xiaofan Lin, Steven J. Simske

Show abstract

This paper presents a novel method of automatically adding navigation capabilities to re-mastered electronic books. We first analyze the need for a generic and robust system to automatically construct navigation links into re-mastered books. We then introduce the core algorithm based on text matching for building the links. The proposed method utilizes the tree-structured dictionary and directional graph of the table of contents to efficiently conduct the text matching. Information fusion further increases the robustness of the algorithm. The experimental results on the MIT Press digital library project are discussed and the key functional features of the system are illustrated. We have also investigated how the quality of the OCR engine affects the linking algorithm. In addition, the analogy between this work and Web link mining has been pointed out.

Slide identification for lecture movies by matching characters and images

Noriaki Ozawa, Hiroaki Takebe, Yutaka Katsuyama, et al.

Show abstract

Slide identification is very important when creating e-Learning materials as it detects slides being changed during lecture movies. Simply detecting the change would not be enough for e-Learning purposes. Because, which slide is now displayed in the frame is also important for creating e-Learning materials. A matching technique combined with a presentation file containing answer information is very useful in identifying slides in a movie frame. We propose two methods for slide identification in this paper. The first is character-based, which uses the relationship between the character code and its coordinates. The other is image-based, which uses normalized correlation and dynamic programming. We used actual movies to evaluate the performance of these methods, both independently and in combination, and the experimental results revealed that they are very effective in identifying slides in lecture movies.

Talking about documents: revealing a missing link to multimedia meeting archives

Denis Lalanne, Dalila Mekhaldi, Rolf Ingold

Show abstract

In the context of multimedia meeting recordings and analysis, we introduce a new kind of multimedia alignment, which aims at reunifying documents with all kind of temporal media. The alignment proposed in this article uses the similarities that exist between the documents’ content and the speech transcript’s content in order to provide temporal indexes to printable documents. Several document content alignment strategies are discussed in this article and evaluated at various levels of granularity.

Block adaptive binarization of business card images in PDA using modified quadratic filter

Ki Taeg Shin, Ick Hoon Jang, Nam Chul Kim, et al.

Show abstract

In this paper, we propose a block adaptive binarization (BAB) using a modified quadratic filter (MQF) to binarize business card images of ill conditions acquired by personal digital assistant (PDA) cameras. In the proposed method, a business card image is first partitioned into blocks of 8×8 and the blocks are then classified into character blocks (CBs) and background blocks (BBs) for locally adaptive processing. Each CB is windowed with 24×24 rectangular window centering around the CB and the windowed blocks are improved by the preprocessing filter MQF, in which the scheme of threshold selection in QF is modified. The 8×8 center block of the improved block is binarized with the threshold. A binary image is obtained tiling each binarized block in its original position. Experimental results show that the quality of binary images obtained by the proposed method is much better than that by the conventional global binarization (GB) using QF. In addition, the proposed method yields about 43% improvement of character recognition rate over the GB using QF.

A nonparametric classifier for unsegmented text

George Nagy, Ashutosh Joshi, Mukkai Krishnamoorthy, et al.

Show abstract

Symbolic Indirect Correlation (SIC) is a new classification method for unsegmented patterns. SIC requires two levels of comparisons. First, the feature sequences from an unknown query signal and a known multi-pattern reference signal are matched. Then, the order of the matched features is compared with the order of matches between every lexicon symbol-string and the reference string in the lexical domain. The query is classified according to the best matching lexicon string in the second comparison. Accuracy increases as classified feature-and-symbol strings are added to the reference string.

Online handwriting recognition in a form-filling task: evaluating the impact of context-awareness

Giovanni Seni, Kimberly Rice, Eddy Mayoraz

Show abstract

Guiding a recognition task using a language model is commonly accepted as having a positive effect on accuracy and is routinely used in automated speech processing. This paper presents a quantitative study of the impact of the use of word models in online handwriting recognition applied to form-filling tasks on handheld devices. Two types of word models are considered: a dictionary, typically from few thousands and up to hundred-thousand words; and a grammar or regular expression generating a language several orders of magnitude bigger than the dictionary. It is reported that the improvement in accuracy obtained by the use of a grammar compares with the gain provided by the use of a dictionary. Finally, the impact of the word models on user acceptance of online handwriting recognition in a specific form-filling application is presented.

Group discriminatory power of handwritten characters

Catalin I. Tomai, Devika M. Kshirsagar, Sargur N. Srihari

Show abstract

Using handwritten characters we address two questions (i) what is the group identification performance of different alphabets (upper and lower case) and (ii) what are the best characters for the verification task (same writer/different writer discrimination) knowing demographic information about the writer such as ethnicity, age or sex. The Bhattacharya distance is used to rank different characters by their group discriminatory power and the k-nn classifier to measure the individual performance of characters for group identification. Given the tasks of identifying the correct gender/age/ethnicity or handedness, the accumulated performance of characters varies between 65% and 85%.

Word level script identification for scanned document images

Huanfeng Ma, David Doermann

Show abstract

In this paper, we compare the performance of three classifiers used to identify the script of words in scanned document images. In both training and testing, a Gabor filter is applied and 16 channels of features are extracted. Three classifiers (Support Vector Machines (SVM), Gaussian Mixture Model (GMM) and k-Nearest-Neighbor (k-NN)) are used to identify different scripts at the word level (glyphs separated by white space). These three classifiers are applied to a variety of bilingual dictionaries and their performance is compared. Experimental results show the capability of Gabor filter to capture script features and the effectiveness of these three classifiers for script identification at the word level.

Comprehensive printed Tibetan/English mixed text segmentation method

Hua Wang, Xiaoqing Ding

Show abstract

Text segmentation plays a crucial role in a text recognition system. A comprehensive method is proposed to solve Tibetan/English text segmentation. 2 algorithms based on Tibetan inter-syllabic tshegs and discirminant function, respectively, are presented to perform skew detection before text line separation. Then a dynamic recursive character segmentation algorithm integrating multi-level information is developed. The encouraging experimental results on a large-scale Tibetan/English mixed text set show the validity of proposed method.

A general framework for multicharacter segmentation and its application in recognizing multilingual Asian documents

Di Wen, Xiaoqing Ding

Show abstract

In this paper we propose a general framework for character segmentation in complex multilingual documents, which is an endeavor to combine the traditionally separated segmentation and recognition processes into a cooperative system. The framework contains three basic steps: Dissection, Local Optimization and Global Optimization, which are designed to fuse various properties of the segmentation hypotheses hierarchically into a composite evaluation to decide the final recognition results. Experimental results show that this framework is general enough to be applied in variety of documents. A sample system based on this framework to recognize Chinese, Japanese and Korean documents and experimental performance is reported finally.

New statistical method for multifont printed Tibetan/English OCR

Hua Wang, Xiaoqing Ding

Show abstract

Tibetan optical character recognition (OCR) system plays a crucial role in the Chinese multi-language information processing system. This paper proposed a new statistical method to perform multi-font printed Tibetan/English character recognition. A robust Tibetan character recognition kernel is elaborately designed. Incorporating with previous English character recognition techniques, the recognition accuracy on a test set containing 206,100 multi-font printed characters reaches 99.67%, which shows the validity of the proposed method.

Design and development of an ancient Chinese document recognition system

Liangrui Peng, Pingping Xiu, Xiaoqing Ding

Show abstract

The digitization of ancient Chinese documents presents new challenges to OCR (Optical Character Recognition) research field due to the large character set of ancient Chinese characters, variant font types, and versatile document layout styles, as these documents are historical reflections to the thousands of years of Chinese civilization. After analyzing the general characteristics of ancient Chinese documents, we present a solution for recognition of ancient Chinese documents with regular font-types and layout-styles. Based on the previous work on multilingual OCR in TH-OCR system, we focus on the design and development of two key technologies which include character recognition and page segmentation. Experimental results show that the developed character recognition kernel of 19,635 Chinese characters outperforms our original traditional Chinese recognition kernel; Benchmarked test on printed ancient Chinese books proves that the proposed system is effective for regular ancient Chinese documents.

System for Oriya handwritten numeral recognition

N. Tripathy, M. Panda, U. Pal

Show abstract

To take care of variability involved in the writing style of different individuals, a scheme for off-line Oriya isolated handwritten numeral recognition is presented here. Oriya is a popular script in India. The scheme is mainly based on features obtained from water reservoir concept as well as topological and structural features of the numerals. Reservoir based features like number of reservoirs, their size, heights and positions, water flow direction, topological feature like number of loops, centre of gravity positions of loops, the ratio of reservoir/loop height to the numeral height, profile based features, features based on jump discontinuity etc. are some of the features used in the recognition scheme. The proposed scheme is tested on 3550 data collected from different individuals of various background and we obtained an overall recognition accuracy of about 97.74%.

Using mathematical morphology for document skew estimation

Laurent A. Najman

Show abstract

We propose a concise definition of the skew angle of document, based on mathematical morphology. This definition has the advantages to be applicable both for binary and grey-scale images. We then discuss various possible implementations of this definition, and show that results we obtain are comparable to those of other existing algorithms.

Adaptive inverse halftoning for scanned document images through multiresolution and multiscale analysis

Hirobumi Nishida

Show abstract

This paper describes an efficient algorithm for inverse halftoning of scanned color document images to resolve problems with interference patterns such as moire and graininess when the images are displayed or printed out. The algorithm is suitable for software implementation and useful for high quality printing or display of scanned document images delivered via networks from unknown scanners. A multi-resolution approach is used to achieve practical processing speed under software implementation. Through data-driven, adaptive, multi-scale processing, the algorithm can cope with a variety of input devices and requires no information on the halftoning method or properties (such as coefficients in dither matrices, filter coefficients of error diffusion kernels, screen angles, or dot frequencies). Effectiveness of the new algorithm is demonstrated through real examples of scanned color document images, as well as quantitative evaluations with synthetic data.

Automatic content extraction of filled-form images based on clustering component block projection vectors

Hanchuan Peng, Xiaofeng He, Fuhui Long

Show abstract

Automatic understanding of document images is a hard problem. Here we consider a sub-problem, automatically extracting content from filled form images. Without pre-selected templates or sophisticated structural/semantic analysis, we propose a novel approach based on clustering the component-block-projection-vectors. By combining spectral clustering and minimal spanning tree clustering, we generate highly accurate clusters, from which the adaptive templates are constructed to extract the filled-in content. Our experiments show this approach is effective for a set of 1040 US IRS tax form images belonging to 208 types.

Document Recognition and Retrieval XI

Volume Details

Table of Contents

Table of Contents