Document Recognition and Retrieval XIV

Front Matter: Volume 6500

Show abstract

This PDF file contains the front matter associated with SPIE Proceedings Volume 6500, including the Title Page, Copyright information, Table of Contents, Introduction (if any), and the Conference Committee listing.

Industrial OCR approaches: architecture, algorithms, and adaptation techniques

István Marosi

Show abstract

Optical Character Recognition is much more than character classification. An industrial OCR application combines algorithms studied in detail by different researchers in the area of image processing, pattern recognition, machine learning, language analysis, document understanding, data mining, and other, artificial intelligence domains. There is no single perfect algorithm for any of the OCR problems, so modern systems try to adapt themselves to the actual features of the image or document to be recognized. This paper describes the architecture of a modern OCR system with an emphasis on this adaptation process.

Scale-controlled area difference shape descriptor

Mingqiang Yang, Kidiyo Kpalma, Joseph Ronsin

Show abstract

In this paper, we propose a shape representation and description well adapted to pattern recognition, particularly in the context of affine shape transformations. The proposed approach operates from a single closed contour. The parameterized contour is convolved with a Gaussian kernel. The curvature is calculated to determine the inflexion points and the main significant ones are kept by using a threshold defined by observing a segment-length between two curvature zero-crossing points. Then this filtered and simplified shape is registered with the original one. Finally, we separately calculate the areas between the two segments corresponding to these two scale-space representations. The proposed descriptor is a vector with components issued for each segment and the corresponding area. This article develops the new concepts: 1) compares the same segment under different scales representation; 2) chooses the appropriate scales by applying a threshold to the shape shortest-segment; 3) then proposes the algorithm and the conditions of merging and removing the short-segments. An experimental evaluation of robustness under affine transformations is presented on a shape database.

Frequency coding: an effective method for combining dichotomizers

Srinivas Andra, George Nagy, Cheng-Lin Liu

Show abstract

Binary classifiers (dichotomizers) are combined for multi-class classification. Each region formed by the pairwise decision boundaries is assigned to the class with the highest frequency of training samples in that region. With more samples and classifiers, the frequencies converge to increasingly accurate non-parametric estimates of the posterior class probabilities in the vicinity of the decision boundaries. The method is applicable to non-parametric discrete or continuous class distributions dichotomized by either linear or non-linear classifiers (like support vector machines). We present a formal description of the method and place it in context with related methods. We present experimental results on machine-printed and handwritten digits that demonstrate the viability of frequency coding in a classification task.

A multi-evidence, multi-engine OCR system

Ilya Zavorin, Eugene Borovikov, Anna Borovikov, et al.

Show abstract

Although modern OCR technology is capable of handling a wide variety of document images, there is no single OCR engine that performs equally well on all documents for a given single language script. Naturally, each OCR engine has its strengths and weaknesses, and therefore different engines tend to differ in the accuracy on different documents, and in the errors on the same document image. While the idea of using multiple OCR engines to boost output accuracy is not new, most of the existing systems do not go beyond variations on majority voting. While this approach may work well in many cases, it has limitations, especially when OCR technology used to process a given script has not yet fully matured. Our goal is to develop a system called MEMOE (for "Multi-Evidence Multi-OCR-Engine") that combines, in an optimal or near-optimal way, output streams of one or more OCR engines together with various types of evidence extracted from these streams as well as from original document images, to produce output of higher quality than that of the individual OCR engines, or of majority voting applied to multiple OCR output streams. Furthermore, we aim to improve the accuracy of OCR output on images that might otherwise have low accuracy that significantly impacts downstream processing. The MEMOE system functions as an OCR engine taking document images and some configuration parameters as input and producing a single output text stream. In this paper, we describe the design of the system, various evidence types and how they are incorporated into MEMOE in the form of filters. Results of initial tests that involve two corpora of Arabic documents show that, even in its initial configuration, the system is superior to a voting algorithm and that even more improvement may be achieved by incorporating additional evidence types into the system.

Interaction for style-constrained OCR

Sriharsha Veeramachaneni, George Nagy

Show abstract

The error rate can be considerably reduced on a style-consistent document if its style is identified and the right style-specific classifier is used. Since in some applications both machines and humans have difficulty in identifying the style, we propose a strategy to improve the accuracy of style-constrained classification by enlisting the human operator to identify the labels of some characters selected by the machine. We present an algorithm to select the set of characters that is likely to reduce the error rate on unlabeled characters by utilizing the labels to reclassify the remaining characters. We demonstrate the efficacy of our algorithm on simulated data.

Reading text in consumer digital photographs

Vincent Vanhoucke, S. Burak Gokturk

Show abstract

We present a distributed system to extract text contained in natural scenes within consumer photographs. The objective is to automatically annotate pictures in order to make consumer photo sets searchable based on the image content. The system is designed to process a large volume of photos, by quickly isolating candidate text regions, and successively cascading them through a series of text recognition engines which jointly make a decision on whether or not the region contains text that is readable by OCR. In addition, a dedicated rejection engine is built on top of each text recognizer to adapt its confidence measure to the specifics of the task. The resulting system achieves very high text retrieval rate and data throughput with very small false detection rates.

Adding contextual information to improve character recognition on the Archimedes Palimpsest

Derek J. Walvoord, Roger L. Easton Jr., Roxanne L. Canosa

Show abstract

The objective of the character recognition effort for the Archimedes Palimpsest is to provide a tool that allows scholars of ancient Greek mathematics to retrieve as much information as possible from the remaining degraded text. With this in mind, the current pattern recognition system does not output a single classification decision, as in typical target detection problems, but has been designed to provide intermediate results that allow the user to apply his or her own decisions (or evidence) to arrive at a conclusion. To achieve this result, a probabilistic network has been incorporated into our previous recognition system, which was based primarily on spatial correlation techniques. This paper reports on the revised tool and its recent success in the transciption process.

OCR result optimization based on pattern matching

Junqing Shang, Changsong Liu, Xiaoqing Ding

Show abstract

Post-processing of OCR is a bottleneck of the document image processing system. Proof reading is necessary since the current recognition rate is not enough for publishing. The OCR system provides every recognition result with a confident or unconfident label. People only need to check unconfident characters while the error rate of confident characters is low enough for publishing. However, the current algorithm marks too many unconfident characters, so optimization of OCR results is required. In this paper we propose an algorithm based on pattern matching to decrease the number of unconfident results. If an unconfident character matches a confident character well, its label could be changed into a confident one. Pattern matching makes use of original character images, so it could reduce the problem caused by image normalization and scanned noises. We introduce WXOR, WAN, and four-corner based pattern matching to improve the effect of matching, and introduce confidence analysis to reduce the errors of similar characters. Experimental results show that our algorithm achieves improvements of 54.18% in the first image set that contains 102,417 Chinese characters, and 49.85% in the second image set that contains 53,778 Chinese characters.

Shape from parallel geodesics for distortion correction of digital camera document images

Katsuhito Fujimoto, Jun Sun, Hiroaki Takebe, et al.

Show abstract

Distortion correction methods for digital camera document images of thick volumes or curved papers become important for camera-based document recognition technologies. In this paper we propose a novel distortion correction method for digital camera document images based on "shape from parallel geodesics." This method considers the following features: parallel lines corresponding to character strings or ruled lines of tables on extended surface become parallel geodesics on a curved paper surface and a smoothly curved paper can be modeled by a ruled surface, which is sweep surface of rulings. The projected geodesics and rulings exist in the input image derived from perspective transformation. The presented method extracts the projected geodesics, estimates the projected rulings in the input image, estimates the ruled surface that models the curved paper, and generates the corrected image, in this order. The projected rulings are estimated by the condition derived from only parallelism of geodesics without the requirements for equal spacing. This method can estimate the ruled surface model directly by numerical operations of differentiation, integration and matrix inversion without any iterative calculation. We also report on experiments that show the effectiveness of the proposed method.

Multispectral pattern recognition applied to x-ray fluorescence images of the Archimedes Palimpsest

D. Michael Hansen, Roger L. Easton Jr., Rolando Raqueño

Show abstract

The Archimedes Palimpsest is one of the most significant texts in the history of science. Much of the text has been read using images of reflected visible light and visible light produced by ultraviolet fluorescence. However, these techniques do not perform well on the four pages of the manuscript that are obscured by forged icons that were painted over these pages during the first half of the 20th century. X-ray fluorescence images of one of these pages have been processed using spectral pattern recognition techniques developed for environmental remote sensing to recover the original texts beneath the paint.

Degraded document image enhancement

G. Agam, G. Bal, G. Frieder, et al.

Show abstract

Poor quality documents are obtained in various situations such as historical document collections, legal archives, security investigations, and documents found in clandestine locations. Such documents are often scanned for automated analysis, further processing, and archiving. Due to the nature of such documents, degraded document images are often hard to read, have low contrast, and are corrupted by various artifacts. We describe a novel approach for the enhancement of such documents based on probabilistic models which increases the contrast, and thus, readability of such documents under various degradations. The enhancement produced by the proposed approach can be viewed under different viewing conditions if desired. The proposed approach was evaluated qualitatively and compared to standard enhancement techniques on a subset of historical documents obtained from the Yad Vashem Holocaust museum. In addition, quantitative performance was evaluated based on synthetically generated data corrupted under various degradation models. Preliminary results demonstrate the effectiveness of the proposed approach.

Curvelets based feature extraction of handwritten shapes for ancient manuscripts classification

Guillaume Joutel, Véronique Eglin, Stéphane Bres, et al.

Show abstract

The aim of this scientific work is to propose a suitable assistance tool for palaeographers and historians to help them in their intuitive and empirical work of identification of writing styles (for medieval handwritings) and authentication of writers (for humanistic manuscripts). We propose a global approach of writers' classification based on Curvelets based features in relation with two discriminative shapes properties, the curvature and the orientation. Those features are revealing of structural and directional micro-shapes and also of concavity that captures the finest variations in the contour. The Curvelets based analysis leads to the construction of a compact Log-polar signature for each writing. The relevance of the signature is quantified with a CBIR (content based image retrieval) system that compares request images and database images candidates. The main experimental results are very promising and show 78% of good retrieval (as precision) on the Middle-Ages database and 89% on the humanistic database.

Interactive training for handwriting recognition in historical document collections

Douglas J. Kennard, William A. Barrett

Show abstract

We present a method of interactive training for handwriting recognition in collections of documents. As the user transcribes (labels) the words in the training set, words are automatically skipped if they appear to match words that are already transcribed. By reducing the amount of redundant training, better coverage of the data is achieved, resulting in more accurate recognition. Using word-level features for training and recognition in a collection of George Washington's manuscripts, the recognition ratio is approximately 2%-8% higher after training with our interactive method than after training the same number of words sequentially. Using our approach, less training is required to achieve an equivalent recognition ratio. A slight improvement in recognition ratio is also observed when using our method on a second data set, which consists of several pages from a diary written by Jennie Leavitt Smith.

Online handwritten mathematical expression recognition

Hakan Büyükbayrak, Berrin Yanikoglu, Aytül Erçil

Show abstract

We describe a system for recognizing online, handwritten mathematical expressions. The system is designed with a user-interface for writing scientific articles, supporting the recognition of basic mathematical expressions as well as integrals, summations, matrices etc. A feed-forward neural network recognizes symbols which are assumed to be single-stroke and a recursive algorithm parses the expression by combining neural network output and the structure of the expression. Preliminary results show that writer-dependent recognition rates are very high (99.8%) while writer-independent symbol recognition rates are lower (75%). The interface associated with the proposed system integrates the built-in recognition capabilities of the Microsoft's Tablet PC API for recognizing textual input and supports conversion of hand-drawn figures into PNG format. This enables the user to enter text, mathematics and draw figures in a single interface. After recognition, all output is combined into one LATEX code and compiled into a PDF file.

Recognition of degraded handwritten digits using dynamic Bayesian networks

Laurence Likforman-Sulem, Marc Sigelle

Show abstract

We investigate in this paper the application of dynamic Bayesian networks (DBNs) to the recognition of handwritten digits. The main idea is to couple two separate HMMs into various architectures. First, a vertical HMM and a horizontal HMM are built observing the evolving streams of image columns and image rows respectively. Then, two coupled architectures are proposed to model interactions between these two streams and to capture the 2D nature of character images. Experiments performed on the MNIST handwritten digit database show that coupled architectures yield better recognition performances than non-coupled ones. Additional experiments conducted on artificially degraded (broken) characters demonstrate that coupled architectures better cope with such degradation than non coupled ones and than discriminative methods such as SVMs.

Google Books: making the public domain universally accessible

Adam Langley, Dan S. Bloomberg

Show abstract

Google Book Search is working with libraries and publishers around the world to digitally scan books. Some of those works are now in the public domain and, in keeping with Google's mission to make all the world's information useful and universally accessible, we wish to allow users to download them all. For users, it is important that the files are as small as possible and of printable quality. This means that a single codec for both text and images is impractical. We use PDF as a container for a mixture of JBIG2 and JPEG2000 images which are composed into a final set of pages. We discuss both the implementation of an open source JBIG2 encoder, which we use to compress text data, and the design of the infrastructure needed to meet the technical, legal and user requirements of serving many scanned works. We also cover the lessons learnt about dealing with different PDF readers and how to write files that work on most of the readers, most of the time.

Pixel and semantic capabilities from an image-object based document representation

Michael Gormish, Kathrin Berkner, Martin Boliek, et al.

Show abstract

This paper reports on novel and traditional pixel and semantic operations using a recently standardized document representation called JPM. The JPM representation uses compressed pixel arrays for all visible elements on a page. Separate data containers called boxes provide the layout and additional semantic information. JPM and related image-based document representation standards were designed to obtain the most rate efficient document compression. The authors, however, use this representation directly for operations other than compression typically performed either on pixel arrays or semantic forms. This paper describes the image representation used in the JPM standard and presents techniques to (1) perform traditional raster-based document analysis on the compressed data, (2) transmit semantically meaningful portions of compressed data between devices, (3) create multiple views from one compressed data stream, and (4) edit high resolution document images with only low resolution proxy images.

Presentation of structured documents without a style sheet

eroxSteven J. Harrington, Elizabeth Wayman

Show abstract

In order to present most XML documents for human consumption, formatting information must be introduced and applied. Formatting is typically done through a style sheet, however, it is conceivable that one could wish to view the document without having a style sheet (either because a style sheet does not exist, or is unavailable, or is inappropriate for the display device). This paper describes a method for formatting structured documents without a provided style sheet. The idea is to first analyze the document to determine structures and features that might be relevant to style decisions. A transformation can be constructed to convert the original document to a generic form that captures the semantics that will be expressed through formatting and style. In the second stage styling is applied to the structures and features that have been discovered by applying a pre-defined style sheet for the generic form. The document instance, and if available, the corresponding schema or DTD can be analyzed in order to construct the transformation. This paper will describe the generic form used for formatting and techniques for generating transformations to it.

Content selection based on compositional image quality

Pere Obrador

Show abstract

Digital publishing workflows usually have the need for composition and balance within the document, where certain photographs will have to be chosen according to the overall layout of the document it is going to be placed in. i.e., the composition within the photograph will have a relationship/balance with the rest of the document layout. This paper presents a novel image retrieval method, in which the document where the image is to be inserted is used as query. The algorithm calculates a balance measure between the document and each of the images in the collection, retrieving the ones that have a higher balance score. The image visual weight map, used in the balance calculation, has been successfully approximated by a new image quality map that takes into consideration sharpness, contrast and chroma.

Generic architecture for professional authoring environments to export XML-based formats

Fabio Giannetti

Show abstract

Professional authoring environments are used by Graphic Artists (GA) during the design phase of any publication type. With the increasing demand for supporting Variable Data Print (VDP) designs, these authoring environments require enhanced capabilities. The recurring challenge is to provide flexible VDP features that can be represented using several VDP enabling XML based formats. Considering the different internal structure of the authoring environments, a common platform needs to be developed. The solution must, at the same time, empower the GA with a rich VDP feature set, as well as, generating a range of output formats that drive their respective VDP workflows. We have designed a common architecture to collect the required data from the hosting application and a generic internal representation that enables multiple XML output formats.

Cost-estimating for commercial digital printing

Malcolm G. Keif

Show abstract

The purpose of this study is to document current cost-estimating practices used in commercial digital printing. A research study was conducted to determine the use of cost-estimating in commercial digital printing companies. This study answers the questions: 1) What methods are currently being used to estimate digital printing? 2) What is the relationship between estimating and pricing digital printing? 3) To what extent, if at all, do digital printers use full-absorption, all-inclusive hourly rates for estimating? Three different digital printing models were identified: 1) Traditional print providers, who supplement their offset presswork with digital printing for short-run color and versioned commercial print; 2) "Low-touch" print providers, who leverage the power of the Internet to streamline business transactions with digital storefronts; 3) Marketing solutions providers, who see printing less as a discrete manufacturing process and more as a component of a complete marketing campaign. Each model approaches estimating differently. Understanding and predicting costs can be extremely beneficial. Establishing a reliable system to estimate those costs can be somewhat challenging though. Unquestionably, cost-estimating digital printing will increase in relevance in the years ahead, as margins tighten and cost knowledge becomes increasingly more critical.

A novel approach for nonuniform list fusion

Wei-Qi Yan

Show abstract

List fusion is a critical problem in information retrieval. The approach using uniform weights for list fusion ignores the correctness, importance and individuality of various detectors for a concrete application. In this paper, we propose a nonuniform and rational optimized paradigm for TRECVid list fusion, which is expected to loyally preserve the precision in the outcomes and reach the maximum Average Precision (A.P.). Therefore we exhaustively search for the corresponding parametric set for the best A.P. in the space spanned by the feature vectors. In order to accelerate the fusion procedure of the input score lists, we train our model using the training data set, and apply the learnt parameters to fuse those new vectors. We take the nonuniform rational blending functions into account, the advantage of using this fusion is that the problem of weights selection is converted to the issue of parameters selection in the space related to the nonuniform and rational functions. The high precision and multiple resolution, controllable and stable attributes of rational functions are helpful in parameters selection. Therefore, the space for fusion weights selection becomes large. The correctness of our proposal is compared and verified with the average and linear fusion results.

Identification of comment-on sentences in online biomedical documents using support vector machines

In Cheol Kim, Daniel X. Le, George R. Thoma

Show abstract

MEDLINE^(R) is the premier bibliographic online database of the National Library of Medicine, containing approximately 14 million citations and abstracts from over 4,800 biomedical journals. This paper presents an automated method based on support vector machines to identify a "comment-on" list, which is a field in a MEDLINE citation denoting previously published articles commented on by a given article. For comparative study, we also introduce another method based on scoring functions that estimate the significance of each sentence in a given article. Preliminary experiments conducted on HTML-formatted online biomedical documents collected from 24 different journal titles show that the support vector machine with polynomial kernel function performs best in terms of recall and F-measure rates.

Combining text clustering and retrieval for corpus adaptation

Feng He, Xiaoqing Ding

Show abstract

The application-relevant text data are very useful in various natural language applications. Using them can achieve significantly better performance for vocabulary selection, language modeling, which are widely employed in automatic speech recognition, intelligent input method etc. In some situations, however, the relevant data is hard to collect. Thus, the scarcity of application-relevant training text brings difficulty upon these natural language processing. In this paper, only using a small set of application specific text, by combining unsupervised text clustering and text retrieval techniques, the proposed approach can find the relevant text from unorganized large scale corpus, thereby, adapt training corpus towards the application area of interest. We use the performance of n-gram statistical language model, which is trained from the text retrieved and test on the application-specific text, to evaluate the relevance of the text acquired, accordingly, to validate the effectiveness of our corpus adaptation approach. The language models trained from the ranked text bundles present well discriminated perplexities on the application-specific text. The preliminary experiments on short message text and unorganized large corpus demonstrate the performance of the proposed methods.

Document recognition serving people with disabilities

James R. Fruchterman

Show abstract

Document recognition advances have improved the lives of people with print disabilities, by providing accessible documents. This invited paper provides perspectives on the author's career progression from document recognition professional to social entrepreneur applying this technology to help people with disabilities. Starting with initial thoughts about optical character recognition in college, it continues with the creation of accurate omnifont character recognition that did not require training. It was difficult to make a reading machine for the blind in a commercial setting, which led to the creation of a nonprofit social enterprise to deliver these devices around the world. This network of people with disabilities scanning books drove the creation of Bookshare.org, an online library of scanned books. Looking forward, the needs for improved document recognition technology to further lower the barriers to reading are discussed. Document recognition professionals should be proud of the positive impact their work has had on some of society's most disadvantaged communities.

Title extraction and generation from OCR'd documents

Kazem Taghva, Allen Condit, Steve Lumos, et al.

Show abstract

Extraction of metadata from documents is a tedious and expensive process. In general, documents are manually reviewed for structured data such as title, author, date, organization, etc. The purpose of extraction is to build metadata for documents that can be used when formulating structured queries. In many large document repositories such as the National Library of Medicine (NLM)¹ or university libraries, the extraction task is a daily process that spans decades. Although some automation is used during the extraction process, generally, metadata extraction is a manual task. Aside from the cost and labor time, manual processing is error prone and requires many levels of quality control. Recent advances in extraction technology, as reported at the Message the Understanding Conference (MUC),² is comparable with extraction performed by humans. In addition, many organizations use historical data for lookup to improve the quality of extraction. For the large government document repository we are working with, the task involves extraction of several fields from millions of OCR'd and electronic documents. Since this project is time-sensitive, automatic extraction turns out to be the only viable solution. There are more than a dozen fields associated with each document that require extraction. In this paper, we report on the extraction and generation of the title field.

Content-based document image retrieval in complex document collections

G. Agam, S. Argamon, O. Frieder, et al.

Show abstract

We address the problem of content-based image retrieval in the context of complex document images. Complex documents typically start out on paper and are then electronically scanned. These documents have rich internal structure and might only be available in image form. Additionally, they may have been produced by a combination of printing technologies (or by handwriting); and include diagrams, graphics, tables and other non-textual elements. Large collections of such complex documents are commonly found in legal and security investigations. The indexing and analysis of large document collections is currently limited to textual features based OCR data and ignore the structural context of the document as well as important non-textual elements such as signatures, logos, stamps, tables, diagrams, and images. Handwritten comments are also normally ignored due to the inherent complexity of offline handwriting recognition. We address important research issues concerning content-based document image retrieval and describe a prototype for integrated retrieval and aggregation of diverse information contained in scanned paper documents we are developing. Such complex document information processing combines several forms of image processing together with textual/linguistic processing to enable effective analysis of complex document collections, a necessity for a wide range of applications. Our prototype automatically generates rich metadata about a complex document and then applies query tools to integrate the metadata with text search. To ensure a thorough evaluation of the effectiveness of our prototype, we are developing a test collection containing millions of document images.

A statistical approach to line segmentation in handwritten documents

Manivannan Arivazhagan, Harish Srinivasan, Sargur Srihari

Show abstract

A new technique to segment a handwritten document into distinct lines of text is presented. Line segmentation is the first and the most critical pre-processing step for a document recognition/analysis task. The proposed algorithm starts, by obtaining an initial set of candidate lines from the piece-wise projection profile of the document. The lines traverse around any obstructing handwritten connected component by associating it to the line above or below. A decision of associating such a component is made by (i) modeling the lines as bivariate Gaussian densities and evaluating the probability of the component under each Gaussian or (ii)the probability obtained from a distance metric. The proposed method is robust to handle skewed documents and those with lines running into each other. Experimental results show that on 720 documents (which includes English, Arabic and children's handwriting) containing a total of 11, 581 lines, 97.31% of the lines were segmented correctly. On an experiment over 200 handwritten images with 78, 902 connected components, 98.81% of them were associated to the correct lines.

Segmentation and labeling of documents using conditional random fields

Shravya Shetty, Harish Srinivasan, Matthew Beal, et al.

Show abstract

The paper describes the use of Conditional Random Fields(CRF) utilizing contextual information in automatically labeling extracted segments of scanned documents as Machine-print, Handwriting and Noise. The result of such a labeling can serve as an indexing step for a context-based image retrieval system or a bio-metric signature verification system. A simple region growing algorithm is first used to segment the document into a number of patches. A label for each such segmented patch is inferred using a CRF model. The model is flexible enough to include signatures as a type of handwriting and isolate it from machine-print and noise. The robustness of the model is due to the inherent nature of modeling neighboring spatial dependencies in the labels as well as the observed data using CRF. Maximum pseudo-likelihood estimates for the parameters of the CRF model are learnt using conjugate gradient descent. Inference of labels is done by computing the probability of the labels under the model with Gibbs sampling. Experimental results show that this approach provides for 95.75% of the data being assigned correct labels. The CRF based model is shown to be superior to Neural Networks and Naive Bayes.

Online medical journal article layout analysis

Jie Zou, Daniel Le, George R. Thoma

Show abstract

We describe a physical and logical layout analysis algorithm, which is applied to segment and label online medical journal articles (regular HTML and PDF-Converted-HTML files). For these articles, the geometric layout of the Web page is the most important cue for physical layout analysis. The key to physical layout analysis is then to render the HTML file in a Web browser, so that the visual information in zones (composed of one or a set of HTML DOM nodes), especially their relative position, can be utilized. The recursive X-Y cut algorithm is adopted to construct a hierarchical zone tree structure. In logical layout analysis, both geometric and linguistic features are used. The HTML documents are modeled by a Hidden Markov Model with 16 states, and the Viterbi algorithm is then used to find the optimal label sequence, concluding the logical layout analysis.

Transcript mapping for handwritten Arabic documents

Liana M. Lorigo, Venu Govindaraju

Show abstract

Handwriting recognition research requires large databases of word images each of which is labeled with the word it contains. Full images scanned in, however, usually contain sentences or paragraphs of writing. The creation of labeled databases of images of isolated words is usually tedious, requiring a person to drag a rectangle around each word in the full image and type in the label. Transcript mapping is the automatic alignment of words in a text file with word locations in the full image. It can ease the creation of databases for research. We propose the first transcript mapping method for handwritten Arabic documents. Our approach is based on Dynamic Time Warping (DTW) and offers two primary algorithmic contributions. First is an extension to DTW that uses true distances when mapping multiple entries from one series to a single entry in the second series. Second is a method to concurrently map elements of a partially aligned third series within the main alignment. Preliminary results are provided.

Document image content inventories

Henry S. Baird, Michael A. Moll, Chang An, et al.

Show abstract

We report an investigation into strategies, algorithms, and software tools for document image content extraction and inventory, that is, the location and measurement of regions containing handwriting, machine-printed text, photographs, blank space, etc. We have developed automatically trainable methods, adaptable to many kinds of documents represented as bilevel, greylevel, or color images, that offer a wide range of useful tradeoffs of speed versus accuracy using methods for exact and approximate k-Nearest Neighbor classification. We have adopted a policy of classifying each pixel (rather than regions) by content type: we discuss the motivation and engineering implications of this choice. We describe experiments on a wide variety of document-image and content types, and discuss performance in detail in terms of classification speed, per-pixel classification accuracy, per-page inventory accuracy, and subjective quality of page segmentation. These show that even modest per-pixel classification accuracies (of, e.g., 60-70%) support usefully high recall and precision rates (of, e.g., 80-90%) for retrieval queries of document collections seeking pages that contain a given minimum fraction of a certain type of content.

Document Recognition and Retrieval XIV

Volume Details

Table of Contents

Table of Contents