Share Email Print
cover

Proceedings Paper

Shape codebook based handwritten and machine printed text zone extraction
Author(s): Jayant Kumar; Rohit Prasad; Huiagu Cao; Wael Abd-Almageed; David Doermann; Premkumar Natarajan
Format Member Price Non-Member Price
PDF $14.40 $18.00
cover GOOD NEWS! Your organization subscribes to the SPIE Digital Library. You may be able to download this paper for free. Check Access

Paper Abstract

In this paper, we present a novel method for extracting handwritten and printed text zones from noisy document images with mixed content. We use Triple-Adjacent-Segment (TAS) based features which encode local shape characteristics of text in a consistent manner. We first construct two codebooks of the shape features extracted from a set of handwritten and printed text documents respectively. We then compute the normalized histogram of codewords for each segmented zone and use it to train a Support Vector Machine (SVM) classifier. The codebook based approach is robust to the background noise present in the image and TAS features are invariant to translation, scale and rotation of text. In experiments, we show that a pixel-weighted zone classification accuracy of 98% can be achieved for noisy Arabic documents. Further, we demonstrate the effectiveness of our method for document page classification and show that a high precision can be achieved for the detection of machine printed documents. The proposed method is robust to the size of zones, which may contain text content at line or paragraph level.

Paper Details

Date Published: 24 January 2011
PDF: 8 pages
Proc. SPIE 7874, Document Recognition and Retrieval XVIII, 787406 (24 January 2011); doi: 10.1117/12.876725
Show Author Affiliations
Jayant Kumar, Univ. of Maryland, College Park (United States)
Rohit Prasad, Raytheon BBN Technologies (United States)
Huiagu Cao, Raytheon BBN Technologies (United States)
Wael Abd-Almageed, Univ. of Maryland, College Park (United States)
David Doermann, Univ. of Maryland, College Park (United States)
Premkumar Natarajan, Raytheon BBN Technologies (United States)


Published in SPIE Proceedings Vol. 7874:
Document Recognition and Retrieval XVIII
Gady Agam; Christian Viard-Gaudin, Editor(s)

© SPIE. Terms of Use
Back to Top