Share Email Print

Proceedings Paper

Ensemble methods with simple features for document zone classification
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

Document layout analysis is of fundamental importance for document image understanding and information retrieval. It requires the identification of blocks extracted from a document image via features extraction and block classification. In this paper, we focus on the classification of the extracted blocks into five classes: text (machine printed), handwriting, graphics, images, and noise. We propose a new set of features for efficient classifications of these blocks. We present a comparative evaluation of three ensemble based classification algorithms (boosting, bagging, and combined model trees) in addition to other known learning algorithms. Experimental results are demonstrated for a set of 36503 zones extracted from 416 document images which were randomly selected from the tobacco legacy document collection. The results obtained verify the robustness and effectiveness of the proposed set of features in comparison to the commonly used Ocropus recognition features. When used in conjunction with the Ocropus feature set, we further improve the performance of the block classification system to obtain a classification accuracy of 99.21%.

Paper Details

Date Published: 23 January 2012
PDF: 9 pages
Proc. SPIE 8297, Document Recognition and Retrieval XIX, 829706 (23 January 2012); doi: 10.1117/12.912103
Show Author Affiliations
Tayo Obafemi-Ajayi, Univ. of Missouri (United States)
Gady Agam, Illinois Institute of Technology (United States)
Bingqing Xie, Illinois Institute of Technology (United States)

Published in SPIE Proceedings Vol. 8297:
Document Recognition and Retrieval XIX
Christian Viard-Gaudin; Richard Zanibbi, Editor(s)

© SPIE. Terms of Use
Back to Top
Sign in to read the full article
Create a free SPIE account to get access to
premium articles and original research
Forgot your username?