Share Email Print

Proceedings Paper

Classification of document page images based on visual similarity of layout structures
Author(s): Christian K. Shin; David Scott Doermann
Format Member Price Non-Member Price
PDF $14.40 $18.00
cover GOOD NEWS! Your organization subscribes to the SPIE Digital Library. You may be able to download this paper for free. Check Access

Paper Abstract

Searching for documents by their type or genre is a natural way to enhance the effectiveness of document retrieval. The layout of a document contains a significant amount of information that can be used to classify a document's type in the absence of domain specific models. A document type or genre can be defined by the user based primarily on layout structure. Our classification approach is based on 'visual similarity' of the layout structure by building a supervised classifier, given examples of the class. We use image features, such as the percentages of tex and non-text (graphics, image, table, and ruling) content regions, column structures, variations in the point size of fonts, the density of content area, and various statistics on features of connected components which can be derived from class samples without class knowledge. In order to obtain class labels for training samples, we conducted a user relevance test where subjects ranked UW-I document images with respect to the 12 representative images. We implemented our classification scheme using the OC1, a decision tree classifier, and report our findings.

Paper Details

Date Published: 22 December 1999
PDF: 9 pages
Proc. SPIE 3967, Document Recognition and Retrieval VII, (22 December 1999); doi: 10.1117/12.373493
Show Author Affiliations
Christian K. Shin, Univ. of Maryland/College Park (United States)
David Scott Doermann, Univ. of Maryland/College Park (United States)

Published in SPIE Proceedings Vol. 3967:
Document Recognition and Retrieval VII
Daniel P. Lopresti; Jiangying Zhou, Editor(s)

© SPIE. Terms of Use
Back to Top