Share Email Print
cover

Proceedings Paper

Document image content inventories
Author(s): Henry S. Baird; Michael A. Moll; Chang An; Matthew R Casey
Format Member Price Non-Member Price
PDF $14.40 $18.00
cover GOOD NEWS! Your organization subscribes to the SPIE Digital Library. You may be able to download this paper for free. Check Access

Paper Abstract

We report an investigation into strategies, algorithms, and software tools for document image content extraction and inventory, that is, the location and measurement of regions containing handwriting, machine-printed text, photographs, blank space, etc. We have developed automatically trainable methods, adaptable to many kinds of documents represented as bilevel, greylevel, or color images, that offer a wide range of useful tradeoffs of speed versus accuracy using methods for exact and approximate k-Nearest Neighbor classification. We have adopted a policy of classifying each pixel (rather than regions) by content type: we discuss the motivation and engineering implications of this choice. We describe experiments on a wide variety of document-image and content types, and discuss performance in detail in terms of classification speed, per-pixel classification accuracy, per-page inventory accuracy, and subjective quality of page segmentation. These show that even modest per-pixel classification accuracies (of, e.g., 60-70%) support usefully high recall and precision rates (of, e.g., 80-90%) for retrieval queries of document collections seeking pages that contain a given minimum fraction of a certain type of content.

Paper Details

Date Published: 29 January 2007
PDF: 12 pages
Proc. SPIE 6500, Document Recognition and Retrieval XIV, 65000X (29 January 2007); doi: 10.1117/12.705094
Show Author Affiliations
Henry S. Baird, Lehigh Univ. (United States)
Michael A. Moll, Lehigh Univ. (United States)
Chang An, Lehigh Univ. (United States)
Matthew R Casey, Lehigh Univ. (United States)


Published in SPIE Proceedings Vol. 6500:
Document Recognition and Retrieval XIV
Xiaofan Lin; Berrin A. Yanikoglu, Editor(s)

© SPIE. Terms of Use
Back to Top