Share Email Print

Proceedings Paper

Document image content inventories
Author(s): Henry S. Baird; Michael A. Moll; Chang An; Matthew R Casey
Format Member Price Non-Member Price
PDF $14.40 $18.00

Paper Abstract

We report an investigation into strategies, algorithms, and software tools for document image content extraction and inventory, that is, the location and measurement of regions containing handwriting, machine-printed text, photographs, blank space, etc. We have developed automatically trainable methods, adaptable to many kinds of documents represented as bilevel, greylevel, or color images, that offer a wide range of useful tradeoffs of speed versus accuracy using methods for exact and approximate k-Nearest Neighbor classification. We have adopted a policy of classifying each pixel (rather than regions) by content type: we discuss the motivation and engineering implications of this choice. We describe experiments on a wide variety of document-image and content types, and discuss performance in detail in terms of classification speed, per-pixel classification accuracy, per-page inventory accuracy, and subjective quality of page segmentation. These show that even modest per-pixel classification accuracies (of, e.g., 60-70%) support usefully high recall and precision rates (of, e.g., 80-90%) for retrieval queries of document collections seeking pages that contain a given minimum fraction of a certain type of content.

Paper Details

Date Published: 29 January 2007
PDF: 12 pages
Proc. SPIE 6500, Document Recognition and Retrieval XIV, 65000X (29 January 2007); doi: 10.1117/12.705094
Show Author Affiliations
Henry S. Baird, Lehigh Univ. (United States)
Michael A. Moll, Lehigh Univ. (United States)
Chang An, Lehigh Univ. (United States)
Matthew R Casey, Lehigh Univ. (United States)

Published in SPIE Proceedings Vol. 6500:
Document Recognition and Retrieval XIV
Xiaofan Lin; Berrin A. Yanikoglu, Editor(s)

© SPIE. Terms of Use
Back to Top