Share Email Print

Proceedings Paper

Versatile document image content extraction
Author(s): Henry S. Baird; Michael A. Moll; Jean Nonnemaker; Matthew R. Casey; Don L. Delorenzo
Format Member Price Non-Member Price
PDF $14.40 $18.00

Paper Abstract

We offer a preliminary report on a research program to investigate versatile algorithms for document image content extraction, that is locating regions containing handwriting, machine-print text, graphics, line-art, logos, photographs, noise, etc. To solve this problem in its full generality requires coping with a vast diversity of document and image types. Automatically trainable methods are highly desirable, as well as extremely high speed in order to process large collections. Significant obstacles include the expense of preparing correctly labeled ("ground-truthed") samples, unresolved methodological questions in specifying the domain (e.g. what is a representative collection of document images?), and a lack of consensus among researchers on how to evaluate content-extraction performance. Our research strategy emphasizes versatility first: that is, we concentrate at the outset on designing methods that promise to work across the broadest possible range of cases. This strategy has several important implications: the classifiers must be trainable in reasonable time on vast data sets; and expensive ground-truthed data sets must be complemented by amplification using generative models. These and other design and architectural issues are discussed. We propose a trainable classification methodology that marries k-d trees and hash-driven table lookup and describe preliminary experiments.

Paper Details

Date Published: 16 January 2006
PDF: 7 pages
Proc. SPIE 6067, Document Recognition and Retrieval XIII, 60670R (16 January 2006); doi: 10.1117/12.650359
Show Author Affiliations
Henry S. Baird, Lehigh Univ. (United States)
Michael A. Moll, Lehigh Univ. (United States)
Jean Nonnemaker, Lehigh Univ. (United States)
Matthew R. Casey, Lehigh Univ. (United States)
Don L. Delorenzo, Lehigh Univ. (United States)

Published in SPIE Proceedings Vol. 6067:
Document Recognition and Retrieval XIII
Kazem Taghva; Xiaofan Lin, Editor(s)

© SPIE. Terms of Use
Back to Top