Share Email Print

Proceedings Paper

High recall document content extraction
Author(s): Chang An; Henry S. Baird
Format Member Price Non-Member Price
PDF $14.40 $18.00

Paper Abstract

We report methodologies for computing high-recall masks for document image content extraction, that is, the location and segmentation of regions containing handwriting, machine-printed text, photographs, blank space, etc. The resulting segmentation is pixel-accurate, which accommodates arbitrary zone shapes (not merely rectangles). We describe experiments showing that iterated classifiers can increase recall of all content types, with little loss of precision. We also introduce two methodological enhancements: (1) a multi-stage voting rule; and (2) a scoring policy that views blank pixels as a "don't care" class with other content classes. These enhancements improve both recall and precision, achieving at least 89% recall and at least 87% precision among three content types: machine-print, handwriting, and photo.

Paper Details

Date Published: 24 January 2011
PDF: 8 pages
Proc. SPIE 7874, Document Recognition and Retrieval XVIII, 787405 (24 January 2011); doi: 10.1117/12.876706
Show Author Affiliations
Chang An, Lehigh Univ. (United States)
Henry S. Baird, Lehigh Univ. (United States)

Published in SPIE Proceedings Vol. 7874:
Document Recognition and Retrieval XVIII
Gady Agam; Christian Viard-Gaudin, Editor(s)

© SPIE. Terms of Use
Back to Top