Share Email Print

Proceedings Paper

Spotting phrases in lines of imaged text
Author(s): Francine R. Chen; Dan S. Bloomberg; Lynn D. Wilcox
Format Member Price Non-Member Price
PDF $14.40 $18.00

Paper Abstract

A system that searches for user-specified phrases in imaged text is described. The search `phrases' can be word fragments, words, or groups of words. The imaged text can be composed of a number of different fonts and can contain graphics. A combination of morphology, simple statistical methods and hidden Markov modeling is used to detect and locate the phrases. The image is deskewed, and then bounding boxes are found for text-lines in the image using multiresolution morphology. Baselines, toplines and the x-height in a text-line are identified using simple statistical methods. The distance between baseline and x-height is used to normalize each hypothesized text-line bounding box, and the columns of pixel values in a normalized bounding box serve as the feature vector for that box. Hidden Markov models are crated for each user-specified search string and to represent all text and graphics other than the search strings. Phrases are identified using Viterbi decoding on a spotting network created from the models. The operating point of the system can be varied to trade off the percentage of words correctly spotted and the percentage of false alarms. Results are given using a subset of the UW English Document Image Database I.

Paper Details

Date Published: 30 March 1995
PDF: 14 pages
Proc. SPIE 2422, Document Recognition II, (30 March 1995); doi: 10.1117/12.205828
Show Author Affiliations
Francine R. Chen, Xerox Palo Alto Research Ctr. (United States)
Dan S. Bloomberg, Xerox Palo Alto Research Ctr. (United States)
Lynn D. Wilcox, Xerox Palo Alto Research Ctr. (United States)

Published in SPIE Proceedings Vol. 2422:
Document Recognition II
Luc M. Vincent; Henry S. Baird, Editor(s)

© SPIE. Terms of Use
Back to Top