Share Email Print

Journal of Electronic Imaging

Extraction of text words in document images based on a statistical characterization
Author(s): Su S. Chen; Robert M. Haralick; Ihsin T. Phillips
Format Member Price Non-Member Price
PDF $20.00 $25.00

Paper Abstract

Text structures in document images are usually laid out in a structured manner—having preferred spatial relations. These spatial relations are rarely deterministic; however, they can be modeled by probabilities. Therefore, any realistic document layout analysis algorithm should utilize this type of probabilistic knowledge to optimize its performance. We first describe a method for automatically generating a large amount of nearly perfect layout ground truth data from the LaTeX device-independent (DVI) files, where the bounding boxes for the characters, words, text lines, and text blocks are represented in hierarchies. These ground truth data enable us to construct statistical models that characterize the various layout structures in document images. We demonstrate this concept through the development of a word segmentation algorithm, which employs the recursive morphological closing transform to model word shapes in document images. We also conducted systematic experiments to evaluate the performance of our algorithm using the synthetic images generated from the LaTeX DVI files and the real images from the UW-I and UW-II English document image databases. The results indicate that the correct word detection rate is about 95% on the synthetic images and more than 90% on most of the tested real images.

Paper Details

Date Published: 1 January 1996
PDF: 12 pages
J. Electron. Imaging. 5(1) doi: 10.1117/12.227706
Published in: Journal of Electronic Imaging Volume 5, Issue 1
Show Author Affiliations
Su S. Chen, Univ. of Washington (United States)
Robert M. Haralick, Univ. of Washington (United States)
Ihsin T. Phillips, Seattle Univ. (United States)

© SPIE. Terms of Use
Back to Top