Share Email Print

Proceedings Paper

Extraction of text layout structures on document images based on statistical characterization
Author(s): Su S. Chen; Robert M. Haralick; Ihsin T. Phillips
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

The textual structures like the characters, words, text lines, paragraphs on a document image are usually laid out in a very structured manner -- having preferred spatial relations. These spatial relations are rarely deterministic; instead, they describe correlations and likelihoods. Therefore, any realistic document layout analysis algorithm should utilize this type of knowledge in order to optimize its performances. In this paper, we first describe a method for automatically generating a large amount of almost 100% correct ground truth data for the document layout analysis. The bounding boxes for the characters, words, text lines, paragraphs are expressed in a hierarchy. Then based on these layout ground-truth, we build statistical models to model the layout structures for the words, text lines, paragraphs on document images. Finally, we described an algorithm that utilizes these statistical models to extract the text words on document images. The performance of the algorithm is evaluated and reported.

Paper Details

Date Published: 30 March 1995
PDF: 12 pages
Proc. SPIE 2422, Document Recognition II, (30 March 1995); doi: 10.1117/12.205815
Show Author Affiliations
Su S. Chen, Univ. of Washington (United States)
Robert M. Haralick, Univ. of Washington (United States)
Ihsin T. Phillips, Seattle Univ. (United States)

Published in SPIE Proceedings Vol. 2422:
Document Recognition II
Luc M. Vincent; Henry S. Baird, Editor(s)

© SPIE. Terms of Use
Back to Top