Share Email Print

Proceedings Paper

Extraction of text lines and text blocks on document images based on statistical modeling
Author(s): Su S. Chen; Robert M. Haralick; Ihsin T. Phillips
Format Member Price Non-Member Price
PDF $14.40 $18.00
cover GOOD NEWS! Your organization subscribes to the SPIE Digital Library. You may be able to download this paper for free. Check Access

Paper Abstract

In this paper, we developed statistical models to characterize the text line and text block structures on document images using the text word bounding boxes. We posed the extraction problem as finding the text lines and text blocks that maximize the Bayesian probability of the text lines and text blocks by observing the text word bounding boxes. We derived the so-called probabilistic linear displacement model (PLDM) to model the text line structures from text word bounding boxes. We also developed an augmented PLDM model to characterize the text block structures from text line bounding boxes. By systematically gathering statistics from a large population of document images, we are able to validate our models experimentally and determine the proper model parameters. We designed and implemented an iterative algorithm that utilized these probabilistic models to extract the text lines and text blocks. The quantitative performances of the algorithm in terms of the rates of miss, false, correct, splitting, merging and spurious detections of the text lines and text blocks were reported.

Paper Details

Date Published: 7 March 1996
PDF: 12 pages
Proc. SPIE 2660, Document Recognition III, (7 March 1996); doi: 10.1117/12.234699
Show Author Affiliations
Su S. Chen, Caere Corp. (United States)
Robert M. Haralick, Univ. of Washington (United States)
Ihsin T. Phillips, Seattle Univ. (United States)

Published in SPIE Proceedings Vol. 2660:
Document Recognition III
Luc M. Vincent; Jonathan J. Hull, Editor(s)

© SPIE. Terms of Use
Back to Top