Share Email Print
cover

Proceedings Paper

Text characterization by connected component transformations
Author(s): Larry Spitz
Format Member Price Non-Member Price
PDF $14.40 $18.00

Paper Abstract

Worldwide there are many different scripts and languages in common use. Finding text lines and character and word boundaries, where present, are necessary primitive operations for most document processing applications. We have developed a method of handling text lines from several different languages that is robust in the presence of common printing and scanning artifacts. A technique is described by which information about the characteristics of a text line can be determined from a list of the connected pixel components that comprise the image. This technique applies across many languages and scripts that are laid out horizontally. For text comprising Roman type, the location and dimensions of each text line are augmented with positions of the baseline and x-height. Where appropriate, coordinates of space-delimited words and individual character cells are determined. This technique incorporates a computationally inexpensive method for straightening curved lines and segmenting kerned characters and a novel method based on font weight and stress for locating the boundaries of individual characters, even if their images touch.

Paper Details

Date Published: 23 March 1994
PDF: 9 pages
Proc. SPIE 2181, Document Recognition, (23 March 1994); doi: 10.1117/12.171097
Show Author Affiliations
Larry Spitz, Fuji Xerox Palo Alto Lab. (United States)


Published in SPIE Proceedings Vol. 2181:
Document Recognition
Luc M. Vincent; Theo Pavlidis, Editor(s)

© SPIE. Terms of Use
Back to Top