Proceedings PaperText characterization by connected component transformations
|Format||Member Price||Non-Member Price|
Worldwide there are many different scripts and languages in common use. Finding text lines and character and word boundaries, where present, are necessary primitive operations for most document processing applications. We have developed a method of handling text lines from several different languages that is robust in the presence of common printing and scanning artifacts. A technique is described by which information about the characteristics of a text line can be determined from a list of the connected pixel components that comprise the image. This technique applies across many languages and scripts that are laid out horizontally. For text comprising Roman type, the location and dimensions of each text line are augmented with positions of the baseline and x-height. Where appropriate, coordinates of space-delimited words and individual character cells are determined. This technique incorporates a computationally inexpensive method for straightening curved lines and segmenting kerned characters and a novel method based on font weight and stress for locating the boundaries of individual characters, even if their images touch.