Share Email Print

Proceedings Paper

Word mining in a sparsely labeled handwritten collection
Author(s): L. R. B. Schomaker
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

Word-spotting techniques are usually based on detailed modeling of target words, followed by search for the locations of such a target word in images of handwriting. In this study, the focus is on deciding for the presence of target words in lines of text, regardless and disregarding their horizontal position. Line strips are modeled using a Bag-of-Glyphs approach using a self-organized map. This approach uses the presence of fragmented-connected component shapes (glyphs) in a line strip to characterize this text passage, similar to the Bag-of-Words approach for 'ASCII'-encoded documents in regular Information Retrieval. Subsequently, the presence of a word or word category is trained to a support-vector machine in an iterative setup which involves an active group of users. Results are promising for a large proportion of words and are dependent both on the amount of labeled lines as well as shape uniqueness. Particularly useful is the ability to train on abstract content classes such as proper names, municipalities or word-bigram presence in the line-strip images.

Paper Details

Date Published: 28 January 2008
PDF: 11 pages
Proc. SPIE 6815, Document Recognition and Retrieval XV, 68150N (28 January 2008); doi: 10.1117/12.766329
Show Author Affiliations
L. R. B. Schomaker, Univ. of Groningen (Netherlands)

Published in SPIE Proceedings Vol. 6815:
Document Recognition and Retrieval XV
Berrin A. Yanikoglu; Kathrin Berkner, Editor(s)

© SPIE. Terms of Use
Back to Top