Share Email Print

Proceedings Paper

Clustering of Farsi sub-word images for whole-book recognition
Author(s): Mohammad Reza Soheili; Ehsanollah Kabir; Didier Stricker
Format Member Price Non-Member Price
PDF $14.40 $18.00

Paper Abstract

Redundancy of word and sub-word occurrences in large documents can be effectively utilized in an OCR system to improve recognition results. Most OCR systems employ language modeling techniques as a post-processing step; however these techniques do not use important pictorial information that exist in the text image. In case of large-scale recognition of degraded documents, this information is even more valuable. In our previous work, we proposed a subword image clustering method for the applications dealing with large printed documents. In our clustering method, the ideal case is when all equivalent sub-word images lie in one cluster. To overcome the issues of low print quality, the clustering method uses an image matching algorithm for measuring the distance between two sub-word images. The measured distance with a set of simple shape features were used to cluster all sub-word images. In this paper, we analyze the effects of adding more shape features on processing time, purity of clustering, and the final recognition rate. Previously published experiments have shown the efficiency of our method on a book. Here we present extended experimental results and evaluate our method on another book with totally different font face. Also we show that the number of the new created clusters in a page can be used as a criteria for assessing the quality of print and evaluating preprocessing phases.

Paper Details

Date Published: 8 February 2015
PDF: 12 pages
Proc. SPIE 9402, Document Recognition and Retrieval XXII, 94020C (8 February 2015); doi: 10.1117/12.2075931
Show Author Affiliations
Mohammad Reza Soheili, Tarbiat Modares Univ. (Iran, Islamic Republic of)
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (Germany)
Ehsanollah Kabir, Tarbiat Modares Univ. (Iran, Islamic Republic of)
Didier Stricker, Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (Germany)

Published in SPIE Proceedings Vol. 9402:
Document Recognition and Retrieval XXII
Eric K. Ringger; Bart Lamiroy, Editor(s)

© SPIE. Terms of Use
Back to Top