Share Email Print

Proceedings Paper

Sub-word image clustering in Farsi printed books
Author(s): Mohammad Reza Soheili; Ehsanollah Kabir; Didier Stricker
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

Most OCR systems are designed for the recognition of a single page. In case of unfamiliar font faces, low quality papers and degraded prints, the performance of these products drops sharply. However, an OCR system can use redundancy of word occurrences in large documents to improve recognition results. In this paper, we propose a sub-word image clustering method for the applications dealing with large printed documents. We assume that the whole document is printed by a unique unknown font with low quality print. Our proposed method finds clusters of equivalent sub-word images with an incremental algorithm. Due to the low print quality, we propose an image matching algorithm for measuring the distance between two sub-word images, based on Hamming distance and the ratio of the area to the perimeter of the connected components. We built a ground-truth dataset of more than 111000 sub-word images to evaluate our method. All of these images were extracted from an old Farsi book. We cluster all of these sub-words, including isolated letters and even punctuation marks. Then all centers of created clusters are labeled manually. We show that all sub-words of the book can be recognized with more than 99.7% accuracy by assigning the label of each cluster center to all of its members.

Paper Details

Date Published: 14 February 2015
PDF: 5 pages
Proc. SPIE 9445, Seventh International Conference on Machine Vision (ICMV 2014), 94450Z (14 February 2015); doi: 10.1117/12.2181404
Show Author Affiliations
Mohammad Reza Soheili, Tarbiat Modares Univ. (Iran, Islamic Republic of)
Ehsanollah Kabir, Tarbiat Modares Univ. (Iran, Islamic Republic of)
Didier Stricker, German Research Ctr. for Artificial Intelligence (Germany)

Published in SPIE Proceedings Vol. 9445:
Seventh International Conference on Machine Vision (ICMV 2014)
Antanas Verikas; Branislav Vuksanovic; Petia Radeva; Jianhong Zhou, Editor(s)

© SPIE. Terms of Use
Back to Top
Sign in to read the full article
Create a free SPIE account to get access to
premium articles and original research
Forgot your username?