Share Email Print
cover

Proceedings Paper

The BBN Byblos Hindi OCR system
Author(s): Premkumar S. Natarajan; Ehry MacRostie; Michael Decerbo
Format Member Price Non-Member Price
PDF $14.40 $18.00

Paper Abstract

The BBN Byblos OCR system implements a script-independent methodology for OCR using Hidden Markov Models (HMMs). We have successfully ported the system to Arabic, English, Chinese, Pashto, and Japanese. In this paper, we report on our recent effort in training the system to perform recognition of Hindi (Devanagari) documents. The initial experiments reported in this paper were performed using a corpus of synthetic (computer-generated) document images along with slightly degraded versions of the same that were generated by scanning printed versions of the document images and by scanning faxes of the printed versions. On a fair test set consisting of synthetic images alone we measured a character error rate of 1.0%. The character error rate on a fair test set consisting of scanned images (scans of printed versions of the synthetic images) was 1.40% while the character error rate on a fair test set of fax images (scans of printed and faxed versions of the synthetic images) was 8.7%.

Paper Details

Date Published: 17 January 2005
PDF: 7 pages
Proc. SPIE 5676, Document Recognition and Retrieval XII, (17 January 2005); doi: 10.1117/12.588810
Show Author Affiliations
Premkumar S. Natarajan, BBN Technologies (United States)
Ehry MacRostie, BBN Technologies (United States)
Michael Decerbo, BBN Technologies (United States)


Published in SPIE Proceedings Vol. 5676:
Document Recognition and Retrieval XII
Elisa H. Barney Smith; Kazem Taghva, Editor(s)

© SPIE. Terms of Use
Back to Top