Share Email Print

Proceedings Paper

A multi-evidence, multi-engine OCR system
Author(s): Ilya Zavorin; Eugene Borovikov; Anna Borovikov; Luis Hernandez; Kristen Summers; Mark Turner
Format Member Price Non-Member Price
PDF $14.40 $18.00

Paper Abstract

Although modern OCR technology is capable of handling a wide variety of document images, there is no single OCR engine that performs equally well on all documents for a given single language script. Naturally, each OCR engine has its strengths and weaknesses, and therefore different engines tend to differ in the accuracy on different documents, and in the errors on the same document image. While the idea of using multiple OCR engines to boost output accuracy is not new, most of the existing systems do not go beyond variations on majority voting. While this approach may work well in many cases, it has limitations, especially when OCR technology used to process a given script has not yet fully matured. Our goal is to develop a system called MEMOE (for "Multi-Evidence Multi-OCR-Engine") that combines, in an optimal or near-optimal way, output streams of one or more OCR engines together with various types of evidence extracted from these streams as well as from original document images, to produce output of higher quality than that of the individual OCR engines, or of majority voting applied to multiple OCR output streams. Furthermore, we aim to improve the accuracy of OCR output on images that might otherwise have low accuracy that significantly impacts downstream processing. The MEMOE system functions as an OCR engine taking document images and some configuration parameters as input and producing a single output text stream. In this paper, we describe the design of the system, various evidence types and how they are incorporated into MEMOE in the form of filters. Results of initial tests that involve two corpora of Arabic documents show that, even in its initial configuration, the system is superior to a voting algorithm and that even more improvement may be achieved by incorporating additional evidence types into the system.

Paper Details

Date Published: 29 January 2007
PDF: 10 pages
Proc. SPIE 6500, Document Recognition and Retrieval XIV, 650005 (29 January 2007); doi: 10.1117/12.703106
Show Author Affiliations
Ilya Zavorin, CACI International Inc. (United States)
Eugene Borovikov, CACI International Inc. (United States)
Anna Borovikov, CACI International Inc. (United States)
Luis Hernandez, Army Research Lab. (United States)
Kristen Summers, CACI International Inc. (United States)
Mark Turner, CACI International Inc. (United States)

Published in SPIE Proceedings Vol. 6500:
Document Recognition and Retrieval XIV
Xiaofan Lin; Berrin A. Yanikoglu, Editor(s)

© SPIE. Terms of Use
Back to Top