Share Email Print

Proceedings Paper

Counting OCR errors in typeset text
Author(s): Jonathan S. Sandberg
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

Frequently object recognition accuracy is a key component in the performance analysis of pattern matching systems. In the past three years, the results of numerous excellent and rigorous studies of OCR system typeset-character accuracy (henceforth OCR accuracy) have been published, encouraging performance comparisons between a variety of OCR products and technologies. These published figures are important; OCR vendor advertisements in the popular trade magazines lead readers to believe that published OCR accuracy figures effect market share in the lucrative OCR market. Curiously, a detailed review of many of these OCR error occurrence counting results reveals that they are not reproducible as published and they are not strictly comparable due to larger variances in the counts than would be expected by the sampling variance. Naturally, since OCR accuracy is based on a ratio of the number of OCR errors over the size of the text searched for errors, imprecise OCR error accounting leads to similar imprecision in OCR accuracy. Some published papers use informal, non-automatic, or intuitively correct OCR error accounting. Still other published results present OCR error accounting methods based on string matching algorithms such as dynamic programming using Levenshtein (edit) distance but omit critical implementation details (such as the existence of suspect markers in the OCR generated output or the weights used in the dynamic programming minimization procedure). The problem with not specifically revealing the accounting method is that the number of errors found by different methods are significantly different. This paper identifies the basic accounting methods used to measure OCR errors in typeset text and offers an evaluation and comparison of the various accounting methods.

Paper Details

Date Published: 30 March 1995
PDF: 12 pages
Proc. SPIE 2422, Document Recognition II, (30 March 1995); doi: 10.1117/12.205821
Show Author Affiliations
Jonathan S. Sandberg, Panasonic Technologies, Inc. (United States)

Published in SPIE Proceedings Vol. 2422:
Document Recognition II
Luc M. Vincent; Henry S. Baird, Editor(s)

© SPIE. Terms of Use
Back to Top
Sign in to read the full article
Create a free SPIE account to get access to
premium articles and original research
Forgot your username?