Share Email Print

Proceedings Paper

Triage of OCR results using confidence scores
Author(s): Prateek Sarkar; Henry S. Baird; John Henderson
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

We describe a technique for modeling the character recognition accuracy of an OCR system -- treated as a black box -- on a particular page of printed text based on an examination only of the output top-choice character classifications and, for each, a confidence score such as is supplied by many commercial OCR systems. Latent conditional independence (LCI) models perform better on this task, in our experience, than naive uniform thresholding methods. Given a sufficiently large and representative dataset of OCR (errorful) output and manually proofed (correct) text, we can automatically infer LCI models that exhibit a useful degree of reliability. A collaboration between a PARC research group and a Xerox legacy conversion service bureau has demonstrated that such models can significantly improve the productivity of human proofing staff by triaging -- that is, selecting to bypass manual inspection -- pages whose estimated OCR accuracy exceeds a threshold chosen to ensure that a customer-specified per-page accuracy target will be met with sufficient confidence. We report experimental results on over 1400 pages. Our triage software tools are running in production and will be applied to more than 5 million pages of multi-lingual text.

Paper Details

Date Published: 18 December 2001
PDF: 7 pages
Proc. SPIE 4670, Document Recognition and Retrieval IX, (18 December 2001); doi: 10.1117/12.450730
Show Author Affiliations
Prateek Sarkar, Xerox Palo Alto Research Ctr. (United States)
Henry S. Baird, Xerox Palo Alto Research Ctr. (United States)
John Henderson, Xerox Corp. (United States)

Published in SPIE Proceedings Vol. 4670:
Document Recognition and Retrieval IX
Paul B. Kantor; Tapas Kanungo; Jiangying Zhou, Editor(s)

© SPIE. Terms of Use
Back to Top
Sign in to read the full article
Create a free SPIE account to get access to
premium articles and original research
Forgot your username?