Share Email Print

Proceedings Paper

Measuring the impact of character recognition errors on downstream text analysis
Author(s): Daniel Lopresti
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

Noise presents a serious challenge in optical character recognition, as well as in the downstream applications that make use of its outputs as inputs. In this paper, we describe a paradigm for measuring the impact of recognition errors on the stages of a standard text analysis pipeline: sentence boundary detection, tokenization, and part-of-speech tagging. Employing a hierarchical methodology based on approximate string matching for classifying errors, their cascading effects as they travel through the pipeline are isolated and analyzed. We present experimental results based on injecting single errors into a large corpus of test documents to study their varying impacts depending on the nature of the error and the character(s) involved. While most such errors are found to be localized, in the worst case some can have an amplifying effect that extends well beyond the site of the original error, thereby degrading the performance of the end-to-end system.

Paper Details

Date Published: 28 January 2008
PDF: 11 pages
Proc. SPIE 6815, Document Recognition and Retrieval XV, 68150G (28 January 2008); doi: 10.1117/12.767131
Show Author Affiliations
Daniel Lopresti, Lehigh Univ. (United States)

Published in SPIE Proceedings Vol. 6815:
Document Recognition and Retrieval XV
Berrin A. Yanikoglu; Kathrin Berkner, Editor(s)

© SPIE. Terms of Use
Back to Top
Sign in to read the full article
Create a free SPIE account to get access to
premium articles and original research
Forgot your username?