Share Email Print

Proceedings Paper

Utilizing web data in identification and correction of OCR errors
Author(s): Kazem Taghva; Shivam Agarwal
Format Member Price Non-Member Price
PDF $14.40 $18.00

Paper Abstract

In this paper, we report on our experiments for detection and correction of OCR errors with web data. More specifically, we utilize Google search to access the big data resources available to identify possible candidates for correction. We then use a combination of the Longest Common Subsequences (LCS) and Bayesian estimates to automatically pick the proper candidate. Our experimental results on a small set of historical newspaper data show a recall and precision of 51% and 100%, respectively. The work in this paper further provides a detailed classification and analysis of all errors. In particular, we point out the shortcomings of our approach in its ability to suggest proper candidates to correct the remaining errors.

Paper Details

Date Published: 24 March 2014
PDF: 6 pages
Proc. SPIE 9021, Document Recognition and Retrieval XXI, 902109 (24 March 2014); doi: 10.1117/12.2042403
Show Author Affiliations
Kazem Taghva, Univ. of Nevada, Las Vegas (United States)
Shivam Agarwal, Univ. of Nevada, Las Vegas (United States)

Published in SPIE Proceedings Vol. 9021:
Document Recognition and Retrieval XXI
Bertrand Coüasnon; Eric K. Ringger, Editor(s)

© SPIE. Terms of Use
Back to Top