Share Email Print

Proceedings Paper

How well does multiple OCR error correction generalize?
Author(s): William B. Lund; Eric K. Ringger; Daniel D. Walker
Format Member Price Non-Member Price
PDF $14.40 $18.00

Paper Abstract

As the digitization of historical documents, such as newspapers, becomes more common, the need of the archive patron for accurate digital text from those documents increases. Building on our earlier work, the contributions of this paper are: 1. in demonstrating the applicability of novel methods for correcting optical character recognition (OCR) on disparate data sets, including a new synthetic training set, 2. enhancing the correction algorithm with novel features, and 3. assessing the data requirements of the correction learning method. First, we correct errors using conditional random fields (CRF) trained on synthetic training data sets in order to demonstrate the applicability of the methodology to unrelated test sets. Second, we show the strength of lexical features from the training sets on two unrelated test sets, yielding a relative reduction in word error rate on the test sets of 6.52%. New features capture the recurrence of hypothesis tokens and yield an additional relative reduction in WER of 2.30%. Further, we show that only 2.0% of the full training corpus of over 500,000 feature cases is needed to achieve correction results comparable to those using the entire training corpus, effectively reducing both the complexity of the training process and the learned correction model.

Paper Details

Date Published: 24 March 2014
PDF: 13 pages
Proc. SPIE 9021, Document Recognition and Retrieval XXI, 90210A (24 March 2014); doi: 10.1117/12.2042502
Show Author Affiliations
William B. Lund, Brigham Young Univ. (United States)
Eric K. Ringger, Brigham Young Univ. (United States)
Daniel D. Walker, Microsoft Corp. (United States)

Published in SPIE Proceedings Vol. 9021:
Document Recognition and Retrieval XXI
Bertrand Coüasnon; Eric K. Ringger, Editor(s)

© SPIE. Terms of Use
Back to Top