Share Email Print

Proceedings Paper

Approximate string matching algorithms for limited-vocabulary OCR output correction
Author(s): Thomas A. Lasko; Susan E. Hauser
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

Five methods for matching words mistranslated by optical character recognition to their most likely match in a reference dictionary were tested on data from the archives of the National Library of Medicine. The methods, including an adaptation of the cross correlation algorithm, the generic edit distance algorithm, the edit distance algorithm with a probabilistic substitution matrix, Bayesian analysis, and Bayesian analysis on an actively thinned reference dictionary were implemented and their accuracy rates compared. Of the five, the Bayesian algorithm produced the most correct matches (87%), and had the advantage of producing scores that have a useful and practical interpretation.

Paper Details

Date Published: 21 December 2000
PDF: 9 pages
Proc. SPIE 4307, Document Recognition and Retrieval VIII, (21 December 2000); doi: 10.1117/12.410841
Show Author Affiliations
Thomas A. Lasko, Gudersen Lutheran Medical Ctr. (United States)
Susan E. Hauser, National Library of Medicine (United States)

Published in SPIE Proceedings Vol. 4307:
Document Recognition and Retrieval VIII
Paul B. Kantor; Daniel P. Lopresti; Jiangying Zhou, Editor(s)

© SPIE. Terms of Use
Back to Top
Sign in to read the full article
Create a free SPIE account to get access to
premium articles and original research
Forgot your username?