Share Email Print

Proceedings Paper

Pattern matching techniques for correcting low-confidence OCR words in a known context
Author(s): Glenn Ford; Susan E. Hauser; Daniel X. Le; George R. Thoma
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

A commercial OCR system is a key component of a system developed at the National Library of Medicine for the automated extraction of bibliographic fields from biomedical journals. This 5-engine OCR system, while exhibiting high performance overall, does not reliably convert very small characters, especially those that are in italics. As a result, the 'affiliations' field that typically contains such characters in most journals, is not captured accurately, and requires a disproportionately high manual input. To correct this problem, dictionaries have been created from words occurring in this field (e.g., university, department, street addresses, names of cities, etc.) from 230,000 articles already processed. The OCR output corresponding to the affiliation field is then matched against these dictionary entries by approximate string-matching techniques, and the ranked matches are presented to operators for verification. This paper outlines the techniques employed and the results of a comparative evaluation.

Paper Details

Date Published: 21 December 2000
PDF: 9 pages
Proc. SPIE 4307, Document Recognition and Retrieval VIII, (21 December 2000); doi: 10.1117/12.410842
Show Author Affiliations
Glenn Ford, National Library of Medicine (United States)
Susan E. Hauser, National Library of Medicine (United States)
Daniel X. Le, National Library of Medicine (United States)
George R. Thoma, National Library of Medicine (United States)

Published in SPIE Proceedings Vol. 4307:
Document Recognition and Retrieval VIII
Paul B. Kantor; Daniel P. Lopresti; Jiangying Zhou, Editor(s)

© SPIE. Terms of Use
Back to Top