Share Email Print
cover

Proceedings Paper

Correcting OCR text by association with historical datasets
Author(s): Susan E. Hauser; Jonathan Schlaifer; Tehseen F. Sabir; Dina Demner-Fushman; Scott Straughan; George R. Thoma
Format Member Price Non-Member Price
PDF $14.40 $18.00

Paper Abstract

The Medical Article Records System (MARS) developed by the Lister Hill National Center for Biomedical Communications uses scanning, OCR and automated recognition and reformatting algorithms to generate electronic bibliographic citation data from paper biomedical journal articles. The OCR server incorporated in MARS performs well in general, but fares less well with text printed in small or italic fonts. Affiliations are often printed in small italic fonts in the journals processed by MARS. Consequently, although the automatic processes generate much of the citation data correctly, the affiliation field frequently contains incorrect data, which must be manually corrected by verification operators. In contrast, author names are usually printed in large, normal fonts that are correctly converted to text by the OCR server. The National Library of Medicine’s MEDLINE database contains 11 million indexed citations for biomedical journal articles. This paper documents our effort to use the historical author, affiliation relationships from this large dataset to find potential correct affiliations for MARS articles based on the author and the affiliation in the OCR output. Preliminary tests using a table of about 400,000 author/affiliation pairs extracted from the corrected data from MARS indicated that about 44% of the author/affiliation pairs were repeats and that about 47% of newly converted author names would be found in this set. A text-matching algorithm was developed to determine the likelihood that an affiliation found in the table corresponding to the OCR text of the first author was the current, correct affiliation. This matching algorithm compares an affiliation found in the author/affiliation table (found with the OCR text of the first author) to the OCR output affiliation, and calculates a score indicating the similarity of the affiliation found in the table to the OCR affiliation. Using a ground truth set of 519 OCR author/OCR affiliation/correct affiliation triples, the matching algorithm is able to select a correct affiliation for the author 43% of the time with a false positive rate of 6%, a true negative rate of 44% and a false negative rate of 7%. MEDLINE citations with United States affiliations typically include the zip code. In addition to using author names as clues to correct affiliations, we are investigating the value of the OCR text of zip codes as clues to correct USA affiliations. Current work includes generation of an author/affiliation/zipcode table from the entire MEDLINE database and development of a daemon module to implement affiliation selection and matching for the MARS system using both author names and zip codes. Preliminary results from the initial version of the daemon module and the partially filled author/affiliation/zipcode table are encouraging.

Paper Details

Date Published: 13 January 2003
PDF: 10 pages
Proc. SPIE 5010, Document Recognition and Retrieval X, (13 January 2003); doi: 10.1117/12.476046
Show Author Affiliations
Susan E. Hauser, National Library of Medicine (United States)
Jonathan Schlaifer, National Library of Medicine (United States)
Tehseen F. Sabir, National Library of Medicine (United States)
Dina Demner-Fushman, National Library of Medicine (United States)
Scott Straughan, National Library of Medicine (United States)
George R. Thoma, National Library of Medicine (United States)


Published in SPIE Proceedings Vol. 5010:
Document Recognition and Retrieval X
Tapas Kanungo; Elisa H. Barney Smith; Jianying Hu; Paul B. Kantor, Editor(s)

© SPIE. Terms of Use
Back to Top