Share Email Print

Proceedings Paper

The Bible, truth, and multilingual OCR evaluation
Author(s): Tapas Kanungo; Philip Resnik
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

In this paper we propose to use the Bible as a dataset for comparing OCR accuracy across languages. Besides being available in a wide range of languages, Bible translations are closely parallel in content, carefully translated, surprisingly relevant with respect to modern-day language, and quite inexpensive. A project at University of Maryland is currently implementing this idea. We have created a scanned image dataset with groundtruth from an Arabic Bible. We have also used image degradation models to create synthetically degraded images of a French Bible. We hope to generate similar Bible datasets for other languages, and we are exploring alternative corpora with similar properties such the Koran and the Bhagavad Gita. Quantitative OCR evaluation based on the Arabic Bible dataset is currently in progress.

Paper Details

Date Published: 7 January 1999
PDF: 11 pages
Proc. SPIE 3651, Document Recognition and Retrieval VI, (7 January 1999); doi: 10.1117/12.335806
Show Author Affiliations
Tapas Kanungo, Univ. of Maryland/College Park (United States)
Philip Resnik, Univ. of Maryland/College Park (United States)

Published in SPIE Proceedings Vol. 3651:
Document Recognition and Retrieval VI
Daniel P. Lopresti; Jiangying Zhou, Editor(s)

© SPIE. Terms of Use
Back to Top
Sign in to read the full article
Create a free SPIE account to get access to
premium articles and original research
Forgot your username?