Share Email Print

Proceedings Paper

A synthetic document image dataset for developing and evaluating historical document processing methods
Author(s): Daniel Walker; William Lund; Eric Ringger
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

Document images accompanied by OCR output text and ground truth transcriptions are useful for developing and evaluating document recognition and processing methods, especially for historical document images. Additionally, research into improving the performance of such methods often requires further annotation of training and test data (e.g., topical document labels). However, transcribing and labeling historical documents is expensive. As a result, existing real-world document image datasets with such accompanying resources are rare and often relatively small. We introduce synthetic document image datasets of varying levels of noise that have been created from standard (English) text corpora using an existing document degradation model applied in a novel way. Included in the datasets is the OCR output from real OCR engines including the commercial ABBYY FineReader and the open-source Tesseract engines. These synthetic datasets are designed to exhibit some of the characteristics of an example real-world document image dataset, the Eisenhower Communiqu´es. The new datasets also benefit from additional metadata that exist due to the nature of their collection and prior labeling efforts. We demonstrate the usefulness of the synthetic datasets by training an existing multi-engine OCR correction method on the synthetic data and then applying the model to reduce word error rates on the historical document dataset. The synthetic datasets will be made available for use by other researchers.

Paper Details

Date Published: 23 January 2012
PDF: 8 pages
Proc. SPIE 8297, Document Recognition and Retrieval XIX, 829710 (23 January 2012); doi: 10.1117/12.912203
Show Author Affiliations
Daniel Walker, Brigham Young Univ. (United States)
William Lund, Brigham Young Univ. (United States)
Eric Ringger, Brigham Young Univ. (United States)

Published in SPIE Proceedings Vol. 8297:
Document Recognition and Retrieval XIX
Christian Viard-Gaudin; Richard Zanibbi, Editor(s)

© SPIE. Terms of Use
Back to Top
Sign in to read the full article
Create a free SPIE account to get access to
premium articles and original research
Forgot your username?