Share Email Print

Proceedings Paper

Evaluating supervised topic models in the presence of OCR errors
Author(s): Daniel Walker; Eric Ringger; Kevin Seppi
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

Supervised topic models are promising tools for text analytics that simultaneously model topical patterns in document collections and relationships between those topics and document metadata, such as timestamps. We examine empirically the effect of OCR noise on the ability of supervised topic models to produce high quality output through a series of experiments in which we evaluate three supervised topic models and a naive baseline on synthetic OCR data having various levels of degradation and on real OCR data from two different decades. The evaluation includes experiments with and without feature selection. Our results suggest that supervised topic models are no better, or at least not much better in terms of their robustness to OCR errors, than unsupervised topic models and that feature selection has the mixed result of improving topic quality while harming metadata prediction quality. For users of topic modeling methods on OCR data, supervised topic models do not yet solve the problem of finding better topics than the original unsupervised topic models.

Paper Details

Date Published: 4 February 2013
PDF: 12 pages
Proc. SPIE 8658, Document Recognition and Retrieval XX, 865812 (4 February 2013); doi: 10.1117/12.2008345
Show Author Affiliations
Daniel Walker, Brigham Young Univ. (United States)
Eric Ringger, Brigham Young Univ. (United States)
Kevin Seppi, Brigham Young Univ. (United States)

Published in SPIE Proceedings Vol. 8658:
Document Recognition and Retrieval XX
Richard Zanibbi; Bertrand Coüasnon, Editor(s)

© SPIE. Terms of Use
Back to Top