Share Email Print

Proceedings Paper

Cross-validation comparison of NIST OCR databases
Author(s): Patrick J. Grother
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

The quality of reference databases for optical character recognition is vital to the meaningful assessment of classification algorithms. NIST has produced two databases of segmented handprinted characters obtained from socially distinct writer populations. Two approaches to the comparison of the databases are described. The first uses the eigenvalue spectrum of the covariance matrix as an a priori measure of the variance intrinsic to the data. The second cross validates the datasets using classification error to quantify the difficulty of OCR. The eigenvalue spectra from the training partitions of the datasets are generated during the production of the Karhunen Loeve Transforms, the leading components of which are used as prototype features for a classifier. The eignespectra are used to quantify diversity of the character sets and the Bhattacharrya distance is used to measure class separability. The digits, uppers and lowers from the two populations of 500 writers are partitioned into N disjoint sets. The KL transforms of each such set are used for testing, while the remaining N-1 sets form the training prototypes for a PNN nearest neighbor classifier. Recognition error rates and their variances are calculated over the N partitions for both databases independently. This quantifies intra-database diversity. The inter-database results, or `cross' terms, obtained by training and testing on different databases, indicate the generality of the training set. The results for digits suggest that the second NIST database (used nominally for testing) is significantly harder than the first (training) set; the testing images are 11% more variant. The NIST training data classifies partitions of itself with 1.7% error, and the test set with 6.8% error. Conversely the test set generalizes to both itself and the training data with 3.5% error. This effect has also ben reported using non-NIST classifiers.

Paper Details

Date Published: 14 April 1993
PDF: 12 pages
Proc. SPIE 1906, Character Recognition Technologies, (14 April 1993); doi: 10.1117/12.143632
Show Author Affiliations
Patrick J. Grother, National Institute of Standards and Technology (United States)

Published in SPIE Proceedings Vol. 1906:
Character Recognition Technologies
Donald P. D'Amato, Editor(s)

© SPIE. Terms of Use
Back to Top