The scale of document image collections, already vast, continues to rapidly expand. Information extraction represents an important R&D field at the interface between document image analysis and information retrieval. The goals of massive book scanning and digital library projects extend beyond simply making available scanned images of physical documents. Ideally, collections should be indexable, searchable, and reusable. Relying on human input to transcribe documents and encode metadata is expensive, and automation of postprocessing tends to focus on a particular niche that cannot be adapted to other areas.
Given a large, diverse collection of document images, what is the first useful action we can perform? We initially emphasize versatility over accuracy. While successful programs for reading bank checks and postal codes achieve extremely high accuracy, these methods are not readily adaptable to other domains. The problem, as we view it, is to classify the different regions of content in a document image, such as machine print (MP), handwriting (HW), photographic (PH), and blank (BL). We emphasize various means of classification that include multiple languages (currently English, Chinese, and Arabic) together with full color, bitonal, and greyscale images. We test our methods on a vast variety of sources bearing many different qualities.
However, the results we achieve at this stage do not extend the ability of an end user to search or retrieve instances from a huge collection of documents. A second part of our research employs analysis using the classification results to make collections searchable and retrievable based on content inventories of its images.
Given the abundant information readily and freely accessible today and our goal of maximum versatility, we want to use as much data as possible in training for retrieval. As a consequence, we investigate the design of classifiers on potentially billions of samples. Novel aspects of our research include the selection of a training sample and classifying document images at the pixel level as compared to most approaches that classify content in rectilinear zones. This avoids the arbitrary nature of using shapes to zone documents that may not always correspond to page layout. It also allows us to easily achieve training samples, as intended, on the order of billions.
We have investigated a wide range of automatically trainable classification technologies, including brute-force k-Nearest Neighbors (kNN), k-dimensional (k-d) trees, classification and regression trees, and locality-sensitive hashing.1–4 We would consider kNN nearly ideal for this problem were it not for the high cost in both time and space of naive implementations, and the difficulty of crafting algorithms that approach required performance in high dimensions. A popular, practical method for approximating kNN in higher dimensions is Bentley's k-d trees.5 Our effort to obtain near-kNN performance in near constant expected time is by hash table lookup with non-adaptive k-d trees. This is made possible by our assumption that, in high-dimensional feature space, the distribution of data from selected document images is highly non-uniform and only populates small, dense ‘pockets’ which have been experimentally verified.4
Our experiments are designed to increase limits for the amount of data the classifier can handle while discriminating between our four content types (MP, HW, PH and BL). Currently we use a training set consisting of 80 images from over 50 sources, totalling 416 million training samples (pixels), and it was tested on 150 images. Figures 1 and 2 represent an example. Because each individual pixel is classified, the classifier is able to capture the non-rectangular layout of the document avoiding the need for preprocessing to correct skew.
Figure 1. A sample test image. This full color magazine article contains three content classes: machine print, photographic, and blank. Scanning the document introduced skew.
Figure 2. The classified result. Since we consider each sample to be a single pixel, we can assign a color to each content class and view the output as itself an image. Dark blue pixels are those classified as machine print, aqua as photographic, and white as blank. The per-pixel classification accuracy for this image was 61.28%. However, subjectively, the classifier appears to have done a much better job than the per-pixel score would suggest. Despite skew in the original image and the non-rectangular layout, the classifier captures the actual layout of text. Because we are classifying individual pixels, some areas of the page are affected by misclassified `noise’ from other content classes. Correcting these mistakes is an area of ongoing research.
For this type of classifier, a traditional performance evaluation metric would be to count the number of pixels that are correctly classified compared to the actual image. While such ‘per-pixel accuracy’ can be a helpful diagnostic tool for making engineering decisions, it is a single score of little use to an end user who may want to search or retrieve images. We propose a new measure, both for evaluation and as a document retrieval tool, that considers the inventory of content. Employ, for example, an information retrieval query of the form, ‘Find all images in a collection that includes 20-50% MP and 10-20% PH.’ Experiments2,6 show that even modest per-pixel classification accuracies (e.g., 60–70%) support usefully high recall and precision rates (80–90%) for queries on collections of documents. We believe this measure provides an end user with a useful means of indexing and searching large collections. We have also begun to develop a theory relating per-pixel inventories and content inventory scores.
The use of hashed k-d trees in approximating kNN in high dimensions appears to be promising when applied to domain document image content extraction. Experiments show that they provide us with a classifier potentially capable of identifying content in vast heterogenous document image collections. Areas for future work include scaling up the training sets by orders of magnitude, increasing the number of content classes, and exploring ways to reduce the original classification errors.
I would like to acknowledge the continually helpful advice and assistance of Henry Baird, Matthew Casey, Chang An, Pingping Xiu, and Jon Bentley (of Avaya Labs Research).
Department of Computer Science and Engineering
Michael Moll is a Ph.D. candidate in the Department of Computer Science and Engineering at Lehigh University.
1. H. S. Baird, M. A. Moll, J. Nonnemaker, M. R. Casey, D. L. Delorenzo, Versatile Document Image Content Extraction, Proc., SPIE/IS&T Document Recognition & Retrieval XIII Conf., San Jose, CA, 2006.