Page-layout-classification methodologies aim to extract text and non-textual regions such as graphics, photos, or logos. These techniques have applications in digital document storage and retrieval where efficient memory consumption and quick retrieval are required.1 Such classification algorithms can also be used in the printing industry for selective or enhanced scanning and object-oriented rendering (printing different parts of a document with different resolution depending on the content).2 Additionally, these techniques can be used as an initial step for various applications. These include optical-character recognition (the electronic translation of handwritten or printed text into machine-encoded text) and graphic interpretation (classifying documents—into military, educational, and others—according to the image content).3
In the past two decades, several techniques have focused on identifying text regions in scanned documents.4, 5 In addition, comprehensive algorithms that aim to identify both text and graphic regions have been developed.6, 7 However, these systems are limited to specific documents, such as newsletters or articles, where the background region is assumed white.8, 9 This assumption not only excludes complex backgrounds and colored documents (such as book covers, advertisements, and flyers),10 but also limits practicality and feasibility when applied to non-ideal (complex) documents.
Figure 1. Line detection results for two different documents. (a) Original image, (b) enhanced L*channel of the CIE L*a*b*space, and (c) final segmentation map where strong-edge or strong-line and text regions are colored in yellow and green, respectively.
Figure 2. Segmentation for different types of scanned documents: (a) newsletter, (b) fax cover, (c) article, (d) correspondence, (e) advertisement, (f) address list, (g) check, and (h) color segmentation. Images are represented with blue boxes, text zones are shown in green, and cyan marks a common zone where photo and text regions overlap. The second and third rows show the classification maps generated by our algorithm for colored and gray-scale scanned documents, respectively, while the final row shows the correct (100% accurate) ground-truth classification.
We propose a page-layout-segmentation technique to extract text, image, and strong-edge or strong-line regions (actual lines in the document or transition pixels between a picture and text or a picture and background).11 The algorithm consists of four modules: pre-processing stage, text detection, photo detection, and strong-edge or strong-line detection units. We start by applying a pre-processing module that includes image scaling and enhancement, as well as color-space conversion (RGB—red, green, and blue—to CIE L*a*b*, a color space where L* represents lightness and a* and b* are two different color coordinates). Then, we employ a text detection module based on a mathematical tool known as wavelet analysis and on the so-called run-length encoding. In the third module, photo detection, we apply a specific technique, proposed by Won10 and designated block-wise segmentation, to generate an initial photo map. This map is then improved by optimization and image-enhancement methods. The final module uses various techniques (specifically, Hough transform, edge detection, and edge linkages procedures) to identify lines and strong edges.11 Our system is fast and robust, especially while dealing with several types of scanned documents at different scanning resolutions. In addition, the proposed document classifier shows enhanced performance independent of the scanning process.
Following development, we evaluated our technique using the MediaTeam document database.12As demonstrated in Figure 1, document enhancement by the pre-processing module enables better detection accuracy. The documents shown have frames (box-lines) that outline the pages, and they are detected accurately in both images. Notice that the written text with large font-size is detected as a strong edge: see second row of Figure 1(a). Additionally, the pictorial structure shown in the second document is also well segmented.
To demonstrate that our algorithm can handle complex documents, we used typical images for eight different types of scanned documents from the database (see Figure 2). In the newsletter case, which is an example of a simple plain document, the photo and text regions are extracted accurately with exception of the page number and the figure caption: see Figure 2(a). In the fax cover and correspondence—scanned documents with a white background—the main body of the text is well detected with a few missing small regions: see Figure 2(b) and (d). As examples of scanned documents with complex colors and backgrounds, an advertisement and an article are also illustrated: see Figure 2(c) and (e). Although there is a significant reflection in the background of the article, both photo and text regions are mostly classified correctly. Small photo regions like the Intel logo in the bottom left corner of Figure 2(c) are misclassified as text. Another complex document, an address list, is shown in Figure 2(f). The text regions include names and numbers in addition to photo regions (section headings), all of which are well extracted. However, the title of the page at the upper left corner in Figure 2(f) is misclassified as text. In Figure 2(g), a check document that comprises a check number, bank name, and an amount written in numbers and words, is classified correctly. However, the algorithm for gray-level scanned document fails to detect the text region on the left side of the check image. In the example of a document that includes only a photo region, the image is detected fairly well except for two tiny false text zone detections: see Figure 2(h). Finally, it is worth noticing that both color and gray-level segmentation maps for given scanned documents have very close accuracy compared to ground-truth (maps that are manually generated and are, therefore, 100% accurate).
Table 1. Average segmentation accuracy for text, photo and background regions. The data represents some 400 images including 10 different types of scanned document provided by the MediaTeam document database from Oulu University.12
We present objective evaluations of photo and text regions compared to the ground-truth maps in Table 1. The algorithm achieves 78% classification accuracy in text regions while 22% are misclassified. Accurate classification rates for photo and background zones are 85% and 92%, respectively. The average execution time for scanned documents with size of about 3000×2000 pixels is 14 seconds running on a 2.4GHz dual core PC implemented in MATLAB 7.11.0 environment.
In summary, we proposed a technique for page-layout classification where text, photo, and strong-edge or strong-line regions are identified. We used a variety of simple, complex, colored, and gray-scale documents to evaluate the proposed technique, and our experimental results indicate that it achieves 85% accuracy on average. More importantly, it provides consistent results for different types of documents. Future work includes assigning semantic relations to the classified text, line, photo, and background regions, as well as improving the overall performance and reducing the computation complexity.
The authors are grateful for the support of the Hewlett–Packard Corporation and the Electrical and Microelectronic Engineering Department at the Rochester Institute of Technology.
Sezer Erkilinc, Mustafa Jaber, Eli Saber
Rochester Institute of Technology
Sezer Erkilinc is an MSc candidate student in the Department of Electrical and Microelectronic Engineering. He received his BSc in Electrical Engineering from Koc University, Turkey in 2009. His main research interests are in the areas of digital image understanding and content-based document analysis.
Peter Bauer, Dejan Depalov
1. L. O'Gorman, R. Kasturi, Document Image Analysis 2, pp. 8-39, IEEE Computer Society Press, 1995.
2. O. Guleryuz, Low-complexity comprehensive labeling and enhancement algorithm for compound documents, J. Electron. Imag. 13, pp. 832, 2004.
3. R. Kasturi, L. O'Gorman, V. Govindaraju, Document image analysis: a primer, Sadhana 27
, no. 1, pp. 3-22, 2002. doi:10.1007/BF02703309
4. J. Fisher, S. Hinds, D. D'Amato, A rule-based system for document image segmentation, Proc. 10th Int'l Conf. Pattern Recognit.
1, pp. 567-572, 1990. doi:10.1109/ICPR.1990.118166
5. S. Grover, K. Arora, S. Mitra, Text extraction from document images using edge information, IEEE India Conf.,
pp. 1-4, 2009. doi:10.1109/INDCON.2009.5409409
6. Z. Shi, V. Govindaraju, Multi-scale techniques for document page segmentation, Proc. 8th Int'l Conf. Doc. Analysis Recognit.
2, pp. 1020-1024, 2005. doi:10.1109/ICDAR.2005.165
7. M. Lin, J. Tapamo, B. Ndovie, A texture-based method for document segmentation and classification, J. S. African Comp. 36, pp. 49-56, 2006.
8. S. Chaudhury, M. Jindal, S. Dutta Roy, Model-guided segmentation and layout labelling of document images using a hierarchical conditional random field, Pattern Recognit. Machine Intell
. 5909, pp. 375-380, 2009. doi:10.1007/978-3-642-11164-8_61
9. L. Caponetti, C. Castiello, P. Górecki, Document page segmentation using neuro-fuzzy approach, Appl. Soft Computing
8, no. 1, pp. 118-126, 2008. doi:10.1109/34.566817
10. C. Won, Image extraction in digital documents, J. Electron. Imag. 17, no. 72008.
11. M. Erkilinc, M. Jaber, E. Saber, P. Bauer, D. Depalov, Page layout analysis and classification for complex scanned documents, Proc. SPIE 8135, 2011. Paper number 8135-5 at the Appl. Digital Image Processing XXXIV Conf., 22–24 August 2011.
12. J. Sauvola, H. Kauniskangas, Media Team Document Database II, 1999. http://www.mediateam.oulu.fi/downloads/MTDB/