Share Email Print

Journal of Electronic Imaging

Image extraction in digital documents
Author(s): Chee Sun Won
Format Member Price Non-Member Price
PDF $20.00 $25.00

Paper Abstract

Images included in documents usually provide information that may not be readily expressible by words. For example, academic articles with similar pictures may be of interest for researchers. We deal with the problem of extracting images in digital document. Given a digital document, the optimal block size is first determined by finding the best fit of the horizontally projected gray-level pattern to a set of orthogonal basis vectors. Because the block with the optimal size is supposed to contain sufficient information to identify text regions, the proposed method is font-size independent regardless of the size of the words in the text lines. The blocks divided by the optimal block size are classified into one of image, text, and background blocks. This block classification result, in turn, is used for the initial configuration for blockwise document segmentation. The blockwise segmentation method is based on the maximum a posteriori (MAP) framework with a deterministic relaxation algorithm. After the blockwise segmentation, each boundary block in the image region is further divided into four subblocks and the class labels for these subblocks are updated. These subdivision and class updating processes are executed recursively until we have a pixel-level segmentation. Experimental results show that the proposed image extraction method yields 2.9% error rates for 232 documents in the Oulu database.

Paper Details

Date Published: 1 July 2008
PDF: 7 pages
J. Electron. Imaging. 17(3) 033016 doi: 10.1117/1.2970151
Published in: Journal of Electronic Imaging Volume 17, Issue 3
Show Author Affiliations
Chee Sun Won, Dongguk Univ. (Korea, Republic of)

© SPIE. Terms of Use
Back to Top