Journal of Electronic ImagingImage extraction in digital documents
|Format||Member Price||Non-Member Price|
Images included in documents usually provide information that may not be readily expressible by words. For example, academic articles with similar pictures may be of interest for researchers. We deal with the problem of extracting images in digital document. Given a digital document, the optimal block size is first determined by finding the best fit of the horizontally projected gray-level pattern to a set of orthogonal basis vectors. Because the block with the optimal size is supposed to contain sufficient information to identify text regions, the proposed method is font-size independent regardless of the size of the words in the text lines. The blocks divided by the optimal block size are classified into one of image, text, and background blocks. This block classification result, in turn, is used for the initial configuration for blockwise document segmentation. The blockwise segmentation method is based on the maximum a posteriori (MAP) framework with a deterministic relaxation algorithm. After the blockwise segmentation, each boundary block in the image region is further divided into four subblocks and the class labels for these subblocks are updated. These subdivision and class updating processes are executed recursively until we have a pixel-level segmentation. Experimental results show that the proposed image extraction method yields 2.9% error rates for 232 documents in the Oulu database.