Share Email Print

Proceedings Paper

Extraction of text-related features for condensing image documents
Author(s): Dan S. Bloomberg; Francine R. Chen
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

A system has been built that selects excerpts from a scanned document for presentation as a summary, without using character recognition. The method relies on the idea that the most significant sentences in a document contain words that are both specific to the document and have a relatively high frequency of occurrence within it. Accordingly, and entirely within the image domain, each page image is deskewed and the text regions of are found and extracted as a set of textblocks. Blocks with font size near the median for the document are selected and then placed in reading order. The textlines and words are segmented, and the words are placed into equivalence classes of similar shape. The sentences are identified by finding baselines for each line of text and analyzing the size and location of the connected components relative to the baseline. Scores can then be given to each word, depending on its shape and frequency of occurrence, and to each sentence, depending on the scores for the words in the sentence. Other salient features, such as textblocks that have a large font or are likely to contain an abstract, can also be used to select image parts that are likely to be thematically relevant. The method has been applied to a variety of documents, including articles scanned from magazines and technical journals.

Paper Details

Date Published: 7 March 1996
PDF: 17 pages
Proc. SPIE 2660, Document Recognition III, (7 March 1996); doi: 10.1117/12.234726
Show Author Affiliations
Dan S. Bloomberg, Xerox Palo Alto Research Ctr. (United States)
Francine R. Chen, Xerox Palo Alto Research Ctr. (United States)

Published in SPIE Proceedings Vol. 2660:
Document Recognition III
Luc M. Vincent; Jonathan J. Hull, Editor(s)

© SPIE. Terms of Use
Back to Top