Share Email Print

Proceedings Paper

Content-based document image retrieval in complex document collections
Author(s): G. Agam; S. Argamon; O. Frieder; D. Grossman; D. Lewis
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

We address the problem of content-based image retrieval in the context of complex document images. Complex documents typically start out on paper and are then electronically scanned. These documents have rich internal structure and might only be available in image form. Additionally, they may have been produced by a combination of printing technologies (or by handwriting); and include diagrams, graphics, tables and other non-textual elements. Large collections of such complex documents are commonly found in legal and security investigations. The indexing and analysis of large document collections is currently limited to textual features based OCR data and ignore the structural context of the document as well as important non-textual elements such as signatures, logos, stamps, tables, diagrams, and images. Handwritten comments are also normally ignored due to the inherent complexity of offline handwriting recognition. We address important research issues concerning content-based document image retrieval and describe a prototype for integrated retrieval and aggregation of diverse information contained in scanned paper documents we are developing. Such complex document information processing combines several forms of image processing together with textual/linguistic processing to enable effective analysis of complex document collections, a necessity for a wide range of applications. Our prototype automatically generates rich metadata about a complex document and then applies query tools to integrate the metadata with text search. To ensure a thorough evaluation of the effectiveness of our prototype, we are developing a test collection containing millions of document images.

Paper Details

Date Published: 29 January 2007
PDF: 12 pages
Proc. SPIE 6500, Document Recognition and Retrieval XIV, 65000S (29 January 2007); doi: 10.1117/12.703163
Show Author Affiliations
G. Agam, Illinois Institute of Technology (United States)
S. Argamon, Illinois Institute of Technology (United States)
O. Frieder, Illinois Institute of Technology (United States)
D. Grossman, Illinois Institute of Technology (United States)
D. Lewis, David D. Lewis Consulting (United States)

Published in SPIE Proceedings Vol. 6500:
Document Recognition and Retrieval XIV
Xiaofan Lin; Berrin A. Yanikoglu, Editor(s)

© SPIE. Terms of Use
Back to Top
Sign in to read the full article
Create a free SPIE account to get access to
premium articles and original research
Forgot your username?