Share Email Print
cover

Proceedings Paper

Complex document information processing: prototype, test collection, and evaluation
Author(s): G. Agam; S. Argamon; O. Frieder; D. Grossman; D. Lewis
Format Member Price Non-Member Price
PDF $14.40 $18.00
cover GOOD NEWS! Your organization subscribes to the SPIE Digital Library. You may be able to download this paper for free. Check Access

Paper Abstract

Analysis of large collections of complex documents is an increasingly important need for numerous applications. Complex documents are documents that typically start out on paper and are then electronically scanned. These documents have rich internal structure and might only be available in image form. Additionally, they may have been produced by a combination of printing technologies (or by handwriting); and include diagrams, graphics, tables and other non-textual elements. The state of the art today for a large document collection is essentially text search of OCR'd documents with no meaningful use of data found in images, signatures, logos, etc. Our prototype automatically generates rich metadata about a complex document and then applies query tools to integrate the metadata with text search. To ensure a thorough evaluation of the effectiveness of our prototype, we are also developing a roughly 42,000,000 page complex document test collection. The collection will include relevance judgments for queries at a variety of levels of detail and depending on a variety of content and structural characteristics of documents, as well as "known item" queries looking for particular documents.

Paper Details

Date Published: 16 January 2006
PDF: 11 pages
Proc. SPIE 6067, Document Recognition and Retrieval XIII, 60670N (16 January 2006); doi: 10.1117/12.662918
Show Author Affiliations
G. Agam, Illinois Institute of Technology (United States)
S. Argamon, Illinois Institute of Technology (United States)
O. Frieder, Illinois Institute of Technology (United States)
D. Grossman, Illinois Institute of Technology (United States)
D. Lewis, David D. Lewis Consulting (United States)


Published in SPIE Proceedings Vol. 6067:
Document Recognition and Retrieval XIII
Kazem Taghva; Xiaofan Lin, Editor(s)

© SPIE. Terms of Use
Back to Top