Share Email Print

Proceedings Paper

Scalable ranked retrieval using document images
Author(s): Rajiv Jain; Douglas W. Oard; David Doermann
Format Member Price Non-Member Price
PDF $14.40 $18.00
cover GOOD NEWS! Your organization subscribes to the SPIE Digital Library. You may be able to download this paper for free. Check Access

Paper Abstract

Despite the explosion of text on the Internet, hard copy documents that have been scanned as images still play a significant role for some tasks. The best method to perform ranked retrieval on a large corpus of document images, however, remains an open research question. The most common approach has been to perform text retrieval using terms generated by optical character recognition. This paper, by contrast, examines whether a scalable segmentation-free image retrieval algorithm, which matches sub-images containing text or graphical objects, can provide additional benefit in satisfying a user’s information needs on a large, real world dataset. Results on 7 million scanned pages from the CDIP v1.0 test collection show that content based image retrieval finds a substantial number of documents that text retrieval misses, and that when used as a basis for relevance feedback can yield improvements in retrieval effectiveness.

Paper Details

Date Published: 24 March 2014
PDF: 15 pages
Proc. SPIE 9021, Document Recognition and Retrieval XXI, 90210K (24 March 2014); doi: 10.1117/12.2038656
Show Author Affiliations
Rajiv Jain, Univ. of Maryland, College Park (United States)
Douglas W. Oard, Univ. of Maryland, College Park (United States)
David Doermann, Univ. of Maryland, College Park (United States)

Published in SPIE Proceedings Vol. 9021:
Document Recognition and Retrieval XXI
Bertrand Coüasnon; Eric K. Ringger, Editor(s)

© SPIE. Terms of Use
Back to Top