Share Email Print

Proceedings Paper

Document similarity measures and document browsing
Author(s): Ildus Ahmadullin; Jian Fan; Niranjan Damera-Venkata; Suk Hwan Lim; Qian Lin; Jerry Liu; Sam Liu; Eamonn O'Brien-Strain; Jan Allebach
Format Member Price Non-Member Price
PDF $17.00 $21.00
cover GOOD NEWS! Your organization subscribes to the SPIE Digital Library. You may be able to download this paper for free. Check Access

Paper Abstract

Managing large document databases is an important task today. Being able to automatically com- pare document layouts and classify and search documents with respect to their visual appearance proves to be desirable in many applications. We measure single page documents' similarity with respect to distance functions between three document components: background, text, and saliency. Each document component is represented as a Gaussian mixture distribution; and distances between dierent documents' components are calculated as probabilistic similarities between corresponding distributions. The similarity measure between documents is represented as a weighted sum of the components' distances. Using this document similarity measure, we propose a browsing mechanism operating on a document dataset. For these purposes, we use a hierarchical browsing environment which we call the document similarity pyramid. It allows the user to browse a large document dataset and to search for documents in the dataset that are similar to the query. The user can browse the dataset on dierent levels of the pyramid, and zoom into the documents that are of interest.

Paper Details

Date Published: 7 February 2011
PDF: 8 pages
Proc. SPIE 7879, Imaging and Printing in a Web 2.0 World II, 787909 (7 February 2011); doi: 10.1117/12.877268
Show Author Affiliations
Ildus Ahmadullin, Purdue Univ. (United States)
Jian Fan, Hewlett-Packard Labs. (United States)
Niranjan Damera-Venkata, Hewlett-Packard Labs. (United States)
Suk Hwan Lim, Hewlett-Packard Labs. (United States)
Qian Lin, Hewlett-Packard Labs. (United States)
Jerry Liu, Hewlett-Packard Labs. (United States)
Sam Liu, Hewlett-Packard Labs. (United States)
Eamonn O'Brien-Strain, Hewlett-Packard Labs. (United States)
Jan Allebach, Purdue Univ. (United States)

Published in SPIE Proceedings Vol. 7879:
Imaging and Printing in a Web 2.0 World II
Qian Lin; Jan P. Allebach; Zhigang Fan, Editor(s)

© SPIE. Terms of Use
Back to Top