Share Email Print

Proceedings Paper

Comparison of text-based methods for detecting duplication in document image databases
Author(s): Daniel P. Lopresti
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

This paper presents an experimental evaluation of several text-based methods for detecting duplication in document image databases using uncorrected OCR output. This task is challenging because of both the wide range of degradations printed documents can suffer, and conflicting interpretations of what it means to be a 'duplicate.' We report results for five sets of experiments exploring various aspects of the problem space. While the techniques studied are generally robust in the face of most types of OCR errors, there are nonetheless important differences which we identify and discuss in detail.

Paper Details

Date Published: 22 December 1999
PDF: 12 pages
Proc. SPIE 3967, Document Recognition and Retrieval VII, (22 December 1999); doi: 10.1117/12.373496
Show Author Affiliations
Daniel P. Lopresti, Lucent Technologies/Bell Labs. (United States)

Published in SPIE Proceedings Vol. 3967:
Document Recognition and Retrieval VII
Daniel P. Lopresti; Jiangying Zhou, Editor(s)

© SPIE. Terms of Use
Back to Top
Sign in to read the full article
Create a free SPIE account to get access to
premium articles and original research
Forgot your username?