Share Email Print
cover

Proceedings Paper

Learning to identify hundreds of flex-form documents
Author(s): Janusz Wnek
Format Member Price Non-Member Price
PDF $14.40 $18.00

Paper Abstract

This paper presents an inductive document classifier (IDC) and its application to document identification. The most important features of the presented system are learning capability, handling large volumes of highly variant documents, and high performance. IDC learns new document types (variants) from examples. To this end, it automatically extracts discriminatory features from images of various document types, generates generalized descriptions, and stores them in the knowledge base. The classification of an unknown document is based on matching its description to all general rules in the knowledge base, and selecting the best matching document types as final classifications. Both learning and identification processes are fast and accurate. The speed is gained due to optimal image processing and feature construction procedures. Identification accuracy is very high despite the fact that the discriminatory features are generated solely based on page layout information. IDC operates in two separate components of an EDMS: Knowledge Base Maintainer (KBM) and Production Identifier (PI). KBM builds a knowledge base and maintains its integrity. PI utilizes learned knowledge during the identification processes.

Paper Details

Date Published: 7 January 1999
PDF: 10 pages
Proc. SPIE 3651, Document Recognition and Retrieval VI, (7 January 1999); doi: 10.1117/12.335815
Show Author Affiliations
Janusz Wnek, Science Applications International Corp. (United States)


Published in SPIE Proceedings Vol. 3651:
Document Recognition and Retrieval VI
Daniel P. Lopresti; Jiangying Zhou, Editor(s)

© SPIE. Terms of Use
Back to Top