Share Email Print

Proceedings Paper

Model-based document categorization employing semantic pattern analysis and local structure clustering
Author(s): Kosei Fume; Yasuto Ishitani
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

We propose a document categorization method based on a document model that can be defined externally for each task and that categorizes Web content or business documents into a target category in accordance with the similarity of the model. The main feature of the proposed method consists of two aspects of semantics extraction from an input document. The semantics of terms are extracted by the semantic pattern analysis and implicit meanings of document substructure are specified by a bottom-up text clustering technique focusing on the similarity of text line attributes. We have constructed a system based on the proposed method for trial purposes. The experimental results show that the system achieves more than 80% classification accuracy in categorizing Web content and business documents into 15 or 70 categories.

Paper Details

Date Published: 28 January 2008
PDF: 8 pages
Proc. SPIE 6815, Document Recognition and Retrieval XV, 68150R (28 January 2008); doi: 10.1117/12.765422
Show Author Affiliations
Kosei Fume, Toshiba Corp. (Japan)
Yasuto Ishitani, Toshiba Solutions Corp. (Japan)

Published in SPIE Proceedings Vol. 6815:
Document Recognition and Retrieval XV
Berrin A. Yanikoglu; Kathrin Berkner, Editor(s)

© SPIE. Terms of Use
Back to Top
Sign in to read the full article
Create a free SPIE account to get access to
premium articles and original research
Forgot your username?