Share Email Print

Proceedings Paper

Improved CHAID algorithm for document structure modelling
Author(s): A. Belaïd; T. Moinel; Y. Rangoni
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

This paper proposes a technique for the logical labelling of document images. It makes use of a decision-tree based approach to learn and then recognise the logical elements of a page. A state-of-the-art OCR gives the physical features needed by the system. Each block of text is extracted during the layout analysis and raw physical features are collected and stored in the ALTO format. The data-mining method employed here is the "Improved CHi-squared Automatic Interaction Detection" (I-CHAID). The contribution of this work is the insertion of logical rules extracted from the logical layout knowledge to support the decision tree. Two setups have been tested; the first uses one tree per logical element, the second one uses a single tree for all the logical elements we want to recognise. The main system, implemented in Java, coordinates the third-party tools (Omnipage for the OCR part, and SIPINA for the I-CHAID algorithm) using XML and XSL transforms. It was tested on around 1000 documents belonging to the ICPR'04 and ICPR'08 conference proceedings, representing about 16,000 blocks. The final error rate for determining the logical labels (among 9 different ones) is less than 6%.

Paper Details

Date Published: 18 January 2010
PDF: 7 pages
Proc. SPIE 7534, Document Recognition and Retrieval XVII, 75340X (18 January 2010); doi: 10.1117/12.839794
Show Author Affiliations
A. Belaïd, LORIA, Univ. Nancy 2 (France)
T. Moinel, LORIA, Univ. Nancy 2 (France)
Y. Rangoni, LORIA, Univ. Nancy 2 (France)

Published in SPIE Proceedings Vol. 7534:
Document Recognition and Retrieval XVII
Laurence Likforman-Sulem; Gady Agam, Editor(s)

© SPIE. Terms of Use
Back to Top