Share Email Print

Proceedings Paper

Automated conversion of structured documents into SGML
Author(s): Janusz Wnek; Robert J. Price
Format Member Price Non-Member Price
PDF $14.40 $18.00
cover GOOD NEWS! Your organization subscribes to the SPIE Digital Library. You may be able to download this paper for free. Check Access

Paper Abstract

Intelligent document understanding (IDU) systems convert scanned document pages into an electronic format which preserves layout and logical document structure in addition to document content. MOst of the IDU experimental systems, however, lack the capability of full exploitation of recognition results. In this paper we present an integrated IDU system that processes documents all the way from recognition to full utilization using standard generalized markup language (SGML). The standardization and widespread use of SGML-based tools provides the means for filling the gap between document recognition and seamless document reuse. The conversion process involves OCR of a multipage document, document structure analysis, processing of tabular data and mathematical expressions, and generation of the final SGML description. Document structure analysis is reduce here to parsing OCR results and recreating document structure by performing fuzzy searches for standard phrases and format analysis. Tabular data processing utilizes OCR results with positional data, horizontal lines and heuristic rules to determine cell boundaries and contents. Recognition of mathematical expressions involves OCR on an extended symbol set, and equation structure recognition via transformations on a tree representation. The transformations are ordered and involve connecting of separated symbols, context-sensitive OCR correction, extraction of horizontally aligned subexpressions, subscript and superscript processing, and a general processing of symbols detected above or below the target symbol.

Paper Details

Date Published: 1 April 1998
PDF: 10 pages
Proc. SPIE 3305, Document Recognition V, (1 April 1998); doi: 10.1117/12.304626
Show Author Affiliations
Janusz Wnek, Science Applications International Corp. (United States)
Robert J. Price, Science Applications International Corp. (United States)

Published in SPIE Proceedings Vol. 3305:
Document Recognition V
Daniel P. Lopresti; Jiangying Zhou, Editor(s)

© SPIE. Terms of Use
Back to Top