Share Email Print

Proceedings Paper

Online medical journal article layout analysis
Author(s): Jie Zou; Daniel Le; George R. Thoma
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

We describe a physical and logical layout analysis algorithm, which is applied to segment and label online medical journal articles (regular HTML and PDF-Converted-HTML files). For these articles, the geometric layout of the Web page is the most important cue for physical layout analysis. The key to physical layout analysis is then to render the HTML file in a Web browser, so that the visual information in zones (composed of one or a set of HTML DOM nodes), especially their relative position, can be utilized. The recursive X-Y cut algorithm is adopted to construct a hierarchical zone tree structure. In logical layout analysis, both geometric and linguistic features are used. The HTML documents are modeled by a Hidden Markov Model with 16 states, and the Viterbi algorithm is then used to find the optimal label sequence, concluding the logical layout analysis.

Paper Details

Date Published: 29 January 2007
PDF: 12 pages
Proc. SPIE 6500, Document Recognition and Retrieval XIV, 65000V (29 January 2007); doi: 10.1117/12.704434
Show Author Affiliations
Jie Zou, National Library of Medicine (United States)
Daniel Le, National Library of Medicine (United States)
George R. Thoma, National Library of Medicine (United States)

Published in SPIE Proceedings Vol. 6500:
Document Recognition and Retrieval XIV
Xiaofan Lin; Berrin A. Yanikoglu, Editor(s)

© SPIE. Terms of Use
Back to Top
Sign in to read the full article
Create a free SPIE account to get access to
premium articles and original research
Forgot your username?