Share Email Print

Proceedings Paper

Extracting a sparsely located named entity from online HTML medical articles using support vector machine
Author(s): Jie Zou; Daniel Le; George R. Thoma
Format Member Price Non-Member Price
PDF $14.40 $18.00
cover GOOD NEWS! Your organization subscribes to the SPIE Digital Library. You may be able to download this paper for free. Check Access

Paper Abstract

We describe a statistical machine learning method for extracting databank accession numbers (DANs) from online medical journal articles. Because the DANs are sparsely-located in the articles, we take a hierarchical approach. The HTML journal articles are first segmented into zones according to text and geometric features. The zones are then classified as DAN zones or other zones by an SVM classifier. A set of heuristic rules are applied on the candidate DAN zones to extract DANs according to their edit distances to the DAN formats. An evaluation shows that the proposed method can achieve a very high recall rate (above 99%) and a significantly better precision rate compared to extraction through brute force regular expression matching.

Paper Details

Date Published: 28 January 2008
PDF: 10 pages
Proc. SPIE 6815, Document Recognition and Retrieval XV, 68150P (28 January 2008); doi: 10.1117/12.765907
Show Author Affiliations
Jie Zou, National Library of Medicine (United States)
Daniel Le, National Library of Medicine (United States)
George R. Thoma, National Library of Medicine (United States)

Published in SPIE Proceedings Vol. 6815:
Document Recognition and Retrieval XV
Berrin A. Yanikoglu; Kathrin Berkner, Editor(s)

© SPIE. Terms of Use
Back to Top