Share Email Print
cover

Proceedings Paper

Locating and parsing bibliographical references in HTML medical articles
Author(s): Jie Zou; Daniel Le; George R. Thoma
Format Member Price Non-Member Price
PDF $14.40 $18.00
cover GOOD NEWS! Your organization subscribes to the SPIE Digital Library. You may be able to download this paper for free. Check Access

Paper Abstract

Bibliographical references that appear in journal articles can provide valuable hints for subsequent information extraction. We describe our statistical machine learning algorithms for locating and parsing such references from HTML medical journal articles. Reference locating identifies the reference sections and then decomposes them into individual references. We formulate reference locating as a two-class classification problem based on text and geometric features. An evaluation conducted on 500 articles from 100 journals achieves near perfect precision and recall rates for locating references. Reference parsing is to identify components, e.g. author, article title, journal title etc., from each individual reference. We implement and compare two reference parsing algorithms. One relies on sequence statistics and trains a Conditional Random Field. The other focuses on local feature statistics and trains a Support Vector Machine to classify each individual word, and then a search algorithm systematically corrects low confidence labels if the label sequence violates a set of predefined rules. The overall performance of these two reference parsing algorithms is about the same: above 99% accuracy at the word level, and over 97% accuracy at the chunk level.

Paper Details

Date Published: 19 January 2009
PDF: 12 pages
Proc. SPIE 7247, Document Recognition and Retrieval XVI, 724708 (19 January 2009); doi: 10.1117/12.805946
Show Author Affiliations
Jie Zou, National Library of Medicine (United States)
Daniel Le, National Library of Medicine (United States)
George R. Thoma, National Library of Medicine (United States)


Published in SPIE Proceedings Vol. 7247:
Document Recognition and Retrieval XVI
Kathrin Berkner; Laurence Likforman-Sulem, Editor(s)

© SPIE. Terms of Use
Back to Top