Share Email Print

Proceedings Paper

A case study on rule-based and CRF-based author extraction methods
Author(s): Shengwen Yang; Yuhong Xiong
Format Member Price Non-Member Price
PDF $14.40 $18.00
cover GOOD NEWS! Your organization subscribes to the SPIE Digital Library. You may be able to download this paper for free. Check Access

Paper Abstract

Information extraction (IE) is the task of automatically extracting structured information from unstructured documents. A typical application of IE is to process a set of documents written in a natural language and populate a database with the information extracted. This paper presents a case study on author extraction from unstructured documents. A rulebased method and a CRF-based (Conditional Random Field) method are implemented for this task. The rule-based method involves defining a set of heuristic rules and leveraging prior knowledge on author names and affiliations to identify metadata. The CRF-based method involves preparing a labeled training dataset, defining a set of feature functions, learning a CRF model, and applying the model to label new documents. We evaluate and compare the performance of the two methods through experiments, and give some useful hints for application developers on the choice of heuristics and formal methods when addressing the real-world information extraction problems.

Paper Details

Date Published: 10 February 2010
PDF: 10 pages
Proc. SPIE 7540, Imaging and Printing in a Web 2.0 World; and Multimedia Content Access: Algorithms and Systems IV, 754005 (10 February 2010); doi: 10.1117/12.838781
Show Author Affiliations
Shengwen Yang, Hewlett-Packard Labs. China (China)
Yuhong Xiong, Hewlett-Packard Labs. China (China)

Published in SPIE Proceedings Vol. 7540:
Imaging and Printing in a Web 2.0 World; and Multimedia Content Access: Algorithms and Systems IV
Theo Gevers; Qian Lin; Raimondo Schettini; Zhigang Fan; Cees Snoek, Editor(s)

© SPIE. Terms of Use
Back to Top