Share Email Print

Proceedings Paper

Identification of embedded mathematical formulas in PDF documents using SVM
Author(s): Xiaoyan Lin; Liangcai Gao; Zhi Tang; Xuan Hu; Xiaofan Lin
Format Member Price Non-Member Price
PDF $14.40 $18.00
cover GOOD NEWS! Your organization subscribes to the SPIE Digital Library. You may be able to download this paper for free. Check Access

Paper Abstract

With the tremendous popularity of PDF format, recognizing mathematical formulas in PDF documents becomes a new and important problem in document analysis field. In this paper, we present a method of embedded mathematical formula identification in PDF documents, based on Support Vector Machine (SVM). The method first segments text lines into words, and then classifies each word into two classes, namely formula or ordinary text. Various features of embedded formulas, including geometric layout, character and context content, are utilized to build a robust and adaptable SVM classifier. Embedded formulas are then extracted through merging the words labeled as formulas. Experimental results show good performance of the proposed method. Furthermore, the method has been successfully incorporated into a commercial software package for large-scale e-Book production.

Paper Details

Date Published: 23 January 2012
PDF: 8 pages
Proc. SPIE 8297, Document Recognition and Retrieval XIX, 82970D (23 January 2012); doi: 10.1117/12.912445
Show Author Affiliations
Xiaoyan Lin, Peking Univ. (China)
Liangcai Gao, Peking Univ. (China)
Zhi Tang, Peking Univ. (China)
State Key Lab. of Digital Publishing Technology (China)
Xuan Hu, BeiHang Univ. (China)
Xiaofan Lin, Vobile, Inc. (United States)

Published in SPIE Proceedings Vol. 8297:
Document Recognition and Retrieval XIX
Christian Viard-Gaudin; Richard Zanibbi, Editor(s)

© SPIE. Terms of Use
Back to Top