Share Email Print
cover

Proceedings Paper

Graph-based layout analysis for PDF documents
Author(s): Canhui Xu; Zhi Tang; Xin Tao; Yun Li; Cao Shi
Format Member Price Non-Member Price
PDF $14.40 $18.00

Paper Abstract

To increase the flexibility and enrich the reading experience of e-book on small portable screens, a graph based method is proposed to perform layout analysis on Portable Document Format (PDF) documents. Digital born document has its inherent advantages like representing texts and fractional images in explicit form, which can be straightforwardly exploited. To integrate traditional image-based document analysis and the inherent meta-data provided by PDF parser, the page primitives including text, image and path elements are processed to produce text and non text layer for respective analysis. Graph-based method is developed in superpixel representation level, and page text elements corresponding to vertices are used to construct an undirected graph. Euclidean distance between adjacent vertices is applied in a top-down manner to cut the graph tree formed by Kruskal’s algorithm. And edge orientation is then used in a bottom-up manner to extract text lines from each sub tree. On the other hand, non-textual objects are segmented by connected component analysis. For each segmented text and non-text composite, a 13-dimensional feature vector is extracted for labelling purpose. The experimental results on selected pages from PDF books are presented.

Paper Details

Date Published: 21 March 2013
PDF: 8 pages
Proc. SPIE 8664, Imaging and Printing in a Web 2.0 World IV, 866407 (21 March 2013); doi: 10.1117/12.2005608
Show Author Affiliations
Canhui Xu, Peking Univ. (China)
Peking Univ. Founder Group Corp. (China)
Zhongguancun Haidian Science Park (China)
Zhi Tang, Peking Univ. (China)
Peking Univ. Founder Group Corp. (China)
Xin Tao, Peking Univ. (China)
Yun Li, Peking Univ. (China)
Peking Univ. Founder Group Corp. (China)
Cao Shi, Peking Univ. (China)


Published in SPIE Proceedings Vol. 8664:
Imaging and Printing in a Web 2.0 World IV
Qian Lin; Jan P. Allebach; Zhigang Fan, Editor(s)

© SPIE. Terms of Use
Back to Top