Share Email Print

Proceedings Paper

Title identification of web article pages using HTML and visual features
Author(s): Jian Fan; Ping Luo; Parag Joshi
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

Extracting informative content from Web article pages has many applications such as printing and content reuse. Title is a very significant and unique component of an article. However, identifying the true title is not an easy problem even for human readers. In this paper, we present a title identification method that takes into account of several features including the title field of the HTML page and HTML tag of a DOM node as well as font size and horizontal alignment. We tested our method on a ground truth data set consisting of 1993 pages from 98 web sites and achieved 97.5% accuracy, about 20% above a baseline method based on only the font size.

Paper Details

Date Published: 7 February 2011
PDF: 5 pages
Proc. SPIE 7879, Imaging and Printing in a Web 2.0 World II, 78790K (7 February 2011); doi: 10.1117/12.876708
Show Author Affiliations
Jian Fan, Hewlett-Packard Labs. (United States)
Ping Luo, Hewlett-Packard Labs. China (China)
Parag Joshi, Hewlett-Packard Labs. (United States)

Published in SPIE Proceedings Vol. 7879:
Imaging and Printing in a Web 2.0 World II
Qian Lin; Jan P. Allebach; Zhigang Fan, Editor(s)

© SPIE. Terms of Use
Back to Top