Share Email Print
cover

Proceedings Paper

DOM-based print-link detection for web article extraction
Author(s): Sam Liu; Suk-Hwan Lim; Jerry Liu
Format Member Price Non-Member Price
PDF $14.40 $18.00
cover GOOD NEWS! Your organization subscribes to the SPIE Digital Library. You may be able to download this paper for free. Check Access

Paper Abstract

Web article pages usually have hyperlinks (or links) that lead to print-friendly web pages containing mainly the article content. Content extraction using these print-friendly pages is generally easier and more reliable, but there are many variations of the print-link representations in HTML that made robust print-link detection more difficult than it first appears. First, the link can be text-based, image-based, or both. For example, there is a lexicon of phrases used to indicate print-friendly pages, such as "print", "print article", "print-friendly version", etc. In addition, some links use printer-resembling image icons with or without a print phrase present. To complicate the matter further, not all the links contain a valid URL, but instead the pages are dynamically generated either by the client Javascript or by the server, so no URL is available for extraction. We estimate that there are more than 90% of the Web article pages have print-links, of which about 35% of them have valid print-friendly URLs, which is a good percentage. Our solution to the print-link extraction problem takes on two stages: (1) the detection of the print-link, (2) the retrieval of the print-friendly page URL from the link attributes, including the test for its validity. Experimental results based on roughly 2000 web article pages suggest our solution is capable of achieving over 99% precision and 97% recall performance measures.

Paper Details

Date Published: 7 February 2011
PDF: 7 pages
Proc. SPIE 7879, Imaging and Printing in a Web 2.0 World II, 787904 (7 February 2011); doi: 10.1117/12.872573
Show Author Affiliations
Sam Liu, Hewlett-Packard Co. (United States)
Suk-Hwan Lim, Hewlett-Packard Co. (United States)
Jerry Liu, Hewlett-Packard Co. (United States)


Published in SPIE Proceedings Vol. 7879:
Imaging and Printing in a Web 2.0 World II
Qian Lin; Jan P. Allebach; Zhigang Fan, Editor(s)

© SPIE. Terms of Use
Back to Top