Share Email Print
cover

Proceedings Paper

Extraction and labeling high-resolution images from PDF documents
Author(s): Suchet K. Chachra; Zhiyun Xue; Sameer Antani; Dina Demner-Fushman; George R. Thoma
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

Accuracy of content-based image retrieval is affected by image resolution among other factors. Higher resolution images enable extraction of image features that more accurately represent the image content. In order to improve the relevance of search results for our biomedical image search engine, Open-I, we have developed techniques to extract and label high-resolution versions of figures from biomedical articles supplied in the PDF format. Open-I uses the open-access subset of biomedical articles from the PubMed Central repository hosted by the National Library of Medicine. Articles are available in XML and in publisher supplied PDF formats. As these PDF documents contain little or no meta-data to identify the embedded images, the task includes labeling images according to their figure number in the article after they have been successfully extracted. For this purpose we use the labeled small size images provided with the XML web version of the article. This paper describes the image extraction process and two alternative approaches to perform image labeling that measure the similarity between two images based upon the image intensity projection on the coordinate axes and similarity based upon the normalized cross-correlation between the intensities of two images. Using image identification based on image intensity projection, we were able to achieve a precision of 92.84% and a recall of 82.18% in labeling of the extracted images.

Paper Details

Date Published: 24 March 2014
PDF: 9 pages
Proc. SPIE 9021, Document Recognition and Retrieval XXI, 90210Q (24 March 2014); doi: 10.1117/12.2042336
Show Author Affiliations
Suchet K. Chachra, U.S. National Library of Medicine (United States)
Zhiyun Xue, U.S. National Library of Medicine (United States)
Sameer Antani, U.S. National Library of Medicine (United States)
Dina Demner-Fushman, U.S. National Library of Medicine (United States)
George R. Thoma, U.S. National Library of Medicine (United States)


Published in SPIE Proceedings Vol. 9021:
Document Recognition and Retrieval XXI
Bertrand Coüasnon; Eric K. Ringger, Editor(s)

© SPIE. Terms of Use
Back to Top