Share Email Print
cover

Proceedings Paper

Title extraction and generation from OCR'd documents
Author(s): Kazem Taghva; Allen Condit; Steve Lumos; Julie Borsack; Thomas Nartker
Format Member Price Non-Member Price
PDF $14.40 $18.00

Paper Abstract

Extraction of metadata from documents is a tedious and expensive process. In general, documents are manually reviewed for structured data such as title, author, date, organization, etc. The purpose of extraction is to build metadata for documents that can be used when formulating structured queries. In many large document repositories such as the National Library of Medicine (NLM)1 or university libraries, the extraction task is a daily process that spans decades. Although some automation is used during the extraction process, generally, metadata extraction is a manual task. Aside from the cost and labor time, manual processing is error prone and requires many levels of quality control. Recent advances in extraction technology, as reported at the Message the Understanding Conference (MUC),2 is comparable with extraction performed by humans. In addition, many organizations use historical data for lookup to improve the quality of extraction. For the large government document repository we are working with, the task involves extraction of several fields from millions of OCR'd and electronic documents. Since this project is time-sensitive, automatic extraction turns out to be the only viable solution. There are more than a dozen fields associated with each document that require extraction. In this paper, we report on the extraction and generation of the title field.

Paper Details

Date Published: 29 January 2007
PDF: 6 pages
Proc. SPIE 6500, Document Recognition and Retrieval XIV, 65000R (29 January 2007); doi: 10.1117/12.712264
Show Author Affiliations
Kazem Taghva, Univ. of Nevada, Las Vegas (United States)
Allen Condit, Univ. of Nevada, Las Vegas (United States)
Steve Lumos, Univ. of Nevada, Las Vegas (United States)
Julie Borsack, Univ. of Nevada, Las Vegas (United States)
Thomas Nartker, Univ. of Nevada, Las Vegas (United States)


Published in SPIE Proceedings Vol. 6500:
Document Recognition and Retrieval XIV
Xiaofan Lin; Berrin A. Yanikoglu, Editor(s)

© SPIE. Terms of Use
Back to Top