Share Email Print
cover

Proceedings Paper

Dealing with extreme data diversity: extraction and fusion from the growing types of document formats
Author(s): Peter David; Nichole Hansen; James J. Nolan; Pedro Alcocer
Format Member Price Non-Member Price
PDF $14.40 $18.00

Paper Abstract

The growth in text data available online is accompanied by a growth in the diversity of available documents. Corpora with extreme heterogeneity in terms of file formats, document organization, page layout, text style, and content are common. The absence of meaningful metadata describing the structure of online and open-source data leads to text extraction results that contain no information about document structure and are cluttered with page headers and footers, web navigation controls, advertisements, and other items that are typically considered noise. We describe an approach to document structure and metadata recovery that uses visual analysis of documents to infer the communicative intent of the author. Our algorithm identifies the components of documents such as titles, headings, and body content, based on their appearance. Because it operates on an image of a document, our technique can be applied to any type of document, including scanned images. Our approach to document structure recovery considers a finer-grained set of component types than prior approaches. In this initial work, we show that a machine learning approach to document structure recovery using a feature set based on the geometry and appearance of images of documents achieves a 60% greater F1- score than a baseline random classifier.

Paper Details

Date Published: 15 May 2015
PDF: 7 pages
Proc. SPIE 9499, Next-Generation Analyst III, 94990Q (15 May 2015); doi: 10.1117/12.2184171
Show Author Affiliations
Peter David, Decisive Analytics Corp. (United States)
Nichole Hansen, Decisive Analytics Corp. (United States)
James J. Nolan, Decisive Analytics Corp. (United States)
Pedro Alcocer, Decisive Analytics Corp. (United States)


Published in SPIE Proceedings Vol. 9499:
Next-Generation Analyst III
Barbara D. Broome; Timothy P. Hanratty; David L. Hall; James Llinas, Editor(s)

© SPIE. Terms of Use
Back to Top