Share Email Print

Proceedings Paper

Automatic content extraction of filled-form images based on clustering component block projection vectors
Author(s): Hanchuan Peng; Xiaofeng He; Fuhui Long
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

Automatic understanding of document images is a hard problem. Here we consider a sub-problem, automatically extracting content from filled form images. Without pre-selected templates or sophisticated structural/semantic analysis, we propose a novel approach based on clustering the component-block-projection-vectors. By combining spectral clustering and minimal spanning tree clustering, we generate highly accurate clusters, from which the adaptive templates are constructed to extract the filled-in content. Our experiments show this approach is effective for a set of 1040 US IRS tax form images belonging to 208 types.

Paper Details

Date Published: 15 December 2003
PDF: 9 pages
Proc. SPIE 5296, Document Recognition and Retrieval XI, (15 December 2003); doi: 10.1117/12.527345
Show Author Affiliations
Hanchuan Peng, Oak Ridge National Lab. (United States)
Lawrence Berkeley National Lab. (United States)
Xiaofeng He, Lawrence Berkeley National Lab. (United States)
Fuhui Long, Duke Univ. (United States)

Published in SPIE Proceedings Vol. 5296:
Document Recognition and Retrieval XI
Elisa H. Barney Smith; Jianying Hu; James Allan, Editor(s)

© SPIE. Terms of Use
Back to Top
Sign in to read the full article
Create a free SPIE account to get access to
premium articles and original research
Forgot your username?