
Proceedings Paper
Clustering header categories extracted from web tablesFormat | Member Price | Non-Member Price |
---|---|---|
$17.00 | $21.00 |
Paper Abstract
Revealing related content among heterogeneous web tables is part of our long term objective of formulating queries over
multiple sources of information. Two hundred HTML tables from institutional web sites are segmented and each table
cell is classified according to the fundamental indexing property of row and column headers. The categories that
correspond to the multi-dimensional data cube view of a table are extracted by factoring the (often multi-row/column)
headers. To reveal commonalities between tables from diverse sources, the Jaccard distances between pairs of category
headers (and also table titles) are computed. We show how about one third of our heterogeneous collection can be
clustered into a dozen groups that exhibit table-title and header similarities that can be exploited for queries.
Paper Details
Date Published: 8 February 2015
PDF: 12 pages
Proc. SPIE 9402, Document Recognition and Retrieval XXII, 94020M (8 February 2015); doi: 10.1117/12.2076209
Published in SPIE Proceedings Vol. 9402:
Document Recognition and Retrieval XXII
Eric K. Ringger; Bart Lamiroy, Editor(s)
PDF: 12 pages
Proc. SPIE 9402, Document Recognition and Retrieval XXII, 94020M (8 February 2015); doi: 10.1117/12.2076209
Show Author Affiliations
George Nagy, Rensselaer Polytechnic Institute (United States)
David W. Embley, Brigham Young Univ. (United States)
David W. Embley, Brigham Young Univ. (United States)
Mukkai Krishnamoorthy, Rensselaer Polytechnic Institute (United States)
Sharad Seth, Univ. of Nebraska-Lincoln (United States)
Sharad Seth, Univ. of Nebraska-Lincoln (United States)
Published in SPIE Proceedings Vol. 9402:
Document Recognition and Retrieval XXII
Eric K. Ringger; Bart Lamiroy, Editor(s)
© SPIE. Terms of Use
