Share Email Print

Proceedings Paper

Text documents as social networks
Author(s): Helen Balinsky; Alexander Balinsky; Steven J. Simske
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

The extraction of keywords and features is a fundamental problem in text data mining. Document processing applications directly depend on the quality and speed of the identification of salient terms and phrases. Applications as disparate as automatic document classification, information visualization, filtering and security policy enforcement all rely on the quality of automatically extracted keywords. Recently, a novel approach to rapid change detection in data streams and documents has been developed. It is based on ideas from image processing and in particular on the Helmholtz Principle from the Gestalt Theory of human perception. By modeling a document as a one-parameter family of graphs with its sentences or paragraphs defining the vertex set and with edges defined by Helmholtz's principle, we demonstrated that for some range of the parameters, the resulting graph becomes a small-world network. In this article we investigate the natural orientation of edges in such small world networks. For two connected sentences, we can say which one is the first and which one is the second, according to their position in a document. This will make such a graph look like a small WWW-type network and PageRank type algorithms will produce interesting ranking of nodes in such a document.

Paper Details

Date Published: 21 February 2012
PDF: 12 pages
Proc. SPIE 8302, Imaging and Printing in a Web 2.0 World III, 830207 (21 February 2012); doi: 10.1117/12.909110
Show Author Affiliations
Helen Balinsky, Hewlett-Packard Labs. (United Kingdom)
Alexander Balinsky, Cardiff Univ. (United Kingdom)
Steven J. Simske, Hewlett-Packard Labs. (United States)

Published in SPIE Proceedings Vol. 8302:
Imaging and Printing in a Web 2.0 World III
Qian Lin; Jan P. Allebach; Zhigang Fan, Editor(s)

© SPIE. Terms of Use
Back to Top