Share Email Print

Proceedings Paper

Clustering method via independent components for semi-structured documents
Author(s): Tong Wang; Da-Xin Liu; Xuanzuo Lin; Wei Sun
Format Member Price Non-Member Price
PDF $14.40 $18.00
cover GOOD NEWS! Your organization subscribes to the SPIE Digital Library. You may be able to download this paper for free. Check Access

Paper Abstract

This paper presents a novel clustering method for XML documents. Much research effort of document clustering is currently devoted to support the storage and retrieval of large collections of XML documents. However, traditional text clustering approaches cannot embody the structural information of semi-structured documents. Our technique is firstly to extract relative path features to represent each document. And then, we transform these documents to Vector Space Model (VSM) and propose a similarity computation. Before clustering, we apply Independent Component Analysis (ICA) to reduce dimensions of VSM. To the best of author's knowledge, ICA has not been used for XML clustering before. The standard C-means partition algorithm is also improved: When a solution can be no more improved, the algorithm makes the next iteration after an appropriate disturbance on the local minimum solution. Thus the algorithm can skip out of the local minimum and in the meanwhile, reach the whole search space. Experimental results, based on two real datasets and one synthetic dataset, show that the proposed approach is efficient and outperforms naive-clustering method without ICA applied.

Paper Details

Date Published: 18 April 2006
PDF: 8 pages
Proc. SPIE 6241, Data Mining, Intrusion Detection, Information Assurance, and Data Networks Security 2006, 62410V (18 April 2006); doi: 10.1117/12.665427
Show Author Affiliations
Tong Wang, Harbin Engineering Univ. (China)
Da-Xin Liu, Harbin Engineering Univ. (China)
Xuanzuo Lin, Northeast Agricultural Univ. (China)
Wei Sun, Harbin Engineering Univ. (China)

Published in SPIE Proceedings Vol. 6241:
Data Mining, Intrusion Detection, Information Assurance, and Data Networks Security 2006
Belur V. Dasarathy, Editor(s)

© SPIE. Terms of Use
Back to Top