Share Email Print

Proceedings Paper

Combining text clustering and retrieval for corpus adaptation
Author(s): Feng He; Xiaoqing Ding
Format Member Price Non-Member Price
PDF $14.40 $18.00
cover GOOD NEWS! Your organization subscribes to the SPIE Digital Library. You may be able to download this paper for free. Check Access

Paper Abstract

The application-relevant text data are very useful in various natural language applications. Using them can achieve significantly better performance for vocabulary selection, language modeling, which are widely employed in automatic speech recognition, intelligent input method etc. In some situations, however, the relevant data is hard to collect. Thus, the scarcity of application-relevant training text brings difficulty upon these natural language processing. In this paper, only using a small set of application specific text, by combining unsupervised text clustering and text retrieval techniques, the proposed approach can find the relevant text from unorganized large scale corpus, thereby, adapt training corpus towards the application area of interest. We use the performance of n-gram statistical language model, which is trained from the text retrieved and test on the application-specific text, to evaluate the relevance of the text acquired, accordingly, to validate the effectiveness of our corpus adaptation approach. The language models trained from the ranked text bundles present well discriminated perplexities on the application-specific text. The preliminary experiments on short message text and unorganized large corpus demonstrate the performance of the proposed methods.

Paper Details

Date Published: 29 January 2007
PDF: 7 pages
Proc. SPIE 6500, Document Recognition and Retrieval XIV, 65000P (29 January 2007); doi: 10.1117/12.703646
Show Author Affiliations
Feng He, Tsinghua Univ. (China)
Xiaoqing Ding, Tsinghua Univ. (China)

Published in SPIE Proceedings Vol. 6500:
Document Recognition and Retrieval XIV
Xiaofan Lin; Berrin A. Yanikoglu, Editor(s)

© SPIE. Terms of Use
Back to Top