Share Email Print

Proceedings Paper

Finding keywords amongst noise: automatic text classification without parsing
Author(s): Andrew G. Allison; Charles E. M. Pearce; Derek Abbott
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

The amount of text stored on the Internet, and in our libraries, continues to expand at an exponential rate. There is a great practical need to locate relevant content. This requires quick automated methods for classifying textual information, according to subject. We propose a quick statistical approach, which can distinguish between 'keywords' and 'noisewords', like 'the' and 'a', without the need to parse the text into its parts of speech. Our classification is based on an F-statistic, which compares the observed Word Recurrence Interval (WRI) with a simple null hypothesis. We also propose a model to account for the observed distribution of WRI statistics and we subject this model to a number of tests.

Paper Details

Date Published: 15 June 2007
PDF: 12 pages
Proc. SPIE 6601, Noise and Stochastics in Complex Systems and Finance, 660113 (15 June 2007); doi: 10.1117/12.724655
Show Author Affiliations
Andrew G. Allison, The Univ. of Adelaide (Australia)
Charles E. M. Pearce, The Univ. of Adelaide (Australia)
Derek Abbott, The Univ. of Adelaide (Australia)

Published in SPIE Proceedings Vol. 6601:
Noise and Stochastics in Complex Systems and Finance
János Kertész; Stefan Bornholdt; Rosario N. Mantegna, Editor(s)

© SPIE. Terms of Use
Back to Top
Sign in to read the full article
Create a free SPIE account to get access to
premium articles and original research
Forgot your username?