Share Email Print
cover

Proceedings Paper

Anomaly detection in discussion forum posts using global vectors
Author(s): Paweł Cichosz
Format Member Price Non-Member Price
PDF $14.40 $18.00
cover GOOD NEWS! Your organization subscribes to the SPIE Digital Library. You may be able to download this paper for free. Check Access

Paper Abstract

Anomaly detection can be seen as an unsupervised learning task in which a predictive model created on historical data is used to detect outlying instances in new data. This work addresses possibly promising but relatively uncommon application of anomaly detection to text data. A Polish Internet discussion forum devoted to psychoactive substances received from home-grown plants, such as hashish or marijuana, serves as a text source that is both realistic and possibly interesting on its own, due to potential associations with drug-related crime. Forum posts are preprocessed by stopword removal, spelling correction, stemming, and frequency-based term filtering. The Global Vectors (GloVe) text representation, which is an example of the increasingly popular word embedding approach, is combined with two unsupervised anomaly detection algorithms, based on one-class SVM classification and based on dissimilarity to k-medoids clusters. The cluster dissimilarity approach combined with the GloVe representation outperforms one-class SVM with respect to detection quality and appears a more promising approach to anomaly detection in text data.

Paper Details

Date Published: 1 October 2018
PDF: 12 pages
Proc. SPIE 10808, Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2018, 108081R (1 October 2018); doi: 10.1117/12.2501345
Show Author Affiliations
Paweł Cichosz, Warsaw Univ. of Technology (Poland)


Published in SPIE Proceedings Vol. 10808:
Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2018
Ryszard S. Romaniuk; Maciej Linczuk, Editor(s)

© SPIE. Terms of Use
Back to Top