Feature Selection for Comment Spam Filtering on YouTube

Main Article Content

Alper Kürşat Uysal

Abstract

Spam filtering is one of the most popular domains for text classification. While there exist some many studies on classification of spam e-mails and short text messages, comment spam filtering on YouTube is relatively a new topic as there are limited number of annotated datasets.  As it is valid for all text classification problems, feature space’s high dimensionality is one of the biggest problems for spam filtering due to accuracy considerations. The contribution of this study is the analysis of the performance of five state-of-the-art text feature selection methods for spam filtering on YouTube using two widely-known classifiers namely naïve Bayes (NB) and decision tree (DT). Five datasets including spam comments belonging to different subjects were utilized in the experiments. These datasets are named as Psy, KatyPerry, LMFAO, Eminem, and Shakira. For evaluation, Macro-F1 success measure were used. Also, 3-fold cross-validation is preferred for a fair performance evaluation. Experiments indicated that distinguishing feature selector (DFS) and Gini Index (GI) methods are superior to the other three feature selection methods for spam filtering on YouTube. However, the performance of DT classifier is better than NB classifier in most cases for spam filtering on YouTube.

Article Details

How to Cite
[1]
A. Uysal, “Feature Selection for Comment Spam Filtering on YouTube”, DataSCI, vol. 1, no. 1, pp. 4-8, Dec. 2018.
Section
Research Articles