Feature Selection for Comment Spam Filtering on YouTube
Main Article Content
Spam filtering is one of the most popular domains for text classification. While there exist some many studies on classification of spam e-mails and short text messages, comment spam filtering on YouTube is relatively a new topic as there are limited number of annotated datasets. As it is valid for all text classification problems, feature space’s high dimensionality is one of the biggest problems for spam filtering due to accuracy considerations. The contribution of this study is the analysis of the performance of five state-of-the-art text feature selection methods for spam filtering on YouTube using two widely-known classifiers namely naïve Bayes (NB) and decision tree (DT). Five datasets including spam comments belonging to different subjects were utilized in the experiments. These datasets are named as Psy, KatyPerry, LMFAO, Eminem, and Shakira. For evaluation, Macro-F1 success measure were used. Also, 3-fold cross-validation is preferred for a fair performance evaluation. Experiments indicated that distinguishing feature selector (DFS) and Gini Index (GI) methods are superior to the other three feature selection methods for spam filtering on YouTube. However, the performance of DT classifier is better than NB classifier in most cases for spam filtering on YouTube.