Data Science and Applications http://jdatasci.com/index.php/jdatasci <p><em><strong>Data Science and Applications (DataSCI)</strong> </em>is an international peer-reviewed (refereed) journal which publishes original and quality research articles in the field of Data Science and its applications. <em><strong>DataSCI</strong></em> is published twice per year online. The aim of the journal is to publish original scientific researches based on data analysis from both life and social sciences. <em><strong>DataSCI</strong></em> also provides a data-sharing platform that will bring together international researchers, professionals and academics. The <em><strong>DataSCI</strong> </em>magazine accepts articles written in English.</p> <p>Our journal covers all the studies based on data&nbsp; analysis from&nbsp;both&nbsp;lifeand&nbsp;social&nbsp;sciences.&nbsp;Your data-based works can also be accepted in areas not mentioned below.</p> <ul> <li class="show"><strong># scientific data mining, machine learning, and Big Data analytics</strong></li> <li class="show"><strong># scientific data management, network analysis, and knowledge discovery</strong></li> <li class="show"><strong>#&nbsp;scholarly communication and (semantic) publishing</strong></li> <li class="show"><strong>#&nbsp;research data publication, indexing, quality, and discovery</strong></li> <li class="show"><strong>#&nbsp;data wrangling, integration, and provenance of scientific data</strong></li> <li class="show"><strong>#&nbsp;trend analysis, prediction, and visualization of research topics</strong></li> <li class="show"><strong>#&nbsp;scalable computing, analysis, and learning for Data Science</strong></li> <li class="show"><strong>#&nbsp;scientific web services and executable workflows</strong></li> <li class="show"><strong>#&nbsp;scientific analytics, intelligence, and real time decision making</strong></li> <li class="show"><strong>#&nbsp;socio-technical systems</strong></li> <li class="show"><strong>#&nbsp;social impacts of Data Science</strong></li> </ul> DataSCI Team en-US Data Science and Applications Feature Selection for Comment Spam Filtering on YouTube http://jdatasci.com/index.php/jdatasci/article/view/9 <p>Spam filtering is one of the most popular domains for text classification. While there exist some many studies on classification of spam e-mails and short text messages, comment spam filtering on YouTube is relatively a new topic as there are limited number of annotated datasets.&nbsp; As it is valid for all text classification problems, feature space’s high dimensionality is one of the biggest problems for spam filtering due to accuracy considerations. The contribution of this study is the analysis of the performance of five state-of-the-art text feature selection methods for spam filtering on YouTube using two widely-known classifiers namely naïve Bayes (NB) and decision tree (DT). Five datasets including spam comments belonging to different subjects were utilized in the experiments. These datasets are named as Psy, KatyPerry, LMFAO, Eminem, and Shakira. For evaluation, Macro-F1 success measure were used. Also, 3-fold cross-validation is preferred for a fair performance evaluation. Experiments indicated that distinguishing feature selector (DFS) and Gini Index (GI) methods are superior to the other three feature selection methods for spam filtering on YouTube. However, the performance of DT classifier is better than NB classifier in most cases for spam filtering on YouTube.</p> Alper Kürşat Uysal ##submission.copyrightStatement## 2018-12-26 2018-12-26 1 1 4 8 Automatic Text Summarization Methods Used on Twitter http://jdatasci.com/index.php/jdatasci/article/view/10 <p>Automatic Text Summarization; is one of the areas of Natural Language Processing which has become very popular especially in recent years. Generally automatic text summarization; is the process of getting a summary of a document given as input to the computer as output. The documents used for summarization are usually selected from news texts, corner texts or research texts. In addition to this, efforts are being made to achieve the same successes on documents that are defined as microblogs and that appear to be relatively short and meaningless. In this study, automatic text summarization methods used on the data obtained from Twitter, which is one of the most widely used microblog sites today, are examined. Summarization performances were evaluated in the light of the obtained findings, the methods used were examined and the difficulties encountered and their solutions were presented.</p> Nazan Kemaloglu Ecir Uğur Küçüksille ##submission.copyrightStatement## 2018-12-26 2018-12-26 1 1 9 15 Performance Evaluation of Feature Subset Selection Approaches on Rule-Based Learning Algorithms http://jdatasci.com/index.php/jdatasci/article/view/6 <p>There are two main approaches for feature subset selection, i.e., wrapper and filter based. In wrapper based approach, which is a supervised method, the feature subset selection algorithm acts as a wrapper around an induction algorithm. The induction algorithm is actually a black-box for the feature subset selection algorithm and is mostly the classifier itself. The filter approach is an unsupervised method and attempts to assess the merits of features from the data while ignoring the performance of the induction algorithm. In this study, the effects of the feature subset selection approaches on the classification performance of rule-based learning algorithms, i.e., C4.5, RIPPER, PART, BFTree were investigated. These algorithms are fast in case of wrapper based approach. For various datasets, significant accuracy improvements were achieved with the wrapper based feature subset selection method. Other algorithms like Multilayer Perceptron (MLP) and Random Forests (RF) were also applied on the same datasets for the purpose of accuracy comparison. These two algorithms were very inefficient in terms of time when they were used in wrapper approach.</p> Ali Ozturk ##submission.copyrightStatement## 2018-12-26 2018-12-26 1 1 16 20 Comparing the performance of basketball players with decision trees and TOPSIS http://jdatasci.com/index.php/jdatasci/article/view/3 <p>In this study, individual game statistics for basketball players from Euroleague 2017-2018 season are analysed with Decision Trees and Technique for Order-Preference by Similarity to Ideal Solution (TOPSIS) methods. The aim of this study is to create an alternative ranking system to find the best and the worst performing players in each position eg. Guards, forwards and centers. Decision trees are a supervised learning method used for classification and regression. The aim of the decision trees is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. On the other side, TOPSIS is another method to construct a ranking system by using a multi-criteria decision-making system. All the individual statistics such as points, rebounds, assists, steals, blocks, turnovers, free throw percentage and fouls are used to construct the rankings of players. Both decision trees and TOPSIS results are compared with the Performace Index Rating (PIR) index of players which is a single number expressing the performance of the player. Comparing these 3 measures revealed the over and underperformers in the Euroleague for the 2017-2018 season. The results of individual players performance are visualized with the proper methods such as Chernoff's faces.</p> <p>&nbsp;</p> Erhan Çene Coşkun Parim Batuhan Özkan ##submission.copyrightStatement## 2018-12-26 2018-12-26 1 1 21 28 Estimating poverty using aerial images: South African application http://jdatasci.com/index.php/jdatasci/article/view/19 <p>Policy makers and the government rely heavily on survey data when making policy-related decisions. Survey data is labour intensive, costly and time consuming, hence it cannot be frequently or extensively collected. The main aim of this research is to demonstrate how deep learning in computer vision coupled with statistical regression modelling can be used to estimate poverty on aerial images supplemented with national household survey data. This is executed in two phases; aerial classification and detection phase and poverty modelling phase. The aerial classification and detection phase use convolutional neural networks (CNN) to perform settlement typology classification of the aerial images into three broad geo-type classes namely; urban, &nbsp;rural&nbsp; and farm. This is then followed by object detection to detect three broad dwelling type classes in the aerial images namely; brick house, traditional house, and informal settlement. Mask Region-based CNN (Mask R-CNN) model with a resnet101 backbone model is used to perform this task. The second phase, poverty modelling phase, involves using National Income Dynamics Survey (NIDS) data to compute the poverty measure Sen-Shorrocks-Thon index (SST). This is followed by using ridge regression to model the poverty measure using aggregated results from the aerial classification and detection phase. The study area for this research is eThekwini district in Kwa-Zulu Natal, South Africa. However, this approach can be extended to other districts in South Africa.</p> Vongani Hlavutelo Maluleke Sebnem Er Quentin R. Williams ##submission.copyrightStatement## 2018-12-26 2018-12-26 1 1 29 36 Combining Artificial Algae Algorithm to Artificial Neural Network for Optimization of Weights http://jdatasci.com/index.php/jdatasci/article/view/7 <p>Artificial Neural Network (ANN) is one of the most important artificial intelligent algorithms used for classification problems. The structure of ANN depends on the learning algorithm used for adjusting the weights between neurons of the layers according to the calculated error between model value and the real value. Recently the weights between layers in ANN has been optimized by using metaheuristic optimization algorithms. One of the recent high performance nonlinear optimization algorithms is Artificial Algae Algorithm (AAA) which is a bioinspired, successful, competitive and robust optimization algorithm. In this study, AAA was used as a tool for optimization of the weights in ANN algorithm. ANN and AAA was combined such that the training step of the ANN modeling to be performed by AAA. After training, ANN continues testing with the optimized weights. The established model combination (AAANN) was tested on three benchmarked datasets (Iris, Thyroid and Dermatology) of the UCI Machine Learning Repository to indicate the performance of this hybrid structure. The results were compared with MLP algorithm in terms of Mean Absolute Error (MAE). Accordingly, up to 96% reduction in mean MSE levels could be achieved by AAANN for all models.</p> Gülay Tezel Sait Ali Uymaz Esra Yel ##submission.copyrightStatement## 2018-12-26 2018-12-26 1 1 37 44 ANN Modelling for Predicting the Water Absorption of Composites with Waste Plastic Pyrolysis Char Fillers http://jdatasci.com/index.php/jdatasci/article/view/8 <p><span lang="EN-GB" style="margin: 0px; font-family: 'Times New Roman','serif'; font-size: 10pt;">Waste material was fragmented into gas, liquid and solid fractions by pyrolysis. Recently the solid fraction (char) has been used as filler in epoxy composites. Type and properties of filler affect water absorption of epoxy composites. A recent water absorption database (of 1512 data) has been obtained experimentally. Accordingly, type of pyrolysed plastic, waste pre–washing, pyrolysis temperature, additive dosage and water exposure time were input parameters in the estimation model developed with multilayer perceptron artificial neural network (MLP ANN) to predict the absorbed water quantity as output. Four datasets were derived with data preprocessing. Among all the configurations worked up, 0.991 training and 0.986 testing R² were attained as the highest R² values under conditions including 2e4 iterations, lr 0.04, mc 0.9, first hidden layer of 22 nodes, and second hidden layer of 15 nodes. The R² value attained in the optimum configuration and the average R² attained via 5-fold cross-validation are close to each other for both training and test. The established model will help users to predict the quantity of water that absorbed upon exposure. This will give idea about the availability of that composite for using it for particular purposes.</span></p> Esra Yel Gülay Tezel Sait Ali Uymaz ##submission.copyrightStatement## 2018-12-26 2018-12-26 1 1 45 51