Impact of Different Data Types on Classifier Performance of Random Forest, Naïve Bayes, and K-Nearest Neighbors Algorithms

Abstract

This study aims to evaluate impact of three different data types (Text only, Numeric Only and Text + Numeric) on classifier performance (Random Forest, k-Nearest Neighbor (kNN) and Naïve Bayes (NB) algorithms). The classification problems in this study are explored in terms of mean accuracy and the effects of varying algorithm parameters over different types of datasets. This content analysis has been examined through eight different datasets taken from UCI to train models for all three algorithms. The results obtained from this study clearly show that RF and kNN outperform NB. Furthermore, kNN and RF perform relatively the same in terms of mean accuracy nonetheless kNN takes less time to train a model. The changing numbers of attributes in datasets have no effect on Random Forest, whereas Naïve Bayes mean accuracy fluctuates up and down that leads to a lower mean accuracy, whereas, kNN mean accuracy increases and ends with higher accuracy. Additionally, changing number of trees has no significant effects on mean accuracy of the Random forest, however, the time to train the model has increased greatly. Random Forest and k-Nearest Neighbor are proved to be the best classifiers for any type of dataset. Thus, Naïve Bayes can outperform other two algorithms if the feature variables are in a problem space and are independent. Besides Random forests, it takes highest computational time and Naïve Bayes takes lowest. The k-Nearest Neighbor requires finding an optimal number of k for improved performance at the cost of computation time. Similarly, changing the number of attributes that effect Naïve Bayes and k-Nearest Neighbor performance nevertheless not the Random forest. This study can be extended by researchers who use the parametric method to analyze results.

Authors and Affiliations

Asmita Singh, Malka N. Halgamuge, Rajasekaran Lakshmiganthan

Keywords

Related Articles

Towards A Framework for Multilayer Computing of Survivability

The notion of survivability has an important position in today enterprise systems and critical functions. This notion has been defined in different ways. However, lacking a comprehensive and multilayer model for computin...

A BAYESIAN FRAMEWORK FOR GLAUCOMA PROGRESSION DETECTION USING HEIDELBERG RETINA TOMOGRAPH IMAGES

Glaucoma, the second leading cause of blindness in the United States, is an ocular disease characterized by structural changes of the optic nerve head (ONH) and changes in visual function. Therefore, early detection is o...

An Enhanced Breast Cancer Diagnosis Scheme based on Two-Step-SVM Technique

This paper proposes an automatic diagnostic method for breast tumour disease using hybrid Support Vector Machine (SVM) and the Two-Step Clustering Technique. The hybrid technique is aimed at improving the diagnostic accu...

Analyzing Data Reusability of Raytrace Application in Splash2 Benchmark

When designing a chip multiprocessors, we use Splash2 to estimate its performance. This benchmark contains eleven applications. The performance when running them is similar, except Raytrace. We analyse it to clarity why...

A Short Description of Social Networking Websites And Its Uses

Now days the use of the Internet for social networking is a popular method among youngsters. The use of collaborative technologies and Social Networking Site leads to instant online community in which people communicate...

Download PDF file
  • EP ID EP251150
  • DOI 10.14569/IJACSA.2017.081201
  • Views 89
  • Downloads 0

How To Cite

Asmita Singh, Malka N. Halgamuge, Rajasekaran Lakshmiganthan (2017). Impact of Different Data Types on Classifier Performance of Random Forest, Naïve Bayes, and K-Nearest Neighbors Algorithms. International Journal of Advanced Computer Science & Applications, 8(12), 1-10. https://europub.co.uk/articles/-A-251150