Impact of Different Data Types on Classifier Performance of Random Forest, Naïve Bayes, and K-Nearest Neighbors Algorithms

Abstract

This study aims to evaluate impact of three different data types (Text only, Numeric Only and Text + Numeric) on classifier performance (Random Forest, k-Nearest Neighbor (kNN) and Naïve Bayes (NB) algorithms). The classification problems in this study are explored in terms of mean accuracy and the effects of varying algorithm parameters over different types of datasets. This content analysis has been examined through eight different datasets taken from UCI to train models for all three algorithms. The results obtained from this study clearly show that RF and kNN outperform NB. Furthermore, kNN and RF perform relatively the same in terms of mean accuracy nonetheless kNN takes less time to train a model. The changing numbers of attributes in datasets have no effect on Random Forest, whereas Naïve Bayes mean accuracy fluctuates up and down that leads to a lower mean accuracy, whereas, kNN mean accuracy increases and ends with higher accuracy. Additionally, changing number of trees has no significant effects on mean accuracy of the Random forest, however, the time to train the model has increased greatly. Random Forest and k-Nearest Neighbor are proved to be the best classifiers for any type of dataset. Thus, Naïve Bayes can outperform other two algorithms if the feature variables are in a problem space and are independent. Besides Random forests, it takes highest computational time and Naïve Bayes takes lowest. The k-Nearest Neighbor requires finding an optimal number of k for improved performance at the cost of computation time. Similarly, changing the number of attributes that effect Naïve Bayes and k-Nearest Neighbor performance nevertheless not the Random forest. This study can be extended by researchers who use the parametric method to analyze results.

Authors and Affiliations

Asmita Singh, Malka N. Halgamuge, Rajasekaran Lakshmiganthan

Keywords

Related Articles

Formal Modeling and Verification of Smart Traffic Environment with Design Aided by UML

Issue challan in response to rules violation, LED (Light Emitting Diode) and Bridge components of this proposed Smart Traffic Monitoring and Guidance System are presented in this paper to monitor violation of rules, upda...

Insights on Car Relocation Operations in One-Way Carsharing Systems

One-way carsharing system is a mobility service that offers short-time car rental service for its users in an urban area. This kind of service is attractive since users can pick up a car from a station and return it to a...

Network Traffic Classification using Machine Learning Techniques over Software Defined Networks

Nowadays Internet does not provide an exchange of information between applications and networks, which may results in poor application performance. Concepts such as application-aware networking or network-aware applicati...

A Proposed Integrated Approach for BI and GIS in Health Sector to Support Decision Makers (BIGIS-DSS)

This paper explores the possibilities of adopting Business Intelligence (BI), and Geographic Information System (GIS) to build a spatial intelligence and predictive analytical approach. The proposed approach will help in...

Implicit and Explicit Knowledge Mining of Crowdsourced Communities: Architectural and Technology Verdicts

The use of social media especially community Q&A Sites by software development community has been increased significantly in past few years. The ever mounting data on these Q&A Sites has open up new horizons for research...

Download PDF file
  • EP ID EP251150
  • DOI 10.14569/IJACSA.2017.081201
  • Views 97
  • Downloads 0

How To Cite

Asmita Singh, Malka N. Halgamuge, Rajasekaran Lakshmiganthan (2017). Impact of Different Data Types on Classifier Performance of Random Forest, Naïve Bayes, and K-Nearest Neighbors Algorithms. International Journal of Advanced Computer Science & Applications, 8(12), 1-10. https://europub.co.uk/articles/-A-251150