Two Stage Comparison of Classifier Performances for Highly Imbalanced Datasets

Journal Title: Journal of Information and Organizational Sciences - Year 2015, Vol 39, Issue 2

Abstract

During the process of knowledge discovery in data, imbalanced learning data often emerges and presents a significant challenge for data mining methods. In this paper, we investigate the influence of class imbalanced data on the classification results of artificial intelligence methods, i.e. neural networks and support vector machine, and on the classification results of classical classification methods represented by RIPPER and the Naïve Bayes classifier. All experiments are conducted on 30 different imbalanced datasets obtained from KEEL (Knowledge Extraction based on Evolutionary Learning) repository. With the purpose of measuring the quality of classification, the accuracy and the area under ROC curve (AUC) measures are used. The results of the research indicate that the neural network and support vector machine show improvement of the AUC measure when applied to balanced data, but at the same time, they show the deterioration of results from the aspect of classification accuracy. RIPPER results are also similar, but the changes are of a smaller magnitude, while the results of the Naïve Bayes classifier show overall deterioration of results on balanced distributions. The number of instances in the presented highly imbalanced datasets has significant additional impact on the classification performances of the SVM classifier. The results have shown the potential of the SVM classifier for the ensemble creation on imbalanced datasets.

Authors and Affiliations

Goran Oreški, Stjepan Oreški

Keywords

Related Articles

An Overview of Graph-Based Keyword Extraction Methods and Approaches

The paper surveys methods and approaches for the task of keyword extraction. The systematic review of methods was gathered which resulted in a comprehensive review of existing approaches. Work related to keyword extracti...

Detecting Source Code Plagiarism on .NET Programming Languages using Low-level Representation and Adaptive Local Alignment

Even though there are various source code plagiarism detection approaches, only a few works which are focused on low-level representation for deducting similarity. Most of them are only focused on lexical token sequence...

Digital Transformation Playground - Literature Review and Framework of Concepts

Digital transformation (DT) introduces strategy-oriented and customer-centric changes, based on innovative usage of emerging information and communication technology (ICT), to implement improved or new processes in moder...

Emotion-Based Content Personalization in Social Networks

Personalization is the process of customizing social network pages of users according to their needs and personal interests. It can also be used for filtering unwanted information from an individual's page received from...

Analysis of Methods and Techniques for Prediction of Natural Gas Consumption: A Literature Review

Due to its many advantages, demand for natural gas has increased considerably and many models for predicting natural gas consumption are developed. The aim of this paper is to present an overview and systematic analysis...

Download PDF file
  • EP ID EP485182
  • DOI -
  • Views 94
  • Downloads 0

How To Cite

Goran Oreški, Stjepan Oreški (2015). Two Stage Comparison of Classifier Performances for Highly Imbalanced Datasets. Journal of Information and Organizational Sciences, 39(2), 209-222. https://europub.co.uk/articles/-A-485182