Two Stage Comparison of Classifier Performances for Highly Imbalanced Datasets

Journal Title: Journal of Information and Organizational Sciences - Year 2015, Vol 39, Issue 2

Abstract

During the process of knowledge discovery in data, imbalanced learning data often emerges and presents a significant challenge for data mining methods. In this paper, we investigate the influence of class imbalanced data on the classification results of artificial intelligence methods, i.e. neural networks and support vector machine, and on the classification results of classical classification methods represented by RIPPER and the Naïve Bayes classifier. All experiments are conducted on 30 different imbalanced datasets obtained from KEEL (Knowledge Extraction based on Evolutionary Learning) repository. With the purpose of measuring the quality of classification, the accuracy and the area under ROC curve (AUC) measures are used. The results of the research indicate that the neural network and support vector machine show improvement of the AUC measure when applied to balanced data, but at the same time, they show the deterioration of results from the aspect of classification accuracy. RIPPER results are also similar, but the changes are of a smaller magnitude, while the results of the Naïve Bayes classifier show overall deterioration of results on balanced distributions. The number of instances in the presented highly imbalanced datasets has significant additional impact on the classification performances of the SVM classifier. The results have shown the potential of the SVM classifier for the ensemble creation on imbalanced datasets.

Authors and Affiliations

Goran Oreški, Stjepan Oreški

Keywords

Related Articles

Methodology of Evaluating the Sufficiency of Information for Software Quality Assessment According to ISO 25010

The research is devoted to the development of the formalized and ontological models of the software quality according to ISO 25010. These models provide the possibility of the formalization of the software quality assess...

A Study on Knowledge Gain and Retention when Using Multimedia Learning Materials of Different Quality

The usage of multimedia has proven to foster meaningful learning, but not every multimedia resource will necessarily contribute to the teaching-learning process. Since for the development of multimedia learning materials...

Agent-Based Modelling Applied to 5D Model of the HIV Infection

This paper proposes a Multi-Agents Model to simulate the phenomenon of the infection by the Human Immunodeficiency Virus (HIV). Since the HIV was isolated in 1983 and found to be the cause of the Acquired Immune Deficien...

Measuring Public Procurement for Innovation on the Country Level and the Role of ICT Support

In recent years, the use of public procurement as a tool for promoting innovation has captured the interest of many researchers. However, their research mostly focuses on the impact of public procurement on companies’ in...

Learning Analytics for Peer-assessment: (Dis)advantages, Reliability and Implementation

Learning analytics deals with the data that occurs from students' interaction with ICT: collecting data, analyzing and reporting that can influence learning and teaching. Analysis of validity and reliability of assessmen...

Download PDF file
  • EP ID EP485182
  • DOI -
  • Views 93
  • Downloads 0

How To Cite

Goran Oreški, Stjepan Oreški (2015). Two Stage Comparison of Classifier Performances for Highly Imbalanced Datasets. Journal of Information and Organizational Sciences, 39(2), 209-222. https://europub.co.uk/articles/-A-485182