A High-Performing Similarity Measure for Categorical Dataset with SF-Tree Clustering Algorithm
Journal Title: International Journal of Advanced Computer Science & Applications - Year 2018, Vol 9, Issue 5
Abstract
Tasks such as clustering and classification assume the existence of a similarity measure to assess the similarity (or dissimilarity) of a pair of observations or clusters. The key difference between most clustering methods is in their similarity measures. This article proposes a new similarity measure function called PWO “Probability of the Weights between Overlapped items ”which could be used in clustering categorical dataset; proves that PWO is a metric; presents a framework implementation to detect the best similarity value for different datasets; and improves the F-tree clustering algorithm with Semi-supervised method to refine the results. The experimental evaluation on real categorical datasets, such as “Mushrooms, KrVskp, Congressional Voting, Soybean-Large, Soybean-Small, Hepatitis, Zoo, Lenses, and Adult-Stretch” shows that PWO is more effective in measuring the similarity between categorical data than state-of-the-art algorithms; clustering based on PWO with pre-defined number of clusters results a good separation of classes with a high purity of average 80% coverage of real classes; and the overlap estimator perfectly estimates the value of the overlap threshold using a small sample of dataset of around 5% of data size.
Authors and Affiliations
Mahmoud A. Mahdi, Samir E. Abdelrahman, Reem Bahgat
Smart Transportation Application using Global Positioning System
Significant increase is noticed in the utilization of mobile applications for different purposes in the past decade. These applications can improve any individual’s way of life in many aspects such as communication, coll...
Research Pathway towards MAC Protocol in Enhancing Network Performance in Wireless Sensor Network (WSN)
The applications and utility of Wireless Sensor Network (WSN) have increased its pace in making an entry to the commercial market since the last five years. It has successfully established its association with Internet-o...
Detecting Inter-Component Vulnerabilities in Event-based Systems
Event-based system (EBS) has become popular because of its high flexibility, scalability, and adaptability. These advantages are enabled by its communication mechanism—implicit invocation and implicit concurrency between...
Hybrid Ensemble Framework for Heart Disease Detection and Prediction
Data mining techniques have been widely used in clinical decision support systems for detection and prediction of various diseases. As heart disease is the leading cause of death for both men and women, detection and pre...
GASolver-A Solution to Resource Constrained Project Scheduling by Genetic Algorithm
The Resource Constrained Scheduling Problem (RCSP) represents an important research area. Not only exact solution but also many heuristic methods have been proposed to solve RCPSP (Resource Constrained Project Scheduling...