A High-Performing Similarity Measure for Categorical Dataset with SF-Tree Clustering Algorithm
Journal Title: International Journal of Advanced Computer Science & Applications - Year 2018, Vol 9, Issue 5
Abstract
Tasks such as clustering and classification assume the existence of a similarity measure to assess the similarity (or dissimilarity) of a pair of observations or clusters. The key difference between most clustering methods is in their similarity measures. This article proposes a new similarity measure function called PWO “Probability of the Weights between Overlapped items ”which could be used in clustering categorical dataset; proves that PWO is a metric; presents a framework implementation to detect the best similarity value for different datasets; and improves the F-tree clustering algorithm with Semi-supervised method to refine the results. The experimental evaluation on real categorical datasets, such as “Mushrooms, KrVskp, Congressional Voting, Soybean-Large, Soybean-Small, Hepatitis, Zoo, Lenses, and Adult-Stretch” shows that PWO is more effective in measuring the similarity between categorical data than state-of-the-art algorithms; clustering based on PWO with pre-defined number of clusters results a good separation of classes with a high purity of average 80% coverage of real classes; and the overlap estimator perfectly estimates the value of the overlap threshold using a small sample of dataset of around 5% of data size.
Authors and Affiliations
Mahmoud A. Mahdi, Samir E. Abdelrahman, Reem Bahgat
Towards Development of Real-Time Handwritten Urdu Character to Speech Conversion System for Visually Impaired
Text to Speech (TTS) Conversion Systems have been an area of research for decades and have been developed for both handwritten and typed text in various languages. Existing research shows that it has been a challenging t...
Security Provisions in Stream Ciphers Through Self Shrinking and Alternating Step Generator
in cryptography stream ciphers used to encrypt plain text data bits one by one. The security of stream ciphers depend upon randomness of key stream, good linear span and low probability of finding the initial states of p...
The Relationships of Soft Systems Methodology (SSM), Business Process Modeling and e-Government
e-Government have emerged in several countries. Because of many aspects that must be considered, and because of there are exist some soft components in e-Government, then the Soft Systems Methodology (SSM) can be c...
New Deep Kernel Learning based Models for Image Classification
Deep learning system is used for solving many problems in different domains but it gives an over-fitting risk when richer representations are increased. In this paper, three different models with different deep multiple...
Opinion Mining: An Approach to Feature Engineering
Sentiment Analysis or opinion mining refers to a process of identifying and categorizing the subjective information in source materials using natural language processing (NLP), text analytics and statistical linguistics....