A Novel Subset Selection Clustering-Based Algorithm for High Dimensional Data

Abstract

Feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are to be distinguished from feature extraction. Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features. Feature selection techniques are often used in domains where there are many features and comparatively few samples (or data points). It involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. A feature selection algorithm may be evaluated from both the efficiency and effectiveness points of view. While the efficiency concerns the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. Based on these criteria, a fast clustering-based feature selection algorithm, FAST, is proposed and experimentally evaluated in this paper. The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graph-theoretic clustering methods. In the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form a subset of features. Features in different clusters are relatively independent; the clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features. To ensure the efficiency of FAST, we adopt the efficient minimum-spanning tree clustering method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical study. Extensive experiments are carried out to compare FAST and several representative feature selection algorithms, namely, FCBF, ReliefF, CFS, Consist, and FOCUS-SF, with respect to four types of well-known classifiers, namely, the probability-based Naive Bayes, the treebased C4.5, the instance-based IB1, and the rule-based RIPPER before and after feature selection. The results, on 35 publicly available real-world high dimensional image, microarray, and text data, demonstrate that FAST not only produces smaller subsets of features but also improves the performances of the four types of classifiers.

Authors and Affiliations

Balineni Bala Krishna| M.Tech (CSE), NRI Institute of Technology (NRIIT), A.P., India, Kolavasi Chandra Mouli| Assistant Professor, Dept. of Master of Computer Application, NRI Institute of Technology (NRIIT), A.P., India

Keywords

Related Articles

Electronic Circuits Diagnosis Using Artificial Neural Networks

When we expect about something that does not treat as it should be, we are initiating the process of diagnosis. Diagnosis is a commonly used activity in our everyday lives (Benjamins & Jansweijer, 1990). Complicated...

Enlightening Website Structure for Enabling Client Steering Effectually

Designing well-Constructiond websites to facilitate Operative user Steering has long been a challenge. A primary reason is that the web developers’ understanding of how a website should be Constructiond can be cons...

Analysis of Sandwich Beam

Sandwich beams are composite systems having high stiffness-to-weight and Strength-toweight ratios and are used as light weight load bearing components. The use of thin, strong skin sheets adhered to thicker, lightweig...

Smart Monitoring and Controlling of a System Using ARM11 SOC

This paper focuses on Smart monitoring and controlling of a system using ARM 11 SOC .Here Raspberry pi which is ARM11 SoC development board acts as the platform to which interfaces modules and monitor controlling mo...

Probity And Speculations Yielding Model For Multi Clouds

Cloud computing is a latest trend in present scenario. Cloud computing definitely makes sense if your own security is weak, missing features of understanding and privacy. The cloud acts as a big black box, nothing in...

Download PDF file
  • EP ID EP16509
  • DOI -
  • Views 335
  • Downloads 11

How To Cite

Balineni Bala Krishna, Kolavasi Chandra Mouli (2015). A Novel Subset Selection Clustering-Based Algorithm for High Dimensional Data. International Journal of Science Engineering and Advance Technology, 3(7), 246-251. https://europub.co.uk/articles/-A-16509