Examining the Impact of Feature Selection Methods on Text Classification
Journal Title: International Journal of Advanced Computer Science & Applications - Year 2017, Vol 8, Issue 12
Abstract
Feature selection that aims to determine and select the distinctive terms representing a best document is one of the most important steps of classification. With the feature selection, dimension of document vectors are reduced and consequently duration of the process is shortened. In this study, feature selection methods were studied in terms of dimension reduction rates, classification success rates, and dimension reduction-classification success relation. As classifiers, kNN (k-Nearest Neighbors) and SVM (Support Vector Machines) were used. 5 standard (Odds Ratio-OR, Mutual Information-MI, Information Gain-IG, Chi-Square-CHI and Document Frequency-DF), 2 combined (Union of Feature Selections-UFS and Correlation of Union of Feature Selections-CUFS) and 1 new (Sum of Term Frequency-STF) feature selection methods were tested. The application was performed by selecting 100 to 1000 terms (with an increment of 100 terms) from each class. It was seen that kNN produces much better results than SVM. STF was found out to be the most successful feature selection considering the average values in both datasets. It was also found out that CUFS, a combined model, is the one that reduces the dimension the most, accordingly, it was seen that CUFS classify the documents more successfully with less terms and in short period compared to many of the standard methods.
Authors and Affiliations
Mehmet Fatih KARACA, Safak BAYIR
Enhanced Architecture of a Web Warehouse based on Quality Evaluation Framework to Incorporate Quality Aspects in Web Warehouse Creation
In the recent years, it has been observed that World Wide Web (www) became a vast source of information explosion about all areas of interest. Relevant information retrieval is difficult from the web space as there is no...
Level of Confidence in Software Effort Estimation by an Intelligent Fuzzy – Neuro - Genetic Approach
Organizations are struggling to deliver the expected software functionality and quality in scheduled time and prescribed budget. Despite availability of numerous advanced effort estimation techniques overestimation and u...
Identify and Manage the Software Requirements Volatility
Management of software requirements volatility through development of life cycle is a very important stage. It helps the team to control significant impact all over the project (cost, time and effort), and also it keeps...
Human Object Tracking in Nonsubsampled Contourlet Domain
The intelligent systems are becoming more important in life. Moving objects tracking is one of the tasks of intelligent systems. This paper proposes the algorithm to track the object in the street. The proposed method us...
Classified Arabic Documents Using Semi-Supervised Technique
In this work, we test the performance of the Naïve Bayes classifier in the categorization of Arabic text. Arabic is rich and unique in its own way and has its own distinct features. The issues and characteristics of Arab...