A Strategy for Training Set Selection in Text Classification Problems

Abstract

An issue in text classification problems involves the choice of good samples on which to train the classifier. Training sets that properly represent the characteristics of each class have a better chance of establishing a successful predictor. Moreover, sometimes data are redundant or take large amounts of computing time for the learning process. To overcome this issue, data selection techniques have been proposed, including instance selection. Some data mining techniques are based on nearest neighbors, ordered removals, random sampling, particle swarms or evolutionary methods. The weaknesses of these methods usually involve a lack of accuracy, lack of robustness when the amount of data increases, over?tting and a high complexity. This work proposes a new immune-inspired suppressive mechanism that involves selection. As a result, data that are not relevant for a classifier’s ?nal model are eliminated from the training process. Experiments show the e?ectiveness of this method, and the results are compared to other techniques; these results show that the proposed method has the advantage of being accurate and robust for large data sets, with less complexity in the algorithm.

Authors and Affiliations

Maria Passini, Katiusca Estébanez, Grazziela Figueredo, Nelson Ebecken

Keywords

Related Articles

Non-linear Dimensionality Reduction-based Intrusion Detection using Deep Autoencoder

The intrusion detection has become core part of any network of computers due to increasing amount of digital content available. In parallel, the data breaches and malware attacks have also grown in large numbers which ma...

Evaluating Confidentiality Impact in Security Risk Scoring Models

Risk scoring models assume that confidentiality evaluation is based on user estimations. Confidentiality evaluation incorporates the impacts of various factors including systems' technical configuration, on the processes...

Using a Cluster for Securing Embedded Systems

In today's increasingly interconnected world, the deployment of an Intrusion Detection System (IDS) is becoming very important for securing embedded systems from viruses, worms, attacks, etc. But IDSs face many challenge...

The Criteria for Software Quality in Information System: Rasch Analysis

Most of the organization uses information system to manage the information and provide better decision making in order to deliver high quality services. Due to that the information system must be reliable and fulfill the...

Efficient Load Balancing Algorithm for the Arrangement-Star Network

The Arrangement-Star is a well-known network in the literature and it is one of the promising interconnection networks in the area of super computing, it is expected to be one of the attractive alternatives in the future...

Download PDF file
  • EP ID EP120335
  • DOI 10.14569/IJACSA.2013.040608
  • Views 88
  • Downloads 0

How To Cite

Maria Passini, Katiusca Estébanez, Grazziela Figueredo, Nelson Ebecken (2013). A Strategy for Training Set Selection in Text Classification Problems. International Journal of Advanced Computer Science & Applications, 4(6), 54-60. https://europub.co.uk/articles/-A-120335