An Efficient Document Categorization Approach for Turkish Based Texts

Abstract

Since, it is infeasible to classify all the documents with human effort due to the rapid and uncontrollable growth in textual data, automatic methods have been approached in order to organize the data. Therefore a support vector machine (SVM) classifier is used for text categorization in this study. In text categorization applications, the text representation process could take a huge computation time on weighting the huge size of terms. So far, lexicons that contain less number of terms are used for the solution in the literature. However it has been observed that these kinds of solutions reduce the accuracy of the text classification. In this paper, the term-document matrix is constructed as user dependent according to the purpose of classification. Since the number of terms is still relatively large, we used a hash table for efficient search of terms. Hereby an efficient and rapid TF-IDF method is introduced to construct a weight-matrix to represent the term-document relations and a study concerning classification of the documents in Turkish based news and Turkish columnists is conducted. With the proposed study, the computational time that is required for term-weighting process is reduced substantially; also 99% accuracy is achieved in determination of the news categories and 98% accuracy is achieved in detection of the columnists.

Authors and Affiliations

Sevinç İlhan Omurca*| Kocaeli University, Faculty of Engineering, Computer Engineering Department Umuttepe Campus, Kocaeli – 41380, Turkey, Semih Baş| Tubitak Marmara Research Center Technology Free Zone, IBTECH, Kocaeli – 41470, Turkey, Ekin Ekinci| Kocaeli University, Faculty of Engineering, Computer Engineering Department Umuttepe Campus, Kocaeli – 41380, Turkey

Keywords

Related Articles

Atmospheric and light-induced effects in nanostructured silicon deposited by capacitively and inductively-coupled plasma

Renewable sources of energy have demonstrated the potential to replace much of the conventional sources but the cost continues to pose a challenge. Efforts to reduce cost involve highly efficient and less expensive mater...

GA Based Selective Harmonic Elimination for Five-Level Inverter Using Cascaded H-bridge Modules

Multilevel inverters (MLI) have been commonly used in industry especially to get quality output voltage in terms of total harmonic distortion (THD). In addition, development in semiconductor technology and advanced modul...

A Fuzzy Logic Controller with Tuning Output Scaling Factor for Induction Motor Control Taking Core Loss into Account

This paper presents a design of a fuzzy logic controller (FLC) with tuning output scaling factor for speed control of indirect field oriented induction motor (IM) taking core loss into account. The variation of output sc...

Diagnosis of Anemia in Children via Artificial Neural Network

In this paper, a neural network algorithm, which diagnosis of anemia for children under 18 years of age, is presented. The network is trained by using data from hemogram test results from 30 patients and an ex...

Speed Control of Direct Torque Controlled Induction Motor By using PI, Anti-Windup PI and Fuzzy Logic Controller

In this study, comparison between PI controller, fuzzy logic controller (FLC) and an anti-windup PI (PI+AW) controller used for speed control with direct torque controlled induction motor is presented. Direct torque cont...

Download PDF file
  • EP ID EP761
  • DOI -
  • Views 458
  • Downloads 24

How To Cite

Sevinç İlhan Omurca*, Semih Baş, Ekin Ekinci (2015). An Efficient Document Categorization Approach for Turkish Based Texts. International Journal of Intelligent Systems and Applications in Engineering, 3(1), 7-13. https://europub.co.uk/articles/-A-761