NADA: New Arabic Dataset for Text Classification

Abstract

In the recent years, Arabic Natural Language Processing, including Text summarization, Text simplification, Text Categorization and other Natural Language-related disciplines, are attracting more researchers. Appropriate resources for Arabic Text Categorization are becoming a big necessity for the development of this research. The few existing corpora are not ready for use, they require preprocessing and filtering operations. In addition, most of them are not organized based on standard classification methods which makes unbalanced classes and thus reduced the classification accuracy. This paper proposes a New Arabic Dataset (NADA) for Text Categorization purpose. This corpus is composed of two existing corpora OSAC and DAA. The new corpus is preprocessed and filtered using the recent state of the art methods. It is also organized based on Dewey decimal classification scheme and Synthetic Minority Over-Sampling Technique. The experiment results show that NADA is an efficient dataset ready for use in Arabic Text Categorization.

Authors and Affiliations

Nada Alalyani, Souad Larabi Marie-Sainte

Keywords

Related Articles

A new approach towards the self-adaptability of Service-Oriented Architectures to the context based on workflow

Distributed information systems are needed to be autonomous, heterogeneous and adaptable to the context. This is the reason why they resort Web services based on SOA Based on the advanced technology of SOA. These technol...

Data Mining Techniques to Construct a Model: Cardiac Diseases

Using echocardiography flexible Transthoracic Echocardiography reported data set detecting heart disease by using mining techniques designed prediction model the data set can develop the reliability of analysis of cardia...

Passenger and Luggage Weight Monitoring System for Public Transport based on Sensing Technology: A Case of Zambia

The prevalence of overloading, which is exceeding the maximum load weight, on public buses in Zambia is very rampant because there is currently no system to measure and monitor load weight at bus stations, apart from wei...

Wavelet Based Image Retrieval Method

A novel method for retrieving image based on color and texture extraction is proposed for improving the accuracy. In this research, we develop a novel image retrieval method based on wavelet transformation to extract the...

A Comparative Study of the Decisional Needs Engineering Approaches

Requirements Engineering (RE) is an important phase in a project of systems development. It helps design-analysts to design and to model the expression of the end-user needs, and their expectations vis-a-vis their future...

Download PDF file
  • EP ID EP393876
  • DOI 10.14569/IJACSA.2018.090928
  • Views 79
  • Downloads 0

How To Cite

Nada Alalyani, Souad Larabi Marie-Sainte (2018). NADA: New Arabic Dataset for Text Classification. International Journal of Advanced Computer Science & Applications, 9(9), 206-212. https://europub.co.uk/articles/-A-393876