NADA: New Arabic Dataset for Text Classification

Abstract

In the recent years, Arabic Natural Language Processing, including Text summarization, Text simplification, Text Categorization and other Natural Language-related disciplines, are attracting more researchers. Appropriate resources for Arabic Text Categorization are becoming a big necessity for the development of this research. The few existing corpora are not ready for use, they require preprocessing and filtering operations. In addition, most of them are not organized based on standard classification methods which makes unbalanced classes and thus reduced the classification accuracy. This paper proposes a New Arabic Dataset (NADA) for Text Categorization purpose. This corpus is composed of two existing corpora OSAC and DAA. The new corpus is preprocessed and filtered using the recent state of the art methods. It is also organized based on Dewey decimal classification scheme and Synthetic Minority Over-Sampling Technique. The experiment results show that NADA is an efficient dataset ready for use in Arabic Text Categorization.

Authors and Affiliations

Nada Alalyani, Souad Larabi Marie-Sainte

Keywords

Related Articles

Conservative Noise Filters

Noisy training data have a huge negative impact on machine learning algorithms. Noise-filtering algorithms have been proposed to eliminate such noisy instances. In this work, we empirically show that the most popular noi...

Instrument Development for Measuring the Acceptance of UC&C: A Content Validity Study

Studies on the acceptance of Unified Communications and Collaboration (UC&C) tools such as instant messaging and video conferencing have been around for some time. Adoption and acceptance of UC&C tools and services has b...

Object Conveyance Algorithm for Multiple Mobile Robots based on Object Shape and Size

This paper describes a determination method of a number of a team for multiple mobile robot object conveyance. The number of robot on multiple mobile robot systems is the factor of complexity on robots formation and moti...

 OCC: Ordered congestion control with cross layer support in Manet routing

  In the recent times many accessible congestion control procedures have no capability to differentiate involving two major problems like packet loss by link crash and packet loss by congestion. Consequently these r...

Human Related-Health Actions Detection using Android Camera based on TensorFlow Object Detection API

A new method to detect human health-related actions (HHRA) from a video sequence using an Android camera. The Android platform works not only to capture video images through its camera, but also to detect emergency actio...

Download PDF file
  • EP ID EP393876
  • DOI 10.14569/IJACSA.2018.090928
  • Views 82
  • Downloads 0

How To Cite

Nada Alalyani, Souad Larabi Marie-Sainte (2018). NADA: New Arabic Dataset for Text Classification. International Journal of Advanced Computer Science & Applications, 9(9), 206-212. https://europub.co.uk/articles/-A-393876