NADA: New Arabic Dataset for Text Classification

Abstract

In the recent years, Arabic Natural Language Processing, including Text summarization, Text simplification, Text Categorization and other Natural Language-related disciplines, are attracting more researchers. Appropriate resources for Arabic Text Categorization are becoming a big necessity for the development of this research. The few existing corpora are not ready for use, they require preprocessing and filtering operations. In addition, most of them are not organized based on standard classification methods which makes unbalanced classes and thus reduced the classification accuracy. This paper proposes a New Arabic Dataset (NADA) for Text Categorization purpose. This corpus is composed of two existing corpora OSAC and DAA. The new corpus is preprocessed and filtered using the recent state of the art methods. It is also organized based on Dewey decimal classification scheme and Synthetic Minority Over-Sampling Technique. The experiment results show that NADA is an efficient dataset ready for use in Arabic Text Categorization.

Authors and Affiliations

Nada Alalyani, Souad Larabi Marie-Sainte

Keywords

Related Articles

A Learner Model for Adaptable e-Learning

The advancement in Information and Communication Technology (ICT) has provided new opportunities for teaching and learning in the form of e-learning. However, developing specialized contents, accommodating profiles of le...

Data-driven based Fault Diagnosis using Principal Component Analysis

Modern industrial systems are growing day by day and unlikely their complexity is also increasing. On the other hand, the design and operations have become a key focus of the researchers in order to improve the productio...

Performance Chronicles of Multicast Routing Protocol in Wireless Sensor Network

Routing protocol in wireless sensor network (WSN) has always been a frequently adopted topic of research in WSN owing to many unsolved issues in it. This paper discusses about the multicast routing protocols in WSN and b...

Enhancing eHealth Information Systems for chronic diseases remote monitoring systems

Statistics and demographics for the aging population in Europe are compelling. The stakes are then in terms of disability and chronic diseases whose proportions will increase because of increased life expectancy. Heart f...

Software Artefacts Consistency Management towards Continuous Integration: A Roadmap

Software development in DevOps practices has become popular with the collaborative intersection between development and operations teams. The notion of DevOps practices drives the software artefacts changes towards conti...

Download PDF file
  • EP ID EP393876
  • DOI 10.14569/IJACSA.2018.090928
  • Views 103
  • Downloads 0

How To Cite

Nada Alalyani, Souad Larabi Marie-Sainte (2018). NADA: New Arabic Dataset for Text Classification. International Journal of Advanced Computer Science & Applications, 9(9), 206-212. https://europub.co.uk/articles/-A-393876