Towards Corpus-Based Stemming for Arabic Texts

Abstract

Stemming is an essential processing step in a number of natural language processing (NLP) applications such as information extraction, text analysis and machine translation. It is the process of reducing words to their stems. This paper presents a light stemmer for Arabic, using a corpus-based approach. The stemmer groups morphological variants of words in an Arabic corpus based on shared characters, before stripping off their affixes (prefixes and suffixes) to produce their common stem. Experimental results show that 86% of words in the test set were correctly grouped under a similar reduced form (i.e. the possible stem). In some cases the reduced form is not the legitimate stem. The evaluation shows that 72.2% of the words in the test set were reduced to their legitimate stem. The current stemmer is developed with the future aim of investigating the effectiveness of using word stems for extracting bilingual equivalents from an Arabic-English parallel corpus.

Authors and Affiliations

Yasser Muhammad Naguib Sabtan

Keywords

Related Articles

Assessing the Translation Quality of Quranic collocations: For better or for worse

This paper argues that in view of the proliferation of English translations of the Quran, a systematic and objective quality assessment framework of translation should be put in place to ensure that a translation meets t...

A Critical Discourse Analysis of the Selected Opposition and State Printed Media on the Representation of Southern Mobility in Yemen

This study scrutinizes the relationship between language and ideology and how such relationship is represented in the analysis of texts, following Systemic Functional Linguistics and transitivity analysis developed by M....

Representation of Women in "The silence of Mohammed" by Salim BACHI

‘The Silence of Muhammad’ is a novel written by Salim BACHI, published in 2008, it is a fictionalized story based on historical facts recounting different facets of the life of the Prophet of Islam Mohammed – Peace be up...

Employing TBL and 3PS Learning Approaches to Improve Writing Skill Among Saudi EFL Students in Jouf University

Learning the writing skill is a challenging task for second or foreign language learners. This difficulty stems from the fact that students required multiple skills and knowledge while writing. They need, for example, en...

Teaching English Language with Digital Journalism

Digital Journalism refers to the production and distribution of reports on recent events via internet. Digital journals can be used as learning material and an assessment tool for English Language Teaching. Through Digit...

Download PDF file
  • EP ID EP476847
  • DOI -
  • Views 168
  • Downloads 0

How To Cite

Yasser Muhammad Naguib Sabtan (2018). Towards Corpus-Based Stemming for Arabic Texts. International Journal of Linguistics, Literature and Translation, 1(4), 117-128. https://europub.co.uk/articles/-A-476847