Towards Corpus-Based Stemming for Arabic Texts

Abstract

Stemming is an essential processing step in a number of natural language processing (NLP) applications such as information extraction, text analysis and machine translation. It is the process of reducing words to their stems. This paper presents a light stemmer for Arabic, using a corpus-based approach. The stemmer groups morphological variants of words in an Arabic corpus based on shared characters, before stripping off their affixes (prefixes and suffixes) to produce their common stem. Experimental results show that 86% of words in the test set were correctly grouped under a similar reduced form (i.e. the possible stem). In some cases the reduced form is not the legitimate stem. The evaluation shows that 72.2% of the words in the test set were reduced to their legitimate stem. The current stemmer is developed with the future aim of investigating the effectiveness of using word stems for extracting bilingual equivalents from an Arabic-English parallel corpus.

Authors and Affiliations

Yasser Muhammad Naguib Sabtan

Keywords

Related Articles

Kennedy’s The Owl Answers (1965): Toward Black Existential Feminism

This article analyzes Adrienne Kennedy’s play The Owl Answers (1965) from a Black Existential-feminist perspective. It dissects the black female protagonist’s identity as a trapped identity. In addition, the article unra...

Ideological Challenges and Linguistic Approaches to Translating a Jewish Semi-Religious Text into Malay

Translating texts rich in elements of religions other than Islam into Malay, in the context of Malaysia, is ever sensitive due to its norms and conventions, restricting the translation of such texts which may contain edu...

Ba Caravan-e Soukhteh: Death of the Dramatic Signs on the Stage

Reviewing intralingual and intersemiotic translation carried out onto the stage from the perspective of Peirce’s interpretive semiotics, in light of Roland Barthes’ The Death of the Author, introduced initially in 1967,...

Forensic linguistics: Ratna Sarumpaet’s Persecution Case on Hate Speech

The objective of this research is to highlight the lexical semantic meaning and analysis of forensic linguistics on Ratna Sarumpaet’s persecution case in online media. The research method used descriptive qualitative. Da...

Gender and Demand Strategies: A Sociolinguistic Study

The investigation on the impact of gender as a sociolinguistics factor on the use of different strategies for demands or requests by men and women is the aim of this research. Six strategies reviewed in this study: Direc...

Download PDF file
  • EP ID EP476847
  • DOI -
  • Views 210
  • Downloads 0

How To Cite

Yasser Muhammad Naguib Sabtan (2018). Towards Corpus-Based Stemming for Arabic Texts. International Journal of Linguistics, Literature and Translation, 1(4), 117-128. https://europub.co.uk/articles/-A-476847