Urdu Word Segmentation using Machine Learning Approaches

Abstract

Word Segmentation is considered a basic NLP task and in diverse NLP areas, it plays a significant role. The main areas which can be benefited from Word segmentation are IR, POS, NER, sentiment analysis, etc. Urdu Word Segmentation is a challenging task. There can be a number of reasons but Space Insertion Problem and Space Omission Problems are the major ones. Compared to Urdu, the tools and resources developed for word segmentation of English and English like other western languages have record-setting performance. Some languages provide a clear indication for words just like English which having space or capitalization of the first character in a word. But there are many languages which do not have proper delimitation in between words e.g. Thai, Lao, Urdu, etc. The objective of this research work is to present a machine learning based approach for Urdu word segmentation. We adopted the use of conditional random fields (CRF) to achieve the subject task. Some other challenges faced in Urdu text are compound words and reduplicated words. In this paper, we tried to overcome such challenges in Urdu text by machine learning methodology.

Authors and Affiliations

Sadiq Nawaz Khan, Khairullah Khan, Wahab Khan, Asfandyar Khan, Fazali Subhan, Aman Ullah Khan, Burhan Ullah

Keywords

Related Articles

Plethora of Cyber Forensics

As threats against digital assets have risen and there is necessitate exposing and eliminating hidden risks and threats. The ability of exposing is called “cyber forensics.” Cyber Penetrators have adopted more sophistic...

A New Approach for a Better Load Balancing and a Better Distribution of Resources in Cloud Computing

Cloud computing is a new paradigm where data and services of Information Technology are provided via the Internet by using remote servers. It represents a new way of delivering computing resources allowing access to the...

Defects Prediction and Prevention Approaches for Quality Software Development

The demand for distributed and complex business applications in the enterprise requires error-free and high-quality application systems. Unfortunately, most of the developed software contains certain defects which cause...

Sound user Interface with Touch Panel for Data and Information Expression and its Application to Meteorological Data Representation

Sound User Interface (SUI) with touch panel for representation of quantitative data and information together with its application to meteorological data representation is proposed. The proposed SUI is not a merely ear-co...

Modeling and Simulation Analysis of Power Frequency Electric Field of UHV AC Transmission Line

In order to study the power frequency electric field of UHV AC transmission lines, this paper which models and calculates using boundary element method simulates various factors influencing the distribution of the power...

Download PDF file
  • EP ID EP321866
  • DOI 10.14569/IJACSA.2018.090628
  • Views 94
  • Downloads 0

How To Cite

Sadiq Nawaz Khan, Khairullah Khan, Wahab Khan, Asfandyar Khan, Fazali Subhan, Aman Ullah Khan, Burhan Ullah (2018). Urdu Word Segmentation using Machine Learning Approaches. International Journal of Advanced Computer Science & Applications, 9(6), 193-200. https://europub.co.uk/articles/-A-321866