Urdu Word Segmentation using Machine Learning Approaches

Abstract

Word Segmentation is considered a basic NLP task and in diverse NLP areas, it plays a significant role. The main areas which can be benefited from Word segmentation are IR, POS, NER, sentiment analysis, etc. Urdu Word Segmentation is a challenging task. There can be a number of reasons but Space Insertion Problem and Space Omission Problems are the major ones. Compared to Urdu, the tools and resources developed for word segmentation of English and English like other western languages have record-setting performance. Some languages provide a clear indication for words just like English which having space or capitalization of the first character in a word. But there are many languages which do not have proper delimitation in between words e.g. Thai, Lao, Urdu, etc. The objective of this research work is to present a machine learning based approach for Urdu word segmentation. We adopted the use of conditional random fields (CRF) to achieve the subject task. Some other challenges faced in Urdu text are compound words and reduplicated words. In this paper, we tried to overcome such challenges in Urdu text by machine learning methodology.

Authors and Affiliations

Sadiq Nawaz Khan, Khairullah Khan, Wahab Khan, Asfandyar Khan, Fazali Subhan, Aman Ullah Khan, Burhan Ullah

Keywords

Related Articles

Analyzing Data Reusability of Raytrace Application in Splash2 Benchmark

When designing a chip multiprocessors, we use Splash2 to estimate its performance. This benchmark contains eleven applications. The performance when running them is similar, except Raytrace. We analyse it to clarity why...

Area Efficient Implementation of Elliptic Curve Point Multiplication Algorithm

Elliptic Curve Cryptography (ECC) has established itself as the most preferred and secured cryptography algorithm for the secure data transfer and secure data storage in embedded system environment. Efficient implementat...

Usability of “Traysi”: A Web Application for Tricycle Commuters

This study measured the usability of a web application for tricycle commuters that was developed using Hypertext Markup Language (HTML), Cascading Style Sheet (CSS) and Javascript (JS) with the aid of Google Artificial P...

Distributed Group Key Management with Cluster based Communication for Dynamic Peer Groups

Secure group communication is an increasingly popular research area having received much attention in recent years. Group key management is a fundamental building block for secure group communication systems. This paper...

RSECM: Robust Search Engine using Context-based Mining for Educational Big Data

With an accelerating growth in the educational sector along with the aid of ICT and cloud-based services, there is a consistent rise of educational big data, where storage and processing become the prime matter of challe...

Download PDF file
  • EP ID EP321866
  • DOI 10.14569/IJACSA.2018.090628
  • Views 127
  • Downloads 0

How To Cite

Sadiq Nawaz Khan, Khairullah Khan, Wahab Khan, Asfandyar Khan, Fazali Subhan, Aman Ullah Khan, Burhan Ullah (2018). Urdu Word Segmentation using Machine Learning Approaches. International Journal of Advanced Computer Science & Applications, 9(6), 193-200. https://europub.co.uk/articles/-A-321866