A Printed PAW Image Database of Arabic Language for Document Analysis and Recognition

Journal Title: Journal of ICT Research and Applications - Year 2017, Vol 11, Issue 2

Abstract

Document image analysis and recognition are important topics in the field of artificial intelligence. In this context, the availability of a database with good script samples is an important requirement for machine-learning processes. For Latin and Asian languages many suitable databases exist. However, there is a shortage of databases with Arabic samples. In this work, a new database of printed Arabic text is introduced. The new concept of collecting sub-words (PAWs) instead of words or individual character samples was adopted. These PAWs constitute all words in the Arabic language. The collected database consists of 83,056 images of PAWs extracted from approximately 550,000 different words. Each sample is presented in the database in five font types: Thuluth, Naskh, Andalusi, Typing Machine, and Kufi. In total, the database consists of 415,280 images. Moreover, ground truth information is included with each PAW image to describe its occurrence number, occurrence frequency, positions and the shapes of the characters. This paper presents a statistical analysis of the frequency of each PAW in the Arabic language.

Authors and Affiliations

Bilal Bataineh

Keywords

Related Articles

Voting-based Classification for E-mail Spam Detection

The problem of spam e-mail has gained a tremendous amount of attention. Although entities tend to use e-mail spam filter applications to filter out received spam e-mails, marketing companies still tend to send unsolicite...

Performance Improvement of LeastSquares Adaptive Filter for High-Speed Train Communication Systems

The downlink communication channel from high-altitude platform (HAP) to high-speed train (HST) in the Ka-band is a slowly time-varying Rician distributed flat fading channel with 10-25 dB Rician K factor. In this respect...

A Chemical Reaction Optimization Approach to Prioritize the Regression Test Cases of Object-Oriented Programs

Regression test case prioritization is used to improve certain performance goals. Limited resources force to choose an effective prioritization technique, which makes an ordering of the test cases so that the most suitab...

VLSI Architecture for Configurable and Low-Complexity Design of Hard-Decision Viterbi Decoding Algorithm

Convolutional encoding and data decoding are fundamental processes in convolutional error correction. One of the most popular error correction methods in decoding is the Viterbi algorithm. It is extensively implemented i...

Enhancing the Stability of the Improved-LEACH Routing Protocol for WSNs

Recently, increasing battery lifetime in wireless sensor networks has turned out to be one of the major challenges faced by researchers. The sensor nodes in wireless sensor networks use a battery as their power source, w...

Download PDF file
  • EP ID EP324697
  • DOI 10.5614/ itbj.ict.res.appl.2017.11.2.6
  • Views 102
  • Downloads 0

How To Cite

Bilal Bataineh (2017). A Printed PAW Image Database of Arabic Language for Document Analysis and Recognition. Journal of ICT Research and Applications, 11(2), 200-212. https://europub.co.uk/articles/-A-324697