A Printed PAW Image Database of Arabic Language for Document Analysis and Recognition
Journal Title: Journal of ICT Research and Applications - Year 2017, Vol 11, Issue 2
Abstract
Document image analysis and recognition are important topics in the field of artificial intelligence. In this context, the availability of a database with good script samples is an important requirement for machine-learning processes. For Latin and Asian languages many suitable databases exist. However, there is a shortage of databases with Arabic samples. In this work, a new database of printed Arabic text is introduced. The new concept of collecting sub-words (PAWs) instead of words or individual character samples was adopted. These PAWs constitute all words in the Arabic language. The collected database consists of 83,056 images of PAWs extracted from approximately 550,000 different words. Each sample is presented in the database in five font types: Thuluth, Naskh, Andalusi, Typing Machine, and Kufi. In total, the database consists of 415,280 images. Moreover, ground truth information is included with each PAW image to describe its occurrence number, occurrence frequency, positions and the shapes of the characters. This paper presents a statistical analysis of the frequency of each PAW in the Arabic language.
Authors and Affiliations
Bilal Bataineh
Document Grouping by Using Meronyms and Type-2 Fuzzy Association Rule Mining
The growth of the number of textual documents in the digital world, especially on the World Wide Web, is incredibly fast. This causes an accumulation of information, so we need efficient organization to manage textual do...
A Chemical Reaction Optimization Approach to Prioritize the Regression Test Cases of Object-Oriented Programs
Regression test case prioritization is used to improve certain performance goals. Limited resources force to choose an effective prioritization technique, which makes an ordering of the test cases so that the most suitab...
Social Media Text Classification by Enhancing Well-Formed Text Trained Model
Social media are a powerful communication tool in our era of digital information. The large amount of user-generated data is a useful novel source of data, even though it is not easy to extract the treasures from this va...
Efficient CFO Compensation Method in Uplink OFDMA for Mobile WiMax
Mobile WiMax uses Orthogonal Frequency Division Multiple Access (OFDMA) in uplink where synchronization is a complex task as each user presents a different carrier frequency offset (CFO). In the Data Aided Phase Incremen...
Passive Available Bandwidth Estimation Based on Collision Probability and Node State Synchronization in Wireless Networks
In wireless networks, available bandwidth estimation is challenging because wireless channels are used by multiple users or applications concurrently. In this study, we propose a passive measurement scheme to estimate th...