QTID: Quran Text Image Dataset

Abstract

Improving the accuracy of Arabic text recognition in imagery requires a big modern dataset as data is the fuel for many modern machine learning models. This paper proposes a new dataset, called QTID, for Quran Text Image Dataset, the first Arabic dataset that includes Arabic marks. It consists of 309,720 different 192x64 annotated Arabic word images that contain 2,494,428 characters in total, which were taken from the Holy Quran. These finely annotated images were randomly divided into 90%, 5%, 5% sets for training, validation, and testing, respectively. In order to analyze QTID, a different dataset statistics were shown. Experimental evaluation shows that current best Arabic text recognition engines like Tesseract and ABBYY FineReader cannot work well with word images from the proposed dataset.

Authors and Affiliations

Mahmoud Badry, Hesham H M Hassan, Hanaa Bayomi, Hussien Oakasha

Keywords

Related Articles

Achieving Flatness: Honeywords Generation Method for Passwords based on user behaviours

Honeywords (decoy passwords) have been proposed to detect attacks against hashed password databases. For each user account, the original password is stored with many honeywords in order to thwart any adversary. The honey...

SIT: A Lightweight Encryption Algorithm for Secure Internet of Things

The Internet of Things (IoT) being a promising technology of the future is expected to connect billions of devices. The increased number of communication is expected to generate mountains of data and the security of data...

Model Driven Testing of Web Applications Using Domain Specific Language

As more and more systems move to the cloud, the importance of web applications has increased recently. Web applications need more strict requirements in order to sup-port higher availability. The techniques in quality as...

Exploring the Use of Digital Games as a Persuasive Tool in Teaching Islamic Knowledge for Muslim Children

Various digital games have been developed that focus on providing a sense of enjoyment and excitement for their players in order to be a modern tool for releasing stress or simply for pleasure. In recent years, digital g...

Improved Generalization in Recurrent Neural Networks Using the Tangent Plane Algorithm

The tangent plane algorithm for real time recurrent learning (TPA-RTRL) is an effective online training method for fully recurrent neural networks. TPA-RTRL uses the method of approaching tangent planes to accelerate the...

Download PDF file
  • EP ID EP278332
  • DOI 10.14569/IJACSA.2018.090351
  • Views 82
  • Downloads 0

How To Cite

Mahmoud Badry, Hesham H M Hassan, Hanaa Bayomi, Hussien Oakasha (2018). QTID: Quran Text Image Dataset. International Journal of Advanced Computer Science & Applications, 9(3), 385-391. https://europub.co.uk/articles/-A-278332