Automated Extraction of Large Scale Scanned Document Images using Google Vision OCR in Apache Hadoop Environment

Abstract

This Digitalization of documents is now being done in all fields to reduce paper usage. The availability of modern technology in the form of scanners and cameras supports the growth of multimedia data, especially documents stored in the form of image files. Searching a particular text in a large-scale scanned document images is a difficult task if the document is in the form of images where the text has not been extracted. In this research, text extraction method of large-scale scanned document images using Google Vision OCR on the Hadoop architecture is proposed. The object of research is student thesis documents, which includes the cover page, the approval page, and abstract. All documents are stored in the university's digital library. Extraction process begins with preparing the input folder that contains image documents (in JPEG format) in HDFS Apache Hadoop and followed by reading the image document. The image document is then extracted using Google Vision OCR in order to obtain text document (in TXT format) and the result is saved to output folder in Hadoop Distributed File System (HDFS). The same process is repeated for the entire documents in the folder. Test results have shown that the proposed methods was able to extract all test documents successfully. The recognition process achieved 100% accuracy and the extraction time is twice as fast as manual extraction. Google Vision OCR also shows better extraction performance compared to other OCR tools. The proposed automated extraction systems can recognize text in a large-scale image document accurately and can be operated in a real-time environment.

Authors and Affiliations

Rifiana Arief, Achmad Benny Mutiara, Tubagus Maulana Kusuma, Hustinawaty Hustinawaty

Keywords

Related Articles

An RGB Image Encryption Supported by Wavelet-based Lossless Compression

In this paper we have proposed a method for an RGB image encryption supported by lifting scheme based lossless compression. Firstly we have compressd the input color image using a 2-D integer wavelet transform. Then we h...

Statistical Quality of Service to Increase Qos/Qoe of IP-Based Gateway for Integrating Heterogeneous Wireless Devices

In broadcast service area above communications supported cellular wireless networks, data is communicated to several addressees from a right of entry point/base station. Multicast significantly progresses the network eff...

An Investigation on Information Communication Technology Awareness and Use in Improving Livestock Farming in Southern District, Botswana

This paper investigated the extent of Information Communication Technology (ICT) usage by livestock keepers and limitations encountered. The study was conducted with the objective of coming up with findings that will con...

Energy Efficient Camera Solution for Video Surveillance

Video surveillance is growing rapidly, new problems and issues are also coming into view which needs serious and urgent attention. Video surveillance system requires a beneficial energy efficient camera solution. In this...

Assessment of Groundwater Vulnerability to Pollution using DRASTIC Model and Fuzzy Logic in Herat City, Afghanistan

Groundwater (GW) vulnerability maps have become a standard tool for protecting groundwater resources from pollution because, from one hand groundwater represents the main source of drinking water, and on the other hand h...

Download PDF file
  • EP ID EP417614
  • DOI 10.14569/IJACSA.2018.091117
  • Views 76
  • Downloads 0

How To Cite

Rifiana Arief, Achmad Benny Mutiara, Tubagus Maulana Kusuma, Hustinawaty Hustinawaty (2018). Automated Extraction of Large Scale Scanned Document Images using Google Vision OCR in Apache Hadoop Environment. International Journal of Advanced Computer Science & Applications, 9(11), 112-116. https://europub.co.uk/articles/-A-417614