Automated Extraction of Large Scale Scanned Document Images using Google Vision OCR in Apache Hadoop Environment

Abstract

This Digitalization of documents is now being done in all fields to reduce paper usage. The availability of modern technology in the form of scanners and cameras supports the growth of multimedia data, especially documents stored in the form of image files. Searching a particular text in a large-scale scanned document images is a difficult task if the document is in the form of images where the text has not been extracted. In this research, text extraction method of large-scale scanned document images using Google Vision OCR on the Hadoop architecture is proposed. The object of research is student thesis documents, which includes the cover page, the approval page, and abstract. All documents are stored in the university's digital library. Extraction process begins with preparing the input folder that contains image documents (in JPEG format) in HDFS Apache Hadoop and followed by reading the image document. The image document is then extracted using Google Vision OCR in order to obtain text document (in TXT format) and the result is saved to output folder in Hadoop Distributed File System (HDFS). The same process is repeated for the entire documents in the folder. Test results have shown that the proposed methods was able to extract all test documents successfully. The recognition process achieved 100% accuracy and the extraction time is twice as fast as manual extraction. Google Vision OCR also shows better extraction performance compared to other OCR tools. The proposed automated extraction systems can recognize text in a large-scale image document accurately and can be operated in a real-time environment.

Authors and Affiliations

Rifiana Arief, Achmad Benny Mutiara, Tubagus Maulana Kusuma, Hustinawaty Hustinawaty

Keywords

Related Articles

Multidimensional Neural-Like Growing Networks - A New Type of Neural Network

The present paper describes a new type of neural networks - multidimensional neural-like growing networks. Multidimensional neural-like growing networks are a dynamic structure, which varies depending on the external inf...

Integrated Framework to Study Efficient Spectral Estimation Techniques for Assessing Spectral Efficiency Analysis

The advanced network applications enable software driven spectral analysis of non-stationary signal or processes which precisely involves domain analysis with the purpose of decomposing a complex signal coefficients into...

Internet of Plants Application for Smart Agriculture

Nowadays, Internet of Things (IoT) is receiving a great attention due to its potential strength and ability to be integrated into any complex system. The IoT provides the acquired data from the environment to the Interne...

An Evaluation of the Proposed Framework for Access Control in the Cloud and BYOD Environment

As the bring your own device (BYOD) to work trend grows, so do the network security risks. This fast-growing trend has huge benefits for both employees and employers. With malware, spyware and other malicious downloads,...

Fall Monitoring Device for Old People based on Tri-Axial Accelerometer

To be able to timely and effective judgment of the elderly fall, a fall monitoring device based on tri-axis accelerometer for elderly is designed. The device collects acceleration and the angle between elderly and horizo...

Download PDF file
  • EP ID EP417614
  • DOI 10.14569/IJACSA.2018.091117
  • Views 107
  • Downloads 0

How To Cite

Rifiana Arief, Achmad Benny Mutiara, Tubagus Maulana Kusuma, Hustinawaty Hustinawaty (2018). Automated Extraction of Large Scale Scanned Document Images using Google Vision OCR in Apache Hadoop Environment. International Journal of Advanced Computer Science & Applications, 9(11), 112-116. https://europub.co.uk/articles/-A-417614