Automated Extraction of Large Scale Scanned Document Images using Google Vision OCR in Apache Hadoop Environment

Abstract

This Digitalization of documents is now being done in all fields to reduce paper usage. The availability of modern technology in the form of scanners and cameras supports the growth of multimedia data, especially documents stored in the form of image files. Searching a particular text in a large-scale scanned document images is a difficult task if the document is in the form of images where the text has not been extracted. In this research, text extraction method of large-scale scanned document images using Google Vision OCR on the Hadoop architecture is proposed. The object of research is student thesis documents, which includes the cover page, the approval page, and abstract. All documents are stored in the university's digital library. Extraction process begins with preparing the input folder that contains image documents (in JPEG format) in HDFS Apache Hadoop and followed by reading the image document. The image document is then extracted using Google Vision OCR in order to obtain text document (in TXT format) and the result is saved to output folder in Hadoop Distributed File System (HDFS). The same process is repeated for the entire documents in the folder. Test results have shown that the proposed methods was able to extract all test documents successfully. The recognition process achieved 100% accuracy and the extraction time is twice as fast as manual extraction. Google Vision OCR also shows better extraction performance compared to other OCR tools. The proposed automated extraction systems can recognize text in a large-scale image document accurately and can be operated in a real-time environment.

Authors and Affiliations

Rifiana Arief, Achmad Benny Mutiara, Tubagus Maulana Kusuma, Hustinawaty Hustinawaty

Keywords

Related Articles

Designing Novel Queries for Analysing NoSQL Data of Gene-Disease Associations

To precisely identify gene associated diseases has been an open area of research for biological scientists to ensure clinical and psychological symptoms and treatment for human diseases. Because whole Human Genome is def...

A Study of Resilient Architecture for Critical Software-Intensive System-of-Systems (Sisos)

The role of critical system-of-systems have become considerably software-intensive. A critical system-of-system has to satisfy correctness properties of liveness and safety. As critical system-of-systems have to operate...

 Improving the Solution of Traveling Salesman Problem Using Genetic, Memetic Algorithm and Edge assembly Crossover

 The Traveling salesman problem (TSP) is to find a tour of a given number of cities (visiting each city exactly once) where the length of this tour is minimized. Testing every possibility for an N city tour would be...

Heterogeneous Ensemble Pruning based on Bee Algorithm for Mammogram Classification

In mammogram, masses are primary indication of breast cancer; and it is necessary to classify them as malignant or benign. In this classification task, Computer Aided Diagnostic (CAD) system by using ensemble learning is...

A Survey on Resource Allocation Strategies in Cloud Computing

Cloud computing has become a new age technology that has got huge potentials in enterprises and markets. Clouds can make it possible to access applications and associated data from anywhere. Companies are able to rent re...

Download PDF file
  • EP ID EP417614
  • DOI 10.14569/IJACSA.2018.091117
  • Views 103
  • Downloads 0

How To Cite

Rifiana Arief, Achmad Benny Mutiara, Tubagus Maulana Kusuma, Hustinawaty Hustinawaty (2018). Automated Extraction of Large Scale Scanned Document Images using Google Vision OCR in Apache Hadoop Environment. International Journal of Advanced Computer Science & Applications, 9(11), 112-116. https://europub.co.uk/articles/-A-417614