Improving Small File Management in Hadoop

Journal Title: Transactions on Machine Learning and Artificial Intelligence - Year 2017, Vol 5, Issue 4

Abstract

Hadoop, considered nowadays as the defacto platform for managing big data, has revolutionized the way customers manage their data. As an opensource implementation of mapreduce, it was designed to offer a high scalability and availability across clusters of thousands of machines. Through it two principals’ components, which is HDFS for a distributed storage and MapReduce as the distributed processing engine, companies and research studies are taking a big benefit from its capabilities. However, Hadoop was designed to handle large size files, so when it comes to a large number of small files, the performance can be heavily degraded. The small file problem has been well defined by researchers and Hadoop community, but most of the proposed approaches only deal with the pressure caused on the NameNode memory. Certainly, grouping small files in different possible formats, that are most of time supported by the actual Hadoop distribution, reduce the metadata entries and solve the memory limitation, but that remain only a part of the equation. Actually, the real impact that organizations need to solve when dealing with lot of small files, is the cluster performance when those files are processed in Hadoop clusters. In this paper, we proposed a new strategy to use efficiently some one of the common solution that group files in a MapFile format. The core idea, is to organize small files files based on specific attributes in MapFile output files, and use prefetching and caching mechanisms during read access. This would lead to less calls of metadata from the NameNode, and better I/O performance during MapReduce jobs. The experimental results show that this approach can help to obtain better access time when the cluster contain massive number of small files.

Authors and Affiliations

O. Achandair, M. Elmahouti, Samira Khoulji, M. L. Kerkeb

Keywords

Related Articles

Toward Multi-Approach Model for Semi-Automating a Data Warehouse Design from an Ontology

The proliferation of projects that are part of the semantic Web is truly impressive. In fact, ontologies become increasingly present in information systems, they constitute great data sources that arouse the interest of...

Mobile Agent Life Cycle Demystified using Formal Method

Underlying technique for mobile agent development is often mystified. Existing research sometimes ignore unveiling the details of the mobility and autonomy of the agent system. This paper exposes using formal methods the...

Content-based Medical Image Retrieval for Liver CT Annotation

The increase number of medical image stored and saved every day presents a unique opportunity for contentbased medical image retrieval (CBMIR) systems. In this paper, we propose contentbased medical image retrieval for a...

English Premier League (EPL) Soccer Matches Prediction using An Adaptive Neuro-Fuzzy Inference System (ANFIS) for

Prediction of English Premiership League (EPL) matches has been on the heart and minds of researcher over the pass decades, but none has sufficiently introduced and Adaptive Neuro-Fuzzy Inference System (ANFIS) approach...

Telecommunications Subscription Fraud Detection using Artificial Neural Networks

Telecommunications Companies are facing a lot of problems due to fraud; hence the need for an effective fraud detection system for the telecommunications companies. This paper presents a design and implements of a subscr...

Download PDF file
  • EP ID EP310225
  • DOI 10.14738/tmlai.54.3333
  • Views 47
  • Downloads 0

How To Cite

O. Achandair, M. Elmahouti, Samira Khoulji, M. L. Kerkeb (2017). Improving Small File Management in Hadoop. Transactions on Machine Learning and Artificial Intelligence, 5(4), 689-702. https://europub.co.uk/articles/-A-310225