Improving Small File Management in Hadoop

Journal Title: Transactions on Machine Learning and Artificial Intelligence - Year 2017, Vol 5, Issue 4

Abstract

Hadoop, considered nowadays as the defacto platform for managing big data, has revolutionized the way customers manage their data. As an opensource implementation of mapreduce, it was designed to offer a high scalability and availability across clusters of thousands of machines. Through it two principals’ components, which is HDFS for a distributed storage and MapReduce as the distributed processing engine, companies and research studies are taking a big benefit from its capabilities. However, Hadoop was designed to handle large size files, so when it comes to a large number of small files, the performance can be heavily degraded. The small file problem has been well defined by researchers and Hadoop community, but most of the proposed approaches only deal with the pressure caused on the NameNode memory. Certainly, grouping small files in different possible formats, that are most of time supported by the actual Hadoop distribution, reduce the metadata entries and solve the memory limitation, but that remain only a part of the equation. Actually, the real impact that organizations need to solve when dealing with lot of small files, is the cluster performance when those files are processed in Hadoop clusters. In this paper, we proposed a new strategy to use efficiently some one of the common solution that group files in a MapFile format. The core idea, is to organize small files files based on specific attributes in MapFile output files, and use prefetching and caching mechanisms during read access. This would lead to less calls of metadata from the NameNode, and better I/O performance during MapReduce jobs. The experimental results show that this approach can help to obtain better access time when the cluster contain massive number of small files.

Authors and Affiliations

O. Achandair, M. Elmahouti, Samira Khoulji, M. L. Kerkeb

Keywords

Related Articles

English Premier League (EPL) Soccer Matches Prediction using An Adaptive Neuro-Fuzzy Inference System (ANFIS) for

Prediction of English Premiership League (EPL) matches has been on the heart and minds of researcher over the pass decades, but none has sufficiently introduced and Adaptive Neuro-Fuzzy Inference System (ANFIS) approach...

SportsBuzzer: Detecting Events at Real Time in Twitter using Incremental Clustering

In the recent past, twitter users are highly regarded as social sensors who can report events and Twitter has been widely used to detect social and physical events such as earthquakes and traffic jam. Real time event det...

Feature-rich PoS Tagging through Taggers Combination : Experience in Arabic

Since words can play different syntactic roles in different contexts, it is not trivial to assign the appropriate morphosyntactic category to each word according to the context. Part of Speech (PoS) tagging is the task w...

Contribution to the Measurement of Organizational Performance based on A Multi-Agent Approach

This research focuses on evaluating and analyzing the organizational performance of a risk management unit within banks. The main proposal is to analyze and simulate the process of risk management based on decision suppo...

Download PDF file
  • EP ID EP310225
  • DOI 10.14738/tmlai.54.3333
  • Views 76
  • Downloads 0

How To Cite

O. Achandair, M. Elmahouti, Samira Khoulji, M. L. Kerkeb (2017). Improving Small File Management in Hadoop. Transactions on Machine Learning and Artificial Intelligence, 5(4), 689-702. https://europub.co.uk/articles/-A-310225