Improving Small File Management in Hadoop

Apply

Improving Small File Management in Hadoop

Journal Title: Transactions on Machine Learning and Artificial Intelligence - Year 2017, Vol 5, Issue 4

Abstract

Hadoop, considered nowadays as the defacto platform for managing big data, has revolutionized the way customers manage their data. As an opensource implementation of mapreduce, it was designed to offer a high scalability and availability across clusters of thousands of machines. Through it two principals’ components, which is HDFS for a distributed storage and MapReduce as the distributed processing engine, companies and research studies are taking a big benefit from its capabilities. However, Hadoop was designed to handle large size files, so when it comes to a large number of small files, the performance can be heavily degraded. The small file problem has been well defined by researchers and Hadoop community, but most of the proposed approaches only deal with the pressure caused on the NameNode memory. Certainly, grouping small files in different possible formats, that are most of time supported by the actual Hadoop distribution, reduce the metadata entries and solve the memory limitation, but that remain only a part of the equation. Actually, the real impact that organizations need to solve when dealing with lot of small files, is the cluster performance when those files are processed in Hadoop clusters. In this paper, we proposed a new strategy to use efficiently some one of the common solution that group files in a MapFile format. The core idea, is to organize small files files based on specific attributes in MapFile output files, and use prefetching and caching mechanisms during read access. This would lead to less calls of metadata from the NameNode, and better I/O performance during MapReduce jobs. The experimental results show that this approach can help to obtain better access time when the cluster contain massive number of small files.

Authors and Affiliations

O. Achandair, M. Elmahouti, Samira Khoulji, M. L. Kerkeb

Keywords

EP ID EP310225
DOI 10.14738/tmlai.54.3333
Views 76
Downloads 0