Optimizing Hadoop for Small File Management

Journal Title: Transactions on Machine Learning and Artificial Intelligence - Year 2017, Vol 5, Issue 4

Abstract

HDFS is one of the most used distributed file systems, that offer a high availability and scalability on lowcost hardware. HDFS is delivered as the storage component of Hadoop framework. Coupled with map reduce, which is the processing component, HDFS and MapReduce become the de facto platform for managing big data nowadays. However, HDFS was designed to handle specifically a huge number of large files, while when it comes to a large number of small files, Hadoop deployments may be not efficient. In this paper, we proposed a new strategy to manage small files. Our approach consists of two principal phases. The first phase is about consolidating more than only one client’s small files input, and store the inputs continuously in the first allocated block, in a SequenceFile format, and so on into the next blocks. That way we avoid multiple block allocations for different streams, to reduce calls for available blocks and to reduce the metadata memory on the NameNode. This is because groups of small files packaged in a SequenceFile on the same block will require one entry instead of one for each small file. The second phase consists of analyzing attributes of stored small files to distribute them in such a way that the most called files will be referenced by an additional index as a MapFile format to reduce the read throughput during random access.

Authors and Affiliations

O. Achandair, M. Elmahouti, S. , Khoulji, M. L. Kerkeb

Keywords

Related Articles

Simulation of the Charge Motion near the Velocity of Light in Electric and Magnetic Fields

A numerical simulation method for the charge motion near the velocity of light in electric and magnetic fields has been investigated using the relativistic mass by Einstein’s special theory of relativity, and an electron...

Mobile Agent Life Cycle Demystified using Formal Method

Underlying technique for mobile agent development is often mystified. Existing research sometimes ignore unveiling the details of the mobility and autonomy of the agent system. This paper exposes using formal methods the...

Mitigating Economic Denial of Sustainability Attacks to Secure Cloud Computing Environments

In cloud computing environment where the infrastructure is shared by millions of users, attackers have the opportunity to ensure more damage to the compromised resources. The main aim of such attacks is to saturate and o...

Learning Style Classification Based on Student's Behavior in Moodle Learning Management System

In learning field, each student has his own learning style that affects his way of get, process, understand and percept information. Determining the learning style of students enhances the performance of learning process...

Stress Management in Primary Caregivers: A Health Challenge

Technology based advances in healthcare are leading to social changes that, in turn, will require innovative responses from health systems. Mainly due to these advances, life expectancy grows so, the aging of world popul...

Download PDF file
  • EP ID EP309385
  • DOI 10.14738/tmlai.54.3214
  • Views 69
  • Downloads 0

How To Cite

O. Achandair, M. Elmahouti, S. , Khoulji, M. L. Kerkeb (2017). Optimizing Hadoop for Small File Management. Transactions on Machine Learning and Artificial Intelligence, 5(4), 426-437. https://europub.co.uk/articles/-A-309385