Improving Small File Management in Hadoop
Journal Title: Transactions on Machine Learning and Artificial Intelligence - Year 2017, Vol 5, Issue 4
Abstract
Hadoop, considered nowadays as the defacto platform for managing big data, has revolutionized the way customers manage their data. As an opensource implementation of mapreduce, it was designed to offer a high scalability and availability across clusters of thousands of machines. Through it two principals’ components, which is HDFS for a distributed storage and MapReduce as the distributed processing engine, companies and research studies are taking a big benefit from its capabilities. However, Hadoop was designed to handle large size files, so when it comes to a large number of small files, the performance can be heavily degraded. The small file problem has been well defined by researchers and Hadoop community, but most of the proposed approaches only deal with the pressure caused on the NameNode memory. Certainly, grouping small files in different possible formats, that are most of time supported by the actual Hadoop distribution, reduce the metadata entries and solve the memory limitation, but that remain only a part of the equation. Actually, the real impact that organizations need to solve when dealing with lot of small files, is the cluster performance when those files are processed in Hadoop clusters. In this paper, we proposed a new strategy to use efficiently some one of the common solution that group files in a MapFile format. The core idea, is to organize small files files based on specific attributes in MapFile output files, and use prefetching and caching mechanisms during read access. This would lead to less calls of metadata from the NameNode, and better I/O performance during MapReduce jobs. The experimental results show that this approach can help to obtain better access time when the cluster contain massive number of small files.
Authors and Affiliations
O. Achandair, M. Elmahouti, Samira Khoulji, M. L. Kerkeb
Integration of the ASR Toolkit Kaldi into a Domoticz Home Automation System
This paper presents the design and the implementation of an interface between Kaldi, automatic speech recognition toolkit, and a home automation system. This interface is based on Open Platform communication (OPC) protoc...
Digitizing Human Sciences to Determine the Individual's Personality Based on Facial Emotions Recognition
Informatics is a science that develops and opens up to a group of other sciences. The current researches have been established for multiple partnerships between informatics and the rest of the sciences. This is noticed i...
Opinion Mining Using Sequence Labelling
Opinion mining aims to determine the attitude of a person by identifying and extracting subjective information. The attitude is the judgement, evaluation or emotional state of the person towards a product, or service or...
Automated Medication System for Rural and War Affected Areas
Robot is machine like human beings working in hazardous situations, replace domain experts and provide accurate results. We proposed automated medication system that work like human physician experts in remote locations,...
Electrical Behavior of Solar Cell Based on ZnO/PS
Because of high loss of photovoltaic conversion due to reflection of incident photon by the silicon surface, we proposed in this work a single heterojunction solar cell model based on porous silicon (PS) and Zinc oxide t...