Reduction of Data at Namenode in HDFS using harballing Technique
Journal Title: International Journal of Advanced Research in Computer Engineering & Technology(IJARCET) - Year 2012, Vol 1, Issue 4
Abstract
HDFS stands for the Hadoop Distributed File System. It has the property of handling large size files (in MB’s, GB’s or TB’s). Scientific applications adapted this HDFS/Mapreduce for large scale data analytics [1]. But major problem is small size files which are common in these applications. HDFS manages these entire small file through single Namenode server [1]-[4]. Storing and processing these small size file in HDFS is overhead to mapreduce program and also have an impact on the performance on Namenode [1]-[3]. In this paper we studied the hadoop archiving technique which will reduce the storage overhead of data on Namenode and also helps in increasing the performance by reducing the map operations in the mapreudce program. Hadoop introduces “harballing” archiving technique which will collect large number of small files in single large file. Hadoop Archive (HAR) is an effective solution to the problem of many small files. HAR packs a number of small files into large files so that the original files can be accessed in parallel transparently (without expanding the files) and efficiently. Hadoop creates the archive file by using “.har” extension. HAR increases the scalability of the system by reducing the namespace usage and decreasing the operation load in the NameNode. This improvement is orthogonal to memory optimization in NameNode and distributing namespace management across multiple NameNodes [3].
Authors and Affiliations
Vaibhav G. Korat , Kumar Swamy Pamu
AN OVERVIEW & ANALYSIS COMPARISION OF INTERNET PROTOCAL TCP\IP V/S OSI REFRENCE MODAL
Basically network is a set number of interconnected lines presenting a net, and a network’s roads |an interlinked system, a network of alliances. Today, computer networks are the core of modern communication. A...
High Performance Computing: A Survey
This paper surveys techniques used for high performance computing. High performance computing is used to develop machines which provide computing power like super computers. It concentrates on both software as well...
Efficient and Reliable Resource Management Framework for Public Cloud Computing
The problem of dynamic resource management for a large-scale cloud environment is mitigated with optimized high throughput performance. The resource management framework consists of, Gossip protocol that ensures fair res...
Survey on Data Sharing and Re-Encryption in Cloud
Cloud storage helps enterprises and government agencies significantly reduce their financial overhead of data management, since they can now archive their data backups remotely to third-party cloud storage provider...
TOSCA ENABLING CLOUD PORTABILITY
The cloud is a computing architecture characterized by a large number of interconnected identical computing devices that can scale on demand and that communicate via an IP network. Many technologies commonly associated w...