Development and Evaluation of a Parallel K-means Algorithm for Big Data Analysis in Google MapReduce Environment

Journal Title: International Journal of Knowledge and Innovation Studies - Year 2024, Vol 2, Issue 3

Abstract

The challenge of executing iterative big data analysis algorithms within the Google Cloud MapReduce environment has been addressed by developing a parallel K-means algorithm capable of leveraging the distributed computing power of the platform. Traditional K-means, which requires iterative steps, is adapted into a parallel version using MapReduce to enhance computational efficiency. This parallel algorithm is structured into multiple super-steps, each of which executes in parallel but is processed sequentially across super-steps. Each super-step corresponds to one iteration of the serial K-means algorithm, with parallel computation carried out at each node to determine the mean of each cluster center. Experimental evaluations have demonstrated that the parallel K-means algorithm performs effectively and accurately. Notably, for a dataset of 450 water samples, a parallel speedup factor of 20.8 was achieved, significantly reducing the time required for data analysis. This substantial reduction in processing time is critical in time-sensitive applications, such as coal mine rescue operations, where quick decision-making is essential. The results indicate that the proposed parallel K-means algorithm is both a feasible and efficient solution for handling large-scale datasets within cloud environments, providing substantial benefits in both computational speed and practical application.

Authors and Affiliations

Junwei Zhao, Xuexu Yuan, Qingtao Hou, Hanyu Gao, Chunyu Gao, Yuanyuan Zhang

Keywords

Related Articles

Application of Knowledge Engineering in Sports Protective Gear Design: A Study on Innovative Methods Based on Extension Theory

This study, rooted in extension theory and the principles of knowledge engineering, explores and formulates a novel method for generating sports protective gear designs. Given the critical role of sports protective gear...

Integration of Fuzzy Inference Systems and Linear Regression for Enhanced Height Prediction of Deodar Cedar Trees in Kumrat Valley

Accurate estimation of tree height is fundamental to sustainable forest management, particularly in regions such as Kumrat Valley, Pakistan, where Deodar Cedar (Cedrus deodara) serves as a vital ecological and economic r...

Racism and Hate Speech Detection on Twitter: A QAHA-Based Hybrid Deep Learning Approach Using LSTM-CNN

Twitter, a predominant platform for instantaneous communication and idea dissemination, is often exploited by cybercriminals for victim harassment through sexism, racism, hate speech, and trolling using pseudony-mous acc...

A Blockchain Cross-Chain Solution Based on Relays

Blockchain has attracted widespread attention due to its unique features such as decentralization, traceability, and tamper resistance. With the rapid development of blockchain technology, an increasing number of industr...

Advanced Estimation of Orange Tree Age Using Fuzzy Inference and Linear Regression Models

The accurate estimation of the age of orange trees is a critical task in orchard management, providing valuable insights into tree growth, yield prediction, and the implementation of optimal agricultural practices. Tradi...

Download PDF file
  • EP ID EP755051
  • DOI 10.56578/ijkis020303
  • Views 36
  • Downloads 0

How To Cite

Junwei Zhao, Xuexu Yuan, Qingtao Hou, Hanyu Gao, Chunyu Gao, Yuanyuan Zhang (2024). Development and Evaluation of a Parallel K-means Algorithm for Big Data Analysis in Google MapReduce Environment. International Journal of Knowledge and Innovation Studies, 2(3), -. https://europub.co.uk/articles/-A-755051