A Two-Level Fault-Tolerance Technique for High Performance Computing Applications

Abstract

Reliability is the biggest concern facing future extreme-scale, high performance computing (HPC) systems. Within the current generation of HPC systems, projections suggest that errors will occur with very high rates in future systems. Thus, it is fundamental that we detect errors that can cause the failure of important applications, such as scientific ones. In this paper, we have presented a two-level fault-tolerance approach for the detection and classification of errors for Compute United Device Architecture (CUDA)-based Graphics Processing Units (GPUs). In the first level, it detects the existence of errors by using software redundancy that applies design diversity. In the second level, it investigates the problematic software version and re-executes it on a different hardware component to classify whether the error is a permanent hardware error or a software error. We implemented our approach to run on GPUs and conducted proof of concept experiments by running three versions of matrix multiplications with different error scenarios and results show the feasibility of the proposed approach.

Authors and Affiliations

Aishah M. Aseeri, Mai A. Fadel

Keywords

Related Articles

A Novel Efficient Forecasting of Stock Market Using Particle Swarm Optimization with Center of Mass Based Technique

This paper develops an efficient forecasting model for various stock price indices based on the previously introduced particle swarm optimization with center mass (PSOCOM) technique. The structure used in the proposed pr...

A New PHP Discoverer for Modisco

MoDisco is an Eclipse Generative Modeling Technologies project (GMT Project) intended to make easier the design and building of model-based solutions that are dedicated to legacy systems Model-Driven Reverse Engineering...

Embedded System Interfacing with GNSS user Receiver for Transport Applications

The real time vehicle movement traces using waypoint display on the base-map with IRNSS/NavIC and GPS dataset in the GUI simultaneously. In this paper, a portable electronic device with application software has been desi...

Face Age Estimation Approach based on Deep Learning and Principle Component Analysis

This paper presents an approach for age estimation based on faces through classifying facial images into predefined age-groups. However, a task such as the one at hand faces several difficulties because of the different...

Exploring Mechanisms for Pattern Formation through Coupled Bulk-Surface PDEs in Case of Non-linear Reactions

This work explores mechanisms for pattern forma-tion through coupled bulk-surface partial differential equations of reaction-diffusion type. Reaction-diffusion systems posed both in the bulk and on the surface on station...

Download PDF file
  • EP ID EP429106
  • DOI 10.14569/IJACSA.2018.091207
  • Views 65
  • Downloads 0

How To Cite

Aishah M. Aseeri, Mai A. Fadel (2018). A Two-Level Fault-Tolerance Technique for High Performance Computing Applications. International Journal of Advanced Computer Science & Applications, 9(12), 46-54. https://europub.co.uk/articles/-A-429106