A Two-Level Fault-Tolerance Technique for High Performance Computing Applications

Abstract

Reliability is the biggest concern facing future extreme-scale, high performance computing (HPC) systems. Within the current generation of HPC systems, projections suggest that errors will occur with very high rates in future systems. Thus, it is fundamental that we detect errors that can cause the failure of important applications, such as scientific ones. In this paper, we have presented a two-level fault-tolerance approach for the detection and classification of errors for Compute United Device Architecture (CUDA)-based Graphics Processing Units (GPUs). In the first level, it detects the existence of errors by using software redundancy that applies design diversity. In the second level, it investigates the problematic software version and re-executes it on a different hardware component to classify whether the error is a permanent hardware error or a software error. We implemented our approach to run on GPUs and conducted proof of concept experiments by running three versions of matrix multiplications with different error scenarios and results show the feasibility of the proposed approach.

Authors and Affiliations

Aishah M. Aseeri, Mai A. Fadel

Keywords

Related Articles

Effective Calibration and Evaluation of Multi-Camera Robotic Head

The paper deals with appropriate calibration of multispectral vision systems and evaluation of the calibration and data-fusion quality in real-world indoor and outdoor conditions. Checkerboard calibration pattern develop...

Innovative Automatic Discrimination Multimedia Documents for Indexing using Hybrid GMM-SVM Method

In this paper, a new parameterization method sound discrimination of multimedia documents based on entropy phase is presented to facilitate indexing audio documents and speed up their searches in digital libraries or the...

GASolver-A Solution to Resource Constrained Project Scheduling by Genetic Algorithm

The Resource Constrained Scheduling Problem (RCSP) represents an important research area. Not only exact solution but also many heuristic methods have been proposed to solve RCPSP (Resource Constrained Project Scheduling...

An Improved Approach for Text-Independent Speaker Recognition

This paper presents new Speaker Identification and Speaker Verification systems based on the use of new feature vectors extracted from the speech signal. The proposed structure combine between the most successful Mel Fre...

Impact of ICT on Students’ Academic Performance: Applying Association Rule Mining and Structured Equation Modeling

Information and communication technology (ICT) plays a significant role in university students’ academic performance. This research examined the effect of ICT on the students’ academic performance at different private un...

Download PDF file
  • EP ID EP429106
  • DOI 10.14569/IJACSA.2018.091207
  • Views 89
  • Downloads 0

How To Cite

Aishah M. Aseeri, Mai A. Fadel (2018). A Two-Level Fault-Tolerance Technique for High Performance Computing Applications. International Journal of Advanced Computer Science & Applications, 9(12), 46-54. https://europub.co.uk/articles/-A-429106