A Two-Level Fault-Tolerance Technique for High Performance Computing Applications

Abstract

Reliability is the biggest concern facing future extreme-scale, high performance computing (HPC) systems. Within the current generation of HPC systems, projections suggest that errors will occur with very high rates in future systems. Thus, it is fundamental that we detect errors that can cause the failure of important applications, such as scientific ones. In this paper, we have presented a two-level fault-tolerance approach for the detection and classification of errors for Compute United Device Architecture (CUDA)-based Graphics Processing Units (GPUs). In the first level, it detects the existence of errors by using software redundancy that applies design diversity. In the second level, it investigates the problematic software version and re-executes it on a different hardware component to classify whether the error is a permanent hardware error or a software error. We implemented our approach to run on GPUs and conducted proof of concept experiments by running three versions of matrix multiplications with different error scenarios and results show the feasibility of the proposed approach.

Authors and Affiliations

Aishah M. Aseeri, Mai A. Fadel

Keywords

Related Articles

A Method for Designing Domain-Specific Document Retrieval Systems using Semantic Indexing

Using domain knowledge and semantics to con-duct e‡ective document retrieval has attracted great attention from researchers in many di‡erent communities. Ultilizing that approach, we presents the method for designing dom...

Post Treatment of Guided Wave by using Wavelet Transform in the Presence of a Defect on Surface

This article presents a Lamb wave processing by using two methods: Fast Fourier Transform (FFT2D) and Continuous Wavelet Transform (CWT) using Morlet wavelet. This treatment is done for a structure of two aluminum-copper...

Pilot Study: The Use of Electroencephalogram to Measure Attentiveness towards Short Training Videos

Universities, schools, and training centers are seeking to improve their computer-based [3] and distance learning classes through the addition of short training videos, often referred to as podcasts [4]. As distance lear...

Performance Evaluation of Network Gateway Design for NoC based System on FPGA Platform

Network on Chip (NoC) is an emerging interconnect solution with reliable and scalable features over the System on Chip (SoC) and helps to overcome the drawbacks of bus-based interconnection in SoC. The multiple cores or...

Secure Medical Images Sharing over Cloud Computing environment

Nowadays, many applications have been appeared due to the rapid development in the term of telecommunication. One of these applications is the telemedicine where the patients' digital data can transfer between the doctor...

Download PDF file
  • EP ID EP429106
  • DOI 10.14569/IJACSA.2018.091207
  • Views 83
  • Downloads 0

How To Cite

Aishah M. Aseeri, Mai A. Fadel (2018). A Two-Level Fault-Tolerance Technique for High Performance Computing Applications. International Journal of Advanced Computer Science & Applications, 9(12), 46-54. https://europub.co.uk/articles/-A-429106