A Two-Level Fault-Tolerance Technique for High Performance Computing Applications
Journal Title: International Journal of Advanced Computer Science & Applications - Year 2018, Vol 9, Issue 12
Abstract
Reliability is the biggest concern facing future extreme-scale, high performance computing (HPC) systems. Within the current generation of HPC systems, projections suggest that errors will occur with very high rates in future systems. Thus, it is fundamental that we detect errors that can cause the failure of important applications, such as scientific ones. In this paper, we have presented a two-level fault-tolerance approach for the detection and classification of errors for Compute United Device Architecture (CUDA)-based Graphics Processing Units (GPUs). In the first level, it detects the existence of errors by using software redundancy that applies design diversity. In the second level, it investigates the problematic software version and re-executes it on a different hardware component to classify whether the error is a permanent hardware error or a software error. We implemented our approach to run on GPUs and conducted proof of concept experiments by running three versions of matrix multiplications with different error scenarios and results show the feasibility of the proposed approach.
Authors and Affiliations
Aishah M. Aseeri, Mai A. Fadel
Audio Search Based on Keyword Spotting in Arabic Language
Keyword spotting is an important application of speech recognition. This research introduces a keyword spotting approach to perform audio searching of uttered words in Arabic speech. The matching process depends on the u...
Glaucoma-Deep: Detection of Glaucoma Eye Disease on Retinal Fundus Images using Deep Learning
Detection of glaucoma eye disease is still a challenging task for computer-aided diagnostics (CADx) systems. During eye screening process, the ophthalmologists measures the glaucoma by structure changes in optic disc (OD...
USING PENALIZED REGRESSION WITH PARALLEL COORDINATES FOR VISUALIZATION OF SIGNIFICANCE IN HIGH DIMENSIONAL DATA
In recent years, there has been an exponential increase in the amount of data being produced and disseminated by diverse applications, intensifying the need for the development of effective methods for the interactive vi...
An Ecn Approach to Congestion Control Mechanisms in Mobile Adhoc Networks
Node(s)/link(s) of a network are subjected to overloading; network performance deteriorates substantially due to network congestion. Network congestion can be mitigated with the help of Explicit Congestion notification (...
Understanding a Co-Evolution Model of Business and IT for Dynamic Business Process Requirements
Organizations adapt existing business processes in order to become competitive but a change in a process affects other processes as well. In order to support the required change suitable technologies must be provided so...