Understanding The Propagation Of Transient Errors In Hpc Applications
Keywords
application vulnerability; distributed applications; fault injection; fault propagation; resiliency; soft errors
Abstract
Resiliency of exascale systems has quickly become an important concern for the scientific community. Despite its importance, still much remains to be determined regarding how faults disseminate or at what rate do they impact HPC applications. The understanding of where and how fast faults propagate could lead to more efficient implementation of application-driven error detection and recovery. In this work, we propose a fault propagation framework to analyze how faults propagate in MPI applications and to understand their vulnerability to faults. We employ a combination of compiler-level code transformation and instrumentation, along with a runtime checker. Using the information provided by our framework, we employ machine learning technique to derive application fault propagation models that can be used to estimate the number of corrupted memory locations at runtime.
Publication Date
11-15-2015
Publication Title
International Conference for High Performance Computing, Networking, Storage and Analysis, SC
Volume
15-20-November-2015
Document Type
Article; Proceedings Paper
Personal Identifier
scopus
DOI Link
https://doi.org/10.1145/2807591.2807670
Copyright Status
Unknown
Socpus ID
84966600338 (Scopus)
Source API URL
https://api.elsevier.com/content/abstract/scopus_id/84966600338
STARS Citation
Ashraf, Rizwan A.; Gioiosa, Roberto; Kestor, Gokcen; Demara, Ronald F.; and Cher, Chen Yong, "Understanding The Propagation Of Transient Errors In Hpc Applications" (2015). Scopus Export 2015-2019. 1491.
https://stars.library.ucf.edu/scopus2015/1491