Scopus Export 2015-2019

Understanding The Propagation Of Transient Errors In Hpc Applications

Rizwan A. Ashraf, University of Central Florida
Roberto Gioiosa, Pacific Northwest National Laboratory
Gokcen Kestor, Pacific Northwest National Laboratory
Ronald F. Demara, University of Central Florida
Chen Yong Cher, IBM Thomas J. Watson Research Center

Keywords

application vulnerability; distributed applications; fault injection; fault propagation; resiliency; soft errors

Abstract

Resiliency of exascale systems has quickly become an important concern for the scientific community. Despite its importance, still much remains to be determined regarding how faults disseminate or at what rate do they impact HPC applications. The understanding of where and how fast faults propagate could lead to more efficient implementation of application-driven error detection and recovery. In this work, we propose a fault propagation framework to analyze how faults propagate in MPI applications and to understand their vulnerability to faults. We employ a combination of compiler-level code transformation and instrumentation, along with a runtime checker. Using the information provided by our framework, we employ machine learning technique to derive application fault propagation models that can be used to estimate the number of corrupted memory locations at runtime.

Publication Date

11-15-2015

Publication Title

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Volume

15-20-November-2015

Document Type

Article; Proceedings Paper

Personal Identifier

scopus

DOI Link

https://doi.org/10.1145/2807591.2807670

Copyright Status

Unknown

Socpus ID

84966600338 (Scopus)

Source API URL

https://api.elsevier.com/content/abstract/scopus_id/84966600338

STARS Citation

Ashraf, Rizwan A.; Gioiosa, Roberto; Kestor, Gokcen; Demara, Ronald F.; and Cher, Chen Yong, "Understanding The Propagation Of Transient Errors In Hpc Applications" (2015). Scopus Export 2015-2019. 1491.
https://stars.library.ucf.edu/scopus2015/1491

This document is currently not available here.

COinS

Scopus Export 2015-2019

Understanding The Propagation Of Transient Errors In Hpc Applications

Keywords

Abstract

Publication Date

Publication Title

Volume

Document Type

Personal Identifier

DOI Link

Copyright Status

Socpus ID

Source API URL

STARS Citation

Explore

Connect

Scopus Export 2015-2019

Understanding The Propagation Of Transient Errors In Hpc Applications

Creator

Keywords

Abstract

Publication Date

Publication Title

Volume

Document Type

Personal Identifier

DOI Link

Copyright Status

Socpus ID

Source API URL

STARS Citation

Share

Explore

Connect