Title
Ynderstanding Software Approaches For Gpgpu Reliability
Keywords
Performance; Reliability; Reliability Keywords GPGPU
Abstract
Even though graphics processors (GPUs) are becoming increasingly popular for general purpose computing, current (and likely near future) generations of GPUs do not provide hardware support for detecting soft/hard errors in computation logic or memory storage cells since graphics pplications are inherently fault tolerant. As a result, if an error occurs in GPUs during program execution, the results could be silently corrupted, which is not acceptable for general purpose computations. To improve the fidelity of general purpose computation on GPUs (GPGPU), we investigate software approaches to perform redundant execution. In particular, we propose and study three different, application-level techniques. The first technique simply executes the GPU kernel program twice, and thus achieves roughly half of the throughput of a non-redundant execution. The next two techniques interleave redundant execution with the original code in different ways to take advantage of the parallelism between the original code and its redundant copy. Furthermore, we evaluate the benefits of providing hardware support, including ECC/parity protection to on-chip and off-chip memories, for each of the software techniques. Interestingly, our findings, based on six commonly used applications, indicate that the benefits of complex software approaches are both application and architecture dependent. The simple approach, which executes the kernel twice, is often sufficient and may even outperform the complex ones. Moreover, we argue that the cost is not justified to protect memories with ECC/parity bits. ©2009 ACM.
Publication Date
7-23-2009
Publication Title
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2
Number of Pages
94-
Document Type
Article; Proceedings Paper
Personal Identifier
scopus
DOI Link
https://doi.org/10.1145/1513895.1513907
Copyright Status
Unknown
Socpus ID
67650671605 (Scopus)
Source API URL
https://api.elsevier.com/content/abstract/scopus_id/67650671605
STARS Citation
Dimitrov, Martin; Mantor, Mike; and Zhou, Huiyang, "Ynderstanding Software Approaches For Gpgpu Reliability" (2009). Scopus Export 2000s. 12125.
https://stars.library.ucf.edu/scopus2000/12125