Keywords
Computer architecture, Computers -- Reliability, Debugging in computer science
Abstract
It is a great challenge to build reliable computer systems with unreliable hardware and buggy software. On one hand, software bugs account for as much as 40% of system failures and incur high cost, an estimate of $59.5B a year, on the US economy. On the other hand, under the current trends of technology scaling, transient faults (also known as soft errors) in the underlying hardware are predicted to grow at least in proportion to the number of devices being integrated, which further exacerbates the problem of system reliability. We propose several methods to improve system reliability both in terms of detecting and correcting soft-errors as well as facilitating software debugging. In our first approach, we detect instruction-level anomalies during program execution. The anomalies can be used to detect and repair soft-errors, or can be reported to the programmer to aid software debugging. In our second approach, we improve anomaly detection for software debugging by detecting different types of anomalies as well as by removing false-positives. While the anomalies reported by our first two methods are helpful in debugging single-threaded programs, they do not address concurrency bugs in multi-threaded programs. In our third approach, we propose a new debugging primitive which exposes the non-deterministic behavior of parallel programs and facilitates the debugging process. Our idea is to generate a time-ordered trace of events such as function calls/returns and memory accesses in different threads. In our experience, exposing the time-ordered event information to the programmer is highly beneficial for reasoning about the root causes of concurrency bugs.
Notes
If this is your thesis or dissertation, and want to learn how to access it or for more information about readership statistics, contact us at STARS@ucf.edu
Graduation Date
2010
Semester
Spring
Advisor
Zhou, Huiyang
Degree
Doctor of Philosophy (Ph.D.)
College
College of Engineering and Computer Science
Department
Electrical Engineering and Computer Science
Format
application/pdf
Identifier
CFE0002975
URL
http://purl.fcla.edu/fcla/etd/CFE0002975
Language
English
Release Date
May 2010
Length of Campus-only Access
None
Access Status
Doctoral Dissertation (Open Access)
Subjects
Dissertations, Academic -- Engineering and Computer Science, Engineering and Computer Science -- Dissertations, Academic
STARS Citation
Dimitrov, Martin, "Architectural Support For Improving System Hardware/software Reliability" (2010). Electronic Theses and Dissertations. 1524.
https://stars.library.ucf.edu/etd/1524