System and application failures are all too common. In this dissertation we argue that operating systems should provide the fundamental abstraction we call failure transparency---the illusion that systems and applications do not fail. Systems that provide failure transparency attempt to completely mask failures from users, and failure handling from programmers. We construct a theory of consistent recovery that provides the fundamental rules for recovering transparently after a failure. In addition to aiding our quest for failure transparency, the theory unifies all existing recovery protocols: they are all simply variations on the theme of the theory's central invariant. Using the theory as a launching point, we construct a series of system...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
As machines increase in scale, it is predicted that failure rates of supercomputers will correspondi...
Production run software failures cause endless grief to end-users, and endless challenges to program...
User applications and data in volatile memory are usually lost when an operating system crashes beca...
We present an in-depth analysis of the crash-recovery problem and propose a novel approach to recove...
Recovery is an essential part of databases and most computer systems, because it enables a system to...
In building systems that can survive random software failures, system designers make assumptions abo...
Gracefully recovering from software and hardware faults is important to ensuring highly reliable an...
Abstract Software applications run on a variety of platforms (filesystems, virtual slices, mobile ha...
Much research has gone into making operating systems more amenable to recovery and more resilient to...
Abstract. As the scale of computing platforms becomes increasingly extreme, the requirements for app...
Traditional reliability-related models for fault-tolerant systems are used to predict system reliabi...
Abstract. We present a new approach for developing robust software applica-tions that breaks depende...
Operating systems enable collecting and extracting rich information on application execution charact...
Despite many decades of research, the management of errors in a live operating system remains a chal...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
As machines increase in scale, it is predicted that failure rates of supercomputers will correspondi...
Production run software failures cause endless grief to end-users, and endless challenges to program...
User applications and data in volatile memory are usually lost when an operating system crashes beca...
We present an in-depth analysis of the crash-recovery problem and propose a novel approach to recove...
Recovery is an essential part of databases and most computer systems, because it enables a system to...
In building systems that can survive random software failures, system designers make assumptions abo...
Gracefully recovering from software and hardware faults is important to ensuring highly reliable an...
Abstract Software applications run on a variety of platforms (filesystems, virtual slices, mobile ha...
Much research has gone into making operating systems more amenable to recovery and more resilient to...
Abstract. As the scale of computing platforms becomes increasingly extreme, the requirements for app...
Traditional reliability-related models for fault-tolerant systems are used to predict system reliabi...
Abstract. We present a new approach for developing robust software applica-tions that breaks depende...
Operating systems enable collecting and extracting rich information on application execution charact...
Despite many decades of research, the management of errors in a live operating system remains a chal...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
As machines increase in scale, it is predicted that failure rates of supercomputers will correspondi...
Production run software failures cause endless grief to end-users, and endless challenges to program...