A significant fraction of software failures in large-scale Internet systems are cured by rebooting, even when the exact failure causes are unknown. However, rebooting can be expensive, causing nontrivial service disruption or downtime even when clusters and failover are employed. In this work we separate process recovery from data recovery to enable microrebooting -- a fine-grain technique for surgically recovering faulty application components, without disturbing the rest of the application. We evaluate microrebooting in an Internet auction system running on an application server. Microreboots recover most of the same failures as full reboots, but do so an order of magnitude faster and result in an order of magnitude savings in lost work. ...
Full system reliability is a problem that spans multiple levels of the software/hardware stack. The...
The default behavior of all commodity operating systems today is to restart the system when a critic...
We present an in-depth analysis of the crash-recovery problem and propose a novel approach to recove...
Microreboot is an attractive technique for recovering an application after a non-malicious failure ...
As software complexity increases so does the difficulty in solving all software defects before the p...
Abstract. Building enterprise applications that can self-adapt to eliminate com-ponent failures is h...
User applications and data in volatile memory are usually lost when an operating system crashes beca...
In this paper we show how to reduce downtime of J2EE applications by rapidly and automatically recov...
Crash-only programs crash safely and recover quickly. There is only one way to stop such software—by...
We present a method to recover from failures caused by soft-ware bugs. Our method relies on two key ...
In this paper we show how to reduce downtime of J2EE applications by rapidly and automatically recov...
Operating systems and hypervisors enable the collection and extraction of rich information on applic...
Performance anomalies represent one common type of failures in Internet servers. Overcoming these f...
In large distributed application-based systems, an ensemble of co-existing micro-services perform in...
Gracefully recovering from software and hardware faults is important to ensuring highly reliable an...
Full system reliability is a problem that spans multiple levels of the software/hardware stack. The...
The default behavior of all commodity operating systems today is to restart the system when a critic...
We present an in-depth analysis of the crash-recovery problem and propose a novel approach to recove...
Microreboot is an attractive technique for recovering an application after a non-malicious failure ...
As software complexity increases so does the difficulty in solving all software defects before the p...
Abstract. Building enterprise applications that can self-adapt to eliminate com-ponent failures is h...
User applications and data in volatile memory are usually lost when an operating system crashes beca...
In this paper we show how to reduce downtime of J2EE applications by rapidly and automatically recov...
Crash-only programs crash safely and recover quickly. There is only one way to stop such software—by...
We present a method to recover from failures caused by soft-ware bugs. Our method relies on two key ...
In this paper we show how to reduce downtime of J2EE applications by rapidly and automatically recov...
Operating systems and hypervisors enable the collection and extraction of rich information on applic...
Performance anomalies represent one common type of failures in Internet servers. Overcoming these f...
In large distributed application-based systems, an ensemble of co-existing micro-services perform in...
Gracefully recovering from software and hardware faults is important to ensuring highly reliable an...
Full system reliability is a problem that spans multiple levels of the software/hardware stack. The...
The default behavior of all commodity operating systems today is to restart the system when a critic...
We present an in-depth analysis of the crash-recovery problem and propose a novel approach to recove...