Operating system lockup errors can render a computer unusable by preventing the execution other programs. Watchdog timers can be used to recover from a lockup by resetting the processor and rebooting the system when a lockup is detected. This results in a loss of unsaved data in running programs. Based on the observation that volatile memory is not affected when a processor a re-set occurs, we present an approach to recover from a watchdog reset with minimal or zero loss of applica-tion state. We study the resolution of lockup conditions using thread termination and using exception dispatch. Thread termination can still result in a usable system and is already used as a recovery strategy for other errors in Linux. Using exceptions allows de...
As technology feature size continues to shrink, we see two challenging problems in the design of com...
Software failures in server applications are a significant problem for preserving system availabilit...
Unpredictable hardware faults and software bugs lead to application crashes, incorrect computations,...
Abstract — When an operating system crashes and hangs, it leaves the machine in an unusable state. A...
User applications and data in volatile memory are usually lost when an operating system crashes beca...
Gracefully recovering from software and hardware faults is important to ensuring highly reliable an...
We present a method to recover from failures caused by soft-ware bugs. Our method relies on two key ...
We present a new technique that enables software recovery in legacy applications by retrofitting exc...
Crash-only programs crash safely and recover quickly. There is only one way to stop such software—by...
Operating systems often manage critical infrastructures where failures can have serious consequences...
Abstract Software applications run on a variety of platforms (filesystems, virtual slices, mobile ha...
Deadlocked threads cannot make further progress, and frequently tie up resources requested by still ...
Despite many decades of research, the management of errors in a live operating system remains a chal...
This study focuses on how to confine error recovery to the immediate environment of a failed computa...
Interrupt handlers implemented by some PCI device drivers can misbehave if the device they are suppo...
As technology feature size continues to shrink, we see two challenging problems in the design of com...
Software failures in server applications are a significant problem for preserving system availabilit...
Unpredictable hardware faults and software bugs lead to application crashes, incorrect computations,...
Abstract — When an operating system crashes and hangs, it leaves the machine in an unusable state. A...
User applications and data in volatile memory are usually lost when an operating system crashes beca...
Gracefully recovering from software and hardware faults is important to ensuring highly reliable an...
We present a method to recover from failures caused by soft-ware bugs. Our method relies on two key ...
We present a new technique that enables software recovery in legacy applications by retrofitting exc...
Crash-only programs crash safely and recover quickly. There is only one way to stop such software—by...
Operating systems often manage critical infrastructures where failures can have serious consequences...
Abstract Software applications run on a variety of platforms (filesystems, virtual slices, mobile ha...
Deadlocked threads cannot make further progress, and frequently tie up resources requested by still ...
Despite many decades of research, the management of errors in a live operating system remains a chal...
This study focuses on how to confine error recovery to the immediate environment of a failed computa...
Interrupt handlers implemented by some PCI device drivers can misbehave if the device they are suppo...
As technology feature size continues to shrink, we see two challenging problems in the design of com...
Software failures in server applications are a significant problem for preserving system availabilit...
Unpredictable hardware faults and software bugs lead to application crashes, incorrect computations,...