Operating systems and hypervisors enable the collection and extraction of rich information on application and system execution characteristics. This thesis describes a Reliability MicroKernel (RMK) architecture, which provides an infrastructure that enables the design and deployment of software modules for providing application-aware error detection and recovery. The purpose of the RMK is to provide an automatic approach for low-latency crash/hang detection and rapid recovery via checkpoint. We first demonstrate how the RMK works in a native system and then enhance the RMK to work in VMs. In a native system, the RMK is installed as a device driver, while in a virtualized system, the RMK is both installed as a device driver in VMs and deplo...
Large machines with tens or even hundreds of thousands of processors are currently in use. As the nu...
Abstract. Large-scale computing platforms provide tremendous capabilities for scientific discovery. ...
System virtualization allows forthe consolidation of many physicalservers on a single physical host ...
Operating systems and hypervisors enable the collection and extraction of rich information on applic...
Operating systems enable collecting and extracting rich information on application execution charact...
Operating systems enable collecting and extracting rich information on application execution charact...
Critical applications require dependability mechanisms to prevent them from failuresdue to faults. D...
Software complexity in embedded systems is continuously increasing while embedded computing platform...
Memory system design is important for providing high reliability and availability. This dissertation...
Full system reliability is a problem that spans multiple levels of the software/hardware stack. The...
User applications and data in volatile memory are usually lost when an operating system crashes beca...
While one always works to prevent attacks and failures, they are inevitable and situational awarenes...
Abstract- In this work, we present the design of the Checkpointing-Enabled Virtual Machine (CEVM) ar...
This dissertation describes monitoring methods to achieve both security and reliability in virtualiz...
A significant fraction of software failures in large-scale Internet systems are cured by rebooting, ...
Large machines with tens or even hundreds of thousands of processors are currently in use. As the nu...
Abstract. Large-scale computing platforms provide tremendous capabilities for scientific discovery. ...
System virtualization allows forthe consolidation of many physicalservers on a single physical host ...
Operating systems and hypervisors enable the collection and extraction of rich information on applic...
Operating systems enable collecting and extracting rich information on application execution charact...
Operating systems enable collecting and extracting rich information on application execution charact...
Critical applications require dependability mechanisms to prevent them from failuresdue to faults. D...
Software complexity in embedded systems is continuously increasing while embedded computing platform...
Memory system design is important for providing high reliability and availability. This dissertation...
Full system reliability is a problem that spans multiple levels of the software/hardware stack. The...
User applications and data in volatile memory are usually lost when an operating system crashes beca...
While one always works to prevent attacks and failures, they are inevitable and situational awarenes...
Abstract- In this work, we present the design of the Checkpointing-Enabled Virtual Machine (CEVM) ar...
This dissertation describes monitoring methods to achieve both security and reliability in virtualiz...
A significant fraction of software failures in large-scale Internet systems are cured by rebooting, ...
Large machines with tens or even hundreds of thousands of processors are currently in use. As the nu...
Abstract. Large-scale computing platforms provide tremendous capabilities for scientific discovery. ...
System virtualization allows forthe consolidation of many physicalservers on a single physical host ...