Critical applications require dependability mechanisms to prevent them from failuresdue to faults. Dependable systems for mainstream deployment are typically built upon commodity hardware with mechanisms that enhance resilience implemented in software. Such systems are aimed at providing commercially viable, best-effort dependability cost- effectively.This thesis proposes several practical, low-overhead dependability mechanisms for criticalcomponents in the system: hypervisors, containers, and parallel applications.For hypervisors, the latency to reboot a new instance to recover from transient faults isunacceptably high. NiLiHype recovers the hypervisor by resetting it to a quiescent state that is highly likely to be valid. Compared to a pr...
Protocols to implement a fault-tolerant computing system are described. These protocols augment the ...
Abstract. Large-scale computing platforms provide tremendous capabilities for scientific discovery. ...
Supporting uninterrupted services for distributed soft real-time applications is hard in resource-co...
Critical applications require dependability mechanisms to prevent them from failuresdue to faults. D...
Operating systems and hypervisors enable the collection and extraction of rich information on applic...
Abstract- In this work, we present the design of the Checkpointing-Enabled Virtual Machine (CEVM) ar...
Many organizations are moving their systems to the cloud, where providers consolidate multiple clie...
Software complexity in embedded systems is continuously increasing while embedded computing platform...
Crash and omission failures are common in service providers: a disk can break down or a link can fai...
Hypervisor-based fault tolerance (HBFT), which synchronizes the state between the primary VM and the...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
System virtualization allows forthe consolidation of many physicalservers on a single physical host ...
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Civil and Environmental Enginee...
This paper revisits replication coupled with checkpointing for fail-stop errors.Replication enables ...
Memory system design is important for providing high reliability and availability. This dissertation...
Protocols to implement a fault-tolerant computing system are described. These protocols augment the ...
Abstract. Large-scale computing platforms provide tremendous capabilities for scientific discovery. ...
Supporting uninterrupted services for distributed soft real-time applications is hard in resource-co...
Critical applications require dependability mechanisms to prevent them from failuresdue to faults. D...
Operating systems and hypervisors enable the collection and extraction of rich information on applic...
Abstract- In this work, we present the design of the Checkpointing-Enabled Virtual Machine (CEVM) ar...
Many organizations are moving their systems to the cloud, where providers consolidate multiple clie...
Software complexity in embedded systems is continuously increasing while embedded computing platform...
Crash and omission failures are common in service providers: a disk can break down or a link can fai...
Hypervisor-based fault tolerance (HBFT), which synchronizes the state between the primary VM and the...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
System virtualization allows forthe consolidation of many physicalservers on a single physical host ...
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Civil and Environmental Enginee...
This paper revisits replication coupled with checkpointing for fail-stop errors.Replication enables ...
Memory system design is important for providing high reliability and availability. This dissertation...
Protocols to implement a fault-tolerant computing system are described. These protocols augment the ...
Abstract. Large-scale computing platforms provide tremendous capabilities for scientific discovery. ...
Supporting uninterrupted services for distributed soft real-time applications is hard in resource-co...