My research focuses on both policy and mechanism for managing datacenter-scale installations (thousands of computers) of interactive-response, Internet-resident services. “Management ” includes recovery from failures, on-demand scalability, dynamic resource allocation, and improved power efficiency. The mechanism for carrying out these tasks is to construct software building blocks in which common operations, such as failure recovery, scaling up/down, or reprovisioning, can be achieved by rebooting a machine (or its dual, adding a new machine and killing the faulty one). The policy is based on the use of statistical machine learning (SML) techniques to automatically identify and react to problems that would take too long for a human operato...
My research focuses on improving the security of distributed systems with multiple administrative do...
Traditional reliability-related models for fault-tolerant systems are used to predict system reliabi...
Computer systems have evolved significantly along two lines: (1) cloud-scale computing, moving into ...
Crash-only programs crash safely and recover quickly. There is only one way to stop such software—by...
My research interests are in the design and evaluation of systems infrastructure for safety-critical...
Building systems to recover fast may be more productive than aiming for systems that never fail. Bec...
Software reliability is one of the cornerstones of any successful user experience. Software needs to...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
As machines increase in scale, it is predicted that failure rates of supercomputers will correspondi...
The availability of the Information Technologies for everything, from everywhere, at all times is a ...
As society becomes more dependent upon computer systems to perform increasingly critical tasks, ensu...
Scientific applications have long embraced the MPI as the environment of choice to execute on large ...
Despite the improvements of the software development and maintenance processes in the last decades, ...
Networked computer systems are prevalent in most aspects of modern society, and we have become depen...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
My research focuses on improving the security of distributed systems with multiple administrative do...
Traditional reliability-related models for fault-tolerant systems are used to predict system reliabi...
Computer systems have evolved significantly along two lines: (1) cloud-scale computing, moving into ...
Crash-only programs crash safely and recover quickly. There is only one way to stop such software—by...
My research interests are in the design and evaluation of systems infrastructure for safety-critical...
Building systems to recover fast may be more productive than aiming for systems that never fail. Bec...
Software reliability is one of the cornerstones of any successful user experience. Software needs to...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
As machines increase in scale, it is predicted that failure rates of supercomputers will correspondi...
The availability of the Information Technologies for everything, from everywhere, at all times is a ...
As society becomes more dependent upon computer systems to perform increasingly critical tasks, ensu...
Scientific applications have long embraced the MPI as the environment of choice to execute on large ...
Despite the improvements of the software development and maintenance processes in the last decades, ...
Networked computer systems are prevalent in most aspects of modern society, and we have become depen...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
My research focuses on improving the security of distributed systems with multiple administrative do...
Traditional reliability-related models for fault-tolerant systems are used to predict system reliabi...
Computer systems have evolved significantly along two lines: (1) cloud-scale computing, moving into ...