Operating systems enable collecting and extracting rich information on application execution characteristics, including program counter traces, memory access patterns, and operating-system-generated signals. This information can be exploited to design highly efficient, application-aware reliability mechanisms that are transparent to applications. This paper describes the Reliability MicroKernel framework (RMK), a loadable kernel module for providing application-aware reliability and dynamically configuring reliability mechanisms installed in RMK. The RMK prototype is implemented in Linux and supports detection of application/OS failures and transparent application checkpointing. Experiment results show that the OS hang detection and applica...
Abstract—To diagnose performance problems in production systems, many OS kernel-level monitoring and...
Nowadays, clusters are widely used to execute scientific applications. These applications are often ...
Checkpointing Rollback Recovery protocol is often used to provide fault tolerance for real-time appl...
Operating systems enable collecting and extracting rich information on application execution charact...
Operating systems enable collecting and extracting rich information on application execution charact...
Operating systems and hypervisors enable the collection and extraction of rich information on applic...
Abstract—In this paper, we present r-kernel, an operating system kernel enhancement specifically des...
On-line failure detection is an essential means to control and assess the dependability of complex a...
227 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2006.Our results show a decrease i...
Despite many decades of research, the management of errors in a live operating system remains a chal...
As processor manufacturers keep pushing the limits of the transistor, the reliability of computer sy...
Abstract—We propose a fault injection framework to assess hang detection facilities within the Linux...
According to Moore’s law, technology scaling is continuously providing smaller and faster devices. T...
Many critical services are nowadays provided by large and complex software systems. However, the inc...
Many critical services are nowadays provided by large and complex software systems. However the incr...
Abstract—To diagnose performance problems in production systems, many OS kernel-level monitoring and...
Nowadays, clusters are widely used to execute scientific applications. These applications are often ...
Checkpointing Rollback Recovery protocol is often used to provide fault tolerance for real-time appl...
Operating systems enable collecting and extracting rich information on application execution charact...
Operating systems enable collecting and extracting rich information on application execution charact...
Operating systems and hypervisors enable the collection and extraction of rich information on applic...
Abstract—In this paper, we present r-kernel, an operating system kernel enhancement specifically des...
On-line failure detection is an essential means to control and assess the dependability of complex a...
227 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2006.Our results show a decrease i...
Despite many decades of research, the management of errors in a live operating system remains a chal...
As processor manufacturers keep pushing the limits of the transistor, the reliability of computer sy...
Abstract—We propose a fault injection framework to assess hang detection facilities within the Linux...
According to Moore’s law, technology scaling is continuously providing smaller and faster devices. T...
Many critical services are nowadays provided by large and complex software systems. However, the inc...
Many critical services are nowadays provided by large and complex software systems. However the incr...
Abstract—To diagnose performance problems in production systems, many OS kernel-level monitoring and...
Nowadays, clusters are widely used to execute scientific applications. These applications are often ...
Checkpointing Rollback Recovery protocol is often used to provide fault tolerance for real-time appl...