Checkpoints are widely used to improve the performance of computer systems and programs in the presence of failures, and they significantly reduce the overall cost of running a program, if the program or the underlying system, are subject to failures. Thus application level checkpointing has been proposed for programs which may execute on platforms which are prone to failures, and also to reduce the execution time of programs which are prone to internal failures. This paper develops a mathematical model to estimate the average execution time of a program in the presence of failures, without and with application level checkpointing, and we use it to predict the optimum interval number of instructions which should be executed between the plac...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
Checkpointing is an effective fault-tolerant technique for improving system availability and reliabi...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
Checkpoints are widely used to improve the performance of computer systems and programs in the prese...
Checkpoints are widely used to improve the performance of computer systems and programs in the prese...
Checkpointing is commonly adopted for enhancing the performance of software applications that operat...
We study programs which operate in the presence of possible failures and which must be restarted fro...
Long-running software may operate on hardware platforms with limited energy resources such as batter...
The large scale of current and next-generation massively parallel processing (MPP) systems presents ...
Checkpointing is a common technique for reducing the time to recover from faults in computer systems...
The massive scale of current and next-generation massively parallel processing (MPP) systems present...
Checkpointing is a common technique for reducing the time to recover from faults in computer systems...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...
International audienceLarge scale applications running on new computing plat- forms with thousands o...
Abstract: Finding the failure rate of a system is a crucial step in high performance comput-ing syst...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
Checkpointing is an effective fault-tolerant technique for improving system availability and reliabi...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
Checkpoints are widely used to improve the performance of computer systems and programs in the prese...
Checkpoints are widely used to improve the performance of computer systems and programs in the prese...
Checkpointing is commonly adopted for enhancing the performance of software applications that operat...
We study programs which operate in the presence of possible failures and which must be restarted fro...
Long-running software may operate on hardware platforms with limited energy resources such as batter...
The large scale of current and next-generation massively parallel processing (MPP) systems presents ...
Checkpointing is a common technique for reducing the time to recover from faults in computer systems...
The massive scale of current and next-generation massively parallel processing (MPP) systems present...
Checkpointing is a common technique for reducing the time to recover from faults in computer systems...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...
International audienceLarge scale applications running on new computing plat- forms with thousands o...
Abstract: Finding the failure rate of a system is a crucial step in high performance comput-ing syst...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
Checkpointing is an effective fault-tolerant technique for improving system availability and reliabi...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...