compiler for Portable Checkpointing), a checkpointing tool designed for heterogeneous clusters and Grid infrastructures through the use of portable protocols, portable checkpoint files and portable code. It works at variable level being user-directed, thus generating small checkpoint files. It allows parallel pro-cesses to checkpoint independently, without runtime coordina-tion or message-logging. Consistency is achieved at restart time by negotiating the restart point. A directive-based checkpointing precompiler has also been implemented to ease up user’s effort. CPPC was designed to work with parallel MPI programs, though it can be used with sequential ones, and easily extended to pa-rallel programs written using different message-passing...
This paper focuses on the performance evaluation of Compiler for Portable Checkpointing (CPPC), a to...
Jobs in Grid workflows are exposed to different types of failure. It is important to develop fault t...
Coordinated checkpointing is a well-known method to achieve fault tolerance in distributed systems. ...
With the evolution of high-performance computing towards heterogeneous, massively par-allel systems,...
Despite the increasing popularity of shared-memory systems, there is a lack of tools for providing f...
Abstract. The Grid community has made an important effort in developing middleware to provide differ...
This is the peer reviewed version of the following article: Rodríguez, G. , Martín, M. J., González,...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
Abstract. With the maturity of the Grid, the community has made an important effort in developing mi...
The ever increasing number of processors used in parallel computers is making fault tolerance suppor...
Abstract:- Checkpoint is defined as a designated place in a program at which normal processing is in...
High Performance Computing (HPC) systems represent the peak of modern computational capability. As ...
AbstractAs parallel machines increase their number of processors, so does the failure rate of the gl...
International audienceA long-term trend in high-performance computing is the increasing number of no...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
This paper focuses on the performance evaluation of Compiler for Portable Checkpointing (CPPC), a to...
Jobs in Grid workflows are exposed to different types of failure. It is important to develop fault t...
Coordinated checkpointing is a well-known method to achieve fault tolerance in distributed systems. ...
With the evolution of high-performance computing towards heterogeneous, massively par-allel systems,...
Despite the increasing popularity of shared-memory systems, there is a lack of tools for providing f...
Abstract. The Grid community has made an important effort in developing middleware to provide differ...
This is the peer reviewed version of the following article: Rodríguez, G. , Martín, M. J., González,...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
Abstract. With the maturity of the Grid, the community has made an important effort in developing mi...
The ever increasing number of processors used in parallel computers is making fault tolerance suppor...
Abstract:- Checkpoint is defined as a designated place in a program at which normal processing is in...
High Performance Computing (HPC) systems represent the peak of modern computational capability. As ...
AbstractAs parallel machines increase their number of processors, so does the failure rate of the gl...
International audienceA long-term trend in high-performance computing is the increasing number of no...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
This paper focuses on the performance evaluation of Compiler for Portable Checkpointing (CPPC), a to...
Jobs in Grid workflows are exposed to different types of failure. It is important to develop fault t...
Coordinated checkpointing is a well-known method to achieve fault tolerance in distributed systems. ...