The increasing failure rate in High Performance Computing encourages the investigation of fault tolerance mechanisms to guarantee the execution of an application in spite of node faults. This paper presents an automatic and scalable fault tolerant model designed to be transparent for applications and for message passing libraries. The model consists of detecting failures in the communication socket caused by a faulty node. In those cases, the affected processes are recovered in a healthy node and the connections are reestablished without losing data. The Redundant Array of Distributed Independent Controllers architecture proposes a decentralized model for all the tasks required in a fault tolerance system: protection, detection, recovery an...
This thesis deals with principles and techniques of fault tolerance for distributed embedded systems...
We present a new software architecture in which all concepts necessary to achieve fault tolerance ca...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
AbstractThe increasing failure rate in High Performance Computing encourages the investigation of fa...
Abstract—We present an analysis design of how to incorpo-rate a transparent fault tolerance system a...
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the n...
The demand for computational power has been leading the improvement of the High Performance Computin...
Fault tolerance has become an important issue for parallel applications in the last few years. The p...
We present a unified fault-tolerance framework for task-parallel message-passing applications to mit...
Clusters of message-passing computing nodes provide high-performance platforms for distributed appli...
With the increase of the number of nodes in clusters, the probability of failures increases. In this...
This thesis addresses issues in building fault-tolerant distributed real-time systems. Such systems ...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
This paper presents the influence of the fault tolerance configuration on different applications usi...
Descripció del recurs: el 23 de febrer de 2010¿Es adecuado un sistema rápido pero poco robusto?¿Es a...
This thesis deals with principles and techniques of fault tolerance for distributed embedded systems...
We present a new software architecture in which all concepts necessary to achieve fault tolerance ca...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
AbstractThe increasing failure rate in High Performance Computing encourages the investigation of fa...
Abstract—We present an analysis design of how to incorpo-rate a transparent fault tolerance system a...
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the n...
The demand for computational power has been leading the improvement of the High Performance Computin...
Fault tolerance has become an important issue for parallel applications in the last few years. The p...
We present a unified fault-tolerance framework for task-parallel message-passing applications to mit...
Clusters of message-passing computing nodes provide high-performance platforms for distributed appli...
With the increase of the number of nodes in clusters, the probability of failures increases. In this...
This thesis addresses issues in building fault-tolerant distributed real-time systems. Such systems ...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
This paper presents the influence of the fault tolerance configuration on different applications usi...
Descripció del recurs: el 23 de febrer de 2010¿Es adecuado un sistema rápido pero poco robusto?¿Es a...
This thesis deals with principles and techniques of fault tolerance for distributed embedded systems...
We present a new software architecture in which all concepts necessary to achieve fault tolerance ca...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...