Fault tolerance has become an important issue for parallel applications in the last few years. The parallel systems' users want them to be reliable considering two main dimensions, availability and data consistency. Availability can be provided with solutions such as RADIC, a fault tolerant architecture with different protection levels, offering high availability with transparency, decentralization, flexibility and scalability for message-passing systems. Transient faults may cause an application running in a computer system to be removed from execution, however the biggest risk of transient faults is to provoke undetected data corruption that changes the final result of the application without anyone knowing. To evaluate the effects of tra...
The scale of parallel computing systems is rapidly approaching dimensions where fault tolerance can...
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the n...
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection...
Fault tolerance has become an important issue for parallel applications in the last few years. The p...
The increasing failure rate in High Performance Computing encourages the investigation of fault tole...
AbstractThe increasing failure rate in High Performance Computing encourages the investigation of fa...
In parallel systems, a number of measures of performance are not accurate or representative of their...
Reliability and availability have become increasingly important in today’s computer- depend...
The demand for computational power has been leading the improvement of the High Performance Computin...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
Transient faults are becoming a critical concern among current trends of design of general-purpose m...
In massively parallel systems (MPS), fault tolerance is indispensable to obtain proper completion o...
In the research reported in this paper, transient faults were injected in the nodes and in the commu...
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and sile...
For massively parallel systems, the probability of cr s~Yslenc failure clue to u random hardware fa...
The scale of parallel computing systems is rapidly approaching dimensions where fault tolerance can...
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the n...
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection...
Fault tolerance has become an important issue for parallel applications in the last few years. The p...
The increasing failure rate in High Performance Computing encourages the investigation of fault tole...
AbstractThe increasing failure rate in High Performance Computing encourages the investigation of fa...
In parallel systems, a number of measures of performance are not accurate or representative of their...
Reliability and availability have become increasingly important in today’s computer- depend...
The demand for computational power has been leading the improvement of the High Performance Computin...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
Transient faults are becoming a critical concern among current trends of design of general-purpose m...
In massively parallel systems (MPS), fault tolerance is indispensable to obtain proper completion o...
In the research reported in this paper, transient faults were injected in the nodes and in the commu...
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and sile...
For massively parallel systems, the probability of cr s~Yslenc failure clue to u random hardware fa...
The scale of parallel computing systems is rapidly approaching dimensions where fault tolerance can...
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the n...
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection...