High-performance systems pose a number of challenges to traditional fault tolerance approaches. The exponential increase of core numbers in large-scale distributed systems exposes the growth of permanent, intermittent, and transient faults. The redundancy schemes in use increase the number of system resources dedicated to recovery, while the extensive use of silent-failure mode inhibits systems’ capability to detect faults that hinder application progress. As parallel computation strives to survive the high failure rates, software shifts focus towards the support of resilience. The thesis proposes a mechanism for resilience support for Chapel, the high performance language developed by Cray. We investigate the potential for embedded ...
AbstractThe increasing failure rate in High Performance Computing encourages the investigation of fa...
Scale-out programs run on multiple processes in a cluster. In scale-out systems, processes can fail....
International audienceThis paper describes an approach to extend process modeling for engineering de...
The consistent trends of increasing core counts and decreasing mean-time-to-failure in supercomputer...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
2015-08-04Future exascale high-performance computing (HPC) systems will be constructed using VLSI de...
AbstractExascale studies project reliability challenges for future high-performance computing (HPC) ...
High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As...
Maintaining performance in a faulty distributed computing environment is a major challenge in the de...
International audienceTwo areas are currently the focus of active research, namely cloud computing a...
As people are becoming increasingly dependent on computerized systems, the need for these systems to...
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the n...
Supercomputers have played an essential role in the progress of science and engineering research. As...
International audienceBecause e-Science applications are data intensive and require long execution r...
The increasing failure rate in High Performance Computing encourages the investigation of fault tole...
AbstractThe increasing failure rate in High Performance Computing encourages the investigation of fa...
Scale-out programs run on multiple processes in a cluster. In scale-out systems, processes can fail....
International audienceThis paper describes an approach to extend process modeling for engineering de...
The consistent trends of increasing core counts and decreasing mean-time-to-failure in supercomputer...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
2015-08-04Future exascale high-performance computing (HPC) systems will be constructed using VLSI de...
AbstractExascale studies project reliability challenges for future high-performance computing (HPC) ...
High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As...
Maintaining performance in a faulty distributed computing environment is a major challenge in the de...
International audienceTwo areas are currently the focus of active research, namely cloud computing a...
As people are becoming increasingly dependent on computerized systems, the need for these systems to...
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the n...
Supercomputers have played an essential role in the progress of science and engineering research. As...
International audienceBecause e-Science applications are data intensive and require long execution r...
The increasing failure rate in High Performance Computing encourages the investigation of fault tole...
AbstractThe increasing failure rate in High Performance Computing encourages the investigation of fa...
Scale-out programs run on multiple processes in a cluster. In scale-out systems, processes can fail....
International audienceThis paper describes an approach to extend process modeling for engineering de...