[Abstract] Heterogeneous systems have increased their popularity in recent years due to the high performance and reduced energy consumption capabilities provided by using devices such as GPUs or Xeon Phi accelerators. This paper proposes a checkpoint-based fault tolerance solution for heterogeneous applications, allowing them to survive fail-stop failures in the host CPU or in any of the accelerators used. Besides, applications can be restarted changing the host CPU and/or the accelerator device architecture, and adapting the computation to the number of devices available during recovery. The proposed solution is built combining CPPC (ComPiler for Portable Checkpointing), an application-level checkpointing tool, and HPL (Heterogeneous Progr...
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the n...
Checkpointing in a homogeneous environment, where both checkpointing and recovery are performed on t...
Computer systems are permanently present in our daily basis in a wide range of applications. In syst...
[Abstract] Heterogeneous systems have increased their popularity in recent years due to the high per...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
[Abstract] Despite the increasing popularity of shared-memory systems, there is a lack of tools for ...
The use of FPGAs in computational workloads is becoming increasingly popular due to the flexibility ...
High Performance Computing (HPC) systems represent the peak of modern computational capability. As ...
AbstractAs parallel machines increase their number of processors, so does the failure rate of the gl...
Please refer to pdf.James Watt ScholarshipEngineering and Physical Sciences Research Council (EPSRC)...
The efficient utilization of current supercomputing systems with deep storage hierarchies demands sc...
One of the major challenges in using extreme scale systems efficiently is to mitigate the impact of ...
Clusters of message-passing computing nodes provide high-performance platforms for distributed appli...
The increasing failure rate in High Performance Computing encourages the investigation of fault tole...
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the n...
Checkpointing in a homogeneous environment, where both checkpointing and recovery are performed on t...
Computer systems are permanently present in our daily basis in a wide range of applications. In syst...
[Abstract] Heterogeneous systems have increased their popularity in recent years due to the high per...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
[Abstract] Despite the increasing popularity of shared-memory systems, there is a lack of tools for ...
The use of FPGAs in computational workloads is becoming increasingly popular due to the flexibility ...
High Performance Computing (HPC) systems represent the peak of modern computational capability. As ...
AbstractAs parallel machines increase their number of processors, so does the failure rate of the gl...
Please refer to pdf.James Watt ScholarshipEngineering and Physical Sciences Research Council (EPSRC)...
The efficient utilization of current supercomputing systems with deep storage hierarchies demands sc...
One of the major challenges in using extreme scale systems efficiently is to mitigate the impact of ...
Clusters of message-passing computing nodes provide high-performance platforms for distributed appli...
The increasing failure rate in High Performance Computing encourages the investigation of fault tole...
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the n...
Checkpointing in a homogeneous environment, where both checkpointing and recovery are performed on t...
Computer systems are permanently present in our daily basis in a wide range of applications. In syst...