The use of FPGAs in computational workloads is becoming increasingly popular due to the flexibility of these devices in comparison to ASICs, and their low power consumption compared to GPUs and CPUs. However, scientific applications run for long periods of time and the hardware is always subject to failures due to either soft or hard errors. Thus, it is important to protect these long running jobs with fault tolerance mechanisms. Checkpoint-Restart is a popular technique in high-performance computing that allows large scale applications to cope with frequent failures. In this work we approach the fault tolerance of CPU-FPGA heterogeneous applications from a high level by using OmpSs@FPGA environment and a multi-level checkpointing library. ...
Abstract — Checkpointing is a typical approach to tolerate failures in today’s supercomputing cluste...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
The utilization of new generation computing platforms like computational grids or desktop grids intr...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...
[Abstract] Heterogeneous systems have increased their popularity in recent years due to the high per...
High Performance Computing (HPC) systems represent the peak of modern computational capability. As ...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
Please refer to pdf.James Watt ScholarshipEngineering and Physical Sciences Research Council (EPSRC)...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
Abstract—As the feature size shrinks to the nanometer scale, SRAM-based FPGAs are increasingly vulne...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
Technology scaling and a continual increase in operating frequency have been the main driver of proc...
Next-generation exascale systems, those capable of performing a quintillion operations per second, ...
Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations ...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
Abstract — Checkpointing is a typical approach to tolerate failures in today’s supercomputing cluste...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
The utilization of new generation computing platforms like computational grids or desktop grids intr...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...
[Abstract] Heterogeneous systems have increased their popularity in recent years due to the high per...
High Performance Computing (HPC) systems represent the peak of modern computational capability. As ...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
Please refer to pdf.James Watt ScholarshipEngineering and Physical Sciences Research Council (EPSRC)...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
Abstract—As the feature size shrinks to the nanometer scale, SRAM-based FPGAs are increasingly vulne...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
Technology scaling and a continual increase in operating frequency have been the main driver of proc...
Next-generation exascale systems, those capable of performing a quintillion operations per second, ...
Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations ...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
Abstract — Checkpointing is a typical approach to tolerate failures in today’s supercomputing cluste...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
The utilization of new generation computing platforms like computational grids or desktop grids intr...