Technology scaling and a continual increase in operating frequency have been the main driver of processor performance for several decades. A recent slowdown in this evolution is compensated by multi-core architectures, which challenge application developers and also increase the disparity between the processor and memory performance. The increasing core count and growing scale of computing systems furthermore turn attention to communication as a significant contributor on application run-times. Larger systems also comprise many more components which are subject to failures. In order to mitigate the effects of these failures, fault tolerance techniques such as Checkpoint/Restart are used. These techniques often rely on message-based comm...
As the number of processors in today’s parallel systems continues to grow, the mean-time-to-failure ...
The next generation of capability-class, massively parallel processing (MPP) systems is expected to ...
An important set of challenges emerge as the High Performance Computing (HPC) community aims to rea...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
The use of FPGAs in computational workloads is becoming increasingly popular due to the flexibility ...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
Today’s supercomputers are built from the state-of-the-art components to extract as much performance...
ArticuloThe predicted failure rates of future supercomputers loom the groundbreaking research large...
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the n...
Improving the performance of future computing systems will be based upon the ability of increasing t...
High Performance Computing (HPC) systems represent the peak of modern computational capability. As ...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
The ever increasing number of processors used in parallel computers is making fault tolerance suppor...
As the number of processors in today’s parallel systems continues to grow, the mean-time-to-failure ...
The next generation of capability-class, massively parallel processing (MPP) systems is expected to ...
An important set of challenges emerge as the High Performance Computing (HPC) community aims to rea...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
The use of FPGAs in computational workloads is becoming increasingly popular due to the flexibility ...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
Today’s supercomputers are built from the state-of-the-art components to extract as much performance...
ArticuloThe predicted failure rates of future supercomputers loom the groundbreaking research large...
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the n...
Improving the performance of future computing systems will be based upon the ability of increasing t...
High Performance Computing (HPC) systems represent the peak of modern computational capability. As ...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
The ever increasing number of processors used in parallel computers is making fault tolerance suppor...
As the number of processors in today’s parallel systems continues to grow, the mean-time-to-failure ...
The next generation of capability-class, massively parallel processing (MPP) systems is expected to ...
An important set of challenges emerge as the High Performance Computing (HPC) community aims to rea...