Today’s high performance computing systems are made possible by multiple increases in hardware parallelity. This results in the decrease of mean time to failures of the systems with each newer generation, which is an alarming trend. Therefore, it is not surprising that a lot of research is going on in the area of fault tolerance and fault mitigation. Applications should survive a failure and/or be able to recover with minimal cost. We have used Global Address Space Programming Interface (GASPI), which is a relatively new communication library based on the PGAS model. It fulfills the basic requirement of a fault tolerant communication library, i.e. the failure of a process does not cause the remaining processes to fail. This work is focus...
The coming exascale era is a great opportunity for high performance computing (HPC) applications. Ho...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
Resilience and fault tolerance are challenging tasks in the field of high performance computing (HPC...
The Partitioned Global Address Space (PGAS) has emerged recently for parallel programming at large s...
With increasing numbers of processors on todays machines, the probability for node or link failures ...
With increasing numbers of processors on todays ma-chines, the probability for node or link failures...
Scientific applications have long embraced the MPI as the environment of choice to execute on large ...
Abstract. Most predictions of Exascale machines picture billion way parallelism, encompassing not on...
With increasing numbers of processors on current ma-chines, the probability for node or link failure...
The consistent trends of increasing core counts and decreasing mean-time-to-failure in supercomputer...
Abstract. As the scale of computing platforms becomes increasingly extreme, the requirements for app...
Abstract—As recent research has demonstrated, it is be-coming a necessity for large scale applicatio...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
Petaflops systems will have tens to hundreds of thousands of compute nodes which increases the likel...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
The coming exascale era is a great opportunity for high performance computing (HPC) applications. Ho...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
Resilience and fault tolerance are challenging tasks in the field of high performance computing (HPC...
The Partitioned Global Address Space (PGAS) has emerged recently for parallel programming at large s...
With increasing numbers of processors on todays machines, the probability for node or link failures ...
With increasing numbers of processors on todays ma-chines, the probability for node or link failures...
Scientific applications have long embraced the MPI as the environment of choice to execute on large ...
Abstract. Most predictions of Exascale machines picture billion way parallelism, encompassing not on...
With increasing numbers of processors on current ma-chines, the probability for node or link failure...
The consistent trends of increasing core counts and decreasing mean-time-to-failure in supercomputer...
Abstract. As the scale of computing platforms becomes increasingly extreme, the requirements for app...
Abstract—As recent research has demonstrated, it is be-coming a necessity for large scale applicatio...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
Petaflops systems will have tens to hundreds of thousands of compute nodes which increases the likel...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
The coming exascale era is a great opportunity for high performance computing (HPC) applications. Ho...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
Resilience and fault tolerance are challenging tasks in the field of high performance computing (HPC...