Abstract. As the scale of computing platforms becomes increasingly extreme, the requirements for application fault tolerance are becoming so as well. Techniques to address this problem by improving the resilience of algorithms have been developed; but they currently receive no support from the programming model, and without such support, they are bound to fail. This paper discusses the failure-free overhead and recovery impact aspects of the User-Level Failure Mitigation proposal presented in the MPI Forum. Experiments demonstrate that fault-aware MPI has little or no impact on performance for a range of applications, and produces satisfactory recovery times when there are failures.
The running times of many computational science applications are much longer than the mean-time-to-f...
This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputin...
One of the topics of paramount importance in the development of Cluster and Grid middleware is the i...
Abstract. As the scale of computing platforms becomes increasingly extreme, the requirements for app...
Scientific applications have long embraced the MPI as the environment of choice to execute on large ...
ing has matured, so too have the tools, libraries, and languages that result from it. The Message Pa...
Abstract—As recent research has demonstrated, it is be-coming a necessity for large scale applicatio...
As machine sizes have increased and application runtimes have lengthened, research into fault tolera...
Abstract. This paper examines the topic of writing fault-tolerant MPI applications. We discuss the m...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
As supercomputers are entering an era of massive parallelism where the frequency of faults is increa...
Application developers are investigating Algorithm Based Fault Tolerance (ABFT) techniques to improv...
Abstract. Most predictions of Exascale machines picture billion way parallelism, encompassing not on...
Abstract—Exascale targeted scientific applications must be prepared for a highly concurrent computin...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
The running times of many computational science applications are much longer than the mean-time-to-f...
This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputin...
One of the topics of paramount importance in the development of Cluster and Grid middleware is the i...
Abstract. As the scale of computing platforms becomes increasingly extreme, the requirements for app...
Scientific applications have long embraced the MPI as the environment of choice to execute on large ...
ing has matured, so too have the tools, libraries, and languages that result from it. The Message Pa...
Abstract—As recent research has demonstrated, it is be-coming a necessity for large scale applicatio...
As machine sizes have increased and application runtimes have lengthened, research into fault tolera...
Abstract. This paper examines the topic of writing fault-tolerant MPI applications. We discuss the m...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
As supercomputers are entering an era of massive parallelism where the frequency of faults is increa...
Application developers are investigating Algorithm Based Fault Tolerance (ABFT) techniques to improv...
Abstract. Most predictions of Exascale machines picture billion way parallelism, encompassing not on...
Abstract—Exascale targeted scientific applications must be prepared for a highly concurrent computin...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
The running times of many computational science applications are much longer than the mean-time-to-f...
This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputin...
One of the topics of paramount importance in the development of Cluster and Grid middleware is the i...