Faults have become the norm rather than the exception for high-end computing on clusters with 10s/100s of thousands of cores. Exacerbating this situation, some of these faults remain undetected, manifesting themselves as silent errors that corrupt memory while applications continue to operate and report incorrect results. This paper studies the potential for redundancy to both detect and correct soft errors in MPI message-passing applications. Our study investigates the challenges inherent to detecting soft errors within MPI application while providing transparent MPI redundancy. By assuming a model wherein corruption in application data manifests itself by producing differing MPI message data between replicas, we study the best suited prot...
The Message Passing Interface (MPI) is the de-facto standard for distributed memory computing in hig...
The coming exascale era is a great opportunity for high performance computing (HPC) applications. Ho...
Resilience and fault tolerance are challenging tasks in the field of high performance computing (HPC...
Abstract—Faults have become the norm rather than the exception for high-end computing on clusters wi...
The Message-Passing Interface (MPI) is large and complex. Therefore, programming MPI is error prone....
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
MPI is the de-facto standard message-passing based parallel programming model. However, the bug dete...
Increasing computational demand of simulations motivates the use of parallel computing systems. At t...
Resilient algorithms in high-performance computing are subject to rigorous non-functional constrain...
Resiliency of exascale systems has quickly become an important concern for the scientific community....
Soft error caused by single event upset has been a severe challenge to aerospace-based computing. Si...
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection...
Hardware errors are on the rise with reducing chip sizes, and power constraints have necessitated th...
An MPI profiling library is a standard mechanism for intercepting MPI calls by applications. Profili...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
The Message Passing Interface (MPI) is the de-facto standard for distributed memory computing in hig...
The coming exascale era is a great opportunity for high performance computing (HPC) applications. Ho...
Resilience and fault tolerance are challenging tasks in the field of high performance computing (HPC...
Abstract—Faults have become the norm rather than the exception for high-end computing on clusters wi...
The Message-Passing Interface (MPI) is large and complex. Therefore, programming MPI is error prone....
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
MPI is the de-facto standard message-passing based parallel programming model. However, the bug dete...
Increasing computational demand of simulations motivates the use of parallel computing systems. At t...
Resilient algorithms in high-performance computing are subject to rigorous non-functional constrain...
Resiliency of exascale systems has quickly become an important concern for the scientific community....
Soft error caused by single event upset has been a severe challenge to aerospace-based computing. Si...
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection...
Hardware errors are on the rise with reducing chip sizes, and power constraints have necessitated th...
An MPI profiling library is a standard mechanism for intercepting MPI calls by applications. Profili...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
The Message Passing Interface (MPI) is the de-facto standard for distributed memory computing in hig...
The coming exascale era is a great opportunity for high performance computing (HPC) applications. Ho...
Resilience and fault tolerance are challenging tasks in the field of high performance computing (HPC...