The Message-Passing Interface (MPI) is large and complex. Therefore, programming MPI is error prone. Several MPI runtime correctness tools address classes of usage errors, such as deadlocks or nonportable constructs. To our knowledge none of these tools scales to more than about 100 processes. However, some of the current HPC systems use more than 100,000 cores and future systems are expected to use far more. Since errors often depend on the task count used, we need correctness tools that scale to the full system size. We present a novel framework for scalable MPI correctness tools to address this need. Our fine-grained, module-based approach supports rapid prototyping and allows correctness tools built upon it to adapt to different archite...
Abstract. The MPI standard lacks semantics and interfaces for sus-tained application execution in th...
High-performance computing codes often combine the Message-Passing Interface (MPI) with a shared-mem...
Due to the increasing size of HPC machines, dealing with faults is becoming mandatory due to their h...
Increasing computational demand of simulations motivates the use of parallel computing systems. At t...
The Message Passing Interface (MPI) is the de-facto standard for distributed memory computing in hig...
MPI is the de-facto standard message-passing based parallel programming model. However, the bug dete...
Abstract. Writing correct and portable MPI programs is hard. Out of bound parameters, inconsistent u...
Abstract—Faults have become the norm rather than the exception for high-end computing on clusters wi...
The trend towards many-core multi-processor systems and clusters will make systems with tens and hun...
An MPI profiling library is a standard mechanism for inter-cepting MPI calls by applications. Profil...
Faults have become the norm rather than the exception for high-end computing on clusters with 10s/10...
Deadlock detection is one of the main issues of software testing in High Performance Computing (HPC)...
Deadlock detection is one of the main issues of software testing in High Performance Computing (HPC)...
An MPI profiling library is a standard mechanism for intercepting MPI calls by applications. Profili...
An MPI profiling library is a standard mechanism for intercepting MPI calls by applications. Profili...
Abstract. The MPI standard lacks semantics and interfaces for sus-tained application execution in th...
High-performance computing codes often combine the Message-Passing Interface (MPI) with a shared-mem...
Due to the increasing size of HPC machines, dealing with faults is becoming mandatory due to their h...
Increasing computational demand of simulations motivates the use of parallel computing systems. At t...
The Message Passing Interface (MPI) is the de-facto standard for distributed memory computing in hig...
MPI is the de-facto standard message-passing based parallel programming model. However, the bug dete...
Abstract. Writing correct and portable MPI programs is hard. Out of bound parameters, inconsistent u...
Abstract—Faults have become the norm rather than the exception for high-end computing on clusters wi...
The trend towards many-core multi-processor systems and clusters will make systems with tens and hun...
An MPI profiling library is a standard mechanism for inter-cepting MPI calls by applications. Profil...
Faults have become the norm rather than the exception for high-end computing on clusters with 10s/10...
Deadlock detection is one of the main issues of software testing in High Performance Computing (HPC)...
Deadlock detection is one of the main issues of software testing in High Performance Computing (HPC)...
An MPI profiling library is a standard mechanism for intercepting MPI calls by applications. Profili...
An MPI profiling library is a standard mechanism for intercepting MPI calls by applications. Profili...
Abstract. The MPI standard lacks semantics and interfaces for sus-tained application execution in th...
High-performance computing codes often combine the Message-Passing Interface (MPI) with a shared-mem...
Due to the increasing size of HPC machines, dealing with faults is becoming mandatory due to their h...