International audienceEvaluating the strong scalability of OpenMP applications is a costly and time-consuming process. It traditionally requires executing the whole application multiple times with different number of threads. We propose the Parallel Codelet Extractor and REplayer (PCERE), a tool to reduce the cost of scalability evaluation. PCERE decomposes applications into small pieces called codelets: each codelet maps to an OpenMP parallel region and can be replayed as a standalone program. To accelerate scalability prediction, PCERE replays codelets while varying the number of threads. Prediction speedup comes from two key ideas. First, the number of invocations during replay can be significantly reduced. Invocations that have the same...
HPC application developers encounter significant challenges getting their codes to run correctly on ...
The performance of many parallel applications relies not on instruction-level parallelism but on loo...
Traditional parallel applications have exploited regular parallelism, based on parallel loops. Only ...
International audienceThis article presents Codelet Extractor and REplayer (CERE), an open-source fr...
OpenMP, a directive-based API supports multithreading programming on shared memory systems. Since O...
OpenMP is a popular application programming interface (API) used to write shared-memory parallel pro...
The Petascale Computing Enabling Technologies (PCET) project addressed challenges arising from curre...
Exascale systems will exhibit much higher degrees of parallelism both in terms of the number of node...
International audienceCurrent architecture complexity requires fine tuning of compiler and runtime p...
The efficient mapping of program parallelism to multi-core processors is highly dependent on the und...
We present a new technique for identifying scalability bottle-necks in executions of single-program,...
Introduction In general, a parallel computer is a computer that has multiple processors connected b...
As computers with tens of thousands of processors successfully deliver high performance power for so...
The most widely used node type in high-performance computing nowadays is a 2-socket server node. The...
In this paper, we present two new approaches while rendering necessary extensions to Periscope to pe...
HPC application developers encounter significant challenges getting their codes to run correctly on ...
The performance of many parallel applications relies not on instruction-level parallelism but on loo...
Traditional parallel applications have exploited regular parallelism, based on parallel loops. Only ...
International audienceThis article presents Codelet Extractor and REplayer (CERE), an open-source fr...
OpenMP, a directive-based API supports multithreading programming on shared memory systems. Since O...
OpenMP is a popular application programming interface (API) used to write shared-memory parallel pro...
The Petascale Computing Enabling Technologies (PCET) project addressed challenges arising from curre...
Exascale systems will exhibit much higher degrees of parallelism both in terms of the number of node...
International audienceCurrent architecture complexity requires fine tuning of compiler and runtime p...
The efficient mapping of program parallelism to multi-core processors is highly dependent on the und...
We present a new technique for identifying scalability bottle-necks in executions of single-program,...
Introduction In general, a parallel computer is a computer that has multiple processors connected b...
As computers with tens of thousands of processors successfully deliver high performance power for so...
The most widely used node type in high-performance computing nowadays is a 2-socket server node. The...
In this paper, we present two new approaches while rendering necessary extensions to Periscope to pe...
HPC application developers encounter significant challenges getting their codes to run correctly on ...
The performance of many parallel applications relies not on instruction-level parallelism but on loo...
Traditional parallel applications have exploited regular parallelism, based on parallel loops. Only ...