Both NUMA thread/data placement and hardware prefetcher configuration have significant impacts on HPC performance. Optimizing both together leads to a large and complex design space that has previously been impractical to explore at runtime. In this work we deliver the performance benefits of optimizing both NUMA thread/data placement and prefetcher configuration at runtime through careful modeling and online profiling. To address the large design space, we propose a prediction model that reduces the amount of input information needed and the complexity of the prediction required. We do so by selecting a subset of performance counters and application configurations that provide the richest profile information as inputs, and by limiting the ...
Part 5: Performance Modeling, Prediction, and TuningInternational audienceSome typical memory access...
Performance bottlenecks across distributed nodes, such as in high performance computing grids or clo...
The benefits of prefetching have been largely overshadowed by the overhead required to produce high...
Both NUMA thread/data placement and hardware prefetcher configuration have significant impacts on HP...
International audienceThere is a large space of NUMA and hardware prefetcher configurations that can...
HPC systems expose configuration options that help users optimize their applications' execution. Que...
International audienceNon Uniform Memory Access (NUMA) architectures are nowadays common for running...
Abstract—Modern processors are equipped with multiple hardware prefetchers, each of which targets a ...
International audienceNowadays, NUMA architectures are common in compute-intensive systems. Achievin...
tures are ubiquitous in HPC systems. NUMA along with other factors including socket layout, data pla...
An important technique for alleviating the memory bottleneck is data prefetching. Data prefetching ...
Processors with multiple sockets or chiplets are becoming more conventional. These kinds of processo...
As the digitisation of the world progresses at an accelerating pace, an overwhelming quantity of dat...
Multicore multiprocessors use Non Uniform Memory Ar-chitecture (NUMA) to improve their scalability. ...
As the number of cores increases Non-Uniform Memory Access (NUMA) is becoming increasingly prevalent...
Part 5: Performance Modeling, Prediction, and TuningInternational audienceSome typical memory access...
Performance bottlenecks across distributed nodes, such as in high performance computing grids or clo...
The benefits of prefetching have been largely overshadowed by the overhead required to produce high...
Both NUMA thread/data placement and hardware prefetcher configuration have significant impacts on HP...
International audienceThere is a large space of NUMA and hardware prefetcher configurations that can...
HPC systems expose configuration options that help users optimize their applications' execution. Que...
International audienceNon Uniform Memory Access (NUMA) architectures are nowadays common for running...
Abstract—Modern processors are equipped with multiple hardware prefetchers, each of which targets a ...
International audienceNowadays, NUMA architectures are common in compute-intensive systems. Achievin...
tures are ubiquitous in HPC systems. NUMA along with other factors including socket layout, data pla...
An important technique for alleviating the memory bottleneck is data prefetching. Data prefetching ...
Processors with multiple sockets or chiplets are becoming more conventional. These kinds of processo...
As the digitisation of the world progresses at an accelerating pace, an overwhelming quantity of dat...
Multicore multiprocessors use Non Uniform Memory Ar-chitecture (NUMA) to improve their scalability. ...
As the number of cores increases Non-Uniform Memory Access (NUMA) is becoming increasingly prevalent...
Part 5: Performance Modeling, Prediction, and TuningInternational audienceSome typical memory access...
Performance bottlenecks across distributed nodes, such as in high performance computing grids or clo...
The benefits of prefetching have been largely overshadowed by the overhead required to produce high...