International audienceEfficiently programming shared-memory machines is a difficult challenge because mapping application threads onto the memory hierarchy has a strong impact on the performance. However, optimizing such thread placement is difficult: architectures become increasingly complex and application behavior changes with implementations and input parameters, e.g problem size and number of threads. In this work, we propose a fully automatic, abstracted and portable affinity module. It produces and implements an optimized affinity strategy that combines knowledge about application characteristics and the platform topology. Implemented in the back-end of our runtime system (ORWL), our approach was used to enhance the performance and t...
This paper compares data distribution methodologies for scaling the performance of OpenMP on NUMA ar...
Funding: This work was generously supported by UK EPSRC Energise, grant number EP/V006290/1.This pap...
Parallel computing platforms are increasingly complex, with multiple cores, shared caches, and NUMA ...
International audienceEfficiently programming shared-memory machines is a difficult challenge becaus...
The complexity of an efficient thread management steadily rises with the number of processor cores a...
International audienceThe ordered read-write lock model (ORWL) is a modern framework that proposes h...
F. Wolf, B. Mohr, and D. an Ney (Eds.), pages 12, pp. 53-64International audienceThread affinity has...
The performance and energy efficiency of modern architectures depend on memory locality, which can b...
International audienceExploiting the full computational power of current hierarchical multiprocessor...
International audienceThe parallelism in shared-memory systems has increased significantly with the ...
International audienceCurrent and future architectures rely on thread-level parallelism to sustain p...
International audienceThe now commonplace multi-core chips have introduced, by design, a deep hierar...
International audienceProcess placement, also called topology mapping, is a well-known strategy to i...
This paper presents COMPROF and COMPLACE, a novel profiling tool and thread placement technique for ...
Current multi-socket systems have complex memory hierarchies with significant Non-Uniform Memory Acc...
This paper compares data distribution methodologies for scaling the performance of OpenMP on NUMA ar...
Funding: This work was generously supported by UK EPSRC Energise, grant number EP/V006290/1.This pap...
Parallel computing platforms are increasingly complex, with multiple cores, shared caches, and NUMA ...
International audienceEfficiently programming shared-memory machines is a difficult challenge becaus...
The complexity of an efficient thread management steadily rises with the number of processor cores a...
International audienceThe ordered read-write lock model (ORWL) is a modern framework that proposes h...
F. Wolf, B. Mohr, and D. an Ney (Eds.), pages 12, pp. 53-64International audienceThread affinity has...
The performance and energy efficiency of modern architectures depend on memory locality, which can b...
International audienceExploiting the full computational power of current hierarchical multiprocessor...
International audienceThe parallelism in shared-memory systems has increased significantly with the ...
International audienceCurrent and future architectures rely on thread-level parallelism to sustain p...
International audienceThe now commonplace multi-core chips have introduced, by design, a deep hierar...
International audienceProcess placement, also called topology mapping, is a well-known strategy to i...
This paper presents COMPROF and COMPLACE, a novel profiling tool and thread placement technique for ...
Current multi-socket systems have complex memory hierarchies with significant Non-Uniform Memory Acc...
This paper compares data distribution methodologies for scaling the performance of OpenMP on NUMA ar...
Funding: This work was generously supported by UK EPSRC Energise, grant number EP/V006290/1.This pap...
Parallel computing platforms are increasingly complex, with multiple cores, shared caches, and NUMA ...