Current high performance computing architectures are composed of large shared memory NUMA nodes, among other components. Such nodes are becoming increasingly complex as they have several NUMA domains with different access latencies depending on the core where the access is issued. In this work, we propose techniques to efficiently mitigate the negative impact of NUMA effects on parallel applications performance. We leverage runtime system metadata expressed in terms of a task dependency graph, where nodes are sequential pieces of code and edges are control or data dependencies between them, to efficiently reduce data transfers using graph partitioning techniques. With our proposals, we are able to improve the execution time of OpenMP parall...
Performance degradation due to nonuniform data access latencies has worsened on NUMA systems and can...
Abstract. NUMA-aware parallel algorithms in runtime systems attempt to improve locality by allocatin...
International audienceWe present a joint scheduling and memory allocation algorithm for efficient ex...
Current high performance computing architectures are composed of large shared memory NUMA nodes, amo...
Current high performance computing architectures are composed of large shared memory NUMA nodes, amo...
Shared memory systems are becoming increasingly complex as they typically integrate several storage ...
The complexity of shared memory systems is becoming more relevant as the number of memory domains in...
The importance of high-performance graph processing to solve big data problems targeting high-impact...
Processors with multiple sockets or chiplets are becoming more conventional. These kinds of processo...
Parallel task-based programming models like OpenMP support the declaration of task data dependences....
International audienceThe architecture of supercomputers is evolving to expose massive parallelism. ...
OpenMP is a parallel programming model widely used on shared-memory systems. Over the years, the Ope...
International audienceDynamic task-parallel programming models are popular on shared-memory systems,...
Pattern matching on large graphs is the foundation for a variety of application domains. The continu...
Graph-structured analytics has been widely adopted in a number of big data applications such as soci...
Performance degradation due to nonuniform data access latencies has worsened on NUMA systems and can...
Abstract. NUMA-aware parallel algorithms in runtime systems attempt to improve locality by allocatin...
International audienceWe present a joint scheduling and memory allocation algorithm for efficient ex...
Current high performance computing architectures are composed of large shared memory NUMA nodes, amo...
Current high performance computing architectures are composed of large shared memory NUMA nodes, amo...
Shared memory systems are becoming increasingly complex as they typically integrate several storage ...
The complexity of shared memory systems is becoming more relevant as the number of memory domains in...
The importance of high-performance graph processing to solve big data problems targeting high-impact...
Processors with multiple sockets or chiplets are becoming more conventional. These kinds of processo...
Parallel task-based programming models like OpenMP support the declaration of task data dependences....
International audienceThe architecture of supercomputers is evolving to expose massive parallelism. ...
OpenMP is a parallel programming model widely used on shared-memory systems. Over the years, the Ope...
International audienceDynamic task-parallel programming models are popular on shared-memory systems,...
Pattern matching on large graphs is the foundation for a variety of application domains. The continu...
Graph-structured analytics has been widely adopted in a number of big data applications such as soci...
Performance degradation due to nonuniform data access latencies has worsened on NUMA systems and can...
Abstract. NUMA-aware parallel algorithms in runtime systems attempt to improve locality by allocatin...
International audienceWe present a joint scheduling and memory allocation algorithm for efficient ex...