NUMA multi-core systems divide system resources into several nodes. When an imbalance in the load between cores occurs, the kernel scheduler’s load balancing mechanism then migrates threads between cores or across NUMA nodes. Remote memory access is required for a thread to access memory on the previous node, which degrades performance. Threads to be migrated must be selected effectively and efficiently since the related operations run in the critical path of the kernel scheduler. This study focuses on improving inter-node load balancing for multithreaded applications. We propose a thread-aware selection policy that considers the distribution of threads on nodes for each thread group while migrating one thread for inter-node load balancing....
The performance of thread mechanism is dominated primarily by two kinds of thread-switching overhead...
This paper presents some techniques for efficient thread forking and joining in parallel execution e...
This paper introduces a learning-based framework for dynamic placement of threads of parallel applic...
Multicore multiprocessors use Non Uniform Memory Ar-chitecture (NUMA) to improve their scalability. ...
Multicore multiprocessors use a Non Uniform Memory Architecture (NUMA) to improve their scalability....
In modern Non-Uniform Memory Access (NUMA) systems, there are multiple memory nodes, each with its o...
A common approach to improve memory access in NUMA machines exploits operating system (OS) page prot...
Modern hardware is trending towards increasingly parallel and heterogeneous architectures. Contempor...
Abstract—Multi-core nodes with Non-Uniform Memory Ac-cess (NUMA) are now a common architecture for h...
It is well known that the placement of threads and memory plays a crucial role for performance on NU...
Current multi-socket systems have complex memory hierarchies with significant Non-Uniform Memory Acc...
© 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for a...
This paper introduces a learning-based framework for dynamic placement of threads of parallel applic...
The problem of placement of threads, or virtual cores, on physical cores in a multicore system has b...
In this paper we describe the way thread migration can be carried out in Distributed Shared Memory (...
The performance of thread mechanism is dominated primarily by two kinds of thread-switching overhead...
This paper presents some techniques for efficient thread forking and joining in parallel execution e...
This paper introduces a learning-based framework for dynamic placement of threads of parallel applic...
Multicore multiprocessors use Non Uniform Memory Ar-chitecture (NUMA) to improve their scalability. ...
Multicore multiprocessors use a Non Uniform Memory Architecture (NUMA) to improve their scalability....
In modern Non-Uniform Memory Access (NUMA) systems, there are multiple memory nodes, each with its o...
A common approach to improve memory access in NUMA machines exploits operating system (OS) page prot...
Modern hardware is trending towards increasingly parallel and heterogeneous architectures. Contempor...
Abstract—Multi-core nodes with Non-Uniform Memory Ac-cess (NUMA) are now a common architecture for h...
It is well known that the placement of threads and memory plays a crucial role for performance on NU...
Current multi-socket systems have complex memory hierarchies with significant Non-Uniform Memory Acc...
© 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for a...
This paper introduces a learning-based framework for dynamic placement of threads of parallel applic...
The problem of placement of threads, or virtual cores, on physical cores in a multicore system has b...
In this paper we describe the way thread migration can be carried out in Distributed Shared Memory (...
The performance of thread mechanism is dominated primarily by two kinds of thread-switching overhead...
This paper presents some techniques for efficient thread forking and joining in parallel execution e...
This paper introduces a learning-based framework for dynamic placement of threads of parallel applic...