In this work we study the effect of data locality on the performance of Gaussian 03 code running on a multicore Non-Uniform Memory Access (NUMA) system. A userspace protocol which affects runtime data locality, through the use of dynamic page migration and interleaving techniques, is considered. Using this protocol results in a significant performance improvement. Results for parallel Gaussian 03 using up to 16 threads are presented. The overhead of page migration and effect of dual-core contention are also examined
Processors with multiple sockets or chiplets are becoming more conventional. These kinds of processo...
This paper compares data distribution methodologies for scaling the performance of OpenMP on NUMA ar...
This paper presents user-level dynamic page migration, a runtime technique which transparently enabl...
Current high-performance multicore processors provide users with a non-uniform memory access model (...
This paper presents algorithms for improving the performance of parallel programs on multiprogrammed...
In this work, we extend and evaluate a simple performance model to account for NUMA and bandwidth ef...
This paper makes two important contributions. First, the paper investigates the performance implicat...
This paper makes two important contributions. First, the paper investigates the performance implicat...
Shared memory systems are becoming increasingly complex as they typically integrate several storage ...
In this paper, we compare and contrast two techniques to improve capacity/conflict miss traffic in C...
This paper introduces two novel algorithms for thread migrations, named CIMAR (Core-aware Interchang...
This paper makes two important contributions. First, the pa-per investigates the performance implica...
The latency of memory access times is hence non-uniform, because it depends on where the request ori...
Nonuniform memory access time (referred to as NUMA) is an important feature in the design of large s...
A common approach to improve memory access in NUMA machines exploits operating system (OS) page prot...
Processors with multiple sockets or chiplets are becoming more conventional. These kinds of processo...
This paper compares data distribution methodologies for scaling the performance of OpenMP on NUMA ar...
This paper presents user-level dynamic page migration, a runtime technique which transparently enabl...
Current high-performance multicore processors provide users with a non-uniform memory access model (...
This paper presents algorithms for improving the performance of parallel programs on multiprogrammed...
In this work, we extend and evaluate a simple performance model to account for NUMA and bandwidth ef...
This paper makes two important contributions. First, the paper investigates the performance implicat...
This paper makes two important contributions. First, the paper investigates the performance implicat...
Shared memory systems are becoming increasingly complex as they typically integrate several storage ...
In this paper, we compare and contrast two techniques to improve capacity/conflict miss traffic in C...
This paper introduces two novel algorithms for thread migrations, named CIMAR (Core-aware Interchang...
This paper makes two important contributions. First, the pa-per investigates the performance implica...
The latency of memory access times is hence non-uniform, because it depends on where the request ori...
Nonuniform memory access time (referred to as NUMA) is an important feature in the design of large s...
A common approach to improve memory access in NUMA machines exploits operating system (OS) page prot...
Processors with multiple sockets or chiplets are becoming more conventional. These kinds of processo...
This paper compares data distribution methodologies for scaling the performance of OpenMP on NUMA ar...
This paper presents user-level dynamic page migration, a runtime technique which transparently enabl...