With the ubiquity of multi-core processors, software must make effective use of multiple cores to obtain good perfor-mance on modern hardware. One of the biggest roadblocks to this is load imbalance, or the uneven distribution of work across cores. We propose LIME, a framework for analyzing parallel programs and reporting the cause of load imbalance in application source code. This framework uses statistical techniques to pinpoint load imbalance problems stemming from both control flow issues (e.g., unequal iteration counts) and interactions between the application and hardware (e.g., unequal cache miss counts). We evaluate LIME on applica-tions from widely used parallel benchmark suites, and show that LIME accurately reports the causes of ...
Multi-core computers are infamous for being hard to use in time-critical systems due to execution-ti...
Parallelism is ubiquitous in modern computer architectures. Heterogeneity of CPU cores and deep memo...
This paper proposes a methodology for analyzing parallel performance by building cycle stacks. A cyc...
The amount of parallelism in modern supercomputers currently grows from generation to generation. Fu...
The shift towards multicore processing has led to a much wider population of developers being faced ...
The amount of parallelism in modern supercomputers currently grows from generation to generation, an...
Abstract—Applications must scale well to make efficient use of today’s class of petascale computers,...
In parallel iterative applications, computational efficiency is essential for addressing large probl...
Load balance is critical for performance in large parallel applica-tions. An imbalance on today’s fa...
Understanding why the performance of a multithreaded program does not improve linearly with the numb...
The increasing pervasiveness of multicore processors in today's computing systems will increase the ...
International audienceIn [8], we demonstrated that contrary to sequential applications, parallel Ope...
The multicore era has initiated a move to ubiquitous parallelization of software. In the process, co...
Performance analysis of parallel programs continues to be challenging for programmers. Programmers h...
In parallel computing, obtaining maximal performance is often mandatory to solve large and complex p...
Multi-core computers are infamous for being hard to use in time-critical systems due to execution-ti...
Parallelism is ubiquitous in modern computer architectures. Heterogeneity of CPU cores and deep memo...
This paper proposes a methodology for analyzing parallel performance by building cycle stacks. A cyc...
The amount of parallelism in modern supercomputers currently grows from generation to generation. Fu...
The shift towards multicore processing has led to a much wider population of developers being faced ...
The amount of parallelism in modern supercomputers currently grows from generation to generation, an...
Abstract—Applications must scale well to make efficient use of today’s class of petascale computers,...
In parallel iterative applications, computational efficiency is essential for addressing large probl...
Load balance is critical for performance in large parallel applica-tions. An imbalance on today’s fa...
Understanding why the performance of a multithreaded program does not improve linearly with the numb...
The increasing pervasiveness of multicore processors in today's computing systems will increase the ...
International audienceIn [8], we demonstrated that contrary to sequential applications, parallel Ope...
The multicore era has initiated a move to ubiquitous parallelization of software. In the process, co...
Performance analysis of parallel programs continues to be challenging for programmers. Programmers h...
In parallel computing, obtaining maximal performance is often mandatory to solve large and complex p...
Multi-core computers are infamous for being hard to use in time-critical systems due to execution-ti...
Parallelism is ubiquitous in modern computer architectures. Heterogeneity of CPU cores and deep memo...
This paper proposes a methodology for analyzing parallel performance by building cycle stacks. A cyc...