We have developed a hierarchical performance bounding methodology that attempts to explain the performance of loop-dominated scientific applications on particular systems. The Kendall Square Research KSR1 is used as a running example. We model the throughput of key hardware units that are common bottlenecks in concurrent machines. The four units currently used are: memory port, floating-point, instruction issue, and a loop–carried dependence pseudo–unit. We propose a workload characterization, and derive upper bounds on the performance of specific machine-workload pairs. Comparing delivered performance with bounds focuses attention on areas for improvement and indicates how much improvement might be attainable. We delineate a comprehensive ...
As the number of compute cores per chip continues to rise faster than the total amount of available ...
In this paper, the authors characterize application performance with a memory-centric view. Using a ...
The multicore era has initiated a move to ubiquitous parallelization of software. In the process, co...
We have developed a hierarchical performance bounding meth-odology that attempts to explain the perf...
An effective methodology of performance evaluation and improvement enables application developers to...
We have developed a performance bounding methodology that explains the performance of loop-dominated...
While parallel computing offers an attractive perspective for the future, developing efficient paral...
Hierarchical memory is a cornerstone of modern hardware design because it provides high memory perfo...
Scientific programs are typically characterized as floating-point intensive loop-dominated tasks wit...
Hierarchical memory is a cornerstone of modern hardware design because it provides high memory perfo...
Tuning the performance of applications requires understanding the interactions between code and targ...
Performance and scalability of high performance scientific applications on large scale parallel mach...
Systems for high performance computing are getting increasingly complex. On the one hand, the number...
A method is presented for modeling application performance on parallel computers in terms of the per...
Performance tuning, as carried out by compiler designers and application programmers to close the pe...
As the number of compute cores per chip continues to rise faster than the total amount of available ...
In this paper, the authors characterize application performance with a memory-centric view. Using a ...
The multicore era has initiated a move to ubiquitous parallelization of software. In the process, co...
We have developed a hierarchical performance bounding meth-odology that attempts to explain the perf...
An effective methodology of performance evaluation and improvement enables application developers to...
We have developed a performance bounding methodology that explains the performance of loop-dominated...
While parallel computing offers an attractive perspective for the future, developing efficient paral...
Hierarchical memory is a cornerstone of modern hardware design because it provides high memory perfo...
Scientific programs are typically characterized as floating-point intensive loop-dominated tasks wit...
Hierarchical memory is a cornerstone of modern hardware design because it provides high memory perfo...
Tuning the performance of applications requires understanding the interactions between code and targ...
Performance and scalability of high performance scientific applications on large scale parallel mach...
Systems for high performance computing are getting increasingly complex. On the one hand, the number...
A method is presented for modeling application performance on parallel computers in terms of the per...
Performance tuning, as carried out by compiler designers and application programmers to close the pe...
As the number of compute cores per chip continues to rise faster than the total amount of available ...
In this paper, the authors characterize application performance with a memory-centric view. Using a ...
The multicore era has initiated a move to ubiquitous parallelization of software. In the process, co...