Parallelism is ubiquitous in modern computer architectures. Heterogeneity of CPU cores and deep memory hierarchies make modern architectures difficult to program efficiently. Achieving top performance on supercomputers is difficult due to complex hardware, software, and their interactions. Production software systems fail to achieve top performance on modern architectures broadly due to three main causes: resource idleness, parallel overhead, and data movement overhead. This dissertation presents novel and effective performance analysis tools, adaptive runtime systems, and architecture-aware algorithms to understand and address these problems. Many future high performance systems will employ traditional multicore CPUs augmented with accel...
Heterogeneous parallel architectures like those comprised of CPUs and GPUs are a tantalizing compute...
Tuning the performance of applications requires understanding the interactions between code and targ...
In recent years the power wall has prevented the continued scaling of single core performance. This ...
The multicore era has initiated a move to ubiquitous parallelization of software. In the process, co...
With processor clock speeds having stagnated, parallel computing architectures have achieved a break...
High compute-density with massive thread-level parallelism of Graphics Processing Units (GPUs) is be...
To help shrink the programmability-performance efficiency gap, we discuss that adaptive runtime syst...
<p>The continued growth of the computational capability of throughput processors has made throughput...
abstract: With the massive multithreading execution feature, graphics processing units (GPUs) have b...
Enhancing the match between software executions and hardware features is key to computing efficiency...
GPUs have become popular due to their high computational power. Data scientists rely on GPUs to proc...
Systems for high performance computing are getting increasingly complex. On the one hand, the number...
The end of Dennard scaling also brought an end to frequency scaling as a means to improve performanc...
[ACCESS RESTRICTED TO THE UNIVERSITY OF MISSOURI AT REQUEST OF AUTHOR.] As computers began to reach ...
Applications may have unintended performance problems in spite of compiler optimizations, because of...
Heterogeneous parallel architectures like those comprised of CPUs and GPUs are a tantalizing compute...
Tuning the performance of applications requires understanding the interactions between code and targ...
In recent years the power wall has prevented the continued scaling of single core performance. This ...
The multicore era has initiated a move to ubiquitous parallelization of software. In the process, co...
With processor clock speeds having stagnated, parallel computing architectures have achieved a break...
High compute-density with massive thread-level parallelism of Graphics Processing Units (GPUs) is be...
To help shrink the programmability-performance efficiency gap, we discuss that adaptive runtime syst...
<p>The continued growth of the computational capability of throughput processors has made throughput...
abstract: With the massive multithreading execution feature, graphics processing units (GPUs) have b...
Enhancing the match between software executions and hardware features is key to computing efficiency...
GPUs have become popular due to their high computational power. Data scientists rely on GPUs to proc...
Systems for high performance computing are getting increasingly complex. On the one hand, the number...
The end of Dennard scaling also brought an end to frequency scaling as a means to improve performanc...
[ACCESS RESTRICTED TO THE UNIVERSITY OF MISSOURI AT REQUEST OF AUTHOR.] As computers began to reach ...
Applications may have unintended performance problems in spite of compiler optimizations, because of...
Heterogeneous parallel architectures like those comprised of CPUs and GPUs are a tantalizing compute...
Tuning the performance of applications requires understanding the interactions between code and targ...
In recent years the power wall has prevented the continued scaling of single core performance. This ...