Performance tuning of non-blocking threads is based on graph partitioning algorithms that create serial code blocks from dependence graphs. Previously existing algorithms are directed toward deadlock-avoidance and maximization of run-length. The latter criterion often generates a high synchronization overhead. This paper presents a partitioning algorithm for dependence graphs that uses a heuristic to determine a costefficient solution based on an architecture-dependent cost function. We present empirical results based on benchmark programs that were compiled with MIT's Id compiler, extended by our architecture-dependent partitioning algorithm. The results demonstrate a reduction in software overhead with our architecturedependent parti...
An algorithm can be modeled as an index set and a set of dependence vectors. Each index vector in th...
In this paper we present an algorithm for system level hardware/software partitioning of heterogeneo...
In this paper we present substantially improved thread partitioning algorithms for modern implicitly...
Performance tuning of non-blocking threads is based on graph partitioning algorithms that create ser...
In this paper, we present an efficient framework for intraprocedural performance based program parti...
Abstract Existing partitioning algorithms provide limited support for load balancing simulations tha...
Existing partitioning algorithms provide limited support for load balancing simulations that are per...
Existing partitioning algorithms provide limited support for load balancing simulations that are per...
This paper describes a method of analysis for detecting and minimizing memory latency using a direct...
The ordering of operations in a data flow program is not specified by the programmer, but is implied...
Three related problems, among others, are faced when trying to execute an algorithm on a parallel ma...
Current high performance computing architectures are composed of large shared memory NUMA nodes, amo...
The topic of intermediate languages for optimizing and parallelizing compilers has received much at...
[[abstract]]The data dependence graph is very useful to parallel algorithm design. In this paper, ap...
Abstract—In order to improve system performance efficiently, a number of systems choose to equip mul...
An algorithm can be modeled as an index set and a set of dependence vectors. Each index vector in th...
In this paper we present an algorithm for system level hardware/software partitioning of heterogeneo...
In this paper we present substantially improved thread partitioning algorithms for modern implicitly...
Performance tuning of non-blocking threads is based on graph partitioning algorithms that create ser...
In this paper, we present an efficient framework for intraprocedural performance based program parti...
Abstract Existing partitioning algorithms provide limited support for load balancing simulations tha...
Existing partitioning algorithms provide limited support for load balancing simulations that are per...
Existing partitioning algorithms provide limited support for load balancing simulations that are per...
This paper describes a method of analysis for detecting and minimizing memory latency using a direct...
The ordering of operations in a data flow program is not specified by the programmer, but is implied...
Three related problems, among others, are faced when trying to execute an algorithm on a parallel ma...
Current high performance computing architectures are composed of large shared memory NUMA nodes, amo...
The topic of intermediate languages for optimizing and parallelizing compilers has received much at...
[[abstract]]The data dependence graph is very useful to parallel algorithm design. In this paper, ap...
Abstract—In order to improve system performance efficiently, a number of systems choose to equip mul...
An algorithm can be modeled as an index set and a set of dependence vectors. Each index vector in th...
In this paper we present an algorithm for system level hardware/software partitioning of heterogeneo...
In this paper we present substantially improved thread partitioning algorithms for modern implicitly...