For loop accelerators such as coarse-grained reconfigurable architectures (CGRAs) and GP-GPUs, nested loops represent an important source of parallelism. Existing solutions to mapping nested loops on CGRAs, however, are either designed for perfectly nested loops only, or expensive and inflexible. Efficient CGRA mapping of imperfect loops with arbitrary nesting depth still remains a challenge. In this paper we propose a compiler-hardware co-operative approach that is flexible and yet able to generate efficient mappings for imperfect nested loops. It is based on loop flattening, but to mitigate the negative impact of flattening we combine loop fission and a light-weight architecture extension that is designed to accelerate common operation pa...
Due to their flexibility and high performance, Coarse Grained Reconfigurable Array (CGRA) are a topi...
International audienceResearch interest and industry investment in edge computing solutions have inc...
Data locality and synchronization overhead are two important factors that affect the performance of ...
Nested loops represent a significant portion of application runtime in multimedia and DSP applicatio...
Pipelining algorithms are typically concerned with improving only the steady-state performance, or t...
The effective parallelization of applications exhibiting irregular nested parallelism is still an op...
Loops are an important source of optimization. In this paper, we propose a new technique for optimiz...
Coarse-Grained Reconfigurable Array (CGRA) processors accelerate inner loops of applications by expl...
Coarse-Grained Reconfigurable Architectures (CGRAs) provide an excellent balance between performance...
Many loop nests in scientific codes contain a parallelizable outer loop but have an inner loop for w...
Control divergence poses many problems in parallelizing loops. While predicated execution is commonl...
Abstract In this paper, an approach to the problem of exploiting parallelism within nested loops is ...
Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefi...
Tiling is a well-known loop transformation to improve temporal locality of nested loops. Current com...
Over the past 20 years, increases in processor speed have dramatically outstripped performance incre...
Due to their flexibility and high performance, Coarse Grained Reconfigurable Array (CGRA) are a topi...
International audienceResearch interest and industry investment in edge computing solutions have inc...
Data locality and synchronization overhead are two important factors that affect the performance of ...
Nested loops represent a significant portion of application runtime in multimedia and DSP applicatio...
Pipelining algorithms are typically concerned with improving only the steady-state performance, or t...
The effective parallelization of applications exhibiting irregular nested parallelism is still an op...
Loops are an important source of optimization. In this paper, we propose a new technique for optimiz...
Coarse-Grained Reconfigurable Array (CGRA) processors accelerate inner loops of applications by expl...
Coarse-Grained Reconfigurable Architectures (CGRAs) provide an excellent balance between performance...
Many loop nests in scientific codes contain a parallelizable outer loop but have an inner loop for w...
Control divergence poses many problems in parallelizing loops. While predicated execution is commonl...
Abstract In this paper, an approach to the problem of exploiting parallelism within nested loops is ...
Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefi...
Tiling is a well-known loop transformation to improve temporal locality of nested loops. Current com...
Over the past 20 years, increases in processor speed have dramatically outstripped performance incre...
Due to their flexibility and high performance, Coarse Grained Reconfigurable Array (CGRA) are a topi...
International audienceResearch interest and industry investment in edge computing solutions have inc...
Data locality and synchronization overhead are two important factors that affect the performance of ...