Single-Instruction-Multiple-Data (SIMD) architectures are widely used to accelerate applications involving Data-Level Parallelism (DLP); the on-chip memory system facilitates the communication between Processing Elements (PE) and on-chip vector memory. It is observed that inefficiency of the on-chip memory system is often a computational bottleneck. In this paper, we describe the design and implementation of an efficient vector data memory system. The proposed memory system consists of two novel parts: an access-pattern-aware memory controller and an automatic loading mechanism. The memory controller reduces the data reorganization overheads. The automatic loading mechanism loads data automatically according to the access patterns without l...
Vector processors often use a cache to exploit temporal locality and reduce memory bandwidth demands...
In the low-end mobile processor market, power, energy, and area budgets are significantly lower than...
Vector processors often use a cache to exploit temporal locality and reduce memory bandwidth demands...
Single-Instruction-Multiple-Data (SIMD) architectures are widely used to accelerate applications inv...
Memory system efficiency is crucial for any processor to achieve high performance, especially in the...
This thesis explores a new approach to building data-parallel accelerators that is based on simplify...
This paper presents mathematical foundations for the design of a memory controller subcomponent that...
Vector processors have good performance, cost and adaptability when targeting multimedia application...
We present a taxonomy and modular implementation approach for data-parallel accelerators, including ...
In this work, we propose a Programmable Vector Memory Controller (PVMC), which boosts noncontiguous ...
To manage power and memory wall affects, the HPC industry supports FPGA reconfigurable accelerators ...
The concept of Parallel Vector (scratch pad) Memories (PVM) was introduced as one solution for Paral...
Processor clock frequencies and the related performance improvements recently stagnated due to sever...
We are attacking the memory bottleneck by building a “smart ” memory controller that improves effect...
2 We present a taxonomy and modular implementation approach for data-parallel accelerators, includ-i...
Vector processors often use a cache to exploit temporal locality and reduce memory bandwidth demands...
In the low-end mobile processor market, power, energy, and area budgets are significantly lower than...
Vector processors often use a cache to exploit temporal locality and reduce memory bandwidth demands...
Single-Instruction-Multiple-Data (SIMD) architectures are widely used to accelerate applications inv...
Memory system efficiency is crucial for any processor to achieve high performance, especially in the...
This thesis explores a new approach to building data-parallel accelerators that is based on simplify...
This paper presents mathematical foundations for the design of a memory controller subcomponent that...
Vector processors have good performance, cost and adaptability when targeting multimedia application...
We present a taxonomy and modular implementation approach for data-parallel accelerators, including ...
In this work, we propose a Programmable Vector Memory Controller (PVMC), which boosts noncontiguous ...
To manage power and memory wall affects, the HPC industry supports FPGA reconfigurable accelerators ...
The concept of Parallel Vector (scratch pad) Memories (PVM) was introduced as one solution for Paral...
Processor clock frequencies and the related performance improvements recently stagnated due to sever...
We are attacking the memory bottleneck by building a “smart ” memory controller that improves effect...
2 We present a taxonomy and modular implementation approach for data-parallel accelerators, includ-i...
Vector processors often use a cache to exploit temporal locality and reduce memory bandwidth demands...
In the low-end mobile processor market, power, energy, and area budgets are significantly lower than...
Vector processors often use a cache to exploit temporal locality and reduce memory bandwidth demands...