This paper presents mathematical foundations for the design of a memory controller subcomponent that helps to bridge the processor /memory performance gap for applications with strided access patterns. The Parallel Vector Access (PVA) unit exploits the regularity of vectors or streams to access them efficiently in parallel on a multi-bank SDRAM memory system. The PVA unit performs scatter/gather operations so that only the elements accessed by the application are transmitted across the system bus. Vector operations are broadcast in parallel to all memory banks, each of which implements an efficient algorithm to determine which vector elements it holds. Earlier performance evaluations have demonstrated that our PVA implementation loads eleme...
The purpose of this paper is to show that multi-threading techniques can be applied to a vector proc...
The Structured Memory Access (SMS) architecture implementation presented in this thesis is formulate...
We present a taxonomy and modular implementation approach for data-parallel accelerators, including ...
This paper presents mathematical foundations for the design of a memory controller subcomponent that...
We are attacking the memory bottleneck by building a “smart ” memory controller that improves effect...
Memory system efficiency is crucial for any processor to achieve high performance, especially in the...
In this work, we propose a Programmable Vector Memory Controller (PVMC), which boosts noncontiguous ...
Vector supercomputers, which can process large amounts of vector data efficiently, are among the fas...
This paper introduces an innovative cache design for vector computers, called prime-mapped cache. By...
Single-Instruction-Multiple-Data (SIMD) architectures are widely used to accelerate applications inv...
On many commercial supercomputers, several vector register processors share a global highly interlea...
To manage power and memory wall affects, the HPC industry supports FPGA reconfigurable accelerators ...
The concept of Parallel Vector (scratch pad) Memories (PVM) was introduced as one solution for Paral...
This paper presents an experimental study on cache memory designs for vector computers. We use an ex...
This paper introduces an innovative cache design for vector computers, called prime-mapped cache. By...
The purpose of this paper is to show that multi-threading techniques can be applied to a vector proc...
The Structured Memory Access (SMS) architecture implementation presented in this thesis is formulate...
We present a taxonomy and modular implementation approach for data-parallel accelerators, including ...
This paper presents mathematical foundations for the design of a memory controller subcomponent that...
We are attacking the memory bottleneck by building a “smart ” memory controller that improves effect...
Memory system efficiency is crucial for any processor to achieve high performance, especially in the...
In this work, we propose a Programmable Vector Memory Controller (PVMC), which boosts noncontiguous ...
Vector supercomputers, which can process large amounts of vector data efficiently, are among the fas...
This paper introduces an innovative cache design for vector computers, called prime-mapped cache. By...
Single-Instruction-Multiple-Data (SIMD) architectures are widely used to accelerate applications inv...
On many commercial supercomputers, several vector register processors share a global highly interlea...
To manage power and memory wall affects, the HPC industry supports FPGA reconfigurable accelerators ...
The concept of Parallel Vector (scratch pad) Memories (PVM) was introduced as one solution for Paral...
This paper presents an experimental study on cache memory designs for vector computers. We use an ex...
This paper introduces an innovative cache design for vector computers, called prime-mapped cache. By...
The purpose of this paper is to show that multi-threading techniques can be applied to a vector proc...
The Structured Memory Access (SMS) architecture implementation presented in this thesis is formulate...
We present a taxonomy and modular implementation approach for data-parallel accelerators, including ...