Many applications use vector operations by applying single instruction to multiple data that map to different locations in conventional memory. Transferring data from memory is limited by access latency and bandwidth affecting the performance gain of vector processing. We present a memory system that makes all of its content available to processors in time so that processors need not to access the memory, we force each location to be available to all processors at a specific time. The data move in different orbits to become available to other processors in higher orbits at different time. We use this memory to apply parallel vector operations to data streams at first orbit level. Data processed in the first level move to upper orbit one dat...
This paper introduces an innovative cache design for vector computers, called prime-mapped cache. By...
The goal of this paper is to show that instruction level parallelism (ILP) and data-level parallelis...
We have designed and implemented an asynchronous data-parallel scheduler for the SML/NJ ML compiler....
Many applications use vector operations by applying single instruction to multiple data that map to ...
The purpose of this paper is to show that multi-threading techniques can be applied to a vector proc...
The Vector Processor is a Single-Instruction Multiple-Data (SIMD) parallel processing system based o...
On many commercial supercomputers, several vector register processors share a global highly interlea...
Memory system efficiency is crucial for any processor to achieve high performance, especially in the...
This paper presents mathematical foundations for the design of a memory controller subcomponent that...
Simultaneous multithreaded vector architectures combine the best of data-level and instruction-level...
The poor bandwidth obtained from memory when conflicts arise in the modules or in the interconnectio...
We discuss the architecture and microarchitecture of a scalable, parametric vector accelerator for t...
To manage power and memory wall affects, the HPC industry supports FPGA reconfigurable accelerators ...
Shows that instruction-level parallelism (ILP) and data-level parallelism (DLP) can be merged in a s...
Today’s computer systems develop towards less energy consumption while keeping high performance. The...
This paper introduces an innovative cache design for vector computers, called prime-mapped cache. By...
The goal of this paper is to show that instruction level parallelism (ILP) and data-level parallelis...
We have designed and implemented an asynchronous data-parallel scheduler for the SML/NJ ML compiler....
Many applications use vector operations by applying single instruction to multiple data that map to ...
The purpose of this paper is to show that multi-threading techniques can be applied to a vector proc...
The Vector Processor is a Single-Instruction Multiple-Data (SIMD) parallel processing system based o...
On many commercial supercomputers, several vector register processors share a global highly interlea...
Memory system efficiency is crucial for any processor to achieve high performance, especially in the...
This paper presents mathematical foundations for the design of a memory controller subcomponent that...
Simultaneous multithreaded vector architectures combine the best of data-level and instruction-level...
The poor bandwidth obtained from memory when conflicts arise in the modules or in the interconnectio...
We discuss the architecture and microarchitecture of a scalable, parametric vector accelerator for t...
To manage power and memory wall affects, the HPC industry supports FPGA reconfigurable accelerators ...
Shows that instruction-level parallelism (ILP) and data-level parallelism (DLP) can be merged in a s...
Today’s computer systems develop towards less energy consumption while keeping high performance. The...
This paper introduces an innovative cache design for vector computers, called prime-mapped cache. By...
The goal of this paper is to show that instruction level parallelism (ILP) and data-level parallelis...
We have designed and implemented an asynchronous data-parallel scheduler for the SML/NJ ML compiler....