The purpose of this paper is to show that using decoupling techniques in a vector processor, the performance of vector programs can be greatly improved. Using a trace driven approach, we simulate a selection of the Perfect Club programs and compare their execution time on a conventional vector architecture and on a decoupled vector architecture. Decoupling provides a performance advantage of more than a factor of two for realistic memory latencies, and even with an ideal memory system with no latency, there is still a speedup of as much as 50%. A bypassing technique between the load/store queues is introduced and we show how it can give up to an extra speedup of 22% while also reducing total memory traffic by an average of 20%. An important...
This paper introduces an innovative cache design for vector computers, called prime-mapped cache. By...
Register renaming and out-of-order instruction issue are now commonly used in superscalar processors...
Despite their superior performance for multimedia ap-plications, vector processors have three limita...
The purpose of this paper is to show that using decoupling techniques in a vector processor, the per...
This paper presents a study of the impact of reducing the vector register size in a decoupled vector...
The paper presents a study of the impact of reducing the vector register size in a decoupled vector ...
This paper presents data confirming the fact that traditional vector architectures can not reduce th...
The purpose of this paper is to show that multi-threading techniques can be applied to a vector proc...
Decoupling is an architectural organization that may tolerate long memory latencies by executing mem...
The purpose of this paper is to show that multi-threading techniques can be applied to a vector proc...
Vector processors often use a cache to exploit temporal locality and reduce memory bandwidth demands...
Vector processors often use a cache to exploit temporal locality and reduce memory bandwidth demands...
This paper presents an experimental study on cache memory designs for vector computers. We use an ex...
An architecture for high-performance scalar computation is proposed and discussed. The main feature ...
This paper presents data confirming the fact that traditional vector architectures can not reduce th...
This paper introduces an innovative cache design for vector computers, called prime-mapped cache. By...
Register renaming and out-of-order instruction issue are now commonly used in superscalar processors...
Despite their superior performance for multimedia ap-plications, vector processors have three limita...
The purpose of this paper is to show that using decoupling techniques in a vector processor, the per...
This paper presents a study of the impact of reducing the vector register size in a decoupled vector...
The paper presents a study of the impact of reducing the vector register size in a decoupled vector ...
This paper presents data confirming the fact that traditional vector architectures can not reduce th...
The purpose of this paper is to show that multi-threading techniques can be applied to a vector proc...
Decoupling is an architectural organization that may tolerate long memory latencies by executing mem...
The purpose of this paper is to show that multi-threading techniques can be applied to a vector proc...
Vector processors often use a cache to exploit temporal locality and reduce memory bandwidth demands...
Vector processors often use a cache to exploit temporal locality and reduce memory bandwidth demands...
This paper presents an experimental study on cache memory designs for vector computers. We use an ex...
An architecture for high-performance scalar computation is proposed and discussed. The main feature ...
This paper presents data confirming the fact that traditional vector architectures can not reduce th...
This paper introduces an innovative cache design for vector computers, called prime-mapped cache. By...
Register renaming and out-of-order instruction issue are now commonly used in superscalar processors...
Despite their superior performance for multimedia ap-plications, vector processors have three limita...