In this paper we show that partitioning data cache into array and scalar caches can improve cache access pattern without having to remap data, while maintaining the constant access time of a direct-mapped cache and improving the performance of L-1 cache memories. By using 4 central moments (mean, standard-deviation, skewness and kurtosis) we report on the frequency of accesses to cache sets and show that split data caches significantly mitigate the problem of non-uniform accesses to cache sets for several embedded benchmarks (from MiBench) and some SPEC benchmarks
Growing wire delay and clock rates limit the amount of cache accessible within a single cycle. Non-u...
The central data structures for many applications in scientific computing are large multidimensional...
textFor the past decade, microprocessors have been improving in overall performance at a rate of ap...
Abstract — As more cores (processing elements) are included in a single chip, it is likely that the ...
Directly mapped caches are an attractive option for processor designers as they combine fast lookup ...
Directly mapped caches are an attractive option for processor designers as they combine fast lookup ...
The widening gap between the processor clock speed and the memory latency puts an added pressure on ...
Abstract — While higher associativities are common at L-2 or Last-Level cache hierarchies, direct-ma...
PosterWhy is it important? As number of cores in a processor scale up, caches would become banked ...
This paper shows that even very small reconfigurable data caches, when split to serve data streams ...
Treating data based on its location in memory has received much attention in recent years due to its...
Wire delays continue to grow as the dominant component of latency for large caches. A recent work pr...
Abstract Caches are widely used to reduce the speed gap between processors and memories. However, th...
Blocking is a well-known optimization technique for improving the effectiveness of memory hierarchie...
Current split data caches classify data as having either spatial locality or temporal locality. The...
Growing wire delay and clock rates limit the amount of cache accessible within a single cycle. Non-u...
The central data structures for many applications in scientific computing are large multidimensional...
textFor the past decade, microprocessors have been improving in overall performance at a rate of ap...
Abstract — As more cores (processing elements) are included in a single chip, it is likely that the ...
Directly mapped caches are an attractive option for processor designers as they combine fast lookup ...
Directly mapped caches are an attractive option for processor designers as they combine fast lookup ...
The widening gap between the processor clock speed and the memory latency puts an added pressure on ...
Abstract — While higher associativities are common at L-2 or Last-Level cache hierarchies, direct-ma...
PosterWhy is it important? As number of cores in a processor scale up, caches would become banked ...
This paper shows that even very small reconfigurable data caches, when split to serve data streams ...
Treating data based on its location in memory has received much attention in recent years due to its...
Wire delays continue to grow as the dominant component of latency for large caches. A recent work pr...
Abstract Caches are widely used to reduce the speed gap between processors and memories. However, th...
Blocking is a well-known optimization technique for improving the effectiveness of memory hierarchie...
Current split data caches classify data as having either spatial locality or temporal locality. The...
Growing wire delay and clock rates limit the amount of cache accessible within a single cycle. Non-u...
The central data structures for many applications in scientific computing are large multidimensional...
textFor the past decade, microprocessors have been improving in overall performance at a rate of ap...