Abstract—Bulk memory copying and initialization is one of the most ubiquitous operations performed in current computer systems by both user applications and Operating Systems. While many current systems rely on a loop of loads and stores, there are proposals to introduce a single instruction to perform bulk memory copying. While such an instruction can improve performance due to generating fewer TLB and cache accesses, and requiring fewer pipeline resources, in this paper we show that the key to significantly improving the performance is removing pipeline and cache bottlenecks of the code that follows the instructions. We show that the bottlenecks arise due to (1) the pipeline clogged by the copying instruction, (2) lengthened critical path...
Abstract|As the performance gap between processors and main memory continues to widen, increasingly ...
Block memory operations are frequently performed by the operating system and consume an increasing f...
Journal PaperCurrent microprocessors incorporate techniques to exploit instruction-level parallelism...
<p>Many programs initialize or copy large amounts of memory data. Initialization and copying are for...
This dissertation presents a hardware accelerator that is able to accelerate large (including non-pa...
Memory copies for bulk data transport incur large overheads due to CPU stalling, small register-size...
In this paper, we present a new architecture of the cache-based memory copy hardware accelerator in ...
An ideal high performance computer includes a fast processor and a multi-million byte memory of comp...
International audience<p>The growing complexity of modern computer architectures increasingly compli...
This paper presents a Least Popularly Used buffer cache algorithm to exploit both temporal locality ...
The memory system remains a major performance bottleneck in modern and future architectures. In this...
Memory (cache, DRAM, and disk) is in charge of providing data and instructions to a computer\u27s pr...
Numerical applications frequently contain nested loop structures that process large arrays of data. ...
The gap between CPU and main memory speeds has long been a performance bottleneck. As we move toward...
Execution efficiency of memory instructions remains critically important. To this end, a plethora of...
Abstract|As the performance gap between processors and main memory continues to widen, increasingly ...
Block memory operations are frequently performed by the operating system and consume an increasing f...
Journal PaperCurrent microprocessors incorporate techniques to exploit instruction-level parallelism...
<p>Many programs initialize or copy large amounts of memory data. Initialization and copying are for...
This dissertation presents a hardware accelerator that is able to accelerate large (including non-pa...
Memory copies for bulk data transport incur large overheads due to CPU stalling, small register-size...
In this paper, we present a new architecture of the cache-based memory copy hardware accelerator in ...
An ideal high performance computer includes a fast processor and a multi-million byte memory of comp...
International audience<p>The growing complexity of modern computer architectures increasingly compli...
This paper presents a Least Popularly Used buffer cache algorithm to exploit both temporal locality ...
The memory system remains a major performance bottleneck in modern and future architectures. In this...
Memory (cache, DRAM, and disk) is in charge of providing data and instructions to a computer\u27s pr...
Numerical applications frequently contain nested loop structures that process large arrays of data. ...
The gap between CPU and main memory speeds has long been a performance bottleneck. As we move toward...
Execution efficiency of memory instructions remains critically important. To this end, a plethora of...
Abstract|As the performance gap between processors and main memory continues to widen, increasingly ...
Block memory operations are frequently performed by the operating system and consume an increasing f...
Journal PaperCurrent microprocessors incorporate techniques to exploit instruction-level parallelism...