2015 International Conference on Parallel Processing

Wang, F
Jiang, H
Zhuo, K
Xue, J
Yang, C

Open link

Publication date

September 2015

DOI

10.1109/ICPP.2015.29

Publisher

Institute of Electrical and Electronics Engineers (IEEE)

Abstract

his paper presents the design and implementation of a highly efficient Double-precision General Matrix Multiplication (DGEMM) based on Open BLAS for 64-bit ARMv8 eight-core processors. We adopt a theory-guided approach by first developing a performance model for this architecture and then using it to guide our exploration. The key enabler for a highly efficient DGEMM is a highly-optimized inner kernel GEBP developed in assembly language. We have obtained GEBP by (1) maximizing its compute-to-memory access ratios across all levels of the memory hierarchy in the ARMv8 architecture with its performance-critical block sizes being determined analytically, and (2) optimizing its computations through exploiting loop unrolling, instruction scheduli...