PARA'04 State-of-the-Art
in Scientific Computing
June 20-23, 2004 (Home page)

Updated: 3 May 2004

Matrix Multiplication Kernels: Synergy Between Theory and Practice Leads to Superior Performance

J. A. Gunnels, F. G. Gustavson, G. M. Henry, and R. A. Van De Geijn
IBM and others, USA

Abstract:

During the last half-decade, a number of projects have developed software for generating automatically tuned matrix-matrix multiplication algorithms, including the PHiPAC project and the ATLAS project.

We develop a simple model of hierarchical memories and we use it to determine an optimal strategy for blocking matrices. This model predicts the form of current, state-of-the-art L1 kernels and five other, related L1 kernels. Additionally, it shows that current L1 kernels can continue to produce their high performance for operand matrices that considerably overflow the L1 cache. For a hierarchical memory with L memory levels (main memory and L-1 caches), our model reduces the number of potential matrix multiply algorithms from $6^L$ to 3. Even so, many current _GEMM implementations use only one or two algorithms. We use the shape of the matrix input operands, M, N, and K, to select one of our 3 algorithms.

The theoretical results show that depending on the shape of the matrices involved, different strategies are locally optimal. Rather than determining a blocking strategy at library generation time, the theoretical results also point us to a heuristic that allows the blocking strategy to be determined dynamically at run-time as a function of the shapes of the matrices being multiplied.  Preliminary experimental results show that the approach yields performance that is superior to that of ATLAS on Intel Pentium(TM) III and IBM POWER3 processors.

Home page


2004-05-03