R. Clint Whaley
rwhaley@cs.utk.edu
http://www.cs.utk.edu/~rwhaley
Jack J. Dongarra
dongarra@cs.utk.edu
http://www.netlib.org/utk/people/JackDongarra/
- Abstract:
-
This paper describes an approach for the automatic generation
and optimization of numerical software
for processors with deep memory hierarchies
and pipelined functional units. The production of
such software for machines ranging from desktop
workstations to embedded processors can be a tedious and time consuming process.
The work described here can help in automating much of this process.
We will concentrate our efforts on the widely used
linear algebra kernels called
the Basic Linear Algebra Subroutines (BLAS).
In particular, the work presented here is for general matrix multiply, DGEMM.
However much of the
technology and approach
developed here can
be applied to the other Level 3 BLAS
and the general strategy can have
an impact on basic linear algebra operations in general
and may be extended to other important kernel operations.
- Keywords:
- BLAS, high performance, tuning, optimization, linear algebra, code generation
Contents
-
Contents
-
List of Tables
-
List of Figures
-
Motivation
-
ATLAS
-
Results
-
Comparison to Other Work
-
Downloading ATLAS
-
Future Work
-
Conclusions
-
References
-
BLAS and compiler details
List of Tables
- System Summary
- Cache flushing with large matrices
- Cache flushing with small matrices
- Theoretical and observed peak MFLOPS
- System and ATLAS DGEMM comparison across platforms (MFLOPS)
- 500x500 System and GEMM-based BLAS comparison across platforms (MFLOPS)
- Double precision LU timings on various platforms (MFLOPS)
- BLAS library and version
- Compiler and version
- Compiler flags
List of Figures
- ATLAS/vendor performance preview, 500x500 DGEMM (MFLOPS)
- One step of matrix-matrix multiply
- General matrix multiplication with A as innermost matrix
- General matrix multiplication with B as innermost matrix
Motivation
Today's microprocessors have peak
execution rates exceeding
1 Gflop/s.
However, straightforward implementation in Fortran or C
of computations
based on simple loops rarely results in such high performance.
To realize such peak rates of execution
for even
the simplest of operations
has required tedious, hand coded, programming
efforts.
Since their inception, the use of de facto standards like
the BLAS [5,4]
has been a means of achieving portability
and efficiency for a wide range of kernel scientific
computations.
While these BLAS are used heavily in linear algebra computations,
such as solving dense systems of equations, they have also found their way into the
basic computing infrastructure of many applications.
The BLAS (Basic Linear Algebra Subprograms) are high quality
``building block'' routines for performing basic
vector and matrix operations. Level 1 BLAS do vector-vector operations,
Level 2 BLAS do matrix-vector
operations, and Level 3 BLAS do matrix-matrix operations.
Because the BLAS are efficient, portable, and
widely available, they are commonly used in the development
of high quality linear algebra software, such as LAPACK [1]
and ScaLAPACK [2], for example.
The BLAS themselves are just a standard or specification of the
semantics and syntax for the operations.
There is a set of reference implementations written in Fortran,
but no attempt was made
with these reference implementations to promote efficiency.
Many vendors provide an ``optimized'' implementation of the BLAS for
a specific machine architecture.
These optimized BLAS libraries are provided by the computer vendor
or by an independent software vendor (ISV).
In general,
the existing BLAS have proven to be very effective in
assisting portable, efficient software for sequential,
vector and shared memory high-performance computers.
However, hand-optimized BLAS are expensive and tedious to
produce for any particular architecture, and in general will only be
created when there is a large enough market, which is not true
for all platforms.
The process of generating an optimized set of BLAS for a new architecture or
a slightly different machine version can be a time consuming
process. The programmer must understand the architecture,
how the memory hierarchy can be used to provide data in an optimum fashion,
how the functional units and registers can be manipulated to generate
the correct operands at the correct time, and how best to use
the compiler optimization.
Care must be taken to optimize the operations to
account for many parameters such as
blocking factors, loop unrolling depths,
software pipelining strategies,
loop ordering, register allocations,
and instruction scheduling.
Many computer vendors have invested considerable resources
in producing optimized BLAS for their
architectures.
In many cases near optimum
performance can be achieved for some operations.
However the coverage and the level of performance
achieved has not been uniform across all platforms.
An example is that up until this point
we have not had an efficient
version of matrix multiply for the Pentium/Linux
architecture.
Our goal is to develop a methodology for the
automatic generation of highly efficient
basic linear algebra routines for
today's microprocessors. The process will work on processors that
have an on-chip cache and a reasonable C compiler.
Our approach,
called Automatically Tuned Linear Algebra Software (ATLAS),
has been able to match or exceed the performance of the
vendor supplied version of matrix multiply in almost every case.
More complete timings will be given in Section 3,
where we report on the timings of various problem sizes across
multiple architectures. As a preview of this more complete coverage,
figure 1 shows the performance of ATLAS versus the
vendor-supplied DGEMM (where available) for a 500x500 matrix multiply.
See section 3 for further details on these results.
ATLAS
We have developed a general methodology
for the generation of the Level 3 BLAS and describe here
how this approach is carried out and some of the preliminary results
we have achieved.
At the moment,
the operation we are supporting is matrix multiply. We can describe
matrix multiply as
, where op(X) = X or XT.
C is an
matrix, and A and B are matrices of size
and
,respectively.
In general, the arrays A, B, and C will be too large to fit into
cache. Using a block-partitioned algorithm for matrix multiply it is still
possible to arrange for the operations to be performed with data for the
most part in cache by dividing the matrix into blocks. For additional
details see [6].
The ATLAS Approach
In our approach, we have isolated the machine-specific features of
the operation to several routines, all of which deal with performing
an optimized on-chip, cache contained, (i.e., in Level 1 (L1) cache) matrix multiply. This
section of code is automatically created by a code generator which
uses timings to determine the correct blocking and loop unrolling factors
to perform an optimized on-chip multiply. The user may directly supply
the code generator with as much detail as desired (i.e., the user
may explicitly indicate the L1 cache size, the blocking factor(s) to
try, etc); if such details are not provided, the generator will
determine appropriate settings via timings.
The rest of the code does not change across architectures, and handles
the looping, blocking, and so on necessary to build the complete
matrix-matrix multiply from the on-chip multiply.
Building the General Matrix Multiply From the On-Chip Multiply
In this section we describe the code which remains the same across
all platforms: the routines necessary to build a general matrix-matrix
multiply using a fixed-size on-chip multiply.
The following section (section 2.3) describes the
on-chip multiply and its code generator
in detail. For this section, it is enough to know that we have
efficient on-chip matrix-matrix multiplies of the forms
,
,
,
and
,
This multiply is of fixed size, with all
dimensions set to a system-specific value, NB (i.e, M = N = K = NB).
Also available are several ``cleanup'' codes, which handle the cases
caused by dimensions which are not multiples of the blocking factor.
When the user calls our _GEMM, the first decision is whether the problem
is large enough to benefit from our special techniques. Our algorithm
requires copying of the operand matrices; if the
problem is small enough, this O(N2) cost, along with miscellaneous
overheads such as function calls and multiple layers of looping, can
actually make the ``optimized'' GEMM slower than the traditional
3 do loops. The size required for the O(N3) costs to dominate
these lower order terms varies across machines, and so
this switch point is automatically determined at installation time.
For these very small problems, a standard 3-loop multiply with some simple
loop unrolling is called. This code will also be called if the
algorithm is unable to allocate enough space to do the blocking
(see below for further details).
Assuming the matrix is large enough, there are presently two
algorithms for performing the general, off-chip multiply. The two
algorithms correspond to different orderings of the loops; i.e., is the outer
loop over M (over the rows of A), and thus the second loop
is over N (over the columns of B), or is this order reversed.
The dimension common to A and B (i.e., the K loop) is currently
always the innermost loop.
Let us define the input matrix looped over by the outer loop as the outer or
outermost matrix; the other input matrix will therefore be the inner or
innermost matrix. Both algorithms have the option of writing the result
of the on-chip multiply directly to the matrix, or to an output temporary
. The advantages to writing to
rather than C are:
- 1.
- address alignment may be controlled (i.e., we can ensure during the malloc
that we begin on a cache-line boundary)
- 2.
- Data is contiguous, eliminating possibility of unnecessary cache-thrashing due
to ill-chosen leading dimension (assuming we have a non-write-through cache)
The disadvantage of using
is that an additional write to C is
required after the on-chip operations have completed. This cost is minimal if we
make many calls to the on-chip multiply (each of which writes to either C or
), but can add significantly to the overhead when this is not the case.
In particular, an important application of matrix multiply is the rank-K update,
where the write to the output matrix C can be a significant portion of the cost
of the algorithm. Writing to
essentially doubles the write cost, which
is unacceptable. The routines therefore employ a heuristic to determine if the
number of times the on-chip multiply will be called in the K loop is large enough
to justify using
, otherwise the answer is written directly to C.
Regardless of which matrix is outermost, the algorithms try to allocate enough
space to store
output temporary,
(if needed),
1 panel of the outermost matrix, and the entire inner matrix. If
this fails, the algorithms attempt to allocate enough
space to hold
, and 1 panel from both A and B.
The minimum workspace required by these routines is therefore
2 K NB, if writing directly to C, and NB2 + 2 K NB if not.
If this amount of workspace
cannot be allocated, the previously mentioned small case code is
called instead.
If there is enough space to copy the entire innermost matrix, we see several
benefits to doing so:
- Each matrix is copied only one time
- If all of the workspaces fit into L2 cache, we get complete L2 reuse on
the innermost matrix
- Data copying is limited to the outermost loop, protecting the inner
loops from unneeded cache thrashing
Of course, even if the allocation succeeds, using too much memory might result
in unneeded swapping. Therefore, the user can set a maximal amount of workspace
that ATLAS is allowed to have, and ATLAS will not try to copy the innermost matrix
if this maximum workspace requirement is exceeded.
If enough space for a copy of the entire innermost matrix is not allocated,
the innermost matrix will be entirely copied for each panel of the outermost
matrix (i.e., if A is our outermost matrix, we will copy B
times).
Further, our usable L2 cache is reduced (the copy of a panel of
the innermost matrix will take up twice the panel's size in L2 cache;
the same is true of the outermost panel copy, but that will only be seen
the first time through the secondary loop).
Regardless of which looping structure or allocation procedure used, the
inner loop is always along K. Therefore, the operation done in the
inner loop by both routines is the same, and it is shown in
figure 2.
If we are writing to
, the following actions are performed in order to
calculate the
block Ci,j, where i and j are in
the range
,
:
- 1.
- Call on-chip multiply of the form
to multiply block of the
row panel i of A with block of the column panel j of B.
- 2.
- Call on-chip multiply of form
to multiply block k of the row panel i of
A with block k of the column panel j of B,
. The on-chip multiply is
performing the operation
, so as expected
this results in multiplying the row panel of A with the column panel
of B.
- 3.
now holds the product of the row panel of A with the column
panel of B, so we now perform the block operation
.
If we are writing directly to C, this action becomes:
- 1.
- Call on-chip multiply of the correct form based on user-defined
(eg. if
, use
) to multiply block of the
row panel i of A with block of the column panel j of B.
- 2.
- Call on-chip multiply of form
to multiply block k of the row panel i of
A with block k of the column panel j of B,
.
Building on this inner loop, we have the loop orderings giving us our two
algorithms for off-chip matrix multiplication. Figures 3 and
4
give the pseudo-code for these two algorithms, assuming we are writing directly
to C. We simplify this code by not
showing the cleanup code necessary for cases where dimensions do not evenly
divide NB. The matrix copies are shown as if coming from the notranspose,
notranspose case. If they do not, only the array access on the copy changes.
Choosing the Correct Looping Structure
When the call to the matrix multiply is made, the routine must decide which
loop structure to call (i.e., which matrix to put as outermost). If the
matrices are of different size, L2 cache reuse can be encouraged by deciding
the looping structure based on the following criteria:
- If either matrix will fit completely into L2 cache, put it as
the innermost matrix (we get L2 cache reuse on the entire inner matrix)
- If neither matrix fits completely into L2 cache, put the one with
the largest panel that will fit into L2 cache as the outermost
matrix (we get L2 cache reuse on the panel of the outer matrix)
By default, present code does no explicit L2
blocking (the size of the L2 cache is not known
anywhere in the code), and so these criteria are not presently used for this
selection. Rather, if one matrix must be accessed by row-panels during
the copy (for instance, the matrix A when TRANSA='N'), that matrix
will be put where it can be copied most efficiently.
This means that if we have enough
workspace to copy it up front, it will be accessed column-wise by putting
it as the innermost loop and copying the entire matrix; otherwise it will
be placed as the outermost loop, where the cost of copying the row-panel
is a lower order term. If both matrices have the same access patterns,
B will be made the outermost matrix, so that C is accessed by columns.
Blocking for Higher Levels of Cache
Note that we define the Level 1 (L1) cache as the ``lowest'' level of cache:
the one closest to the processor. Subsequent levels are ``higher'': further
from the processor and thus usually larger and slower. Typically,
L1 caches are relatively small (eg., 8-32KB), employ least recently used replacement
policies, and are often non-associative and
write-through. Higher levels of cache or more often non-write-through, with
varying degrees of associativity and differing replacement polices.
Because of the wide variance in high level cache behaviors,
our present cache detection algorithm is not sophisticated enough to reliably
detect multiple levels of cache. Therefore, by default the only cache size
that is known to ATLAS is that of the L1 cache. This means that we cannot
automatically determine when it is appropriate to perform blocking for these
higher levels of cache, as we do for L1.
At installation time, the installer may supply ATLAS the cache size of one
of the higher level caches to ATLAS (this quantity is called CacheEdge),
and ATLAS will then perform explicit cache blocking, when appropriate.
Explicit cache blocking is for a the selected level of cache is only
required when the cache size is insufficient to
hold the two input panels and the
piece of C. This means
that users will have optimal results for many problem sizes without setting
CacheEdge. This is expressed formally below;
Notice that conditions 1 and 2 below do not require explicit cache blocking,
so the user gets this result even if CacheEdge is not set.
The explicit cache blocking strategy discussed in 4 below
assumes that we have a case where the panels of A and B overflow
a particular level of cache. In this case, we can easily partition the K
dimension of
the input matrices so that the panels of the partitioned matrices Ap and Bp
will fit into the cache. This means that we
get cache reuse on the input matrices, at the cost of writing C additional
times.
More concretely, ATLAS allows an installer to supply the value
CacheEdge. This value is set to the size of the level of cache
the user wants explicit blocking support for. It is easily shown that
the footprint of the algorithm computing a
section of C
in cache is roughly 2 K NB + NB2, where 2 K NB stores the panels
from A and B, and the section of C is of size NB2. If we set
the above expression equal to the size of the cache, and solve for K,
we have the maximal K (call this quantity Km)
which will, assuming we copy all of the inner matrix up front, allow us to
reuse the outer matrix panel N / NB times. We now translate our original
matrix multiply into
rank-Km updates.
Assuming that matrix A is the innermost matrix,
and we are discussing cache level L, of size SL, and that
main memory is classified as a level of ``cache'' greater than L,
there are four possible states (depending on cache and problem size, and whether
CacheEdge is set) which ATLAS may be in. These states and their
associated memory access costs are (we ignore the reads of C for simplicity):
- 1.
- If the entire inner matrix, a panel of the outer matrix, and the
section of C fits into the cache
(eg.
)
- K (M + N) reads from higher level(s) cache
-
writes to first level of non-write-through cache;
higher levels of cache receive only the final M N writes
- 2.
- If the cache cannot satisfy the memory requirements of 1, it may still
be large enough to accommodate the two active input panels, along with
the relevant section of C
(eg., (
AND we copy entire inner matrix)
OR (
AND we copy a panel of the inner matrix
in the inner loop, thus doubling the inner panel's footprint in the cache))
-
reads from higher level(s) of cache
-
writes to first level of non-write-through cache;
higher levels of cache receive only the final M N writes
- 3.
- If the cache is too small for either of the previous cases to hold true,
(eg., 2 K NB + NB2 > SL) and we do not set CacheEdge,
and thus do no explicit level L blocking, the memory access becomes:
-
reads from higher level(s) of cache
-
writes to first level of non-write-through cache;
higher levels of cache receive only the final M N writes
- 4.
- Finally, if the first two cases do not apply
(eg., 2 K NB + NB2 > SL), but CacheEdge is set to
SL, ATLAS can perform cache blocking to change the memory access
from that given in 3 to:
-
reads from higher level(s) of cache
-
writes to first level of non-write-through cache;
higher levels of cache receive at most
writes
As mentioned above, case 4 is only used if CacheEdge has been set, and
cases 1 and 2 do not apply. It is used as an alternative to case 3. Note that,
assuming we have a non-write-through cache, we reduce one of the reads
from O(n3) to O(n2) at the cost of raising the number writes from O(n2)
to O(n3).
At first glance this may appear to be a poor bargain indeed, since writes are
generally
more expensive than reads. There are several mitigating factors that make this
blocking nonetheless worthwhile.
If the cache is write-through, 4 does not increase writes over 3, so it is a
clear win.
Second, ATLAS also does not allow Km < NB, so we know that the number of writes
is at most the same as the number of reads we saved. For many problems
, so the number of writes is much less. Finally, the writes are not
flushed immediately; This fact has two important consequences:
- 1.
- The cache can schedule the write-back during times when the algorithm
is not using the bus.
- 2.
- Writes may be written in large bursts, which significantly reduces
bus traffic; this fact alone can make writing faster than reading on
some systems.
In practice, 4 has been shown to be at least as good as 3 on all platforms.
The amount of actual speedup varies widely depending on problem size and
architecture. On some systems the speedup is negligible; on others it can
be significant: for instance, it can make up to 20% difference on DEC Alpha-
based systems (which have three layers of cache). Note that this 20%
improvement is merely the difference between cases 3 and 4, not between
ATLAS and some naive implementation, for instance.
The analysis given above may be applied to any cache level greater than 1;
it is not for level 2 caches only.
However, this analysis is accurate only for the algorithm
used by ATLAS in a particular section of code, so we cannot recur in order
to perform explicit cache blocking for arbitrary levels of cache. To put this
another way, ATLAS explicitly blocks for L1, and only one other higher level
cache. If you have 3 levels of cache, ATLAS can explicitly block for
L1 and L2, or L1 and L3, but not all three.
If ATLAS performs explicit cache blocking for level L, that does not
mean that level L+1 would be useless; depending on cache size and
replacement policy, level L+1 may still save extra read and writes
to main memory.
It should be possible to block for arbitrary levels of cache. We have not
yet experimentally shown this, however. The idea is that we now have two
specific layers of cache blocking, L1, which blocks for the on-chip multiply,
and the level for which CacheEdge is set, which essentially blocks for
the inner product of the two input panels. Levels above this must block
for the entire multiply as a whole, and therefore need to be able to to contain
all of the working portions of
A, B, and C. They may do this by partitioning N or M appropriately,
so that the operation is reduced to one which will fit into their cache size.
If the matrices must be partitioned to fit, it corresponds to a multilevel
version of the blocking strategy outlined below.
Another interesting idea for performing explicit cache blocking
is where multiple matrix panels will fit
into level L cache, but the entire innermost matrix will not. In this case
the looping mechanism must become slightly more complex. This algorithm
copies X panels of the outermost matrix into level L cache, and then
reuses them by applying them to the panels of the innermost matrix. If X
becomes as small as 1 or as large as the number of panels in the outermost
matrix, we are in a previously discussed case (case 2 and 1, respectively).
Within this restriction, we reduce the number of times we must copy A by
X.
This idea was promising enough that it was implemented. However, the
range between X of 1 and the entire matrix was typically not large
enough to make the speedup worthwhile, and so, for the sake of simplicity,
we have chosen not to support this kind of blocking. As previously mentioned,
it may be of more use when considering multiple levels of cache.
Generation of the On-Chip Multiply
As previously mentioned, the on-chip matrix-matrix multiply is the only
code which must change depending on the platform. Since we copy
the input matrices into blocked form, only one transpose case is required,
which we have chosen as
(we support differing values
of
as well, but this is merely a convenience to avoid scaling).
This case was chosen
(as opposed to, for instance
), because it
generates the largest (flops)/(cache misses) ratio possible when the loops are
written with no unrolling. Machines with hardware allowing a smaller ratio
can be addressed using loop unrolling on the M and N loops (this could
also be addressed by permuting the order of the K loop, but we do not
at present use this technique).
In a multiply designed for L1 cache reuse, one of the input matrices
is brought completely into the L1 cache, and is then reused in looping
over the rows or columns of the other input matrix. The present code brings
in the matrix A, and loops over the columns of B; this was an arbitrary
choice, and there is no theoretical reason it would be superior to bringing
in B and looping over the rows of A.
There is a common misconception that cache reuse is optimized when both input
matrices, or all three matrices, fit into L1 cache.
In fact, the only win in fitting all three matrices into L1 cache is that
it is possible, assuming the cache is not write-through, to save the cost of
pushing previously used sections of C back to higher levels of memory.
Often, however, the L1 cache is write-through, while higher levels
are not. If this is the case, there is no way to minimize the write cost,
so keeping all three matrices in L1 does not result in greater cache reuse.
Therefore, ignoring the write cost, maximal cache reuse for our case is achieved
when all of A fits into cache, with room for at least two
columns of B and 1 cache line of C. Only one column of B is actually accessed
at a time in this scenario; having enough storage for two columns assures that
the old column will be the least recently used data when the cache overflows,
thus making certain that all of A is kept in place (this obviously assumes
the cache replacement policy is least recently used).
While cache reuse can account for a great amount of the overall performance
win, it is obviously not the only factor. For the on-chip matrix
multiplication, other relevant factors are:
- Instruction cache overflow
- Floating point instruction ordering
- Loop overhead
- Exposure of possible parallelism
- The number of outstanding cache misses the hardware can handle before
execution is blocked
Instruction Cache Reuse
Instructions are cached, and it is therefore important to fit our
on-chip multiply's instructions into the L1 cache.
This means that we will not
be able to completely unroll all three loops, for instance.
Floating Point Instruction Ordering
When we discuss floating point instruction ordering in this paper, it will usually
be in reference to latency hiding.
Most modern architectures possess pipelined floating point units. This means
that the results of an operation will not be available for use until X cycles
later, where X is the number of stages in the floating point pipe
(typically 3 or 5). Remember that our on-chip matrix multiply is of the
form
; individual statements would then naturally be
some variant of C[X] += A[Y] * B[Z]. If the architecture does not possess
a fused multiply/add unit, this can cause an unnecessary execution stall. The
operation register = A[Y] * B[Z] is issued to the floating point unit,
and the add cannot be started until the result of this computation is available,
X cycles later. Since the add operation is not started until the multiply finishes,
the floating point pipe is not utilized.
The solution is to remove this dependence by separating the multiply and add,
and issuing unrelated instructions between them.
This reordering of operations
can be done in hardware (out-of-order execution) or by the compiler, but this
will sometimes generate code that is not quite as efficient as doing it
explicitly. More importantly, not all platforms have this capability (for
example, gcc on a Pentium), and in this case the performance win can be large.
Reducing Loop Overhead
The primary method of reducing loop overhead is through loop unrolling. If
it is desirable to reduce loop overhead without changing the order of
instructions, one must unroll the loop over the dimension common to A and B
(i.e., unroll the K loop). Unrolling along the other dimensions
(the M and N loops) changes the order of instructions, and thus the resulting
memory access patterns.
Exposing Parallelism
Many modern architectures have multiple floating point units. There are two
barriers to achieving perfect parallel speedup with floating point computations
in such a case. The first is a hardware limitation, and therefore out of our
hands: All of the floating point units will need to access memory, and thus, for
perfect parallel speedup, the memory fetch will usually also need to operate in
parallel.
The second prerequisite is that the compiler recognize opportunities for
parallelization, and this is amenable to software control. The fix for
this is the classical one employed in such cases, namely unrolling
the M and/or N loops, and choosing the correct register allocation
so that parallel operations are not constrained by false dependencies.
Finding the Correct Number of Cache Misses
Any operand that is not already in a register must be fetched from memory.
If that operand is not in the L1 cache, it must be fetched from further
down the memory hierarchy, possibly resulting in large delays in execution.
The number of cache misses which can be issued simultaneously without blocking
execution varies between architectures. To minimize memory costs, the
maximal number of cache misses should be issued each cycle, until all memory
is in cache or used. In theory, one can permute the matrix multiply to
ensure that this is true. In practice, this fine a level of control would
be difficult to ensure (there would be problems with overflowing the
instruction cache, and the generation of such precision instruction sequence,
for instance). So the method we use to control the cache-hit ratio is
the more classical one of M and N loop unrolling.
Putting It All Together
It is obvious that with this many interacting effects, it would be difficult,
if not impossible to predict a priori the best blocking factor, loop
unrolling etc. Our approach is to provide a code generator coupled with
a timer routine which takes in some initial information, and then tries
different strategies for loop unrolling and latency hiding and chooses
the case which demonstrated the best performance.
The timers are structured so that operations have a large granularity, leading
to fairly repeatable results even on non-dedicated machines. The user may
enter the size of the L1 cache, or have the program attempt to calculate it.
This in turn allows the routine to choose a range of blocking factors to
examine. The user
may specify the maximum number of registers to use (or use the default),
and thus dictate the maximum amount of M and/or N loop unrolling to perform.
The first step of the timing figures the size of the L1 cache, if the user has
not supplied it. This is done by performing a fixed number of memory references,
while successively reducing the amount memory addressed. The most significant
gap between timings for successive memory sizes is declared to mark the L1 cache
boundary. For speed, we check only powers of 2. This means that a 48K cache
would probably be detected as a 32K cache, for instance. We have not found this
problem severe enough problem to justify the additional installation time it
would take to remedy it.
Next, we attempt to determine something about the floating point units of the
platform. We need to understand whether we have a combined muladd unit, or
whether we need to allow for independent multiply and add pipes. To do this,
we generate simple register-to-register code which performs the required
multiply-add using a combined muladd and separate multiply and add pipes. We
try both variants using code which implies various pipeline lengths. Therefore,
once we finish timing this simple register-register code, we have a good
empirical idea of the kinds of instructions to issue (muladd or separate multiply
and add), the pipeline depth, and even some idea if we have multiple floating point
units or not. With this data in hand, we are ready to begin actual on-chip multiply
timings.
In general, K loop unrollings of 1 or K have tended to produce the best results.
Thus we time only these two K loop unrolling during our initial search. This
is done to reduce the length of install time. At the end of the install process
we try to ensure we have not left out an opportunity to gain greater performance by
trying a wide range of K loop unrolling factors with the best case code generated
for the unrollings factors of 1 or K.
At the beginning of the search a number of possible blocking factors are tried
using set amount of M and N loop unrolling (at the present, 2x2),
from which an initial
blocking factor is chosen.
With this initial blocking factor, which instructions set to use (muladd or
separate multiply and add), and a guess as to pipeline length,
the search routine loops over all M and N
loop unrollings possible with the given number of registers.
Once an optimal unrolling has been found, we again try all blocking factors,
and various latency and K-loop unrolling factors, and choose the best.
All results are stored in files, so that subsequent searches will not repeat
the same experiments, allowing searches to build on previously obtained data.
This also means that if a search is interrupted (for instance due to a machine
failure), previously run cases will not need to be re-timed.
A typical install takes from 1 to 2 hours for each precision.
Cleanup Code
After all of the above operations are done, we have a square on-chip multiply
of fixed dimension NB. Since the input matrices may not be a multiple of
NB, there is an obvious need for a way to handle the remainder.
It is possible to write the cleanup code in one routine, with 3 loops of arbitrary
dimension. Practice shows that on some platforms, this results in
unacceptably large performance drops for matrices with dimensions which
are not multiples of NB. Generating the code for all possible cleanup cases
is not difficult, but is not a usable solution in practice. This would
result in NB3 routines, which would take an unacceptable amount of
compilation time, and make the user's executable too large.
The key is to note that a majority of the time spent in cleanup code will be the
case where
only 1 dimension is not equal to NB. Therefore we generate roughly 4 NB
routines for cleanup: 3 NB routines for the cases where a given dimension
is less than NB. The remaining routines accept arbitrary M and N, but
K is known so that we can unroll the inner loop (critical for reducing
loop overhead). Thus the NB routines generated for the general case
correspond to the differing values K is allowed. These routines where more
than one dimension is less than NB will still not be as efficient as the
other routines, but the time spent in them should be negligible.
Because of the number of routines required in this cleanup strategy, we generate
only the
case (eg., the operation done by the cleanup is always
). Other
choices are accommodated by scaling.
Why Can't the Compiler Do This?
It would be ideal if the compiler where capable of
performing the optimization needed automatically.
However, compiler technology
is far from mature enough to perform these optimizations automatically.
This is true even for the BLAS on widely marketed machines which can
justify the great expense of compiler development. Adequate compilers
for less widely marketed machines are almost certain not to be developed.
Requirements for Good Performance
The approach we have taken has two requirements for achieving
good performance:
- 1.
- There is a cache from which the floating point unit can fetch
operands cheaply.
- 2.
- The platform possesses an adequate C compiler.
It might seem that the compiler would play little role in achieving performance,
when the code generator does so much of the work usually reserved for compilers
(eg., loop unrolling, latency hiding, register blocking, etc.). Some compilers
ignore the register keyword, and/or reorder the loops even though they are
already optimal. Also, if the compiler has limits on its use of the hardware
(for instance, it only emits code for a subset of the actual physical registers),
there is nothing ATLAS can do about it.
Also, for the code other than the on-chip multiply, the
compiler must do the brunt of optimization (this is a low-order cost, however).
There are two systems where we attempted an ATLAS installation, and had
unacceptable performance. The platforms, and the reason for the
performance loss is summarized below:
- Intel i860: inadequate C compiler
- Cray T3E: inadequate C compiler/lack of L3 cache
On Intel i860 we were unable to achieve a DGEMM speed exceeding 12MFLOPS. The
system supplied DGEMM executed at rates above 40Mflop. We were only able to
approach this speed by rewriting the on-chip multiply in terms of fortran77.
Since changing languages appears to have such a salutary effect, it is easy to
believe that the compiler is at fault.
Further, the case that got the best performance involved
no M or N loop unrolling, something that happens on no other platform.
This anomalous result may be due to the compiler's inability to handle the
increased complexity inherit in outer loop unrolling.
There is a gcc installation for the i860, and it would be interesting
to see if results are better with this alternative compiler.
The nodes of the SGI/CRAY T3E we had access to are
DEC Alpha 21164 RISC processors, running at 450MHz. Our top MFLOPS on this
system was on the order of 400MFLOPS, far below both the theoretical peak and
that enjoyed by the vendor-supplied BLAS. ATLAS performs well on other machines
built around this chip, but since SGI/CRAY removed the Level 3 cache supplied
by DEC, we are not able to determine if the problem is the differing compilers
or hardware.
As hinted at above, one option when faced with a poor C compiler is to try
another language. In
the future we hope to provide the option to generate the on-chip multiply
in F77. For some of the legacy platforms, this might offer a speed improvement
over coding in C.
Results
In this section we present double precision (64-bit floating point arithmetic)
timings across various
platforms. ATLAS also supplies a single precision version; The single
precision results are not markedly different than the double, and so
we omit them in the interest of brevity.
The timings presented here are different than many BLAS timings in that we
flush cache before each call, and set the leading dimensions of the arrays
to greater than the number of rows of the matrix (all timings in this section
set the leading dimension to the maximal size timed, 1000).
This means our performance numbers, even when timing the same routine
(for instance the vendor-supplied DGEMM) are lower than
those reported in other papers. However, these numbers are in general a much
better estimate of the performance a user will see in his application. We
devote a brief section to this topic.
Next, we show timings for square matrix multiply on all systems.
To demonstrate that the performance shown in these timings translates to
actual applications, we then give LU timings for various systems.
Table 1 shows the configurations of the various platforms which
we have installed and timed the package on. It is worth noting that the
machine labeled DCG533 has the fastest DGEMM of any reported system, the
second fastest LU, and cost less than $2500, complete.
Appendix A has several tables providing further details.
Table 8 shows
the system BLAS that were used for the timings. We should note that we did
not have access to HP's most optimal BLAS for the HP9K/735, and so had to
compare against their vector library (which describes itself as optimized for
the 9000 series) instead.
Similarly, Intel does not officially supply a BLAS for their chips running
Linux. However, it is possible to get a unofficial version of the library
which works under Linux. This is what we compare against in our timings below.
Tables 9 and
10 show the compiler version and flags used in compiling the
on-chip matrix multiply.
Results with Varying Timing Methods
There are numerous ways to perform timings. Perhaps the most common
method is to generate the matrices A,
B and C, and then call the appropriate matmul routine. Depending on the
matrix and cache sizes, this can make a large difference in the timings.
For medium-sized matrices, a significant portion of the matrices will remain
in cache from the matrix generation, and thus the memory costs of main memory
will not be as prevalent in the timings. For very small matrices, a significant
portion of the matrices may remain in L1 cache, and thus the timings will be
truly misleading.
Some timers will perform the same operation X times in a row, and report
the best timing obtained. This will result in even more optimistic numbers.
Obviously, if all matrices fit into some level of the cache, the timings will enjoy
cache reuse just as above. However, if only one matrix will fit into cache,
there may still be significant cache reuse. For instance,
if the off-chip multiply has A in the inner loop, and A fits entirely into
some level of cache, the performance reported will not reflect the cost of bringing
A into that level of cache.
Finally, many timers set LDA = M; in other words, they make all of their
matrices contiguous memory. This rules out problems where an ill-chosen
leading dimension causes only part of the cache to be used, for instance.
It also insures maximal cache reuse. Unfortunately, in actual applications,
it is rarely the case that DGEMM is called with the leading dimension equal to
the size of matrix (usually, DGEMM is called on submatrices of some larger array).
In all of the timings presented in this paper, a section of
memory corresponding to the size of the highest cache level is written to and read from
after the matrix generation, so that the matrices must be fetched from main
memory by the matmul. We set the leading dimension to the maximal size being
timed.
It is readily observed that the method we are using gives a lower bound
on performance, while the more commonly used method gives an upper bound.
Why then do we not also just report the upper bound? The reason is that this
upper bound will be achieved only in very particular applications (ones that
repeatedly use the same memory space, without corrupting the cache between
invocations), where the problem size is small or the cache is very large.
In short, most users will never see it, and these timings are therefore not
indicative of true performance.
Use of appropriate timings is much more important when one is basing
software decisions upon it, as our package does. In this case, timing
the matmul where things are in cache may cause less optimal code to be produced.
To give the reader a feeling for the kinds of differences the method of timing
can cause, we provide a few examples below. In these tables, method 1 is
with LDA=1000, and cache flushing before and after the call. Method 2 is
sets LDA=M, runs the problem 5 times, and chooses the best result.
Note that we use the system BLAS for these timings, so that it is clear this
is not specific to our implementation.
First, for the machines with large high level caches, table 2 shows
the standard sizes we time in the rest of the paper. As one would expect,
as the matrices get larger, caching effects play less and less of a role.
Table 3 shows the same thing for smaller sizes, where
the problem is more severe. The Pentium II timings use ATLAS, since we did
not have access to the vender BLAS under Linux.
Square Matrix Multiply
Table 4 shows the theoretical and observed peaks for matrix
multiplication. By observed peak, we mean the best repeatable timing
produced on the platform, for any problem size. Where the observed peak
differs from the best timings reported in table 5,
the difference is usually due to using a multiple of the blocking factor.
Table 5 shows the times for the vendor-supplied
dgemm and ATLAS dgemm across all platforms, with problems sizes
ranging from 100 to 1000. In these tables, the LIB column indicates which
library the timings are for:
- SYS: system or vendor-supplied GEMM
- ATL: ATLAS GEMM
Matrix Multiply Results
Preliminary Results with GEMM-Based Level 3 BLAS
We present here some preliminary results for the Superscalar GEMM-based
level 3 BLAS where ATLAS supplies the optimized GEMM. This package was
developed by Bo Kagstrom and Per Ling. We have not had the time to do
a exhaustive set of timings for these codes as yet. During the installation
of ATLAS on the several platforms detailed in this paper, we did some
preliminary timings of this package to determine whether it would supply
an adequate level 3 BLAS. We reproduce these preliminary results here.
We used the BLAS timer from netlib, and report only the 500x500
result for each Level 3 BLAS call, always taking the 'Notranspose',
'Notranspose', 'Left', 'Upper' variant of each routine. These results
are shown in table 6. For some of the systems, we timed the
reference Fortran77 BLAS available from netlib as well.
LU Timings
In order to demonstrate that these routines provide good performance in practice,
we timed LAPACK's LU factorization. For each platform, we show three LU results:
- 1.
- LU factorization time linking to ATLAS DGEMM, Superscalar Level 3
GEMM-based BLAS, and reference Fortran77 Level 1 and 2 BLAS (all free software)
- 2.
- LU factorization time linking to vendor supplied BLAS
- 3.
- LU factorization time linking only to reference Fortran77 BLAS
The blocking factor for the factorization to use was determined by timing all
blocking
factors between 1 and 64, and choosing the one that performed best for
the LU factorization of a matrix of order 500. As with previous timings,
caches were flushed before the start of the algorithm.
Table 7 shows the LU performance on several platforms. The LIB
column is overloaded to convey both the BLAS used (A for ATLAS/Superscaler,
S for system, F for Fortran77),
and the blocking factor chosen (for instance, A(40) in this column indicates
a run using ATLAS's DGEMM, using a blocking factor of 40).
Comparison to Other Work
There are other efforts to produce optimal codes through code generation.
The closest parallel to ATLAS is seen in the PHiPAC [3] effort.
PHiPAC also deals with using a code generator for BLAS work. Since PhiPAC
predates ATLAS by several years, it is natural to ask what the differences
between the packages are, and perhaps why the ATLAS project was begun.
ATLAS was started because we needed an optimized DGEMM for Pentiums running
Linux. The authors of PhiPAC reported disappointing performance for PHiPAC
on the Linux/Intel platform (this is no longer the case).
When we examined the issue of creating an efficient DGEMM for this platform,
it was readily apparent that it would require only a little more effort to
make the work portable.
If this answers the question of why ATLAS was begun in the first place, it
does not tell how it is different from PHiPAC. The main difference is in
the complexity of the approach. ATLAS puts all system-specific code in
one square on-chip multiply. It then uses the off-chip code to coerce all
problems to this format. ATLAS further counts on a level 1 cache being
accessible by the floating point unit, in order to be able to make the
simplifying step of writing the on-chip multiply. This means we need
generate/time only one routine for each new platform. This has resulted
in a code generator that finishes in a relatively short time (generally,
1-2 hours), even though
the operations being timed are artificially inflated in order to ensure
repeatability.
PHiPAC, on the other hand, chose the more comprehensive approach of
directly optimizing each individual operation. This means different
code will be generated for each transpose combination, for instance.
This results in a lengthy installation process (usually, a matter of days),
as multiple cases for every routine must be generated and timed.
Neither of these approaches are ``better'' than the other. The approach
used by PHiPAC will probably yield better performance for very small
problems (since they may avoid any unnecessary data copies), or on machines
with no L1 cache. The same methods of code
generation used in the level 3 BLAS should work pretty much unchanged for
level 1 and 2. However, the cost of this increased generality is seen
in the longer installation time, and in performance which may be more sensitive
to various factors such as poorly chosen leading dimensions (ATLAS is somewhat
shielded from such factors by its data copy), etc.
The best way to determine which of these packages a user should use is to
experiment with them in a specific application. If the user wishes to compare
raw performance as reported in the publications, it should be
mentioned that the PHiPAC timing method is not the same as used in this
paper. Current PHiPAC timings as reported in [3] use timing
method 2 discussed in section 3.1. This means that their
performance numbers do not in general include the costs of bringing operands
into cache. Section 3.1 can give the reader an idea of
the effects of this.
Downloading ATLAS
The alpha release of ATLAS can be found at
www.netlib.org/atlas. Installation instructions
are provided in the supplied README file.
Future Work
The ATLAS package presently available on netlib contains the real
version of matrix-matrix multiplication. Matrix multiplication is
the building block of all of the level 3 BLAS, however. Initial
research using the publicly available level 3 gemm-based BLAS
[7] suggests that this provides a perfectly
acceptable level 3 BLAS.
As time allows, we can avoid some of the O(N2) costs associated with using
the gemm-based BLAS by supporting the level 3 BLAS directly in ATLAS. We also
plan on providing the software for complex data types.
We have preliminary results for the most important level 2 BLAS routine
(matrix-vector multiply) as well. This is of particular importance, because
matrix vector operations, which have O(N2) operations and O(N2) data,
demand
a significantly different code generation approach than that required for
matrix-matrix operations, where the data is O(N2), but the operation count is
O(N3). Initial results suggest that ATLAS will achieve comparable success with
optimizing the level 2 BLAS as has been achieved for level 3 (this
means that our timings compared to the vendor will be comparable; obviously,
unless
your architecture supports many pipes to memory, a level 2 operation will not
be as efficient as the corresponding level 3 operation).
Another avenue of research involves sparse algorithms. The fundamental
building block of
iterative methods is the sparse matrix times dense vector multiply. This work
should leverage the present research (in particular, make use of the dense
matrix-vector multiply). The present work uses compile-time adaptation of
software. Since matrix vector multiply may be called literally thousands
of times during the course of an iterative method, we plan to investigate
run-time adaptation as well. These run-time adaptations could include
matrix dependent transformations [8], as well as specific code
generation.
Conclusions
We have demonstrated the ability to
produce highly optimized matrix multiply
for a wide range of architectures based on
a code generator
that probes and searches
the system for an optimal set of
parameters.
This avoids the tedious task of generating
by hand routines optimized for a specific
architecture.
We believe these ideas can be expanded to cover
not only the Level 3 BLAS, but Level 2 BLAS
as well. In addition there is scope for
additional operations beyond the BLAS,
such as sparse matrix vector multiplication,
and FFTs.
References
- 1
-
E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz,
A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen.
LAPACK Users' Guide (second edition).
SIAM, Philadelphia, 1995.
324 pages.
- 2
-
L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon,
J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and
R. C. Whaley.
ScaLAPACK Users' Guide.
SIAM, Philadelphia, 1997.
- 3
-
J. Bilmes, K. Asanovic, C.W. Chin, and J. Demmel.
Optimizing matrix multiply using phipac: a portable,
high-performance, ansi c coding methodology.
In Proceedings of the International Conference on
Supercomputing, Vienna, Austria, July 1997. ACM SIGARC.
- 4
-
J. Dongarra, J. Du Croz, I. Duff, and S. Hammarling.
A set of Level 3 Basic Linear Algebra Subprograms.
ACM Trans. Math. Soft., 16(1):1-17, March 1990.
- 5
-
J. Dongarra, J. Du Croz, S. Hammarling, and Richard J. Hanson.
An Extended Set of FORTRAN Basic Linear Algebra
Subroutines.
ACM Trans. Math. Soft., 14(1):1-17, March 1988.
- 6
-
J. Dongarra, P. Mayes, and G. Radicati di Brozolo.
The IBM RISC System 6000 and linear algebra operations.
Supercomputer, 8(4):15-30, 1991.
- 7
-
B. Kågström, P. Ling, and C. Van Loan.
Portable High Performance GEMM-based Level 3 BLAS.
In R. F. Sincovec et al., editor, Parallel Processing for
Scientific Computing, pages 339-346, Philadelphia, 1993. SIAM.
- 8
-
S. Toledo.
Improving instruction-level paral