CS 594 -
Applications of Parallel Computing
Assignment 2
Due February
16th, 2000
Part1:
Implement,
in Fortran or C, the six different ways to perform matrix multiplication by
interchanging the loops. (Use 64-bit arithmetic.) Make each implementation a
subroutine, like:
subroutine ijk ( a, m, n,
lda, b, k, ldb, c, ldc )
subroutine ikj ( a, m, n,
lda, b, k, ldb, c, ldc )
...
Construct
a driver program to generate random matrices and calls each matrix multiply
routine with square matrices of orders 50, 100, 150, 200, …, 500, timing the
calls and computing the Mflop/s rate.
Run
your program on at least one RISC based architecture. A few you can try are:
|
power3.cs.utk.edu |
IBM
RS/6000, Fortran: xlf –O3
-lessl |
|
torc1.cs.utk.edu |
Intel
Pentium II 550 MHz: |
|
nala.cs.utk.edu |
SUN
Ultra 2, 200 MHz, max rate: 400 Mflop/s, cache size: |
Use
the highest level of optimization. Include in your timing routine a call to the
following system supplied
call dgemm('No', 'No', n,
n, n, 1.0d0, a, lda, b, ldb,1.0d0, c, ldc )
(This is a routine provided in the ESSL
Library on the IBM for computing matrix multiply. Use ATLAS, (see
http://www.netlib.org/atlas/).
Write-up
a description of the timing and describe why the routines perform as they do.
Part
2:
The
goal is to optimize matrix multiplication on these machines.
You
can call routine papi_timer(t1,t2)
to collect the execution time (t1 is the process time in seconds.
Also
there is a routine papi_flops(f1,f2), where f2 is the total
number of floating point operations. All these quantities are capturing a
running counter and you need to subtract successive calls to get the time
between calls.
Point
to ~garner/pub/i386-linux/libperfometer.a -lpapi –DPERFOMETER
And
also include a call to call
perfometer_nogui(1) to initialize
things.