CS 594 - Applications of Parallel Computing

Assignment 2

Due February 16th, 2000

 

Part1:

Implement, in Fortran or C, the six different ways to perform matrix multiplication by interchanging the loops. (Use 64-bit arithmetic.) Make each implementation a subroutine, like:

 

subroutine ijk ( a, m, n, lda, b, k, ldb, c, ldc )

subroutine ikj ( a, m, n, lda, b, k, ldb, c, ldc )

          ...

 

Construct a driver program to generate random matrices and calls each matrix multiply routine with square matrices of orders 50, 100, 150, 200, …, 500, timing the calls and computing the Mflop/s rate.

 

Run your program on at least one RISC based architecture. A few you can try are:

 

power3.cs.utk.edu

IBM RS/6000, Fortran: xlf –O3 -lessl

torc1.cs.utk.edu

Intel Pentium II 550 MHz:

nala.cs.utk.edu

SUN Ultra 2, 200 MHz, max rate: 400 Mflop/s, cache size:

 

Use the highest level of optimization. Include in your timing routine a call to the following system supplied

     

call dgemm('No', 'No', n, n, n, 1.0d0, a, lda, b, ldb,1.0d0, c, ldc )

 

 (This is a routine provided in the ESSL Library on the IBM for computing matrix multiply. Use ATLAS, (see http://www.netlib.org/atlas/).

Write-up a description of the timing and describe why the routines perform as they do.

 

Part 2:

The goal is to optimize matrix multiplication on these machines.

 

 

You can call routine papi_timer(t1,t2)  to collect the execution time (t1 is the process time in seconds.

Also there is a routine papi_flops(f1,f2), where f2 is the total number of floating point operations. All these quantities are capturing a running counter and you need to subtract successive calls to get the time between calls.

Point to ~garner/pub/i386-linux/libperfometer.a -lpapi –DPERFOMETER

And also include a call to call perfometer_nogui(1) to initialize things.