ParkBench and EuroBen Benchmarks

On the AlphaServerSC


 

Jay Patel

Computer Science
University of Tennessee
Knoxville, TN


 

June 2000

Keywords: AlphaServerSC, EuroBen, Benchmark, ParkBench, performance


ABSTRACT

In this paper we present the results of benchmark experiments carried out on a AlphaServerSC. We used two benchmarks to assess the AlphaServerSC: EuroBen and ParkBench. EuroBen primarily gives results for serial benchmark execution, but a distributed version of EuroBen gives parallel results on certain EuroBen benchmark tests. The ParkBench benchmark gives us results primarily in the area of parallel computations. Relevent results obtained are compared to results obtained on a Silicon Graphics Origin2000, Cray T3E, and/or the IBM SP3.


1. INTRODUCTION

This project gives us an insight into the workings of the Colt cluster at the Oak Ridge National Laboratories. Colt is an AlphaServerSC system. The behavior of the AlphaServerSC system can be seen through results obtained via the running of the EuroBen and ParkBench Benchmarks. These results were obtained by the running of the benchmarks inbetween February and May 2000.

The Colt Cluster is a sixteen node AlphaServerSC system. Each node of the Colt Cluster is comprised of 4 667 MHz EV67 processors and a 2 GB memory. The EV67 processor contains a 8 MB cache. On this machine, the Compaq Fortran compiler was used to compile the Fortran portion of the EuroBen and ParkBench benchmarks, while DEC C Version 6.1 compiler was used to compile the C portion of the EuroBen and ParkBench benchmarks. For optimizing the Fortran benchmark code, the code was compiled with the options -O4 -f -u. To optimize the C benchmark code, the code was compiled with the -O4 option also. To implement the parallel benchmark portions of the codes, MPI was used to pass data across the distributed system.

This paper has the following structure: Section 2 contains a description of the hardware used for communications across the nodes. Section 3 talks briefly about the topology of the distributed network that was implemented to decrease latency. Section 4 discusses the EuroBen benchmark. Section 5 discusses the ParkBench benchmark. Section 6 discusses the results arrived at from the execution of the two benchmarks. 


2. ELAN Communications Processor

The ELAN processor is a separate processor designed to handle all the communications between the nodes. This processor has a shared memory interface and two data links. The switch latency is 0.035 microsecs, while the MPI latency is less than 5.5 microsecs. The data links are connected by Meiko designed 8x8 cross-point switches. The data speed is 206 MHz. This creates a bandwidth of 50 MByte/s in each direction.

 

 

Figure 1: Elan Communications Processor

 

ELAN supports remote read, write, and synchronization. These operations are specified by virtual processor number and virtual address. The main efficiency is hinged to the interface between the communications processor and the rest of the network. The following features are provided:

The use of a dedicated processor strictly used for communications has reduced start-up latency by taking code used for communications that would be normally executed on the main processor and porting it to the specialized communications processor. The fat tree topology of the nodes is what provides a constant, high bandwidth between nodes. Significant benefits are also seen from the move of lightweight interrupt intensive operations that are used in inter-process communication from the main microprocessor to a communications specialized processor. Also, in another way to reduce start-up latency, the communication instructions sent to the communications processor from the main processor are unchecked. The security strategies are implemented on the communications processor.

 The start-up process consists of four parts:

 The communications - only processor has two types of parameters that it checks: the memory addresses and the process addresses. The main processor sends, in the beginning, to the communications processor virtual memory addresses that are unchecked. A memory management unit (MMU) was created in ELAN to support multiple simultaneous contexts. This feature allows processes that are suspended to still allow I/O. The way the communications processor checks the process address is via a simple table look-up and exception mechanism.

 Table lookups are also used for communications processor in the translation component in the startup process. The translation table lookup mechanism for the memory address translation operates in the same way that the main processor does translation. The result of the process translation are two components: the processor that the instruction is sent to, and the context of the destination process.

 A technique that is common to most distributed systems is to copy data into a physically mapped output buffer. This technique is avoided in ELAN by using network-wide virtual addressing. Cache coherency problems are avoided by the fact that the main processor and the communications processor share a common memory bus which is the SPARC MBus. This avoidance of copying data greatly reduces start-up latency and makes the bandwidth usage more efficient.

 A command port which is located in the user address space is used for control of the communications processor. A fixed portion of the memory addresses make up the command port. The command for the communications processor is taken from five bits from the address that is used. The data that the communications processor uses with the appropriate command is taken from the 32 bits of data that are written to the memory address that is given from the communications processor. If any exceptions are generated by the communications processor, they will be handled by the communications processor without any intervention from the main processor.

 The ELAN communications processor contains a RISC processor which has the ability to execute user level code without the assistance of the main processor. It also creates network transactions. The microcode and hardware that is implemented in the processor can support an extremely lightweight scheduling mechanism.

 In order to transfer data to a remote processor, the main processor has to perform the following actions:

User program creates a DMA structure that identifies certain characteristics of the transfer (source and dest. addresses, data size, etc.)

User program issues a DMA command with RmW to the command port. The address of the DMA structure created in the previous step is outputted to the appropriate address in the command port.

The user program checks the command that was accepted. If the return value from the command port was greater than or equal to 0 than that would indicate that the command was accepted.

The actions that are taken by the communications processor are:

The command processor takes in the 32 bit data from the command port. It then uses this data to locate the DMA descriptor. This descriptor is then placed into the communications DMA queue.

 

3. Distributed Network Topology

The topology that is used in the AlphaServer SC system is the fat tree topology. With the use of the fat tree topology, packets do not always have to go to the top of the tree. If certain problems use communications that are local to each other this will reduce the bandwidth at the higher levels of the fat tree. In order to properly take advantage of these benefits, process placement is made more important.

 

Figure 2: Fat Tree Diagram

4. The EuroBen Benchmark

EuroBen is a standard benchmark which was proposed by the European Benchmark Group. This benchmark suite is a predominantly single-user benchmark intended to give results on single or multiple CPU machines, the impact traffic going back and forth from the CPU to memory has on performance, and I/O throughput. This suite is composed of three modules, each entailing from six to eight programs. These programs range from small test codes to large application codes. Module one is composed of six programs that test the basic functions of the machine. Module two tests the machine through the execution of simple but important numerical algorithms. Module three contains simple application codes that are composed of more than one algorithm. Each of the individual programs will now be discussed in a little bit more detail.

 

4.1 Module 1

Mod1ac

Mod1ac measures the n1/2 and the rinf of different basic mathematical operations and some triadic operations like axpy operation and the inner product are executed on vectors.

Mod1d

This program will execute and report if there are any memory bank conflicts.

Mod1e

This program will test several intrinsic functions: power, square root, exponent, log, sine, cosine, tangent, cotangent, asine, acosine, atangent, sinh, cosh, tanh.

Mod1f

Mod1f tests the speed of the intrinsic functions tested in Mod1e. With each function, an estimate of n1/2 and rinf is given.

Mod1g

Mod1g calculates the cost of memory references in relation to mathematical functions. This is accomplished by evaluating f1/2 through a polynomial. The polynomial ranges from degrees of 1 through 10.

Mod1h

Mod1h calculates the bandwidth for point-to-point communication using the ping-pong method with MPI as the message passing protocol.  

Mod1i

Mod1i is a distributed benchmark that calculates three variants of the dot product. The first variant uses plain send and receive calls to distribute the global sum to all processors. An unorganized structure is used to gather the partial sums. The second variant also uses send and receive calls, but the structure used to gather the partial sums is a tree-based structure. The third variant uses MPI_Reduce and MPI_Bcast to send and receive data as opposed to the plain send and receives that were used in the previous two variants.


4.2 Module 2

Mod2a(1)

Mod2a tests the speed of matrix-vector multiplies of various orders. These orders are defined in an input file named mod2a.in. This test is a uni-processor test.

Mod2a(2)

Mod2a tests the speed of matrix-vector multiplies of various orders, like the previous test. This test, though, does the computations in parallel using MPI, with the number of processes ranging from 2 through 16.

Mod2b(1)

The distributed memory version of Mod2b calculates C = C + AB and does a general matrix update. The orders of these system are defined in an input file mod2b.in.

Mod2b(2)

Mod2b will test the speed of calculating the solution to a set dense linear systems. The orders of these system are defined in an input file mod2b.in.

Mod2c

Mod2c will test the speed of calculating the solution to a set of sparse linear systems, whose orders range from 103 though 1003.

Mod2d

Mod2d will test the speed of computing eigenvalues of dense linear systems. The orders tested are defined in Mod2b.in.

Mod2e

Mod2e will test the speed of computing eigenvalues of sparse linear systems, whose orders range from 10, ..., 10000.

Mod2f(1)

Mod2f tests the speed of computing the 1-D complex-to-complex Fast Fourier Transform whose orders range from 28,...,220. These orders are defined in the input file mod2f.in. This is the singular processor version.

Mod2f(2)

Mod2f tests the speed of computing the 1-D complex-to-complex Fast Fourier Transform whose orders range from 28,...,220. This version is run in parallel using MPI, and with the number of processes ranging from 2 through 16. The orders are defined in the input file mod2f.in.

Mod2g(1)

Mod2g tests the speed of computing the 2-D Haar Wavelet transform whose matrices are in the orders ranging from 16x16 to 512x256. MPI calls are inserted to see what the speed up is distributing the data across processors. These matrices are defined in the input file mod2g.in.

Mod2g(2)

The distributed version of Mod2g also calculates the 2-D Haar Wavelet transform. These matrices are defined in the input file mod2g.in.

Mod2h

Mod2h computes the speed of generating uniformly distributed random numbers.



4.3 Module 3

 Mod3a

Mod3a measures the speed of calculating matrix-vector multiplications whose sizes range from 20000x25000 to 100000 x 250000. The data for these calculations reside out-of-core, so I/O bandwidth is also measured.

Mod3b

Mod3b this calculates a 2-D Fast Fourier Transform. In order to get this test to run, the size of the matrix had to be reduced from 8192x8192 to 512x512.

Mod3c

Mod3c solves a Poisson equation on a 257 x 257 grid.

Mod3d

Mod3d solves a 500 x 20 linear least squares problem. This problem is solved twice, one using a QR decomposition, and once by Singular Value Decomposition.

Mod3e

Mod3e finds the solution to a non-linear least squares problem with 7 fitting parameters and 1000 observations. Standard linear algebra routines are used. In the case of this benchmarking, the vendor-optimized versions in CXML from Compaq were implemented.

Mod3f

Mod3f solves a one-dimensional diffusion problem. Solving this problem also required using CXML in order to get optimal performance.

Mod3g

Mod3g solves Poisson and Helmholtz equations by cyclic reduction and Fast Fourier techniques. The Laplace equation on the unit square is solved with boundary conditions U = 1.0 on the entire boundary. This discretisation is on a 257 x 257 grid.

Mod3h

Mod3h solves a Poisson equation by a block-relaxation method. The problem sizes range from 17 x 17 to 513 x 513.




5 ParkBench Benchmark

PARallel Kernels and BENCHmarks (PARKBENCH) is a collection of comprehensive parallel benchmarks designed to give both the vendor and the user the exact optimum computational speeds of the various distributed memory architectures. The original objectives of ParkBench were:

A good benchmark has a strong basis on good time measurements. ParkBench provides two low-level benchmarks that test the precision and accuracy of the cpu clock to be used in subsequent benchmarking:

Tick1- Measures the time interval between ticks of the cpu clock, giving a measure of the precision.

Tick2- The same time intervals measured by an external wall clock and the computer clock are compared to measure the accuracy of the computer clock. The scale factor used to change computer clock ticks to seconds are tested. If the CPU-clock is not being used in the correct manner, it is immediately detected.

Benchmarks need to have several different performance metrics defined given the time of execution T(N;p) and the flop-count F(N). N is defined as the size of the problem, and p is defined as the number of processors. The areas metrics are defined are Temporal Performance, Benchmark Performance, and Hardware Performance.


5.1 Temporal Performance

 Temporal Performance, RT , is looked at when there is need to compare different algorithms for solutions to the same problem. This is defined as the inverse of the execution time:

RT(N;p) = T-1(N;p)

The units used in this metric are solutions per second (sol/s), or timesteps per second (tstep/s).



5.2 Benchmark Performance

Since different benchmarks solve different problems, each problem will require a different amount of work to arrive at the solution. In performance evaluation, work is measured in flops (floating point operations). The flop-count of benchmark B is symbolized as FB(N). Benchmark Performance is defined as:

RB(N;p) = FB(N)/T(N;p)

Mflop/s (benchmark name) is the unit used to measure Benchmark Performance. The benchmark name is used in the unit name since the performance of the computer on this benchmark may be deeply interconnected to the problem being solved on the benchmark.

 

5.3 Hardware Performance

 In order to get the actual performance  of the hardware being benchmarked, the actual floating point calculations need to be computed. This is defined as FH(N;p). From the actual floating point calculations computed, the hardware performance is defined as:

RH(N;p) = FH(N;p)/T(N;p)

This metric uses Mflop/s (the same as Benchmark Performance). Since this metric measures the actual amount of floating point calculations used by the hardware, it will never surpass the theoretical peak performance. The theoretical peak value of the hardware, given a computer with more than one CPU each using multiple arithmetic pipelines delivering a maximum of one flop per clock period, is defined as:

r* = (fl. pt. pipes/CPU)/clock. period * number of CPUs

 

5.4 Low-Level Single Processor Benchmarks

When calculating the performance of a parallel machine, the initial performance that needs to be calculated is the how a single logical processor from the multi-processor system performs. Five benchmarks are provided in PARKBENCH to test the various aspects of a single processor. The five benchmarks are TICK1, TICK2, RINF1, POLY1, and POLY2.

5.4.1 TICK1

 TICK1 is a simple benchmark that tests the time interval between ticks of the CPU clock. This is achieved by using multiple calls to the timer function that are put into a loop that is executed many times. The elapsed time between calls to the timer function are then looked at. Zeros will be recorded if

5.4.2 TICK2

TICK2 performs it tests by checking the time interval measured by the computer's internal clock with that of the same interval seen on the benchmarker's wristwatch. The absolute values of both time intervals are calculated to see if the computer's internal clock interval is approximately equal to that of the wall-clock interval. TICK2 checks this to see if it is correct and if it is not correct it will be reported. This is checked since CPU time cannot be used in performance calculations. CPU time is not used since the timer does include time elapsed when the job is out of the CPU. TICK2 also checks to see if the correct multiplier is being used in the conversion of CPU ticks to seconds.

5.4.3 RINF1

RINF1 contains a set of Fortran DO-loops and using two parameters (rinf, n1/2 ) the time it takes to execute these loops is studied. Rinf is the ideal performance rate in Mflop/sec which it comes close to as the length of the loop increases. n1/2 expresses how quickly R (the actual performance rate) approaches Rinf. This is the half-performance length, and it is given in terms of the length of the loop. Therefore it is defined as the loop length required to achieve a rate of one half of Rinf. The time, T, required to execute a DO-loop with q vector operations (q floating point operations per element per iteration) is given as:

T = q * (n + n1/2) / rinf

Then the performance rate can be calculated using:

R = (q * n)/T = rinf / (1 + n1/2 / n)

5.4.4 POLY1 and POLY2

 Since there are always delays with getting data from the cache or main memory to the CPU, getting an ideal peak rate will never occur. POLY1 and POLY2 give numerical amounts to the degradation of computer performance that is derived from memory access bottlenecks. f is defined as "the number of floating-point operations performed per memory reference to an element of a vector variable". This degradation effect on the overall hardware performance ( represented by rinf) is represented by two parameters (r1inf , f1/2). r1inf is defined as the optimal hardware performance, and f1/2 is the computational intensity required to arrive at a performance of half of r1inf . The performance is defined by :

rinf = r1inf / (1 + f1/2 / f)

In order to get as maximum efficiency as possible inside the loop, the evaluation of a polynomial by Horner's Rule was chosen. The computational intensity of the loops was observed to be the order of the polynomial being evaluated. It was also seen that the multiply and add pipelines used by the hardware could be executed in parallel. In these benchmarks, Horner's Rule was evaluated on polynomial of orders one through ten.

 The POLY1 benchmark evaluates each polynomial from one through ten 1000 times for vectors of length up to 10,000. Vectors of this length would be able to fit into most caches. This therefore, shows that POLY1 is a benchmark that tests the bottleneck that occurs between the registers and the cache.

 The POLY2 benchmark, prior to moving up one order, flushes the cache. After flushing the cache, it will perform only one evaluation on vectors ranging in incremental size from 10,000 to 100,000, thereby always needing accesses to main memory. POLY2 is therefore an out-of-cache test of the memory bottleneck between main memory and the registers.

 

5.5 Low-Level Multi-Processor Benchmarks

5.5.1        COMMS1

This test is known as the ping-pong test. This benchmark uses two nodes, one as the master, and another as the slave. The master node takes a message of varied length n and sends the message to the slave node. The slave nodes receives the message and puts it into a Fortran array. It then immediately sends the message back to the master node. The time T is recorded as half the time it takes for the message to be sent to the slave node and then received back from the slave node.

5.5.2        COMMS2

This benchmark test is similar to the test executed in COMMS1. Instead of having one node send to the slave node and waiting for it to receive the message back from the slave node, COMMS2 has both nodes send messages of varied length n and then receive the messages from each other. The use of bidirectional links can make it possible to achieve higher bandwidths than COMMS1.


5.5.3        COMMS3

This is a generalized version of the ping-pong test to determine the total saturation bandwidth of the system. In a X processor system, each of the processors sends a message of size N to the other X - 1 processors in the distributed network. The individual processors wait until each one recieves the messages from all the other X - 1 processors. The time period ends when all X processors receive all the messages from the other X - 1 processors.

5.5.4       POLY3

This benchmark test is roughly the same as POLY1. The difference between POLY1 and POLY3 is that the results for the polynomial equation are kept on an adjacent processor instead of in the cache of the processor executing the benchmark test.

 

5.5.5        SYNCH1

This benchmark tests the amount of time it takes for the all the processors participating in this benchmark to execute a barrier synchronization statement. This is an important test since distributed networks with hundreds and thousands of processors require that this statement not execute too quickly as the number of processors increases. The results are given in both a barrier time, and number of barriers executed per second.

 

 

5.6 Kernel Benchmarks

Results from the low-level benchmarks only test the basic architectural performances of the machines benchmarked. The user would have the desire to know how the machine would perform on complete problems, and to accomplish this entire application codes could be used. This, however, is undesirable, since full application codes can be extremely complex, and parallel implementations may not be available. Therefore, the application code is analyzed, and the computationally intensive portions are identified, and those kernels of the application codes are parallelized and benchmark codes are created for them. In ParkBench, seven kernel codes were analyzed: LU_solver, MATMUL, QR, TRANS, TRD, MG, and FT. LU_solver, MATMUL, TRANS, QR, and TRD are kernel codes that perform various matrix operations. MG and FT were benchmark codes that were taken from the NAS Parallel Benchmark Suite. MG is a multigrid benchmark kernel code. FT is a 3-D FFT PDE benchmark code.

 

5.6.1     LU_solver

This kernel code computes a LU factorization using partial pivoting. A reduce operation is essentially used inside a column of the processor mesh. A point to point communication is used to pass pivot rows.

5.6.2     MATMUL

The objective of this code is to calculate the product of two dense matrices. The communication involved is a broadcast along the rows of the mesh, along with a shift in the direction of each column.

5.6.3     TRANS

The transposition of a matrix is calculated in this kernel code. This is an important test case since the communication required in the transposition is the transfer of data between two processors simultaneously. The network communications capacity is tested using this kernel.

5.6.4     QR

The purpose of this code is to execute a QR factorization using column pivoting. A logical grid is created and the rows of this grid are distributed using a method of block interleaving.

5.6.5     TRD

This kernel code computes the eigenvalues for symmetrix matrices using matrix tridiagonalization.

5.6.6     MG

This mulitgrid kernel tests communication performances in both directions over both short and long distances. A V-cycle multigrid algorithm is implemented in order to arrive at an approximate solution to a Poisson problem using a 256x256 grid with boundaries being periodic.

5.6.7     FT

This kernel tests long distance performances by solving a Fast Fourier Transform using applications of a partial differential equation. This kernel code performs many of the aspects of spectral codes.

5.7 NASA Ames Parallel Application Benchmark Codes

Three compact application codes were also run that were part of the NASPB2.1 benchmark suite. These three codes are: LU, SP, and BT.

5.7.1     LU

This compact application code performs an LU decomposition. Due to the nature of this code, only square number of processors can be used. This is due to the way that that the code partitions the processors onto the grid used.

5.7.2     SP & BT

These codes solve systems of equations. Bot SP and BT use the same structure to solve these three sets of uncoupled systems. The differences are that SP involves a scalar pentadiagonal system of equations, while BT involves a a block tridiagonal system with 5x5 blocks.

6 Results

In this section results obtained from the execution of the two benchmarks, EuroBen, and ParkBench are discussed. These results were obtained while executing the codes that were compiled using the -O4 -f -u compilation options in order to optimize the fortran code, while the C portions of the code were executed using the -04 options for optimization. The compilers that were available for use were the Compaq Fortran compiler for compiling the Fortran code, while the DEC C Version 6.1 compiler was used to compile the C portions of the code. Also, in both cases, in order to make use of the Elan communications processor, the options -elan -elan3 were added to the compile line. First, the single node benchmark tests executed from EuroBen and ParkBench will be discussed. The multinode benchmark tests are then discussed.

One advantage of using standard benchmarks like ParkBench and Euroben is that benchmark results are available for other architecutres. We compare our Alpha SC results to those of the Cray T3E and Origin 2000 as reported in the 1997 EuroBen Experiences with the SGI Origin 2000 and the Cray T3E. Also, we compare our results with some recent results from the IBM SP3 at Oak Ridge National Laboratory, courtesy of Dr. Dunigan. The IBM, like the Alpha, has four processors per node running at 375 MHz. The nodes are interconnected by crossbar switches. Like the Alpha, the IBM has an 8 MB L2 cache.

6.1 Single Node Benchmark Tests

6.1.1 Low Level Benchmark Tests

6.1.1.1 Intrinsics Benchmark Tests

Figure 3 illustrates the results obtained from the execution of benchmark tests that evaluate the speeds of various intrinsic math functions. These tests were Mod1e and Mod1f from EuroBen. The intervals which executed the fastest were placed in the table. The precision of these functions are also tested, and were found to be within acceptable parameters.


Figure 3: Intrinsics Benchmark Test

6.1.1.2 Basic Operations Benchmark Test

Figure 4 shows the results observed from the benchmark that tested the basic operations of various mathematical operations. Only the first 14 are put in the table above because the results arrived at from 15-31 are pretty much identical with kernels 1-14 with the subtle differences that odd and even strides and index arrays are used. The results are for operands that were loaded from the primary cache. The results for operands that are taken from main memory were not seen since the secondary cache for the AlphaServer SC is so immense (8 MB cache. To arrive at values big enough to be forced out of cache extensive revisions to the code would have to be made.


Figure 4: Basic Operations Benchmark Test

6.1.2 Single Node Kernel Benchmark Tests

6.1.2.1 Single Node Fourier Transform Benchmark Tests

The speeds that are observed from the execution of the benchmark tests of the two Fourier Transform tests are shown in Figure 5 and Figure 6. These two tests both calculated the Fast Fourier Transform for various size input vectors. The 1-D Fourier Transform test was from the NAS Benchmark Suite, while the 3-D Fourier Transform test was written in the EuroBen Benchmark Suite. The results for the AlphaServer SC are only listed up to 262144 since higher results were not obtained for the IBM SP.


Figure 5: 1-D Fourier Transform Test


Figure 6: 3-D Fourier Transform Test


Figure 7: Mod1d Benchmark Test

6.2 Multinode Benchmark Tests

ParkBench and EuroBen provide some parallel benchmarks. Explicit message passing using MPI is the programming model we tested. We present results on basic MPI performance, and some of the results from the parallel kernels.

6.2.1 Multinode Low-Level Benchmark Tests

Our results show that between nodes, the Alpha can reach 200 MBytes/second compared with 138 MBs for the IBM SP3. The short message latecy, one-way, is only 5.4 microseconds for the Alpha compared with 16.3 microseconds for the IBM.

6.2.1.1 Ping Pong Benchmark Tests

The results in Fig. 7 and Fig. 8 are for the Ping Pong tests that were part of the of EuroBen and ParkBench Suites. The tests were executed twice: one time with two processors on the same node, and another time with processors on different nodes. A noticeable drop in performance is noticed when the processors have to communicate across different nodes. The Mod1h test was part of EuroBen, while Comms 1 was a part of ParkBench. Both of the tests accomplished the same tasks. This assumption is backed up by the fact that the results from both are pretty much the same. An anomalous result for the message at the size of 2800000 bytes is given in the EuroBen results. This was not a time-dependent error because the test was run quite a few times with the same anomaly each time. This must be a result of a quirk in the code since this reading is not seen in the results for the ParkBench code.


Figure 8: Comms 1 Test





Figure 9: Mod1h Test





Figure 10: Comms 2 Test





Figure 11: Comms 3 Test



6.2.1.2 Barrier Synchronization Test

This benchmark tests the execution of the MPI_BARRIER mpi routine. It is seen that as the number of processors increases, the amount of time it takes to execute an MPI_BARRIER increases logorithmically. The number of processors tested were: 2, 4, 8, and 16.


Figure 12: Synch 1 Test



6.2.2 Multinode Kernel Benchmark Tests

The Alpha SC showed near linear speedups for the the LU, MATMUL, QR, MG, FT, and SP. Similar benchmarks in ParkBench and EuroBen (e.g., LU) show similar results.

6.2.2.1 Single Node Fourier Transform Benchmark Tests


Figure 13: 1-D Fast Fourier Transform Test

6.2.2.2 LU_solver


Figure 14: LU_solver Test

6.2.2.3 MATMUL


Figure 15: MATMUL Test

6.2.2.4 TRANS


Figure 16: TRANS Test

6.2.2.5 QR


Figure 17: QR Test

6.2.2.6 TRD


Figure 18: TRD Test

6.2.2.7 MG


Figure 19: MG Test

6.2.2.8 FT


Figure 20: FT Test

6.2.3 NASA Ames Parallel Application Benchmark Tests

6.2.3.1 LU


Figure 21: LU Test

6.2.3.2 SP


Figure 22: SP Test

6.2.3.3 BT


Figure 23: BT Test

6.2.3.4 Distributed Dot Product Benchmark Test


Figure 10: Distributed Dot Product Benchmark Test

7 Conclusion

From this project, several things were learned concerning the AlphaServer SC and benchmarking machines. First of all, benchmark codes are often not ready to run "out of the box". Codes need to be tuned to the specific architecture of the machine being benchmarked, and this often means minor to extensive modifications need to be made to the benchmark code to get it to execute at optimal level. Also, some benchmark tests which were effective for older machines are no longer effective because the processors might be too fast, or the caches and memory might be too large.

As for the AlphaServer SC, it has been determined through the use and analysis of the benchmark codes run, that this machine is competitive in respects to competitive architectures. Also, it was determined that in the execution of parallel benchmarks (MPI), this architecture scales well when more processors are added to the execution of parallel codes.

8 Acknowledgements

I would like to thank Prof. Dunigan for providing a great deal of assistance throughout the time that this project was done in. I would also like to thank the people at the Oak Ridge National Laboratory for assistance in various technical aspects dealing with the AlphaServer SC machine.

9 References