Intel Corporation
Workstation Technology Lab
Hillsboro, OR 97124
bruce_s_greer@ccm.jf.intel.com
Greg Henry
Intel Corporation
Enterprise Server Group
Hillsboro, OR 97124
henry@co.intel.comhttp://www.cs.utk.edu/~ghenry/prof.html
Abstract:
This paper gives a technical discussion of the Intel Pentium® Pro processor and optimization strategies used to achieve high performance on scientific applications. We demonstrate these optimizations by characterizing matrix multiplication (DGEMM). We give insight and a model into our efforts on obtaining the world's first TeraFLOP MP LINPACK run (on the Intel ASCI Option Red Supercomputer), based on Pentium Pro processor technology. The importance of this paper is carried by the increasing trend of commodity parts in the supercomputing arena.
Keywords:
BLAS, DGEMM, Optimization, MP LINPACK, TeraFLOP, ASCI Red.
Introduction
In this paper we discuss the broad range of methods and techniques we used to capture as much performance as possible on Intel Pentium Pro processors across the spectrum from a single scientific workstation all the way to the world's first Teraflop supercomputer. Our motivation for this work sprang both from the need to provide full performance factorization software for the ASCI program, and to supply a high performance library for technical computing as part of the Intel Performance Library Suite. As a proof of concept, we demonstrate how we used the precise code described in these techniques to achieve the world's first Teraflop MP LINPACK run on the new Intel ASCI Option Red Supercomputer. We describe the details of that historic run, provide a thorough model of performance, the techniques necessary to get at that performance and show how high performance is becoming more available on off the shelf components. We start by providing details of memory hierarchy found in the Pentium Pro. We then describe the matrix multiply routine DGEMM of the BLAS (Basic Linear Algebra Subroutines) and how we exploited the levels of hierarchy. We then model MP LINPACK and show how we obtained the first sustained Teraflop on this code.
The idea behind the BLAS (Basic Linear Algebra Subroutines) is the encapsulation of important linear algebra kernels. With these kernels, the programmer can develop modular code, because new machines such as Intel processors, are released with optimized BLAS (the Intel Performance Library Suite). The BLAS philosophy therefore allows for efficiency and portability.
The concept of level[1] is key to final BLAS performance. Level-1 BLAS, such as a dot product, have order N data and order N floating point operations (FLOPS), where N is implicitly understood as the size of the problem. Level-2 BLAS, such as matrix vector multiplication, have order N2 data and order N2 FLOPS. Level-3 BLAS, such as matrix multiplication like DGEMM, have order N2 data and order N3 FLOPS. This level concept is critical because data movement is slow compared to computationl[2]. With a double precision dot product run on a 200 MHz Pentium Pro processor run with a typical 66 MHz bus, for example, the processor can do a multiply and an add every other clock, thus peaking at 200 MFLOPS (millions of floating point operations per second). However, if the data comes from main memory, each data item requires a single bus clock, or three CPU clocks, and the peak obtainable performance must be reduced to at most a third (66 MFLOPS). One solution is to use a faster memory subsystem. Because this is so expensive, most high performance architectures such as Intel's use a memory hierarchy. If data brought into the low end (the registers and L1 cache) is reused to a significant degree, then the data traffic between the layers of the hierarchy is decreased. Here is where the concept of level assists us. If an algorithm is blocked so that there are more FLOPS than data touches, such as the DGEMM example above, then data can stay in the fast end of memory and one can get significantly closer to the 200 MFLOPS described above. Achieving this is not easy, and we describe some of the greatest obstacles in achieving fast floating point performance on one of the quickest growing processors in the scientific workstation market.
We have shown why the notion of level is important to a memory hierarchical system. Before showing how we exploited this concept with DGEMM, we describe the details of the off-the-shelf components used in the Pentium Pro-based Intel ASCI Option Red supercomputer.
Most computer hardware companies provide optimized math library kernels. At a minimum these libraries include the BLAS, FFTs, and sometimes, higher level functions. These include the Digital Extended Math Library (DXML), SGI Scientific Math Library and Sun Technical Library, among others. Intel has developed a family of technical libraries[3] to address the computational needs of various users. The intention for each of these libraries is that they meet the needs of a large class of user, that they exploit the latest architectural features of the processors and that they offer significant performance advantages to the user. The libraries, aimed at consumer, commercial, and technical software, are:
Only the last library was designed expressly for technical computing. MKL is currently optimized only for the Pentium® Pro processor, while the others have optimized versions for Pentium processors, including some with optimizations for MMX™ technology. MKL was the first of this set to be multithreaded, recognizing both that technical users are likely to have access to machines with more than one processor, that they will likely be using Windows* NT, which supports multiprocessing, and that their applications require all the compute power available.
The Pentium Pro processor has the same user visible architecture as the Pentium processor and previous Intel processor. Collectively this is known as Intel Architecture, or IA. Since the genesis of this architecture is from much earlier times when CPU requirements for PCs were much more modest, the number of registers is small. There is a set of 8 floating point stack locations[4] and 8 integer registers. Since this is a small set of integer registers, most of them are saved and restored upon entering and leaving subroutines. The truth is that it has 40 general purpose 80-bit (conforming to double extended IEEE format Std 854) internal registers, but these are hidden from the user at the micro-architecture level. The 80486 instruction set allows only for the use of a floating point stack consisting of 8 values. Note that previous x86 instruction sets did not have a floating point stack.
Due to register renaming, hidden from the user, the Pentium Pro processor uses its floating point registers to complete the floating point stack instructions using out-of-order execution, but in-order retirement. The out-of-order execution is combined with sophisticated branch prediction and register renaming to provide what Intel calls dynamic execution. The CISC x86 instructions are broken down into simpler RISC instructions called micro-ops which can be decoded at a rate of 3 per CPU cycle and retired at a rate of 3 per CPU cycle (only one retirement can belong to the floating point unit.) The core CPU cycle we are considering is 200 MHz. The core can execute a burst rate of 6 micro-ops per cycle on 5 separate functional units (store data, store address, load address, integer ALU, and floating point or integer unit.) There are 3 separate decoders, but only decoder 0 can handle the more complicated instructions, so it is wisest to disperse complicated instructions to the first instruction in a 16-byte segment, and then every third instruction thereafter (instruction 4, 7, etc.).
The Pentium Pro processor is often referred to as 3-way superscalar because of its ability to decode, dispatch, and retire three instructions per clock. It accomplishes this via a decoupled 12-stage superpipeline and out-of-order execution just mentioned. The pipeline is divided into four processing units (fetch/decode, dispatch/execute, retire, and the instruction pool.) The instruction pool (or reorder buffer) can contain as many as 40 micro-ops before the resource becomes saturated.
The processor has a primary on-chip non-blocking L1 data and instruction cache. Each one is 8K bytes and they are located on the die. (Note: the Pentium II L1 cache is 16K bytes each for data and instruction). The data cache is dual ported, two way set associative (four-way on the Pentium II processor), and the instruction cache is four way set associative using a LRU policy. The cache lines are 32 bytes long. The data cache is write-back, write allocate. This means that writes done to L1 require first that the respective cache line be read into L1. It also means that the write data remains cached in L1 until another cache line evicts it. At that time, the cache line gets written out. For most of our experiments, except where we state otherwise, we assume the L1 data cache is clean. We will sometimes abbreviate the L1 data cache by just referring to the L1 cache, and explicitly talk about the L1 instruction cache when we need to refer to it. The L1 data cache has a 3 CPU cycle latency to the floating point registers, but it can sustain a bandwidth of one data access per cycle. Actually, it can support an 8 byte (quadword) data read, an 8 byte data write, and a 16 byte instruction cache request per CPU cycle. In most cases, however, reads dominate over writes and also codes are usually not instruction-bound. If one wishes to do an 80-bit (extended) read or write, this will require an extra CPU cycle. Reading data from L1 can be done at 8 bytes per CPU clock, or 1600 million bytes per second. This is 1526 Mbytes/sec (1526*1048576 bytes or 1526*131072 doubles per second).
The processor has a non-blocking unified off-die L2 cache that is packaged with the CPU in a single dual-cavity 387-pin PGA package. The L2 cache is closely coupled with the CPU via a dedicated full CPU clock speed (64 bit) cache bus. The L2 cache supports both data and instructions. It is 256 Kbytes (static RAM). (Note: Pentium Pro systems have L2 cache sizes of 256 KBytes, 512 KBytes or 1 MBytes; Pentium II processors has 512 KBytes L2 cache). The L2 cache is 4 way set associative and each cache line is 32 bytes. The cache is implemented with a pseudo-LRU replacement policy. The L2 cache is write-back and write-allocate just as the L1 cache. We usually assume the L2 cache is clean except where otherwise stated. The L2 data cache has a 3 CPU cycle latency to L1, and a 6 CPU cycle latency to the registers. The L2 data cache can theoretically sustain a bandwidth of one double per CPU clock. Cache coherency is maintained via the MESI protocol.
The external 66 MHz bus is 64 bits wide and can support a data transfer every bus cycle. That is, it can do one 8 byte (64 bit) transaction every bus clock. This amounts to a bandwidth of an 8 byte read or write to L2 every 3 CPU clocks. Actual latency to L2 is minimally 8 CPU clocks for the first 8 bytes in a cache line, and 1 clock for each 8 bytes thereafter for around 11 clocks total. This is the minimal read latency, and actual latency depends on data locality. If there is a page miss without a precharge, an extra 3 cycles is lost, and if a page misses requires a precharge, three more cycles are lost. Each node on a kestrel board (the boards used in the ASCI Option Red Supercomputer) has two Pentium Pro(R) processors, and each kestrel board has two nodes each with 128 Mbytes of main memory. The 8 bytes per 3 CPU clocks amounts to 1600/3 or 533 million bytes/sec (or 508 Mbytes/sec..) The memory controller used and CPU can support up to 4 pending transactions. The external P6DX bus can support up to 8 pending transactions (4 for each CPU.)
All parts of the system and processor must be considered when building an optimal kernel such as DGEMM. The registers, both integer for memory reference and loop management and floating point registers, in the form of the floating point stack, the instruction decoder, each level of cache, the memory subsystem, and multiple processors on the bus were all considered in building DGEMM.
DGEMM may well be the most used kernel in computational science. It is certainly the workhorse of the level 3 BLAS. Figure 1 illustrates the blocking of the three matrices as performed in our implementations both in MKL and for the MP LINPACK software for the Teraflops initiative.
Ultimately DGEMM is a set of blocks of the matrix A multiplied by blocks of the matrix B, with careful attention given to cache management. Cache management essentially involves ensuring that a subsequent value copied to cache does not evict a value moved to cache. We seek to minimize the likelihood that anext%sc = aprevious%sc, where anext is the address of the next memory reference, sc is the cache set size, and aprevious is the address of a datum in cache which we do not want evicted.
Blocking simply assures that the total amount of data we reference fits in some cache, say L2. Figure 2 shows the effects of a simple blocking of matrix multiply on a 300 MHz Pentium II processor, with DGEMM also shown for comparison, illustrating the inadequacy of blocking alone. More complex blocking, including copying, represents much of the effort spent on this code.

Figure 2. Comparing simple blocking with fully optimized DGEMM on 300 MHz Pentium II processor
Many of the optimizations applied in DGEMM for the Pentium Pro processor are applicable to the various RISC processors available. However, because of the small number of integer registers and the use of a stack rather than a set of individually addressable registers for floating point usage, the optimization here differs from what might be necessary on other processors.
When a compiler generates code spilling registers to the stack has small consequence since the compiler tends to generate suboptimal code so the added inefficiency has small effect. However, in the development of kernels we try to get all the performance possible; we cannot afford to spill and restore registers in the inner loop. Table 1 shows the mapping of the operations onto the set of registers:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 1. DGEMM parameters mapped onto IA registers
Within the inner loop, only the first three registers are needed. The cost to save and restore registers for the other uses would not have a substantial effect on performance. The same cannot be said for the floating point stack. The number of stack locations required is determined in part because of the pipelined arithmetic unit. The latency for the floating point unit for performing addition is 3 clocks with a result available every clock[5]. When performing multiplication, the latency is 5 clocks, with a result available every clock provided the last operation was not multiply. Thus a sequence of multiply-adds will can be performed with a result every clock. The pipeline depth for the adder, however, limits the performance that can be achieved on a single reduction operation like the dot product. To deal with this issue, four dot products are maintained simultaneously. The limited number of floating point stack locations presents a challenge for managing this approach.
There are just eight floating point stack locations: st(0):st(7). Individual stack locations may be referenced for reading, but all writes are to st(0), which is also known as st. Since st is overwritten, much of the difficulty arises from getting the data into the correct position on the stack. Two operations of the stack make management efficient. The order of operands to the floating point operation affects where the result will be when the operation completes. In the one case the result will be in st and the other operands will be in st(1) and st(n+1), as the stack will be pushed as a result of the operation. The result and st(n+1) can be swapped, however, as a result of the operation enabling stack management. Finally, the stack can also be popped at the same time. For reductions, the old sum is added to the new product, the new sum swaps with the old sum, and the old sum is popped off the stack as a single clock operation.
The mapping onto the floating point stack of the quantities used in DGEMM, at the beginning of a phase of computing the multiple dot products, is shown in table 2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 2. Mapping of values for DGEMM onto floating point stack
The Pentium Pro processor can be thought of as having a hardware translator for decoding IA instructions and issuing micro operations. The out-of-order execution capabilities of the processor provide a means of more effectively using memory. However, since instructions are retired in order, care must be taken when issuing instructions that they are paired properly to assure optimal retirement, which is the final measure of how well the processor is executing. The processor can perform two integer operations and one floating point operation per clock.
Anticipating some of the behavior of the instruction decoder is straightforward and is covered in the optimization manual for Intel processors. For example, issuing multiple simple instructions (fetch, multiply) instead of complex instructions (fetch and multiply) decodes more efficiently.
Other optimizations required experimentation and reverse engineering of how operations are scheduled, queued, and retired. Positions of values on the floating point stack can be exchanged with the top of the stack through an instruction fxch. This is an integer operation so it can be paired with an additional integer operation and a floating point operation in a single clock.
Adding fxch to swap two values which are to be added, even though unnecessary due to the commutative nature of addition, can improve performance in excess of 5%. These fxch operations take up slots in the instruction pool, so we use fewer of them when making cite1 to main memory, which induces the largest latencies, filling up the instruction pool.
The floating point unit performs both addition and multiplication. It is capable of starting an addition operation every clock or an add-multiply-add-multiply, and so on, sequence at one operation per clock. However, successive multiplies can begin only every other clock. The dot product allows us to alternate multiplication and addition, except in the start up portion of our code where we perform several multiplies successively without intervening additions.
Both addition and multiplication are pipelined operations; the adder has a 3 clock latency while the multiplier has a 5 clock latency. While the out-of-order execution can hide the occasional dependency, it cannot hide the dependency when every add is dependent on the previous add. To remove the dependency on the previous addition, four dot products are performed simultaneously, the sums of which use half the floating point stack locations.
Typically two integer operations are performed with each floating point operation - a memory fetch and a stack swap to keep data in the correct place. Data positions on the stack are carefully managed to assure that this ratio is not exceeded except when fetching the next value from the B matrix. The out-of-order execution permits prefetching of data, discussed in more detail in the next section. Memory may be referenced without stalling the processor until the dependency on the data prohibits further execution or the resources of the reservation station are exhausted.
Finally, each branch, even when correctly predicted, costs 1 clock. As a consequence we fully unroll the inner loop to remove that one clock per iteration cost.
On Intel processors, all data must pass through the caches. Much of the effort expended on blocking sought to minimize the opportunity for eviction of data from a cache until it had been fully used. Since there is such a disparity between main memory performance and the requirements of the processor, effective use of the cache is mandatory for kernels like DGEMM.
Our blocking is based on the size of a single set of the respective caches. For L1 on the Pentium Pro processor, we are limited to blocks no larger than 512 double precision words. We use a block size of 16x32 for the A matrix, where 16 is the blocking size for rows of C and 32 is the inner product size. To assure that there none of these 512 values has the same address in cache, we copy from A to a temporary array of dimension 512. If transposition is required we perform that when we copy the data.
The demands for the B matrix are somewhat less stringent than those for A. We do not copy B in the current version of DGEMM. Rather, we prefetch the data to minimize cache misses[6]. In any iteration n the value from the B matrix was fetched in the previous iteration; B(n+1) is being fetched for the next cycle. Prefetching helps even if the B resides in L2 since it hides the latency of moving data from L2 to L1.
The goal is to bring the C block into cache only once. This goal affects the choices for blocking parameters for both A and B. In the example above, the 16x32 block size for the A coupled the 32x256 block size for B allows C to remain in L2 cache. For optimal performance, other block combinations that still keep C in cache are also used; an 8x64 blocking on A coupled with a 64x128 blocking on B still keeps C in L2 cache.
We prefetch C by touching it at the beginning of the loop, simply by touching it, and loading it at the end of the loop, anticipating that the value will now be in cache.
The disparity between memory speed and CPU speed continues to grow[7]. The Pentium Pro system we developed DGEMM for has a 3:1 ratio between the CPU clock and the system clock. We have already discussed our use of cache to ameliorate this disparity in clock speed. However, the disparity is worse than it seems. The operating system breaks the memory into pages. The processor supports both 4 KB and 4 MB pages. For Windows NT, the user space is limited to the 4 KB pages. The formula for the time required to get a cache line from main memory to cache is x,n,n,n, where x is the number of system clocks required to get the first 8-byte word to cache and n is the number of clocks required for each successive word. n depends on the degree to which the memory is interleaved. For 4-way interleaved memories, n = 1.
The value of x is much harder to predict, as it depends on the previous state of the page and the state of the table lookaside buffers (TLBs)[8]. When large matrices are used, the data represents far more data than the TLBs can reference. The cost of fetching data on such a page for which the TLB is not in cache includes searching for the TLB and loading the value into the TLB cache, warming the page (required due to power management) and then actually moving the data from memory to cache.
Memory management is integrated into the use of cache, but loading cache keeps in mind the characteristics of the memory design. By copying a block of the A matrix to the cache, cache lines are moved and transposed. Furthermore, once the data is in cache few TLBs are required to reference all of the data. Figure 3 illustrates some of these effects, running DGEMM on two 300 MHz Pentium II processors.

Figure 3. New DGEMM running on two 300 MHz Pentium II processors: Effective memory management including reduced TLB misses.
All Pentium Pro and Pentium II processors have built-in support for multiprocessing. The Pentium Pro has support for up to 4 processors without additional support. Both the ASCI Red Teraflops program and MKL exploit the availability of symmetric multiprocessing to get more performance on DGEMM, and in general, the level 3 BLAS. There is a strong synergy between making a routine run well on a single cache-based system and making it scale well when run on multiple processors. The steps that we have outlined above for optimal use of the cache/memory system reduce the total number of memory access that are required. The effect of that is to increase the availability of the memory system for other processors. The chart below, Figure 4, shows scaling of DGEMM on from 1 to 4 Pentium Pro processors running at 200 MHz.

Figure 4. Plots of DGEMM performance showing scaling on 1-, 2- and 4-200 MHz Pentium Pro processors (Note: this performance measured with DBLAT3, which distorts performance on small sized problems since they are run more than once.)
The MP LINPACK Teraflop run (on 12/4/96) was made with OSF running in the service partition, and Cougar running in the compute partition. It was done on 3632 nodes, or 7264 Pentium Pro (TM) 200 MHz processors, with a theoretical peak of 1.45 TFLOPS (trillions of floating point operations-per-second.) The 57 cabinets of nodes were equipped with 128 Mbytes of memory, making the complete system have over 450 Gigabytes of memory. The problem size solved was 215000, which requires over 350 Gigabytes of memory just to store the matrix in core. RMAX (performance achieved) was 1.068 TeraFLOPS , NMAX or N was 215000, and N1/2 was 53400. N1/2 is the minimum problem size (to the nearest 100) such that half the RMAX performance was achieved. That is, 53400 achieved over half a Teraflop on this machine. The RMAX was found on 12/4/96, and N1/2 was found on 12/6/96. The number of floating point operations done is roughly (2 N3)/3 for a problem of size N. The MP LINPACK 1.3 Teraflop run (on 6/9/97) was run on 9152 Pentium Pro 200 MHz processors. RMAX was 1.338 TeraFLOPS . NMAX or N was 235000. N1/2 was 63000. Both runs used MPI.
The MP LINPACK benchmark allows you to write your own code as long as it meets certain requirements. The MP LINPACK benchmark measures the time it takes to solve a real double precision (64 bits) linear system with a single right hand side. It is a well known comparison between high performance supercomputers. The benchmark results are maintained by in the LINPACK Performance Report, "Performance of Various Computers Using Standard Linear Equations Software" by Dr. Jack Dongarra at the University of Tennessee[9]. He has accepted both our Teraflop entries into the report, which is available on the web, e-mail, and ftp.
Intel's MP LINPACK code uses a two dimensional block scattered data decomposition[10] with block size 64. Row pivoting is done in accordance with the block LU found in LAPACK[11]. The timings are for real floating point operations and not "macho" FLOPS obtained by using Strassen[12] (or Winograd[13]) multiplication. The code does not rely on the Strassen technique. The code explicitly computes all the relevant norms and does several rigorous residual checks to guarantee accuracy. The matrix generation is identical to ScaLAPACK[14] version 1.00 Beta, which is a standard MPP package for Linear Algebra. The code is similar to the one used that at one point held the MP LINPACK world record of 143.4 GFLOPS (billions of floating point operations per second) on Sandia's XP/S Model 140 GP Paragon (1840 node) Supercomputer, and later got another world record at 281 GFLOPS. The initial implementation was based on work that Robert van de Geijn did to capture the world record in 1991 on the Intel Touchstone Delta[15].
The previous MP LINPACK record was 368.2 GFLOPS, set in September 1996. Prior to that, however, the record was the above mentioned run of 281 GFLOPS in December 1994.
Modifications in the Delta code for the i860® versions in previous world records used on the Intel MP Paragon (TM) supercomputer had hand-tuned assembly kernels done by Bob Norin and Brent Leback. Other modifications were done by Satya Gupta and Greg Henry. This recent Teraflop run was on the x86 version of the code and had assembly routines written by Satya Gupta, Stuart Hawkinson, and Greg Henry.
In our ISUG 95 paper[16]
, we described a model for MP LINPACK when we had set a 281 GFLOPS world record for the earlier code. It was later pointed out to us that the model had a few typos and was incomplete. Here, we present a new updated model.Let N be the problem size. Let V be the number of vertical nodes (rows) in a logical two dimensional mapping of the data. Let H be the number of horizontal nodes (columns). The NxN matrix is then mapped on a VxH logical grid. Let K be the block size used in a cyclic block wrap mapping of the data and let it also represent the block size of the algorithm (one could alternatively use any mapping of the data regardless of the block size used for the algorithm, but we keep this for simplicity.) Let G be the operations per second rate for the GEMM calculation, T the operation per second rate for doing a triangular solve (a necessary step in a block LU decomposition.) Let U be the rate of doing K columns of a local LU. Let B be the communication bandwidth in bytes per second, and S be the communication latency in terms of seconds. Let C be the number of doubles per second to copy long strided arrays.
Fortunately, the primary piece of time goes in DGEMM. If K is much less than N, then roughly (2N3)/(3VHG) seconds goes here. For the Horizontal Broadcast time, each row must send and receive its portions of the multipliers, which average (NK/2V) doubles once for each (N/K) blocks. This is (2N/K) * ( (8NK)/(2VB) + S ) seconds. For the vertical broadcast time, each column must participate in a tree synchronization (log(V) messages) of a block averaging (NK/2H) doubles once for each (N/K) blocks. This is (log(V)N/K) * ( (8NK)/(2HB) + S ) seconds. For the triangular solve time, each triangular solve is done by only one row of nodes. Since all the nodes in the column of nodes must wait, in our model we assume the work is done redundantly. The amount of work is (K2N/(2H)) on average, done once for each (N/K) blocks. This is (N2K)/(2HT). The K-column Local LU work time is done by a single column of nodes. On average there is (K2N/(2V)) FLOPS and (N/(KH)) local LUs performed by each column. The total time is then (N2K)/(2VHU). In addition, there is communication associated with the local LU. To find local pivot requirements, there is a tree fan in followed by a fan out, followed by the actual pivot just within the block column. This requires time (3N/H)log(V)(8/B + S). For pivoting, the worst case is when every pivot is off the node that owns the current block. Then one sends and receives all the rows (2N) from the current column to the end of the matrix. This size averages N/2H. We need to also consider the cost of copying the 2N rows which average N/2H size when all the elements are spread by at least N/V doubles. The total time is (2N){ (8N)/(2HB) + S + (N/(2CH)) }. The final element is load imbalance delays due to some nodes having more columns/rows than others within DGEMM. There are (N/K) DGEMMs total. If one node has K extra columns, there are (2 K2 (N/2H)) extra FLOPS on average associated with doing the DGEMM. If one node has K extra rows, this translates into (2 K2 (N/2V)) FLOPS on average. We multiply this by (N/K) for the total number of times, and divide by two since this problem only occurs half the time. This is (N2K)/(2VG) + (N2K)/(2HG) seconds. We ignore the final order N triangular solve at the end because it represents so little of the total time when implemented correctly.
The total time is then:
(2N3)/(3VHG) + 8N2 / VB + 2NS/K + (4log(V) N2)/(BH) + (log(V)NS/K) +
(N2K)/(2HT) + (N2K)/(2VHU) + 24Nlog(V)/(BH) + (3N/H)log(V)S +
8N2/(BH) + 2NS + N2/(CH) + (N2K)/(2VG) + (N2K)/(2HG) seconds.
If we plug in the appropriate values for the Intel ASCI Option Red TeraFLOP machine, given in Table 3, we get the following number of seconds: 6247 as opposed to the real value of 6465. Using 2N3/3 FLOPS, this implies a MFLOP rate of 1.38 TeraFLOPS, implying an accuracy of 96% in our model.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 3. Parameters and values for MP LINPACK model
To achieve optimal performance on MP LINPACK we have used several techniques. First and foremost, we have made the effort to make the code DGEMM-based, and then spent a considerable effort optimizing DGEMM. We have striven to make the code scalable by using a two dimensional decomposition. We have incorporated lookahead techniques to mask the time spent doing the local LU. We have used asynchronous message passing and hand shaking to ensure that message passing was optimal and additional buffering or message copying was necessary. We have used double level blocking on the local LU, using a level-3 DGEMM-based method there as well to ensure larger blocks did not slow down the code.
In this paper we have discussed the Intel Pentium Pro and optimizing strategies. We have used for a vehicle matrix multiplication (DGEMM), which has a wide range of use in scientific applications. We have used matrix multiplication to discuss standard optimization techniques and we have discussed the methods we used to achieve the world's first TeraFLOP MP LINPACK run. We have given an accurate model for our MP LINPACK application. We have exposed some key insights in the Pentium Pro processor and how to gain the best possible performance. Finally, we have demonstrated that commodity parts are becoming increasingly important in the supercomputing of tomorrow.
References:
[1]Golub, G.H., and Van Loan, C.F., Matrix Computations, 3rd Edition, The Johns Hopkins University Press, Baltimore, 1996
[2]Hennessy, J.L. and Patterson, D.A., Computer Organization and Design, 2nd ed., Morgan Kaufmann Publishers, San Francisco, 1997
[3]developer.intel.com/design/perftool/perflibst/
[4]developer.intel.com/design/pro/manuals/242690.htm
[5]Intel Architecture Optimizations Manual: developer.intel.com/design/pro/manuals/242816.htm
[6]VanderWiel, S.P. and Lilja, D.J., "When Caches Aren't Enough: Data Prefetching Techniques," Computer, July 1997, pp 23-30.
[7]Patterson, D., et al., "A Case for Intelligent RAM", IEEE Micro, March/April 1997, pp 34-44.
[8]Hwang, K., Advanced Computer Architecture", McGraw-Hill, Inc, New York, 1993
[9]Dongarra, J.J., "Performance of various computers using standard linear equations software in a Fortran environment", Computer Science Technical Report CS-89-85, University of Tennessee, 1989, http://www.netlib.org/benchmark/performance.ps
[10]Hendrickson, B.A., Womble, D.E., "The torus-wrap mapping for dense matrix calculations on massively parallel computers", SIAM J. Sci. Stat. Comput., 1994, http://www.cs.sandia.gov/~bahendr/torus.ps.Z
[11]Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorenson, D., LAPACK Users' Guide, SIAM Publications, Philadelphia, PA, 1992
[12]Strassen, V., "Gaussian Elimination is not Optimal", Numer. Math. Vol. 13, 1969, pp. 354--356
[13]Winograd, S., "A new algorithm for inner product", IEEE Trans. Comp., Vol. C-37, 1968, pp. 693--694
[14]Blackford, S., Choi, J., Cleary, A., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C., "ScaLAPACK: A Portable Linear Algebra Library for Distributed MemoryComputers - Design Issues and Performance", Technical Paper in Supercomputing 1996, Proceedings of Supercomputing '96, Pittsburgh, Pennsylvania, http://www.supercomp.org/sc96/proceedings.
[15]van de Geijn, R.A., "Massively Parallel LINPACK Benchmark on the Intel Touchstone DELTA and iPSC(R)/860 Systems", 1991 Annual Users' Conference Proceedings. Intel Supercomputer Users' Group, Dallas, TX, 10/91
[16]Jerry Bolen, Arlin Davis, Bill Dazey, Satya Gupta, Greg Henry, David Robboy, Guy Schiffler, David Scott, Mack Stallcup, Amir Taraghi, Stephen Wheat from Intel SSD, LeeAnn Fisk, Gabi Istrail, Chu Jong, Rolf Riesen, Lance Shuler, from Sandia National Laboratories, "Massively Parallel Distributed Computing: World's First 281 Gigaflop Supercomputer", Proceedings of the Intel Supercomputer Users Group 1995, http://www.cs.utk.edu/~ghenry/isug.ps.
Bruce Greer works in the Workstation Technology Lab and is responsible for the development of kernel libraries for Intel processors. He has been involved with the optimization of both large scale programs and kernels over a number of years. When at Floating Point Systems he optimized SPICE by replacing the solver with an optimized solver. He also restructured the code and creating loops in the modeling code to better use the pipelining compiler.
At Intel he developed a parallelizing compiler for iWarp, ported a low level vision library to that architecture and developed a communication library to support that library. With the Supercomputer Systems Division he ported codes to the Paragon system. It was there that he first began optimizing for cache utilization, both as it applies to uniprocessing and to SMP multithreaded applications.
Bruce earned the BS in physics from Walla Walla College and the MS in Physics from Georgia Tech. His hobbies and activities include running, golf, photography and woodworking.
Greg Henry received his Ph.D. from Cornell University in the field of Applied Mathematics in January, 1994. He starting working at Intel Corporation in Beaverton, Oregon, in the Supercomputing Division, in August 1993. His current position at Intel is a Computational Scientist in the supercomputing group working on the new Intel ASCI Option Red Supercomputer. His area of research has been large dense eigenvalue problems. He was a Gordon Bell winner in 1994 and has attended several Supercomputing conferences as both exhibitor and technical paper contributor. Relevant to this particular paper, he has been responsible for tuning MP LINPACK and the BLAS used there-in for the Intel Supercomputers. He has also participated in the ScaLAPACK project.

Greg has three children, a dog, two birds, some frogs, lizards, two hamsters, a wonderful wife, and perhaps other animals on the way; he plays roller hockey on the Intel team, enjoys Aikido and writing.
Greg can he reached via e-mail , and the WWW . His phone number and pager can be found on the WWW site.