Here are some generic facts about the world's first Teraflop supercomputer runs for MP LINPACK: The MP LINPACK Teraflop run (on 12/4/96) was made with OSF running in the service partition, and Cougar running in the compute partition. It was done on 3632 nodes, or 7264 Pentium Pro (TM) 200 Mhz processors, with a theoretical peak of 1.45 Tflops (trillions of floating point operations-per- second.) The 57 cabinets of nodes were equipped with 128 Mbytes of memory, making the complete system have over 450 Gigabytes of memory. The problem size solved was 215000, which requires over 350 Gigabytes of memory just to store the matrix in core. The Teraflop supercomputer is being called the "Intel ASCI Option Red" [4]. The MP LINPACK benchmark you are allowed to write your own code as long as it meets certain requirements. The MP LINPACK benchmark measures the time it takes to solve a real double precision (64 bits) linear system with a single right hand side. It is a well known comparison between high performance supercomputers. The benchmark results are maintained by in the LINPACK Performance Report, "Performance of Various Computers Using Standard Linear Equations Software" by Dr. Jack Dongarra at the University of Tennessee [4]. He has accepted our Teraflop entry into his 12/16/96 report, which is available on the web [4], e-mail, and ftp. RMAX was 1.068 Teraflops, NMAX or N was 215000, and N1/2 was 53400. N1/2 is the minimum problem size (to the nearest 100) such that half the RMAX performance was achieved. That is, 53400 achieved over half a Teraflop on this machine. The RMAX was found on 12/4/96, and N1/2 was found on 12/6/96. The number of floating point operations done is roughly (2N^3)/3 for a problem of size N. The MP LINPACK 1.3 Teraflop run (on 6/9/97) was run on 9152 Pentium Pro (TM) 200 Mhz processors. RMAX was 1.338 Teraflops. NMAX or N was 235000. N1/2 was 63000. Both runs used MPI. Intel's MP LINPACK code uses a two dimensional block scattered data decomposition with block size 64 [5]. Row pivoting is done in accordance with LAPACK [1]. The timings are for real floating point operations and not "macho" flops obtained by using Strassen [6] (or Winograd [8]) multiplication. The code does not rely on the Strassen technique. The code explicitly computes all the relevant norms and does several rigorous residual checks to guarantee accuracy. The matrix generation is identical to ScaLAPACK version 1.00 Beta, which is a standard MPP package for Linear Algebra [2]. The code is similar to the one used that at one point held the MP LINPACK world record of 143.4 Gflops on Sandia's XP/S Model 140 GP Paragon (1840 node) Supercomputer, and later got another world record at 281 Gflops [3]. The initial implementation was based on work that Robert van de Geijn did to capture the world record in 1991 on the Intel Touchstone Delta [7]. The previous record was 368.2 Gigaflops (billions of operations per second), and this was set in September 1996. The previous record before that, however, was the above Intel run of 281 Gflops. Modifications in the Delta code for the i860 (R) versions in previous world records used on the Intel MP Paragon (TM) supercomputer had hand-tuned assembly kernels done by Bob Norin and Brent Leback. Other modifications were done by Satya Gupta and Greg Henry. This recent Teraflop run was on the x86 version of the code and it had assembly routines written by Satya Gupta, Stuart Hawkinson, and Greg Henry. There is a paper that thoroughly describes the technique and algorithm in publication by Bolen et. al. [3] in ISUG 1995. It is also available on the Web [3]. [1] Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorenson, D., LAPACK Users' Guide, SIAM Publications, Philadelphia, PA, 1992 [2] Blackford, S., Choi, J., Cleary, A., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C., "ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance", Technical Paper in Supercomputing 1996, Proceedings of Supercomputing '96, Pittsburgh, Pennsylvania, http://www.supercomp.org/sc96/proceedings. [3] Jerry Bolen, Arlin Davis, Bill Dazey, Satya Gupta, Greg Henry, David Robboy, Guy Schiffler, David Scott, Mack Stallcup, Amir Taraghi, Stephen Wheat from Intel SSD, LeeAnn Fisk, Gabi Istrail, Chu Jong, Rolf Riesen, Lance Shuler, from Sandia National Laboratories, "Massively Parallel Distributed Computing: World's First 281 Gigaflop Supercomputer", Proceedings of the Intel Supercomputer Users Group 1995, http://www.cs.utk.edu/~ghenry/isug.ps. [4] Dongarra, J.J., "Performance of various computers using standard linear equations software in a Fortran environment", Computer Science Technical Report CS-89-85, University of Tennessee, 1989, http://www.netlib.org/benchmark/performance.ps [5] Hendrickson, B.A., Womble, D.E., "The torus-wrap mapping for dense matrix calculations on massively parallel computers", SIAM J. Sci. Stat. Comput., 1994, http://www.cs.sandia.gov/~bahendr/torus.ps.Z [6] Strassen, V., "Gaussian Elimination is not Optimal", Numer. Math. Vol. 13, 1969, pp. 354--356 [7] van de Geijn, R.A., "Massively Parallel LINPACK Benchmark on the Intel Touchstone DELTA and iPSC(R)/860 Systems", 1991 Annual Users' Conference Proceedings. Intel Supercomputer Users' Group, Dallas, TX, 10/91 [8] Winograd, S., "A new algorithm for inner product", IEEE Trans. Comp., Vol. C-37, 1968, pp. 693--694