Installing LAPACK 3 on CRAY machines

Ed Anderson
Lockheed Martin Technical Services
Anderson.Edward@epa.gov
December 28, 1999
Updated December 4, 2000

This report contains my porting notes, errata, and comments on the LAPACK 3 distribution, dated June 30, 1999, and the first update dated October 31, 1999, with particular recommendations for porting to a CRAY PVP or CRAY T3E system. Opinions, recommendations, and software modifications presented here are not necessarily endorsed by the other LAPACK authors or contributors. See the official LAPACK release notes (http://www.netlib.org/lapack/release_notes) for the current errata list from the LAPACK group. Also, please see the README.patch file from my patch file for a disclaimer if you are unclear on the concept of public domain software.

LAPACK 3 is the third major release of software from the LAPACK project, which began in 1987 to develop a library of Fortran 77 subroutines for solving the most common problems in numerical linear algebra. The building blocks for LAPACK are the Basic Linear Algebra Subprograms (BLAS) for vector, matrix-vector, and matrix-matrix operations. The Cray Scientific Library (libsci) already contains the BLAS and LAPACK 2 software, with some performance improvements that are documented in LAPACK Working Note 126 (LAWN 126), Performance Improvements to LAPACK for the Cray Scientific Library.

My objective was to create a library of new software from the LAPACK 3 distribution to augment the existing library. I did not replace LAPACK routines already in libsci unless they failed tests in the latest LAPACK or LAPACK95 test suites. Consequently, some of the new features of LAPACK 3 are not available via the supplemental library. For example, all LAPACK 3 routines that require workspace now allow a "workspace query", indicated by specifying a size of -1 for the LWORK or LIWORK argument. If such a value is input, the LAPACK 3 routines determine the optimal amount of work space and return it in WORK(1) or IWORK(1). This feature is not exercised by the LAPACK test software, so it was not necessary to replace the LAPACK 2 versions that lack it in order to pass the tests.

The outline of this report is as follows:

Porting notes
reports bugs found in testing LAPACK 3, and proposes workarounds
Renaming LAPACK routines in libsci
addresses two of the most vexing issues with using libsci -- missing LAPACK auxiliary routines and the Cray library's use of single precision names when it is really 64-bit.
Enhancements to LAPACK
describes features already in libsci and includes new prototypes to aid implementation on other platforms.
Test suite improvements
includes modified versions of many of the LAPACK test suite routines to test additional subroutines and improve the presentation of results.
Installing the libsci LAPACK 3 supplement
includes links to my patch file and instructions for compiling and testing an LAPACK 3 supplement to libsci.
Outstanding test failures
documents the current state of LAPACK 3 testing.

Test results are from the CRAY systems at the National Environmental Supercomputing Center (NESC), operated by Lockheed Martin for the U. S. Environmental Protection Agency. The CRAY C90 is a 4-processor system running Unicos 10.0 and Programming Environment 3.2. The CRAY T3E is a 72-processor CRAY T3E-1200 system running Unicos/mk 2.0.4 and Programming Environment 3.2. All results reported here are for a single processor.

1.0 Porting notes

Several subroutines in LAPACK 3 are documented as working only on computer systems that support IEEE-754 arithmetic and can propagate NaN's and inf's. These include
SSTEGR
CSTEGR
SLASQ1, SLASQ2, SLASQ3, SLASQ4, SLASQ5, SLASQ6

SLASQ1 through SLASQ4 had implementations in LAPACK 2 that worked on all architectures and were used by the LAPACK routine SBDSQR, but in LAPACK 3, they have a different calling sequence and work only on certain IEEE machines. Both the CRAY C90 and the CRAY T3E are among the machines for which the IEEE-specific algorithms are invalid. These subroutines were improved in an LAPACK 3 update to check an internally specified value for IEEE compliance, but the calling sequence incompatibilities with LAPACK 2 remain. The workaround is to continue to use the libsci version of SBDSQR, which calls an internal libsci routine SLASQ1@, and also to rename SLASQx to SLADQx to avoid further confusion.

A large class of bugs surfaced when testing LAPACK95, which uses the previously untested workspace queries to determine how much workspace to allocate dynamically. In some cases, the workspace calculations were skipped unless LWORK were positive (but LWORK is -1 on a workspace query), while in other cases the argument test for a valid LWORK value followed all the argument tests and was done even if an invalid value of some earlier argument had already been detected. The code for calculating the workspace requirements and the test for a valid LWORK value were all moved inside a block like "IF( INFO.EQ.0 ) THEN" to make the handling of these cases consistent. The affected routines were

CGEES, CGEEV, CGEEVX, CGELSD, CGELSS, CGESDD, CGESVD, CGGES, CGGEV, CGGEVX,
SGEES, SGEEV, SGEEVX, SGELSD, SGELSS, SGESDD, SGESVD, SGGES, SGGEV, SGGEVX
For several routines that did not support workspace queries because the workspace requirements could not be determined exactly, I added the workspace calculations using an upper bound. These routines were
CGEESX, CGGESX, SGEESX, SGGESX
Other bugs in LAPACK 3 include longstanding performance problems with SLASSQ/CLASSQ, excessive work space in the main timing program, and a path name problem that prevented some new timing code from executing.

Bugs in LAPACK 3:

SRC/ilaenv.f: divides by zero (also appears in several other places) Set parameters 10 and 11 to zero.
SRC/cgees.f: test of LWORK is out of order
SRC/cgeesx.f: no workspace query, test of LWORK is out of order
SRC/cgeev.f: test of LWORK is out of order
SRC/cgeevx.f: test of LWORK is out of order
SRC/cgelsd.f: no workspace query
SRC/cgelss.f: test of LWORK is out of order
SRC/cgesdd.f: test of LWORK is out of order
SRC/cgesvd.f: test of LWORK is out of order
SRC/cgges.f: test of LWORK is out of order
SRC/cggesx.f: no workspace query
SRC/cggev.f: test of LWORK is out of order
SRC/cggevx.f: test of LWORK is out of order
SRC/chgeqz.f: one rotation is incorrectly applied
SRC/classq.f: fails to scale small x when SCL = 1.0, SUMSQ = 0.0; pathologically poor performance
SRC/sgees.f: test of LWORK is out of order
SRC/sgeesx.f: no workspace query, test of LWORK is out of order
SRC/sgeev.f: no workspace query, test of LWORK is out of order
SRC/sgeevx.f: test of LWORK is out of order
SRC/sgelsd.f: no workspace query
SRC/sgelss.f: test of LWORK is out of order
SRC/sgesdd.f: no workspace query, test of LWORK is out of order
SRC/sgesvd.f: test of LWORK is out of order
SRC/sgges.f: test of LWORK is out of order
SRC/sggesx.f: no workspace query
SRC/sggev.f: test of LWORK is out of order
SRC/sggevx.f: test of LWORK is out of order
SRC/shgeqz.f: one rotation is incorrectly applied; LDA should be LDB in one place
SRC/slagv2.f: WI must be initialized
SRC/slasq1.f: call to SLASQ2 is incompatible with LAPACK 2 (not used)
SRC/slasq2.f: calling sequence is incompatible with LAPACK 2, renamed sladq2
SRC/slasq3.f: calling sequence is incompatible with LAPACK 2, renamed sladq3
SRC/slasq4.f: calling sequence is incompatible with LAPACK 2, renamed sladq4
SRC/slasq5.f: renamed sladq5
SRC/slasq6.f: renamed sladq6
SRC/slassq.f: fails to scale small x when SCL = 1.0, SUMSQ = 0.0; pathologically poor performance
SRC/slarre.f: call to SLASQ2 changed to SLADQ2
SRC/sstebz.f: FUDGE factor should be at least 2.1 for T3E
SRC/stgex2.f: LDA should be LDB in 3 places
TESTING/EIG/serred.f: summary line is incorrect or incomplete
TESTING/EIG/derred.f: summary line is incorrect or incomplete
TESTING/EIG/cerred.f: summary line is incorrect or incomplete
TESTING/EIG/zerred.f: summary line is incorrect or incomplete
TESTING/EIG/cerrbd.f: summary line is incomplete
TESTING/EIG/zerrbd.f: summary line is incomplete
TIMING/LIN/stimaa.f: can't execute, too much work space for LS tests
TIMING/LIN/dtimaa.f: can't execute, too much work space for LS tests
TIMING/LIN/ctimaa.f: can't execute, too much work space for LS tests
TIMING/LIN/ztimaa.f: can't execute, too much work space for LS tests
TIMING/LIN/stimq3.f: never executed because PATH doesn't match
TIMING/LIN/dtimq3.f: never executed because PATH doesn't match
TIMING/LIN/ctimq3.f: never executed because PATH doesn't match
TIMING/LIN/ztimq3.f: never executed because PATH doesn't match
Some bugs were also found among the libsci LAPACK routines by the new testing software. Fixes are provided in the patch file for the following bugs in libsci:
SRC/sgebal.f: fails LAPACK tests
SRC/slassq@.f: can overflow if SUMSQ is large and X is small
SRC/classq@.f: can overflow if SUMSQ is large and X is small
SRC/sstein.f: LAWN 126 optimizations affect the orthogonality of one test case. There is a big performance trade off for this fix.
SRC/cstein.f: LAWN 126 optimizations affect the orthogonality of one test case. There is a big performance trade off for this fix.
BLAS/SRC/snrm2.f: (PVP only) fails to scale the sum of squares
BLAS/SRC/scnrm2.f: (PVP only) fails to scale the sum of squares

For futher details about SLASSQ, see Section 3.

2.0 Renaming LAPACK routines in libsci

In the Cray Scientific Library version of LAPACK, auxiliary routines that were not thought to be of general interest were renamed to <lapack_name>@ in order to make their entry points internal to the library. These internal entry points were then optimized by removing argument checking and occasionally making algorithmic improvments ([LAWN126]). However, some users have complained about the inconvenience of not being able to call auxiliary routines by their LAPACK names. A related problem is that Cray Scientific Library routines that operate on 64-bit arrays have names that begin with S or C, while other systems require double precision to use 64-bit arithmetic and so may call subroutines with names beginning with D or Z.

These problems can be solved by using loader options to map the calls in a user's application from one name to another, and by using a compiler option to treat all double precision declarations as if they were single precision. For example, a program that declares arrays to be DOUBLE PRECISION and calls DGEMM could be compiled with double precision disabled (f90 -dp) and instructed to replace calls to DGEMM with calls to SGEMM, all without modifying the code. The form of the loader mappings is

PVP (segldr):  f90 -dp -Wl"-Dequiv=SGEMM(DGEMM)"
T3E (cld):     f90 -dp -Wl"-Dequiv(DGEMM)=SGEMM"
Because there are so many such mappings, it is convenient to put them in a file and pass the name of the file to the loader via a f90 command line option, for example, f90 -Wl"blasdp2sp.cld". The following directives files provide some of the most commonly used mappings:
PVP:
blasdp2sp.segldr -- map double precision BLAS names to single
lapackaux.segldr -- map LAPACK@ names to regular LAPACK names
lapackdp2sp.segldr -- map double precision LAPACK names to single (includes BLAS and LAPACK auxiliary routine mappings)
T3E:
blasdp2sp.cld -- map double precision BLAS names to single
lapackaux.cld -- map LAPACK@ names to regular LAPACK names
lapackdp2sp.cld -- map double precision LAPACK names to single (includes BLAS and LAPACK auxiliary routine mappings)

An alternative to using loader mappings for type matching is to specify generic interfaces with Fortran 90 modules. Several projects are considering Fortran 90 interfaces to BLAS and LAPACK subroutines (see, for example, [LAWN134]), but these assume a different interface for the generic routine. For Cray users who do not want to modify their codes, but would like the capability of compile-time argument checking, the compiler flag "-A <modulefilename>" can be used to implicitly invoke a suitably constructed module file. The following Fortran 90 module file contains the Fortran 77 interfaces for the BLAS (no genericity is supplied in this version):

library_blas.f90

3.0 Enhancements to LAPACK

CPTSV/CPTSVX

One oversight in the LAPACK design is the lack of an UPLO argument in the driver routines for solving complex positive definite tridiagonal systems. This was corrected in the Cray Scientific Library version of LAPACK by overloading the interfaces to CPTSV and CPTSVX (using non-portable language extensions) to allow an optional UPLO argument. Now it is possible to do the same thing using Fortran 90. We have also modified the CPT test routine (LAPACK/TESTING/LIN/cdrvpt.f) to exercise the new options. The following versions of CPTSV and CPTSVX are not provided in the patch file because equivalents are already available in the Cray Scientific Library.
TESTING/LIN/cdrvpt2.f
SRC/cptsv.f90
SRC/cptsvx.f90

SLAMCH

Because LAPACK does not use features of Fortran 90, some subroutines are unnecessarily complicated. The most obvious example is SLAMCH, the function to compute machine parameters, which could simply access the Fortran 90 numeric inquiry functions for almost all its values. The only value that does not have a corresponding intrinsic is the rounding mode; for this one can access a C routine to read the ANSI standard identifier FLT_ROUNDS from float.h. This is also implemented in the Cray Scientific Library and is only included here for reference.
SRC/slamch.f90
SRC/rounding_mode.c

SLARTG/CLARTG

The problem of generating Givens rotations accurately and efficiently is a subject in itself. Most of the existing software to compute these simple transformations, including the versions in the BLAS, EISPACK, LAPACK, and libsci, use different formulas depending on the relative sizes of the inputs. These differences create lines of discontinuity in the function domain, which makes the algorithm sensitive to small perturbations in the input data or small errors made in the course of the computation.

The discontinuities can be removed by choosing the signs of the computed cosine and sine to always make the output value r positive. In the real case, the new version of SLARTG can be described by

   [  c  s ][ f ] = [ r ],   c**2 + s**2 = 1, r >= 0
   [ -s  c ][ g ]   [ 0 ]
while in the complex case, where c is real but s is complex, the new version of CLARTG can be described by
   [  c         s ][ f ] = [ r ],  c**2 + s*conjg(s) = 1, Re(r) >= 0
   [ -conjg(s)  c ][ g ]   [ 0 ]
The following versions of the routines to generate Givens rotations maintain the desired continuity and pass all the LAPACK tests.
SRC/slartg.f
SRC/clartg.f
SRC/slargv.f
SRC/clargv.f

SLASSQ

A common mistake in numerical library development is implementing the 2-norm as the square root of a sum of squares without scaling the sum of squares. This reduces the usable range of floating point numbers to values greater than approximately sqrt(underflow) and less than approximately sqrt(overflow). LAPACK includes a scaled sum of squares routine (SLASSQ), which returns a pair of values SCALE and SUMSQ such that
   ( SCALE**2 )*SUMSQ = x( 1 )**2 +...+ x( n )**2 + ( scl**2 )*smsq
where scl and smsq are the input values of SCALE and SUMSQ. However, the implementation of SLASSQ in LAPACK (and duplicated in a recent revision to the netlib BLAS version of SNRM2) has pathologically poor performance ([LAWN126]). Furthermore, LAPACK's SLASSQ has a bug in that it fails to scale the sum of squares when SCL = 1.0, SMSQ = 0.0, and X is small in magnitude. An alternative presented in [LAWN126], which was used in the Cray Scientific Library under the name SLASSQ@, also has a bug which causes it to overflow if SCL is large on input and X is small in magnitude. SLASSQ@ is usually called in libsci with SCL = 1.0, so this bug is unlikely to occur, but we have replaced both SLASSQ and SLASSQ@ with new versions in the patch file.

The main use of SLASSQ and its complex equivalent CLASSQ is for computing the 2-norm of a vector, either explicitly or through the BLAS interfaces SNRM2 and SCNRM2. In these situations, LAPACK routines call SLASSQ with SCL = 0.0 and SMSQ = 1.0, although we think SCL = 1.0 and SMSQ = 0.0 is more intuitive. Our modified version of SLASSQ works correctly for either combination of SCL and SMSQ, but is more efficient when SCL = 1.0 and SMSQ = 0.0. Several calling sites in LAPACK have been modified to reflect this preferred usage. In addition, the subroutine SLATDF has been modified to call SLASSQ instead of the unscaled function SDOT that it called before.

SRC/slassq.f
SRC/classq.f
BLAS/SRC/snrm2.f (required on PVP only)
BLAS/SRC/scnrm2.f (required on PVP only)

General cleanup

The newer LAPACK routines use SCOPY from the Level 1 BLAS to initialize a vector to a scalar by specifying a zero increment for the first vector argument. For example, SLASY2 contains the lines
      BTMP( 1 ) = ZERO
      CALL SCOPY( 16, BTMP, 0, T16, 1 )
where T16 is a 4-by-4 REAL array. Most implementations of Level 1 BLAS accept a zero increment, but the Cray man pages say results for zero increments may be unpredictable. Since there is no reason to avoid Fortran 90 constructs on Cray platforms, calls such as these have been changed to direct assignments:
      T16 = ZERO
The subroutines CTGSYL and CTGSY2 call CSCAL to scale a complex array by the value CMPLX( SCALOC, ZERO ). This operation really should be performed by CSSCAL.

SRC files changed:

   slasy2.f
   clatdf.f   slatdf.f
   ctgex2.f   stgex2.f
   ctgsen.f   stgsen.f
   ctgsy2.f   stgsy2.f
   ctgsyl.f   stgsyl.f
The subroutines SHGEQZ and CHGEQZ had several loops which could be transformed to calls to SROT and CROT. Also, calls in STGEX2 to the undocumented Level 2 BLAS versions of several LAPACK routines were changed to use the standard interface.

4.0 Test suite improvements

In early 1996 I provided updated versions of the TESTING/LIN software to implement the following new features: Only the first of these was adopted, but the patch file reimplements the remaining features. In addition, the latest patch file changes calls in the test code to the undocumented Level 2 BLAS versions of LAPACK routines, such as SORM2R and CGEBD2, to calls to the standard Level 3 BLAS versions such as SORMQR and CGEBRD. This change required adding an LWORK argument to the calling list of xCHKTZ, with corresponding changes at the calling site in xCHKAA.

Modified TESTING/LIN routines:

   alahd.f

   sbrdt1.f   schkaa.f   schkgb.f   schkge.f   schkgt.f
   schklq.f   schkop.f   schkpo.f   schkpp.f   schkpt.f
   schkql.f   schkqp.f   schkqr.f   schkrq.f   schksp.f
   schksy.f   schktb.f   schktp.f   schktr.f   schktz.f
   sckbrd.f   sckhrd.f   scktrd.f   serrqr.f   serrrd.f
   shrdt1.f   sorgt1.f   sqrt11.f   sqrt12.f   sqrt14.f
   strdt1.f

   cbrdt1.f   cchkaa.f   cchkgb.f   cchkge.f   cchkgt.f
   cchkhe.f   cchkhp.f   cchklq.f   cchkpb.f   cchkpo.f
   cchkpp.f   cchkpt.f   cchkql.f   cchkqp.f   cchkqr.f
   cchkrq.f   cchksp.f   cchksy.f   cchktb.f   cchktp.f
   cchktr.f   cchktz.f   cchkup.f   cckbrd.f   cckhrd.f
   ccktrd.f   cerrqr.f   cerrrd.f   cerrtr.f   chrdt1.f
   cqrt11.f   cqrt12.f   cqrt14.f   ctrdt1.f   cungt1.f

   + d and z versions
Several test routines in the TESTING/EIG directory still call SLASUM and DLASUM, old versions of the subroutine that prints a summary line in the output. These have been changed to call ALASUM for consistency.

TESTING/EIG routines:

   schkbb.f   sdrves.f   cchkbb.f   cdrves.f 
   schkgg.f   sdrvev.f   cchkgg.f   cdrvev.f
   schkhs.f   sdrvsg.f   cchkhb.f   cdrvsg.f
   schksb.f   sdrvsx.f   cchkhs.f   cdrvsx.f
   schkst.f   sdrvvx.f   cchkst.f   cdrvvx.f

   + d and z versions
Also, the test of the error exits was always on in the tests of the EC routines, as can be seen from the following lines from xCHKEE:
         READ( NIN, FMT = * )THRESH
         TSTERR = .TRUE.
         CALL SCHKEC( THRESH, TSTERR, NIN, NOUT )
This was changed to read the value of TSTERR from the input file, as follows:
         READ( NIN, FMT = * )THRESH, TSTERR
         CALL SCHKEC( THRESH, TSTERR, NIN, NOUT )
There is no change in behavior for the input files provided, which have a line like this for the threshold:
20.0            Threshold value for test ratios
The T of the comment is interpreted as a value of "true" for the logical variable TSTERR. But now we can turn off the test of the error exits (for example, in the dec.in file where we would get many spurious error messages due to the loader renaming) as follows:
20.0   F        Threshold value for test ratios
Modified input files dec.in and zec.in are provided in the TESTING directory.

New scripts called sgo, dgo, cgo, and zgo have been added to the TESTING directory for interactively running the eigenvalue tests. These are not used by the main Makefiles.

Many LAPACK routines and test routines make calls to the undocumented Level 2 BLAS versions of LAPACK routines instead of their standard LAPACK interfaces. Wherever possible, these calls were changed to use the standard interfaces. Some of the routines changed in this way include

TESTING/LIN directory:

   cchkaa.f   schkaa.f
   cchktz.f   schktz.f
   cqrt11.f   sqrt11.f
   cqrt12.f   sqrt12.f
   cqrt14.f   sqrt14.f

TESTING/EIG directory:

   cchkgg.f   schkgg.f
   cchkst.f   schkst.f
   cdrges.f   sdrges.f
   cdrgev.f   sdrgev.f
   cdrvgg.f   sdrvgg.f
   cdrvst.f   sdrvst.f
   chet21.f   ssyt21.f

In the timing half of the package, the LAPACK 3 distribution sets the minimum time in the "small" timing tests to 0.05 seconds, which is a very long time on a modern machine. New input files are provided in the patch file to set it back to 0.0 seconds.

5.0 Installing the libsci LAPACK 3 supplement

  1. Copy the lapack.tgz file from netlib and the patch file indicated here:
    LAPACK3patch_120400.tar.gz
  2. Unzip the tar files and execute the following commands:
       tar xvf lapack.tar
       tar xvf LAPACK3patch_120400.tar
    
  3. Edit the file LAPACK/make.t3e or LAPACK/make.pvp to set the environment variable LAPACKPATH to the location of your LAPACK directory.
  4. Copy make.t3e or make.pvp to make.inc (make.inc is the same as make.t3e initially).
  5. If you are compiling on a T3E, do this:
       cd LAPACK; make
    
    or if you are compiling on a PVP platform, do this:
       cd LAPACK; make -f Makefile.pvp
    

The Makefiles create a library

   LAPACK/lib/liblapack3.a
which can be linked before the usual libsci.a to resolve references to those subroutines new to LAPACK 3. A set of man pages for the LAPACK 3 supplement (cleaned up from the automatically-generated man pages on netlib) plus pages for 32 auxiliary routines that could be of interest to users are found in the directory
   LAPACK/man
created by the patch file.

The easiest way to add new libraries and man pages to one's search paths on a Cray machine is by creating and loading a module. Here is a sample module file for the LAPACK 3 supplement:

lapack3
With this module loaded, the compile line to link the LAPACK 3 supplement would be
   f90 -llapack3 ...
One caveat is that the supplementary library includes a few subroutines that were not designed to run on Cray platforms and have not been fully tested. They are included for completeness and should be regarded as research codes for experimentation and further study.

6.0 Outstanding test failures

The October 31 update modified the input files sgd.in, cgd.in, etc. so that only matrices of order 2 or smaller were tested. I found this too restrictive and changed it back. The test failures for these cases are documented below. All other tests with the standard input files passed.

CRAY T3E test failures:

sgd.out
dgd.out
cgd.out
zgd.out
CRAY C90 test failures:
sgd.out
dgd.out
cgd.out
zgd.out

References

[LAPACK]
E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK Users' Guide, Third Edition, SIAM, Philadelphia, 1999.

[LAWN126]
Edward Anderson and Mark Fahey, Performance Improvements to LAPACK for the Cray Scientific Library, LAPACK Working Note 126, University of Tennessee, CS-97-359, April 1997.

[LAWN134]
Jerzy Wasniewski and Jack Dongarra, High Performance Linear Algebra Package - LAPACK90, LAPACK Working Note 134, University of Tennessee, CS-98-383, March 1998.

Maintained by Ed Anderson. Please send comments to eanderso@cs.utk.edu.
Created: December 28, 1999
Last Modified: December 4, 2000