This report contains my porting notes, errata, and comments on the LAPACK 3 distribution, dated June 30, 1999, and the first update dated October 31, 1999, with particular recommendations for porting to a CRAY PVP or CRAY T3E system. Opinions, recommendations, and software modifications presented here are not necessarily endorsed by the other LAPACK authors or contributors. See the official LAPACK release notes (http://www.netlib.org/lapack/release_notes) for the current errata list from the LAPACK group. Also, please see the README.patch file from my patch file for a disclaimer if you are unclear on the concept of public domain software.
LAPACK 3 is the third major release of software from the LAPACK project, which began in 1987 to develop a library of Fortran 77 subroutines for solving the most common problems in numerical linear algebra. The building blocks for LAPACK are the Basic Linear Algebra Subprograms (BLAS) for vector, matrix-vector, and matrix-matrix operations. The Cray Scientific Library (libsci) already contains the BLAS and LAPACK 2 software, with some performance improvements that are documented in LAPACK Working Note 126 (LAWN 126), Performance Improvements to LAPACK for the Cray Scientific Library.
My objective was to create a library of new software from the LAPACK 3 distribution to augment the existing library. I did not replace LAPACK routines already in libsci unless they failed tests in the latest LAPACK or LAPACK95 test suites. Consequently, some of the new features of LAPACK 3 are not available via the supplemental library. For example, all LAPACK 3 routines that require workspace now allow a "workspace query", indicated by specifying a size of -1 for the LWORK or LIWORK argument. If such a value is input, the LAPACK 3 routines determine the optimal amount of work space and return it in WORK(1) or IWORK(1). This feature is not exercised by the LAPACK test software, so it was not necessary to replace the LAPACK 2 versions that lack it in order to pass the tests.
The outline of this report is as follows:
Test results are from the CRAY systems at the National Environmental Supercomputing Center (NESC), operated by Lockheed Martin for the U. S. Environmental Protection Agency. The CRAY C90 is a 4-processor system running Unicos 10.0 and Programming Environment 3.2. The CRAY T3E is a 72-processor CRAY T3E-1200 system running Unicos/mk 2.0.4 and Programming Environment 3.2. All results reported here are for a single processor.
SLASQ1 through SLASQ4 had implementations in LAPACK 2 that worked on all architectures and were used by the LAPACK routine SBDSQR, but in LAPACK 3, they have a different calling sequence and work only on certain IEEE machines. Both the CRAY C90 and the CRAY T3E are among the machines for which the IEEE-specific algorithms are invalid. These subroutines were improved in an LAPACK 3 update to check an internally specified value for IEEE compliance, but the calling sequence incompatibilities with LAPACK 2 remain. The workaround is to continue to use the libsci version of SBDSQR, which calls an internal libsci routine SLASQ1@, and also to rename SLASQx to SLADQx to avoid further confusion.
A large class of bugs surfaced when testing LAPACK95, which uses the previously untested workspace queries to determine how much workspace to allocate dynamically. In some cases, the workspace calculations were skipped unless LWORK were positive (but LWORK is -1 on a workspace query), while in other cases the argument test for a valid LWORK value followed all the argument tests and was done even if an invalid value of some earlier argument had already been detected. The code for calculating the workspace requirements and the test for a valid LWORK value were all moved inside a block like "IF( INFO.EQ.0 ) THEN" to make the handling of these cases consistent. The affected routines were
Bugs in LAPACK 3:
SRC/ilaenv.f: divides by zero (also appears in several other places)
Set parameters 10 and 11 to zero.
SRC/cgees.f: test of LWORK is out of order
SRC/cgeesx.f: no workspace query, test of LWORK is out of order
SRC/cgeev.f: test of LWORK is out of order
SRC/cgeevx.f: test of LWORK is out of order
SRC/cgelsd.f: no workspace query
SRC/cgelss.f: test of LWORK is out of order
SRC/cgesdd.f: test of LWORK is out of order
SRC/cgesvd.f: test of LWORK is out of order
SRC/cgges.f: test of LWORK is out of order
SRC/cggesx.f: no workspace query
SRC/cggev.f: test of LWORK is out of order
SRC/cggevx.f: test of LWORK is out of order
SRC/chgeqz.f: one rotation is incorrectly applied
SRC/classq.f: fails to scale small x when SCL = 1.0, SUMSQ = 0.0;
pathologically poor performance
SRC/sgees.f: test of LWORK is out of order
SRC/sgeesx.f: no workspace query, test of LWORK is out of order
SRC/sgeev.f: no workspace query, test of LWORK is out of order
SRC/sgeevx.f: test of LWORK is out of order
SRC/sgelsd.f: no workspace query
SRC/sgelss.f: test of LWORK is out of order
SRC/sgesdd.f: no workspace query, test of LWORK is out of order
SRC/sgesvd.f: test of LWORK is out of order
SRC/sgges.f: test of LWORK is out of order
SRC/sggesx.f: no workspace query
SRC/sggev.f: test of LWORK is out of order
SRC/sggevx.f: test of LWORK is out of order
SRC/shgeqz.f: one rotation is incorrectly applied; LDA should be LDB in one place
SRC/slagv2.f: WI must be initialized
SRC/slasq1.f: call to SLASQ2 is incompatible with LAPACK 2 (not used)
SRC/slasq2.f: calling sequence is incompatible with LAPACK 2, renamed
sladq2
SRC/slasq3.f: calling sequence is incompatible with LAPACK 2, renamed
sladq3
SRC/slasq4.f: calling sequence is incompatible with LAPACK 2, renamed
sladq4
SRC/slasq5.f: renamed sladq5
SRC/slasq6.f: renamed sladq6
SRC/slassq.f: fails to scale small x when SCL = 1.0, SUMSQ = 0.0;
pathologically poor performance
SRC/slarre.f: call to SLASQ2 changed to SLADQ2
SRC/sstebz.f: FUDGE factor should be at least 2.1 for T3E
SRC/stgex2.f: LDA should be LDB in 3 places
TESTING/EIG/serred.f: summary line is incorrect or incomplete
TESTING/EIG/derred.f: summary line is incorrect or incomplete
TESTING/EIG/cerred.f: summary line is incorrect or incomplete
TESTING/EIG/zerred.f: summary line is incorrect or incomplete
TESTING/EIG/cerrbd.f: summary line is incomplete
TESTING/EIG/zerrbd.f: summary line is incomplete
TIMING/LIN/stimaa.f: can't execute, too much work space for LS tests
TIMING/LIN/dtimaa.f: can't execute, too much work space for LS tests
TIMING/LIN/ctimaa.f: can't execute, too much work space for LS tests
TIMING/LIN/ztimaa.f: can't execute, too much work space for LS tests
TIMING/LIN/stimq3.f: never executed because PATH doesn't match
TIMING/LIN/dtimq3.f: never executed because PATH doesn't match
TIMING/LIN/ctimq3.f: never executed because PATH doesn't match
TIMING/LIN/ztimq3.f: never executed because PATH doesn't match
SRC/sgebal.f: fails LAPACK tests
SRC/slassq@.f: can overflow if SUMSQ is large and X is small
SRC/classq@.f: can overflow if SUMSQ is large and X is small
SRC/sstein.f: LAWN 126 optimizations affect the orthogonality of one
test case. There is a big performance trade off for this fix.
SRC/cstein.f: LAWN 126 optimizations affect the orthogonality of one
test case. There is a big performance trade off for this fix.
BLAS/SRC/snrm2.f: (PVP only) fails to scale the sum of squares
BLAS/SRC/scnrm2.f: (PVP only) fails to scale the sum of squares
For futher details about SLASSQ, see Section 3.
These problems can be solved by using loader options to map the calls
in a user's application from one name to another, and by using a
compiler option to treat all double precision declarations as if they
were single precision. For example, a program that declares arrays to
be DOUBLE PRECISION and calls DGEMM could be compiled with double
precision disabled (f90 -dp) and instructed to replace
calls to DGEMM with calls to SGEMM, all without modifying the code.
The form of the loader mappings is
PVP (segldr): f90 -dp -Wl"-Dequiv=SGEMM(DGEMM)" T3E (cld): f90 -dp -Wl"-Dequiv(DGEMM)=SGEMM"Because there are so many such mappings, it is convenient to put them in a file and pass the name of the file to the loader via a f90 command line option, for example,
f90 -Wl"blasdp2sp.cld".
The following directives files provide some of the most commonly used
mappings:
blasdp2sp.segldr --
map double precision BLAS names to single
lapackaux.segldr --
map LAPACK@ names to regular LAPACK names
lapackdp2sp.segldr --
map double precision LAPACK names to single
(includes BLAS and LAPACK auxiliary routine mappings)
blasdp2sp.cld --
map double precision BLAS names to single
lapackaux.cld --
map LAPACK@ names to regular LAPACK names
lapackdp2sp.cld --
map double precision LAPACK names to single
(includes BLAS and LAPACK auxiliary routine mappings)
An alternative to using loader mappings for type matching is to specify generic interfaces with Fortran 90 modules. Several projects are considering Fortran 90 interfaces to BLAS and LAPACK subroutines (see, for example, [LAWN134]), but these assume a different interface for the generic routine. For Cray users who do not want to modify their codes, but would like the capability of compile-time argument checking, the compiler flag "-A <modulefilename>" can be used to implicitly invoke a suitably constructed module file. The following Fortran 90 module file contains the Fortran 77 interfaces for the BLAS (no genericity is supplied in this version):
library_blas.f90
TESTING/LIN/cdrvpt2.f
SRC/cptsv.f90
SRC/cptsvx.f90
SRC/slamch.f90
SRC/rounding_mode.c
The discontinuities can be removed by choosing the signs of the computed cosine and sine to always make the output value r positive. In the real case, the new version of SLARTG can be described by
[ c s ][ f ] = [ r ], c**2 + s**2 = 1, r >= 0 [ -s c ][ g ] [ 0 ]while in the complex case, where c is real but s is complex, the new version of CLARTG can be described by
[ c s ][ f ] = [ r ], c**2 + s*conjg(s) = 1, Re(r) >= 0 [ -conjg(s) c ][ g ] [ 0 ]The following versions of the routines to generate Givens rotations maintain the desired continuity and pass all the LAPACK tests.
SRC/slartg.f
SRC/clartg.f
SRC/slargv.f
SRC/clargv.f
( SCALE**2 )*SUMSQ = x( 1 )**2 +...+ x( n )**2 + ( scl**2 )*smsqwhere scl and smsq are the input values of SCALE and SUMSQ. However, the implementation of SLASSQ in LAPACK (and duplicated in a recent revision to the netlib BLAS version of SNRM2) has pathologically poor performance ([LAWN126]). Furthermore, LAPACK's SLASSQ has a bug in that it fails to scale the sum of squares when SCL = 1.0, SMSQ = 0.0, and X is small in magnitude. An alternative presented in [LAWN126], which was used in the Cray Scientific Library under the name SLASSQ@, also has a bug which causes it to overflow if SCL is large on input and X is small in magnitude. SLASSQ@ is usually called in libsci with SCL = 1.0, so this bug is unlikely to occur, but we have replaced both SLASSQ and SLASSQ@ with new versions in the patch file.
The main use of SLASSQ and its complex equivalent CLASSQ is for computing the 2-norm of a vector, either explicitly or through the BLAS interfaces SNRM2 and SCNRM2. In these situations, LAPACK routines call SLASSQ with SCL = 0.0 and SMSQ = 1.0, although we think SCL = 1.0 and SMSQ = 0.0 is more intuitive. Our modified version of SLASSQ works correctly for either combination of SCL and SMSQ, but is more efficient when SCL = 1.0 and SMSQ = 0.0. Several calling sites in LAPACK have been modified to reflect this preferred usage. In addition, the subroutine SLATDF has been modified to call SLASSQ instead of the unscaled function SDOT that it called before.
SRC/slassq.f
SRC/classq.f
BLAS/SRC/snrm2.f (required on PVP only)
BLAS/SRC/scnrm2.f (required on PVP only)
BTMP( 1 ) = ZERO
CALL SCOPY( 16, BTMP, 0, T16, 1 )
where T16 is a 4-by-4 REAL array.
Most implementations of Level 1 BLAS accept a zero increment, but the
Cray man pages say results for zero increments may be unpredictable.
Since there is no reason to avoid
Fortran 90 constructs on Cray platforms, calls such as these have been
changed to direct assignments:
T16 = ZERO
The subroutines CTGSYL and CTGSY2 call CSCAL to scale a complex array by
the value CMPLX( SCALOC, ZERO ). This operation really should be
performed by CSSCAL.
SRC files changed:
slasy2.f clatdf.f slatdf.f ctgex2.f stgex2.f ctgsen.f stgsen.f ctgsy2.f stgsy2.f ctgsyl.f stgsyl.fThe subroutines SHGEQZ and CHGEQZ had several loops which could be transformed to calls to SROT and CROT. Also, calls in STGEX2 to the undocumented Level 2 BLAS versions of several LAPACK routines were changed to use the standard interface.
Modified TESTING/LIN routines:
alahd.f sbrdt1.f schkaa.f schkgb.f schkge.f schkgt.f schklq.f schkop.f schkpo.f schkpp.f schkpt.f schkql.f schkqp.f schkqr.f schkrq.f schksp.f schksy.f schktb.f schktp.f schktr.f schktz.f sckbrd.f sckhrd.f scktrd.f serrqr.f serrrd.f shrdt1.f sorgt1.f sqrt11.f sqrt12.f sqrt14.f strdt1.f cbrdt1.f cchkaa.f cchkgb.f cchkge.f cchkgt.f cchkhe.f cchkhp.f cchklq.f cchkpb.f cchkpo.f cchkpp.f cchkpt.f cchkql.f cchkqp.f cchkqr.f cchkrq.f cchksp.f cchksy.f cchktb.f cchktp.f cchktr.f cchktz.f cchkup.f cckbrd.f cckhrd.f ccktrd.f cerrqr.f cerrrd.f cerrtr.f chrdt1.f cqrt11.f cqrt12.f cqrt14.f ctrdt1.f cungt1.f + d and z versionsSeveral test routines in the TESTING/EIG directory still call SLASUM and DLASUM, old versions of the subroutine that prints a summary line in the output. These have been changed to call ALASUM for consistency.
TESTING/EIG routines:
schkbb.f sdrves.f cchkbb.f cdrves.f schkgg.f sdrvev.f cchkgg.f cdrvev.f schkhs.f sdrvsg.f cchkhb.f cdrvsg.f schksb.f sdrvsx.f cchkhs.f cdrvsx.f schkst.f sdrvvx.f cchkst.f cdrvvx.f + d and z versionsAlso, the test of the error exits was always on in the tests of the EC routines, as can be seen from the following lines from xCHKEE:
READ( NIN, FMT = * )THRESH
TSTERR = .TRUE.
CALL SCHKEC( THRESH, TSTERR, NIN, NOUT )
This was changed to read the value of TSTERR from the input file,
as follows:
READ( NIN, FMT = * )THRESH, TSTERR
CALL SCHKEC( THRESH, TSTERR, NIN, NOUT )
There is no change in behavior for the input files provided, which
have a line like this for the threshold:
20.0 Threshold value for test ratiosThe T of the comment is interpreted as a value of "true" for the logical variable TSTERR. But now we can turn off the test of the error exits (for example, in the dec.in file where we would get many spurious error messages due to the loader renaming) as follows:
20.0 F Threshold value for test ratiosModified input files dec.in and zec.in are provided in the TESTING directory.
New scripts called sgo, dgo, cgo, and zgo have been added to the TESTING directory for interactively running the eigenvalue tests. These are not used by the main Makefiles.
Many LAPACK routines and test routines make calls to the undocumented Level 2 BLAS versions of LAPACK routines instead of their standard LAPACK interfaces. Wherever possible, these calls were changed to use the standard interfaces. Some of the routines changed in this way include
TESTING/LIN directory:
cchkaa.f schkaa.f cchktz.f schktz.f cqrt11.f sqrt11.f cqrt12.f sqrt12.f cqrt14.f sqrt14.f
TESTING/EIG directory:
cchkgg.f schkgg.f cchkst.f schkst.f cdrges.f sdrges.f cdrgev.f sdrgev.f cdrvgg.f sdrvgg.f cdrvst.f sdrvst.f chet21.f ssyt21.f
In the timing half of the package, the LAPACK 3 distribution sets the minimum time in the "small" timing tests to 0.05 seconds, which is a very long time on a modern machine. New input files are provided in the patch file to set it back to 0.0 seconds.
tar xvf lapack.tar tar xvf LAPACK3patch_120400.tar
LAPACK/make.t3e or
LAPACK/make.pvp to set the environment variable
LAPACKPATH to the location of your LAPACK directory.
make.t3e or make.pvp to
make.inc (make.inc is the same as
make.t3e initially).
cd LAPACK; makeor if you are compiling on a PVP platform, do this:
cd LAPACK; make -f Makefile.pvp
The Makefiles create a library
LAPACK/lib/liblapack3.awhich can be linked before the usual libsci.a to resolve references to those subroutines new to LAPACK 3. A set of man pages for the LAPACK 3 supplement (cleaned up from the automatically-generated man pages on netlib) plus pages for 32 auxiliary routines that could be of interest to users are found in the directory
LAPACK/mancreated by the patch file.
The easiest way to add new libraries and man pages to one's search paths on a Cray machine is by creating and loading a module. Here is a sample module file for the LAPACK 3 supplement:
lapack3
f90 -llapack3 ...One caveat is that the supplementary library includes a few subroutines that were not designed to run on Cray platforms and have not been fully tested. They are included for completeness and should be regarded as research codes for experimentation and further study.
CRAY T3E test failures:
sgd.out
dgd.out
cgd.out
zgd.out
sgd.out
dgd.out
cgd.out
zgd.out