These are release notes http://www.cs.utk.edu/~ghenry/distrib/blasnews. 3/20/98: The Single CPU Pentium Pro Linux BLAS 1.1b were available. 4/15/98: The Single CPU Pentium Pro Linux BLAS 1.1c were available. 6/01/98: The Single CPU Pentium Pro Linux BLAS 1.1d were available. 6/01/98: The Single CPU Pentium Linux BLAS 1.1d were available in Beta 6/09/98: The Single CPU Pentium Pro Linux BLAS 1.1e were available. 6/10/98: Testing and instructions were built for running this on Solaris 2.6. See the latest download file. 8/20/98: Version 1.1H of the BLAS was released for single and dual processor Pentium Pros. 10/13/98: Version 1.1J of the BLAS was released for single and dual processor Pentium Pros. 10/16/98: Version 1.1K of the BLAS was released for single and dual processor Pentium Pros. 11/06/98: Version 1.1L of the BLAS was released for single and dual processor Pentium Pros. 11/16/98: Version 1.1M of the BLAS was released for single and dual processor Pentium Pros. 01/12/99: Version 1.1N of the BLAS was released for single and dual processor Pentium Pros. 08/27/99: Version 1.1O of the BLAS was released for single and dual processor Pentium Pros. 09/08/99: http://www.cs.utk.edu/~ghenry/distrib/linux_cop was added. 10/20/99: Version 1.2A of the BLAS were released for single and dual processor Pentium IIs. 11/15/99: Version 1.2B of the BLAS were released for single and dual processor Pentium IIs. 11/30/99: Version 1.2C of the BLAS were built for single and dual processor Pentium IIs. 12/10/99: Version 1.2D of the BLAS were built for single and dual processor Pentium IIs. Plans: 1.) Dual Processor stuff needs more testing and work 2.) Pentium III BLAS has been built! I'm just testing now! Changes in 1.2e and 1.2f (03/24/00): 1.) SGEMM made faster 2.) I accidentally broke reentrancy in DGEMM/ZGEMM in 1.2b, this has been fixed in this version 3.) Shared libraries are now being distributed Changes in 1.2c and 1.2d (12/10/99): 1.) My DTRSM stuff (which for some reason is *slower* on Pentium IIs but faster on Pentium II Xeons) had an unresolved external reference. I'm still investigation the performance implication. 2.) If the B matrix in DGEMM/ZGEMM was not on a 8-byte boundary, things now go faster. 3.) Large DGEMMs were made to go 1% faster 4.) Small N was made to go faster in DGEMM. 5.) I'm working on faster cases for M=3 and 5. Changes in 1.2b (11/15/99): 1.) Try DTRSM LLNU case for M >= 32: It's between 1.5x and 3.0x as fast! Unfortunately, the level-3 blas timers off of netlib time "non-unit", so you won't see this case which is critically important for LU. 2.) Some bugfixes were done for shared libraries (still in beta). Changes in 1.2a (10/20/99): 1.) datablas.a is created instead of what version 1.1o did. If you want data stored on in the data section (single processor only) link in this library first. 2.) There were unresolved externals that have been removed. 3.) DGEMM tunings were done for varying outer product updates. Changes in 1.1o (08/27/99): 1.) DGEMM and ZGEMM return data to the data section instead of the stack in the single processor versions. If this is a problem, please let me know. Too many complained they didn't know how to align their stack in linux- and I'm not sure if I know either. Changes in 1.1n (01/12/99): 1.) If your input matrix was filled with NaNs, but you chose to "overwrite" it by multiplying by beta=0 in the level-3 BLAS, Versions 1.1m through 1.1d had a bug which actually did the multiply resulting in more NaNs rather than having a special beta=0 case. This version fixes this and is now faster for beta=0. Changes in 1.1m ( 11/16/98): 1.) Solaris x86 linking bug fixed for the xCOPY family. Changes in 1.1l (11/06/98): 1.) Bug in zrotg was fixed Changes in 1.1k (10/16/98): 1.) Bug in crotg was fixed Changes in 1.1j (10/13/98): 1.) Enhanced DZASUM, and the borders cases of DGEMM 2.) Improved all single precision (real and complex) level-3 libraries. 3.) Compiled all the BLAS reentrant, and placed all temporary variables on the stack for better threaded use. 4.) Vastly improved dual processor single precision real codes. Changes in 1.1h (8/20/98): 1.) Fixed a bug in ZGEMM 2.) Made performance enhancements for DGEMM 3.) Did further dual processor testing 4.) Enhanced DCOPY, SCOPY, CCOPY, and ZCOPY. Changes in 1.1e (6/10/98): 1.) Added zrotg & crotg to the distribution. 2.) Added notes about running on Solaris. Changes in 1.1d (6/1/98): 1.) More ZGEMM enhancements 2.) DGEMV bug fix: was storing data past the end of Y in the transpose case when n modulo 4 was 3. 3.) Some ZTRSM enhancements 4.) Added the CBLAS Optimizations in 1.1c (4/15/98): 1.) DGEMM & ZGEMM improvements for small matrices when BETA != 1 2.) ZGEMM inner loop enhancements 3.) ZGEMM bug fix: was referencing past the end of the B matrix on a prefetch. 4.) DGEMM blocking improvements Notes about 1.1n: 1.) The notes for 1.1h apply below, plus this is the first stack-based version of this library. 2.) ssyr2k, zher2k, zsyr2k, cher2k, csyr2k, ctpsv, and ztpsv were not compiled reentrant. 3.) stack alignment might still be an issue. 4.) some have reported dual processor issues. I need more test time on dual processor blocks. Notes about 1.1h: 1.) Codes perform better when the majority of the data written out is initially cache-line aligned. In most linear algebra codes, you can cache-line align to 32-bytes and with an even leading dimension and blocking factor usually maintain this. 2.) Outer product updates for DGEMM are optimized for K=32, 64, and sometimes 128. 3.) Outer product updates for ZGEMM are optimized for K=32, 64. 4.) There is a known problem in IZAMAX. IZAMAX treats (.1,.4) as a different complex number (due to roundoff!) than (.4,.1). IZAMAX is supposed to return the first index of maximum value, and it may instead return a later index because rounding errors suggest the later index is larger. The only solution to this problem is to force roundings. 5.) There is a known unresolved external in crotg.o. I'm temporarily lost my Linux machine, and I'm awaiting its return to retest the fix. Future optimizations/fixes 1.) Greater handling of BETA != 1 for DGEMM, ZGEMM 2.) Making DGEMM performance more uniform over the various cases. 3.) Making outer product updates perform better over all cases. 4.) Decrease the code's sensitivity to cache-line alignment. 5.) Fix the problem in IZAMAX.