PARA'04 State-of-the-Art
in Scientific Computing
June 20-23, 2004 (Home page)

Updated: 9 February 2004

An Implementation of Parallel 3-D FFT Using Short Vector SIMD Instructions on Clusters of PCs

Daisuke Takahashi, Taisuke Boku and Mitsuhisa Sato
Institute of Information Sciences and Electronics
University of Tsukuba
1-1-1 Tennodai, Tsukuba, Ibaraki 305-8573, Japan
E-mails: {daisuke,taisuke,msato}@is.tsukuba.ac.jp

In this paper, we propose an implementation of parallel three-dimensional fast Fourier transform (FFT) using short vector SIMD instructions on clusters of PCs.

The FFT is one of the most widely used algorithms in the field of scientific computing. Parallel three-dimensional FFT algorithms on distributed-memory parallel computers have been well studied.

Today, many processors have short vector SIMD instructions, e.g., Intel's SSE/SSE2, AMD's 3DNow!, and Motorola's Altivec. These instructions provide substantial speedup for digital signal processing applications.

Many FFT algorithms work well for data sets that fit into a cache. When a problem exceeds the cache size, however, the performance of these FFT algorithms decreases dramatically. One goal for large FFTs is to minimize the number of cache misses.

Our implemented parallel three-dimensional FFT is based on the multicolumn FFT algorithm. Conventional three-dimensional FFT algorithm requires three multicolumn FFTs and three data transpositions. The three transpose steps typically are the chief bottlenecks in cache-based processors.

Some previously presented three-dimensional FFT algorithms separate the multicolumn FFTs from the transpositions. Taking the opposite approach, we combine the multicolumn FFTs and transpositions to reduce the number of cache misses, and we modify the conventional three-dimensional FFT algorithm to reuse data in the cache memory. Also, we vectorized radix-2, 3, 4, 5 and 8 FFT kernels by using SSE2 intrinsics.

To evaluate the proposed parallel three-dimensional FFT, we compared its performance against that of the FFT library of FFTW (version 2.1.5). Although the latest FFTW (version 3.0.1) supports SSE/SSE2/3DNow!/Altivec (new in version 3.0), MPI parallel transforms are still only available in 2.1.5.

A 16-node dual Xeon PC SMP cluster (Northwood 2.8 GHz, 12 KB L1 instruction cache, 8 KB L1 data cache, 512 KB L2 cache, 2 GB DDR-SDRAM main memory per node, Linux 2.4.20smp) was used. The nodes on the PC SMP cluster are interconnected through a Myrinet-2000 switch.

With the 16-node PC SMP cluster, for 512 x 512 x 1024-point FFT, the parallel three-dimensional FFT runs about 1.8 times faster than the FFTW. The performance of the parallel three-dimensional FFT remains at a high level even for the larger problem size, owing to cache blocking. Moreover, our implementation of the parallel three-dimensional FFT exploits the SSE2 instructions. This is the reason why the parallel three-dimensional FFT is most advantageous with the dual Xeon PC SMP cluster.

We succeeded in obtaining performance of over 4.7 GFLOPS on a 16-node dual Xeon 2.8 GHz PC SMP cluster. These performance results demonstrate that the parallel three-dimensional FFT utilizes Intel's SSE2 instructions and cache memory effectively.

Home page


2004-02-09