Go back to the FFTW home page.

Parallel Versions of FFTW

Starting with FFTW 1.1, we are including parallel Fourier transform subroutines in the FFTW distribution. Parallel computation is a very important issue for many users, but few (no?) parallel FFT codes are publicly available. We hope that we can begin to fill this gap.

We are currently experimenting with several parallel versions of FFTW. These have not yet been optimized nearly as fully as FFTW itself, nor have they been benchmarked against other parallel FFT software (mainly because parallel FFT codes are few and far in between). They are, however, a start in the right direction.

Parallel versions of FFTW exist for:

Cilk: Cilk is a superset of C developed at MIT. It allows easy parallelization of C code and automates load balancing between processors. It is currently only available on shared memory SMP architectures. The Cilk FFTW routines parallelize both one and multi-dimensional transforms, and are included in FFTW.
Threads: Threads are a low-level mechanism for managing concurrent computations that share a single address space. They are available in some form on most SMP systems. FFTW currently supports Solaris, POSIX, and BeOS threads, and should be easy to port to other systems. (Some experimental code is in place for MacOS MP and Win32 threads.) The threads FFTW routines parallelize both one and multi-dimensional transforms, and are included in FFTW.
MPI: MPI is a standard message-passing library available on many platforms. Currently, we have in-place, multi-dimensional parallel transforms for FFTW that use MPI. These routines differ from the threads and Cilk versions of FFTW in that they work on distributed memory machines in addition to shared-memory architectures. The MPI FFTW routines are included in FFTW.

Benchmarks of the parallel FFTs

We have done a small amount of benchmarking of the parallel versions of FFTW. Unfortunately, there isn't much to benchmark them against except for each other--we have had some difficulty in finding other, publicly available, parallel FFT software. On the Cray T3D, however, we were able to compare Cray's PCCFFT3D routine to FFTW.

Benchmark results are available for the following machines:

8 Processor Sun HPC 5000 (167 MHz Sun UltraSPARC-I). (Shared memory SMP.)
Cray T3D (150 MHz DEC Alphas) at the Pittsburg Supercomputing Center. (Distributed memory.)

How the parallel FFTW transforms work

The following is a very brief overview of the methods with which we currently parallelize FFTW.

One-dimensional Parallel FFT

The Cooley-Tukey algorithm that FFTW uses to perform one-dimensional transforms works by breaking a problem of size N=N₁N₂ into N₂ problems of size N₁ and N₁ problems of size N₂. The way in which we parallelize this transform, then, is simply to divide these sub-problems equally among different threads. In Cilk, this amounts to little more than adding spawn keywords in front of subroutine calls (thanks to FFTW's explicitly recursive structure). Currently, we keep the data in the same order as we do for the uniprocessor code; in the future, we may reorder the data at each phase of the computation to achieve better segregation of the data between processors.

Multi-dimensional Parallel FFT (shared memory)

Multi-dimensional transforms consist of many one-dimensional transforms along the rows, columns, etcetera of a matrix. For the Cilk and threads versions of FFTW, we simply divide these one-dimensional transforms equally between the processors. Again, we leave the data in the same order as it is in the uniprocessor case. Unfortunately, this means that the arrays being transformed by different processors are interleaved in memory, resulting in more memory contention than is desirable. We are investigating ways to alleviate this problem.

Multi-dimensional Parallel FFT (distributed memory)

The MPI code works differently, since it is designed for distributed memory systems--here, each processor has its own memory or address space in which it holds a portion of the data. The data on a processor is inaccessible to other processors except by explicit message passing. In the MPI version of FFTW, we assume that a multi-dimensional array is distributed across the rows (the first dimension) of the data. To perform the FFT of this data, each processor first transforms all the dimensions of the data that are completely local to it (e.g. the rows). Then, the processors have to perform a transpose of the data in order to get the remaining dimension (the columns) local to a processor. This dimension is then Fourier transformed, and the data is (optionally) transposed back to its original order.

The hardest part of the multi-dimensional, distributed FFT is the transpose. A transpose is one of those operations that sounds simple, but is very tricky to implement in practice. The real subtlety comes in when you want to do a transpose in place and with minimal communications. Essentially, the way we do this is to do an in-place transpose of the data local to each processor (itself a non-trivial problem), exchange blocks of data with every other processor, and then complete the transpose by a further (precomputed) permutation on the data. All of this involves an appalling amount of bookkeeping, especially when the array dimensions aren't divisible by the number of processors. The communications phase is an example of a complete exchange: an operation in which every processor sends a different message to every other processor. (This is probably the worst thing that you can do to a memory architecture.)

The whole process would be simpler, and possibly faster, if the transpose weren't required to be in-place. We felt, however, that if a problem is large enough to require a distributed memory multiprocessor, it is probably large enough that the capability to perform a transform in-place is essential.

Go back to the FFTW home page.