Go back to the FFTW home page.
Parallel Versions of FFTW
Starting with FFTW 1.1, we are including parallel Fourier transform
subroutines in the FFTW distribution. Parallel computation is a very
important issue for many users, but few (no?) parallel FFT codes are
publicly available. We hope that we can begin to fill this gap.
We are currently experimenting with several parallel versions of
FFTW. These have not yet been optimized nearly as fully as FFTW
itself, nor have they been benchmarked against other parallel FFT
software (mainly because parallel FFT codes are few and far in
between). They are, however, a start in the right direction.
Parallel versions of FFTW exist for:
- Cilk: Cilk
is a superset of C developed at MIT. It allows easy parallelization
of C code and automates load balancing between processors. It is
currently only available on shared memory SMP architectures. The Cilk
FFTW routines parallelize both one and multi-dimensional transforms,
and are included in FFTW.
- Threads: Threads are a low-level mechanism for managing
concurrent computations that share a single address space. They are
available in some form on most SMP systems. FFTW currently supports
Solaris, POSIX, and BeOS threads, and
should be easy to port to other systems. (Some experimental code is
in place for MacOS MP and Win32 threads.) The threads FFTW routines
parallelize both one and multi-dimensional transforms, and are
included in FFTW.
- MPI:
MPI is a standard message-passing library available on many
platforms. Currently, we have in-place, multi-dimensional parallel
transforms for FFTW that use MPI. These routines differ from the
threads and Cilk versions of FFTW in that they work on distributed
memory machines in addition to shared-memory architectures. The MPI
FFTW routines are included in FFTW.
Benchmarks of the parallel FFTs
We have done a small amount of benchmarking of the parallel versions
of FFTW. Unfortunately, there isn't much to benchmark them against
except for each other--we have had some difficulty in finding other,
publicly available, parallel FFT software. On the Cray T3D, however,
we were able to compare Cray's PCCFFT3D routine to FFTW.
Benchmark results are available for the following machines:
How the parallel FFTW transforms work
The following is a very brief overview of the methods with which we currently parallelize FFTW.
One-dimensional Parallel FFT
The Cooley-Tukey algorithm that FFTW uses to perform one-dimensional
transforms works by breaking a problem of size
N=N1N2 into N2 problems of size
N1 and N1 problems of size N2. The
way in which we parallelize this transform, then, is simply to divide
these sub-problems equally among different threads. In Cilk, this
amounts to little more than adding spawn
keywords in
front of subroutine calls (thanks to FFTW's explicitly recursive
structure). Currently, we keep the data in the same order as we do for
the uniprocessor code; in the future, we may reorder the data at each
phase of the computation to achieve better segregation of the data
between processors.
Multi-dimensional Parallel FFT (shared memory)
Multi-dimensional transforms consist of many one-dimensional
transforms along the rows, columns, etcetera of a matrix. For the Cilk
and threads versions of FFTW, we simply divide these one-dimensional
transforms equally between the processors. Again, we leave the data in
the same order as it is in the uniprocessor case. Unfortunately, this
means that the arrays being transformed by different processors are
interleaved in memory, resulting in more memory contention than is
desirable. We are investigating ways to alleviate this problem.
Multi-dimensional Parallel FFT (distributed memory)
The MPI code works differently, since it is designed for
distributed memory systems--here, each processor has its own memory or
address space in which it holds a portion of the data. The data on a
processor is inaccessible to other processors except by explicit
message passing. In the MPI version of FFTW, we assume that a
multi-dimensional array is distributed across the rows (the first
dimension) of the data. To perform the FFT of this data, each
processor first transforms all the dimensions of the data that are
completely local to it (e.g. the rows). Then, the processors have to
perform a transpose of the data in order to get the remaining
dimension (the columns) local to a processor. This dimension is then
Fourier transformed, and the data is (optionally) transposed back to
its original order.
The hardest part of the multi-dimensional, distributed FFT is the
transpose. A transpose is one of those operations that sounds simple,
but is very tricky to implement in practice. The real subtlety comes
in when you want to do a transpose in place and with minimal
communications. Essentially, the way we do this is to do an in-place
transpose of the data local to each processor (itself a non-trivial
problem), exchange blocks of data with every other processor, and then
complete the transpose by a further (precomputed) permutation on the
data. All of this involves an appalling amount of bookkeeping,
especially when the array dimensions aren't divisible by the number of
processors. The communications phase is an example of a complete
exchange: an operation in which every processor sends a different
message to every other processor. (This is probably the worst thing
that you can do to a memory architecture.)
The whole process would be simpler, and possibly faster, if the
transpose weren't required to be in-place. We felt, however, that if
a problem is large enough to require a distributed memory
multiprocessor, it is probably large enough that the capability to
perform a transform in-place is essential.
Go back to the FFTW home page.