Go to the first, previous, next, last section, table of contents.

Parallel FFTW

In this chapter we discuss the use of FFTW in a parallel environment, documenting the different parallel libraries that we have provided. (Users calling FFTW from a multi-threaded program should also consult Section Thread safety.) The FFTW package currently contains three parallel transform implementations that leverage the uniprocessor FFTW code:

Multi-threaded FFTW

In this section we document the parallel FFTW routines for shared-memory threads on SMP hardware. These routines, which support parallel one- and multi-dimensional transforms of both real and complex data, are the easiest way to take advantage of multiple processors with FFTW. They work just like the corresponding uniprocessor transform routines, except that they take the number of parallel threads to use as an extra parameter. Any program that uses the uniprocessor FFTW can be trivially modified to use the multi-threaded FFTW.

Installation and Supported Hardware/Software

All of the FFTW threads code is located in the threads subdirectory of the FFTW package. On Unix systems, the FFTW threads libraries and header files can be automatically configured, compiled, and installed along with the uniprocessor FFTW libraries simply by including --enable-threads in the flags to the configure script (see Section Installation on Unix). (Note also that the threads routines, when enabled, are automatically tested by the `make check' self-tests.)

The threads routines require your operating system to have some sort of shared-memory threads support. Specifically, the FFTW threads package works with POSIX threads (available on most Unix variants, including Linux), Solaris threads, BeOS threads (tested on BeOS DR8.2), Mach C threads (reported to work by users), and Win32 threads (reported to work by users). (There is also untested code to use MacOS MP threads.) We also support using OpenMP or SGI MP compiler directives to launch threads, enabled by using --with-openmp or --with-sgimp in addition to --enable-threads. This is especially useful if you are employing that sort of directive in your own code, in order to minimize conflicts. If you have a shared-memory machine that uses a different threads API, it should be a simple matter of programming to include support for it; see the file fftw_threads-int.h for more detail.

SMP hardware is not required, although of course you need multiple processors to get any benefit from the multithreaded transforms.

Usage of Multi-threaded FFTW

Here, it is assumed that the reader is already familiar with the usage of the uniprocessor FFTW routines, described elsewhere in this manual. We only describe what one has to change in order to use the multi-threaded routines.

First, instead of including <fftw.h> or <rfftw.h>, you should include the files <fftw_threads.h> or <rfftw_threads.h>, respectively.

Second, before calling any FFTW routines, you should call the function:

int fftw_threads_init(void);

This function, which should only be called once (probably in your main() function), performs any one-time initialization required to use threads on your system. It returns zero if successful, and a non-zero value if there was an error (in which case, something is seriously wrong and you should probably exit the program).

Third, when you want to actually compute the transform, you should use one of the following transform routines instead of the ordinary FFTW functions:

fftw_threads(nthreads, plan, howmany, in, istride, 
             idist, out, ostride, odist);

fftw_threads_one(nthreads, plan, in, out);

fftwnd_threads(nthreads, plan, howmany, in, istride,
               idist, out, ostride, odist);

fftwnd_threads_one(nthreads, plan, in, out);

rfftw_threads(nthreads, plan, howmany, in, istride, 
              idist, out, ostride, odist);

rfftw_threads_one(nthreads, plan, in, out);

rfftwnd_threads_real_to_complex(nthreads, plan, howmany, in, 
                                istride, idist, out, ostride, odist);

rfftwnd_threads_one_real_to_complex(nthreads, plan, in, out);

rfftwnd_threads_complex_to_real(nthreads, plan, howmany, in,
                                istride, idist, out, ostride, odist);

rfftwnd_threads_one_real_to_complex(nthreads, plan, in, out);

rfftwnd_threads_one_complex_to_real(nthreads, plan, in, out);

All of these routines take exactly the same arguments and have exactly the same effects as their uniprocessor counterparts (i.e. without the `_threads') except that they take one extra parameter, nthreads (of type int), before the normal parameters.(5) The nthreads parameter specifies the number of threads of execution to use when performing the transform (actually, the maximum number of threads).

For example, to parallelize a single one-dimensional transform of complex data, instead of calling the uniprocessor fftw_one(plan, in, out), you would call fftw_threads_one(nthreads, plan, in, out). Passing an nthreads of 1 means to use only one thread (the main thread), and is equivalent to calling the uniprocessor routine. Passing an nthreads of 2 means that the transform is potentially parallelized over two threads (and two processors, if you have them), and so on.

These are the only changes you need to make to your source code. Calls to all other FFTW routines (plan creation, destruction, wisdom, etcetera) are not parallelized and remain the same. (The same plans and wisdom are used by both uniprocessor and multi-threaded transforms.) Your arrays are allocated and formatted in the same way, and so on.

Programs using the parallel complex transforms should be linked with -lfftw_threads -lfftw -lm on Unix. Programs using the parallel real transforms should be linked with -lrfftw_threads -lfftw_threads -lrfftw -lfftw -lm. You will also need to link with whatever library is responsible for threads on your system (e.g. -lpthread on Linux).

How Many Threads to Use?

There is a fair amount of overhead involved in spawning and synchronizing threads, so the optimal number of threads to use depends upon the size of the transform as well as on the number of processors you have.

As a general rule, you don't want to use more threads than you have processors. (Using more threads will work, but there will be extra overhead with no benefit.) In fact, if the problem size is too small, you may want to use fewer threads than you have processors.

You will have to experiment with your system to see what level of parallelization is best for your problem size. Useful tools to help you do this are the test programs that are automatically compiled along with the threads libraries, fftw_threads_test and rfftw_threads_test (in the threads subdirectory). These take the same arguments as the other FFTW test programs (see tests/README), except that they also take the number of threads to use as a first argument, and report the parallel speedup in speed tests. For example,

fftw_threads_test 2 -s 128x128

will benchmark complex 128x128 transforms using two threads and report the speedup relative to the uniprocessor transform.

For instance, on a 4-processor 200MHz Pentium Pro system running Linux 2.2.0, we found that the "crossover" point at which 2 threads became beneficial for complex transforms was about 4k points, while 4 threads became beneficial at 8k points.

Using Multi-threaded FFTW in a Multi-threaded Program

It is perfectly possible to use the multi-threaded FFTW routines from a multi-threaded program (e.g. have multiple threads computing multi-threaded transforms simultaneously). If you have the processors, more power to you! However, the same restrictions apply as for the uniprocessor FFTW routines (see Section Thread safety). In particular, you should recall that you may not create or destroy plans in parallel.

Tips for Optimal Threading

Not all transforms are equally well-parallelized by the multi-threaded FFTW routines. (This is merely a consequence of laziness on the part of the implementors, and is not inherent to the algorithms employed.) Mainly, the limitations are in the parallel one-dimensional transforms. The things to avoid if you want optimal parallelization are as follows:

Parallelization deficiencies in one-dimensional transforms


This section describes the MPI FFTW routines for distributed-memory (and shared-memory) machines supporting MPI (Message Passing Interface). The MPI routines are significantly different from the ordinary FFTW because the transform data here are distributed over multiple processes, so that each process gets only a portion of the array. Currently, multi-dimensional transforms of both real and complex data, as well as one-dimensional transforms of complex data, are supported.

MPI FFTW Installation

The FFTW MPI library code is all located in the mpi subdirectoy of the FFTW package (along with source code for test programs). On Unix systems, the FFTW MPI libraries and header files can be automatically configured, compiled, and installed along with the uniprocessor FFTW libraries simply by including --enable-mpi in the flags to the configure script (see Section Installation on Unix).

The only requirement of the FFTW MPI code is that you have the standard MPI 1.1 (or later) libraries and header files installed on your system. A free implementation of MPI is available from the MPICH home page.

Previous versions of the FFTW MPI routines have had an unfortunate tendency to expose bugs in MPI implementations. The current version has been largely rewritten, and hopefully avoids some of the problems. If you run into difficulties, try passing the optional workspace to (r)fftwnd_mpi (see below), as this allows us to use the standard (and hopefully well-tested) MPI_Alltoall primitive for communications. Please let us know (fftw@fftw.org) how things work out.

Several test programs are included in the mpi directory. The ones most useful to you are probably the fftw_mpi_test and rfftw_mpi_test programs, which are run just like an ordinary MPI program and accept the same parameters as the other FFTW test programs (c.f. tests/README). For example, mpirun ...params... fftw_mpi_test -r 0 will run non-terminating complex-transform correctness tests of random dimensions. They can also do performance benchmarks.

Usage of MPI FFTW for Complex Multi-dimensional Transforms

Usage of the MPI FFTW routines is similar to that of the uniprocessor FFTW. We assume that the reader already understands the usage of the uniprocessor FFTW routines, described elsewhere in this manual. Some familiarity with MPI is also helpful.

A typical program performing a complex two-dimensional MPI transform might look something like:

#include <fftw_mpi.h>

int main(int argc, char **argv)
      const int NX = ..., NY = ...;
      fftwnd_mpi_plan plan;
      fftw_complex *data;


      plan = fftw2d_mpi_create_plan(MPI_COMM_WORLD,
                                    NX, NY,
                                    FFTW_FORWARD, FFTW_ESTIMATE);

      ...allocate and initialize data...

      fftwnd_mpi(p, 1, data, NULL, FFTW_NORMAL_ORDER);



The calls to MPI_Init and MPI_Finalize are required in all MPI programs; see the MPI home page for more information. Note that all of your processes run the program in parallel, as a group; there is no explicit launching of threads/processes in an MPI program.

As in the ordinary FFTW, the first thing we do is to create a plan (of type fftwnd_mpi_plan), using:

fftwnd_mpi_plan fftw2d_mpi_create_plan(MPI_Comm comm,
                                       int nx, int ny,
                                       fftw_direction dir, int flags);

Except for the first argument, the parameters are identical to those of fftw2d_create_plan. (There are also analogous fftwnd_mpi_create_plan and fftw3d_mpi_create_plan functions. Transforms of any rank greater than one are supported.) The first argument is an MPI communicator, which specifies the group of processes that are to be involved in the transform; the standard constant MPI_COMM_WORLD indicates all available processes.

Next, one has to allocate and initialize the data. This is somewhat tricky, because the transform data is distributed across the processes involved in the transform. It is discussed in detail by the next section (see Section MPI Data Layout).

The actual computation of the transform is performed by the function fftwnd_mpi, which differs somewhat from its uniprocessor equivalent and is described by:

void fftwnd_mpi(fftwnd_mpi_plan p,
                int n_fields,
                fftw_complex *local_data, fftw_complex *work,
                fftwnd_mpi_output_order output_order);

There are several things to notice here:

The output of fftwnd_mpi is identical to that of the corresponding uniprocessor transform. In particular, you should recall our conventions for normalization and the sign of the transform exponent.

The same plan can be used to compute many transforms of the same size. After you are done with it, you should deallocate it by calling fftwnd_mpi_destroy_plan.

Important: The FFTW MPI routines must be called in the same order by all processes involved in the transform. You should assume that they all are blocking, as if each contained a call to MPI_Barrier.

Programs using the FFTW MPI routines should be linked with -lfftw_mpi -lfftw -lm on Unix, in addition to whatever libraries are required for MPI.

MPI Data Layout

The transform data used by the MPI FFTW routines is distributed: a distinct portion of it resides with each process involved in the transform. This allows the transform to be parallelized, for example, over a cluster of workstations, each with its own separate memory, so that you can take advantage of the total memory of all the processors you are parallelizing over.

In particular, the array is divided according to the rows (first dimension) of the data: each process gets a subset of the rows of the data. (This is sometimes called a "slab decomposition.") One consequence of this is that you can't take advantage of more processors than you have rows (e.g. 64x64x64 matrix can at most use 64 processors). This isn't usually much of a limitation, however, as each processor needs a fair amount of data in order for the parallel-computation benefits to outweight the communications costs.

Below, the first dimension of the data will be referred to as `x' and the second dimension as `y'.

FFTW supplies a routine to tell you exactly how much data resides on the current process:

void fftwnd_mpi_local_sizes(fftwnd_mpi_plan p,
                            int *local_nx,
                            int *local_x_start,
                            int *local_ny_after_transpose,
                            int *local_y_start_after_transpose,
                            int *total_local_size);

Given a plan p, the other parameters of this routine are set to values describing the required data layout, described below.

total_local_size is the number of fftw_complex elements that you must allocate for your local data (and workspace, if you choose). (This value should, of course, be multiplied by n_fields if that parameter to fftwnd_mpi is not 1.)

The data on the current process has local_nx rows, starting at row local_x_start. If fftwnd_mpi is called with FFTW_TRANSPOSED_ORDER output, then y will be the first dimension of the output, and the local y extent will be given by local_ny_after_transpose and local_y_start_after_transpose. Otherwise, the output has the same dimensions and layout as the input.

For instance, suppose you want to transform three-dimensional data of size nx x ny x nz. Then, the current process will store a subset of this data, of size local_nx x ny x nz, where the x indices correspond to the range local_x_start to local_x_start+local_nx-1 in the "real" (i.e. logical) array. If fftwnd_mpi is called with FFTW_TRANSPOSED_ORDER output, then the result will be a ny x nx x nz array, of which a local_ny_after_transpose x nx x nz subset is stored on the current process (corresponding to y values starting at local_y_start_after_transpose).

The following is an example of allocating such a three-dimensional array array (local_data) before the transform and initializing it to some function f(x,y,z):

        fftwnd_mpi_local_sizes(plan, &local_nx, &local_x_start,

        local_data = (fftw_complex*) malloc(sizeof(fftw_complex) *

        for (x = 0; x < local_nx; ++x)
                for (y = 0; y < ny; ++y)
                        for (z = 0; z < nz; ++z)
                                local_data[(x*ny + y)*nz + z]
                                        = f(x + local_x_start, y, z);

Some important things to remember:

Usage of MPI FFTW for Real Multi-dimensional Transforms

MPI transforms specialized for real data are also available, similiar to the uniprocessor rfftwnd transforms. Just as in the uniprocessor case, the real-data MPI functions gain roughly a factor of two in speed (and save a factor of two in space) at the expense of more complicated data formats in the calling program. Before reading this section, you should definitely understand how to call the uniprocessor rfftwnd functions and also the complex MPI FFTW functions.

The following is an example of a program using rfftwnd_mpi. It computes the size nx x ny x nz transform of a real function f(x,y,z), multiplies the imaginary part by 2 for fun, then computes the inverse transform. (We'll also use FFTW_TRANSPOSED_ORDER output for the transform, and additionally supply the optional workspace parameter to rfftwnd_mpi, just to add a little spice.)

#include <rfftw_mpi.h>

int main(int argc, char **argv)
     const int nx = ..., ny = ..., nz = ...;
     int local_nx, local_x_start, local_ny_after_transpose,
         local_y_start_after_transpose, total_local_size;
     int x, y, z;
     rfftwnd_mpi_plan plan, iplan;
     fftw_real *data, *work;
     fftw_complex *cdata;


     /* create the forward and backward plans: */
     plan = rfftw3d_mpi_create_plan(MPI_COMM_WORLD,
                                    nx, ny, nz,
     iplan = rfftw3d_mpi_create_plan(MPI_COMM_WORLD,
      /* dim.'s of REAL data --> */  nx, ny, nz,

     rfftwnd_mpi_local_sizes(plan, &local_nx, &local_x_start,

     data = (fftw_real*) malloc(sizeof(fftw_real) * total_local_size);

     /* workspace is the same size as the data: */
     work = (fftw_real*) malloc(sizeof(fftw_real) * total_local_size);

     /* initialize data to f(x,y,z): */
     for (x = 0; x < local_nx; ++x)
             for (y = 0; y < ny; ++y)
                     for (z = 0; z < nz; ++z)
                             data[(x*ny + y) * (2*(nz/2+1)) + z]
                                     = f(x + local_x_start, y, z);

     /* Now, compute the forward transform: */
     rfftwnd_mpi(plan, 1, data, work, FFTW_TRANSPOSED_ORDER);

     /* the data is now complex, so typecast a pointer: */
     cdata = (fftw_complex*) data;
     /* multiply imaginary part by 2, for fun:
        (note that the data is transposed) */
     for (y = 0; y < local_ny_after_transpose; ++y)
             for (x = 0; x < nx; ++x)
                     for (z = 0; z < (nz/2+1); ++z)
                             cdata[(y*nx + x) * (nz/2+1) + z].im
                                     *= 2.0;

     /* Finally, compute the inverse transform; the result
        is transposed back to the original data layout: */
     rfftwnd_mpi(iplan, 1, data, work, FFTW_TRANSPOSED_ORDER);


There's a lot of stuff in this example, but it's all just what you would have guessed, right? We replaced all the fftwnd_mpi* functions by rfftwnd_mpi*, but otherwise the parameters were pretty much the same. The data layout distributed among the processes just like for the complex transforms (see Section MPI Data Layout), but in addition the final dimension is padded just like it is for the uniprocessor in-place real transforms (see Section Array Dimensions for Real Multi-dimensional Transforms). In particular, the z dimension of the real input data is padded to a size 2*(nz/2+1), and after the transform it contains nz/2+1 complex values.

Some other important things to know about the real MPI transforms:

Programs using the MPI FFTW real transforms should link with -lrfftw_mpi -lfftw_mpi -lrfftw -lfftw -lm on Unix.

Usage of MPI FFTW for Complex One-dimensional Transforms

The MPI FFTW also includes routines for parallel one-dimensional transforms of complex data (only). Although the speedup is generally worse than it is for the multi-dimensional routines,(6) these distributed-memory one-dimensional transforms are especially useful for performing one-dimensional transforms that don't fit into the memory of a single machine.

The usage of these routines is straightforward, and is similar to that of the multi-dimensional MPI transform functions. You first include the header <fftw_mpi.h> and then create a plan by calling:

fftw_mpi_plan fftw_mpi_create_plan(MPI_Comm comm, int n, 
                                   fftw_direction dir, int flags);

The last three arguments are the same as for fftw_create_plan (except that all MPI transforms are automatically FFTW_IN_PLACE). The first argument specifies the group of processes you are using, and is usually MPI_COMM_WORLD (all processes). A plan can be used for many transforms of the same size, and is destroyed when you are done with it by calling fftw_mpi_destroy_plan(plan).

If you don't care about the ordering of the input or output data of the transform, you can include FFTW_SCRAMBLED_INPUT and/or FFTW_SCRAMBLED_OUTPUT in the flags. These save some communications at the expense of having the input and/or output reordered in an undocumented way. For example, if you are performing an FFT-based convolution, you might use FFTW_SCRAMBLED_OUTPUT for the forward transform and FFTW_SCRAMBLED_INPUT for the inverse transform.

The transform itself is computed by:

void fftw_mpi(fftw_mpi_plan p, int n_fields,
              fftw_complex *local_data, fftw_complex *work);

n_fields, as in fftwnd_mpi, is equivalent to howmany=n_fields, stride=n_fields, and dist=1, and should be 1 when you are computing the transform of a single array. local_data contains the portion of the array local to the current process, described below. work is either NULL or an array exactly the same size as local_data; in the latter case, FFTW can use the MPI_Alltoall communications primitive which is (usually) faster at the expense of extra storage. Upon return, local_data contains the portion of the output local to the current process (see below).

To find out what portion of the array is stored local to the current process, you call the following routine:

void fftw_mpi_local_sizes(fftw_mpi_plan p,
                          int *local_n, int *local_start,
                          int *local_n_after_transform,
                          int *local_start_after_transform,
                          int *total_local_size);

total_local_size is the number of fftw_complex elements you should actually allocate for local_data (and work). local_n and local_start indicate that the current process stores local_n elements corresponding to the indices local_start to local_start+local_n-1 in the "real" array. After the transform, the process may store a different portion of the array. The portion of the data stored on the process after the transform is given by local_n_after_transform and local_start_after_transform. This data is exactly the same as a contiguous segment of the corresponding uniprocessor transform output (i.e. an in-order sequence of sequential frequency bins).

Note that, if you compute both a forward and a backward transform of the same size, the local sizes are guaranteed to be consistent. That is, the local size after the forward transform will be the same as the local size before the backward transform, and vice versa.

Programs using the FFTW MPI routines should be linked with -lfftw_mpi -lfftw -lm on Unix, in addition to whatever libraries are required for MPI.

MPI Tips

There are several things you should consider in order to get the best performance out of the MPI FFTW routines.

First, if possible, the first and second dimensions of your data should be divisible by the number of processes you are using. (If only one can be divisible, then you should choose the first dimension.) This allows the computational load to be spread evenly among the processes, and also reduces the communications complexity and overhead. In the one-dimensional transform case, the size of the transform should ideally be divisible by the square of the number of processors.

Second, you should consider using the FFTW_TRANSPOSED_ORDER output format if it is not too burdensome. The speed gains from communications savings are usually substantial.

Third, you should consider allocating a workspace for (r)fftw(nd)_mpi, as this can often (but not always) improve performance (at the cost of extra storage).

Fourth, you should experiment with the best number of processors to use for your problem. (There comes a point of diminishing returns, when the communications costs outweigh the computational benefits.(7)) The fftw_mpi_test program can output helpful performance benchmarks. It accepts the same parameters as the uniprocessor test programs (c.f. tests/README) and is run like an ordinary MPI program. For example, mpirun -np 4 fftw_mpi_test -s 128x128x128 will benchmark a 128x128x128 transform on four processors, reporting timings and parallel speedups for all variants of fftwnd_mpi (transposed, with workspace, etcetera). (Note also that there is the rfftw_mpi_test program for the real transforms.)

Go to the first, previous, next, last section, table of contents.