Go back to the FFTW home page.

FFTW on the Cell Processor

Version 3.2.2 of FFTW, contains specific support for the Cell Broadband Engine ("Cell") processor, added to FFTW in 2007. This page summarizes that support, which is also described in the FFTW manual and the README.Cell file included in FFTW. We also provide some benchmarks from an IBM Cell Blade and a PlayStation 3.

Cell support was removed in FFTW version 3.3 in 2011, primarily because we lack a machine to test on, combined with a perceived lack of user interest for the last few years. Users who wish to employ FFTW on the Cell can continue to use version 3.2.2.

Acknowledgments

The Cell code in the FFTW was written and graciously donated to the FFTW project by the IBM Austin Research Laboratory. We are grateful to Pat Bohrer and Lorraine Herger of IBM for this generous contribution.

Scope

Cell consists of one PowerPC core ("PPE") and of a number of Synergistic Processing Elements ("SPE") to which the PPE can delegate computation. The IBM QS20 Cell blade offers 8 SPEs per Cell chip. The Sony PlayStation 3 contains 6 useable SPEs.

This version of FFTW fully utilizes the SPEs for one- and multi-dimensional complex FFTs of sizes that can be factored into small primes, both in single and double precision. Transforms of real data use SPEs only partially at this time. If FFTW cannot use the SPEs, it falls back to a slower computation on the PPE.

This library is meant to use the SPEs transparently without user intervention. However, certain caveats apply, which are discussed later in this document.

Installation

To enable support for Cell in double precision:

   configure --enable-cell
   make
   make install

In single precision:

   configure --enable-cell --enable-single
   make
   make install

In addition, the PPE supports the Altivec (or VMX) instruction set in single precision. (Altivec is Apple/Freescale terminology, VMX is IBM terminology for the same thing.) You can enable support for Altivec with the "--enable-altivec" flag (single precision only).

The software compiles with the Cell SDK 2.0, and probably with earlier ones as well.

Caveats

The benchmark program allocates memory using malloc() or equivalent library calls, reflecting the common usage of the FFTW library. However, you can sometimes improve performance significantly by allocating memory in system-specific large TLB pages. E.g., we have seen 39 GFLOP/s for a 256×256×256 problem using large pages, whereas the speed is about 25 GFLOP/s with normal pages. YMMV.
FFTW hoards all available SPEs for itself. You can optionally choose a different number of SPEs by calling the undocumented function fftw_cell_set_nspe(n), where "n" is the number of desired SPEs. Expect this interface to go away once we figure out how to make FFTW play nicely with other Cell software.
In particular, if you try to link both the single and double precision of FFTW in the same program (which you can do), they will both try to grab all SPEs and the second one will hang.
The SPEs demand that data be stored in contiguous arrays aligned at 16-byte boundaries. If you instruct FFTW to operate on noncontiguous or nonaligned data, the SPEs will not be used, resulting in slow execution.
The FFTW_ESTIMATE mode may produce seriously suboptimal plans, and it becomes particularly confused if you enable both the SPEs and Altivec. If you care about performance, please use FFTW_MEASURE or FFTW_PATIENT until we figure out a more reliable performance model.

Accuracy

The SPEs are fully IEEE-754 compliant in double precision. In single precision, they only implement round-towards-zero as opposed to the standard round-to-even mode. (The PPE is fully IEEE-754 compliant like all other PowerPC implementations.) Because of the rounding mode, FFTW is less accurate when running on the SPEs than on the PPE. The accuracy loss is hard to quantify in general, but as a rough guideline, the L2 norm of the relative roundoff error for random inputs is 4-8 times larger than the corresponding calculation in round-to-even arithmetic. In other words, expect to lose 2 to 3 bits of accuracy.

FFTW currently does not use any algorithm that degrades accuracy to gain performance on the SPE. One implication of this choice is that large 1D transforms run slower than they would if we were willing to sacrifice another bit or so of accuracy.

Benchmarks

These benchmarks show the results of running benchFFT on an IBM Cell Blade and a PlayStation 3. Note that, of the programs benchmarked, only FFTW uses the Cell SPEs.

Go back to the FFTW home page.