Sometimes, people want to compute a discrete Fourier transform (DFT) where only a subset of the outputs are needed, and/or conversely where only a subset of the inputs are non-zero. Fast algorithms to compute this are, in general, known as "pruned" FFTs.
Unfortunately, the performance gains in general from pruned FFTs are quite modest. If you have a transform of size N where only K outputs are desired, the complexity is in general O(N log K), vs. O(N log N) for the full FFT, thus saving only a small additive factor in the log. (Similarly for K nonzero inputs.) This is discussed in more detail by the references at the end. In principle, the arithmetic count can be reduced (however slightly) whenever K < N, simply by discarding (pruning) unused operations from an ordinary FFT. Turning these arithmetic gains into actual performance improvements is much more difficult, however.
The problem is that optimizing an FFT requires a lot of effort, and is not just a matter of minimizing arithmetic cost on modern computers. By implementing a pruned FFT, you are gaining a small factor in the log, but you are sacrificing a significant part of the effort that went into optimizing full FFTs (even if you use an optimized FFT as a subroutine), and the latter can easily outweigh the former. Because of this, I would not recommend bothering to consider a pruned 1d FFT unless you want 1% of the outputs or fewer (and/or if 1% or fewer of your inputs are nonzero).
The most common case where people seem to want a pruned FFT is for zero-padded convolutions, where roughly 50% of your inputs are zero (to get a linear convolution from an FFT-based cyclic convolution). Here, a pruned FFT is hardly worth thinking about, at least in one dimension. In higher dimensions, matters change (e.g. for a 3d zero-padded array about 1/8 of your inputs are non-zero, and one can fairly easily save a factor of two or so simply by skipping 1d sub-transforms that are zero).
Another possibility, of course, is to simply use the naive DFT formula or the Goertzel algorithm, both of which have O(N K) complexity. These methods have the added advantage of only requiring O(K) storage if the N inputs are computed or read on the fly (a huge advantage in some applications). However, they are also generally less accurate than FFT-based methods when implemented in finite-precision floating-point arithmetic — the naive formula and Goertzel's algorithm have error bounds that grow as O(N3/2) and O(N5/2), respectively, compared to O(log N) for the FFT. The minimum K at which a pruned FFT becomes faster will depend upon the context, but we have observed benefits below from a pruned FFT (compared to Goertzel) for K as small as 10 with N of 105 (where Goertzel is orders of magnitude less accurate).
FFTW does not currently implement any general pruned FFT algorithm. However, in principle one can easily implement a pruned FFT algorithm on top of FFTW, and we describe the simplest such case below.
Another case that we experimented with once in the past is 3d FFTs with sparse inputs/outputs; you can see some sample code that we wrote for that at http://fftw.org/pub/fftw/sparsefft3-0.5.1.tar.gz.
An important special case where one can easily achieve O(N log K) complexity is where one wishes to compute only the first K outputs of a DFT, where K divides N.
In this case, one simply computes N/K FFTs of size K, with stride N/K in the data, and then combines the output by summing with appropriate phase ("twiddle") factors. Essentially, you are performing "by hand" a single radix-K decimation-in-frequency Cooley-Tukey step, where you only compute one size-K block of the output.
(The dual problem is where only the first K inputs are non-zero, in which case you essentially just reverse these steps.)
More generally, if you want an arbitrary subset of K outputs, you would compute N/K FFTs of size K as below, and then each output is one output of a DFT of size N/K, premultiplied by twiddle factors. This general case is not implemented below, however. In principle, to gain maximum possible arithmetic savings, you should notice when you are computing more than one output of a given size-N/K DFT, and use a pruned FFT recursively, but this is much more complicated to implement and the practical gains are doubtful in general. More simply, you would just compute each output using the naive DFT formula with O(N/K) operations for each of the K outputs in the last step. (In the special case, below, where you want the first K outputs of the DFT, you are computing only the 0th output of each size-N/K DFT, and no additional pruning is possible.)
Note: the code below will often only yield substantial gains over the normal FFTW if K is much smaller than N (ideally by a factor of 100 or more, depending on your machine/compiler).
Below is an outline of code to do this with FFTW3's complex FFT,
operating out of place. Here, for simplicity, we assume that C99's
complex.h
header is available so that we can use complex
arithmetic in C.
#include <math.h> #include <complex.h> #include <fftw3.h> { int i, j; const double TWOPI = 6.2831853071795864769252867665590057683943388; fftw_complex in[N], out[N], twids[(K-1)*(N/K-1)]; fftw_plan plan; /* plan N/K FFTs of size K */ plan = fftw_plan_many_dft(1, &K, N/K, in, NULL, N/K, 1, out, NULL, 1, K, FFTW_FORWARD, FFTW_ESTIMATE); /* precompute twiddle factors (since we usually want more than one FFT) */ for (j = 1; j < N/K; ++j) for (i = 1; i < K; ++i) twids[(j-1)*(K-1) + (i-1)] = cexp((I * FFTW_FORWARD * TWOPI/N) * (i*j)); ...initialize in[N] data.... fftw_execute(plan); /* set *only* first K outputs, in out[], to values for full size-N DFT: */ for (j = 1; j < N/K; ++j) { out[0] += out[j*K]; for (i = 1; i < K; ++i) out[i] += out[i + j*K] * twids[(j-1)*(K-1) + (i-1)]; } fftw_destroy_plan(plan); }
Notice that, while the inputs are transformed with stride N/K, the
outputs have stride 1, so that each size-K sub-transform is stored as
a contiguous length-K chunk in out
. This is convenient
because it allows us to overwrite the first size-K chunk, in-place,
with the desired output (and is essentially what FFTW normally does
internally for radix-K DIF). Other data arrangements could be devised,
of course.
Of course, further optimizations are possible. You may want to
plan with FFTW_MEASURE
instead of
FFTW_ESTIMATE
if you are computing many FFTs. You
probably also want to allocate the input/output arrays with
fftw_malloc
to ensure correct alignment and to avoid
limitations on the stack size. Many games can be played with how the
twiddle factors are computed (e.g. they can be stored in the order
that they are used to improve locality).
We should also comment on the floating-point accuracy. For K = 1, the above algorithm is exactly the naive DFT, with the same floating-point accuracy. In general, then, for K much smaller than N, the accuracy of this code is worse than that of the full FFT, and approaches (from below) the error of the naive DFT formula. The accuracy could be improved by a number of techniques, quite possibly with negligible computational cost (e.g. by cascade summation), albeit with more code.
Now, if you really have complex data, then you probably want the K lowest positive and negative frequency amplitudes, where the negative frequencies are located at N-k (for k=1..K-1) because of aliasing. This can be accomplished by replacing the final loop above with:
/* set *only* first K outputs and the last K-1 outputs, in out[], to values for full size-N DFT: */ out[0] += out[N - K]; for (i = 1; i < K; ++i) { fftw_complex o0 = out[i], oN = out[(N - K) + i]; out[i] = o0 + oN * twids[(N/K-2)*(K-1) + (i-1)]; out[(N - K) + i] = o0 + oN * conj(twids[(N/K-2)*(K-1) + (K-i-1)]); } for (j = 1; j < N/K - 1; ++j) { out[0] += out[j*K]; for (i = 1; i < K; ++i) out[i] += out[i + j*K] * twids[(j-1)*(K-1) + (i-1)]; for (i = 1; i < K; ++i) out[(N-K) + i] += out[i + j*K] * conj(twids[(j-1)*(K-1) + (K-i-1)]); }
This code is a little more complicated, but the basic idea is the
same. The only difference from before is that we generate two size-K
blocks (minus one point) of the Cooley-Tukey output instead of one.
(Here, the first loop is separated so that we can continue to
operate in-place on out[]
.)
An even more common case is where you have real inputs, and you want only the first K amplitudes of the DFT output. (Here, because of the conjugate symmetry, there is no need to distinguish positive and negative frequencies.)
#include <math.h> #include <complex.h> #include <fftw3.h> { int i, j; const double TWOPI = 6.2831853071795864769252867665590057683943388; double in[N], out1[N]; fftw_complex out[K], twids[(K-1)*(N/K-1)]; fftw_r2r_kind kind = FFTW_R2HC; fftw_plan plan; /* plan N/K FFTs of size K */ plan = fftw_plan_many_r2r(1, &K, N/K, in, NULL, N/K, 1, out1, NULL, 1, K, &kind, FFTW_ESTIMATE); /* precompute twiddle factors (since we usually want more than one FFT) */ for (j = 1; j < N/K; ++j) for (i = 1; i < K; ++i) twids[(j-1)*(K-1) + (i-1)] = cexp((I * FFTW_FORWARD * TWOPI/N) * (i*j)); ...initialize in[N] data.... fftw_execute(plan); /* set K elements of out[] to first K outputs of full size-N DFT: */ out[0] = out1[0]; for (i = 1; i+i < K; ++i) { double o1 = out1[i], o2 = out1[K-i]; out[i] = o1 + I*o2; out[K-i] = o1 - I*o2; } if (i+i == K) /* Nyquist element */ out[i] = out1[i]; for (j = 1; j < N/K; ++j) { out[0] += out1[j*K]; for (i = 1; i+i < K; ++i) { double o1 = out1[i + j*K], o2 = out1[K-i + j*K]; out[i] += (o1 + I*o2) * twids[(j-1)*(K-1) + (i-1)]; out[K-i] += (o1 - I*o2) * twids[(j-1)*(K-1) + (K-i-1)]; } if (i+i == K) /* Nyquist element */ out[i] += out1[i + j*K] * twids[(j-1)*(K-1) + (i-1)]; } fftw_destroy_plan(plan); }
This is essentially the same as the very first case above, except that it takes into account that the sub-FFTs were of real data and so have conjugate-symmetry ("halfcomplex") output.
In this case, it is more difficult to compute the final output in
the same array as the sub-FFT outputs, so we instead store the result
in a separate complex array out[K]
, while using an
auxiliary real array out1[N]
for the outputs of the
sub-FFTs.
Again, in realistic code you would probably allocate the arrays
with fftw_malloc
instead of on the stack. Various
optimizations and other variations are possible, as before. The above
code does not exploit SIMD instructions, even for the FFTW
sub-transforms, because it uses FFTW's r2r interface. To exploit SIMD
for the sub-tranasforms, you should change it to the r2c interface
(which involves a different output format).