Go to the next, previous, or main section.

Methodology

In this section we describe and motivate the methodology of our benchmark. We hope to convince you that our program makes a fair and accurate measurement of the performance and accuracy of FFT software.

Performance Measurement

The FFT performance benchmark is designed to model a situation in which you repeatedly perform transforms of a particular size, with some computation on the data in between transforms. (Any one-time initialization cost is therefore not included in the timing measurements.) This seems to be the most common kind of use for FFT code, especially in cases where performance is important.

Essentially, a given array is repeatedly FFTed and the elapsed time is measured. There are a couple of tricky details, however. The basic problem is that repeatedly transforming the same array is a diverging process--before long, you will be multiplying NaN's or Inf's and all timing measurements become meaningless. There are (at least) two solutions to this problem, of which we use the second:

First, you could repeatedly perform the FFT followed by the inverse FFT of the data set. The problem with this is that many FFT implementations compute an unnormalized transform, in which the FFT followed by the inverse yields the original array multiplied by N (the size of the array). Again, you have a diverging process. Now, you could simply loop through the array after the transforms and scale the data by 1/N. Naively, you might think that the time for this calculation should be included in the performance measurements. In most real situations, however, the transform is followed or preceded by some computation on the data, into which the scaling by 1/N can be absorbed. (Most importantly for large transforms, the cost of the extra loads of the data can be completely eliminated in this way.) So, in order to model the needs of real applications, it doesn't make sense to include the cost of the scaling in the performance of an FFT. One solution might be to measure the cost of the scaling separately, and subtract it from the elapsed time; as long as you are going to do this, however, you might as well use the second solution, below. (Also, some public-domain FFTs do not include inverse transforms.)

The second possibility, which is used in our benchmark, is to repeatedly FFT the same array, but to reinitialize the array before each transform. The time for these reinitializations is measured separately and is subtracted from the elaped time for the FFTs. (It turns out that the initialization time is almost always negligible, but it doesn't hurt to be thorough.)

Note that we only time one of the FFTs, either the forward or the backward transform. We make the reasonable assumption that the forward and backward transforms take the same time to compute, and so it is not necessary to measure the performance of both (for most programs, the two cases use the same code).

Timing

It is necessary to time for a long enough period that the resolution of the clock is not an issue, but not for so long that the benchmark takes forever to run. Our solution is to repeatedly double the number of iterations used (starting with 1 iteration) until the elapsed time is at least 1 second. This is repeated for every FFT code and every transform size. So, in pseudo-code, the benchmark process (for one FFT and one transform size) is:
perform one-time initializations
initialize data
fft data
num_iters = 1

do
     get start_time
     for iteration = 1 to num_iters do
          initialize data
          fft data
     get end_time
     t = end_time - start_time
     if (t < 1.0) then
          num_iters = num_iters * 2
while t < 1.0

get start_time
for iteration = 1 to num_iters do
     initialize data
get end_time
t = t - (end_time - start_time)
You might wonder why we initialize the array and compute the FFT once before the timing starts. There are two reasons. First, some codes perform one-time initializations the first time you call them. Second, we don't want to measure the time taken to load instructions and data into the cache on the first call (although this is probably insignificant anyway).

Performance Numbers Reported

Our benchmark reports the "mflops" of each FFT for every transform size. This is defined to be:

"mflops" = 5 N log2N / (time for one FFT in µs)

Here, N is the size of the transform (total number of points in multi-dimensional FFTs). "mflops" is in quotes because it is not really the MFLOPS of the FFT, and is likely to be a source of confusion for some readers. Regardless, we believe that it is the best way to report performance, as we shall explain below.

The first number that one might think to report is the elapsed time "t" from above. Since each FFT might be running for a different number of iterations, however, you need to at least divide t by the number of iterations, yielding the time for one FFT. This is still unsatisfactory, because the time for one FFT inherently increases with transform size, making it impossible to compare results for different transform sizes, or even view them together in a single graph.

Since the number of instructions executed in an FFT is O(N log2N), it makes sense to divide the time for one FFT by N log2N; call this quantity t'. Now, t' is comparable even between different transform sizes and would seem a suitable number to report. There is shortcoming, however, if you try to plot t' on a single graph for all FFTs and transform sizes. Since t' is smaller for fast FFTs and larger for slow FFTs, most of the graph is occupied by the slow FFTs, while the fast FFTs are huddled near the bottom where they are difficult to compare. This is unacceptable--it is the fast FFTs, after all, that you are most interested in.

Instead of t', one can instead report 1/t'. This will yield graphs where the fast FFTs are emphasized and are easy to compare, while the slow FFTs will be clustered around the bottom of the plot. However, by this point you have lost all intuition about the meaning of the magnitudes (as opposed to the relative values) of the numbers that you are reporting. 1/t' is also inconvenient to compare with numbers quoted in the literature.

Instead, we report 5/t', or 5 N log2N / (time for one FFT in µs). The reason for this is that the number of floating point operations in a radix-2 Cooley-Tukey FFT of size N is 5 N log2N. If we assume that this is also an approximation for the number of operations in any FFT, then 5/t' is roughly equal to the MFLOPS (millions of floating-point operations per second) of the transforms. That is why we call it the "mflops," and it has the advantage that its absolute magnitude has some meaning. (It is also a standard way of reporting FFT performance in the literature.) Note that the relative values of the "mflops" are still the most important quantities, however.

For example, the 167MHz UltraSPARC can perform 2 fp operations at a time, and is thus capable in "theory" (as opposed to reality) of 334 MFLOPS. From the benchmark results, we see that we can achieve about 2/3 of that for small transforms, where floating point computations dominate, and much less for larger transforms that are dominated by memory access.

Some people might propose that we report the actual MFLOPS of each program. Aside from the fact that this is difficult to compute exactly (you have to count how many floating-point operations are performed), it is also useless. The basic problem is that you cannot make meaningful comparisons of the actual MFLOPS of two programs. One could have a higher MFLOPS than the other simply by performing lots of superfluous operations in a tight loop! What you are interested in is how long a program takes to compute an answer, not how many multiplications and additions it takes to get there.

Accuracy Measurement

Measurement of the accuracy of an FFT is much simpler than measuring its performance. Essentially, all we do is to perform the FFT and inverse FFT of some data, scale it if necessary, and compare the result to the original data. The difference is reported as a "mean fractional error." If x is the original data and new_x is the data after the transforms, then we define the mean fractional error as:

average of |xi - new_xi| * 2 / (|xi| + |new_xi| + epsilon)

Here, epsilon is a small number to prevent us from dividing by zero.

The original xi consist of pseudo-random numbers (generated by the rand() function).

If an FFT implementation does not provide an inverse FFT, then we construct one using the identity that ifft(x) = fft(x*)*, where "*" denotes complex conjugation. Since complex conjugation is an exact operation, this procedure does not introduce additional error and we are measuring the accuracy of the FFT subroutine exclusively.

Default Transform Sizes

Unless a particular transform size is specified, the benchmark is run for a hard-coded selection of "representative" sizes. These sizes fall into two groups.

First, there are all the powers of two up to the memory limit of your machine. These are by far the most common transform sizes in the real world as they are typically the most efficient to compute. They are also the easiest to code, and many FFT implementations only support transforms of sizes that are powers of two.

Second, there are (more-or-less) randomly selected numbers whose factors are powers of 2, 3, 5, and 7. These were chosen because the FFT is usually fastest for numbers with small prime factors, and so real applications usually try to limit themselves to such sizes even if they don't restrict themselves to powers of two.

A Few Words on Bias

As we are the authors of FFTW, you might be justifiably concerned that we could have tilted the benchmark in our favor. In fact, the basic methodology of the benchmark pre-dates FFTW, and we believe its fairness and neutrality should be evident from the discussion above. We have strictly avoided any "tweaking" of the measurements in such a way as to favor a particular FFT code (e.g. fudging data alignments, default transform sizes, etcetera). Feel free to look at the source code if you are worried, and don't hesitate to email us if you have any questions or concerns regarding our methods.
Go to the next, previous, or main section.