The FFT performance benchmark is designed to model a situation in which you repeatedly perform transforms of a particular size, with some computation on the data in between transforms. (Any one-time initialization cost is therefore not included in the timing measurements.) This seems to be the most common kind of use for FFT code, especially in cases where performance is important.
Essentially, a given array is repeatedly FFTed and the elapsed time is measured. There are a couple of tricky details, however. The basic problem is that repeatedly transforming the same array is a diverging process--before long, you will be multiplying NaN's or Inf's and all timing measurements become meaningless. There are (at least) two solutions to this problem, of which we use the second:
First, you could repeatedly perform the FFT followed by the inverse FFT of the data set. The problem with this is that many FFT implementations compute an unnormalized transform, in which the FFT followed by the inverse yields the original array multiplied by N (the size of the array). Again, you have a diverging process. Now, you could simply loop through the array after the transforms and scale the data by 1/N. Naively, you might think that the time for this calculation should be included in the performance measurements. In most real situations, however, the transform is followed or preceded by some computation on the data, into which the scaling by 1/N can be absorbed. (Most importantly for large transforms, the cost of the extra loads of the data can be completely eliminated in this way.) So, in order to model the needs of real applications, it doesn't make sense to include the cost of the scaling in the performance of an FFT. One solution might be to measure the cost of the scaling separately, and subtract it from the elapsed time; as long as you are going to do this, however, you might as well use the second solution, below. (Also, some public-domain FFTs do not include inverse transforms.)
The second possibility, which is used in our benchmark, is to repeatedly FFT the same array, but to reinitialize the array before each transform. The time for these reinitializations is measured separately and is subtracted from the elaped time for the FFTs. (It turns out that the initialization time is almost always negligible, but it doesn't hurt to be thorough.)
Note that we only time one of the FFTs, either the forward or the backward transform. We make the reasonable assumption that the forward and backward transforms take the same time to compute, and so it is not necessary to measure the performance of both (for most programs, the two cases use the same code).
perform one-time initializations initialize data fft data num_iters = 1 do get start_time for iteration = 1 to num_iters do initialize data fft data get end_time t = end_time - start_time if (t < 1.0) then num_iters = num_iters * 2 while t < 1.0 get start_time for iteration = 1 to num_iters do initialize data get end_time t = t - (end_time - start_time)You might wonder why we initialize the array and compute the FFT once before the timing starts. There are two reasons. First, some codes perform one-time initializations the first time you call them. Second, we don't want to measure the time taken to load instructions and data into the cache on the first call (although this is probably insignificant anyway).
"mflops" = 5 N log2N / (time for one FFT in µs)
Here, N is the size of the transform (total number of points in multi-dimensional FFTs). "mflops" is in quotes because it is not really the MFLOPS of the FFT, and is likely to be a source of confusion for some readers. Regardless, we believe that it is the best way to report performance, as we shall explain below.
The first number that one might think to report is the elapsed time "t" from above. Since each FFT might be running for a different number of iterations, however, you need to at least divide t by the number of iterations, yielding the time for one FFT. This is still unsatisfactory, because the time for one FFT inherently increases with transform size, making it impossible to compare results for different transform sizes, or even view them together in a single graph.
Since the number of instructions executed in an FFT is O(N log2N), it makes sense to divide the time for one FFT by N log2N; call this quantity t'. Now, t' is comparable even between different transform sizes and would seem a suitable number to report. There is shortcoming, however, if you try to plot t' on a single graph for all FFTs and transform sizes. Since t' is smaller for fast FFTs and larger for slow FFTs, most of the graph is occupied by the slow FFTs, while the fast FFTs are huddled near the bottom where they are difficult to compare. This is unacceptable--it is the fast FFTs, after all, that you are most interested in.
Instead of t', one can instead report 1/t'. This will yield graphs where the fast FFTs are emphasized and are easy to compare, while the slow FFTs will be clustered around the bottom of the plot. However, by this point you have lost all intuition about the meaning of the magnitudes (as opposed to the relative values) of the numbers that you are reporting. 1/t' is also inconvenient to compare with numbers quoted in the literature.
Instead, we report 5/t', or 5 N log2N / (time for one FFT in µs). The reason for this is that the number of floating point operations in a radix-2 Cooley-Tukey FFT of size N is 5 N log2N. If we assume that this is also an approximation for the number of operations in any FFT, then 5/t' is roughly equal to the MFLOPS (millions of floating-point operations per second) of the transforms. That is why we call it the "mflops," and it has the advantage that its absolute magnitude has some meaning. (It is also a standard way of reporting FFT performance in the literature.) Note that the relative values of the "mflops" are still the most important quantities, however.
For example, the 167MHz UltraSPARC can perform 2 fp operations at a time, and is thus capable in "theory" (as opposed to reality) of 334 MFLOPS. From the benchmark results, we see that we can achieve about 2/3 of that for small transforms, where floating point computations dominate, and much less for larger transforms that are dominated by memory access.
Some people might propose that we report the actual MFLOPS of each program. Aside from the fact that this is difficult to compute exactly (you have to count how many floating-point operations are performed), it is also useless. The basic problem is that you cannot make meaningful comparisons of the actual MFLOPS of two programs. One could have a higher MFLOPS than the other simply by performing lots of superfluous operations in a tight loop! What you are interested in is how long a program takes to compute an answer, not how many multiplications and additions it takes to get there.
average of |xi - new_xi| * 2 / (|xi| + |new_xi| + epsilon)
Here, epsilon is a small number to prevent us from dividing by zero.
The original xi consist of pseudo-random numbers
(generated by the rand()
function).
If an FFT implementation does not provide an inverse FFT, then we
construct one using the identity that ifft(x) = fft(x*)*
,
where "*
" denotes complex conjugation. Since complex
conjugation is an exact operation, this procedure does not introduce
additional error and we are measuring the accuracy of the FFT
subroutine exclusively.
First, there are all the powers of two up to the memory limit of your machine. These are by far the most common transform sizes in the real world as they are typically the most efficient to compute. They are also the easiest to code, and many FFT implementations only support transforms of sizes that are powers of two.
Second, there are (more-or-less) randomly selected numbers whose factors are powers of 2, 3, 5, and 7. These were chosen because the FFT is usually fastest for numbers with small prime factors, and so real applications usually try to limit themselves to such sizes even if they don't restrict themselves to powers of two.