Fast Fourier transform is slow

In the early 90's, when I was still getting my undergraduate computer engineering degree, I took a class in signal processing that I really enjoyed. One of the major projects we were tasked to do was to implement a 2 dimensional fast fourier transform.

The class spent several lectures proving that the number of mathematical for a simple implementation for convolution in the spacial domain required O(n^2) math operations. On the other hand, if you switched to the frequency domain, you required O(n log(n)) operations to perform the FFT, O(n) to perform a single multiplication per pixel and then another O(n log(n)) to convert it back to the spacial domain.

We proved it was better!

And then we implemented it, and it was orders of magnitude slower. Theory and practice didn't match.

We showed it was slow!

The issue was a boundary we had abstracted away in our performance proofs - memory access was not a uniform cost. The math operations we were optimizing away were now completely irrelevant due to the limitations of the computer system we were running on: a disk-less Motorola 6800 based Sun workstation.

Because we were supposed to be showing how much better the FFT code was as the image sizes increased, we had test images at 16x16, 64x64, 256x256, 512x512 and 1024x1024 pixels. Each of these images were then converted into C double variables. So the largest image was 8 MB, or twice the RAM of the computer (without considering the operating system). Additionally, like students everywhere, we were all trying to do our projects at the same time. So the machine's memory was oversubscribed.

The final nail in the performance coffin was the memory pattern that the FFT algorithm uses is not localized (for simple implementations, a lot of work has been done to improve this, but for the class we had to do this from scratch):

causing multiple page faults. On a computer with local storage, this would still have been slower because now memory access would have to wait for an IO operation. But it was made worse because the server was diskless - each IO operation went over a network:

Each IO operation now required a network request/reply trip to a file server over a non-switched 1 Gb Ethernet network. If any two servers on the network were transmitting at the same time, they would go into the randomized exponential back off algorithm.

On the other hand, the standard convolution algorithm slowly moved predictably through the image, generating a small number of page faults, making it run much faster.

When memory size is a limitation, there is an entire field of research dedicated to cache oblivious algorithms. This was one of the first times I had run into a case that so clearly illustrated that the assumptions you bring with you are boundaries that you can cross at your peril, sometimes without even knowing that you have done so.

Boundaries in Dev

Search This Blog