In the early 90's, when I was still getting my undergraduate computer engineering degree, I took a class in signal processing that I really enjoyed. One of the major projects we were tasked to do was to implement a 2 dimensional fast fourier transform . The class spent several lectures proving that the number of mathematical for a simple implementation for convolution in the spacial domain required O(n^2) math operations. On the other hand, if you switched to the frequency domain, you required O(n log(n)) operations to perform the FFT, O(n) to perform a single multiplication per pixel and then another O(n log(n)) to convert it back to the spacial domain. We proved it was better! And then we implemented it, and it was orders of magnitude slower. Theory and practice didn't match. We showed it was slow! The issue was a boundary we had abstracted away in our performance proofs - memory access was not a uniform cost. The math operations we were optimizing ...