This paper presents asymptotically optimal algorithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size Z and cache-line length L where Z=\Omega(L^2) the number of cache misses for an m\times n matrix transpose is \Theta(1+mn/L). The number of cache misses for either an n-point FFT or the sorting of n numbers is \Theta(1+(n/L)(1+\logZ n)). We also give an \Theta(mnp)-work algorithm to multiply an m\times n matrix by an n\times p matrix that incurs \Theta(1 + (mn+np+mp)/L + mnp/L\sqrt{Z}) cache faults.
We introduce an ``ideal-cache'' model to analyze our algorithms, and we prove that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels. We also prove that any optimal cache-oblivious algorithm is also optimal in the previously studied HMM and SUMH models. Algorithms developed for these earlier models are perforce cache-aware: their behavior varies as a function of hardware-dependent parameters which must be tuned to attain optimality. Our cache-oblivious algorithms achieve the same asymptotic optimality, but without any tuning.