Name me another sorting algorithm that does no reads that cannot be predicted several hundred cycles in advance, and O(1) writes to locations that are read from within several hundred cycles. I don't know of any.
Remember: a cache miss costs easily several hundred cycles.
And, to ask a question: if radix sort is faster, and not due to better cache performance, what is it due to?