If you look at the benchmarks for x32 (ILP32) you see an improvement of up to 10...

firethief · on March 7, 2019

That kind of benchmark shows the data path cost of big pointers but misses much of the instruction path cost of inefficient encoding, because the cost doesn't usually manifest as bottlenecking at the frontend. The cost is having a frontend that can keep up. On Intel cores this involves having a huge decoder that can deal with arbitrary alignment and unbounded sequences of prefixes; and having caches both for instructions and decoded uops (multiple types of the latter). Plus fundamental limitations on what the backend can do: there would be no point adding a 4th vector op x-port, because the frontend is miles away from being able to keep up with that many instruction bytes. All of that, and still the programmer/compiler walks a knife's edge avoiding frontend bottlenecks, trying to keep code tight and aligned so the important parts fit in loop buffer or at least uop cache and don't have to squeeze through the decoders. Use it or don't, REX is paid for.