I wonder, does Intel's TSX/HLE help with these workloads? If it's read-only then I'd expect that it'd be able to elide a lot of the locking (assuming the Intel-designed heuristics do the job).
I'd bought one of the first haswell notebooks to play around with tsx. Before I'd time to do so Intel found the tsx bug...
I hope to have time to play around with it one I have new hardware (I refuse to do performance development on virtual).
But honestly, most remaining performance/scalability problems in pg are more algorithmically caused. So micro optimization, and that's what I'd call tax/hle, aren't likely to biggest bottleneck.