The fact that different CPUs have different features was one of the original reasons you never ran a generic kernel. The generic kernel was just that, generic. As a result it made no assumptions. Now in Windows and later Linux and FreeBSD kernels there are kernel modules that can change behaviors but sometimes that still doesn't give you the max performance for the integer code. Fortunately it isn't that much of a burden.
That said, I'm sure HPC folks compile with all the bells and whistles for the exact model, options, and probably CPU stepping enabled.
HPC guy here; Nope, we don't bother recompiling the kernel for every piece of hardware we have. HPC code tends to run 99.99% in user space, kernel performance doesn't matter that much.
End user applications, that's another matter, and here using all the latest vector instructions etc. can make a difference. Usually less so than what one might hope, though. The really big deal tends to be using optimized libraries such as OpenBLAS, FFTW, MKL instead of doing numerical linear algebra yourself in a naive fashion, or using the reference netlib BLAS.
Another very common problem we see is poor application I/O patterns. Yes, every HPC site loves to brag how many GB/s their Lustre system does, but if you divide that by the number of CPU cores in a cluster, that ratio is quite low. Additionally, like other clustered file systems, Lustre metadata performance is relatively poor, so applications banging on lots of small files can easily tank the performance of the entire Lustre system.
Solves and creates different issues. With FMV you essentially have to build the compile unit with the highest level of microarch support so that the pre-processor interprets the intrinsics headers and doesn't eliminate intrinsics you might need.
(ICC solves this in a differently annoying way - where all the intrinsics are available, even if they are incompatible with the platform you are on.)
For some light entertainment, think about what happens with static initializers and compiling for different microarch flavours. If your C++ static init function happens to generate an AVX insn, and you've only got SSE2, welcome to SIGILL before main().
Can't you get the same behavior as icc in gcc/clang by just using target specific optimization options at the function level?
See for example stage 1 here: https://gcc.gnu.org/wiki/FunctionSpecificOpt (that document appears dated, but do things still work that way?) Afaik, clang/llvm have similar functionality.
That stage1 example is pretty ugly - using __builtin_ia32<x> would work, but they are the only things harder to read than the intrinsics themselves!
Plus there are some intrinsics that are just macros, (sets, masks, etc), and you don't get them from the preprocessor just by setting the function target.
As an aside, that page really is dated - it is just early proposals afterall - as SSE5 didn't see light of day like that. VPCMOV ended up in AMD's XOP set.
ICC also has its auto as well as manual dispatch options:
I believe this is the area where Intel had their knuckles rapped for only working on "GenuineIntel" processors, and why there are big disclaimers on everything now. I've not tried using these myself as they aren't portable solutions.
I wonder how long we have until we run out of x86 opcodes. Even if it's a variable-length instruction encoding, there's going to come a point where the size of the instruction outweighs any performance benefits, not to mention the effort that goes into designing new instruction extensions in a chip.
Then again, perhaps that's all that's left for Intel to do now. Evolution of marketing ploys: transistors -> clock speed -> #cores -> instruction extensions.
there's going to come a point where the size of the instruction outweighs any performance benefits
Except for the fact that many of these new instructions perform some huge nontrivial operation in hardware that would've required hundreds or more regular instructions previously --- AES is a good example. It seems like a general principle that instruction sets tend to become more CISC-y over time, as dedicated hardware and instructions designed to operate on such replace slower software implementations.
x86_64 already reused a couple of short opcodes to make a huge number of long ones. From the modest size of today's implementations of decoders, it doesn't seem like instructions are going to kill x86 any time soon... except on the lowest end, where x86 already doesn't have much marketshare.
pretty nice. i like the code in the article, seems much cleaner to look at different functions for different arch. This will come handy during code maintenance.
That said, I'm sure HPC folks compile with all the bells and whistles for the exact model, options, and probably CPU stepping enabled.