Function multi-versioning in GCC 6

ChuckMcM · on Dec 12, 2016

The fact that different CPUs have different features was one of the original reasons you never ran a generic kernel. The generic kernel was just that, generic. As a result it made no assumptions. Now in Windows and later Linux and FreeBSD kernels there are kernel modules that can change behaviors but sometimes that still doesn't give you the max performance for the integer code. Fortunately it isn't that much of a burden.

That said, I'm sure HPC folks compile with all the bells and whistles for the exact model, options, and probably CPU stepping enabled.

jabl · on Dec 12, 2016

HPC guy here; Nope, we don't bother recompiling the kernel for every piece of hardware we have. HPC code tends to run 99.99% in user space, kernel performance doesn't matter that much.

End user applications, that's another matter, and here using all the latest vector instructions etc. can make a difference. Usually less so than what one might hope, though. The really big deal tends to be using optimized libraries such as OpenBLAS, FFTW, MKL instead of doing numerical linear algebra yourself in a naive fashion, or using the reference netlib BLAS.

Another very common problem we see is poor application I/O patterns. Yes, every HPC site loves to brag how many GB/s their Lustre system does, but if you divide that by the number of CPU cores in a cluster, that ratio is quite low. Additionally, like other clustered file systems, Lustre metadata performance is relatively poor, so applications banging on lots of small files can easily tank the performance of the entire Lustre system.

ngoldbaum · on Dec 12, 2016

Yup. When I was running on NASA's HPC resources, I found pages like these to be very useful:

https://www.nas.nasa.gov/hecc/support/kb/preparing-to-run-on...

https://www.nas.nasa.gov/hecc/support/kb/broadwell-processor...

JoshTriplett · on Dec 12, 2016

This seems particularly interesting as a contrast with https://news.ycombinator.com/item?id=13145245 .

mcbain · on Dec 12, 2016

Solves and creates different issues. With FMV you essentially have to build the compile unit with the highest level of microarch support so that the pre-processor interprets the intrinsics headers and doesn't eliminate intrinsics you might need.

(ICC solves this in a differently annoying way - where all the intrinsics are available, even if they are incompatible with the platform you are on.)

For some light entertainment, think about what happens with static initializers and compiling for different microarch flavours. If your C++ static init function happens to generate an AVX insn, and you've only got SSE2, welcome to SIGILL before main().

burntsushi · on Dec 12, 2016

Can't you get the same behavior as icc in gcc/clang by just using target specific optimization options at the function level?

See for example stage 1 here: https://gcc.gnu.org/wiki/FunctionSpecificOpt (that document appears dated, but do things still work that way?) Afaik, clang/llvm have similar functionality.

mcbain · on Dec 12, 2016

That stage1 example is pretty ugly - using __builtin_ia32<x> would work, but they are the only things harder to read than the intrinsics themselves!

Plus there are some intrinsics that are just macros, (sets, masks, etc), and you don't get them from the preprocessor just by setting the function target.

As an aside, that page really is dated - it is just early proposals afterall - as SSE5 didn't see light of day like that. VPCMOV ended up in AMD's XOP set.

ICC also has its auto as well as manual dispatch options:

auto: https://software.intel.com/en-us/node/682440 manual: https://software.intel.com/en-us/node/684505

I believe this is the area where Intel had their knuckles rapped for only working on "GenuineIntel" processors, and why there are big disclaimers on everything now. I've not tried using these myself as they aren't portable solutions.

akkartik · on Dec 12, 2016

I wonder how long we have until we run out of x86 opcodes. Even if it's a variable-length instruction encoding, there's going to come a point where the size of the instruction outweighs any performance benefits, not to mention the effort that goes into designing new instruction extensions in a chip.

Then again, perhaps that's all that's left for Intel to do now. Evolution of marketing ploys: transistors -> clock speed -> #cores -> instruction extensions.

userbinator · on Dec 12, 2016

there's going to come a point where the size of the instruction outweighs any performance benefits

Except for the fact that many of these new instructions perform some huge nontrivial operation in hardware that would've required hundreds or more regular instructions previously --- AES is a good example. It seems like a general principle that instruction sets tend to become more CISC-y over time, as dedicated hardware and instructions designed to operate on such replace slower software implementations.

wumpus · on Dec 12, 2016

x86_64 already reused a couple of short opcodes to make a huge number of long ones. From the modest size of today's implementations of decoders, it doesn't seem like instructions are going to kill x86 any time soon... except on the lowest end, where x86 already doesn't have much marketshare.

eriknstr · on Dec 12, 2016

Does clang / llvm have anything similar?

mcbain · on Dec 12, 2016

Clang doesn't have function multi-versioning (FMV), but it now supports the ifunc attribute for runtime resolution:

http://clang.llvm.org/docs/AttributeReference.html#ifunc-gnu...

noobermin · on Dec 12, 2016

Google says no.

Here is a pdf reviewing work on this[0], I guess. It is two years old though:

[0] http://llvm.org/devmtg/2014-10/Slides/Christopher-Function%2...

mkj · on Dec 12, 2016

Not sure about those, but Intel compiler (C and Fortran) has had it for a long time.

hitlin37 · on Dec 12, 2016

pretty nice. i like the code in the article, seems much cleaner to look at different functions for different arch. This will come handy during code maintenance.