The fact that different CPUs have different features was one of the original reasons you never ran a generic kernel. The generic kernel was just that, generic. As a result it made no assumptions. Now in Windows and later Linux and FreeBSD kernels there are kernel modules that can change behaviors but sometimes that still doesn't give you the max performance for the integer code. Fortunately it isn't that much of a burden.
That said, I'm sure HPC folks compile with all the bells and whistles for the exact model, options, and probably CPU stepping enabled.
HPC guy here; Nope, we don't bother recompiling the kernel for every piece of hardware we have. HPC code tends to run 99.99% in user space, kernel performance doesn't matter that much.
End user applications, that's another matter, and here using all the latest vector instructions etc. can make a difference. Usually less so than what one might hope, though. The really big deal tends to be using optimized libraries such as OpenBLAS, FFTW, MKL instead of doing numerical linear algebra yourself in a naive fashion, or using the reference netlib BLAS.
Another very common problem we see is poor application I/O patterns. Yes, every HPC site loves to brag how many GB/s their Lustre system does, but if you divide that by the number of CPU cores in a cluster, that ratio is quite low. Additionally, like other clustered file systems, Lustre metadata performance is relatively poor, so applications banging on lots of small files can easily tank the performance of the entire Lustre system.
That said, I'm sure HPC folks compile with all the bells and whistles for the exact model, options, and probably CPU stepping enabled.