I'm compiling a relatively simple maths routine that takes 8 bit unsigned image data from a byte buffer, performs some simple linear floating point operations on that data and writes the modified 8 bit data back to the same buffer. There are nested column and row loops to allow for a custom row stride. The routine is written as a template function with two type parameters to test the effects of using different types in the interim calculation (short/int integer calculation, float/double floating point calculation).
I'm using the parallel studio 2015 C++ compiler within Visual Studio 2010, testing on Win7 x64 on an i7-4980HQ 2.8GHz processor compiling code as 32bit optimized release code.
The code is clearly vectorizable and In my performance tests I'm seeing a large 30% performance boost if I compile for SSE 4.1 over compiling for SSE2. Compiling for SSE 4.2/AVX/AVX2 get between the same or 10% worse performance that compiling for SSE 4.1. So it seems that targeting SSE 4.1 is a sweet spot. Great!
However if I using /Qax to create a processor specific branch over a default baseline /Qx then the performance is exactly the same as without /Qax implying that the compiler has decided that the SSE 4.1 processor branch is not worthwhile even though my tests show that it clearly is. Adding #pragma loop_count with a large number of iterations makes no difference.
I can see why it's extremely difficult for the compiler to guarantee that its decisions for /Qax always give the best performance on all CPUs but it seems like I've got no flexibility when I disagree with the compiler's decision. I can't find any obvious #pragmas or other techniques to make /Qax do what I'd prefer. Any suggestions? Ideally I'd like to be able to specify or override /Qax with a #pragma for more fine grained control. Can I improve the results with /Qax by avoiding inline or template functions or tweaking my C++ coding style?