Hi, I'm fighting ICC to optimize a certain code, something like this:
float Function(float x) { // a few lines of code, some short floating point math }; ... for (int i=0; i<cnt; i++) { // a few lines of code dst[i] = Function(x); };
I also added "force vectorize" pragma to the loop. Now if I leave it like this, the testing program takes 10 seconds. If I put the body of "Function" directly into the cycle however, it will take 6 seconds, because ICC will correctly use AVX and actually create a pretty long stuff from it. So there's like 40% improvement! I even tried _Pragma("vector always") before the "Function" call, but nothing, still slow.
Any ideas?