I have been compiling some benchmark with ICC. I am seeing results where optimized version i.e executable generated with -O3 takes more time than the executable generated with -O0. Although when I generate the vectorization report by using flag -vec-report5 I see that compiler chooses to vectorize because
scalar loop cost : 28
vector loop cost : 7.680
estimated potential speedup: 3.630
But when I run the executables then vectorized version takes more time than the nonvectorized executable, even difference is about 10 secs. I just wanted to know that is it really possible as mentioned in the above case, or am I not able to visualize something.