Problem Statement
When build applications with Intel(R) Compiler, in some cases compiling with -xAVX option will not get a better performance than compiling with the default compiler options, even though loops are vectorized.
Compiler Version
Intel(R) C++ Compiler for Linux* 14.0
Intel(R) Fortran Compiler for Linux* 14.0
Root Cause
Most of those cases have a similar scenario: loop containing non-unit stride access.
For example, below loop is from an open-source benchmarks which you can download here:
for (r=0; r<M; r++) { double sum = 0.0; int rowR = row[r]; int rowRp1 = row[r+1]; for (i=rowR; i<rowRp1; i++) sum += x[ col[i] ] * val[i]; y[r] = sum; }
The body of the inner loop contains an indirect access for array X: X[col[i]], where index of array X is from another array col. This caused compiler to generate scalar instructions to access array X's elements and used more instructions to put those scalar data into a vector. Due to the scalar access, AVX has no performance benefit to SSE.
In addition, AVX instructions may cause bigger remainder loop than SSE due to its longer vector length. The remainder loop is a scalar loop remained from the complete loop removing the vectorized loop iterations. The remainder loop count can be calculated by:
n % (vector length * unroll times)
Where n is the total loop count. Unroll times is the vectorized loop unroll times.
Since AVX has bigger vector length, it's more possible to get a bigger remainder loop, especially when total loop count is unknown in compile time.
Due to all above reasons, you will see a slower performance with AVX than SSE in some cases.
The same scenario also applies in Fortran:
do r=1, M sum = 0.d0 do i=row(r),row(r+1)-1 sum = sum + x(col(i))*val(i) enddo y(r) = sum enddo
Solution
Firstly have a check if your code contains any non-unit stride access on some hotspots loops.
If yes, resolve the issue in two ways according to the corresponding root causes:
1. Try to reduce non-unit stride access if possible.
2. Make a similar size of remainder loop by adjusting the unroll times:
In the above example, specifying the unroll times manually to gain an identical remainder loop as SSE version:
#pragma unroll 4 #pragma SIMD reduction(+:sum) vectorlength(2) for (r=0; r<M; r++) { double sum = 0.0; int rowR = row[r]; int rowRp1 = row[r+1]; for (i=rowR; i<rowRp1; i++) sum += x[ col[i] ] * val[i]; y[r] = sum; }
Or
#pragma unroll 2 #pragma SIMD reduction(+:sum) vectorlength(4) for (r=0; r<M; r++) { double sum = 0.0; int rowR = row[r]; int rowRp1 = row[r+1]; for (i=rowR; i<rowRp1; i++) sum += x[ col[i] ] * val[i]; y[r] = sum; }
Similarly in Fortran, we have:
!DIR$ UNROLL(4) !DIR$ SIMD reduction(+:sum) vectorlength(2) do r=1, M sum = 0.d0 do i=row(r),row(r+1)-1 sum = sum + x(col(i))*val(i) enddo y(r) = sum enddo
or
!DIR$ UNROLL(2) !DIR$ SIMD reduction(+:sum) vectorlength(4) do r=1, M sum = 0.d0 do i=row(r),row(r+1)-1 sum = sum + x(col(i))*val(i) enddo y(r) = sum enddo
This will make code compiling for AVX gain a close performance to SSE.
References
To understand more on vectorization and how to compiler for Intel(R) AVX, you may refer to below articles:
Intel vectorization tools:https://software.intel.com/en-us/intel-vectorization-tools/
How to compile for Intel® AVX: https://software.intel.com/en-us/articles/how-to-compile-for-intel-avx