The present question relates to an already existing question on Stackoverflow with the difference that in this case AVX is the target ISA and that the function to be vectorized is more complex. When I use the __attribute__((vector(...))) declaration in the function definition:
__attribute__((vector(linear(a),linear(b)))) inline void foo(float* restrict a, float* restrict b) { ... for(j=0; j<n; j++) { // do something with a[j*STRIDE] and b[j*STRIDE] } for(j=n-1; j>=0; j--) { // do something with a[j*STRIDE] and b[j*STRIDE] } }
the compiler reports the following for the function foo():
foo.hpp(56): (col. 101) remark: FUNCTION WAS VECTORIZED foo.hpp(56): (col. 101) remark: FUNCTION WAS VECTORIZED
When I want to call the function with array notation or a single for loop:
int main() { ... #pragma omp parallel for for(k=0; k<n; k++) { int base = k*256*256; FP* __restrict a = &h_a[base]; FP* __restrict b = &h_b[base]; __assume_aligned(a,32); __assume_aligned(b,32); foo(&a[0:256], &b[0:256]); // line 337 // OR for(i=0; i<n; i++) { foo(&a[i], &b[i]); } }
it refuses to vectorize:
main.c(337): (col. 3) remark: loop was not vectorized: existence of vector dependence main.c(337): (col. 3) remark: loop was not vectorized: existence of vector dependence main.c(337): (col. 3) remark: loop was not vectorized: not inner loop
The used Intel compiler flags are:
icc -O3 -xAVX -ip -restrict -parallel -fopenmp -vec-report2 -openmp-report2
The question: If the compiler could vectorize the function foo(), why it can not use the vectorized version on the place of the function call (main.c:337)? The "remark" message suggests that the function was analysed again by the compiler, instead of simply injecting the already compiled vector code.
Note: I tried to use a for loop instead of array notation with #pragma ivdep and also #pragma simd, but non of them helped. The actual code is much larger, then it would conveniently fit in this post.