Hi,
We have a question on the behavior of "#pragma ivdep", multi-versioning and assumed vector dependence. We have a workload (LU decomposition) that contains an assumed vector dependence. I did not want to post the whole code here, so I created a short reproducer that has the same behavior.
const int n = 128; float* data = (float*) malloc(sizeof(float)*n*n); data[0:n*n] = 1.0f; for(int i = 0 ; i < n; i++) { for(int j = 0 ; j < n; j++) { //#pragma ivdep //#pragma vector always //#pragma simd for(int k = 0 ; k < n; k++) { data[i*n+k] += data[j*n+k]; } } }
There is an assumed vector dependence here because 'n' could be smaller than the vector length, and the optimization report recognizes this and implements multi-versioning. However both versions that it creates are not vectorized. Following is the snippet from optimization report with -qopt-report=5
LOOP BEGIN at reproducer.cc(15,7)<Multiversioned v1> remark #25228: Loop multiversioned for Data Dependence remark #15344: loop was not vectorized: vector dependence prevents vectorization remark #15346: vector dependence: assumed FLOW dependence between data line 16 and data line 16 remark #15346: vector dependence: assumed ANTI dependence between data line 16 and data line 16 remark #25438: unrolled without remainder by 2 LOOP END LOOP BEGIN at reproducer.cc(15,7) <Multiversioned v2> remark #15304: loop was not vectorized: non-vectorizable loop instance from multiversioning remark #25438: unrolled without remainder by 2 LOOP END
Multi-version v1 reports that this is an "assumed" dependence, which is what we had expected. Furthermore, adding "#pragma ivdep" does resolve the multi-versioning, but the loop is still unvectorized.
LOOP BEGIN at reproducer.cc(15,7) remark #15344: loop was not vectorized: vector dependence prevents vectorization remark #15346: vector dependence: assumed FLOW dependence between data line 16 and data line 16 remark #15346: vector dependence: assumed ANTI dependence between data line 16 and data line 16 remark #25438: unrolled without remainder by 2 LOOP END
Finally, we were able to vectorize this workload by forcing vectorization with "#pragma simd" (and we indeed got the correct result, along with significant speedup). For some reason "#pragma vector always" refused to vectorize this loop.
We are currently using C++ compiler v16.0.1 for Linux, and the problem also occurred with v16.0.0. But with earlier compilers the same code with multi-versioning had a vectorized and non-vectorized versions, and "#pragma ivdep" removed the assumed vector dependence.
Is this a change in behavior with the 16 compiler? If so, is the proper remedy to replace "#pragma ivdep" with "#pragma simd", or is there a different pragma for ignoring this type of assumed vector dependence?
Thanks in Advance!
Ryo