Hi,
I am using a simple ikj triple loop to compute a matrix multiplication. The intel compiler icpc (ICC) 14.0.2 20140120 is used.
Suppose that in the 2 following cases the number of threads is 1 (one) (No parallel for is used yet!)
1- If I use a #pragma omp parallel, the compiled code is seemed to be vectorized. That is what -vec-report6 tells me. But the running time is equal to the non-vectorized case:
MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference B has aligned access
MATMUL.cc(71): (col. 4) remark: vectorization support: unroll factor set to 4
MATMUL.cc(71): (col. 4) remark: LOOP WAS VECTORIZED
MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference B has unaligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: unaligned access used inside loop body
MATMUL.cc(71): (col. 4) remark: REMAINDER LOOP WAS VECTORIZED
2- On the other hand, if I simply remove the #pragma omp parallel, This message is printed out by the -vec-report6:
MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference B has aligned access
MATMUL.cc(71): (col. 4) remark: vectorization support: unroll factor set to 4
MATMUL.cc(71): (col. 4) remark: LOOP WAS VECTORIZED
MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference C has aligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: reference B has unaligned access
MATMUL.cc(73): (col. 12) remark: vectorization support: unaligned access used inside loop body
MATMUL.cc(71): (col. 4) remark: REMAINDER LOOP WAS VECTORIZED
MATMUL.cc(71): (col. 4) remark: loop skipped: multiversioned
Although it says "loop skipped: multiversioned", which I am not sure what it exactly means, the running time is roughly 6X better, which implies the proper vectorization. Using the #pragma omp simd does not change the results.
void MatMul_Par(float* A, float* B, float* C) { //#pragma omp parallel shared(A,B,C) { for (int i=0;i<N;i++) { for(int k=0;k<N;k++) { float temp = A[i*N+k]; //#pragma omp simd for(int j=0;j<N;j++) { C[i*N+j] += temp * B[k*N+j]; } } } } //parallel }
PS: The problem does not exist when using Intel Cilk Plus, etc. It seems to be related to the parallel pragma in OpenMP.