Hi,
I had a for loop which had some branches due to which the loop was not a candidate of vectorization which I confirmed from the optimization report. I removed this branches using masking, and now the same for loop satisfies all the necessary requirement for a loop to be vectorized, like
1. No Branches or Jumps.
2. No Dependencies. e.t.c
Below is a output from the optimization report with option -vec-report5. As can be seen from the vectorization report the loop was splitted and part1 (chunk1) was supposed to give a speed of 2.140 and part2 (chunk2) was supposed to give a speed up of 1.4 but when I execute the code in real time I cannot find any speedup at all, time is almost equal to original. I also checked the assembly for the same loop and it seems that compiler do generates the machine code instructions with xmm registers.
LOOP BEGIN at ../../../sample.c(2635,5)
<Distributed chunk1>
remark #15388: vectorization support: reference ai has aligned access [ ../../../sample.c(2637,9) ]
remark #15388: vectorization support: reference bi has aligned access [ ../../../sample.c(2638,9) ]
remark #15301: PARTIAL LOOP WAS VECTORIZED
remark #15449: unmasked aligned unit stride stores: 2
remark #15460: masked strided loads: 2
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 12
remark #15477: vector loop cost: 5.500
remark #15478: estimated potential speedup: 2.140
remark #15479: lightweight vector operations: 7
remark #15481: heavy-overhead vector operations: 1
remark #15487: type converts: 2
remark #15488: --- end vector loop cost summary ---
LOOP END
LOOP BEGIN at ../../../sample.c(2635,5)
<Remainder, Distributed chunk1>
LOOP END
LOOP BEGIN at ../../../sample.c(2635,5)
<Distributed chunk2>
remark #15388: vectorization support: reference bi has aligned access [ ../../../sample.c(2641,9) ]
remark #15388: vectorization support: reference ai has aligned access [ ../../../sample.c(2641,9) ]
remark #15388: vectorization support: reference mask5 has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference bi has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference ai has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference bi has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference ai has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference bi has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference ai has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference bi has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference ai has aligned access [ ../../../sample.c(2661,2) ]
remark #15399: vectorization support: unroll factor set to 2
remark #15301: PARTIAL LOOP WAS VECTORIZED
remark #15448: unmasked aligned unit stride loads: 10
remark #15449: unmasked aligned unit stride stores: 1
remark #15458: masked indexed (or gather) loads: 10
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 72
remark #15477: vector loop cost: 47.370
remark #15478: estimated potential speedup: 1.480
remark #15479: lightweight vector operations: 67
remark #15480: medium-overhead vector operations: 1
remark #15488: --- end vector loop cost summary ---
LOOP END
LOOP BEGIN at ../../../sample.c(2635,5)
<Remainder, Distributed chunk2>
LOOP END
My Question is that is this possible that although the optimization report shows opportunity of speed and machine code do use the 128 bits xmm registers rather than the scalar code, but still in real time it does not shows any speedup? Or is it because of some heavy overhead vectorization operations as shown in the optimization report (But I think compiler do considers it while compilation) ? If I am wrong or missing something, any help for the right direction would be appericiated.