Speed Up shown in the optimization report but no speedup is shown while execution.

Hi,

I had a for loop which had some branches due to which the loop was not a candidate of vectorization which I confirmed from the optimization report. I removed this branches using masking, and now the same for loop satisfies all the necessary requirement for a loop to be vectorized, like

1. No Branches or Jumps.

2. No Dependencies. e.t.c

Below is a output from the optimization report with option -vec-report5. As can be seen from the vectorization report the loop was splitted and part1 (chunk1) was supposed to give a speed of 2.140 and part2 (chunk2) was supposed to give a speed up of 1.4 but when I execute the code in real time I cannot find any speedup at all, time is almost equal to original. I also checked the assembly for the same loop and it seems that compiler do generates the machine code instructions with xmm registers.

LOOP BEGIN at ../../../sample.c(2635,5)
<Distributed chunk1>
remark #15388: vectorization support: reference ai has aligned access [ ../../../sample.c(2637,9) ]
remark #15388: vectorization support: reference bi has aligned access [ ../../../sample.c(2638,9) ]
remark #15301: PARTIAL LOOP WAS VECTORIZED
remark #15449: unmasked aligned unit stride stores: 2
remark #15460: masked strided loads: 2
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 12
remark #15477: vector loop cost: 5.500
remark #15478: estimated potential speedup: 2.140
remark #15479: lightweight vector operations: 7
remark #15481: heavy-overhead vector operations: 1
remark #15487: type converts: 2
remark #15488: --- end vector loop cost summary ---
LOOP END

LOOP BEGIN at ../../../sample.c(2635,5)
<Remainder, Distributed chunk1>
LOOP END

LOOP BEGIN at ../../../sample.c(2635,5)
<Distributed chunk2>
remark #15388: vectorization support: reference bi has aligned access [ ../../../sample.c(2641,9) ]
remark #15388: vectorization support: reference ai has aligned access [ ../../../sample.c(2641,9) ]
remark #15388: vectorization support: reference mask5 has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference bi has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference ai has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference bi has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference ai has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference bi has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference ai has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference bi has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference ai has aligned access [ ../../../sample.c(2661,2) ]
remark #15399: vectorization support: unroll factor set to 2
remark #15301: PARTIAL LOOP WAS VECTORIZED
remark #15448: unmasked aligned unit stride loads: 10
remark #15449: unmasked aligned unit stride stores: 1
remark #15458: masked indexed (or gather) loads: 10
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 72
remark #15477: vector loop cost: 47.370
remark #15478: estimated potential speedup: 1.480
remark #15479: lightweight vector operations: 67
remark #15480: medium-overhead vector operations: 1
remark #15488: --- end vector loop cost summary ---
LOOP END

LOOP BEGIN at ../../../sample.c(2635,5)
<Remainder, Distributed chunk2>
LOOP END

My Question is that is this possible that although the optimization report shows opportunity of speed and machine code do use the 128 bits xmm registers rather than the scalar code, but still in real time it does not shows any speedup? Or is it because of some heavy overhead vectorization operations as shown in the optimization report (But I think compiler do considers it while compilation) ? If I am wrong or missing something, any help for the right direction would be appericiated.

Speed Up shown in the optimization report but no speedup is shown while execution.

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List