Assembly instructions reordering is not optimal when providing hand-vectorized code

I'm testing the assembly code generation of the main loop of an application with icc 14.0.3 and icc 15.0 for the Intel Xeon Phi coprocessor.

I'm generating a large number of prefetch instructions by hand for this main loop using _mm512_prefetch intrinsic and compiling with -O3 -no-opt-prefetch -mmic.

The first version of this application only contains this _mm512_prefetch intrinsics and a #pragma omp simd on the main loop.

The second version has been vectorized by hand using KNC intrinsics in addition to the same _mm512_prefetch instructions of the previous versions.

When I have a look carefully at the assembly code generated for both versions, I see that the assembly corresponding to the loop body is highly equivalent. Both versions seem to have the same optimizations applied, SAME aligned/unaligned loads/stores, etc... BUT the order of the instructions is not the same.

Whereas in the auto-vectorized version ALL prefetch instructions have been shuffled with other instructions throughout the WHOLE body loop (this is just a short snippet):

        vprefetch0 2720(%r14,%r13,4)                            #300.15 c93
        vaddps    %zmm25, %zmm24, %zmm8                         #362.48 c97
        vprefetch0 1440(%r14,%r13,4)                            #301.15 c101
        vaddps    %zmm27, %zmm26, %zmm7                         #363.48 c105
        vprefetch0 160(%r14,%r13,4)                             #302.15 c109
        vaddps    %zmm29, %zmm28, %zmm6                         #364.48 c113
        vprefetch0 6560(%r14,%r13,4)

in the hand-vectorized version only a few instructions have been shuffled, and the vast majority of them are one after another at the beginning of the loop body:

        vmovaps   -3824(%rdx,%rsi,4), %zmm0                     #478.2861 c1
        vprefetch1 528(%rcx)                                    #459.21 c5
        vmovaps   -1264(%rdx,%rsi,4), %zmm30                    #478.2437 c9
        vprefetch0 144(%rcx)                                    #460.21 c13
        vmovaps   -2544(%rdx,%rsi,4), %zmm31                    #478.2649 c17
        vprefetch1 -752(%rcx)                                   #461.21 c21
        vaddps    3856(%rdx,%rsi,4), %zmm0, %zmm1               #478.2831 c25
        vprefetch1 -2032(%rcx)                                  #462.21 c29
        vaddps    1296(%rdx,%rsi,4), %zmm30, %zmm3              #478.2407 c33
        vprefetch1 -3312(%rcx)                                  #463.21 c37
        vaddps    2576(%rdx,%rsi,4), %zmm31, %zmm2              #478.2619 c41
        vprefetch1 -4592(%rcx)                                  #464.21 c45
        vmulps    %zmm11, %zmm1, %zmm1                          #478.34 c49
        vprefetch1 1808(%rcx)                                   #465.21 c53
        vfmadd213ps %zmm1, %zmm13, %zmm3                        #478.34 c57
        vprefetch1 3088(%rcx)                                   #466.21 c61
        vprefetch1 4368(%rcx)                                   #467.21 c65
        vprefetch1 5648(%rcx)                                   #468.21 c69
        vprefetch0 -1136(%rcx)                                  #469.21 c73
        vprefetch0 -2416(%rcx)                                  #470.21 c77
        vprefetch0 -3696(%rcx)                                  #471.21 c81
        vprefetch0 -4976(%rcx)                                  #472.21 c85
        vprefetch0 1424(%rcx)                                   #473.21 c89
        vprefetch0 2704(%rcx)                                   #474.21 c93

This cause the auto-vectorized version runs significantly faster than the hand-coded version. I shuffle the assembly prefetch instructions by hand in the hand-coded version and I reached the same performance as in the auto-vectorized version.

Could you help me to understand by this is happening? What could I do to have a similar order of instructions generated automatically in the hand-vectorized version?

Thank you in advance!

Assembly instructions reordering is not optimal when providing hand-vectorized code

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112