Dear Intel developers,
I'm using intel 15 on E5-2670 processor. Analyzing my code by using Vtune, in a particolar line when I unpack a m128 type in order to sum in a single floating point each elements like horizontal sum, like this:
_mm_store_ps(denom_arr_tmp, denom_tmp); semblance[m_local] += denom_arr_tmp[0]+denom_arr_tmp[1]+denom_arr_tmp[2]+denom_arr_tmp[3];
The assembly generated is:
vunpckhps %xmm2, %xmm2, %xmm3 movq -0x80(%rbp), %rax vaddssl -0x9c(%rbp), %xmm2, %xmm4 vaddss %xmm3, %xmm4, %xmm5 vaddssl -0x94(%rbp), %xmm5, %xmm6 vaddssl (%rax,%r14,4), %xmm6, %xmm7 vmovssl %xmm7, (%rax,%r14,4)
My question is: what is VADDSSL instruction? What's the difference with VADDSS? How I can optimize that piece of code? Actually is a bottleneck.
Thanks.