Hi there,
After reading a large of materials, I can never fount out how to broadcast 4 float variables into 4 lanes of the vector register on MIC.
e.g. float array[4]={a,b,c,d};
how to load into a vector register like :{aaaa,bbbb,cccc,dddd} using one intrinsic.
If I use _mm512_mask_blend_ps, it takes 4 intrinsics.
__forceinline __m512 gather16float_4float(const float a, const float b, const float c, const float d)
{
__m512 v = _mm512_set1_ps(a);
v = _mm512_mask_blend_ps(0x00f0,v,_mm512_set1_ps(b));
v = _mm512_mask_blend_ps(0x0f00,v,_mm512_set1_ps(c));
v = _mm512_mask_blend_ps(0xf000,v,_mm512_set1_ps(d));
return v;
}
Any more faster methods?
All of the intrinsics are about 128bits broadcast. Is there any intrinsics between 4 lane.
Could u please help me how to do this.
Thanks.