The ICL's vectorizer seems to be very good, which makes me think whether it makes sense to use IPP (performance primitives) for simple tasks such as
for (int i=0; i<cnt; i++) dst[i] = src1[i] * src2[i];
I assume to use SSE2 as base architecture and AVX for dispatching.