Does anyone know if ICC 14 can transform (x >> 12) & 0x3 into _bextr_u32(x, 12, 2) ?
I tried compiling it with icc -mcore-avx2 but it didn't transform. How profitable is it to do so? 2 instructions, 2 cycles latency vs 1 instruction, 2 cycles latency.
Also what is there an analogue of bextr_u32 for inserting contiguous bits into another word? (e.g. a | ((b & 0xff) << 8) )
It seems that instruction would need 4 operands, which isn't implemented, but what about just filling all the upper bits (e.g. a | (b << 8) )