Hi,
1) I spent more than a day playing with AVX intrinsics just to find out, that despite I made almost as fast as my assembler code (with ICC actually slightly faster), ICC itself produced even better code! So it seems I'm going for ICC after all, but :
- I need the software to be working on everything from SSE2 upwards, hence /arch:SSE2
- I want auto-dispatcher for AVX, since I found out the AVX code is faster on Sandy bridge and very much faster on Haswell
So I used /QaxCORE-AVX, but there was no difference and in debugger I verified it didn't create (or run) AVX code, it was using just SSE2. But it did create a great AVX code with /arch:AVX, but that wouldn't work on older CPUs, so it is not usable. So how can I enable the dispatching?
2) My software is full of vectorial cycles such as
for (int i=0; i<cnt; i++) dst[i] = (a[i] + b[i]) * c[i]
In these cases I know that I want this particular part of the code dispatched into multiple architectures (perhaps even FMA and newer in some cases). So should I mark these parts of the code somehow? Or how does this work? Is there some guide about writing code, so that it is easier for vectorization?
3) Does the vectorization (and other optimizations) work the same way on OSX as on Windows? I'll need both and I'm a little bit scared as things are usually much more problematic on OSX.
4) I actually compiled a big project with ICC and compared the realtime performance and sadly the difference much noticeable compared to MSVC, but the code ICC produced is like 30% bigger, which makes me think if despite the ICC produces better vectorized code, the code is so big, that the code cache misses are so frequent that it may degrade performance back to original level.
5) Can I use Profile guided optimizations with just ICC without buying VTune? Is it worth the trouble at all?
Thanks in advance!