Hi,
I think I have a good background on how a cpu and memory work; I know the usual stuff about CPUs, especially Intel CPUs with a cache line that is usually 64 bytes, each CPU core having dedicated SSE and/or AVX registers, and so on. I'm also fairly familiar with the common practices that are used to increment performances, improve memory usage, avoid cache spills, and all the modern scenarios that make a good software for concurrency and parallelism in various form ( mainly multicore and SIMD ) .
But when it comes to intrinsics there is very little to no documentation about design, patterns and good practices .
My question is about a series of scenarios that I have encountered and I don't know how to evaluate correctly or how they should be tackle to get the most out of the CPU :
* all the examples are in pseudocode to simplify the writing
- Is it a good idea to use a register, that is an operand of the function, as the target for storing the result or is it better to use a third m128 register ? Ex. r1 = _mm_add ( r1, r2 ) or r3 = _mm_add ( r1, r2 )
- There is a substantial difference between doing some type punning and using store instructions when returning a value/s from a function as far as the hardware is concerned ? Ex. return ((int32*)(&r1))[0] + ((int32*)(&r1))[1] + ((int32*)(&r1))[2] + ((int32*)(&r1))[3] ( the function is some int foo(...){} ) or _mm_store ( buffer, r1 ) ( and at this point the function is void foo(...){} )
- I assume there is no difference in writing a for loop that iterates over X amount of registers doing X times the same operation and writing X times the operations in the body of your function for each register . Correct ?
- There is a way to know, programmatically, how many and what are the registers available on a given CPU/architecture ? For example how many xmm registers a given machine has in each CPU core plus all the other registers .
- What is supposed to happen if I use more registers than what my machine is capable of ? For example what if I will use 33 xmm registers on a cpu that only has 32 SSE registers ?
- There is a cheap trick to avoid writing a load instrunction which consists in casting ( C style cast ) the pointer to your data to ( m128 * ) directly . Is this well defined behaviour ? There are pros and cons ? for example __m128* r1 = (__m128i *) ( &ptr[0] )
I'll appreciate a clarification about how to approach each single point in the list, I find that intrinsics are kinda not that well documented ( and registers too ) in terms of behaviour and patterns, the only real suggestion about any possible pattern is about the load-compute-store pattern, which suggest that you just load all the registers first, do your computation and only at the end you operate your store/s , but this is too generic and it doesn't answer my doubts .
Thank you for your time