Good practices and design choices for intrinsics

Hi,

I think I have a good background on how a cpu and memory work; I know the usual stuff about CPUs, especially Intel CPUs with a cache line that is usually 64 bytes, each CPU core having dedicated SSE and/or AVX registers, and so on. I'm also fairly familiar with the common practices that are used to increment performances, improve memory usage, avoid cache spills, and all the modern scenarios that make a good software for concurrency and parallelism in various form ( mainly multicore and SIMD ) .

But when it comes to intrinsics there is very little to no documentation about design, patterns and good practices .

My question is about a series of scenarios that I have encountered and I don't know how to evaluate correctly or how they should be tackle to get the most out of the CPU :

* all the examples are in pseudocode to simplify the writing

Is it a good idea to use a register, that is an operand of the function, as the target for storing the result or is it better to use a third m128 register ? Ex. r1 = _mm_add ( r1, r2 ) or r3 = _mm_add ( r1, r2 )
There is a substantial difference between doing some type punning and using store instructions when returning a value/s from a function as far as the hardware is concerned ? Ex. return ((int32*)(&r1))[0] + ((int32*)(&r1))[1] + ((int32*)(&r1))[2] + ((int32*)(&r1))[3] ( the function is some int foo(...){} ) or _mm_store ( buffer, r1 ) ( and at this point the function is void foo(...){} )
I assume there is no difference in writing a for loop that iterates over X amount of registers doing X times the same operation and writing X times the operations in the body of your function for each register . Correct ?
There is a way to know, programmatically, how many and what are the registers available on a given CPU/architecture ? For example how many xmm registers a given machine has in each CPU core plus all the other registers .
What is supposed to happen if I use more registers than what my machine is capable of ? For example what if I will use 33 xmm registers on a cpu that only has 32 SSE registers ?
There is a cheap trick to avoid writing a load instrunction which consists in casting ( C style cast ) the pointer to your data to ( m128 * ) directly . Is this well defined behaviour ? There are pros and cons ? for example __m128* r1 = (__m128i *) ( &ptr[0] )

I'll appreciate a clarification about how to approach each single point in the list, I find that intrinsics are kinda not that well documented ( and registers too ) in terms of behaviour and patterns, the only real suggestion about any possible pattern is about the load-compute-store pattern, which suggest that you just load all the registers first, do your computation and only at the end you operate your store/s , but this is too generic and it doesn't answer my doubts .

Thank you for your time

Good practices and design choices for intrinsics

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List