I looked and I'm not sure if there is a better forum for this, so I'll ask here. If there is, please let me know and I'll move my question.
I'm quite new at developing for the GPU in general, and so some topics are still a bit new/confusing for me. One of them is the notion of divergence when it comes to branch instructions: how the SIMD EU can come become "stalled" when different kernel instances take different paths (please check my terminology here too). It's my understanding that, with NVIDIA, there's a notion of a "wavefront" across the the EU, whereby the individual threads are okay if they're each executing the same instruction; if that no longer becomes the case, then some threads are stalled. This means the SIMD lanes aren't maximally used and you can run into situations where performance fails.
My high-level question is this: when programming for Intel's GPGPU, is this still the case? Or does it become more of an issue of instruction cache being not optimally used? Or both? Or neither? This article seems to indicate this is an "issue" with the Intel GPGPU as well: https://software.intel.com/en-us/node/540425. And I understand it's not really an issue, but just the way GPUs work, possibly.
I ask because I am trying to put a decision tree (a machine learning technique) on the GPU. Not the training, but the use of the decision tree. I'm using it for computer vision and so each pixel in an image I run down the decision tree. The decision tree is a binary tree, and at each node there's a question asked of the pixel; the answer causes slightly different logic to be executed. Then we branch left or right until we reach an end node. I figured I would gain massive speed increases by putting this on the GPU. I was wrong. I see little speed improvements. The problem, at least as far as I can tell, is that this is precisely a worst-case scenario for the GPU because different threads will be executing slightly different logic based on the particular pixel they're working on. Therefore causing massive stalling as divergence and merging occurs. But, again, the world of CPU and GPU seems to be blurring more and more everyday, and so perhaps I'm wrong and this isn't a good reason why my code may be executing much slower than expected.
Thoughts? Again, I apologize if this is the wrong forum.