We have a situation where we're operating on some large data structures that are larger than the 4KB per-thread caches afforded by the GPU. In particular, I'm trying to implement something similar to this:
http://www.msr-waypoint.com/pubs/71445/ForestFire.pdfWhat I've got is slow. I did some profiling to determine where the bottleneck was, and if this link is to be trusted, then the following sentence tells me that I need some memory optimization work:
If the EU Array Stalled metric value is non-zero and correlates with the GPU L3 Misses, and if the algorithm is not memory bandwidth-bound, you should try to optimize memory accesses and layout.
Here is the output from the profiler (I was having trouble uploading the image to this forum post, so I uploaded it to my server): http://www.ben-rush.net/output.png.
I'm hoping there is maybe salvation in the GfxImage2D function(s) to enable me to more intelligently work with the GPGPU's memory when using large data structures (things like images, for example).
Or am I barking up the wrong tree?