I'm developing a realtime audio processing software. There may be several (for example even 100) processors at each moment, in several parallel chains. I cannot let the processors cooperate and must assume any possible sequence of processing. Each of them receives a block of data usually 256-1024 values and needs to process them as quickly as possible, so that the results may be passed to the next item in chain. If the data is not delivered in time, bad things happen... But in many cases just a few processors may be used and the goal is to keep general CPU usage minimal then. The algorithms in each processor vary a lot, so it is hard to predict anything.
The "host" for all these processors is unknown and usually implements some kind of parallelization as well, but in my testing huge project it was reporting "near trouble" CPU usage, while the system task manager reported just about 14% CPU usage on my 8-core Xeon E5, so evidently there's a lot of spare processing power.
From what I know these are the choices:
1) TBB - this one looks harder to use.
2) CILK
3) OpenMP - I actually tested this one via MSVC and sadly it seemed to have open actively waiting threads, which means that the CPU was at 100% despite pretty small improvement in performance.
I'd prefer if the solution could be linked statically. All of the processor implementations will be present in a single DLL (Windows) / dylib (OSX).
Any recommendations?