This article demonstrates on how to write vector friendly code inside TBB parallel_for block. Consider the below code snippet:
$ cat test1.cc #include <iostream> #include <tbb/tbb.h> #include <tbb/parallel_for.h> #include <cstdlib> using namespace std; using namespace tbb; long len = 0; float *a; float *b; float *c; class Test { public: void operator()( const blocked_range<size_t>& x ) const { for (long i=x.begin(); i!=x.end(); ++i ) { c[i] = (a[i] * b[i]) + b[i]; } } }; int main(int argc, char* argv[]) { cout << atol(argv[1]) << endl; len = atol(argv[1]); a = new float[len]; b = new float[len]; c = new float[len]; parallel_for(blocked_range<size_t>(0,len, 100), Test() ); return 0; }
The above code has a parallel_for block which calls Test() functor. When this program is compiled, the vectorization report states the Loop was not vectorized as shown below:
$ icpc -S -O3 -vec-report2 test1.cc -o test1_O3_icc.s parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure parallel_for.h(108): (col. 22) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate parallel_for.h(108): (col. 22) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate partitioner.h(158): (col. 9) remark: loop was not vectorized: existence of vector dependence
Studying the loop closely, it is clear that the compiler is unable to figure out if the loop is a countable loop since the bounds of the loop are essentially function calls (x.begin()/x.end()). Modifying the code as shown below (in red font) will avoid this confusion for the compiler:
From:
class Test { public: void operator()( const blocked_range<size_t>& x ) const { for (long i=x.begin(); i!=x.end(); ++i ) { c[i] = (a[i] * b[i]) + b[i]; } } };
To:
class Test { public: void operator()( const blocked_range<size_t>& x ) const { long j = x.begin(); long k = x.end(); for (long i=j; i!=k; ++i ) { c[i] = (a[i] * b[i]) + b[i]; } } };
The vectorization report for the above change is:
$ icpc -S -O3 -vec-report2 test1.cc -o test1_O3_icc.s parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure parallel_for.h(108): (col. 22) remark: loop was not vectorized: existence of vector dependence parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure parallel_for.h(108): (col. 22) remark: loop was not vectorized: existence of vector dependence parallel_for.h(108): (col. 22) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate parallel_for.h(108): (col. 22) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate partitioner.h(158): (col. 9) remark: loop was not vectorized: existence of vector dependence
Still the loop was not vectorized but this time because the compiler assumes that there is vector dependence. This is because compiler has clue if the arrays “a”, “b” and “c” are aliased (do they point to overlapping memory locations). Since in this case the arrays are disjoint in memory, declaring them as restrict pointers helps. The __restrict__ keyword is explicitly inform the compiler that there is no aliasing. Below the code change:
From:
float *a; float *b; float *c;
To:
float * __restrict__ a; float * __restrict__ b; float * __restrict__ c;
Compiling this modified code will vectorize the loop as shown below:
$ icpc -S -O3 -vec-report2 test1.cc -o test1_O3_icc.s
parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure
parallel_for.h(108): (col. 22) remark: LOOP WAS VECTORIZED
parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure
parallel_for.h(108): (col. 22) remark: LOOP WAS VECTORIZED
parallel_for.h(108): (col. 22) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate
parallel_for.h(108): (col. 22) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate
partitioner.h(158): (col. 9) remark: loop was not vectorized: existence of vector dependence