OpenMP Related Tips

Compiler Methodology for Intel® MIC Architecture

OpenMP Related Tips

OpenMP* Loop Collapse Directive

Use the OpenMP collapse-clause to increase the total number of iterations that will be partitioned across the available number of OMP threads by reducing the granularity of work to be done by each thread. If the amount of work to be done by each thread is non-trivial (after collapsing is applied), this may improve the parallel scalability of the OMP application.

You can improve performance by avoiding use of the collapsed-loop indices (if possible) inside the collapse loop-nest (since the compiler has to recreate them from the collapsed loop-indices using divide/mod operations AND the uses are complicated enough that they don't get dead-code-eliminated as part of compiler optimizations):

Example showing use of collapse clause:

#pragma omp parallel for collapse(2)
  for (i = 0; i < imax; i++) { 
    for (j = 0; j < jmax; j++) a[ j + jmax*i] = 1.; 
  }

Common mistakes with OpenMP clauses

Make sure to add the right OpenMP clauses regarding data-sharing (using keywords: private, reduction, etc.). If these are not present as required, the program likely has data-races. But due to the non-deterministic effect of data-races, it may appear to work “correctly” in some configurations (but with very different performance characteristics compared to the correct modified version).

One example is give below:

Incorrect code with data-races:

      #pragma omp parallel for
      for(i=0; i<threads; i++)
      {
            offset=i*array_size;
            for(j=0; j<(iterations*(vec_ratio)); j++)
            {
                  for(k=0; k<array_size; k++)  
                  {
                        sum1 += a[k+offset] * s1;
                        sum2 += a[k+offset] * s2;
                  }
            }
      }

Modified version with data-races removed:

  #pragma omp parallel for num_threads(threads) reduction(+:sum1,sum2)
  for(i=0; i<threads; i++)
  {
    float sum1_local=0.0f;
    float sum2_local=0.0f;
    int offset=i*array_size;

    for(int j=0; j<(iterations*(vec_ratio)); j++)
    {
      for(int k=0; k<array_size;  k++)
      {
        sum1_local += a[k+offset] * s1;
        sum2_local += a[k+offset] * s2;
      }
    }
    sum1 += sum1_local;
    sum2 += sum2_local;
  }

Reduce barrier synchronization overheads

When executing on a large number of threads, barrier synchronization can add significant overheads depending on the OpenMP constructs used. In some cases, user may be able to use the “nowait” clause to reduce such overheads. See the example below, where this clause is used with static scheduling for an omp-for loop inside a parallel region. This clause implies that the threads will not synchronize upon completing their individual pieces of work. Note that using the “nowait” clause with static scheduling is okay here because the same thread will execute the same iteration numbers in each loop - a thread will always run the same iterations of the inner loop for different image_id values. (Changing the scheduling type to dynamic will make the “nowait” clause incorrect here).

void *task(void* tid_, float **Raw, float *Vol)
{
  long int tid = (long int) tid_;
  if(tid==0) printf("bptask_vgather\n");
  int _kk_;
  int il,jl,kl,i,j,k,kbase;
  __int64 cycles=0;

  #pragma omp parallel
  for(int image_id =0; image_id < global_number_of_projection_images; image_id++) {
    float *local_Raw = Raw[image_id];
    int chunksize = NBLOCKS/nthreads; // nthreads==1, tid==0
    int beg_ = tid*chunksize;
    int end_ = (tid+1)*chunksize;

    #pragma omp for schedule(static) nowait
    for(int b=beg_; b < end_; b++) {    {
        int K = ((b%NBLOCKSDIM_K) << BLOCKDIMBITS_K);
        int t=b/NBLOCKSDIM_K;
        int J = ((t%NBLOCKSDIM_J) << BLOCKDIMBITS_J);
        t=t/NBLOCKSDIM_J;
        int I = ((t%NBLOCKSDIM_I) << BLOCKDIMBITS_I);
        float *localVol=Vol+b*BLOCKDIM_I*BLOCKDIM_J*BLOCKDIM_K;

        for(int i=I; i<I+BLOCKDIM_I; i++)
        {
          for(int j=J; j<J+BLOCKDIM_J; j++)
          {
            float tmp0 = c00*i+c01*j+c03;
            float tmp1 = c10*i+c11*j+c13;
            float tmp3 = c30*i+c31*j+c33;

            #pragma simd
            for(int k=K; k<K+BLOCKDIM_K; k++)
            {
                 float w=1/(tmp3+c32*k);
                 float lreal=w*(tmp1+c12*k); // y
                 float mreal=w*(tmp0+c02*k); // x
                 int l=(int)(lreal); float wl=lreal-l;
                 int m=(int)(mreal); float wm=mreal-m;

                 (*localVol) +=w*w*((1-wl)*((1-wm)*local_Raw[l*WinY +m ]+wm *local_Raw[l*WinY +m+1])
                                        +wl *((1-wm)*local_Raw[l*WinY+WinY+m ]
                                        +wm *local_Raw[l*WinY+WinY+m+1]));
                 localVol++;
            }
          }
        }
    }
  }
  return NULL;
}

How to Ensure Efficient Vectorization for Loops Inside OMP Parallel Regions

See the "Vector Alignment and Parallelization" section in the following article to learn more about tips for alignment of data in vector loops inside parallel regions: Data Alignment to Assist Vectorization

NEXT STEPS

It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon Phi™ coprocessor. The paths provided in this guide reflect the steps necessary to get best possible application performance.

BACK to the chapter Efficient Parallelization

Intel® Fortran Compiler

Arquitectura Intel® para muchos núcleos integrados

Optimización

Computación en paralelo

Vectorización