Asynchronous Offload - C++ Code Examples

This document provides information about asynchronous data transfer, asynchronous computation and memory management without data transfer. This document includes code examples of common usage scenarios. The examples in this article are in C/C++ only.

Introduction

Two different C++ pragmas are used for data transfer and wait for completion.
The pragma for data transfer only, with asynchronous option is:

#pragma offload_transfer <clauses> [ signal(<tag>) ]

The pragma to wait for completion of asynchronous activity is

#pragma offload_wait <clauses> wait(<tag>)

The offload pragma also takes optional signal/wait clauses

#pragma offload <clauses> [ signal(<tag>) ] [ wait(<tag>) ]<statement>

The offload_transfer and offload_wait pragmas are stand-alone and do not apply to the subsequent code block.

Data Transfer

The offload_transfer pragma is a stand-alone pragma, meaning that no statement succeeds it. This pragma contains a target clause and either all in clauses, or all out clauses. Without a signal clause, offload_transfer initiates and completes a synchronous data transfer. With a signal clause, initiates the data transfer only. The offload_transfer pragma can also take a wait clause. A later pragma with wait clause is used to wait for data transfer completion. Expressions in signal and wait clauses are address-sized values that serve as tags on the asynchronous operation.

// Example 1:
// Synchronous data transfer CPU -> MIC
// Next statement executed after data transfer is completed
#pragma offload_transfer target(mic:0) in(a,b,c)

// Example 2:
// Initiate asynchronous data transfer CPU -> MIC
#pragma offload_transfer target(mic:0) in(a,b,c) signal(&a)

The offload_wait pragma is also a stand-alone pragma which does not require a succeeding statement. This pragma contains a target clause and a wait clause, which cause the pragma to start execution only after the asynchronous activity associated with the tag has completed.

// Example 3:
// Wait for activity signaled by &p to be completed. Variable p is the tag.
#pragma offload_wait target(mic:0) wait(&p)

Memory Management

The offload_transfer pragma can be used for memory allocation and deallocation by avoiding the data transfer with the use of the nocopy clause. This is typically done outside of a loop to amortize cost of allocation.

// Example 4:
#define ALLOC alloc_if(1) free_if(0)
#define FREE alloc_if(0) free_if(1)
#define REUSE alloc_if(0) free_if(0)
// Allocate memory on the coprocessor  (without also transferring data)
#pragma offload_transfer target(mic:0) nocopy(p,q : length(l) ALLOC)
…
for (…)
{
  // Use of allocated memory on the coprocessor for offloads
  #pragma offload target(mic:0) in(p:length(l) REUSE) out(q:length(l) REUSE)
  {
    // computation using p and q
    ...
  }
}
…
// Free memory on the coprocessor (without also transferring data)
#pragma offload_transfer target(mic:0) nocopy(p,q : length(l) FREE)

Send Input Data Asynchronously

The most typical usage begins with initiating the data transfer, execute some CPU activity, then start the offload computation that will use the transferred data. The data is placed in the same variables listed in the transfer initiation. Those variables must be accessible by the time the offload pragma begins execution.

// Example 5:
// Initiate asynchronous data transfer MIC -> CPU
#pragma offload_transfer target(mic:0) in(p,q,r) signal(&p)
…
…
// Do the offload only after data has arrived
#pragma offload target(mic:0) wait(&p)
{
  // offload computation… = p;
}

Receive Output Asynchronously

In asynchronous offload, an offload computation produces results that will be transferred back to the host at a later time. The offload pragma finishes the work but does not immediately copy the data back. Instead, an asynchronous offload_transfer initiates the copy. Later, when results are needed, an offload_wait is used to retrieve the data.

// Example 6a:
// Perform  the offload computation but don’t copy back results immediately
#pragma offload target(mic:0) nocopy(p)
{
  p = …;
}
// Initiate asynchronous data transfer MIC -> CPU
#pragma offload_transfer target(mic:0) out(p) signal(&p)
…
…
// Wait for data to arrive
#pragma offload_wait target(mic:0) wait(&p)

Asynchronous Computation

The host initiates an offload to be performed asynchronously and can proceed to next statement after starting this computation. Later in the code, an offload_wait pragma is used to wait for completion of the offload activity.

// Example 6b:
char signal_var;
int *p;
do {
// Initiate asynchronous computation
#pragma offload … in( p:length(1000) ) signal(&signal_var)
{
mic_compute();
}
concurrent_cpu_activity();
#pragma offload_wait (&signal_var);
} while (1);

Testing Signals

Some scenarios require testing to determine whether the computation signaled with a given tag is finished. Use the _Offload_signaled function (non-blocking mechanism) to check if an offload has completed.

// Example 7:
// Initiate asynchronous computation
int c;
#pragma offload target(mic:mic_no) signal(&c) ...
{
   S3;
}
...
// Test if computation has been completed for tag “c”
if _Offload_signaled(mic_no, &c) ….

Double-buffering

Use the offload, offload_transfer and offload_wait pragmas to implement a double-buffering algorithm. The example below shows memory allocation on the target device, asynchronous data transfers, the use of signal clauses to control asynchronous offloads.

// Example 8: Double-buffering Input
void do_async_in()
{
  int i;
  #pragma offload_transfer target(mic:0) in(in1 : length(count) REUSE) signal(in1)
  for (i=0; i<iter; i++)
  {
    if (i%2 == 0)
    {
      #pragma offload_transfer target(mic:0) if(i!=iter-1) \
        in(in2 : length(count) REUSE) signal(in2)

      #pragma offload target(mic:0) nocopy(in1) wait(in1) \
        out(out1 : length(count) REUSE)
         compute(in1, out1);
    } else {
      #pragma offload_transfer target(mic:0) if(i!=iter-1) \
        in(in1 : length(count) REUSE ) signal(in1)

      #pragma offload target(mic:0) nocopy(in2) wait(in2) \
        out(out2 : length(count) REUSE)
          compute(in2, out2);
    }
  }
}

// Example 8: Double-buffering Output
void do_async_out()
{
  int i;
  for (i=0; i<iter+1; i++)
  {
    if (i%2 == 0) {
      if (i<iter) {
        #pragma offload target(mic:0) in(in1 : length(count) REUSE) nocopy(out1)
          compute(in1, out1);
        #pragma offload_transfer target(mic:0) out(out1:length(count) REUSE) signal(out1)
      }
      if (i>0) {
        #pragma offload_wait target(mic:0) wait(out2)
          use_result(out2);
      }
    } else {
      if (i<iter) {
        #pragma offload target(mic:0) in(in2 : length(count) REUSE) nocopy(out2)
          compute(in2, out2);
        #pragma offload_transfer target(mic:0) out(out2:length(count) REUSE)) signal(out2)
      }
      if (i>0) {
        #pragma offload_wait target(mic:0) wait(out1)
          use_result(out1);
      }
    }
  }
}

Summary

Asynchronous offload allows data transfer and computation to overlap. This method does not require the use of additional threads on the host and is useful for pipelined operations. Refer to the following sample code installed with the Intel® C++ Compiler for more details (default installation directory):

Linux*: /opt/intel/composer_xe_2015/Samples/en_US/C++/mic_samples/intro_sampleC
Windows*: C:\Program Files (x86)\Intel\Composer XE 2015\Samples\en_US\C++

Arquitectura Intel® para muchos núcleos integrados

Mejora del rendimiento