Quantcast
Channel: Intel® C++ Compiler
Viewing all articles
Browse latest Browse all 1616

semantics and use of __intel_simd_lane w/ SIMD functions?

$
0
0

I'm seeking clarification on how (and when) to properly use __intel_simd_lane().  I'm trying to understand how best to write SIMD functions that are portable across different intel architectures (i.e. different hardware SIMD length).  Here is my toy example (retyped so there might be small errors):

#include <omp.h>
#define M 16

#pragma omp declare simd uniform(x,j), linear(lane)
unsigned int
add2a(unsigned int *x, unsigned int lane, unsigned in j)
{
  return x[lane] += j;
}

#pragma omp declare simd uniform(x,j)
unsigned int
add2b(unsigned int *x, unsigned int j)
{
  return x[__intel_simd_lane()] += j;
}

#pragma omp declare simd linear(x) uniform(j)
unsigned int
add2c(unsigned int *x, unsigned int j)
{
  return *x += j;
}

#include <stdio.h>
#include <string.h>
int
main(int argc, char *argv[])
{
  unsigned int x[M] = {0};
  unsigned int y[M];

  memcpy(y,x,M*(sizeof(y[0]);
#pragma omp simd
  for (int j=0; j<M; j++) add2a(y,j,1);
  for (int j=0; j<M; j++) printf("%d ", y[j]);
  printf("\n");

  memcpy(y,x,M*(sizeof(y[0]);
#pragma omp simd
  for (int j=0; j<M; j++) add2b(y,1);
  for (int j=0; j<M; j++) printf("%d ", y[j]);
  printf("\n");

}

add2a is taken from Example 10 in https://software.intel.com/en-us/intel-parallel-universe-magazine (issue 22) but I don't like the idea of having to modify the argument list to make a SIMD function. add2b is based on the C compiler 16.0 documentation for __intel_simd_lane().  (add2c is the way I probably would have written it.) So, the issue is that these are not all the same.  Compile with icc -std=c99 -O3 -xHost simdlane.c -qopenmp and you get the following (on an E5-2690 Sandy Bridge).

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1   <- add2a
4 4 4 4 0 0 0 0 0 0 0 0 0 0 0 0   <- add2b

If you add simdlen(M) to the add2b loop pragma, things start working right for M=2,4,8,16,32,64.  Any other values (non power of 2, M>64) and only the first 4 elements are set (and incorrect).  I guess I understand what's going on, but it seems there's a lot of subtle behavior that I would need to take into account to use it in portable code and that would make my code more complicated, not less.

So, have I misunderstood __intel_simd_lane()?  Is it useful for portable code?  When is it the right thing to use and how should it be used?

Thanks.


Viewing all articles
Browse latest Browse all 1616

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>