Selective Use of gatherhint/scatterhint Instructions

Compiler Methodology for Intel® MIC Architecture

Selective Use of gatherhint/scatterhint Instructions

Overview

The -opt-gather-scatter-unroll=<N> compiler option can be used to generate gatherhint/scatterhint instructions supported by the coprocessor. This is useful if your code is doing non-unit stride accesses and/or uses indirect addressing via pointers or index arrays.

Topics

Here is the compiler behavior related to gatherhint/scatterhint generation and unrolling of gather/scatter loops:

Behavior in the compiler:

There are no “one-shot” gather/scatter instructions on KNC, so the compiler generates a loop to perform complete gather/scatter. The loop by default looks as follows:
L1:
gather
jkz L2
gather
jknz L1
L2:

The code above is good for most applications, but for some applications this loop would be faster if it was unrolled, and also different unroll factors may be needed for best performance for different applications. Also, when the loop is unrolled, adding gather/scatter hint instructions before the loop gives additional benefit. Compiler generates an alternate code sequence for gather/scatter with these properties with the option specified here.

For example, if –opt-gather-scatter-unroll=3 option is specified, instead of the sequence above, compiler will generate the following unrolled version, and also with two gather/scatter hint instructions preceding the loop:
gather hint
gather hint
nop
L1:
gather
jkz L2
gather
gather
gather
jknz L1
L2:

Here the value of N that gives best performance is data-dependent. In cases where the gather/scatter accesses data in a small number of cache-lines (say 1 or 2), the default sequence (using a small value of N) works best. In cases where each individual data item falls in a different cache-line, using a large value of N may be better.

Take Aways

The gatherhint/scatterhint instructions and unrolling of gather/scatter loops are useful for codes with non-unit stride memory accesses, and codes using indirect addressing through pointers or index arrays. Use the compiler option above to tune your application.

NEXT STEPS

It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon Phi™Coprocessors. The paths provided in this guide reflect the steps necessary to get best possible application performance.

Back to Advanced MIC Optimizations chapter

Intel Many Integrated Core

Intel® Fortran Compiler

Arquitectura Intel® para muchos núcleos integrados

Optimización

Computación en paralelo