Gather-Scatter instructions may not be the optimal choice of instructions when you are trying to achieve superior performance on the Intel® Xeon Phi™ coprocessor. However, if your code uses indirect addressing or performs non-unit strided memory accesses, gather-scatter instructions may be the best option. This document reveals the operation of gather-scatter instructions and introduces -opt-gather-scatter-unroll, a new compiler switch added to Intel® C++ Compiler 14.0 and Intel® Fortran Compiler 14.0 that may provide improvements in certain codes that use gather-scatter instructions.
Understanding Gather-Scatter instructions
Before we dive into the details of how and when to use the compiler switch and why it even works, let us review the operation of gather-scatter instructions.
Consider the VGATHERDPS instruction which gathers a Float32 vector using signed Dword indices. The inputs and outputs of the VGATHERDPS instruction are shown below:
Inputs | Outputs |
BASE_ADDR: Base Address | ZMM: Float32 output vector |
VINDEX: Doubleword Index Vector | K: Write Mask |
SCALE: Scaling factor |
|
K: Write Mask |
|
Operation:
The write mask register, also known as the source mask, plays an important role in the operation of the gather. The mask register has one bit corresponding to each element that can be held by the output vector register. The elements of the mask register for which the corresponding bit is set are known as the active elements. In a gather instruction, data is loaded from memory to the vector register for only the active elements of the vector.
On a given invocation of a gather instruction, the following operations are performed:
- At least a single active element starting from the least significant active bit in the source mask is selected.
- 64 bytes of memory corresponding to the selected element is accessed. Note that the accessed element will always access 64 bytes of memory.
- If multiple data elements required by the gather sequence are present in the 64 bytes then all the corresponding elements of the output register are updated with the values from the accessed memory.
- The mask bits in the write mask register corresponding to the updated elements in the output register are reset to 0.
Usage in code:
Note that during a single invocation of the gather instruction, only 64 bytes of memory are accessed and the output register is updated with only those elements that are present in these 64 bytes. Since gather instructions are generally used to fetch data elements that are dispersed in memory, it is unlikely that the all required data elements will be present in the accessed 64 bytes of memory. Hence, it is generally required for the programmer to execute gather statements in a loop until all the required data elements have been loaded into the output register. As each data element is gathered, the corresponding bit in the write mask register is reset. The completion of the gather sequence is signaled by the write mask register being zero, i.e. when all the bits of the mask register have been cleared (zero).
Due to the special behavior of the mask register, it can be used to allow conditional looping of the gather instruction until all the elements from a given write mask have been successfully loaded. A typical usage of a gather instruction along with the re-trigger loop is shown below:
#zmm0 – Index register #r8 - Base Address #k1 - Mask register #zmm6 – Output register #4 - Scale ..L10: vgatherdps (%r8,%zmm0,4), %zmm6{%k1} jkzd ..L9, %k1 vgatherdps (%r8,%zmm0,4), %zmm6{%k1} jknzd ..L10, %k1 ..L9:
Extending the concept to scatter instructions:
The operation of the scatter instructions is very similar to that of the corresponding gather instructions. The only difference being that the instruction stores data elements into the memory instead of loading them into an output vector register. Similar to the gather instructions, only those elements will be stored into memory for which the corresponding mask bits are set.
As with the gather instruction, memory is accessed at a granularity of 64 bytes, and hence all the elements present in a given 64 bytes of memory are “scattered” in a single invocation of the scatter instruction. Again, a scatter instruction will generally be re-executed (i.e. re-triggered) until all the required data has been stored to memory.
A typical usage of a scatter instruction along with the re-trigger loop is shown below:
#zmm0 – Index register #r12 - Base Address #k3 - Mask register #zmm6 – Input register #4 - Scale ..L14: vscatterdps %zmm7, (%r12,%zmm6,4){%k3} jkzd ..L13, %k3 vscatterdps %zmm7, (%r12,%zmm6,4){%k3} jknzd ..L14, %k3 ..L13:
“-gather-scatter-unroll” compiler switch:
As shown in the above two examples, gather-scatter instructions are generally followed by a conditional branching instruction which is used to test whether the gather-scatter is complete and re-trigger the instruction if it is not. However, this conditional branching is an added overhead to the execution of the program.
In certain cases, a programmer may know that in his application, most gather-scatter sequences take N number of invocations for completion. In such cases, the programmer may use the -gather-scatter-unroll=N compiler switch to hint the compiler to unroll the re-trigger loop n times, thereby alleviating the overhead caused by the branching instructions.
Consider the following code which uses indirect addressing:
int *arr1; float *arr2; … … #pragma ivdep for(int i=0;i<SIZE;i++) { arr2[arr1[i]]++; }
In this example, since the compiler has no prior information as to the order that data elements of arr2 will be accessed, the compiler will generate gather-scatter instructions to load the different data elements into a vector register and to store those data elements back to memory once the data elements have been incremented. Without the switch, the code generate by the compiler will look similar to the following:
..L32: #29.3 vgatherdps (%r12,%zmm5,4), %zmm4{%k1} #29.3 jkzd ..L31, %k1 #29.3 vgatherdps (%r12,%zmm5,4), %zmm4{%k1} #29.3 jknzd ..L32, %k1 #29.3 ..L31: # vpaddd %zmm3, %zmm4, %zmm6 #29.3 c17 nop #29.3 c21 ..L34: #29.3 vscatterdps %zmm6, (%r12,%zmm5,4){%k2} #29.3 jkzd ..L33, %k2 #29.3 vscatterdps %zmm6, (%r12,%zmm5,4){%k2} #29.3 jknzd ..L34, %k2 #29.3
In the following code, we know (based on prior information) that all the data elements are widely scattered in memory and, in general, will need 16 invocations of the gather-scatter instructions. In this case, we can add a “-gather-scatter-unroll=8” switch to the compiler call. We use an unroll factor of 8 as this is the greatest unroll factor allowed by the compiler. The compiler will unroll the loop, alleviating the overhead incurred due to the branching instructions. The unrolled code generated by the compiler with the “-gather-scatter-unroll=8” switch is shown below:
..L28 # vgatherdps (%r12,%zmm11,4), %zmm10{%k7} #29.3 jkzd ..L27, %k7 #29.3 vgatherdps (%r12,%zmm11,4), %zmm10{%k7} #29.3 vgatherdps (%r12,%zmm11,4), %zmm10{%k7} #29.3 vgatherdps (%r12,%zmm11,4), %zmm10{%k7} #29.3 vgatherdps (%r12,%zmm11,4), %zmm10{%k7} #29.3 vgatherdps (%r12,%zmm11,4), %zmm10{%k7} #29.3 vgatherdps (%r12,%zmm11,4), %zmm10{%k7} #29.3 vgatherdps (%r12,%zmm11,4), %zmm10{%k7} #29.3 vgatherdps (%r12,%zmm11,4), %zmm10{%k7} #29.3 jknzd ..L28, %k7 #29.3 ..L27: # vpaddd %zmm3, %zmm10, %zmm12 #29.3 c201 nop #29.3 c205 vscatterpf0hintdps (%r12,%zmm11,4){%k1} #29.3 c209 vscatterpf0hintdps (%r12,%zmm11,4){%k1} #29.3 nop #29.3 ..L30: # vscatterdps %zmm12, (%r12,%zmm11,4){%k1} #29.3 jkzd ..L29, %k1 #29.3 vscatterdps %zmm12, (%r12,%zmm11,4){%k1} #29.3 vscatterdps %zmm12, (%r12,%zmm11,4){%k1} #29.3 vscatterdps %zmm12, (%r12,%zmm11,4){%k1} #29.3 vscatterdps %zmm12, (%r12,%zmm11,4){%k1} #29.3 vscatterdps %zmm12, (%r12,%zmm11,4){%k1} #29.3 vscatterdps %zmm12, (%r12,%zmm11,4){%k1} #29.3 vscatterdps %zmm12, (%r12,%zmm11,4){%k1} #29.3 vscatterdps %zmm12, (%r12,%zmm11,4){%k1} #29.3 jknzd ..L30, %k1 #29.3 ..L29:
Note that the gather-scatter-unroll compiler switch may not provide a performance improvement for all codes. Its primary target are codes that use indirect addressing or perform non-unit strided memory accesses.