The Chronicles of Phi - part 3 Hyper-Thread Phalanx

The prior part (2) of this blog provided a header and set of function that can be used to determine the logical core and logical Hyper-Thread number within the core. This determination is to be use in an optimization strategy called the Hyper-Thread Phalanx.

The term phalanx is derived from a military formation used by the ancient Greeks and Romans. The formation generally involved soldiers lining up shoulder to shoulder, shield to shield multiple rows deep. The formation would advance in unison becoming “an irresistible force”. I use the term Hyper-Thread Phalanx to refer to the Hyper-Thread siblings of a core being aligned shoulder-to-shoulder and advancing forward.

Note, the Hyper-Thread Phalanx code provided in part 2 of this blog allows you to experiment with different thread teaming scenarios. We intend to run with 180 threads (3 threads per core) and 240 threads (4 threads per core).

Additionally, the also works on the host processor(s) with 2 threads per core (as well as 1 thread per core should you disable HT).

The Hyper-Thread Phalanx can be attained by a relatively simple loop hand partitioning technique:

Code for main computation function in tiled_HT1:

diffusion_tiled(REAL *restrict f1, REAL *restrict f2, int nx, int ny, int nz,
              REAL ce, REAL cw, REAL cn, REAL cs, REAL ct,
              REAL cb, REAL cc, REAL dt, int count) {
  #pragma omp parallel
  {
    REAL *f1_t = f1;
    REAL *f2_t = f2;

    // number of Squads (singles/doublets/triplets/quadruples) across z dimension
    int nSquadsZ = (nz + nHTs - 1) / nHTs;
    // number of full (and partial) singles/doublets/triads/quads on z-y face
    int nSquadsZY = nSquadsZ * ny;
    int nSquadsZYPerCore = (nSquadsZY + nCores - 1) / nCores;
    // Determine this thread's triads/quads (TLS init setup myCore and myHT)
    int SquadBegin = nSquadsZYPerCore * myCore;
    int SquadEnd = SquadBegin + nSquadsZYPerCore; // 1 after last Squad for core
    if(SquadEnd > nSquadsZY)
 SquadEnd = nSquadsZY; // truncate if necessary

    // benchmark timing loop
    for (int i = 0; i < count; ++i) {
      // restrict current thread to its subset of Squads on the Z/Y face.
      for(int iSquad = SquadBegin; iSquad < SquadEnd; ++iSquad) {
    // home z for 0'th team member for next Squad
        int z0 = (iSquad / ny) * nHTs;
        int z = z0 + myHT;  // z for this team member
        int y = iSquad % ny;
        // last double/triad/quad along z may be partially filled
        // assure we are within z
        if(z < nz)
        {
            // determine the center cells and cells about the center
            int x = 0;
            int c, n, s, b, t;
            c =  x + y * nx + z * nx * ny;
            n = (y == 0)    ? c : c - nx;
            s = (y == ny-1) ? c : c + nx;
            b = (z == 0)    ? c : c - nx * ny;
            t = (z == nz-1) ? c : c + nx * ny;
            // c runs through x, n and s through y, b and t through z
            // x=0 special (no f1_t[c-1])
            f2_t[c] = cc * f1_t[c] + cw * f1_t[c] + ce * f1_t[c+1]
                + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t];
            // interior x's faster
#pragma noprefetch
#pragma simd  
            for (x = 1; x < nx-1; x++) {
              ++c;
              ++n;
              ++s;
              ++b;
              ++t;
              f2_t[c] = cc * f1_t[c] + cw * f1_t[c-1] + ce * f1_t[c+1]
                  + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t];
            } // for (x = 1; x < nx-1; x++)
            // final x special (f1_t[c+1])
            ++c;
            ++n;
            ++s;
            ++b;
            ++t;
            f2_t[c] = cc * f1_t[c] + cw * f1_t[c-1] + ce * f1_t[c]
                + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t];
        } // if(z < nz)
      } // for(int iSquad = SquadBegin; iSquad < SquadEnd; ++iSquad)
// barrier required because we removed implicit barrier of #pragma omp for collapse(2)
          #pragma omp barrier
#if defined(VERIFY)
          #pragma omp master
          diffusion_baseline_verify(f1_t, f2_t, nx, ny, nz,
                   ce, cw, cn, cs, ct,
                   cb, cc);
          #pragma omp barrier
#endif

      REAL *t = f1_t;
      f1_t = f2_t;
      f2_t = t;
    } // count
  } // parallel
  return;
}

Effectively we removed 5 lines of code relating to the OpenMP loop control and added 13 lines for hand control (net 8 lines of code difference). This tiledHT1 code also changed how the blocking factor was performed. Due to the data flow, blocking was removed in favor of data flow and hardware prefetching.

In making three runs of each problem size of the original tiled code and the newer tiled_HT1 code we find:

export KMP_AFFINITY=scatter
export OMP_NUM_THREADS=180

./diffusion_tiled_xphi
118771.945
123131.672
122726.906
121543.508

./diffusion_tiled_Large_xphi
114972.258
114524.977
116626.805
115374.680

export KMP_AFFINITY=compact
unset OMP_NUM_THREADS

./diffusion_tiled_HT1_xphi
134904.891
131310.906
133888.688
133368.162

./diffusion_tiled_HT1_Large_xphi
118476.734
118078.930
118157.188
118237.617

The number on the right is the average of the three runs. The ganging strategy did show some improvement in the small model but not a similar improvement in the large model. Furthermore, the small model improvement was not as anticipated. The chart including the tiled_HT1 code:

The improvement to the small model looks good, Something isn’t right with the large model. Let’s discover what it is.

Sidebar:

The earlier draft of this article progressed making optimizations from here. However, discoveries made later caused me to revisit this code. At this point it is important for me to take you on a slight divergence so that you can learn from my experience.

My system is dual boot for the host. I have both CentOS Linux and Windows 7 installed on different hard drives. I was curious to see if there was any impact of running the native coprocessor code dependent on which host operating system was running. The expectation was that there should be no noticeable difference.

I configured the environment variables to run the 3-wide phalanx and ran tests on both CentOS and Windows 7 host (the chart above is for 4-wide phalanx).To my surprise, running the same code (same file in fact) hosted on Windows and comparing the results run in the coprocessor on Linux, the relative performance figures were reversed! What was faster with a Linux host was slower with the Windows host, and what was slower became faster. This didn’t make sense.

One of the thoughts that came to mind is there may be a memory alignment issue between the allocations of the arrays. This has been my experience on Intel64 and IA32 platforms. So I added a printf to display the address of the buffers. The two programs tiled and tiled_HT both had 16 byte alignment and approximately the same offset within the page, so data alignment differences could not be the cause. Curiously, by adding the printf of the addresses of the two buffers, the performance figures flipped again. The results of the runs were:

[Jim@Thor-mic0 tmp]$ ./diffusion_tiled_xphi
f1 = 0x7fbbf3945010 f2 = 0x7fbbef944010  printf
FLOPS        : 122157.898 (MFlops) With printf
[Jim@Thor-mic0 tmp]$ ./diffusion_tiled_xphi
f1 = 0x7ffaf7c0b010 f2 = 0x7ffaf3c0a010  printf
FLOPS        : 123543.602 (MFlops) With printf
[Jim@Thor-mic0 tmp]$ ./diffusion_tiled_xphi
f1 = 0x7f3afa480010 f2 = 0x7f3af647f010  printf
FLOPS        : 123908.375 (MFlops) With printf
Average with printf: 123203.292 MFlops

[Jim@Thor-mic0 tmp]$ ./diffusion_tiled_xphi
FLOPS        : 114380.531 (MFlops) Without printf
[Jim@Thor-mic0 tmp]$ ./diffusion_tiled_xphi
FLOPS        : 121105.062 (MFlops) Without printf
[Jim@Thor-mic0 tmp]$ ./diffusion_tiled_xphi
FLOPS        : 116298.797 (MFlops) Without printf
Average without printf 117261.463 Mflops
With printf +5941.829 MFlops (+5.1% over without printf)

Clearly there is a shift of 5% in performance due to code shift (shift in load position of code).

What this means then is that the relative performance difference measured in the original tiled code and the newer tiled_HT code (earlier versions) may be completely obscured by the fortuitous, or lack thereof, of placement of the code. One version might be +5%, and the other version -5%, yielding a comparative uncertainty of up to 10%. This difference is specific to this particular sample of code. Do not assume that all code exhibits this amount of performance difference due to code placement. However, using this experience as an example, suggests that you be vigilant in your tuning process to look for code alignment issues.

End sidebar

Added to the above concerns, the expectation of the performance improvements was not observed. We only achieved a 9.7% improvement for the small model and a 2.5% improvement for large model.

Applying Occam’s razor: If we did not observe an increase in L1 hit ratio – it didn’t happen.

The estimated improvement in the L1 hit ratio was based on the amount of data we calculated should be in the L1 cache for the specified algorithm.

Three-wide, small data: 8 x 256 x 4 = 8KB low side, 11 x 256 x 4 = 11KB high side
Four-wide, small data: 10 x 256 x 4 = 10KB low side, 14 x 256 x 4 = 14KB high side

Both calculations indicate plenty of room in the 32KB L1 cache. Something else must be going on and will need some investigation (deferred to some other time).

Regardless of this, the GFlops is in the range of 133 GFlops, far short of the capability of the Xeon Phi™.

Now I must ask myself:

What is non-optimal about this strategy?
And: What can be improved?

You can think about it while you wait for part 4

Jim Dempsey
Consultant
QuickThread Programming, LLC

- Part 1 - The Chronicles of Phi - part 1 The Hyper-Thread Phalanx

- Part 2 - The Chronicles of Phi - part 2 Hyper-Thread Phalanx – tiled_HT1

Jim Dempsey

Intel Xeon Phi Coprocessor