Best Known Method: Avoid heterogeneous precision in control flow calculations

Best Known Method

Running an MPI program in symmetric mode on an Intel^® Xeon^® host and an Intel Xeon Phi™ coprocessor may deadlock in specific cases due to the heterogeneous precision in replicated control flow calculations. The advice is to determine the control flow only on one master MPI process.

The Issue

Intel MPI applications can be executed on multiple combinations of Intel Xeon processors and Intel Xeon Phi coprocessors. Four different execution models are supported:

Host-only: All MPI processes are running only on Intel Xeon hosts.
Coprocessor-only: All MPI processes are running only on Intel Xeon Phi coprocessors. The Intel Xeon hosts to which the coprocessors are attached are not used.
Symmetric: Some MPI processes are executing on Intel Xeon hosts and some on Intel Xeon Phi coprocessors.
MPI+Offload: All MPI processes are running only on Intel Xeon hosts, but the processes are offloading parts of the execution to the coprocessor.

The host-only and coprocessor-only modes both execute on homogeneous architectures, i.e. either all host nodes or all coprocessors. However, running an MPI program in symmetric mode on an Intel Xeon host and an Intel Xeon Phi coprocessor (or a cluster of these) represents a heterogeneous system due to the differences in computation and communication speed of the two architectures. An obvious concern in such a matchup is newly-introduced load-balancing issues which do not occur in a homogeneous cluster for an already load-balanced code.

But we should also be aware of potential issues due to the heterogeneous precision of floating-point computations combining Intel Xeon processors and Intel Xeon Phi coprocessors. The precision differences in general are discussed at http://software.intel.com/en-us/articles/differences-in-floating-point-arithmetic-between-intel-xeon-processors-and-the-intel-xeon. Reasons for it include different vector lengths, use of Fused Multiply-Adds, and other instructions that are not available on both architectures. Apart from the well known impact for the precision of calculated results, another problem can show up for symmetric runs under specific circumstances: a deadlock of the MPI program.

Below is a snippet of pseudo MPI code whose principle structure is often found in iterative algorithms. The MPI program performs some parallel computations and calculates a value to govern the control flow of the program, i.e. to decide when the iterative process will stop. An obvious question is whether the control flow of such an MPI program will always be identical on all processes, independent of the precision of the underlying architecture?

MPI_Init
while ( residual > threshold ) {
     do_MPI_parallel_calculation(…)
     residual = MPI_Allreduce(…)
}
MPI_Finalize

The residual value is identical on all MPI processes due to the MPI_Allreduce call which broadcasts the final value to all processes. The control flow is governed by the check (residual > threshold)which is calculated replicated on all MPI processes for reasons of simplicity, assuming a homogeneous environment so that either all processes execute the next while-iteration or all leave the while-loop. For symmetric MPI runs on Intel Xeon and Intel Xeon Phi processors, simple checks like (residual > threshold)will give identical results on all MPI processes because most likely there is no precision problem.

But what will happen if the simple check is replaced by something like (log(residual) > threshold)? Or even with more complicated calculations from an external library function?

while ( libraryfunction(residual) > threshold ) {
     do_MPI_parallel_calculation(…)
     residual = MPI_Allreduce(…)
}

If the MPI processes on the Intel Xeon host and on the Intel Xeon Phi coprocessor evaluate the result of the library function call differently a deadlock may occur because some MPI processes will continue the while-loop and the others will leave it.

For mathematical functions like log() it is possible to circumvent the problem with a local precision pragma. However, for an external libraryfunction() it might be necessary to calculate the whole library with identical precision for the Intel Xeon and Intel Xeon Phi processors, with potentially large impact on the overall performance.

The Solution

A solution for this heterogeneous precision issue is to govern the control flow only on one master MPI process (usually the process with MPI rank 0). The original code can be modified so that all other processes are signaled by the master when to leave the iteration:

goon = TRUE
while (goon) {
     do_MPI_parallel_calculation(…)
     residual = MPI_Reduce(…,ROOT=0,…)
     if (rank == 0) {
          goon = ( libraryfunction(residual) > threshold )
     }
     MPI_Bcast(goon,…,ROOT=0,…)
}

Only the master process with MPI rank 0 decides whether the computation will continue, and it broadcasts the information to the other processes. In particular, it is now sufficient to compute the residual with MPI_Reduce. The implicit broadcast from the initially executed MPI_Allreduce is replaced by the explicit MPI_Bcast of the goon signal. Therefore we can assume that the performance impact of the additional MPI call is low.

Deadlock Detection

If an MPI program runs well only on Intel Xeon hosts or only on Intel Xeon Phi coprocessors but does not finish when run in symmetric mode, the likelihood is high that it ran into a deadlock. One approach to identify this would be to attach debuggers to the processes. However, there is a much simpler method available. The Correctness Checking library from the Intel Trace Analyzer and Collector can be used to identify deadlocks. It is available for Intel Xeon processors and Intel Xeon Phi coprocessors, and can be used in symmetric runs as well. If the collector‘s environment has been set up in addition to the Intel MPI environment it is sufficent just to add the flag „-check“ to the mpirun command!

mpirun –check ...

An alternative method to enable the correctness check is achieved by pre-loading the libVTmc.so library from the Intel Trace Analyzer and Collector. This approach may be beneficial if the actual mpirun command is hidden in a complicated shell script hierarchy:

export LD_PRELOAD=libVTmc.so
run_script.sh         (or mpirun ... )

In both cases the Correctness Checking library will identify many MPI errors as described in the documentation. A potential deadlock is identified if all MPI processes are stuck in MPI calls with no progress for a certain amount of time. The default timeout is 60 seconds, but it can be configured to the user‘s needs. For a deadlock initiated by different control flows, the output will show you most likely that the processes have entered different non-matching MPI functions. The traceback provides a starting point for user analysis to find the fork of the control flows. And if the code was compiled with the debug flag, the runtime system can generate such reports as this:

[0] INFO: CHECK GLOBAL:DEADLOCK:HARD ON
[0] INFO: CHECK GLOBAL:DEADLOCK:POTENTIAL ON
[0] INFO: CHECK GLOBAL:DEADLOCK:NO_PROGRESS ON

[0] ERROR: GLOBAL:DEADLOCK:HARD: fatal error
[0] ERROR:    Application aborted because no progress was observed for over 0:19 minutes,
[0] ERROR:    check for real deadlock (cycle of processes waiting for data) or
[0] ERROR:    potential deadlock (processes sending data to each other and getting blocked
[0] ERROR:    because the MPI might wait for the corresponding receive).
[0] ERROR:    [0] no progress observed for over 0:19 minutes, process is currently in MPI call:
[0] ERROR:       MPI_Finalize()
[0] ERROR:       main (/home/kdoertel/Documents/GOAT/deadlock/poisson.c:239)
[0] ERROR:       (/lib64/libc-2.12.so)
[0] ERROR:       (/home/kdoertel/Documents/GOAT/deadlock/poisson)

[0] ERROR:    [32] no progress observed for over 0:19 minutes, process is currently in MPI call:
[0] ERROR:       MPI_Send(*buf=0x7f9762073620, count=4802, datatype=MPI_DOUBLE, dest=30, tag=100, comm=MPI_COMM_WORLD)
[0] ERROR:       exchange (/home/kdoertel/Documents/GOAT/deadlock/comm.c:268)
[0] ERROR:       main (/home/kdoertel/Documents/GOAT/deadlock/poisson.c:182)
[0] ERROR:       __libc_start_main (/lib64/libc-2.14.90.so)
[0] ERROR:       (/home/kdoertel/Documents/GOAT/deadlock/poisson.MIC)

[0] INFO: GLOBAL:DEADLOCK:HARD: found 1 time (1 error + 0 warnings), 0 reports were suppressed
[0] INFO: Found 1 problem (1 error + 0 warnings), 0 reports were suppressed.

Beyond Deadlocks

The deadlock is not the worst case one can think of for symmetric MPI runs. A deadlock will always be identified. The worst case is when differences in the control flow do not yield a deadlock, but wrong results instead.

residual = MPI_Allreduce(…)
if ( libraryfunction(residual) > threshold ) {
     do_A_calculation(…)
} else {
     do_B_calculation(…)
}

The MPI processes on Intel Xeon and Intel Xeon Phi processors may execute different calculations depending on the outcome of the check. Once more it is the replicated computation of the control flow which is the source of the issue. And again, the solution is to govern the control flow only on one master process.

The issue of heterogeneous precision in control flow executions was presented above in the context of symmetric MPI processes. However, the problem may also occur without MPI in pure offload codes for which parts of the calculations are done on the Intel Xeon host and parts are offloaded to the Intel Xeon Phi coprocessor. If the control flow is again replicated on Intel Xeon processor threads as well as on Intel Xeon Phi coprocessor threads the heterogeneous precision issue may also show up in some form for such offload codes.

MPI

computational accuracy

Mensaje pasa a interfaz