Recipe: Building and Optimizing the Hogbom Clean Benchmark for Intel® Xeon Phi™ Coprocessors

Overview

This article provides a recipe for compiling and running the Hogbom Clean benchmark for the Intel® Xeon Phi™ coprocessor and discusses the various optimizations applied to the code.

Introduction

Hogbom Clean is a part of the ASKAP benchmark package. The ASKAP benchmark package is used to benchmark a variety of platforms for the Australian SKA Pathfinder (ASKAP) Science Data Processor. The Hogbom Clean (tHogbomClean) benchmark implements the kernel of the Hogbom Clean deconvolution algorithm.

Preliminaries

1)   This recipe assumes that you are using a system equipped with an Intel Xeon Phi coprocessor. If they are not already present, install the Intel® Manycore Plastform System Stack (Intel® MPSS) and Intel® C++ compiler 13.1 or higher on your host system
2)   Download the ASKAP benchmarks from : https://github.com/ATNF/askap-benchmarks
3)   Running the benchmark requires the existence of a point spread function (PSF) image and a dirty image (the image to be cleaned) in the work directory. These can be downloaded from
•   http://www.atnf.csiro.au/people/Ben.Humphreys/dirty.img
•   http://www.atnf.csiro.au/people/Ben.Humphreys/psf.img

Compiling and running on the Intel Xeon Phi Coprocessor

1)   Set up the compiler environment:
$ source /opt/intel/composer_xe_2013_sp1/bin/compilervars.sh intel64
2)   Unpack the source code and build the executables for the Intel Xeon Phi coprocessor
$ unzip askap-benchmarks-master.zip
$ cd askap-benchmarks-master/tHogbomCleanMIC/
$ make clean
$ make
3)   Since, the benchmark offloads work to the coprocessor, the execution begins on the host. On the host, run the benchmark for the coprocessor. Ensure the PSF image and the dirty image are present in the work directory.
$ ./tHogbomCleanMIC

Modifications and Optimizations

Several changes were made to the OpenMP version of the benchmark to run the benchmark on the Intel® Many Integrated Core (Intel® MIC) Architecture and achieve optimal performance. The OpenMP version and the Intel MIC architecture version of the code can be found in the tHogbomCleanOMP and the tHogbomCleanMIC directories respectively in the github repository. The benchmark uses the Intel® Xeon Phi coprocessor in the offload mode wherein the host offloads a portion of the work to the coprocessor. To enable offloading, various functions in the benchmark were decorated with __declspec(target(mic)) to inform the compiler that the respective functions were intended for use on the coprocessor. Also, STL vectors were replaced with simple arrays.

Three primary optimizations were applied to the subractPSF and findPeak functions. The first two optimizations aid the compiler in vectorizing the code whereas the third optimization focuses on eliminating critical sections. The details of the three optimizations are discussed in the following sections. We encourage you to contrast the two OpenMP and the Intel MIC architecture versions to better understand the optimizations.

subtractPSF – Simplifying the loop index

Most of the computation within this function is concentrated within the two for loops. By simply expanding the macros and simplifying the loop index, we are able to vectorize the loop.

findPeak – Vectorizing the loop before the critical section

By modifying the code to its current form, the compiler is able to not only reuse the result of the fabsf function call (thereby reducing the actual number of function calls) but is also able to recognize the indexed max idiom and is thus able to vectorize it.

findPeak – Eliminating the critical section

Critical sections in parallel code are detrimental to the performance of the code. This effect is further amplified for the Intel Xeon Phi coprocessor due to the large number of threads and makes it imperative to reduce the number of critical sections or eliminate them completely, if possible. In this case, it is possible to completely eliminate the critical section and replace it with a serial loop as demonstrated by the code.

SUMMARY

The HogBom Clean benchmark was ported to run on the Intel Xeon Phi coprocessor in offload mode. Three key optimizations were applied to the OpenMP version of the benchmark to achieve optimal performance on the coprocessor.

Arquitectura Intel® para muchos núcleos integrados

Optimización