For Intel® System Studio 2015, find the corresponding article here -> click
< Overview >
In this article, we are enabling and using Intel® Integrated Performance Primitives(IPP), Intel® Threading Building Blocks(TBB) and Intel® C++ Compiler(ICC) on Linux ( Ubuntu 14.04 LTS 64bit ). We will build and run one of the examples that comes with IPP and apply TBB and ICC on the example to observe the performance improvement of using Intel® System Studio features.
Intel® System Studio (ISS) used for this article is Intel® System Studio 2016 Beta Ultimate Edition for Linux Host. The components used here in the tool suite are the following
- Intel® Integrated Performance Primitives 9.0 got Linux
- Intel® Threading Building Blocks 4.4
- Intel® C++ Compiler 16.0
This example was tested on i5 dual core platform.
< Building the IPP example with TBB libraries and ICC >
STEP 1. Setup the environment variables for IPP, TBB and ICC
We need to setup environment variables for IPP,TBB and ICC to work appropriately. Use the following 3 commands in the command line then the variables will be set. It is needed to input the right target architecture when you execute them. ex) 'ia32'IA-32 target and 'intel64'for Intel®64 target. Additionally, for ICC, you also need to insert a platform type. ex) 'linux' for a Linux target, 'android' for an Android target and 'mac' for a Mac target . Finally, do not forget to type a dot and a space at the beginning wich is '. '
- . /opt/intel/compilers_and_libraries_2016.x.xxx/linux/ipp/bin/ippvars.sh <arch type>
- . /opt/intel/compilers_and_libraries_2016.x.xxx/linux/tbb/bin/tbbvars.sh <arch type>
- . /opt/intel/compilers_and_libraries_2016.x.xxx/linux/bin/iccvars.sh -arch <arch type> -platform <platform type>
To verify if the above commands were executed correctly, type 'printenv' and check if 'IPPROOT' and 'TBBROOT' are listed and indicating IPP and TBB install directories, and 'PATH'is indicating'/opt/intel/compilers_and_libraries_2016.x.xxx/linux/bin/<arch type>'. For the future usage, it is recommended to write a bash script to enable multiple features of ISS at once.
STEP 2. Find the example
First, we will go find the IPP example and prepare to build with additional ISS features applied such as TBB and ICC.
When you install ISS 2016 with default setting, the IPP example archive file is located at
/opt/intel/compilers_and_libraries_2016.x.xxx/linux/ipp/examples
you will find 'ipp-examples_lin.tgz' in the location. Extract the examples wherever you like (but don't extract it at a directory where you need strict permissions. Do it where you can play without type 'sudo' otherwise, building the example gets complicated), and find 'ipp_resize_mt'example folder. That is the example we are using here. You can find additional document at '<Extracted Eamples>/documentation/ipp-examples.html'when you extract the examples.
STEP3. Build the example
If you want to build the example without TBB and ICC, just try 'make' at '<Extracted Eamples>/ipp_resize_mt' and save the binary for the future comparison. Since IPP environment setup has been done already, the example should build without any problem.
Now we need to add TBB and ICC to build a faster version of the original example. In 'Makefile' of the example, we can see comments that let us know how to enable TBB and ICC while building.
Type 'export CC=icc && export CXX=icpc && CXXFLAGS=-DUSE_TBB' . Now run 'make' at the 'ipp_resize_mt' folder to build the example.
< Simple Performance Comparison >
The IPP example simply shows the performance of itself as how long in average it spends on resizing one image.
Refer the following as the options and arguments that can be used to execute the resize sample.
When the resize example works without TBB, resize function will be utilizing a single thread which results in not full exploitation of multi cores. The following is the result of the resize example with a command : './ipp_resize_mt -i ../../lena.bmp -r 960x540 -p 1 -T AVX2 -l 5000' . This command means 'resize ../../lena.bmp into 960x540 using linear interpolation method and AVX2 5000 times.
As we can see above, the average duration resizing a single image takes about 2.275ms in average. Given this result, we will test the same example with TBB exploiting 2 cores. If TBB has been successfully enabled, the thread option gets included in the help page.
When the resize example works with TBB, resize function will be run on 2 threads simultaneously. The following is the result of the resize example with a command : './ipp_resize_mt -i ../../lena.bmp -r 960x540 -p 1 -T AVX2 -t 2 -l 5000'
Utilizing 2 threads at the same time resulted in exploiting both two cores and the performance increased about 70%.
To verify if the example technically exploit two cores simultaneously, we can use VTune to investigate. The following picture shows the number of CPUs utilized during each execution. ( Blue = Resize example without TBB, Yellow = Resize example with TBB )
A yellow bar on 2.00 tells us that 2 CPUs had been running simultaneously about 4.4s.
VTune results also shows how threads were working for specific tasks. Extracted results of functions used for resizing are listed below.
We can see only a single thread is used to handle the resize function and it is a heavy load. If this sort of circumstance happens we should consider multi parallelizing. The following is results of the one with TBB.
As expected, 2 threads where running simultaneously for about 4.4s during the task and that increased the performance.
< Conclusion >
We saw how easily an IPP example can be built and tested with other features of ISS. It is recommended to take a close look into the IPP example to learn how to program with IPP and TBB. TBB here parallelizes for the dual core processor and increase the performance.
Talking about ICC for this example in fact, just changing compiler from GCC into ICC did not bring a big benefit in this case since IPP resize function already is optimized with SIMD instructions and the loops were parallelized by TBB. So there are not many other tasks that could be optimized by ICC in this example. If there were additional functions and loops that can be vectorized or parallelized so SIMD instructions or OpenMP or Cilk could be used with ICC, there would have been further chances to optimize the application.