Explicit Vector Programming – Best Known Methods

Explicit Vector Programming – Best Known Methods

Why do we care about vectorizing applications? The simple answer: Vectorizing improves performance, and achieving high performance can save power. The faster an application can compute CPU-intensive regions, the faster the CPU can be set to a lower power state.

How does vectorizing compare to scalar operations with regard to performance and power? Vectorizing consumes less power than equivalent scalar operations because it performs better: Scalar operations process less several times data per cycle and require more instructions and more cycles to complete.

The introduction of wider vector registers in x86 platforms and the increasing number of cores that support single instruction multiple data (SIMD) and threading parallelism now make vectorization an optimization consideration for developers. This is because vector performance gains are applied per core, so multiplicative application performance gains become possible for more applications. In the past, many developers relied heavily on the compiler to auto-vectorize some loops, but serial constraints of programming languages have hindered the compiler’s ability to vectorize many different kinds of loops. The need arose for explicit vector programming methods to extend vectorization capability for supporting reductions, vectorizing:

Outer loops
Loops with user defined functions
Loops that the compiler assumes to have data dependencies, but on developer were understood to benign.

In summary: achieving high performance can also save power.

(An excellent web reference is the “Programming and Compiling for Intel® Many Integrated Core Architecture”. While the focus is on Intel® Xeon™ Phi coprocessor optimization, much of the content is also applicable to Intel Xeon® and Intel® Core™ processors.)

This document describes high-level best known methods (BKMs) for using explicit vector programming to improve the performance of CPU-bound applications on modern processors with vector processing units. In many cases, it is advisable to consider structural changes that accommodate both thread-level parallelism and as SIMD-level parallelism as you pursue your optimization strategy.

Note: To determine whether your application is CPU-bound or memory-bound, see About Performance Analysis with VTune™ Amplifier and Detecting Memory Bandwidth Saturation in Threaded Applications. Using hotspot analysis, find the parts of your application that are CPU-bound.

The following six steps are applicable for CPU-bound applications:

Measure baseline application performance.
Run hotspots and general exploration report analysis with the Intel® VTune™ Amplifier XE.
Determine hot loop/functions candidates to see if they are qualified for SIMD parallelism.
Implement SIMD parallelism using explicit vector programming techniques.
Measure SIMD performance.
[Optional for advanced developers] Generate assembly code and inspect.
Repeat!

Step 1. Measure Baseline Application Performance

You first need to have a baseline for your application’s existing performance level to if your vectorization changes are effective. In addition, you need have a baseline to measure your progress and final application performance relative to your starting point. Understanding this provides some guidance about when to stop optimizing.

Use a release build of your application for the initial baseline instead of a debug build. A release build contains all the optimizations in your final application. This is important because you need to understand the loops or “hotspots” in your application are spending significant time.

A release baseline provides symbol information, and has all optimizations turned on except simd (explicit vectorization) and vec (auto-vectorization). To explicitly turn off simd and auto-vectorization use the following compiler switches -no-simd–no-vec. (See Intel® C++ Compiler User Reference Guide 14.0)

Compare the baseline’s performance against the vectorized version to get a sense of how well your vectorization tuning approaches theoretical maximum speedup.

It is best to compare the performance of specific loops in the baseline and vectorized version using tools such as the Intel® VTune™ Amplifier XE or embedded print statements.

Step 2. Run hotspots and general exploration report analysis with Intel® VTune™ Amplifier XE

You can use the Intel® VTune™ Amplifier XE to find the most time-consuming functions in your application. The “Hotspots” analysis type is recommended, although “Lightweight Hotspots” (which profiles the whole system, as opposed to just your application) works as well

Identifying which areas of your application are taking the most time allows you to focus your optimization efforts in those areas where performance improvements will have the most effect. Generally, you want to focus only on the top few hotspots or functions taking at least 10% of your application’s total runtime. Make note of the hotspots you want to focus on for the next step. (Tutorial: Finding Hotspots.)

The general exploration report can provide information about:

TLB misses (consider compiler profile guided optimization),
L1 Data cache misses (consider cache locality and using streaming stores),
Split loads and split stores (consider data alignment for targeted architecture),
Memory bandwidth,
Memory latency (consider streaming stores and prefetching) demanded by the application.

This higher level analysis can help you determine whether it is profitable to pursue vectorization tuning.

Step 3: Determine Hot Loop/Functions Candidates Are Qualified for SIMD Parallelism

One key suitability ingredient for choosing loops to vectorize is whether the memory references in the loop are independent of each other. (See Memory Disambiguation inside vector-loops and Requirements for Vectorizable Loops.)

The Intel® Compiler vectorization report (or -vec-report) can tell you whether each loop in your code was vectorized. Ensure that you are using the compiler optimization level 2 or 3 (-O2 or –O3) to enable the auto-vectorizer. Run the vectorization report and look at the output for the hotspots determined from Step 2. If there are loops in these hotspots that did not vectorize, check whether they have math, data processing, or string calculations on data in parallel (for instance in an array). If they do, they might benefit from vectorization. Move to Step 4 if any vectorization candidates are found.

Data alignment

Data alignment is another key ingredient for getting the most out of your vectorization efforts. If the Intel® VTune™ Amplifier reports split loads and stores, then the application is using unaligned data. Data alignment forces the compiler to create data objects in memory on specific byte boundaries. There are two aspects of data alignment that you must be aware of:

Create arrays with certain byte alignment properties.
Insert alignment pragmas/directives and clauses in performance critical regions.

Alignment increases the efficiency of data loads and stores to and from the processor. When targeting the Intel® Supplemental Streaming Extensions 2 (Intel® SSE 2) platforms, use 16-byte alignment that facilitates the use of SSE-aligned load instructions. When targeting the Intel® Advanced Vector Extensions (Intel® AVX) instruction set, try to align data on a 32-byte boundary. (See Improving Performance by Aligning Data.) For Intel® Xeon Phi™ coprocessors, memory movement is optimal on 64-byte boundaries. (See Data Alignment to Assist Vectorization.)

Unit stride

Consider using unit stride memory (also known as address sequential memory) access and structure of arrays (SoA) rather than arrays of structures (AoS) or other algorithmic optimizations to assist vectorization. (See Memory Layout Transformations.)

As a general rule, it is best to try to access data in a unit stride fashion when referencing memory. Because this is often good for vectorization and other parallel programming techniques. (See Improving Discrete Cosine Transform performance using Intel(R) Cilk(TM) Plus.)

Successful vectorization may hinge on the application of other loop optimizations, such as loop interchange (see information on cache locality), and loop unroll.

It may be worth experimenting to see if inlining a function using –ip or –ipo allows vectorization to proceed for loops with embedded, user-defined functions. This is one alternative approach to using simd-enabled functions; there may be tradeoffs between using one or the other.

Note:

If the algorithm is computationally bound when performing hotspot analysis, continue pursuing the strategy described in this paper. If the algorithm is memory-latency bound or memory-bandwidth bound, then vectorization will not help. In such cases, consider strategies like cache optimizations or other memory-related optimizations, or even rethink the algorithm entirely. High level loop optimizations, such as –O3, can look for loop interchange optimizations that might help cache locality issues. Cache blocking, can also help improve cache locality when applicable. (See Cache Blocking Techniques which is specific to the Intel® Many Integrated Core Architecture (Intel® MIC Architecture), but the technique applies to the Intel® Xeon® processor as well.)

Step 4: Implement SIMD Parallelism Using Explicit Vector Programming Techniques

Explicit vector programming includes features such as the Intel® Cilk™ Plus or OpenMP* 4.0 vectorization directives. These optimizations provide a very powerful and portable way to express vectorization potential in C/C++ applications. OpenMP* 4.0 vectorization directives are also applicable to Fortran applications. These explicit vector programming techniques give you the means to specify which targeted loops to vectorize. Candidate loops for vectorization directives include loops that have too many memory references for the compiler to put in place dependency checks, loops with reductions, loops with user-defined functions, outer loops, among others.

(See Best practices for using Intel® Cilk™ Plus for recommendations for using the Intel® Cilk™ Plus methodology and Enabling SIMD in program using OpenMP4.0” for how to enable SIMD features in an application using the OpenMP* 4.0 methodology.)

See also the webinar Introducing Intel® Cilk™ Plus and two video training series detailing vectorization essentials with explicit vector programming using Intel® Cilk™ Plus and OpenMP* 4.0 vectorization techniques.

Here are some common components of explicit vector programming.

SIMD-enabled Functions (Intel® Cilk™ Plus and OpenMP* 4.0 Methodologies)

User creation of SIMD-enabled functions is a capability provided in both the Intel® Cilk™ Plus and OpenMP* 4.0 methodologies. SIMD-enabled functions explicitly describe the SIMD behavior of user-defined functions, including how SIMD behavior is altered due to call site dependence. (See Call site dependence for SIMD-enabled functions in C++, which explains why the compiler sometimes uses a vector version of a function in some call sites, but not others. It also describes what you can do to extend the types of call sites for which the compiler can provide vector versions. Learn more about SIMD-enabled functions in Usage of linear and uniform clause in Elemental function (SIMD-enabled function).)

SIMD Loops (Intel® Cilk™ Plus and OpenMP* 4.0 Methodologies)

Both the Intel® Cilk™ Plus and OpenMP* 4.0 methodologies provide SIMD loops. The principle with SIMD loops is to explicitly describe the SIMD behavior of a loop, including descriptions of variable usage and any idioms such as reductions. (See Requirements for Vectorizing Loops with #pragma SIMD.) For a quick introduction to #pragma simd, see the corresponding topic for Intel® Cilk™ Plus and OpenMP* 4.0.)

Traditionally, only inner loops have been targeted for vectorization. One unique application of the Cilk Plus #pragma simd or OpenMP* 4.0 #pragma omp simd is that it can be applied to an outer loops.

(See loops Outer Loop Vectorization , and Outer Loop Vectorization via Intel® Cilk™ Plus Array Notations, which describe using #pragma simd in outer loops).

Intel® Cilk™ Plus Array Notation (Intel® Cilk™ Plus Methodology)

Array Notation is an Intel-specific language extension that is a part of the Intel® Cilk™ Plus methodology supported by the Intel® C++ Compiler. Array Notation provides a way to express a data parallel operation on ordinary declared C/C++ arrays. Array Notation is also compatible with OpenMP* 4.0 and Intel® Cilk™ Plus SIMD-enabled functions. It provides a concise way of replacing loops operating on arrays with a clean array notation syntax that the Intel® Compiler identifies as being vectorizable.

Step 5: Measure SIMD performance

Measure your application’s build configuration runtime performance. If you are satisfied, you are done! Otherwise, inspect -vec-report6 to get a SIMD vectorization summary report (to check alignment, unit-stride and using (SoA) versus (AoS), interaction with other loop optimizations, etc.).

(For a deeper exploration on measuring performance, see How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures.)

Another approach is to use a family of compiler switches with the template –profile-xxxx. (These switches are described ‘Profile Function or Loop Execution Time.) Using the instrumentation method to profile function or loop execution time makes it easy to view where cycles are being spent in your application. The Intel® Compiler inserts instrumentation code into your application to collect the time spent in various locations. (data for identifying hotspots that may be candidates for optimization tuning or targeting parallelization.).

Another method to measure performance is to re-run the Intel® VTune™ Amplifier XE hotspot analysis after the optimizations are made and compare results.

Optional Step 6: For Advanced Developers -Generate assembly code and do inspection

For those who want to see the assembly code that the compiler generates, and inspect that code to gain insight into how well applications were vectorized, use the compiler switch –S to compile to assembly (.s) without invoking a link step.

Step 7: Repeat!
Repeat as needed until you achieve the desired performance or no good candidates remain.

Other considerations are applicable for applications that are memory latency-bound or memory bandwidth-bound:

Other considerations: Prefetching and Streaming Stores

Prefetching

Data prefetching is a method for a compiler or a developer to request that data be pulled into a cache line from main memory prior to it being used. Prefetching is more applicable for Intel® MIC Architecture. Explicit control of prefetching can be an important performance factor to investigate. (See Prefetching on Intel® MIC Architecture.)

Streaming stores

Streaming stores are a method of writing data explicitly to main memory bypassing all intermediate caches in instances where you are sure that data being written will not be needed from cache any time soon. Strictly speaking, bypassing all caches is only applicable on Intel® Xeon® processors. For Intel® Xeon Phi™ coprocessors, streaming stores evict instructions are provided to evict data only from a specific cache. (See Intel® Xeon Phi™ coprocessor specific support of streaming stores or Compiler-based Data Prefetching and Streaming Non-temporal Store Generation for Intel Xeon Phi Coprocessor (May 2013). Vectorization support describes the use of the VECTOR NONTEMPORAL compiler directive for addressing streaming stores.)

Other considerations: Scatter, gather, and compress structures:

Many applications benefit from explicit vector programming efforts. In many cases performance increases over scalar performance can be commensurate with the number of available vector lanes on a given platform. However, some types of coding patterns or idioms limit vectorization performance to a large degree.

Gather and Scatter codes

A[I] = B[Index[i]]; //Gather
A[Index[i]] = b[i]; //Scatter

While gather/scatter vectorization is available on Intel® MIC Architecture and recent Intel® Xeon® platforms, the performance gains from vectorization relying on gather/scatter is often much inferior to use of unit-strided loads and stores inside vector-loops. If there are not enough other profitably vectorized operations (such as multiple, divide, or math calls, …) inside such vector loops, performance may even be lower than serial performance in some cases. The only possible workaround for such issues is to look at alternative algorithms all together to avoid using gathers and scatters.

Compress and Expand structures

Compress and expand structures are generally problematic. On Intel® Xeon™ Phi coprocessors, the Intel® Compiler can automatically vectorize loops that contain simple forms of compress/expand idioms. An example of a compress idiom is as follows:

do I =1,N
   if (B(I)>0)
       x= x+1
       A(X) = B(I)
   endif
enddo

In this example, the variable x is updated under a condition. Note that it is incorrect to use the #pragma simd for such compress structures but using #pragma ivdep is okay.

Improve performance of such vectorized loops on Intel® MIC Architecture using the -opt-assume-safe-padding compiler option. (See Common Vectorization Tips.)

Currently vectorization of compress structures is only for future platforms that support compress structures.

Reference Materials:

Compiler diagnostic messages

Intel® Fortran Vectorization Diagnostics– Diagnostic messages from the vectorization report produced by the Intel® Fortran Compiler. To obtain a vectorization report in Intel® Fortran, use the option -vec-report[n] (Linux* and OS X* platforms) or /Qvec-report[:n] (Windows* platform).
Vectorization Diagnostics for Intel® C++ Compiler– Diagnostic messages from the vectorization report produced by the Intel® C++ Compiler. To obtain a vectorization report with the Intel® C++ Compiler, use option -vec-report[n] (Linux* and OS X* platforms) or /Qvec-report[:n] (Windows* platform).

Intel® C++ Compiler Videos

Vectorization Essentials– Ten videos covering the motivation for explicit vector programming and SIMD concepts with the Intel® Cilk™ Plus methodology.
Performance Essentials with OpenMP 4.0 Vectorization– Seven videos covering the motivation for explicit vector programming and SIMD concepts with the OpenMP* 4.0 methodology.

Webinars

Introduction to Vectorization using Intel® Cilk™ Plus Extensions

Articles

Compiler Switches for Intel® Parallel Amplifier– The Intel® Parallel Amplifier can analyze many native binaries.
Vectorization and Optimization Reports– Using compiler option -vec-report to determine what is (or is not) vectorizing and why (or why not) in your application.
Data Alignment to Assist Vectorization– New features in the Intel® Compiler 14.0 that support the data alignment critical for vectorization.
Pointer Aliasing and Vectorization– How to tell the Intel® Compiler that pointers are not aliasing the same data.
Requirements for Vectorizable Loops– Requirements for loop vectorization, code snippets, examples, and an advice section.
Vectorization for C or C++ Users with Intel® Cilk™ Plus Array Notations and Elemental Functions– Elemental vectorization functions with the Intel® Cilk™ Plus array notation.
Memory Layout Transformations – Moving from data organized in an AoS to an organization of SoA.
Usage of linear and uniform clause in Elemental function (SIMD-enabled function)– Linear and uniform clauses specifically for the Intel® Cilk™ Plus methodology but with almost direct application for OpenMP* 4.0 #pragma omp declare simd linear and uniform clauses.
Call site dependence for SIMD-enabled functions in C++– Relationship of function call site to multiple SIMD-enabled functions in a header file.
Best practices for using Intel® Cilk™ Plus– Step-by-step approach to enable an application with task and data parallelism using the Intel® Cilk™ Plus methodology.
Data Alignment to Assist Vectorization– Data alignment is a method to force the compiler to create data objects in memory on specific byte boundaries. In addition to creating the data on aligned boundaries, the compiler is able to make optimizations when the data is known to be aligned by 64-bytes.
Array Notation Tradeoffs– Rewriting array notation syntax (one way to express parallelism that helps the compiler with vectorization) with shorter vectors to avoid cache overflow.
Improving Discrete Cosine Transform performance using Intel® Cilk™ Plus– Improving the performance of Discrete Cosine Transforms using the Intel® Cilk™ Plus methodology and Array Notation.
Improving Averaging Filter performance using Intel® Cilk™ Plus – Improving the performance of an Averaging Filter in image processing using the Intel® Cilk™ Plus methodology, and using task parallelism, Array Notation, and SIMD-enabled functions to express data parallelism.
Outer Loop Vectorization– Moving the vectorization from an inner level to an outer level using a combination of elemental functions and pragma/directive SIMD.
Outer Loop Vectorization via Intel® Cilk™ Plus Array Notations – Using C++ Array Notation with the Intel® Cilk™ Plus methodology.
Tradeoffs between array-notation long-vector and short-vector coding – Using C++ Array Notation with the Intel® Cilk™ Plus methodology.
The Importance of Vectorization for Intel® Many Integrated Core Architecture (Intel® MIC Architecture) (Fortran Example)– Using the Intel® Fortran Compiler vectorizer to get good performance through effective use of the SIMD hardware and the benefits of threading over many cores.
Vectorizing Intel® Threading Building Blocks (Intel® TBB) parallel_for block– Writing vector-friendly code inside an Intel® Threading Building Blocks (Intel® TBB) parallel_for bloc.
Intel® Xeon Phi™ coprocessor specific support of streaming stores– Speeding up performance of vector-aligned unmasked stores in streaming kernels
Molecular Dynamics Optimization on Intel® Many Integrated Core Architecture (Intel® MIC)– Optimizing a molecular dynamics program using Intel® MIC Architecture.
Cache Blocking Techniques– Cache Blocking is a technique to rearrange data access to pull subsets (blocks) of data into cache and to operate on block to avoid repeatedly fetching data from memory.
Memory Layout Transformations– Reorganizing data into an SoA organization for real world applications.
Large Page Considerations– Using mmap to allocate data directly in 2MB pages, using libraries such as libhugetlbfs to allocate all malloc-ed data and static data.
Common Vectorization Tips– User-defined function-calls inside vector-loops, unit-stride accesses inside elemental functions, and memory disambiguation inside vector loops.
Program Optimization through Loop Vectorization– The Intel® Compilers provide many ways to generate well-optimized vector instructions. Good vectorization is of fundamental importance to take full advantage of SIMD-based data parallelism. High-level coding can take advantage of auto-vectorization.
Intel Guide for Developing Multithreaded Applications– More specific tuning-related information applicable to thread synchronization and memory management.
Large Page Considerations– Compiler Methodology for Intel® MIC Architecture.
Element-wise Alignment Requirements for Data Accesses– Compiler Methodology for Intel® MIC Architecture.