Experimenting with vectorization I've come across some unexpected behavior. For example, the following demo gets 2x slower when #pragma simd is used.
// Vectorization simd slowdown demo // #pragma simd makes this 2x slower on AVX2 (Haswell) CPU // Build with: icl /nologo /Qstd=c++11 /Qcxx-features /Wall /QxHost /DNOMINMAX /DWIN32_LEAN_AND_MEAN -D__builtin_huge_val()=HUGE_VAL -D__builtin_huge_valf()=HUGE_VALF -D__builtin_nan=nan -D__builtin_nanf=nanf -D__builtin_nans=nan -D__builtin_nansf=nanf /DNDEBUG /Qansi-alias /O3 /fp:fast=2 /Qprec-div- /Qip /Qopt-report /Qopt-report-phase:vec simd_slowdown.cc #include <ctime> #include <iostream> using namespace std; // Length Squared int length_squared( int * a, int N ) { int length_sq( 0 ); #pragma simd // 2x slower with this! #pragma vector aligned for ( int i = 0; i < N; ++i ) { length_sq += a[ i ] * a[ i ]; } return length_sq; } int main() { int const N( 4096 ), R( 32*1024*1024 ); alignas( 32 ) int a[ N ]; #pragma novector for ( int i = 0; i < N; ++i ) { a[ i ] = 1; } int s( 0 ); double const time1 = (double)clock()/CLOCKS_PER_SEC; #pragma novector for ( int r = 1; r <= R; ++r ) { s += length_squared( a, N ); } double const time2 = (double)clock()/CLOCKS_PER_SEC; cout << time2 - time1 << " s "<< s << endl; }
This occurs with Intel C++ 2016 on a Haswell system using an AVX2 build. The vectorization reports are similar both ways. For another twist, if you change the array type to float then #pragma simd makes it run 40% faster. Is this just exposing weaknesses in the vectorization engine or is there a rational explanation for this.
Thanks!