Quantcast
Channel: Intel® C++ Compiler
Viewing all articles
Browse latest Browse all 1616

Surprising simd behavior

$
0
0

Experimenting with vectorization I've come across some unexpected behavior. For example, the following demo gets 2x slower when #pragma simd is used.

// Vectorization simd slowdown demo
// #pragma simd makes this 2x slower on AVX2 (Haswell) CPU
// Build with: icl /nologo /Qstd=c++11 /Qcxx-features /Wall /QxHost /DNOMINMAX /DWIN32_LEAN_AND_MEAN -D__builtin_huge_val()=HUGE_VAL -D__builtin_huge_valf()=HUGE_VALF -D__builtin_nan=nan -D__builtin_nanf=nanf -D__builtin_nans=nan -D__builtin_nansf=nanf /DNDEBUG /Qansi-alias /O3 /fp:fast=2 /Qprec-div- /Qip /Qopt-report /Qopt-report-phase:vec simd_slowdown.cc

#include <ctime>
#include <iostream>
using namespace std;

// Length Squared
int
length_squared( int * a, int N )
{
	int length_sq( 0 );
#pragma simd // 2x slower with this!
#pragma vector aligned
	for ( int i = 0; i < N; ++i ) {
		length_sq += a[ i ] * a[ i ];
	}
	return length_sq;
}

int
main()
{
	int const N( 4096 ), R( 32*1024*1024 );
	alignas( 32 ) int a[ N ];
#pragma novector
	for ( int i = 0; i < N; ++i ) {
		a[ i ] = 1;
	}
	int s( 0 );
	double const time1 = (double)clock()/CLOCKS_PER_SEC;
#pragma novector
	for ( int r = 1; r <= R; ++r ) {
		s += length_squared( a, N );
	}
	double const time2 = (double)clock()/CLOCKS_PER_SEC;
	cout << time2 - time1 << " s  "<< s << endl;
}

This occurs with Intel C++ 2016 on a Haswell system using an AVX2 build. The vectorization reports are similar both ways. For another twist, if you change the array type to float then #pragma simd makes it run 40% faster. Is this just exposing weaknesses in the vectorization engine or is there a rational explanation for this.

Thanks!


Viewing all articles
Browse latest Browse all 1616

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>