Hi,
I am trying to make the "#pragma omp declare simd" construct to work, but I am struggling with some problems.
I wrote a program that consists of two compilation units: main.cpp and f.cpp. The file f.cpp contains two functions. The function f_not_vectorized does not come with a vectorized version and f_openmp is asked to be vectorized with OpenMP 4. They both compute the cos of a float.
I use main.cpp to time the results. Three tests are done. One with f_not_vectorized, one with f_openmp and one with a direct call to std::cos. Unfortunately, I get the following result:
Not vectorized: 8.370e-01 s OpenMP: 8.370e-01 s Inlined: 1.563e-01 s
which is not what I expected. I was looking to an OpenMP version with performance very close to the inlined version as it should be vectorized. Here is the full code compiler with icpc version 16.0.2, and compiled with
icpc -c -std=c++11 -O3 -xHost -ansi-alias -qopenmp main.cpp -o main.o icpc -c -std=c++11 -O3 -xHost -ansi-alias -qopenmp f.cpp -o f.o icpc -std=c++11 -O3 -xHost -ansi-alias -qopenmp main.o f.o -o main
The file f.cpp
#include <cmath> float f_not_vectorized(float x) { return std::cos(x); } #pragma omp declare simd notinbranch simdlen(8) float f_openmp(float x) { return std::cos(x); }
And the file main.cpp
#include <cstdio> #include <cmath> #include <chrono> float f_not_vectorized(float x); #pragma omp declare simd notinbranch simdlen(8) float f_openmp(float x); int main() { const int nb_times{1000000}; const int array_length{128}; float v[array_length]; auto time_begin = std::chrono::high_resolution_clock::now(); for (int k{0}; k < nb_times; ++k) { for (int i{0}; i < array_length; ++i) { v[i] = f_not_vectorized(v[i]); } } auto time_end = std::chrono::high_resolution_clock::now(); double time_not_vectorized{ 1.0e-9 * std::chrono::duration_cast<std::chrono::nanoseconds>(time_end - time_begin) .count()}; std::printf("Not vectorized: %7.3e s\n", time_not_vectorized); time_begin = std::chrono::high_resolution_clock::now(); for (int k{0}; k < nb_times; ++k) { #pragma omp simd for (int i{0}; i < array_length; ++i) { v[i] = f_openmp(v[i]); } } time_end = std::chrono::high_resolution_clock::now(); double time_openmp{1.0e-9 * std::chrono::duration_cast<std::chrono::nanoseconds>( time_end - time_begin) .count()}; std::printf(" OpenMP: %7.3e s\n", time_not_vectorized); time_begin = std::chrono::high_resolution_clock::now(); for (int k{0}; k < nb_times; ++k) { for (int i{0}; i < array_length; ++i) { v[i] = std::cos(v[i]); } } time_end = std::chrono::high_resolution_clock::now(); double time_inlined{1.0e-9 * std::chrono::duration_cast<std::chrono::nanoseconds>( time_end - time_begin) .count()}; std::printf(" Inlined: %7.3e s\n", time_inlined); float check{0.0f}; for (int i{0}; i < array_length; ++i) { check += v[i]; } std::printf("Check: %7.3e\n", check); return 0; }
Thanks for your help.