While preparing two benchmarks on a 4-socket/60-core Xeon server (E7-4890v2, HyperThreading and TurboBoost enabled; RHEL 7, Transparent Huge Pages activated), cf. topic 515900, I have observed that combining the optimization options '-O2 -ansi-alias' with '-xHost' (ICC 14.0.2) for the OpenMP parallelized "Benchmark B" increases the execution time:
== 'icpc -ansi-alias -O2 -xHost' ... ====================================
---- 60 threads --------------------------------------------------------
real 13m34.796s user 805m26.873s sys 1m33.328s
real 13m35.425s user 805m51.402s sys 1m33.418s
real 13m35.406s user 805m39.981s sys 1m35.493s
---- 120 threads --------------------------------------------------------
real 13m8.188s user 1553m54.815s sys 2m52.075s
real 13m3.077s user 1544m28.512s sys 2m45.633s
real 13m2.473s user 1542m21.130s sys 2m52.306s
== 'icpc -ansi-alias -O2' ... ===========================================
---- 60 threads --------------------------------------------------------
real 13m23.987s user 793m46.546s sys 1m29.245s
real 13m24.985s user 795m35.378s sys 1m28.225s
real 13m27.984s user 798m38.090s sys 1m27.477s
---- 120 threads --------------------------------------------------------
real 12m46.201s user 1509m1.876s sys 2m25.665s
real 12m46.240s user 1510m39.814s sys 2m31.060s
real 12m45.393s user 1508m45.201s sys 2m25.807s
Using IPO or PGO shows the same result: The machine code gets—against my expectations—slower. Is this effect already known with a special sort of code? I would be very grateful for any hint! Please ask if more information are needed.
Thank you for reading.