The STREAM benchmark (http://www.cs.virginia.edu/stream/) a synthetic benchmark program, written in standard Fortran 77 (with a corresponding version in C). It measures the the performance of four long vector operations. These operations are:
------------------------------------------
name kernel bytes per iteration | FLOPS per iteration
------------------------------------------
COPY: a(i) = b(i) 16 0
SCALE: a(i) = q*b(i) 16 1
SUM: a(i) = b(i) + c(i) 24 1
TRIAD: a(i) = b(i) + q*c(i) 24 2
------------------------------------------
These operations are representative of the "building blocks" of long vector operations. The array sizes are defined so that each array is larger than the cache of the machine to be tested, and the code is structured so that data re-use is not possible.
People interested in benchmarking things would often choose the Stream benchmark to measure the sustainable memory bandwidth on their own desktops or servers. In this article series ,I will demonstate at the assembly -level and micro-architecture-level to help you interpret and analyse the common cases like that :
1 .Why there are so big different differences when running the same compiled binary on different hosts even though the hosts you conducted on are of the same micro-architecture (Please wait for my next article) ?
2.Why the compiler's full optimization (It's '-O3' for Intel® C++ Compiler) could give a sub-optimal performance compared to a less-full compiler optimization option on the same machine?
In this article ,I will use the Intel® C++ Compiler 2013 SP1 to compile the source code of the Stream benchmark and test the single-threaded code on the E5-2680 server ,then analyse the root cause of performance degration in the case of the compiler's full optimization (It's '-O3' for Intel® C++ Compiler) by Intel® Vtune™ Amplifier XE.
You can compile to generate a single-threaded or muti-threaded Stream code to measure the sustainable memory bandwidth for your specific use.
For simplicity ,here I compile the code with option '-O3 -xAVX -g -vec-report2' to generate a single-threaded Stream binary in this case and compile out a binary with option '-O3 -xAVX -g -opt-streaming-stores never -vec-report2' for comparison ,also you could add the '-openmp' when you intend to measure the max sustainable memory bandwidth on a multi-core desktop or server.
To demonstrate the performance gap more clearly ,I paste two outputs here ,which both run as a single thread on the E5-2680 server.
// icc -O3 -xAVX -g stream.c -vec-report2 -o FullStream.out
Function Best Rate MB/s Avg time Min time Max time
Copy: 7029.2 0.035565 0.022762 0.040708
Scale: 6901.0 0.037871 0.023185 0.042517
Add: 8871.4 0.044005 0.027053 0.049305
Triad: 8850.8 0.044265 0.027116 0.049458
// icc -O3 -xAVX -g -opt-streaming-stores never stream.c -vec-report2 -o PartStream.out
Function Best Rate MB/s Avg time Min time Max time
Copy: 7588.7 0.022461 0.021084 0.022827
Scale: 12364.8 0.013024 0.012940 0.013072
Add: 13210.8 0.018239 0.018167 0.018469
Triad: 13502.8 0.017995 0.017774 0.018093
Now you can see the big performance difference now when with or without 'streaming-stores' optimization specified.
To find out the root cause and locate the region of interest in this case ,I will use Vtune's event based sampling and assembly pane etc. features.
//The screenshot for the case of '-O3 -xAVX -g'
//The screenshot for the case of '-O3 -xAVX -g -opt-streaming-stores never'
Actually ,the only different thing you can see form the 'Summary' pace of the Vtune is the first binary has near 32% more execution time (CPU_CLK_UNHALTED.THREAD2,930,000,000) than the latter binary (CPU_CLK_UNHALTED.THREAD 2,220,000,000) .
So,where have my 32% more cycles gone? As you can see from the above screenshots (As a result of length reason, here I only paste the event reports for the source code line of 325), in the first scenario ,compiler performs a SIMD multiplication of four packed double-precision floating-point elements to vectorize the loop and follows the 'vmulpdy' instruction with a 'vmovntpdy' instruction ,and after the whole array gets streaming stored to the memory it also adds a 'mfence' to ensure the program's memory order in the end.
Now you can see the root cause more clear till here : Compared to the 'no streaming-stores' case ,compiler unroll more loops than the '*have* streaming-stores' case ,but that's not the point ,the point is you can see the high number (566,000,000) event of CPU_CLK_UNHALTED.THREAD for the assembly line ,which is ‘vmovntpdy %ymm0, 0x5255060(,%rdx,8)’ in the '*have* streaming-stores' case , and except this ,actually there are no big differences between the two cases' events(MEM_LOAD_UOPS_RETIRED.LLC_MISS 、MEM_LOAD_UOPS_RETIRED.HIT_LFB_PS and MEM_LOAD_UOPS_RETIRED.L1_HIT_PS etc.) corresponding to the source code line 325.
To conclude for the moment :
1: You can see from the vec-report2 that the Intel® C++ Compiler SP1 has done good vectorization for these three aligned arrays. And the only apparent differences as you can see from the assembly code is the unroll factor and the non-temporal stores related things ,also as can be seen from the big number gap on the Triad kernel between the 13502.8GB/s in the case of 'no streaming stores' to the 8850.8GB/s in the '*have* streaming stores' case .So the streaming-stores must have something strong correlation with the micro-architecture implementation details of the E5-2680 server. So there comes the root cause !
2: Streaming store move instructions store non-temporal data directly to memory without updating the cache. This minimizes cache pollution and unnecessary bus bandwidth between cache and YMM registers because it does not write-allocate on a write miss. So when insturction encounters a store miss ,this could ensure that there are no more cache interventions during this store journey ,however ,as you know ,latency may be very high when CPU comes to service a store miss ,especially on Westmere and Sandy Bridge ,on which single-thread performance is limited by the occupancy of the core's 10 Line Fill Buffers (LFB's). When you use streaming stores, the LFBs are occupied until the data can be handed off to the memory controller. When you run ordinary cacheable stores, the hardware prefetcher can bring the target cache line for the store in advance .So there are many wait time periods to get enqueued into the LFB to service every streaming store instruction in the 4-factor-unrolled 'for (j=0; j<STREAM_ARRAY_SIZE; j++)' loop ,and so these occasional long wait periods accompanied with execution starvation contributes most to the wastely consumed high unhalted clocks.
That's all for this analysis .Thanks for the Stream benchmark's author and many other colleagues' comments on this.
Optimization Notice |
---|
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 |