My ultimate opinions on the benchmarking games

Nowadays there are existing many benchmark applications to guide the users to select the highest scored Microprocessor or SOC (or just measure a specific sub-system like memory shipped in them) among a wide range of state-of-the-art mobile phones 、tablets、desktops or servers .The benchmarks are usually designed/wrote by some enthusiasts like the Stream's (used to measure the sustainable memory bandwidth and the corresponding computation rate for simple vector kernels) ,or released by some performance evaluation company like 'SPEC CPU2006' and 'Geekbench' ,whose intents usually focus on measuring the Chip's maximum arithmetic calculation capability、maximum read or write speed or GPU's maximum rendering capability. In this article I will talk about the shortcomings of current benchmarks on specific devices and give my opinions on what we should be really cared about when we choose a mobile phone or other hardware solutions .The main target in this article's context would mainly focus on IA devices with no discrete GPU involved.

1. Is the benchmark's result truly trustful?

Recently ,there exposes a benchmark software cheating incident from an Android mobile phone's manufacturer ,on whose platform the benchmark software reaches 20% score in some tests compared to another Android mobile phone's when both use the same CPU and same benchmark application. Later ,it turns out that the phone's manufacturer uses some tricks to automatically detect the benchmarking applications in the phone's whitelist and when it is detected at the start-up time, the auto-detecting script would raise the clock frequency of CPU to the highest and trigger all the cores shipped in the package to boost up to the most strength that its phone can deliver.

So, now you can see there are many ways or tricks to help the software or hardware to give a beautiful score :

1). As to the software part ,the variations in the performance result may involve the source code's implementation of algorithm、compilation's optimality、language's runtime framework like Objective-c's and runtime virtual machine's interpretation or execution or other translation infrastructure.

You know that the source code of benchmark needs to be compiled to the machine code in order to be run on a processor with ARM ISA or IA(Intel Architecture)ISA ,so there should be taken care about that how the compiler would affect your source code's semantics、 correctness or performance, for example with a benchmark code suite from 'SPEC CPU2006' wrote by the C++ language and also with C++11 new features added ,some compiler would generate less optimal machine code because of less fully support for all C++11 new features or suboptimal internal code analysis and code generation for complicated C++ features .And other than that ,many compilation options' combination would also have mixed performance effect on the final linked executable ,for example ,simply inlining common single/double floating point transcendental functions shipped with the Intel Compiler in library libimf.so would sometimes give a 10% higher score in Geekbench on IA devices.

2).As to the hardware part, the variations in the performance result may concern mostly with the internal implementation of the instruction set ,i.e. micro-architecture. For example, there are many specific benchmarks to measure the sub-systems in the processor ,such as reading or writing sequentially to adjacent memory areas to make the kernel memory bound in order to measure the max sustainable memory bandwidth the processor or SOC can deliver ,and reading or writing randomly like a pointer-chasing case in a small time window to measure the processor's best capability to fetch an on-demanding cache line for in-flight instructions in the core in worst scenarios ,which is very common in real world applications. In this post Moore era, memory-bound scenarios in real-world applications have become more common than CPU-bound, which would require more internal bandwidth for nowadays' cache-memory bandwidth, just imagine the upcoming AVX-512 in future processors, to sustain a parallel two ZMM registers in one cycle, today's 64Bytes's granularity cache line transfer size and internal simultaneous in/out bandwidth to the L2 cache would definitely become the bottleneck again.

Although processor's internal design could fundamentally effect the different benchmarks' runtime behavior (as said above on the software side the first factor is code's design ,the second one is compiler and linked libraries),the real runtime behavior could also be tricked like some phone manufacturer did as I mentioned above, the manufacturer could do what he want to play with the processor's setting like limiting the frequency when running normal applications and scaling up the frequency and trigger the GPU to get offloaded with appropriate arithmetic calculations when a specific running benchmark detected.

2. Can the benchmark's score really reflect the max capability of the CPU/GPU or a multicore package?

Now you should know that there are many factors to be considered when we compare the benchmark results among alike phone or desktop platforms, not to mention there are also platform variations like the OS、runtime libraries 、JIT implementations or even cross ISA and so on. So how should we utilize benchmark's functionality to give us the most accurate information of processor's capability?

First we need to know that CPU often claims its max theoretical FLOPS(floating point operations per second) as its arithmetic calculation capability when gets released into market, for example Intel's latest desktop/tablet-level processor Haswell has a theoretical 128 Gflops in single precision with one core at 3Ghz,which doubles its flops number compared to its prior-tock Sandy Bridge's. And generally users often don't have the time to look into the CPU's data sheet to find out the numbers about how much is the CPU's Gflops or its max memory bandwidth, they need an graphical way to tell them how is the new CPU's performance compared to its prior generation or peers and which CPU scores highest ,and users just need a chart listed the scores from the highest to the lowest with all the state-of-art CPUs on the list ,as simple(and as true 'please')as possible .So benchmark applications have their marketing shares !

But nowadays many benchmarks have been misused for some unrighteous marketing purposes by some manufactures. It has become more and more needful that many misleading or unfair benchmarking at the market should come to the ground or fact that the source code design in the benchmark applications ,which is wrote to stand for typical workloads in the real world ,should truly reflect the raw/max performance of a CPU .For example ,with the purpose to measure the CPU's flops in a benchmark , the code design in its loop kernel should be fully optimized to get close to the Haswell's theoretical 128 Gflops (in single precision with one core at 4Ghz) ,here I mean in this customized kernel there should be

1).a high ratio between floating point calculations and loads&stores

2).loop fully unrolled like parallel overlapping adds and muls with software pipelining, also this involves the compiler's good understanding to enable the optimal code generation.

For another example ,with the purpose to measure the memory bandwidth of a CPU(or Uncore) in a benchmark ,the code design for this case should truly reflect the max memory bandwidth that CPU can deliver and the result should be close or 10%(is reasonable) less than the claimed theoretical memory bandwidth. Luckily the code design requirements for this kind of benchmark is less difficult and there are many good benchmarks in this field, for example, to measure how much memory bandwidth can my application be fed on my desktop with Core™ i7-2600K, the four-threaded Stream benchmark could give me a result of 19GB/s for this multicore CPU, which has a theoretical 21GB/s memory bandwidth.

To sum up for this post:

1. Don’t trust the scores from the benchmarks so easily unless you have known about the theoretical capability of your processor .What you see from the source code in software may not be as optimal as can be got from the hardware.

2.Many layers from hardware(state-of-the-art instruction set、sub-systems in microarchitecture design、runtime CPU's configuration like scaling) to software(the code design pattern、compiler's effectiveness、runtime libraries' responsiveness)could contribute to the benchmarking final results ,so do not just compare the scores without taking the platform into consideration either.

3. The benchmark applications should truly reflect the capability that a CPU can give, code design and the emitted machine code are both important, especially in complicated kernels.

My upcoming posts would talk about the performance per watt and per core in mobile phones' field ,and how we should interpret the benchmarked scores between Baytrail and 64-bit A7.

(EndOfPost)Thanks.