Hi,
Is is well-known that the performance of the scalar product is limited by the memory bandwidth of the machine.
auto n = std::size_t{100000000}; auto v1 = std::vector<double>(n); auto v2 = std::vector<double>(n); auto sum = double{0.0}; for (auto k = std::size_t{0}; k < n; ++k) { sum += v1[k] * v2[k]; }
What is surprising is the fact that this loop still benefit from parallelization. If I throw a (#pragma omp parallel for) just before the loop, I get a 3x speedup on my Core i7-4850HQ. It is quite surprising for an algorithm known for being memory limited.
Here is my guess at the explanation:
- The memory prefetcher is here to hide the latency of memory access. But it can't prefetch more than one page at a time (about 4kB). The fact that we now have 4 prefetchers running at the same time makes it faster because with one thread the bandwidth of the CPU <-> RAM connection was not saturated.
- A similar story around the Translation Lookaside Buffer (TLB).
Is any of this explanation correct ?