Quantcast
Channel: Intel® C++ Compiler
Viewing all articles
Browse latest Browse all 1616

Memory bandwidth and scalar product

$
0
0

Hi,

Is is well-known that the performance of the scalar product is limited by the memory bandwidth of the machine.

auto n = std::size_t{100000000};
auto v1 = std::vector<double>(n);
auto v2 = std::vector<double>(n);

auto sum = double{0.0};
for (auto k = std::size_t{0}; k < n; ++k) {
  sum += v1[k] * v2[k];
}

What is surprising is the fact that this loop still benefit from parallelization. If I throw a (#pragma omp parallel for) just before the loop, I get a 3x speedup on my Core i7-4850HQ. It is quite surprising for an algorithm known for being memory limited.

Here is my guess at the explanation:

- The memory prefetcher is here to hide the latency of memory access. But it can't prefetch more than one page at a time (about 4kB). The fact that we now have 4 prefetchers running at the same time makes it faster because with one thread the bandwidth of the CPU <-> RAM connection was not saturated.

- A similar story around the Translation Lookaside Buffer (TLB).

Is any of this explanation correct ?


Viewing all articles
Browse latest Browse all 1616

Trending Articles