I recently noticed that Intel icpc compiled program is noticeably slower when memory operations are intensive. So did some benchmark on memcpy. It appears that __intel_memcpy is slower than the platform memcpy. Below is some details and results of the benchmark. The operating system is Mac OS X Yosemite (10.10), and all tools are up-to-date. The CPU is a Haswell one and details information of the CPU is also pasted below. The test program use some utilities in my own library. Mostly two classes that count cycles and measure time intervals. For each size of memory chunk, the memcpy call is repeated such that in total 20GB was copied, and thus the cycle count and time measurement are accurate enough.
I have also run the test program through a profiler, and it appears that the icpc compiled test program calls __intel_memcpy instead of __intel_fast_memcpy or __intel_new_memcpy. I image __intel_fast_memcpy shall be faster. When will it be called? Is there some compiler option I need to enable it?
In addition, if I am not mistaken, on this CPU the theoretical peak cpB shall be about 0.03125 (32 bytes per cycles), when data can fit into cache. It appears that when using clang, for 2KB - 16KB buffer, the performance is note far off from this value. In contrast, the Intel compiler is never close to this one. Instead it appears to peak out at 0.06, which is about (16 bytes per cycle).
I would very much appreciate if any one can explain to me why intel's supposed optimized memcpy is slower. Many thanks in advance.
$ cpuid_info # a small program I wrote to query basic flags in cpuid
========================================================================================== Vendor ID GenuineIntel Processor brand Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz ========================================================================================== Deterministic cache parameters ------------------------------------------------------------------------------------------ Cache level 1 1 2 3 Cache type Data Instruction Unified Unified Cache size (byte) 32K 32K 256K 6M Maximum Proc sharing 2 2 2 16 Maximum Proc physical 8 8 8 8 Coherency line size (byte) 64 64 64 64 Physical line partitions 1 1 1 1 Ways of associative 8 8 8 12 Number of sets 64 64 512 8192 Self initializing Yes Yes Yes Yes Fully associative No No No No Write-back invalidate No No No No Cache inclusiveness No No No Yes Complex cache indexing No No No Yes ========================================================================================== Processor info and features ------------------------------------------------------------------------------------------ ACPI AES APIC AVX CLFSH CMOV CX16 CX8 DE DS DS_CPL DTES64 EST F16C FMA FPU FXSR HTT MCA MCE MMX MONITOR MOVBE MSR MTRR OSXSAVE PAE PAT PBE PCID PCLMULQDQ PDCM PGE POPCNT PSE PSE_36 RDRAND SEP SMX SS SSE SSE2 SSE3 SSE4_1 SSE4_2 SSSE3 TM TM2 TSC TSC_DEADLINE VME VMX X2APIC XSAVE XTPR ========================================================================================== Extended features ------------------------------------------------------------------------------------------ AVX2 BMI1 BMI2 ERMS FSGSBASE HLE INVPCID RTM SMEP ========================================================================================== Extended processor info and features ------------------------------------------------------------------------------------------ ABM GBPAGES LAHF_LM LM NX RDTSCP SYSCALL ==========================================================================================
$ icpc --version
icpc (ICC) 15.0.0 20140716 Copyright (C) 1985-2014 Intel Corporation. All rights reserved.
$ clang --version
Apple LLVM version 6.0 (clang-600.0.54) (based on LLVM 3.5svn) Target: x86_64-apple-darwin14.0.0 Thread model: posix
$ cat test.cpp
#include #include #include #include #include #include #include int main (int argc, char **argv) { std::mt19937_64 eng; std::uniform_real_distribution runif(0, 1); const std::size_t NMax = 1024U * 1024U * 256U; // vectors with memory aligned as 32 bytes std::vector> x(NMax); std::vector> y(NMax); for (std::size_t i = 0; i != NMax; ++i) x[i] = runif(eng); std::vector bytes; std::vector cpB; std::vector GBs; std::size_t N = NMax; while (N > 0) { // R: Number of repeats // The total size of memory copied (R * N * sizeof(double)) will be // about 20GB std::size_t R = NMax / N * 10; std::size_t B = N * sizeof(double); // bytes vsmc::RDTSCPCounter counter; // A class to count cycles using RDTSCP vsmc::StopWatch watch; // A class to measure time watch.start(); counter.start(); for (std::size_t r = 0; r != R; ++r) { memcpy(y.data(), x.data(), B); x.front() += 1.0; // prevent compiler to be too smart to see that the loop does // nothing } counter.stop(); watch.stop(); double dbytes = static_cast(B * R); bytes.push_back(B); cpB.push_back(counter.cycles() / dbytes); GBs.push_back(dbytes / watch.nanoseconds()); N /= 2; } for (std::size_t i = 0; i != bytes.size(); ++i) { if (bytes[i] >= 1024 * 1024 * 1024) std::cout << bytes[i] / (1024.0 * 1024 * 1024) << "GB\t"; else if (bytes[i] >= 1024 * 1024) std::cout << bytes[i] / (1024.0 * 1024) << "MB\t"; else if (bytes[i] >= 1024) std::cout << bytes[i] / 1024.0 << "KB\t"; else std::cout << bytes[i] << "B: "; std::cout << cpB[i] << "cpB"<< '\t'; std::cout << GBs[i] << "GB/s"<< '\n'; } // Output y.front(), again prevent too clever compiler to see that all // those memcpy is for nothing std::ofstream dummy("dummy_file"); dummy << y.front(); dummy.close(); return 0; }
$ clang++ -std=c++11 -march=native -mavx2 -O3 -DNDEBUG -o test test.cpp; nm test | grep memcpy; ./test
U _memcpy 2GB 0.416656cpB 7.66246GB/s 1GB 0.385954cpB 8.27198GB/s 512MB 0.387439cpB 8.24028GB/s 256MB 0.382988cpB 8.33605GB/s 128MB 0.383861cpB 8.3171GB/s 64MB 0.382132cpB 8.35473GB/s 32MB 0.385583cpB 8.27994GB/s 16MB 0.381789cpB 8.36224GB/s 8MB 0.333294cpB 9.57894GB/s 4MB 0.21869cpB 14.5988GB/s 2MB 0.157544cpB 20.2649GB/s 1MB 0.151573cpB 21.0631GB/s 512KB 0.150837cpB 21.1659GB/s 256KB 0.135304cpB 23.5957GB/s 128KB 0.113332cpB 28.1704GB/s 64KB 0.11268cpB 28.3335GB/s 32KB 0.113212cpB 28.2001GB/s 16KB 0.0344147cpB 92.7684GB/s 8KB 0.031262cpB 102.124GB/s 4KB 0.0334106cpB 95.5566GB/s 2KB 0.0388868cpB 82.0998GB/s 1KB 0.0440008cpB 72.5579GB/s 512B: 0.0601694cpB 53.0602GB/s 256B: 0.0923376cpB 34.5754GB/s 128B: 0.156117cpB 20.4501GB/s 64B: 0.328138cpB 9.72948GB/s 32B: 0.513387cpB 6.21871GB/s 16B: 0.617065cpB 5.17386GB/s 8B: 1.32244cpB 2.41418GB/s
$ icpc -std=c++11 -xHost -O3 -DNDEBUG -o test test.cpp; nm test | grep memcpy; ./test
00000001000087c0 T ___intel_memcpy 00000001000087c0 T ___intel_new_memcpy 0000000100005480 T __intel_fast_memcpy 2GB 0.618551cpB 5.16144GB/s 1GB 0.562556cpB 5.67518GB/s 512MB 0.552512cpB 5.77835GB/s 256MB 0.531465cpB 6.00719GB/s 128MB 0.537667cpB 5.93789GB/s 64MB 0.537064cpB 5.94456GB/s 32MB 0.534799cpB 5.96974GB/s 16MB 0.530338cpB 6.01995GB/s 8MB 0.593578cpB 5.37859GB/s 4MB 0.364334cpB 8.76286GB/s 2MB 0.206092cpB 15.4912GB/s 1MB 0.187371cpB 17.039GB/s 512KB 0.183633cpB 17.3858GB/s 256KB 0.184216cpB 17.3308GB/s 128KB 0.114155cpB 27.9673GB/s 64KB 0.113653cpB 28.0908GB/s 32KB 0.114386cpB 27.9109GB/s 16KB 0.0781414cpB 40.8568GB/s 8KB 0.060229cpB 53.0078GB/s 4KB 0.0622433cpB 51.2924GB/s 2KB 0.0675664cpB 47.2514GB/s 1KB 0.078558cpB 40.6401GB/s 512B: 0.0994097cpB 32.1157GB/s 256B: 0.142734cpB 22.3676GB/s 128B: 0.179139cpB 17.822GB/s 64B: 0.239995cpB 13.3028GB/s 32B: 0.443505cpB 7.19858GB/s 16B: 0.878139cpB 3.63566GB/s 8B: 1.54955cpB 2.06035GB/s