Intel ___intel_memcpy slower than platform memcpy

I recently noticed that Intel icpc compiled program is noticeably slower when memory operations are intensive. So did some benchmark on memcpy. It appears that __intel_memcpy is slower than the platform memcpy. Below is some details and results of the benchmark. The operating system is Mac OS X Yosemite (10.10), and all tools are up-to-date. The CPU is a Haswell one and details information of the CPU is also pasted below. The test program use some utilities in my own library. Mostly two classes that count cycles and measure time intervals. For each size of memory chunk, the memcpy call is repeated such that in total 20GB was copied, and thus the cycle count and time measurement are accurate enough.

I have also run the test program through a profiler, and it appears that the icpc compiled test program calls __intel_memcpy instead of __intel_fast_memcpy or __intel_new_memcpy. I image __intel_fast_memcpy shall be faster. When will it be called? Is there some compiler option I need to enable it?

In addition, if I am not mistaken, on this CPU the theoretical peak cpB shall be about 0.03125 (32 bytes per cycles), when data can fit into cache. It appears that when using clang, for 2KB - 16KB buffer, the performance is note far off from this value. In contrast, the Intel compiler is never close to this one. Instead it appears to peak out at 0.06, which is about (16 bytes per cycle).

I would very much appreciate if any one can explain to me why intel's supposed optimized memcpy is slower. Many thanks in advance.

$ cpuid_info # a small program I wrote to query basic flags in cpuid

==========================================================================================
Vendor ID                  GenuineIntel
Processor brand            Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
==========================================================================================
Deterministic cache parameters
------------------------------------------------------------------------------------------
Cache level                           1           1           2           3
Cache type                         Data Instruction     Unified     Unified
Cache size (byte)                   32K         32K        256K          6M
Maximum Proc sharing                  2           2           2          16
Maximum Proc physical                 8           8           8           8
Coherency line size (byte)           64          64          64          64
Physical line partitions              1           1           1           1
Ways of associative                   8           8           8          12
Number of sets                       64          64         512        8192
Self initializing                   Yes         Yes         Yes         Yes
Fully associative                    No          No          No          No
Write-back invalidate                No          No          No          No
Cache inclusiveness                  No          No          No         Yes
Complex cache indexing               No          No          No         Yes
==========================================================================================
Processor info and features
------------------------------------------------------------------------------------------
ACPI           AES            APIC           AVX            CLFSH          CMOV
CX16           CX8            DE             DS             DS_CPL         DTES64
EST            F16C           FMA            FPU            FXSR           HTT
MCA            MCE            MMX            MONITOR        MOVBE          MSR
MTRR           OSXSAVE        PAE            PAT            PBE            PCID
PCLMULQDQ      PDCM           PGE            POPCNT         PSE            PSE_36
RDRAND         SEP            SMX            SS             SSE            SSE2
SSE3           SSE4_1         SSE4_2         SSSE3          TM             TM2
TSC            TSC_DEADLINE   VME            VMX            X2APIC         XSAVE
XTPR
==========================================================================================
Extended features
------------------------------------------------------------------------------------------
AVX2           BMI1           BMI2           ERMS           FSGSBASE       HLE
INVPCID        RTM            SMEP
==========================================================================================
Extended processor info and features
------------------------------------------------------------------------------------------
ABM            GBPAGES        LAHF_LM        LM             NX             RDTSCP
SYSCALL
==========================================================================================

$ icpc --version

icpc (ICC) 15.0.0 20140716
Copyright (C) 1985-2014 Intel Corporation.  All rights reserved.

$ clang --version

Apple LLVM version 6.0 (clang-600.0.54) (based on LLVM 3.5svn)
Target: x86_64-apple-darwin14.0.0
Thread model: posix

$ cat test.cpp

#include
#include
#include
#include
#include
#include
#include

int main (int argc, char **argv)
{
    std::mt19937_64 eng;
    std::uniform_real_distribution runif(0, 1);

    const std::size_t NMax = 1024U * 1024U * 256U;

    // vectors with memory aligned as 32 bytes
    std::vector> x(NMax);
    std::vector> y(NMax);
    for (std::size_t i = 0; i != NMax; ++i)
        x[i] = runif(eng);

    std::vector bytes;
    std::vector cpB;
    std::vector GBs;

    std::size_t N = NMax;
    while (N > 0) {
        // R: Number of repeats
        // The total size of memory copied (R * N * sizeof(double)) will be
        // about 20GB
        std::size_t R = NMax / N * 10;
        std::size_t B = N * sizeof(double); // bytes
        vsmc::RDTSCPCounter counter; // A class to count cycles using RDTSCP
        vsmc::StopWatch watch;       // A class to measure time

        watch.start();
        counter.start();
        for (std::size_t r = 0; r != R; ++r) {
            memcpy(y.data(), x.data(), B);
            x.front() += 1.0;
            // prevent compiler to be too smart to see that the loop does
            // nothing
        }
        counter.stop();
        watch.stop();

        double dbytes = static_cast(B * R);
        bytes.push_back(B);
        cpB.push_back(counter.cycles() / dbytes);
        GBs.push_back(dbytes / watch.nanoseconds());

        N /= 2;
    }

    for (std::size_t i = 0; i != bytes.size(); ++i) {
        if (bytes[i] >= 1024 * 1024 * 1024)
            std::cout << bytes[i] / (1024.0 * 1024 * 1024) << "GB\t";
        else if (bytes[i] >= 1024 * 1024)
            std::cout << bytes[i] / (1024.0 * 1024) << "MB\t";
        else if (bytes[i] >= 1024)
            std::cout << bytes[i] / 1024.0 << "KB\t";
        else
            std::cout << bytes[i] << "B: ";
        std::cout << cpB[i] << "cpB"<< '\t';
        std::cout << GBs[i] << "GB/s"<< '\n';
    }

    // Output y.front(), again prevent too clever compiler to see that all
    // those memcpy is for nothing
    std::ofstream dummy("dummy_file");
    dummy << y.front();
    dummy.close();

    return 0;
}

$ clang++ -std=c++11 -march=native -mavx2 -O3 -DNDEBUG -o test test.cpp; nm test | grep memcpy; ./test

                 U _memcpy
2GB    0.416656cpB    7.66246GB/s
1GB    0.385954cpB    8.27198GB/s
512MB    0.387439cpB    8.24028GB/s
256MB    0.382988cpB    8.33605GB/s
128MB    0.383861cpB    8.3171GB/s
64MB    0.382132cpB    8.35473GB/s
32MB    0.385583cpB    8.27994GB/s
16MB    0.381789cpB    8.36224GB/s
8MB    0.333294cpB    9.57894GB/s
4MB    0.21869cpB    14.5988GB/s
2MB    0.157544cpB    20.2649GB/s
1MB    0.151573cpB    21.0631GB/s
512KB    0.150837cpB    21.1659GB/s
256KB    0.135304cpB    23.5957GB/s
128KB    0.113332cpB    28.1704GB/s
64KB    0.11268cpB    28.3335GB/s
32KB    0.113212cpB    28.2001GB/s
16KB    0.0344147cpB    92.7684GB/s
8KB    0.031262cpB    102.124GB/s
4KB    0.0334106cpB    95.5566GB/s
2KB    0.0388868cpB    82.0998GB/s
1KB    0.0440008cpB    72.5579GB/s
512B: 0.0601694cpB    53.0602GB/s
256B: 0.0923376cpB    34.5754GB/s
128B: 0.156117cpB    20.4501GB/s
64B: 0.328138cpB    9.72948GB/s
32B: 0.513387cpB    6.21871GB/s
16B: 0.617065cpB    5.17386GB/s
8B: 1.32244cpB    2.41418GB/s

$ icpc -std=c++11 -xHost -O3 -DNDEBUG -o test test.cpp; nm test | grep memcpy; ./test

00000001000087c0 T ___intel_memcpy
00000001000087c0 T ___intel_new_memcpy
0000000100005480 T __intel_fast_memcpy
2GB    0.618551cpB    5.16144GB/s
1GB    0.562556cpB    5.67518GB/s
512MB    0.552512cpB    5.77835GB/s
256MB    0.531465cpB    6.00719GB/s
128MB    0.537667cpB    5.93789GB/s
64MB    0.537064cpB    5.94456GB/s
32MB    0.534799cpB    5.96974GB/s
16MB    0.530338cpB    6.01995GB/s
8MB    0.593578cpB    5.37859GB/s
4MB    0.364334cpB    8.76286GB/s
2MB    0.206092cpB    15.4912GB/s
1MB    0.187371cpB    17.039GB/s
512KB    0.183633cpB    17.3858GB/s
256KB    0.184216cpB    17.3308GB/s
128KB    0.114155cpB    27.9673GB/s
64KB    0.113653cpB    28.0908GB/s
32KB    0.114386cpB    27.9109GB/s
16KB    0.0781414cpB    40.8568GB/s
8KB    0.060229cpB    53.0078GB/s
4KB    0.0622433cpB    51.2924GB/s
2KB    0.0675664cpB    47.2514GB/s
1KB    0.078558cpB    40.6401GB/s
512B: 0.0994097cpB    32.1157GB/s
256B: 0.142734cpB    22.3676GB/s
128B: 0.179139cpB    17.822GB/s
64B: 0.239995cpB    13.3028GB/s
32B: 0.443505cpB    7.19858GB/s
16B: 0.878139cpB    3.63566GB/s
8B: 1.54955cpB    2.06035GB/s

Intel ___intel_memcpy slower than platform memcpy

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112