Using Intel® C++ Composer XE for Multiple Simple Random Sampling without Replacement

Introduction

Random sampling is often used when pre- or post-processing of all records of the entire data set is expensive, as in the following examples. When the file of records or database is too large, retrieval cost for one record is too high. In further physical examination of the real-world entity described by a record, fiscal audit of financial records, or medical examinations of sampled patients for epidemiological studies, post-processing of one data record is too time-consuming. Random sampling is typically used to support statistical analysis of an entire data set and some aggregate statistic estimation (such as average), to estimate parameters of interest, or to perform hypothesis testing. Typical applications of random sampling are financial audit, fissile materials audit, epidemiology, exploratory data analysis and graphics, statistical quality control, polling and marketing research, official surveys and censuses, statistical database security and privacy, etc.

Problem statement

Definitions:

The population to be sampled is assumed to be a set of records (tuples) of a known size N.
A fixed-size random sample is a random sample for which the sample size is a specified constant M.
A simple random sample without replacement (SRSWOR) is a subset of the elements of a population where each element is equally likely to be included in the sample and no duplicates are allowed.

We need to generate multiple fixed size simple random samples without replacement. Each sample is unbiased, i.e., item (record) in each sample was chosen from the whole population with equal probability 1/N, independently of others. All samples are independent.

Note: We consider a special case of problems where all records are numbered using natural numbers from 1 to N, so we do not need access to population items themselves (or we have array of indexes of population items).

In other words, we need to conduct a series of experiments, each generating a sequence of M unique random natural numbers from 1 to N (1≤M≤N).

The attached program uses M=6 and N=49, conducts 119 696 640 experiments, generates a large number of result samples (sequences of length M) in the single array RESULTS_ARRAY, and uses all available parallel threads. In the program, we call each experiment a “lottery M of N”.

Considered approaches to simulate one experiment

Algorithm 1

A straightforward algorithm to simulate one experiment is as follows:

           A1.1: let RESULTS_ARRAY be empty

            A1.2: for i from 1 to M do:

                A1.3: generate random natural number X from {1,...,N}

                A1.4: if X is already present in RESULTS_ARRAY (loop), then go to A1.3

                A1.5: put X at the end of RESULTS_ARRAY

            End.

In more detail, step A1.4 is the “for” loop of length i-1:

            A1.4.1: for k from 1 to i-1:

            A1.4.2: if RESULTS_ARRAY[i]==X, then go to A1.3

Algorithm 2

This algorithm uses the partial “Fisher-Yates shuffle” algorithm. Each experiment is treated as a partial length-M random shuffle of the whole population of N elements. It needs M random numbers. The algorithm is as follows:

            A2.1: (Initialization step) let PERMUT_BUF contain natural numbers 1, 2, ..., N

            A2.2: for i from 1 to M do:

                A2.3: generate random integer X uniform on {i,...,N}

                A2.4: interchange PERMUT_BUF[i] and PERMUT_BUF[X]

            A2.5: (Copy step) for i from 1 to M do: RESULTS_ARRAY[i]=PERMUT_BUF[i]

            End.

Explanation: each iteration of the loop A2.2 works as a real lottery step. Namely, in each step, we extract random item X from remaining items in the bin PERMUT_BUF[i], ..., PERMUT_BUF[N] and put it at the end of the results row PERMUT_BUF[1],...,PERMUT_BUF[i]. The algorithm is partial because we do not generate full permutation of length N, but only a part of length M.

At the cost of more memory and extra Initialization and Copy steps (loops), Algorithm 2 needs fewer random numbers than Algorithm 1, and does not have the second nested loop A1.4 with “if” branching. Therefore, we chose to use Algorithm 2.

In the case of simulating many experiments, Initialization step is needed only once because at the beginning of each experiment, the order of natural numbers 1...N in the PERMUT_BUF array does not matter (like in real lottery).

Note that in our C program (attached), zero-based arrays are used.

Optimization

We use Intel® C++ Compiler, with its OpenMP* implementation, and Intel® MKL shipped with Intel® Composer XE 2013 SP1.

Parallelization

We exploit all CPUs with all available processor cores by using OpenMP* (see “#pragma parallel for” in the code, and see [4] for more details about OpenMP usage).

We use Intel® MKL MT2203 BRNG since it easily supports a parallel independent stream in each thread (see [3] for details).

     #pragma omp parallel for num_threads(THREADS_NUM)

     for( thr=0; thr<THREADS_NUM; thr++ ) { // thr is thread index

         VSLStreamStatePtr stream;

         // RNG initialization

        vslNewStream( &stream, VSL_BRNG_MT2203+thr, seed );

         ... // Generation of experiment samples (in thread number thr)

         vslDeleteStream( &stream );

     }

Generation of experiment samples

In each thread, we generate EXPERIM_NUM/THREADS_NUM experiment results. For each experiment we call Fisher_Yates_shuffle function that implements steps A2.2, A2.3, and A2.4 of the core algorithm to generate the next results sample. After that we copy the generated sample to RESULTS_ARRAY (step A2.5) as shown below:

     //  A3.1: (Initialization step) let PERMUT_BUF contain natural numbers 1, 2, ..., N

     for(i=0; i<N; i++) PERMUT_BUF[i]=i+1; // we will use the set {1,...,N}

     for(sample_num=0;sample_num<EXPERIM_NUM/THREADS_NUM;sample_num++) {

         Fisher_Yates_shuffle(...);
     

         for(i=0; i<M; i++)

             RESULTS_ARRAY[thr*ONE_THR_PORTION_SIZE + sample_num*M + i] = PERMUT_BUF[i];

     }

Fisher_Yates_shuffle function

The function implements steps A2.2, A2.3, and A2.4 of the core algorithm (chooses a random item from the remaining part of PERMUT_BUF and places this item at the end of the output row, namely, to PERMUT_BUF[i]):

            for(i=0; i<M; i++) {

                j = Next_Uniform_Int(...);            

                tmp = PERMUT_BUF[i];

                PERMUT_BUF[i] = PERMUT_BUF[j];

                PERMUT_BUF[j] = tmp;

            }

Next_Uniform_Int function

In step A2.3 of the core algorithm, our program calls the Next_Uniform_Int function to generate the next random integer X, uniform on {i,...,N-1}.

To exploit the full power of vectorized RNGs from Intel MKL, but to hide vectorization overheads, the generator must be called to generate a sufficiently large vector D_UNIFORM01_BUF of size RNGBUFSIZE that fits the L1 cache. Each thread uses its own buffer D_UNIFORM01_BUF and the index D_UNIFORM01_IDX pointing to after the random number from that buffer used last. In the first call to Next_Uniform_Int function (or in the case all random numbers from the buffer have been used), we regenerate the full buffer of random numbers again by calling to vdRngUniform function with the length RNGBUFSIZE and set the index D_UNIFORM01_IDX to zero (in fact, the index was already set to zero a while before):

 vdRngUniform( ... RNGBUFSIZE, D_UNIFORM01_BUF ... );

Because Intel MKL provides only generators of random values with same distribution, but in step A2.3 we need random integers on different intervals, we fill our buffer with double-precision random numbers uniformly distributed on [0;1) and then, in the “Integer scaling step”, we convert these double-precision values to the needed integer intervals. Fortunately, we know that our algorithm in step A2.3 will need this sequence of numbers, distributed as follows:

            number 0   distributed on {0,...,N-1}   = 0   + {0,...,N-1}
            number 1   distributed on {1,...,N-1}   = 1   + {0,...,N-2}

            ...

            number M-1 distributed on {M-1,...,N-1} = M-1 + {0,...,N-M}

            (then repeat previous M steps)

            number M     distributed on: see (0)
            number M+1   distributed on: see (1)

            ...

            number 2*M-1 distributed on: see (M-1)
            (then again repeat previous M steps)
            ...

            etc.

Hence, “Integer scaling step” looks like this:

            // Integer scaling step

            for(i=0;i<RNGBUFSIZE/M;i++)

                for(k=0;k<M;k++)

                    I_RNG_BUF[i*M+k] =

                        k + (unsigned int)(D_UNIFORM01_BUF[i*M+k] * (double)(N-1-k));

Notes:

RNGBUFSIZE must be a multiple of M;
This double-nested loop is not suitable for good vectorization because M=6 is not a multiple of 8 (8 is the number of integers in the Intel® Advanced Vector Extensions (Intel® AVX) vector register);
Even if we interchange loops “for i” and “for k” and choose RNGBUFSIZE/M to be multiple of 8, this double-nested loop is not suitable for good vectorization, because we will store the results not contiguously in memory;
We put scaled integers I_RNG_BUF[i*M+k] into the same buffer where we put double-precision random values D_UNIFORM01_BUF[i*M+k]. Although depending on the CPU type, it may be preferable to have a separate buffer for integers, so that both buffers together fit L1 cache. Separate buffers allow to avoid store-after-load forwarding penalty stalls that might occur because the size of loaded double-precision values is not equal to the size of stored integers.

Conclusions

The attached, Intel C++ Composer XE based, implementation of the algorithm presented in this article for the case of 119 696 640 experiments of “lottery 6 of 49” runs ~24*13 times faster than the sequential algorithm based on the sequential scalar version using GNU* Scientific Library (GSL)+GNU Compiler Collection (GCC).

Measured work time is:

0.216 sec (algorithm presented in this article);
69.321 sec (sequential scalar algorithm, based on GSL+GCC, i.e., using gsl_ran_choose function, sequential RNG gsl_rng_mt19937 from GSL, gcc 4.4.6 20110731 with options -O2 -mavx -I$GSL_ROOT/include -L$GSL_ROOT/lib -lgsl -lgslcblas).

The measurements were done on the following platform:

CPU: 2 x 3d-Generation Intel® Core™ i7 processor 2.5GHz, 2*12 cores, 30MB L3 cache size, hyper-threading off;
OS: Red Hat Enterprise Linux* Server release 6.2, x86_64;
Software: Intel® C++ Composer XE 2013 SP1 (with Intel C++ Compiler 13.1.1 and Intel MKL 11.0.3).

Program code attached (see lottery6of49.c file).

References

[1] D. Knuth. The Art of Computer Programming. Volume 2. Section 3.4.2 Random Sampling and Shuffling. Algorithm S, Algorithm P;

[2] Intel® Math Kernel Library Reference Manual, available at https://software.intel.com/en-us/intel-software-technical-documentation?..., section “Statistical Functions”, subsection “Random Number Generators”;

[3] Intel® MKL Vector Statistical Library Notes, available at https://software.intel.com/en-us/intel-software-technical-documentation?..., section “Independent Streams. Block-Splitting and Leapfrogging” about usage of several independent streams of VSL_BRNG_MT2203;

[4] User and Reference Guide for the Intel® C++ Compiler, available at https://software.intel.com/en-us/intel-software-technical-documentation?..., section “Key Features”, subsection “OpenMP support”;

[5] GNU Scientific Library (GSL), available at http://www.gnu.org/software/gsl, documentation section “18 Random Number Generation” about gsl_rng_alloc() and gsl_rng_mt19937 and subsection “20.38 Shuffling and Sampling” about gsl_ran_choose() function.

Simple Random Sampling without Replacement

Unique Random Numbers Generation