1 Introduction

Parallelization is an effective option for reducing the running time of computationally intensive algorithms. However, to effectively parallelize stochastic computationally intensive algorithms, such as Monte Carlo methods, genetic algorithms, or simulations of stochastic processes, we need to be able to generate random numbers in parallel. Consequently we need a parallel implementation of a random number generator (RNG).

Most programming languages already implement an efficient and sufficiently random RNG in their standard libraries. However, these implementations are sequential. Libraries with parallel implementations exist, but only a few can be run on a graphics processing unit (GPU) [2, 10, 12, 18, 20, 24] [https://developer.nvidia.com/curand] and each library implements at most a few RNGs. A user who requires a parallel RNG in his algorithm has to, in most cases, implement that RNG or is forced to include an entire RNG library that implements it.

It is also not clear which parallel RNG is the most efficient or which RNGs are too flawed for practical use. Quality and performance of sequential RNGs has been extensively evaluated [8]; however, the only evaluation of GPU implementations we have found compares a small number of RNGs on a single GPU [12] and most of those RNGs have known flaws [8]. We are not aware of any comparison of RNGs across different GPUs and CPUs.

The main goal of our research was to prepare a library based on a general approach to parallelizing RNGs and a set of parallel implementations of different RNGs, which users can easily include in their algorithms. Additionally, we performed several experiments that provide insights into the effectiveness of the RNGs and the practical usefulness of the sequences of numbers that they generate.

1.1 GPUs and OpenCL

GPUs are powerful parallel processing units. If an algorithm can be effectively parallelized, it will usually run significantly faster on a GPU compared to a CPU, especially if the algorithm is computationally complex.

Our implementation is made with OpenCL which allows the use of the same application on a multi-core CPU as well as on a many-core GPU [25, 26]. In OpenCL, functions can be run on hundreds or thousands of threads in parallel on a GPU. These functions are termed kernels. The host (CPU) runs the host program, these can be written in C, C++, Python, etc. This host program initializes the compute device, copies data to its memory (if needed) and sets parameters of execution. The most important parameters are the number of created threads and their organization in work groups.

2 RandomCL library

We implemented a library that contains 22 RNGs. The library is header-only and can be used on any operating system that supports OpenCL. The RNGs from the library can be executed on any OpenCL-enabled CPU or GPU, regardless of device vendor. The library is available at https://github.com/bstatcomp/RandomCL under the BSD-3 license.

All the RNGs can generate random numbers in the following formats: unsigned 32-bit integers, unsigned 64-bit integers, 32-bit floating-point numbers, or 64-bit double precision floating-point numbers. Integers are generated between 0 and a generator-dependent upper bound. Floating-point RNGs generate numbers between 0 and 1.

The typical use of the library consists of the following steps. First, random seeds are generated for each thread using a sequential generator from a standard library. Seeds are then copied to the computational device’s global memory. Next, the OpenCL kernel is run on a device. It first initializes one generator for each thread using the previously generated seeds. Finally, the stochastic application’s kernel calls the RNG function when random values are needed. This way random numbers are generated during the execution of the stochastic algorithm/application.

The library also supports generating random numbers in batches beforehand. In this case, a simple kernel is first run to generate random numbers and save them to the device’s global memory. The algorithm that requires the random numbers is run afterward in a separate kernel and can access the pre-generated random number from global memory. However, if the algorithm requires many random numbers, we expect this option to be slower since generating random numbers can be significantly faster than loading them from the slow global memory on most devices.

2.1 Implemented random number generators

A pseudo-RNG is an algorithm that outputs a sequence of numbers that appears random. While deterministic, good RNGs generate a sequence of seemingly unpredictable numbers that can be used to simulate a random process.

In general, a random number generator consists of a state x, a state transition function f and an output function g. To generate a random number y, the state of generator is advanced \(x_n=f(x_{n-1})\) first, before outputting \(y_n=g(x_n)\).

In practice, the output function is usually simple, sometimes even the identity. For generators with a large state it often returns just a part of the generator state.

When choosing which RNGs to implement, we opted for some well-known RNGs, such as the linear congruential generator and Mersenne Twister. Other RNGs were chosen because they pass the tests from TestU01 library [8] and are relatively efficient. We implemented the following RNGs:

  • ISAAC (Indirection, Shift, Accumulate, Add, and Count) [5] is a RNG intended for cryptographic purposes. We implemented isaac, but it does not work on graphics cards, because it requires unaligned memory access.

  • KISS (Keep It Simple, Stupid) [13, 15] is a common name for three compound RNGs by the same author. We implemented the second, kiss99, proposed in 1999, and third—kiss09, proposed in 2009. Their components are LCG, xorshift and MWC generators. KIS99 is 32-bit RNG, while KISS09 is 64-bit RNG. Both pass BigCrush, even though none of their components do.

  • Lagged Fibonacci Generator [16], defined by lags r, p and binary operation \(*\) generates numbers according to equation \(x_n=x_{n-r}*x_{n-p}\). If \(*\) is addition, subtraction or exclusive-or, resulting generators are known to have poor quality [8]. We implemented a lagged Fibonacci generator lfib using multiplication.

  • Linear Congruential Generator (LCG) [7] generates random numbers according to the equation \(x_{n}=(x_{n-1}*a+b) \, \mathrm{mod} \, m\), where a, b and m are parameters. If m is a power of 2, implementation is very simple and fast. LCGs are known as poor generators, especially if m is a power of 2, but they can still pass the BigCrush battery (a test suite, described in chapter 3.1) if only a part of state is returned [21]. We have implemented 128-bit LCG that returns the upper 64 bits (lcg12864) and 64-bit LCG and that returns the upper 32 bits (lcg6432). Lcg6432 does not pass BigCrush [21].

  • Mersenne Twister [17] is one of most popular RNGs. It is based on a large linear feedback shift register (LFSR) and a linear output function. However, it does not pass BigCrush. We implemented the Mersenne Twister mt19937.

  • Middle Square Weyl Sequence [29] generates the next number by squaring the previous one before swapping the lower and upper half of its bits. Lastly, a number generated by a Weyl sequence [14] is added. Weyl sequence produces the next number by adding a constant to the previous one. We implemented the 64-bit middle square Weyl sequence msws.

  • Multiplicative Recursive Generator (MRG) [6] of order k generates random numbers according to equation \(x_n=(a_1 x_{n-1} + \cdots +a_k x_{n-k}) \, \mathrm{mod} \, m\), where \(a_i\) and m are parameters. It is one of the most commonly used parallel generators. We implemented three versions: mrg31k3p [9], mrg32k3a [6], and mrg63k3a [6].

  • Multiply With Carry (MWC) [4] with state (x, c) and parameters a and m generates random numbers according to equation \(x_{n}=(x_{n-1}*a+c_{n-1}) \, \mathrm{mod} \, m\). In each step, it also updates c according to equation \(c_n=\left\lfloor {(x_{n-1}*a+c_{n-1}) / m}\right\rfloor \). A modification of MWC exists that instead of x returns \(x+c\) in an attempt to improve the quality of the generator [27]. We implemented the modified MWC generator mwc64x.

  • Permutated Congruential Generator (PCG) [21] combines a LCG and a non-trivial output function. Multiple versions with different output functions exist. We implemented a 64-bit generator that returns 32-bit numbers pcg6432. To generate a random number LCG is advanced, the state is shifted and xor-ed with the unshifted state. Then the uppermost four bits of the result determine which 32 bits are returned.

  • Philox (Product HIgh LOw Xorshift) [24] is a counter-based RNG. That means that its state transition function is just an increment, while the output function is more complex. It can be even used without storing a state, just by applying its output function to some other variable in the algorithm it is used in, such as a loop counter. It is based on ideas of cryptographic block cyphers—using multiple rounds of a bit-scrambling operation. We implemented a 10-round Phylox RNG that works on two 32-bit numbers philox2x32-10.

  • Ran2 [23] combines two 32-bit LCGs. A table is used to save some results of the first LCG. The result of the second LCG is combined with a randomly chosen previous result of the first LCG. We implemented ran2.

  • Tiny Mersenne Twister [18] is a smaller version intended for situations where not much memory can be used for storing generator state, for example, on GPUs. The original implementation is already compatible with OpenCL. We only modified the interface and initialization to match other generators in RandomCL. There is 32-bit version tinymt32 and 64-bit version tinymt64.

  • Tyche [19] is a random number generator based on a quarter round function of the ChaCha cypher. Tyche-i uses state transition function that is the inverse of Tyche’s. This allows it to exploit instruction-level parallelism of modern processors to be slightly faster. We implemented both tyche and tyche_i.

  • WELL (Well-Equidistributed Long-period Linear) [22] were created as an improvement to Mersenne Twister. While it has some nice theoretical properties, it still fails some tests in BigCrush. We implemented the smallest version of the generator with 512-bit state well512.

  • Xorshift [14] generates a random number from the previous number by shifting it and xor-ing it with the unshifted version three times, using a different shift each time. Xorshift has been shown to be mathematically equivalent to a linear feedback shift register (LFSR) generator [1]. The 64-bit xorshift does not pass BigCrush on its own [8]. We implemented a 1024-bit xorshift generator xorshift1024 [12]. Its state is advanced jointly by 32 threads.

  • Xorshift* [28] is an xorshift generator with a non-trivial output function—a multiplication with an constant. We implemented a 64-bit generator that returns 32 bits of its state xorshift_star. That makes it pass BigCrush [21].

2.2 Parallelization

There are several different approaches to generating random numbers in a parallel algorithm. We use random initialization—each thread has its own instance of a generator, initialized to a random state. While this is efficient and applicable to any generator, it is possible that the generated streams overlap. However, for generators with sufficiently long periods, the probability of overlap is negligible [11]. All generators we implemented have a period of at least \(2^{64}\), so the probability of overlap is small. It has been shown for LCGs that random initialization gives better quality of generated numbers compared to equally spaced substreams [12].

There are several possible alternatives to our approach. A trivial alternative would be to generate a sequence of random numbers sequentially—possibly in advance. However, this approach is slow as it does not scale with the number of threads.

Next, we could use a different generator for each thread. Same algorithm with a different parameter set for each thread would suffice. However, parameter sets that produce streams of good quality exist only for a few RNGs. Even if the quality of all streams is good, they must be tested for independence [11].

If we have T threads, each with a single instance of a generator, we can initialize generators with T sequential states. Before using the output function to generate a number, the generator is advanced not for 1, but for T states. However, for most generators jumping ahead by multiple states is significantly slower than just advancing the state by one.

We could also split the stream of numbers into T substreams of (almost) equal length and initialize each generator to the first state in different substreams. This is realized by initializing all generators to the same state before advancing them for an appropriate number of steps. However, efficient jumping for many steps is only possible for a few RNGs and advancing one step at a time would be too time-consuming to be practically feasible.

2.3 An example of how to use a RNG

Listing 1 shows how to fill an array in an OpenCL kernel with random numbers using a RNG from the RandomCL library. It uses the tyche_i RNG to generate 32-bit unsigned integers.

Line 1 includes the header file with the implementation of the RNG.

Lines 3–5 contain the kernel function header. This is the function that can be called from the host and executes in parallel on the device. It accepts three arguments. The first argument num sets the number of random values to generate. The second argument seed is a pointer to the array in global memory that contains seeds for initialization of generators. Since this example uses one generator per thread, the seed array must contain (at least) as many seeds. The third and final argument res is a pointer to an array in global memory, where the generated numbers will be stored.

Lines 6 and 7 determine the execution parameters: total number of threads gsize and index of the thread gid. Line 8 declares the variable state that stores the state of the RNG. Line 9 initializes the RNG of each thread with one of the seeds. Lines 10-12 generate random numbers and save them in the res array.

figure a

3 Empirical evaluation

3.1 Testing quality: TestU01

TestU01 [8] is the most commonly used library for empirically testing the quality of RNGs and supersedes other popular libraries, such as Diehard, Dieharder and the NIST statistical test suite.

While statistical testing cannot prove a generator is good, it can be used to search for particular deficiencies. TestU01 defines three test batteries that determine the tests and their parameters. From fastest to most discriminative, they are SmallCrush, Crush, and BigCrush. Sequential implementations of most generators we implemented are known to pass BigCrush (exceptions are lcg6432, mt19937, tinymt32, tinymt64, and well512).

Depending on how they are used, random numbers generated in parallel might or might not be consumed in the same order as they are generated. If they are, the quality of the RNG is exactly the same as the quality of the sequential implementation of the same RNG. If they are not, this is effectively the same as permuting the order in which the numbers are generated. If each thread works on an independent part of the problem, numbers are consumed in the same order as generated, resulting in no permutation. However, if threads work jointly on the same part of problem, one number from each thread is consumed before the next number from first thread is consumed.

For example, we can take a simple case of generating random numbers and saving them in an array in memory. If this task is done with a sequential program, there is only one obvious way of ordering numbers. The i-th generated number is saved to the i-th place in the array. In parallel, however, there are two reasonable options. If we have T threads, each generating N numbers (for a total of NT numbers), the i-th number generated by the thread t can be saved at the index \(N t+i\) or \(T i+t\). The first option is similar to sequential generation of numbers. Each thread stores numbers generated in sequence in a contiguous part of array. In the second option, the consecutive numbers of the resulting array are generated by different threads.

These permutations could affect the quality of the generated stream of numbers. This is why we have tested the quality of parallel implementations which return permuted sequences. It is impossible to test permutations for all possible numbers of threads. We have selected 1024 as a representative number of threads and executed tests on that many.

TestU01 can only test 32-bit numbers. So we have tested 64-bit generators three times: the lower 32 bits of each number, the upper 32 bits and both the lower and the upper 32 bits as two consecutive 32-bit numbers.

3.2 Testing quality: PractRand

PractRand (Practically Random) is a C++ RNG library [3]. It includes a battery of statistical tests in the tradition of Diehard and TestU01, some of which detect statistical flaws that are not covered by TestU01. Two other important advantages of PractRand testing are multi-core computation support and that tests can easily be performed on relatively long sequences.

PractRand runs all tests on all the generated data. This is in contrast with the TestU01 test batteries SmallCrush, Crush, and BigCrush, where each test is performed on an independently generated data set and data size varies from test to test. To reduce computation times, PractRand tests start with a small data set (256 kB) and increase in increments of factor 2 to the maximum data size (in our case 2TB) or until a test fails.

Note that some RNGs generate only 31/63 bits (lfib, mrg31k3p, mrg63k3a, ran2). This does not pose a problem for TestU01, because it ignores the lowest bit of every 32-bit number. We modified testing with PractRand for these 4 generators so that the missing bit was ignored.

3.3 Testing speed

We tested the speed of the implemented RNGs on several different devices: AMD Radeon R7 260X (2013 mid-range gaming GPU), AMD Ryzen Threadripper 1950X (2017 high-end CPU), Intel Core i5-4690 (2014 mid-range CPU), Intel HD Graphics 4000 (2012 low-end integrated GPU) and NVIDIA GeForce GTX 1070 (2016 high-end gaming GPU).

The performance of a particular generator on a particular device can vary greatly with the number of threads used and how they are divided into work groups. We have made no attempt at finding optimal configurations. Instead, we used a simple heuristic to determine the number of threads that worked relatively well for all generators and devices. We set a number of threads per work group to 256 and number of work groups to 4 times the number of compute units on the device. In practice, RNGs are usually part of a larger program and it makes no sense to expect the number of threads to be optimized for performance of the RNG.

Some RNGs generate 32-bit numbers and some generate 64-bit numbers. To avoid the overhead of converting all numbers to either 64 or 32 bits, we tested 32- and 64-bit generators separately and report measured speed in gigabytes per second.

For comparison, we also include generators from other libraries that use OpenCL. Note that we did not include generators that are known to fail tests from the TestU01 library:

  • The Random123 library [24] implements various counter-based RNGs. The Phylox family of generators is designed for use on GPUs. We tested the generator phylox2x32_10(random123), which is also implemented in our library. Other Phylox generators generate more than 64 random bits at once, which is impractical for most use-cases.

  • The clRNG library [10] implements 4 generators and we tested three of them—mrg32k3a(clrng), mrg31k3p(clrng), and phylox4x32_10(clrng). Both MRG generators are implemented in the library, while Phylox is just a wrapper around the implementation from Random123. It also implements lfsr113, which we did not test as it is known to fail some tests from the TestU01 library [8].

  • The RANLUXCL library [20] implements ranlux(ranluxcl) generator, which we included in our tests.

  • The PRNGCL library [2] implements 7 generators. We tested the two generators that pass testing with TestU01 library—ranlux(prngcl) and mrg32k3a(prngcl). All other implemented generators are known to fail some of the tests. Compared to all other mentioned libraries (ours included), the generators from PRNGCL are not intended to be included in the user’s kernel program. Instead, a separate kernel is run before random numbers are required. It generates random numbers and saves them in memory. This is how we used it when testing efficiency.

  • The original MWC64X generator [27] is also implemented in OpenCL. Except for different initialization, its implementations are practically the same as ours so it produces numbers at same speed. That is why we did not list it separately in the table with results.

4 Results

From Table 1, we can see that parallel implementations of generators isaac and mt19937 fail at least one of the tests from TestU01. That makes them unsuitable for general-purpose parallel RNGs. kiss09, msws and lfib also fail some, but could still be used. We can see that lower 32 bits of kiss09 and msws and upper 32 bits of lfib still pass all tests so we could modify them to only return half of their state. However, that would effectively halve the speed at which they generate numbers.

Table 2 shows results of testing the generators with the PractRand library. Most of the generators that fail testing with the TestU01 library also fail with the PractRand. It also identifies problems with some generators that pass the testing with the TestU01 library. Only generators isaac, kiss09, xorshift_star, tyche, tyche_i and all three MRGs pass the testing.

We also tested the original implementation of MWC64X [27] that uses stream splitting and it failed 4 tests on both Crush and BigCrush while passing the testing with the PractRand library. Therefore, our approach to initialization improves some statistical properties of this generator and degrades others.

Table 1 TestU01 quality tests results
Table 2 PractRand quality tests results
Table 3 Performance tests results

To compare speeds of a generators across devices, we define \(rs_{gd}\)—the relative speed of a particular generator g on a particular device d—as the quotient between the speed of the generator on that device \(s_{gd}\) and the speed of the fastest generator implementation on the same device

$$\begin{aligned} rs_{gd}=\frac{s_{gd}}{\max \nolimits _{i} s_{id}}. \end{aligned}$$

Time measurement results are shown in Table 3. For each generator, we also report its average relative speed and its worst relative speed across all devices. These two summarize the generators expected average and worst-case performance, relative to the best generator, respectively. The worst relative speed represents the generator’s robustness. A generator can achieve a high average relative speed by being particularly fast on one or a few devices and slow on others. To have a high worst relative speed, a generator must perform reasonably well on all tested devices.

5 Discussion and conclusion

For a parallel RNG to be as general as possible, it should pass statistical tests both when run in a single thread and in parallel—as explained in Sect. 3.1. Generators tyche, tyche_i, xorshift_star, and all three versions of mrg pass all the tests, while all the remaining generators fail the testing either when run sequentially or in parallel.

Different generators produce numbers at very different speeds. If we disregard generators that do not pass all the tests, tyche_i is on average the best generator. It is also the best on Intel and NVIDIA GPUs, AMD CPU and practically as good as the best generator on the AMD GPU. Its worst relative speed is low only due to its poor performance on the Intel CPU. Therefore, as a general-purpose generator that can run very fast on almost any device, tyche_i is strongly recommended. If we target a specific device, we can instead select a generator that has the best performance on the most similar device.

Usually, however, RNG is a part of a larger algorithm. Its speed depends on many factors that are affected by both RNG and the rest of the algorithm in a non-trivial way. To achieve the best possible performance, the speed of the algorithm should be measured while using some of the fastest generators to find which one works best for the particular case.

In some cases, we might also be able to use generators that fail some of the tests, as the algorithm itself might not be sensitive to deficiencies of a particular generator. However in that case, the algorithm should be tested for correctness while using each of the candidate generators. Such testing may be beneficial in general, as the algorithm may be sensitive to a deficiency that is not tested for in TestU01. The best worst relative speed is achieved by msws. While it does not pass PractRand and only the lower 32 bits pass BigCrush, the lower 32 bits could still be used, resulting in a generator that is very robust in terms of speed across different devices.

The RandomCL library and the work presented in this paper provides users with a library of RNGs that the users can easily include in their algorithms. All the RNGs have been thoroughly tested, providing the user with information and guidance on both the statistical properties and the efficiency of each individual RNG. Furthermore, the best generators from the library generate random numbers 4 times faster on average, compared to the best generators from existing libraries.

As part of future work, we will test and implement other RNGs. Two particularly interesting avenues of research would be to (a) implementations that can be split into substreams of equal length or ones that use different parameter sets for each thread and (b) to modify the Middle Square Weyl Sequence RNG so that it passes all tests.