Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The fields of probability and statistics are built over the abstract concepts of probability space and random variable. This has given rise to elegant and powerful mathematical theory, but exact implementation of these concepts on conventional computers seems impossible. In practice, random variables and other random objects are simulated by deterministic algorithms. The purpose of these algorithms is to produce sequences of numbers or objects whose behavior is very hard to distinguish from that of their “truly random” counterparts, at least for the application of interest. Key requirements may differ depending on the context.For Monte Carlo methods, the main goal is to reproduce the statistical properties on which these methods are based, so that the Monte Carlo estimators behave as expected, whereas for gambling machines and cryptology, observing the sequence of output values for some time should provide no practical advantage for predicting the forthcoming numbers better than by just guessing at random.

In computational statistics, random variate generation is usually made in two steps: (1) generating imitations of independent and identically distributed (i.i.d.) random variables having the uniform distribution over the interval (0, 1) and(2) applying transformations to these i.i.d. U(0, 1) random variates to generate (or imitate) random variates and random vectors from arbitrary distributions. These two steps are essentially independent and the world’s best experts on them are two different groups of scientists, with little overlap. The expression (pseudo)random number generator (RNG) usually refers to an algorithm used for step (1).

In principle, the simplest way of generating a random variate X with distribution function F from a U(0, 1) random variate U is to apply the inverse of F to U:

$$X = {F}^{-1}(U)\stackrel{\mathrm{def}}{=}\min \{x\mid F(x) \geq U\}.$$
(3.1)

This is the inversion method.It is easily seen that X has the desired distribution: \(P[X \leq x] = P[{F}^{-1}(U) \leq x] = P[U \leq F(x)] = F(x)\). Other methods are sometimes preferable when F  − 1 is too difficult or expensive to compute, as will be seen later.

The remainder of this chapter is organized as follows. In the next section, we give a definition and the main requirements of a uniform RNG. Generators based on linear recurrences modulo a large integer m, their lattice structure and quality criteria, and their implementation, are covered in Section ??. In Sect. ??, we have a similar discussion for RNGs based on linear recurrences modulo 2. Nonlinear RNGs are briefly presented in Sect. ??. In Sect. 3.6, we discuss empirical statistical testing of RNGs and give some examples. Section 3.7 contains a few pointers to recommended RNGs and software. In Sect. ??, we cover non-uniform random variate generators. We first discuss inversion and its implementation in various settings. We then explain the rejection, ratio-of-uniform, composition and convolution methods, provide pointers to other methods that apply in special cases, and discuss multivariate distributions.

Important basic references that we recommend are Knuth (1998), L’Ecuyer (19941998), Niederreiter (1992), and Tezuka (1995) for uniform RNGs, and Devroye (19862006), Gentle (2003), and Hörmann et al. (2004) for non-uniform RNGs.

2 Uniform Random Number Generators

2.1 Physical Devices

Random numbers can be generated via physical mechanisms such as the timing between successive events in atomic decay, thermal noise in semiconductors, photon counting and photon trajectory detectors, and the like. A key issue when constructing a RNG based on a physical device is that a “random” or “chaotic” output does not suffice; the numbers produced must be, at least to a good approximation, realizations of independent and uniformly distributed random variables. If the device generates a stream of bits, which is typical, then each bit should be 0 or 1 with equal probability, and be independent of all the other bits. In general, this cannot be proved, so one must rely on the results of empirical statistical testing to get convinced that the output values have the desired statistical behavior, at least approximately. Not all these devices are reliable, but some are and, as far as we know, they pass all statistical tests that can be run in reasonable time.

For computational statistics, physical devices have several disadvantages compared to a good algorithmic RNG that stands in a few lines of code. For example, (a) they are much more cumbersome to install and run; (b) they are more costly; (c) they are slower; (d) they cannot reproduce exactly the same sequence twice. Item (d) is important in several contexts, including program verification and debugging as well as comparison of similar systems by simulation with common random numbers to reduce the variance (Bratley et al. 1987; Fishman 1996; Law and Kelton 2000). Nevertheless, these physical RNGs can be useful for selecting the seed of an algorithmic RNG, more particularly for applications in cryptology and for gaming machines, where frequent reseeding of the RNG with an external source of entropy (or randomness) is important. A good algorithmic RNG whose seed is selected at random can be viewed as an extensor of randomness, stretching a short random seed into a long sequence of pseudorandom numbers.

2.2 Generators Based on a Deterministic Recurrence

RNGs used for simulation and other statistical applications are almost always based on deterministic algorithms that fit the following framework, taken from L’Ecuyer (1994):a RNG is a structure \((\mathcal{S},\mu,f,\mathcal{U},g)\) where \(\mathcal{S}\) is a finite set of states (the state space), μ is a probability distribution on \(\mathcal{S}\) used to select the initial state (or seed) s 0,\(f : \mathcal{S}\rightarrow \mathcal{S}\) is the transition function, \(\mathcal{U}\) is the output space, and \(g : \mathcal{S}\rightarrow \mathcal{U}\) is the output function. Usually, \(\mathcal{U} = (0,1)\), and we shall assume henceforth that this is the case. The state of the RNG evolves according to the recurrence \({s}_{i} = f({s}_{i-1})\), for i ≥ 1, and the output at step i is \({u}_{i} = g({s}_{i}) \in \mathcal{U}\). The output values \({u}_{0},{u}_{1},{u}_{2},\ldots \) are the so-called random numbers produced by the RNG.

Because \(\mathcal{S}\) is finite, there must be some finite l ≥ 0 and j > 0 such that \({s}_{l+j} = {s}_{l}\). Then, for all i ≥ l, one has \({s}_{i+j} = {s}_{i}\) and \({u}_{i+j} = {u}_{i}\), because both f and g are deterministic. That is, the state and output sequences are eventually periodic. The smallest positive j for which this happens is called the period of the RNG, and is denoted by ρ. When l = 0, the sequence is said to be purely periodic.Obviously, \(\rho\leq\vert \mathcal{S}\vert \), the cardinality of \(\mathcal{S}\). If the state has a k-bit representation on the computer, then ρ ≤ 2k. Good RNGs are designed so that their period ρ is not far from that upper bound. In general, the value of ρ may depend on the seed s 0, but good RNGs are normally designed so that the period is the same for all admissible seeds.

In practical implementations, it is important that the output be strictly between 0 and 1, because F  − 1(U) is often infinite when U is 0 or 1. All good implementations take care of that. However, for the mathematical analysis of RNGs, we often assume that the output space is [0, 1) (i.e., 0 is admissible), because this simplifies the analysis considerably without making much difference in the mathematical structure of the generator.

2.3 Quality Criteria

What important quality criteria should we consider when designing RNGs? An extremely long period is obviously essential, to make sure that no wrap-around over the cycle can occur in practice. The length of the period must be guaranteed by a mathematical proof. The RNG must also be efficient (run fast and use only a small amount of memory), repeatable (able to reproduce exactly the same sequence as many times as we want), and portable (work the same way in different software/hardware environments).The availability of efficient jump-ahead methods that can quickly compute s i + ν given s i , for any large ν and any i, is also very useful, because it permits one to partition the RNG sequence into long disjoint streams and substreams ofrandom numbers, to create an arbitrary number of virtual generators from a single RNG (Law and Kelton 2000; L’Ecuyer et al. 2002; L’Ecuyer 2008). These virtual generators can be used on parallel processors or to support different sources of randomness in a large simulation model, for example.

To show that these properties are not sufficient, consider a RNG with state space \(\mathcal{S} =\{ 0,\ldots,{2}^{1000} - 1\}\), transition function \({s}_{i+1} = f({s}_{i}) = ({s}_{i} + 1)\quad{\rm mod}\,\,{2}^{1000}\), and \({u}_{i} = g({s}_{i}) = {s}_{i}/{2}^{1000}\). This RNG has period 21000 and enjoys all the nice properties described in the preceding paragraph, but it is far from imitating “randomness.”

A sequence of real-valued random variables \({u}_{0},{u}_{1},{u}_{2},\ldots \) are i.i.d. U(0, 1) if and only if for all integers i ≥ 0 and t > 0, the vector \({\mathbf{u}}_{i,t} = ({u}_{i},\ldots,{u}_{i+t-1})\) is uniformly distributed over the t-dimensional unit hypercube (0, 1)t. Of course, this cannot hold for algorithmic RNGs because any vector of t successive values produced by the generator must belong to the finite set

$${\Psi }_{t} =\{ ({u}_{0},\ldots,{u}_{t-1}) : {s}_{0} \in \mathcal{S}\},$$

which is the set of all vectors of t successive output values, from all possible initial states. Here we interpret Ψ t as a multiset, which means that the vectors are counted as many times as they appear, and the cardinality of Ψ t is exactly equal to that of \(\mathcal{S}\).

Suppose we select the seed s 0 at random, uniformly over \(\mathcal{S}\). This can be approximated by using some physical device, for example. Then, the vector u 0, t has the uniform distribution over the finite set Ψ t . And if the sequence is purely periodic for all s 0, \({\mathbf{u}}_{i,t} = ({u}_{i},\ldots,{u}_{i+t-1})\) is also uniformly distributed over Ψ t for all i ≥ 0. Since the goal is to approximate the uniform distribution over (0, 1)t, it immediately becomes apparent that Ψ t should be evenly spread over this unit hypercube. In other words, Ψ t approximates (0, 1)t as the sample space from which the vectors of successive output values are drawn randomly, so it must be a good approximation of (0, 1)t in some sense. The design of good-quality RNGs must therefore involve practical ways of measuring the uniformity of the corresponding sets Ψ t even when they have huge cardinalities. In fact, a large state space \(\mathcal{S}\) is necessary to obtain a long period, but an even more important reason for having a huge number of states is to make sure that Ψ t can be large enough to provide a good uniform coverage of the unit hypercube, at least for moderate values of t.

More generally, we may also want to measure the uniformity of sets of the form

$${\Psi }_{I} =\{ ({u}_{{i}_{1}},\ldots,{u}_{{i}_{t}})\mid {s}_{0} \in \mathcal{S}\},$$

where I = { i 1, ⋯ , i t } is a fixed set of non-negative integers such that 0 ≤ i 1 < ⋯ < i t . As a special case, we recover Ψ t  = Ψ I when \(I =\{ 0,\ldots,t - 1\}\). Of course, there are so many such sets Ψ I that we cannot examine the uniformity over all of them, but we can do it over a selected class \(\mathcal{J}\) of such sets deemed more important.

The uniformity of a set Ψ I is typically assessed by measuring the discrepancy between the empirical distribution of its points and the uniform distribution over (0, 1)t (Niederreiter 1992; L’Ecuyer and Lemieux 2002; L’Ecuyer 2009). Discrepancy measures are equivalent to goodness-of-fit test statistics for the multivariate uniform distribution. They can be defined in many different ways. The choice of a specific definition typically depends on the mathematical structure of the RNG to be studied and the reason for this is very pragmatic: we must be able to compute these measures quickly even when \(\mathcal{S}\) has very large cardinality, for instance 2200 or more. This obviously excludes any method that requires explicit generation of the sequence over its entire period. The selected discrepancy measure is usually computed for each set I in a predefined class \(\mathcal{J}\), these values are weighted or normalized by factors that depend on I, and the worst-case (or average) over \(\mathcal{J}\) is adopted as a figure of merit used to rank RNGs. The choice of \(\mathcal{J}\) and of the weights are arbitrary. Typically, \(\mathcal{J}\) would contain sets I such that t and i t  − i 1 are rather small. Examples of such figures of merit will be given when we discuss specific classes of RNGs.

2.4 Statistical Testing

Good RNGs are designed based on mathematical analysis of their properties, then implemented and submitted to batteries of empirical statistical tests. These tests try to detect empirical evidence against the null hypothesis \({\mathcal{H}}_{0}\): “the u i are realizations of i.i.d. U(0, 1) random variables.” A test can be defined by any function T that maps a sequence \({u}_{0},{u}_{1},\ldots \) in (0, 1) to a real number X, and for which a good approximation is available for the distribution of the random variable X under \({\mathcal{H}}_{0}\). For the test to be implementable, X must depend on only a finite (but perhaps random) number of u i ’s. Passing many tests may improve one’s confidence in the RNG, but never guarantees that the RNG is foolproof for all kinds of simulations.

Building a RNG that passes all statistical tests is an impossible dream. Consider, for example, the class of all tests that examine the first (most significant) b bits of n successive output values, \({u}_{0},\ldots,{u}_{n-1}\), and return a binary value X ∈ { 0, 1}. Select α ∈ (0, 1) so that α2nb is an integer and let \({\mathcal{T}}_{n,b,\alpha }\) be the set of tests in this class that return X = 1 for exactly α2nb of the 2nb possible output sequences. We say that the sequence fails the test when X = 1. This \({\mathcal{T}}_{n,b,\alpha }\) is the set of all statistical tests of (exact) level α. Its cardinality is equal to the number of ways of choosing α2nb distinct objects among 2nb. The chosen objects are the sequences that fail the test. For any given output sequence, the number of tests in \({\mathcal{T}}_{n,b,\alpha }\) that return 1 for this sequence is equal to the number of ways of choosing the other α2nb − 1 sequences that also fail the test. This is the number of ways of choosing α2nb − 1 distinct objects among 2nb − 1. In other words, as pointed out by Leeb (1995), every output sequence fails exactly the same number of tests! Viewed from a different angle, it is a restatement of the well-known fact that under \({\mathcal{H}}_{0}\), each of the 2nb possible sequences has the same probability of occurring, so one may argue that none should be considered more random than any other (Knuth 1998).

This viewpoint seems to lead into a dead end. For statistical testing to be meaningful, all tests should not be considered on equal footing. So which ones are more important? Any answer is tainted with arbitrariness. However, for large values of n, the number of tests is huge and all but a tiny fraction are too complicated even to be implemented. So we may say that bad RNGs are those that fail simple tests, whereas good RNGs fail only complicated tests that are hard to find and run. This common-sense compromise has been generally adopted in one way or another.

Experience shows that RNGs with very long periods, good structure of their set Ψ t , and based on recurrences that are not too simplistic, pass most reasonable tests, whereas RNGs with short periods or bad structures are usually easy to crack by standard statistical tests. For sensitive applications, it is a good idea, when this is possible, to apply additional statistical tests designed in close relation with the random variable of interest (e.g., based on a simplification of the stochastic model being simulated, and for which the theoretical distribution can be computed).

Our discussion of statistical tests continues in Sect. 3.6. A key reference is L’Ecuyer and Simard (2007).

2.5 Cryptographically Strong Generators

One way of defining an ideal RNG would be that no statistical test can distinguish its output sequence from an i.i.d. U(0, 1) sequence. If an unlimited computing time is available, no finite-state RNG can satisfy this requirement, because by running it long enough one can eventually figure out its periodicity. But what if we impose a limit on the computing time? This can be analyzed formally in the framework of asymptotic computational complexity theory, under the familiar “rough-cut” assumption that polynomial-time algorithms are practical and others are not.

Consider a family of RNGs \(\{{\mathcal{G}}_{k} = ({\mathcal{S}}_{k},{\mu }_{k},{f}_{k},{\mathcal{U}}_{k},{g}_{k}),\,k = 1,2,\ldots \,\}\) where \({\mathcal{S}}_{k}\) of cardinality 2k (i.e., \({\mathcal{G}}_{k}\) has a k-bit state). Suppose that the transition and output functions f and g can be computed in time bounded by a polynomial in k. Let \(\mathcal{T}\) be the class of statistical tests that run in time bounded by a polynomial in k and try to differentiate between the output sequence of the RNG and an i.i.d. U(0, 1) sequence. The RNG family is called polynomial-time perfect if there is a constant ε > 0 such that for all k, no test in \(\mathcal{T}\) can differentiate correctly with probability larger than \(1/2 + {e}^{-k\epsilon }\). This is equivalent to asking that no polynomial-time algorithm can predict any given bit of u i with probability of success larger than \(1/2 + {e}^{-k\epsilon }\), after observing \({u}_{0},\ldots,{u}_{i-1}\).This links unpredictability with statistical uniformity and independence. For the proofs and additional details, see, e.g. Blum et al. (1986), L’Ecuyer and Proulx (1989), Lagarias (1993), and Luby (1996). This theoretical framework has been used to define a notion of reliable RNG in the context of cryptography. But the guarantee is only asymptotic; it does not necessarily tell what value of k is large enough for the RNG to be secure in practice. Moreover, specific RNG families have been proved to be polynomial-time perfect only under yet unproven conjectures. So far, no one has been able to prove even their existence. Most RNGs discussed in the remainder of this chapter are known not to be polynomial-time perfect. However, they are fast, convenient, and have good enough statistical properties when their parameters are chosen carefully.

3 Linear Recurrences Modulo m

Linear Recurrences Modulo m

3.1 The Multiple Recursive Generator

The most widely used RNGs are based on the linear recurrence

$$\begin{array}{rcl}{ x}_{i}& =& ({a}_{1}{x}_{i-1} + \cdots+ {a}_{k}{x}_{i-k})\quad{\rm mod}\,\,m,\end{array}$$
(3.2)

where m and k are positive integers called the modulus and the order, and the coefficients \({a}_{1},\ldots,{a}_{k}\) are in \({\mathbb{Z}}_{m}\), interpreted as the set \(\{0,\ldots,m - 1\}\) on which all operations are performed with reduction modulo m. The state at step i is \({s}_{i} ={ \mathbf{x}}_{i} = {({x}_{i-k+1},\ldots,{x}_{i})}^{\mathsf{T}}\). When m is a prime number, the finite ring \({\mathbb{Z}}_{m}\) is a finite field and it is possible to choose the coefficients a j so that the period reaches \(\rho= {m}^{k} - 1\) (the largest possible value) (Knuth 1998). This maximal period is achieved if and only if the characteristic polynomial of the recurrence, \(P(z) = {z}^{k} - {a}_{1}{z}^{k-1} -\cdots- {a}_{k}\), is a primitive polynomial over \({\mathbb{Z}}_{m}\), i.e., if and only ifthe smallest positive integer ν such that (z ν\quadmod P(z))\quadmod m = 1 is \(\nu= {m}^{k} - 1\). Knuth (1998) explains how to verify this for a given P(z). For k > 1, for P(z) to be a primitive polynomial, it is necessary that a k and at least another coefficient a j be nonzero. Finding primitive polynomials of this form is generally easy and they yield the simplified recurrence:

$${x}_{n} = ({a}_{r}{x}_{n-r} + {a}_{k}{x}_{n-k})\quad{\rm mod}\,\,m.$$
(3.3)

A multiple recursive generator (MRG) uses (3.2) with a large value of m and defines the output as \({u}_{i} = {x}_{i}/m\). For k = 1, this is the classical linear congruential generator (LCG).In practice, the output function is modified slightly to make sure that u i never takes the value 0 or 1 (e.g., one may define \({u}_{i} = ({x}_{i} + 1)/(m + 1)\), or \({u}_{i} = {x}_{i}/(m + 1)\) if x i  > 0 and \({u}_{i} = m/(m + 1)\) otherwise) but to simplify the theoretical analysis, we will follow the common convention of assuming that \({u}_{i} = {x}_{i}/m\) (in which case u i does take the value 0 occasionally).

3.2 The Lattice Structure

Let e i denote the ith unit vector in k dimensions, with a 1 in position i and 0’s elsewhere. Denote by \({x}_{i,0},{x}_{i,1},{x}_{i,2},\ldots \) the values of \({x}_{0},{x}_{1},{x}_{2},\ldots \) produced by the recurrence (3.2) when the initial state x 0 is e i . An arbitrary initial state \({\mathbf{x}}_{0} = {({z}_{1},\ldots,{z}_{k})}^{\mathsf{T}}\) can be written as the linear combination \({z}_{1}{\mathbf{e}}_{1} + \cdots+ {z}_{k}{\mathbf{e}}_{k}\) and the corresponding sequence is a linear combination of the sequences \(({x}_{i,0},{x}_{i,1},\ldots \,)\), with reduction of the coordinates modulo m. Conversely, any such linear combination reduced modulo m is a sequence that can be obtained from some initial state \({\mathbf{x}}_{0} \in \mathcal{S} = {\mathbb{Z}}_{m}^{k}\). If we divide everything by m we find that for the MRG, for each t ≥ 1, Ψ t  = L t  ∩ [0, 1)t where

$${L}_{t} = \left \{\mathbf{v} =\sum\limits_{i=1}^{t}{z}_{ i}{\mathbf{v}}_{i}\;\mid \;{z}_{i} \in\mathbb{Z}\right \},$$

is a t-dimensional lattice in \({\mathbb{R}}^{t}\), with basis

$$\begin{array}{rcl}{ \mathbf{v}}_{1}& =& {(1,0,\ldots,0,{x}_{1,k},\ldots,{x}_{1,t-1})}^{\mathsf{T}}/m \\ \vdots& & \vdots \\ { \mathbf{v}}_{k}& =& {(0,0,\ldots,1,{x}_{k,k},\ldots,{x}_{k,t-1})}^{\mathsf{T}}/m \\ {\mathbf{v}}_{k+1}& =& {(0,0,\ldots,0,1,\ldots,0)}^{\mathsf{T}} \\ \vdots& & \vdots \\ { \mathbf{v}}_{t}& =& {(0,0,\ldots,0,0,\ldots,1)}^{\mathsf{T}}\end{array}.$$

For t ≤ k, L t contains all vectors whose coordinates are multiples of 1 ∕ m. For t > k, it contains a fraction m k − t of those vectors.

This lattice structure implies that the points of Ψ t are distributed according to a very regular pattern, in equidistant parallel hyperplanes. Graphical illustrations of this, usually for LCGs, can be found in a myriad of papers and books; e.g., Gentle (2003), Law and Kelton (2000), and L’Ecuyer (1998).Define the dual lattice to L t  as

$${{L}_{t}}^{{_\ast}} =\{ \mathbf{h} \in{\mathbb{R}}^{t} :{ \mathbf{h}}^{\mathsf{T}}\mathbf{v} \in\mathbb{Z}\mbox{ for all }\mathbf{v} \in{L}_{ t}\}.$$

Each h ∈ L t  ∗  is a normal vector that defines a family of equidistant parallel hyperplanes, at distance \(1/\Vert {\mathbf{h}\Vert }_{2}\) apart, and these hyperplanes cover all the points of L t unless h is an integer multiple of some other vector h  ∈ L t  ∗ . Therefore, if t is the Euclidean length of a shortest non-zero vector h in L t  ∗ , then there is a family of hyperplanes at distance 1 ∕  t apart that cover all the points of L t . A small t means there are thick slices of empty space between the hyperplanes and we want to avoid that. A large t means a better (more uniform) coverage of the unit hypercube by the point set Ψ t . Computing the value of 1 ∕  t is often called the spectral test (Knuth 1998; Fishman 1996).

The lattice property holds as well for the point sets Ψ I formed by values at arbitrary lags defined by a fixed set of indices I = { i 1, ⋯ , i t }. One has Ψ I  = L I  ∩ [0, 1)t for some lattice L I , and the largest distance between successive hyperplanes for a family of hyperplanes that cover all the points of L I is 1 ∕  I , where I is the Euclidean length of a shortest nonzero vector in L I  ∗ , the dual lattice to L I .

The lattice L I and its dual can be constructed as explained in Couture and L’Ecuyer (1996) and L’Ecuyer and Couture (1997). Finding the shortest nonzero vector in a lattice with basis \({\mathbf{v}}_{1},\ldots,{\mathbf{v}}_{t}\) can be formulated as an integer programming problem with a quadratic objective function:

$$\mbox{ Minimize }\Vert {\mathbf{v}\Vert }^{2} =\sum\limits_{i=1}^{t}\sum\limits_{j=1}^{t}{z}_{ i}{\mathbf{v}}_{i}^{\mathsf{T}}{\mathbf{v}}_{ j}{z}_{j}$$

subject to \({z}_{1},\ldots,{z}_{t}\) integers and not all zero. This problem can be solved by a branch-and-bound algorithm (Fincke and Pohst 1985; L’Ecuyer and Couture 1997; Tezuka 1995).

For any given dimension t and m k points per unit of volume, there is an absolute upper bound on the best possible value of I (Conway and Sloane 1999; Knuth 1998; L’Ecuyer 1999b). Let t  ∗ (m k) denote such an upper bound.To define a figure of merit that takes into account several sets I, in different numbers of dimensions, it is common practice to divide I by an upper bound, to obtain a standardized value between 0 and 1, and then take the worst case over a given class \(\mathcal{J}\) of sets I.This gives a figure of merit of the form

$${M}_{\mathcal{J}} {=\min }_{I\in \mathcal{J}}\;{\mathcal{l}}_{I}/{\mathcal{l}}_{\vert I\vert }^{{_\ast}}({m}^{k}).$$

A value of \({M}_{\mathcal{J}}\) too close to zero means that L I has a bad lattice structure for at least one of the selected sets I. We want a value as close to 1 as possible. Computer searches for good MRGs with respect to this criterion have been reported by L’Ecuyer et al. (1993), L’Ecuyer and Andres (1997), L’Ecuyer (1999a), for example. In most cases, \(\mathcal{J}\) was simply the sets of the form \(I =\{ 1,\ldots,t\}\) for t ≤ t 1, where t 1 was an arbitrary integer ranging from 8 to 45. L’Ecuyer and Lemieux (2000) also consider the small dimensional sets I with indices not too far apart. They suggest taking \(\mathcal{J} =\{\{ 0,1,\ldots,i\} : i < {t}_{1}\} \cup \{\{{i}_{1},{i}_{2}\} : 0 = {i}_{1} < {i}_{2} < {t}_{2}\} \cup \cdots \cup \{\{{i}_{1},\ldots,{i}_{d}\} : 0 = {i}_{1} < \ldots< {i}_{d} < {t}_{d}\}\)for some positive integers \(d,{t}_{1},\ldots,{t}_{d}\). We could also take a weighted average instead of the minimum in the definition of \({M}_{\mathcal{J}}\).

An important observation is that for t > k, the t-dimensional vector \(\mathbf{h} = {(-1,{a}_{1},\ldots,{a}_{k},0,\ldots,0)}^{\mathsf{T}}\) always belong to L t  ∗ , because for any vector v ∈ L t , the first k + 1 coordinates of m v must satisfy the recurrence (3.2), which implies that \((-1,{a}_{1},\ldots,{a}_{k},0,\ldots,0)\mathbf{v}\) must be an integer. Therefore, one always has \({\mathcal{l}}_{t}^{2} \leq 1 + {a}_{1}^{2} + \cdots+ {a}_{k}^{2}\). Likewise, if I contains 0 and all indices j such that \({a}_{k-j}\not =0\), then \({\mathcal{l}}_{I}^{2} \leq 1 + {a}_{1}^{2} + \cdots+ {a}_{k}^{2}\) (L’Ecuyer 1997). This means that the sum of squares of the coefficients a j must be large if we want to have any chance that the lattice structure be good.

Constructing MRGs with only two nonzero coefficients and taking these coefficients small has been a very popular idea, because this makes the implementation easier and faster (Deng and Lin 2000; Knuth 1998). However, the MRGs thus obtained have a bad structure. As a worst-case illustration, consider the widely-available additive or subtractive lagged-Fibonacci generator,based on the recurrence (3.2) where the two coefficients a r and a k are both equal to ± 1. In this case, whenever I contains {0, k − r, k}, one has I 2 ≤ 3, so the distance between the hyperplanes is at least \(1/\sqrt{3}\). In particular, for \(I =\{ 0,k - r,k\}\), all the points of Ψ I (aside from the zero vector) are contained in only two planes! This type of structure can have a dramatic effect on certain simulation problems and is a good reason for staying away from these lagged-Fibonacci generators, regardless of their parameters. They fail several simple empirical statistical tests (L’Ecuyer and Simard 2007).

A similar problem occurs for the “fast MRG” proposed by Deng and Lin (2000), based on the recurrence

$${x}_{i} = (-{x}_{i-1} + a{x}_{i-k})\quad{\rm mod}\,\,m = ((m - 1){x}_{i-1} + a{x}_{i-k})\quad{\rm mod}\,\,m,$$

with a 2 < m. If a is small, the bound I 2 ≤ 1 + a 2 implies a bad lattice structure for \(I =\{ 0,k - 1,k\}\). A more detailed analysis by L’Ecuyer and Touzin (2004) shows that this type of generator cannot have a good lattice structure even if the condition a 2 < m is removed. Another special case proposed by Deng and Xu (2003) has the form

$${x}_{i} = a({x}_{i-{j}_{2}} + \cdots+ {x}_{i-{j}_{t}})\quad{\rm mod}\,\,m.$$
(3.4)

In this case, for \(I =\{ 0,k - {j}_{t-1},\ldots,k - {j}_{2},k\}\), the vectors \((1,a,\ldots,a)\) and \(({a}^{{_\ast}},1,\ldots,1)\) both belong to the dual lattice L I  ∗ , where a  ∗  is the multiplicative inverse of a modulo m. So neither a nor a  ∗  should be small.

To get around this structural problem when I contains certain sets of indices, Lüscher (1994) and Knuth (1998) recommend to skip some of the output values to break up the bad vectors. For the lagged-Fibonacci generator, for example, one can output k successive values produced by the recurrence, then skip the next d values, output the next k, skip the next d, and so on. A large value of d (e.g., d = 5k or more) may get rid of the bad structure, but slows down the generator. See Wegenkittl and Matsumoto (1999) for further discussion.

3.3 MRG Implementation Techniques

The modulus m is often taken as a large prime number close to the largest integer directly representable on the computer (e.g., equal or near 231 − 1 for 32-bit computers). Since each x i − j can be as large as m − 1, one must be careful in computing the right side of (3.2) because the product a j x i − j is typically not representable as an ordinary integer. Various techniques for computing this product modulo m are discussed and compared by Fishman (1996), L’Ecuyer and Tezuka (1991), L’Ecuyer (1999a), and L’Ecuyer and Simard (1999). Note that if \({a}_{j} = m - {a'}_{j} > 0\), using a j is equivalent to using the negative coefficient − a′ j , which is sometimes more convenient from the implementation viewpoint. In what follows, we assume that a j can be either positive or negative.

One approach is to perform the arithmetic modulo m in 64-bit (double precision) floating-point arithmetic (L’Ecuyer 1999a).Under this representation, assuming that the usual IEEE floating-point standard is respected, all positive integers up to 253 are represented exactly. Then, if each coefficient a j is selected to satisfy | a j  | (m − 1) ≤ 253, the product | a j  | x i − j will always be represented exactly and \({z}_{j} = \vert {a}_{j}\vert {x}_{i-j}\quad{\rm mod}\,\,m\) can be computed by the instructions

$$y = \vert {a}_{j}\vert {x}_{i-j};\qquad {z}_{j} = y - m\lfloor y/m\rfloor.$$

Similarly, if \((\vert {a}_{1}\vert+ \cdots+ \vert {a}_{k}\vert )(m - 1) \leq {2}^{53}\), \({a}_{1}{x}_{i-1} + \cdots+ {a}_{k}{x}_{i-k}\) will always be represented exactly.

A second technique, called approximate factoring (L’Ecuyer and Côté 1991), uses only the integer representation and works under the condition that | a j  |  = i or \(\vert {a}_{j}\vert= \lfloor m/i\rfloor \) for some integer \(i < \sqrt{m}\). One precomputes \({q}_{j} = \lfloor m/\vert {a}_{j}\vert \rfloor \) and r j  = m\quadmod  | a j  | . Then, \({z}_{j} = \vert {a}_{j}\vert {x}_{i-j}\quad{\rm mod}\,\,m\) can be computed by

$$\begin{array}{rcl} & & y = \lfloor {x}_{i-j}/{q}_{j}\rfloor ;\qquad z = \vert {a}_{j}\vert ({x}_{i-j} - y{q}_{j}) - y{r}_{j}; \\ & & \mbox{ if }z < 0\mbox{ then }{z}_{j} = z + m\mbox{ else }{z}_{j} = z\end{array}.$$

All quantities involved in these computations are integers between − m and m, so no overflow can occur if m can be represented as an ordinary integer (e.g., m < 231 on a 32-bit computer).

The powers-of-two decomposition approach selects coefficients a j that can be written as a sum or difference of a small number of powers of 2 (Wu 1997; L’Ecuyer and Simard 1999; L’Ecuyer and Touzin 2000). For example, one may take a j  =  ± 2q ± 2r and \(m = {2}^{e} - h\) for some positive integers q, r, e, and h. To compute y = 2q x\quadmod m, decompose \(x = {z}_{0} + {2}^{e-q}{z}_{1}\) (where \({z}_{0} = x\quad{\rm mod}\,\,{2}^{e-q}\)) and observe that

$$\begin{array}{rcl} y& =& {2}^{q}({z}_{ 0} + {2}^{e-q}{z}_{ 1})\quad{\rm mod}\,\,({2}^{e} - h)\ =\ ({2}^{q}{z}_{ 0} + h{z}_{1})\quad{\rm mod}\,\,({2}^{e} - h)\end{array}$$

Suppose now that

$$h < {2}^{q}\quad \mbox{ and }\quad h({2}^{q} - (h + 1){2}^{-e+q}) < m.$$
(3.5)

Then, 2q z 0 < m and hz 1 < m, so y can be computed by shifts, masks, additions, subtractions, and a single multiplication by h. Intermediate results never exceed 2m − 1. Things simplify further if q = 0 or q = 1 or h = 1. For h = 1, y is obtained simply by swapping the blocks of bits z 0 and z 1 (Wu 1997). L’Ecuyer and Simard (1999) pointed out that LCGs with parameters of the form \(m = {2}^{e} - 1\) and a =  ± 2q ± 2r have bad statistical properties because the recurrence does not “mix the bits” well enough. However, good and fast (combined) MRGs can be obtained via the power-of-two decomposition method, as explained in L’Ecuyer and Touzin (2000).

Another idea to improve efficiency is to take all nonzero coefficients a j equal to the same a, as in (3.4) (Marsaglia 1996; Deng and Xu 2003). Then, computing the right side of (3.2) requires a single multiplication. Deng and Xu (2003) and Deng (2005) provide specific parameter sets and concrete implementations for MRGs of this type, for prime m near 231, and for k ranging from 102 to 1597.

One may be tempted to take m equal to a power of two, say m = 2e, because then the “\quadmod m” operation is much easier: it suffices to keep the e least significant bits and mask-out all others. However, taking a power-of-two modulus is not recommended because it has several strong disadvantages in terms of the quality of the RNG (L’Ecuyer 19901998). In particular, the least significant bits have very short periodicity and the period of the recurrence (3.2) cannot exceed \(({2}^{k} - 1){2}^{e-1}\) if k > 1, and 2e − 2 if k = 1 and e ≥ 4. The maximal period achievable with k = 7 and m = 231, for example, is more than 2180 times smaller than the maximal period achievable with k = 7 and \(m = {2}^{31} - 1\) (a prime number).

3.4 Combined MRGs and LCGs

The conditions that make MRG implementations run faster (e.g., only two nonzero coefficients both close to zero) conflict with those required for having a good lattice structure and statistical robustness. Combined MRGs are one solution to this problem. Consider J distinct MRGs evolving in parallel, based on the recurrences

$$\begin{array}{rcl}{ x}_{j,i}& =& ({a}_{j,1}{x}_{j,i-1} + \cdots+ {a}_{j,k}{x}_{j,i-k})\quad{\rm mod}\,\,{m}_{j}\end{array}$$
(3.6)

where a j, k  ≠ 0, for \(j = 1,\ldots,J\). Let \({\delta }_{1},\ldots,{\delta }_{J}\) be arbitrary integers,

$${z}_{i} = ({\delta }_{1}{x}_{1,i} + \cdots+ {\delta }_{J}{x}_{J,i})\quad{\rm mod}\,\,{m}_{1},\qquad {u}_{i} = {z}_{i}/{m}_{1},$$
(3.7)

and

$${w}_{i} = ({\delta }_{1}{x}_{1,i}/{m}_{1} + \cdots+ {\delta }_{J}{x}_{J,i}/{m}_{J})\quad{\rm mod}\,\,1.$$
(3.8)

This defines two RNGs, with output sequences {u i , i ≥ 0} and {w i , i ≥ 0}.

Suppose that the m j are pairwise relatively prime, that δ j and m j have no common factor for each j, and that each recurrence (3.6) is purely periodic with period ρ j . Let m = m 1m J and let ρ be the least common multiple of \({\rho }_{1},\ldots,{\rho }_{J}\). Under these conditions, L’Ecuyer and Tezuka (1991) and L’Ecuyer (1996a) proved the following: (a) the sequence (3.8) is exactly equivalent to the output sequence of a MRG with (composite) modulus m and coefficients a j that can be computed explicitly as explained by L’Ecuyer (1996a); (b) the two sequences in (3.7) and (3.8) have period ρ; and (c) if both sequences have the same initial state, then \({u}_{i} = {w}_{i} + {\epsilon }_{i}\) where max i ≥ 0 | ε i  | can be bounded explicitly by a constant ε which is very small when the m j are close to each other.

Thus, these combined MRGs can be viewed as practical ways of implementing an MRG with a large m and several large nonzero coefficients. The idea is to cleverly select the components so that: (1) each one is easy to implement efficiently (e.g., has only two small nonzero coefficients) and (2) the MRG that corresponds to the combination has a good lattice structure. If each m j is prime and if each component j has maximal period \({\rho }_{j} = {m}_{j}^{k} - 1\), then each ρ j is even and ρ cannot exceed \({\rho }_{1}\cdots {\rho }_{J}/{2}^{J-1}\). Tables of good parameters for combined MRGs of different sizes that reach this upper bound are given in L’Ecuyer (1999a) and L’Ecuyer and Touzin (2000), together with C implementations.

3.5 Jumping Ahead

The recurrence (3.2) can be written in matrix form as

$${\mathbf{x}}_{i} = \mathbf{A}{\mathbf{x}}_{i-1}\quad{\rm mod}\,\,m = \left (\begin{array}{*{10}c} 0 & 1 &\cdots & 0\\ \vdots & & \ddots & \vdots\\ 0 & 0 &\cdots& 1 \\ {a}_{k}&{a}_{k-1} & \cdots &{a}_{1} \end{array} \right ){\mathbf{x}}_{i-1}\quad{\rm mod}\,\,m.$$

To jump ahead directly from x i to x i + ν, for an arbitrary integer ν, it suffices to exploit the relationship

$${\mathbf{x}}_{i+\nu } ={ \mathbf{A}}^{\nu }{\mathbf{x}}_{ i}\quad{\rm mod}\,\,m = ({\mathbf{A}}^{\nu }\quad{\rm mod}\,\,m){\mathbf{x}}_{ i}\quad{\rm mod}\,\,m.$$

If this is to be done several times for the same ν, the matrix A ν\quadmod m can be precomputed once for all. For a large ν, this can be done in O(log2ν) matrix multiplications via a standard divide-and-conquer algorithm (Knuth 1998):

$${\mathbf{A}}^{\nu }\quad{\rm mod}\,\,m = \left \{\begin{array}{@{}l@{\quad }l@{}} ({\mathbf{A}}^{\nu /2}\quad{\rm mod}\,\,m)({\mathbf{A}}^{\nu /2}\quad{\rm mod}\,\,m)\quad{\rm mod}\,\,m\quad &\mbox{ if $\nu $ is even;} \\ \mathbf{A}({\mathbf{A}}^{\nu -1}\quad{\rm mod}\,\,m)\quad{\rm mod}\,\,m \quad &\mbox{ if $\nu $ is odd.} \end{array} \right.$$

3.6 Linear Recurrences with Carry

These types of recurrences were introduced by Marsaglia and Zaman (1991) to obtain a large period even when m is a power of two (in which case the implementation may be faster). They were studied and generalized by Tezuka et al. (1994), Couture and L’Ecuyer (19941997), and Goresky and Klapper (2003). The basic idea is to add a carry to the linear recurrence (3.2). The general form of this RNG, called multiply-with-carry (MWC), can be written as

$$\begin{array}{rcl}{ x}_{i}& =& ({a}_{1}{x}_{i-1} + \cdots+ {a}_{k}{x}_{i-k} + {c}_{i-1})d\quad{\rm mod}\,\,b,\end{array}$$
(3.9)
$$\begin{array}{rcl}{ c}_{i}& =& \lfloor ({a}_{0}{x}_{i} + {a}_{1}{x}_{i-1} + \cdots+ {a}_{k}{x}_{i-k} + {c}_{i-1})/b\rfloor,\end{array}$$
(3.10)
$$\begin{array}{rcl}{ u}_{i}& =& \sum\limits_{\mathcal{l}=1}^{\infty }{x}_{ i-\mathcal{l}+1}{b}^{-\mathcal{l}},\end{array}$$
(3.11)

where b is a positive integer (e.g., a power of two), \({a}_{0},\ldots,{a}_{k}\) are arbitrary integers such that a 0 is relatively prime to b, and d is the multiplicative inverse of − a 0 modulo b. The state at step i is \({s}_{i} = {({x}_{i-k+1},\ldots,{x}_{i},{c}_{i})}^{\mathsf{T}}\). In practice, the sum in (3.11) is truncated to a few terms (it could be a single term if b is large), but the theoretical analysis is much easier for the infinite sum.

Define \(m =\sum\limits_{\mathcal{l}=0}^{k}{a}_{\mathcal{l}}{b}^{\mathcal{l}}\) and let a be the inverse of b in arithmetic modulo m, assuming for now that m > 0. A major result proved in Tezuka et al. (1994), Couture and L’Ecuyer (1997), and Goresky and Klapper (2003) is that if the initial states agree, the output sequence {u i , i ≥ 0} is exactly the same as that produced by the LCG with modulus m and multiplier a. Therefore, the MWC can be seen as a clever way of implementing a LCG with very large modulus.Couture and L’Ecuyer (1997) have shown that the value of t for this LCG satisfies \({\mathcal{l}}_{t}^{2} \leq {a}_{0}^{2} + \cdots+ {a}_{k}^{2}\) for t ≥ k, which means that the lattice structure will be bad unless the sum of squares of coefficients a j is large.

In the original proposals of Marsaglia and Zaman (1991), calledadd-with-carry and subtract-with-borrow, one has \(-{a}_{0} = \pm {a}_{r} = \pm {a}_{k} = 1\) for some r < k and the other coefficients a j are zero, so t 2 ≤ 3 for t ≥ k and the generator has essentially the same structural defect as the additive lagged-Fibonacci generator. In the version studied by Couture and L’Ecuyer (1997), it was assumed that \(-{a}_{0} = d = 1\). Then, the period cannot exceed \((m - 1)/2\) if b is a power of two. A concrete implementation was given in that paper. Goresky and Klapper (2003) pointed out that the maximal period of \(\rho= m - 1\) can be achieved by allowing a more general a 0. They provided specific parameters that give the maximal period for b ranging from 221 to 235 and ρ up to approximately 22521.

4 Generators Based on Recurrences Modulo 2

4.1 A General Framework

It seems natural to exploit the fact that computers work in binary arithmetic and to design RNGs defined directly in terms of bit strings and sequences. We do this under the following framework, taken from L’Ecuyer and Panneton (2002) and L’Ecuyer and Panneton (2009). Let \({\mathbb{F}}_{2}\) denote the finite field with two elements, 0 and 1, in which the operations are equivalent to addition and multiplication modulo 2.Consider the RNG defined by a matrix linear recurrence over \({\mathbb{F}}_{2}\), as follows:

$$\begin{array}{rcl}{ \mathbf{x}}_{i}& =& \mathbf{A}{\mathbf{x}}_{i-1},\end{array}$$
(3.12)
$$\begin{array}{rcl}{ \mathbf{y}}_{i}& =& \mathbf{B}{\mathbf{x}}_{i},\end{array}$$
(3.13)
$$\begin{array}{rclllll}{ u}_{i}& = & \sum\limits_{\mathcal{l}=1}^{w}{y}_{i,\mathcal{l}-1}{2}^{-\mathcal{l}}\ = {y}_{ i,0} \; {y}_{i,1}\; {y}_{i,2}\;\cdot \, , \end{array}$$
(3.14)

where \({\mathbf{x}}_{i} = {({x}_{i,0},\ldots,{x}_{i,k-1})}^{\mathsf{T}} \in{\mathbb{F}}_{2}^{k}\) is the k-bit state vector at step i, \({\mathbf{y}}_{i} = {({y}_{i,0},\ldots,{y}_{i,w-1})}^{\mathsf{T}} \in{\mathbb{F}}_{2}^{w}\) is the w-bit output vector at step i, k and w are positive integers, A is a k ×ktransition matrix with elements in \({\mathbb{F}}_{2}\), B is a w ×koutput transformation matrix with elements in \({\mathbb{F}}_{2}\), and u i  ∈ [0, 1) is the output at step i. All operations in (3.12) and (3.13) are performed in \({\mathbb{F}}_{2}\).

It is well-known (Niederreiter 1992; L’Ecuyer 1994) that when the x i ’s obey (3.12), for each j, the sequence {x i, j , i ≥ 0} follows the linear recurrence

$${x}_{i,j} = ({\alpha }_{1}{x}_{i-1,j} + \cdots+ {\alpha }_{k}{x}_{i-k,j})\quad{\rm mod}\,\,2,$$
(3.15)

whose characteristic polynomialP(z) is thecharacteristic polynomial of A, i.e.,

$$P(z) = \mathrm{det}(\mathbf{A} - z\mathbf{I}) = {z}^{k} - {\alpha }_{ 1}{z}^{k-1} -\cdots- {\alpha }_{ k-1}z - {\alpha }_{k},$$

where I is the identity matrix and each α j is in \({\mathbb{F}}_{2}\). The sequences {y i, j , i ≥ 0}, for 0 ≤ j < w, also obey the same recurrence (although some of them may follow recurrences of shorter order as well, depending on B). We assume that α k  = 1, so that the recurrence (3.15) has orderk and is purely periodic.Its period is 2k − 1 (i.e., maximal) if and only if P(z) is a primitive polynomial over \({\mathbb{F}}_{2}\) (Niederreiter 1992; Knuth 1998).

To jump ahead directly from x i to x i + ν with this type of generator, it suffices to precompute the matrix A ν (in \({\mathbb{F}}_{2}\)) and then multiply x i by this matrix.

Several popular classes of RNGs fit this framework as special cases, by appropriate choices of the matrices A and B. This includes the Tausworthe or LFSR, polynomial LCG, GFSR, twisted GFSR, Mersenne twister, WELL, xorshift, multiple recursive matrix generators, and combinations of these (L’Ecuyer and Panneton 2009; Matsumoto and Nishimura 1998; Niederreiter 1995; Panneton and L’Ecuyer 2005; Panneton et al. 2006; Tezuka 1995). We detail some of them after discussing measures of uniformity.

4.2 Measures of Uniformity

The uniformity of point sets Ψ I produced by RNGs based on linear recurrences over \({\mathbb{F}}_{2}\) is usually assessed by measures of equidistribution defined as follows (L’Ecuyer 1996b; L’Ecuyer and Panneton 2002; L’Ecuyer 2004; L’Ecuyer and Panneton 2009; Tezuka 1995).For an arbitrary vector \(\mathbf{q} = ({q}_{1},\ldots,{q}_{t})\) of non-negative integers, partition the unit hypercube [0, 1)t into \({2}^{{q}_{j}}\) intervals of the same length along axis j, for each j. This determines a partition of [0, 1)t into \({2}^{{q}_{1}+\cdots +{q}_{t}}\) rectangular boxes of the same size and shape. We call this partition the q-equidissection of the unit hypercube.

For some index set \(I =\{ {i}_{1},\ldots,{i}_{t}\}\), if Ψ I has 2k points, we say that Ψ I is q-equidistributed in base 2 ifthere are exactly 2q points in each box of the q-equidissection, where \(k - q = {q}_{1} + \cdots+ {q}_{t}\). This means that among the 2k points \(({x}_{{j}_{1}},\ldots,{x}_{{j}_{t}})\) of Ψ I , if we consider the first q 1 bits of \({x}_{{j}_{1}}\), the first q 2 bits of \({x}_{{j}_{2}}\), …, and the first q t bits of \({x}_{{j}_{t}}\), each of the 2k − q possibilities occurs exactly the same number of times. This is possible only if q ≤ k.

The q-equidistribution of Ψ I depends only on the first q j bits of \({x}_{{i}_{j}}\) for 1 ≤ j ≤ t, for the points \(({x}_{{i}_{1}},\ldots,{x}_{{i}_{t}})\) that belong to Ψ I . The vector of these \({q}_{1} + \cdots+ {q}_{t} = k - q\) bits can always be expressed as a linear function of the k bits of the initial state x 0, i.e., as M q x 0 for some (k − q) ×k binary matrix M q , and it is easily seen that Ψ I is q-equidistributed if and only if M q has full rank k − q. This provides an easy way of checking equidistribution (L’Ecuyer 1996b; L’Ecuyer and Panneton 2009; Tezuka 1995).

If Ψ I is \((\mathcal{l},\ldots,\mathcal{l})\)-equidistributed for some  ≥ 1, it is called t-distributed with ℓ bits of accuracy, or (t,ℓ)-equidistributed (L’Ecuyer 1996b). The largest value of for which this holds is called the resolution of the set Ψ I and is denoted by I . This value has the upper bound \({\mathcal{l}}_{t}^{{_\ast}} =\min (\lfloor k/t\rfloor,\,w)\). The resolution gap of Ψ I is defined as \({\delta }_{I} = {\mathcal{l}}_{t}^{{_\ast}}- {\mathcal{l}}_{I}\). In the same vein as for MRGs, a worst-case figure of merit can be defined here by

$${\Delta }_{\mathcal{J}} {=\max }_{I\in \mathcal{J}}{\delta }_{I},$$

where \(\mathcal{J}\) is a preselected class of index sets I.

The point set Ψ I is a (q,k,t)-net in base 2 (often called a (t, m, s)-net in the context of quasi-Monte Carlomethods, where a different notation is used Niederreiter 1992), if it is \(({q}_{1},\ldots,{q}_{t})\)-equidistributed in base 2 for all non-negative integers \({q}_{1},\ldots,{q}_{t}\) summing to k − q. We call the smallest such q the q-value of Ψ I . The smaller it is, the better. One candidate for a figure of merit could be the q-value of Ψ t for some large t. Although widely used to construct and evaluate low-discrepancy point sets for quasi-Monte Carlo methods, a major drawback of this measure is that it is too costly to compute for good long-period generators (for which k − q is large), because there are too many vectors q for which equidistribution needs to be checked. In practice, one must settle for figures of merit that involve a smaller number of equidissections.

When δ I  = 0 for all sets I of the form \(I =\{ 0,\ldots,t - 1\}\), for 1 ≤ t ≤ k, the RNG is said to be maximally equidistributed or asymptotically random for the word size w (L’Ecuyer 1996b; Tezuka 1995; Tootill et al. 1973).This property ensures perfect equidistribution of all sets Ψ t , for any partition of the unit hypercube into subcubes of equal sizes, as long as  ≤ w and the number of subcubes does not exceed the number of points in Ψ t . Large-period maximally equidistributed generators, together with their implementations, can be found in L’Ecuyer (1999c), L’Ecuyer and Panneton (2002), Panneton and L’Ecuyer (2004), and Panneton et al. (2006), for example.

4.3 Lattice Structure in Spaces of Polynomials and Formal Series

The RNGs defined via (3.12)–(3.14) do not havea lattice structure in the real space like MRGs, but they do have a lattice structure in a space of formal series, as explained in Couture and L’Ecuyer (2000), L’Ecuyer (2004), L’Ecuyer and Panneton (2009), Lemieux and L’Ecuyer (2003), and Tezuka (1995). The real space \(\mathbb{R}\) is replaced by the space \({\mathbb{L}}_{2}\) of formal power series with coefficients in \({\mathbb{F}}_{2}\), of the form \(\sum\limits_{\mathcal{l}=\omega }^{\infty }{x}_{\mathcal{l}}{z}^{-\mathcal{l}}\) for some integer ω. In that setting, the lattices have the form

$${\mathcal{L}}_{t} = \left \{\mathbf{v}(z) =\sum\limits_{j=1}^{t}{h}_{ j}(z){\mathbf{v}}_{j}(z)\mbox{ such that each }{h}_{j}(z) \in{\mathbb{F}}_{2}[z]\right \},$$

where \({\mathbb{F}}_{2}[z]\) is the ring of polynomials with coefficients in \({\mathbb{F}}_{2}\), and the basis vectors v j (z) are in \({\mathbb{L}}_{2}^{t}\).The elements of the dual lattice \({\mathcal{L}}_{t}^{{_\ast}}\) are the vectors h(z) in \({\mathbb{L}}_{2}^{t}\) whose scalar product with any vector of \({\mathcal{L}}_{t}\) belongs to \({\mathbb{F}}_{2}[z]\). We define the mapping \(\varphi: {\mathbb{L}}_{2} \rightarrow\mathbb{R}\) by

$$\varphi \left (\sum\limits_{\mathcal{l}=\omega }^{\infty }{x}_{ \mathcal{l}}{z}^{-\mathcal{l}}\right ) =\sum\limits_{\mathcal{l}=\omega }^{\infty }{x}_{ \mathcal{l}}{2}^{-\mathcal{l}}.$$

Then, it turns out that the point set Ψ t produced by the generator is equal to \(\varphi ({\mathcal{L}}_{t}) \cap[0,1{)}^{t}\). Moreover, the equidistribution properties examined in Sect. 3.4.2 can be expressed in terms of lengths of shortest vectors in the dual lattice, with appropriate definitions of the length (or norm). Much of the theory and algorithms developed for lattices in the real space can be adapted to these new types of lattices (Couture and L’Ecuyer 2000; L’Ecuyer et al. 2009; Tezuka 1995).

4.4 The LFSR Generator

The Tausworthe or linear feedback shift register (LFSR) generator (Tausworthe 1965; L’Ecuyer 1996b; Tezuka 1995) is a special case of (3.123.14) with A = A 0 s (in \({\mathbb{F}}_{2}\)) for some positive integer s, where

$${ \mathbf{A}}_{0} = \left (\begin{array}{cccc} & 1 & & \\ & & \ddots &\\ & & & 1 \\ {a}_{k}&{a}_{k-1} & \ldots &{a}_{1} \end{array} \right ),$$
(3.16)

\({a}_{1},\ldots,{a}_{k}\) are in \({\mathbb{F}}_{2}\), a k  = 1, and all blank entries in the matrix are zeros. If w ≤ k, the matrix B contains the first w lines of the k ×k identity matrix, otherwise B is constructed as explained in L’Ecuyer and Panneton (2009). The RNG thus obtained can be defined equivalently by

$$ \begin{array}{rcl}{ x}_{i}& =& {a}_{1}{x}_{i-1} + \cdots+ {a}_{k}{x}_{i-k}\quad{\rm mod}\,\,2, \\ \end{array} $$
(3.17)
$$ \begin{array}{rcl}{ u}_{i}& =& \sum\limits_{\mathcal{l}=1}^{w}{x}_{ is+\mathcal{l}-1}{2}^{-\mathcal{l}}.<EquationNumber>3.18</EquationNumber> \\ \end{array} $$

Here, P(z) is the characteristic polynomial of the matrix A 0 s, not the characteristic polynomial of the recurrence (3.17), and the choice of s is important for determining the quality of the generator. A frequently encountered case is when a single a j is nonzero in addition to a k ; then, P(z) is a trinomial and we have a trinomial-based LFSR generator. These generators are known to have important statistical deficiencies (Matsumoto and Kurita 1996; Tezuka 1995) but they can be used a components of combined RNGs (Sect. ??).

LFSR generators can be expressed as LCGs in a space of polynomials (Tezuka and L’Ecuyer 1991; Tezuka 1995; L’Ecuyer 1994).With this representation, their lattice structure as discussed in Sect. ?? follows immediately.

4.5 The GFSR and Twisted GFSR

Here we take A as the pq ×pq matrix

$$\mathbf{A} = \left (\begin{array}{ccccc} & &{\mathbf{I}}_{p}& &\mathbf{S} \\ {\mathbf{I}}_{p}& & & & \\ &{\mathbf{I}}_{p}& & & \\ & & \ddots & & \\ & & &{\mathbf{I}}_{p}& \\ \end{array} \right )$$

for some positive integers p and q, where I p is the p ×p identity matrix, S is a p ×p matrix, and the matrix I p on the first line is in columns \((r - 1)p + 1\) to rp for some positive integer r. Often, w = p and B contains the first w lines of the pq ×pq identity matrix. If S is also the identity matrix, the generator thus obtained is the trinomial-based generalized feedback shift register (GFSR),for which x i is obtained by a bitwise exclusive-or of x i − r and x i − q . This gives a very fast RNG, but its period cannot exceed 2q − 1, because each bit of x i follows the same binary recurrence of order k = q, with characteristic polynomial \(P(z) = {z}^{q} - {z}^{q-r} - 1\).It also fails several simple empirical tests (L’Ecuyer and Simard 2007).

More generally, we can define x i as the bitwise exclusive-or of \({\mathbf{x}}_{i-{r}_{1}},{\mathbf{x}}_{i-{r}_{2}},\ldots,{\mathbf{x}}_{i-{r}_{d}}\) where r d  = q, so that each bit of x i follows a recurrence in \({\mathbb{F}}_{2}\) whose characteristic polynomial P(z) has d + 1 nonzero terms. However, the period is still bounded by 2q − 1, whereas considering the pq-bit state, we should rather expect a period close to 2pq.This was the main motivation for the twisted GFSR (TGFSR) generator. In the original version introduced by Matsumoto and Kurita (1992), w = p and the matrix S is defined as the transpose of A 0 in (3.16), with k replaced by p. The characteristic polynomial of A is then \(P(z) = {P}_{S}({z}^{q} + {z}^{m})\), where \({P}_{S}(z) = {z}^{p} - {a}_{p}{z}^{p-1} -\cdots- {a}_{1}\) is the characteristic polynomial of S, and its degree is k = pq. If the parameters are selected so that P(z) is primitive over \({\mathbb{F}}_{2}\), then the TGFSR has period 2k − 1. Matsumoto and Kurita (1994) pointed out important weaknesses of the original TGFSR and proposed an improved version that uses a well-chosen matrix B whose lines differ from those of the identity. The operations implemented by this matrix are calledtempering and their purpose is to improve the uniformity of the points produced by the RNG.

The Mersenne twister (Matsumoto and Nishimura 1998; Nishimura 2000) is a variant of the TGFSR where k is slightly less than pq and can be a prime number. A specific instance named MT19937, proposed by Matsumoto and Nishimura (1998), has become quite popular; it runs very fast and has the huge period of 219937 − 1. However, its state x i occupies a large amount of memory (19,937 bits) and changes very slowly as a function of i. Panneton et al. (2006) showed that as a consequence of this slow change, if the generator starts in a state with very few bits equal to 1, then the average output values over the next few thousand steps is likely to be much less than 1/2. In particular, if the initial state has a single bit at 1, say randomly selected, then we need about 3/4 million steps before the average output value gets close to 1/2. Likewise, if two initial states differ by a single bit, it takes the same number of steps before the corresponding outputs differ by about half of their bits. This problem is related to the fact that the characteristic polynomial P(z) has too few nonzero coefficients, namely 135 out of 19,938.

Panneton et al. (2006) went on to develop a class of \({\mathbb{F}}_{2}\)-linear generators called well-equidistributed long-period linear (WELL), which run almost as fast as MT19937, but whose state changes faster and whose polynomial P(z) contains nearly 50% nonzero coefficients. They propose specific instances with periods ranging from 2512 − 1 to 244, 497 − 1, which are all almost (or exactly) maximally equidistributed.

In the multiple recursive matrix method of Niederreiter (1995), the first row of p ×p matrices in A contains arbitrary matrices. However, a fast implementation is possible only when these matrices are sparse and have a special structure.

4.6 Combined Linear Generators Over xxx

\({\mathbb{F}}_{2}\)

Many of the best generators based on linear recurrences over \({\mathbb{F}}_{2}\) are constructed by combining the outputs of two or more RNGs having a simple structure. The idea is the same as for MRGs: select simple components that can run fast but such that their combination has a more complicated structure and highly-uniform sets Ψ I for the sets I considered important.

Consider J distinct recurrences of the form (3.123.13), where the jth recurrence has parameters (k, w, A, B) = (k j , w, A j , B j ) and state x j, i at step i, for \(j = 1,\ldots,J\). The output of the combined generator at step i is defined by

$$\begin{array}{rcl}{ \mathbf{y}}_{i}& =&{ \mathbf{B}}_{1}{\mathbf{x}}_{1,i} \oplus \cdots\oplus {\mathbf{B}}_{J}{\mathbf{x}}_{J,i}, \\ {u}_{i}& =& \sum\limits_{\mathcal{l}=1}^{w}{y}_{ i,\mathcal{l}-1}{2}^{-\mathcal{l}}, \\ \end{array}$$

where ⊕ denotes the bitwise exclusive-or operation. One can show (Tezuka 1995) that the period ρ of this combined generator is the least common multiple of the periods ρ j of its components. Moreover, this combined generator is equivalent to the generator (3.123.14) with \(k = {k}_{1} + \cdots+ {k}_{J}\), A = diag\(({\mathbf{A}}_{1},\ldots,{\mathbf{A}}_{J})\), and \(\mathbf{B} = ({\mathbf{B}}_{1},\ldots,{\mathbf{B}}_{J})\).

With this method, by selecting the parameters carefully, the combination of LFSR generators with characteristic polynomials \({P}_{1}(z),\ldots,{P}_{J}(z)\) gives yet another LFSR with characteristic polynomial P(z) = P 1(z)⋯P J (z) and period equalto the product of the periods of the components (Tezuka and L’Ecuyer 1991; Wang and Compagner 1993; L’Ecuyer 1996b; Tezuka 1995). Tables and fast implementations of maximally equidistributed combined LFSR generators are given in L’Ecuyer (1999c).

The TGFSR and Mersenne twister generators cannot be maximally equidistributed. However, concrete examples of maximally equidistributed combined TGFSR generators with periods near 2466 and 21250 can be found in L’Ecuyer and Panneton (2002). These generators have the additional property that the resolution gaps δ I are zero for a class of small sets I with indices not too far apart.

5 Nonlinear RNGs

All RNGs discussed so far are based on linear recurrences and their structure may be deemed too regular. For example, we saw earlier that the output binary sequence {y i, j , i ≥ 0} of any \({\mathbb{F}}_{2}\)-linear generator obeys the linear recurrence (3.15). This can be detected easily by applying statistical tests that measure the linear complexity of this output sequence, or that construct “random” binary matrices from this sequence and compute their ranks (L’Ecuyer and Simard 2007). Because of the linear dependences between the bits, the linear complexity and the matrix ranks will be smaller than what they should be on average. For the great majority of Monte Carlo applications, this linearity is not a problem, because the random numbers are transformed nonlinearly by the simulation algorithm. But for the rare situations where it may matter, we need alternatives.

There are several ways of getting rid of the regular linear structure, including: (1) use a nonlinear transition function f; (2) keep the transition function linear but use a nonlinear output function g; (3) combine two linear RNGs of different types, such as an MRG with an \({\mathbb{F}}_{2}\)-linear generator; (4) shuffle (randomly permute) the output values using another generator. Several types of genuinely nonlinear RNGs have been proposed over the years; see for example Blum et al. (1986), Eichenauer-Herrmann (1995), Eichenauer-Herrmann et al. (1998), Hellekalek and Wegenkittl (2003), Knuth (1998), L’Ecuyer and Proulx (1989), L’Ecuyer (1994), L’Ecuyer and Simard (2007), Niederreiter and Shparlinski (2002), and Tezuka (1995). Their nonlinear mappings are defined in various ways by multiplicative inversion in a finite field, quadratic and cubic functions in the finite ring of integers modulo m, and other more complicated transformations. Many of them have output sequences that tend to behave much like i.i.d. U(0, 1) sequences even over their entire period length, in contrast with “good” linear RNGs, whose point sets Ψ t are much more regular than typical random points (Eichenauer-Herrmann et al. 1998; L’Ecuyer and Hellekalek 1998; L’Ecuyer and Granger-Piché 2003; Niederreiter and Shparlinski 2002). On the other hand, their statistical properties have been analyzed only empirically or via asymptotic theoretical results. For specific nonlinear RNGs, the uniformity of the point sets Ψ t is very difficult to measure theoretically. Moreover, the nonlinear RNGs are generally significantly slower than the linear ones. The RNGs recommended for cryptology are all nonlinear.

An interesting idea for adding nonlinearity without incurring an excessive speed penalty is to combine a small nonlinear generator with a fast long-period linear one (Aiello et al. 1998).L’Ecuyer and Granger-Piché (2003) show how to do this while ensuring theoretically the good uniformity properties of Ψ t for the combined generator.A fast implementation can be achieved by using precomputed tables for the nonlinear component. Empirical studies suggest that mixed linear-nonlinear combined generators are more robust than the linear ones with respect to statistical tests, because of their less regular structure.

Several authors have proposed various ways of combining RNGs to produce streams of random numbers with less regularity and better “randomness” properties; see, e.g., Collings (1987), Knuth (1998), Gentle (2003), Law and Kelton (2000), L’Ecuyer (1994), Fishman (1996), Marsaglia (1985), and other references given there. This includes shuffling the output sequence of one generator using another one (or the same one), alternating between several streams, or just adding them in different ways. Most of these techniques are heuristics. They usually improve the uniformity (empirically), but they can also make it worse. For random variables in the mathematical sense, certain types of combinations (e.g., addition modulo 1) can provably improve the uniformity, and some authors have used this fact to argue that combined RNGs are provably better than their components alone (Brown and Solomon 1979; Deng and George 1990; Marsaglia 1985; Gentle 2003), but this argument is faulty because the output sequences of RNGs are deterministic, not sequences of independent random variables. To assess the quality of a combined generator, one must analyze the mathematical structure of the combined generator itself rather than the structure of its components (L’Ecuyer 1996b,a1998; L’Ecuyer and Granger-Piché 2003; Tezuka 1995).

6 Empirical Statistical Tests

As mentioned earlier, a statistical test for RNGs is defined by a random variable X whose distribution under \({\mathcal{H}}_{0}\) can be well approximated. When X takes the value x, we define the right and left p-values of the test by

$${p}_{\mathrm{R}} = P[X \geq x\mid {\mathcal{H}}_{0}]\quad \mbox{ and}\quad {p}_{\mathrm{L}} = P[X \leq x\mid {\mathcal{H}}_{0}].$$

When testing RNGs, there is no need to prespecify the level of the test. If either of the right or left p-value is extremely close to zero, e.g., less than 10 − 15, then it is clear that \({\mathcal{H}}_{0}\) (and the RNG) must be rejected. When a suspiciousp-value is obtained, e.g., near 10 − 2 or 10 − 3, one can just repeat this particular test a few more times, perhaps with a larger sample size. Almost always, things will then clarify.

Most tests are defined by partitioning the possible realizations of \(({u}_{0},\ldots,{u}_{\tau -1})\) into a finite number of subsets (where the integer τ can be random or deterministic), computing the probability p j of each subset j under \({\mathcal{H}}_{0}\), and measuring the discrepancy between these probabilities and empirical frequencies from realizations simulated by the RNG.

A special case that immediately comes to mind is to take τ = t (a constant) and cut the interval [0, 1) into d equal segments for some positive integer d, in order to partition the hypercube [0, 1)t into k = d t subcubes of volume 1 ∕ k. We then generate n points \({\mathbf{u}}_{i} = ({u}_{ti},\ldots,{u}_{ti+t-1}) \in[0,1{)}^{t}\), for \(i = 0,\ldots,n - 1\), and count the number N j of points falling in subcube j, for \(j = 0,\ldots,k - 1\). Any measure of distance (or divergence) between the numbers N j and their expectations n ∕ k can define a test statistic X. The tests thus defined are generally called serial tests of uniformity (Knuth 1998; L’Ecuyer et al. 2002).They can be sparse (if n ∕ k ≪ 1), or dense (if n ∕ k ≫ 1), or somewhere in between. There are also overlapping versions, where the points are defined by \({\mathbf{u}}_{i} = ({u}_{i},\ldots,{u}_{i+t-1})\) for \(i = 0,\ldots,n - 1\) (they have overlapping coordinates).

Special instances for which the distribution under \({\mathcal{H}}_{0}\) is well-known are the chi-square, the (negative) empirical entropy,and the number of collisions (L’Ecuyer and Hellekalek 1998; L’Ecuyer et al. 2002; Read and Cressie 1988). For the latter, the test statistic X is the number of times a point falls in a subcube that already had a point in it. Its distribution under \({\mathcal{H}}_{0}\) is approximately Poisson with mean \({\lambda }_{1} = {n}^{2}/(2k)\), if n is large and λ1 not too large.

A variant is the birthday spacings test, defined as follows (Marsaglia 1985; Knuth 1998; L’Ecuyer and Simard 2001).Let I (1) ≤ ⋯ ≤ I (n) be the numbers of the subcubes that contain the points, sorted by increasing order. Define the spacings \({S}_{j} = {I}_{(j+1)} - {I}_{(j)}\), for \(j = 1,\ldots,n - 1\), and let X be the number of collisions between these spacings. Under \({\mathcal{H}}_{0}\), X is approximately Poisson with mean \({\lambda }_{2} = {n}^{3}/(4k)\), if n is large and λ2 not too large.

Consider now a MRG, for which Ψ t has a regular latticestructure. Because of this regularity the points of Ψ t will tend to be more evenly distributed among the subcubes than random points. For a well-chosen k and large enough n, we expect the collision test to detect this: it is likely that there will be too few collisions.In fact, the same applies to any RNG whose set Ψ t is very evenly distributed. When a birthday spacings test with a very large k is applied to a MRG, the numbers of the subcubes that contain one point of Ψ t tend to be too evenly spaced and the test detects this by finding too many collisions.

These specific interactions between the test and the structure of the RNG lead to systematic patterns in the p-values of the tests. To illustrate this, suppose that we take k slightly larger than the cardinality of Ψ t (so k ≈ ρ) and that due to the excessive regularity, no collision is observed in the collision test. The left p-value will then be p L ≈ P[X ≤ 0∣X ∼ Poisson\(({\lambda }_{1})] =\exp [-{n}^{2}/(2k)]\). For this p-value to be smaller than a given ε, we need a sample size n proportional to the square root of the period ρ. And after that, p L decreases exponentially fast in n 2.

Extensive experiments with LCGs, MRGs, and LFSR generators confirms that this is actually what happens with these RNGs (L’Ecuyer and Hellekalek 1998; L’Ecuyer 2001; L’Ecuyer et al. 2002). For example, if we take \(\epsilon= 1{0}^{-15}\) and define n 0 as the minimal sample size n for which p L < ε, we find that n 0 ≈ 16ρ1 ∕ 2 (plus some noise) for LCGs that behave well in the spectral test as well as for LFSR generators.For the birthday spacings test, the rule for LCGs is n 0 ≈ 16ρ1 ∕ 3 instead (L’Ecuyer and Simard 2001). So to be safe with respect to these tests, the period ρ must be so large that generating more than ρ1 ∕ 3 numbers is practically unfeasible. This certainly disqualifies all LCGs with modulus smaller than 2100 or so.

Other types of tests for RNGs include tests based on the closest pairs of points among n points generated in the hypercube, tests based on random walks on the real line or over the integers, tests based on the linear complexity of a binary sequence, tests based on the simulation of dice or poker hands, and many others (Gentle 2003; Knuth 1998; L’Ecuyer and Simard 2007; Marsaglia 1996; Rukhin et al. 2001; Vattulainen et al. 1995).

When testing RNGs, there is no specific alternative hypothesis to \({\mathcal{H}}_{0}\). Different tests are needed to detect different types of departures from \({\mathcal{H}}_{0}\). The TestU01 library of L’Ecuyer and Simard (2007) implements a large collection of tests in the C language, and also provides specific test suites with preselected parameters and sample sizes. Some of these suites are designed for i.i.d. U(0, 1) output sequences and others for strings of bits. Other (smaller) test suites for RNGs are DIEHARD (Marsaglia 1996) and the NIST suite (Rukhin et al. 2001).

7 Available Software and Recommendations

Applying standard statistical test suites to RNGs found in popular software (statistical and simulation software, spreadsheets, system libraries, etc.) reveals that many of them are surprisingly poor and fail the tests spectacularly (L’Ecuyer 2001; L’Ecuyer and Simard 2007). There is no good reason to use these poor RNGs, because several good ones are available that are fast, portable, and pass these test suites with flying colors.

The RNG I use most of the time is the combined MRG MRG32k3a from L’Ecuyer (1999a). A convenient object-oriented software package with multiple streams and substreams of random numbers, based on this generator, is described in L’Ecuyer et al. (2002) and is available in Java, C, and C++, at http://www.iro.umontreal.ca/~lecuyer. This tool has been included recently in several software products, including MATLAB, SAS, R, Arena, Automod, ns3, and many more. MRG32k3a is not the fastest RNG available, but it is very robust and reliable. A faster alternative is MGR31k3p from L’Ecuyer and Touzin (2000). Other good combined MRGs, some for 64-bit computers, are available in L’Ecuyer (1999a). Even faster ones are the combined LFSRs, Mersenne twisters, and WELL generators proposed in L’Ecuyer (1999c), L’Ecuyer and Panneton (2002), Matsumoto and Nishimura (1998), Nishimura (2000), and Panneton et al. (2006).When speed is a concern, I personally use LFSR113 or LFSR258 from L’Ecuyer (1999c). Software tools that provide multiple streams and substreams with most of these generators (except the ones with very large state) are available in the SSJ library (L’Ecuyer 2008).

8 Non-Uniform Random Variate Generation

Like for the uniform case, non-uniform variate generation often involves approximations and compromises. The first requirement is, of course, correctness. This does not mean that the generated random variate X must always have exactly the required distribution, because this would sometimes be much too costly or even impossible. But we must have a good approximation and, preferably, some understanding of the quality of that approximation. Robustness is also important: when the accuracy depends on the parameters of the distribution, it must be good uniformly over the entire range of parameter values that we are interested in.

The method must also be efficient both in terms of speed and memory usage. Often, it is possible to increase the speed by using more memory (e.g, for larger precomputed tables) or by relaxing the accuracy requirements. Some methods need a one-time setup to compute constants and construct tables. The setup time can be significant but may be well worth spending if it is amortized by a large number of subsequent calls to the generator. For example, it makes sense to invest in a more extensive setup if we plan to make a million calls to a given generator than if we expert to make only a few calls, assuming that this investment can improve the speed of the generator sufficiently.

In general, compromises must be made between simplicity of the algorithm, quality of the approximation, robustness with respect to the distribution parameters, and efficiency (generation speed, memory requirements, and setup time).

In many situations, compatibility with variance reduction techniques is another important issue (Asmussen and Glynn 2007; Bratley et al. 1987; Law and Kelton 2000). We may be willing to sacrifice the speed of the generator to preserve inversion, because the gain in efficiency obtained via the variance reduction methods may more than compensate (sometimes by orders of magnitude) for the slightly slower generator.

8.1 Inversion

The inversion method, defined in the introduction, should be the method of choice for generating non-uniform random variates in a majority of situations. The fact that \(X = {F}^{-1}(U)\) is a monotone (non-decreasing) function of U makes this method compatible with important variancereductions techniques such as common random numbers, antithetic variates, Latin hypercube sampling, and randomized quasi-Monte Carlo methods (Bratley et al. 1987; Law and Kelton 2000; L’Ecuyer and Lemieux 2000; L’Ecuyer et al. 2009).

For some distributions, an analytic expression can be obtained for the inverse distribution function F  − 1 and inversion can be easily implemented. As an example, consider the Weibull distribution function with parameters α > 0 and β > 0, defined by \(F(x) = 1 -\exp [-{(x/\beta )}^{\alpha }]\) for x > 0. It is easy to see that \({F}^{-1}(U) = \beta {[-\ln (1 - U)]}^{1/\alpha }\). For α = 1, we have the special case of the exponential distribution with mean β.

For an example of a simple discrete distribution, suppose that \(P[X = i] = {p}_{i}\) where p 0 = 0. 6, p 1 = 0. 3, p 2 = 0. 1, and p i  = 0 elsewhere. The inversion method in this case will return 0 if U < 0. 6, 1 if 0. 6 ≤ U < 0. 9, and 2 if U ≥ 0. 9. For the discrete uniform distribution over \(\{0,\ldots,k - 1\}\), return X = ⌊kU⌋. As another example, let X have the geometric distribution with parameter p, so \(P[X = x] = p{(1 - p)}^{x}\) for \(x = 0,1,2,\ldots \), where 0 < p < 1. Then, \(F(x) = 1 - {(1 - p)}^{\lfloor x+1\rfloor }\) for x ≥ 0 and one can show that \(X = {F}^{-1}(U) = \lceil \ln (1 - U)/\ln (1 - p)\rceil - 1\).

For other distributions (e.g., the normal, Student, chi-square) there is no closed-form expression for F  − 1 but good numerical approximations are available (Bratley et al. 1987; Gentle 2003; Hörmann et al. 2004; Marsaglia et al. 1994). When the distribution has only scale and location parameters, we need to approximate F  − 1 only for a standardized version of the distribution. For the normal distribution, for example, it suffices to have an efficient method for evaluating the inverse distribution function of a N(0, 1) random variable Z, since a normal with mean μ and variance σ2 can be generated by \(X = \sigma Z + \mu \).

When shape parameters are involved (e.g., the gamma and beta distributions), things are more complicated because F  − 1 then depends on the parameters in a more fundamental manner.

Hörmann and Leydold (2003) propose a generaladaptive and automatic method that constructs a highly accurate Hermite interpolation method of F  − 1. In a one-time setup, their method produces tables for the interpolation points and coefficients. Random variate generation using these tables is then quite fast.

A less efficient but simpler way of implementing inversion when a method is available for computing F is via binary search (Cheng 1998). If the density is also available and if it is unimodal with known mode, a Newton-Raphson iteration method can advantageously replace the binary search (Cheng 1998; Devroye 1986).

To implement inversion for general discrete distributions, sequential search and binary search with look-up tables are the standard methods (Bratley et al. 1987; Cheng 1998). For a discrete distribution over the values x 1 < ⋯ < x k , one first tabulates the pairs (x i , F(x i )), where F(x i ) = P[X ≤ x i ] for \(i = 1,\ldots,k\). To generate X, it then suffices to generate U ∼ U(0, 1), find I = min{iF(x i ) ≥ U}, and return X = x I . The following algorithms do that.

Sequential search (needs O(k) iterations in the worst case);

generate U ∼ U(0, 1); let i = 1;

while F(x i ) < U do \(i = i + 1\);

return x i .

Binary search (needs O(logk) iterations in the worst case);

generate U ∼ U(0, 1); let L = 0 and R = k;

while L < R − 1 do

\(m = \lfloor (L + R)/2\rfloor \);

if F(x m ) < U then L = m else R = m;

/* Invariant: at this stage, the index I is in \(\{L + 1,\ldots,R\}\). */

return x R .

These algorithms can be modified in many different ways. For example, if k = , in the binary search, one can start with an arbitrary value of R, double it until F(x R ) ≥ U, and start the algorithm with this R and \(L = R/2\). Of course, only a finite portion of the table (a portion that contains most of the probability mass) would be precomputed in this case, the other values can be computed only when needed. This can also be done if k is finite but large.

Another class of techniques use indexing or buckets to speed up the search (Bratley et al. 1987; Chen and Asau 1974; Devroye 1986). For example, one can partition the interval (0, 1) into c subintervals of equal sizes and use (pre-tabulated) initial values of (L, R) that depend on the subinterval in which U falls. For the subinterval \([j/c,\,(j + 1)/c)\) the values of L and R would be \({L}_{j} = {F}^{-1}(j/c)\) and \({R}_{j} = {F}^{-1}((j + 1)/c)\), for \(j = 0,\ldots,c - 1\). The subinterval number that corresponds to a given U is simply J = ⌊cU⌋. Once we know that subinterval, we can search it by linear of binary search. With a larger value of c the search is faster (on the average) but the setup is more costly and a larger amount of memory is needed. So a compromise must be made depending on the situation (e.g., the value of k, the number of variates we expect to generate, etc.). For c = 1, we recover the basic sequential and binary search algorithms given above. A well-implemented indexed search with a large enough c is competitive with the alias method (described in the next paragraph). A combined indexed/binary search algorithm is given below. An easy adaptation gives the combined indexed/sequential search, which is generally preferable when k ∕ c is small, because it has smaller overhead.

Indexed search (combined with binary search);

generate U ∼ U(0, 1); let J = ⌊cU⌋, L = L J , and R = R J ;

while L < R − 1 do

\(m = \lfloor (L + R)/2\rfloor \);

if F(x m ) < U then L = m else R = m;

return x R .

These search methods are also useful for piecewise-linear (or piecewise-polynomial) distribution functions. Essentially, it suffices to add an interpolation step at the end of the algorithm, after the appropriate linear (or polynomial) piece has been determined (Bratley et al. 1987).

Finally, the stochastic model itself can sometimes be selected in a way that makes inversion easier. For example, one can fit a parametric, highly-flexible, and easily computable inverse distribution function F  − 1 to the data, directly or indirectly (Nelson and Yamnitsky 1998).

There are situations where speed is important and where non-inversion methods are appropriate. In forthcoming subsections, we outline the main non-inversion methods.

8.2 The Alias Method

Sequential and binary search require O(k) and O(logk) time, respectively, in the worst case, to generate a random variate X by inversion over the finite set \(\{{x}_{1},\ldots,{x}_{k}\}\). The alias method (Walker 1977) can generate such a X in O(1) time per variate, after a table setup that takes O(k) time and space. On the other hand, it does not implement inversion, i.e., the transformation from U to X is not monotone.

To explain the idea, consider a bar diagram of the distribution, where each index i has a bar of height \({p}_{i} = P[X = {x}_{i}]\). The idea is to “equalize” the bars so that they all have height 1 ∕ k, by cutting-off bar pieces and transferring them to other bars. This is done in a way that in the new diagram, each bar i contains one piece of size q i (say) from the original bar i and one piece of size \(1/k - {q}_{i}\) from another bar whose index j, denoted A(i), is called the alias value of i. The setup procedure initializes two tables, A and R, where A(i) is the alias value of i and \(R(i) = (i - 1)/k + {q}_{i}\). See Devroye (1986) and Law and Kelton (2000) for the details. To generate X, we generate U ∼ U[0, 1], define I = ⌈kU⌉, and return X = x I if U < R(I) and X = x A(I) otherwise.

There is a version of the alias method for continuous distributions, called the acceptance-complement method (Kronmal and Peterson 1984; Devroye 1986; Gentle 2003). The idea is to decompose the density f of the target distribution as the convex combination of two densities f 1 and f 2, \(f = w{f}_{1} + (1 - w){f}_{2}\) for some real number w ∈ (0, 1), in a way that wf 1 ≤ g for some other density g and so that it is easy to generate from g and f 2. The algorithm works as follows: Generate X from density g and U ∼ U(0, 1); if Ug(X) ≤ wf 1(Y ) return X, otherwise generate a new X from density f 2 and return it.

8.3 Kernel Density Estimation and Generation

Instead of selecting a parametric distribution that is hard to invert and estimating the parameters, one can estimate the density via a kernel density estimation methodfor which random variate generation is very easy (Devroye 1986; Hörmann et al. 2004). In the case of a Gaussian kernel, for example, on can generate variates simply by selecting one observation at random from the data and adding random noise generated form a normal distribution with mean zero. However, this method is not equivalent to inversion. Because of the added noise, selecting a larger observation does not necessarily guarantee a larger value for the generated variate.

8.4 The Rejection Method

Suppose we want to generate X from a complicated density f. In fact f may be known only up to a multiplicative constant κ > 0, i.e., we know only κf. If we know f, we may just take κ = 1. We select another density r such that κf(x) ≤ t(x){def}  = }}ar(x) for all x for some constant a, and such that generating variates Y from the density r is easy. The function t is called ahat function or majorizing function. By integrating this inequality with respect to x on both sides, we find that κ ≤ a. The following rejection method generates X with density f (von Neumann 1951; Devroye 1986; Evans and Swartz 2000):

Rejection method;

repeat

generate Y from the density r and U ∼ U(0, 1), independent;

until Ut(Y ) ≤ κf(Y );

return X = Y.

The number R of turns into the “repeat” loop is one plus a geometric random variable with parameter κ ∕ a, so \(E[R] = a/\kappa \). Thus, we want a ∕ κ ≥ 1 to be as small as possible, i.e., we want to minimize the area between κf and t. There is generally a compromise between bringing a ∕ κ close to 1 and keeping r simple.

When κf is expensive to compute, we can also use squeeze functionsq 1 and q 2 that are faster toevaluate and such that q 1(x) ≤ κf(x) ≤ q 2(x) ≤ t(x) for all x. To verify the condition Ut(Y ) ≤ κf(Y ), we first check if Ut(Y ) ≤ q 1(Y ), in which case we accept Y immediately, otherwise we check if Ut(Y ) ≥ q 2(Y ), in which case we reject Y immediately. The value of κf(Y ) must be computed only when Ut(Y ) falls between the two squeezes. Sequences of embedded squeezes can also be used, where the primary ones are the least expensive to compute, the secondary ones are a little more expensive but closer to κf, etc.

It is common practice to transform the density f by a smooth increasing function T (e.g., T(x) = logx or \(T(x) = -{x}^{-1/2}\)) selected so that it is easier to construct good hat and squeeze functions (often piecewise linear) for the transformed density T(f( ⋅)). By transforming back to the original scale, we get hat and squeeze functions for f.This is the transformed density rejection method, which has several variants and extensions (Devroye 1986; Evans and Swartz 2000; Hörmann et al. 2004).

The rejection method works for discrete distributions as well; it suffices to replace densities by probability mass functions.

8.5 Thinning for Point Processes with Time-Varying Rates

Thinning is a cousin of acceptance-rejection, often used for generating events from a non-homogeneous Poisson process.Suppose the process has rate λ(t) at time t, with \(\lambda (t) \leq\bar{ \lambda }\) for all t, where \(\bar{\lambda }\) is a finite constant. One can generate Poisson pseudo-arrivals at constant rate \(\bar{\lambda }\) by generating interarrival times that are i.i.d. exponentials of mean \(1/\bar{\lambda }\). Then, a pseudo-arrival at time t is accepted (becomes an arrival) with probability \(\lambda (t)/\bar{\lambda }\) (i.e., if \(U \leq \lambda (t)/\bar{\lambda }\), where U is an independent U[0, 1]), and rejected with probability \(1 - \lambda (t)/\bar{\lambda }\). Non-homogeneous Poisson processes can also be generated by inversion (Bratley et al. 1987). The idea is to apply a nonlinear transformation to the time scale to make the process homogeneous with rate 1 in the new time scale. Arrival times are generated in this new time scale (which is easy), and then transformed back to the original time scale. The method can be adapted to other types of point processes with time-varying rates.

8.6 The Ratio-of-Uniforms Method

If f is a density over the real-line, κ an arbitrary positive constant, and the pair (U, V ) has the uniform distribution over the set

$$\mathcal{C} = \left \{(u,v) \in{\mathbb{R}}^{2}\mbox{ such that }0 \leq u \leq\sqrt{\kappa f(v/u)}\right \},$$

then V ∕ U has density f (Kinderman and Monahan 1977; Devroye 1986; Gentle 2003). This interesting property can be exploited to generate X with density f: generate (U, V ) uniformly over \(\mathcal{C}\) and return \(X = V/U\). This is the ratio-of-uniforms method. The key issue is how do we generate a point uniformly over \(\mathcal{C}\). In the cases where this can be done efficiently, we immediately have an efficient way of generating X.

The most frequent approach for generating (U, V ) uniformly over \(\mathcal{C}\) is the rejection method: Define a region \({\mathcal{C}}_{2}\) that contains \(\mathcal{C}\)and in which it is easy to generate a point uniformly (for example, a rectangular box or a polygonal region). To generate X, repeat: generate (U, V ) uniformly over \({\mathcal{C}}_{2}\), until it belongs to \(\mathcal{C}\). Then return \(X = V/U\). If there is another region \({\mathcal{C}}_{1}\) contained in \(\mathcal{C}\) and for which it is very fast to check if a point (U, V ) is in \({\mathcal{C}}_{1}\), this \({\mathcal{C}}_{1}\) can also be used as a squeeze to accelerate the verification that the point belongs to \(\mathcal{C}\). Several special cases and refinements are described in Devroye (1986), Gentle (2003), Leydold (2000), and other references given there.

8.7 Composition and Convolution

Suppose F is a convex combination of several distributions, i.e., F(x) =  ∑ j p j F j (x), or more generally F(x) =  ∫F y (x)dH(y). To generate from F, one can generate J = j with probability p j (or Y from H), then generate X from F J (or F Y ).This method, called the composition algorithm, is useful for generating from compound distributions such as the hyperexponential or from compound Poisson processes. It is also frequently used to design specialized algorithms for generating from complicated densities. The idea is to partition the area under the complicated density into pieces, where piece j has surface p j . To generate X, first select a piece (choose piece j with probability p j ), then draw a random point uniformly over that piece and project it to the horizontal axis. If the partition is defined so that it is fast and easy to generate from the large pieces, then X will be returned very quickly most of the time. The rejection method with a squeeze is often used to generate from some of the pieces.

A dual method to composition is the convolution method, which can be used when \(X = {Y }_{1} + {Y }_{2} + \cdots+ {Y }_{n}\), where the Y i ’s are independent with specified distributions. With this method, one just generates the Y i ’s and sum up. This requires at least n uniforms. Examples of random variables that can be expressed as sums like this include the hypoexponential, Erlang, and binomial distributions.

8.8 Other Special Techniques

Specialized and sometimes very elegant techniques have been designed for commonly used distributions such as the Poisson distribution with small mean, the normal (e.g., the Box-Muller and the polar methods), for generating points uniformly on a k-dimensional sphere, for generating random permutations, and so on. Details can be found, e.g., in Bratley et al. (1987), Cheng (1998), Devroye (1986), Fishman (1996), Gentle (2003). Many of those methods are based on a clever multivariate change of variables, defined so that the random variates or random vectors in the new coordinates are much easier to generate. In the Box-Muller and Polar methods, for example, a pair of independent standard normals is generated in polar coordinates, and then transformed back into rectangular coordinates.

Recently, there has been an effort in developing automatic or black box algorithms for generating variates from an arbitrary (known) density, and reliable software that implements these methods (Hörmann and Leydold 2000; Hörmann et al. 2004; Leydold 2009).

8.9 Multivariate Distributions

Inversion does not directly apply to generate a d-dimensional random vector \(\mathbf{X} = {({X}_{1},\ldots,{X}_{d})}^{\mathsf{T}}\), because the inverse of its distribution function is not well defined. In some cases, one can generate the first coordinate X 1 by inversion from its marginal distribution, then generate X 2 by inversion from its marginal distribution conditional on X 1, then generate X 3 by inversion from its marginal distribution conditional on (X 1, X 2), and so on. But this is not always possible and convenient.

Simple and elegant methods are available for certain classes of distributions. For example, if X has a multinormal distribution with mean vector μ and covariance matrix Σ, then one can decompose Σ = AA T for some matrix A, generate a vector Z of d independent standard normal random variable (with mean 0 and variance 1), usually by inversion, and return \(\mathbf{X} = \mu+ \mathbf{A}\mathbf{Z}\). One way to decompose Σ is via the Cholesky decomposition, for which A is lower triangular, but there are many other possibilities, including the eigendecomposition as in principal component analysis. The choice of decomposition can have a large impact on the variance reduction in the context of randomized quasi-Monte Carlo integration, by concentrating more of the variance on the first few uniform random numbers that are generated L’Ecuyer (2009).

Multivariate normals are often used to generate vectors from other distributions. For example, to generate a random point on a d-dimensional sphere of radius r centered at zero, one can generate a vector Z of independent standard normals (this amounts to generating a random direction), then normalize its length to r. More generally, by putting \(\mathbf{X} = R\,\mathbf{Z}/\Vert \mathbf{Z}\Vert\) where R has an arbitrary distribution over (0, ), one generates a vector with a radially symmetric distribution. As a special case, if R has the Student distribution, X is multivariate Student. As a further generalization, let \(\mathbf{X} = \mu+ R\,\mathbf{A}\mathbf{Z}/\Vert \mathbf{Z}\Vert\) where Z is multinormal in k dimensions and A is a d ×k matrix. This X has an elliptic multivariate distribution.

A richer class of multivariate distributions are defined via copula methods (Asmussen and Glynn 2007; Hörmann et al. 2004; Nelsen 1999). Start with an arbitrary d-dimensional distribution function G with continuous marginals G j , generate \(\mathbf{Y} = {({Y }_{1},\ldots,{Y }_{d})}^{\mathsf{T}}\) from G, and let \(\mathbf{U} = ({U}_{1},\ldots,{U}_{d}) = {({G}_{1}({Y }_{1}),\ldots,{G}_{d}({Y }_{d}))}^{\mathsf{T}}\). These U j have the uniform distribution over (0, 1), but they are not independent in general. The distribution function of U is the copula associated with G and it specifies the dependence structure of the vector U. In fact, any multivariate distribution function over (0, 1)d with uniform marginals is a copula. To generate \(\mathbf{X} = {({X}_{1},\ldots,{X}_{d})}^{\mathsf{T}}\) with arbitrary marginal distribution functions F j and dependence structure specified by this copula, put \({X}_{j} = {F}_{j}^{-1}({U}_{j})\) for each j. A popular choice for G is the multinormal distribution with standard normal marginals; then Y and U are easy to generate, and one can select the correlation matrix of Y to approximate a target correlation (or rank correlation) matrix for X. It can usually match the correlations pretty well. But to approximate the whole dependence structure in general, a much richer variety of copulas is required (Asmussen and Glynn 2007; Hörmann et al. 2004; Nelsen 1999).

The rejection method extends easily to the multivariate case. For a target d-dimensional density f known up to the multiplicative constant κ, pick a d-dimensional density r such that κf(x) ≤ ar(x) for all x and some constant a, and such that sampling random vectors Y from r is easy. To generate X with density f, generate Y from r and U uniform over (0, 1) independent of Y, until Uar(Y) ≤ κf(Y), and return X = Y.

There are many situations where one wishes to generate random vectors X from quite complicated distributions and no efficient method is available to do it exactly. One very important approach that often permit one to generate X approximately from the target distribution is the Markov chain Monte Carlo (MCMC) method. In a nutshell, it constructs an artificial Markov chain whose steady-state distribution is the target distribution of X, and runs the Markov chain for a large number of steps, until it is deemed sufficiently close to steady-state. Then the state of the chain has a distribution close to the target one. MCMC is covered elsewhere in this handbook.