Keywords

1 Introduction

Gaussian sampling over the integers is a central building block of lattice-based cryptography, in theory as well as in practice. It is also notoriously difficult to perform efficiently and securely, as illustrated by numerous side-channel attacks exploiting BLISS’ Gaussian sampler [9, 21, 49, 56]. For this reason, some schemes limit or proscribe the use of Gaussians [6, 36]. However, in some situations, Gaussians are unavoidable. The most prominent example is trapdoor sampling [26, 40, 48]: performing it with other distributions is an open question, except in limited cases [37] which entail a growth \(O(\sqrt{n})\) to O(n) of the output, resulting in dwindling security levels. Given the countless applications of trapdoor sampling (full-domain hash signatures [26, 53], identity-based encryption (or IBE) [18, 26], hierarchical IBE [1, 11], etc.), it is important to come up with Gaussian samplers over the integers which are not only efficient, but also provably secure, resistant to timing attacks, and in general easy to deploy.

Our first contribution is to propose a Gaussian sampler over the integers with all the properties which are expected of a sampler for widespread deployment. It is simple and modular, making analysis and subsequent improvements easy. It is efficient and portable, making it amenable to a variety of scenarios. Finally, we formally prove its security and resistance against timing attacks. We detail below different aspects of our sampler:

  • Simplicity and Modularity. At a high level, our framework only requires two ingredients (a base sampler and a rejection sampler) and combines them in a simple and black-box way. Not only does it make the description of our sampler modular (as one can replace any of the ingredients), this simplicity and modularity also infuses all aspects of its analysis.

  • Genericity. Our sampler is fully generic as it works with arbitrary center \(\mu \) and standard deviation \(\sigma \). In addition, it does not incur hidden precomputation costs: given a fixed base sampler of parameter \(\sigma _\mathrm{max}\), our framework allows to sample from \(D_{\mathbb {Z},\sigma ,\mu }\) for any \(\eta _{\epsilon }(\mathbb {Z}^n) \le \sigma \le \sigma _\mathrm{max}\). In comparison, [42] implicity requires a different base sampler for each different value of \(\sigma \); this limits its applicability for use cases such as Falcon [53], which has up to 2048 different \(\sigma \)’s, all computed at key generation.

  • Efficiency and Portability. Our sampler is instantiated with competitive parameters which make it very efficient in time and memory usage. For \(\sigma _\mathrm{max}= 1.8205\) and SHAKE256 used as PRNG, our sampler uses only 512 bytes of memory and achieved 1,848,428 samples per second on an Intel i7-6500U clocked at 2.5 GHz. Moreover, our sampler can be instantiated in a way that uses only integer operations, making it highly portable.

  • Provable Security. A security analysis based on the statistical distance would either provide very weak security guarantees or require to increase the running time by an order of magnitude. We instead rely on the Rényi divergence, a tool which in the recent years has allowed dramatic efficiency gains for lattice-based schemes [3, 52]. We carefully selected our parameters as to make them as amenable to a Rényi divergence-based analysis.

  • Isochrony. We formally show that our sampler is isochronous: its running time is independent of the inputs \(\sigma , \mu \) and of the output z. Isochrony is weaker than being constant-time, but it nevertheless suffices to argue security against timing attacks. Interestingly, our proof of isochrony relies on techniques and notions that are common in lattice-based cryptography: the smoothing parameter, the Rényi divergence, etc. In particular, the isochrony of our sampler is implied by parameters dictated by the current state of the art for black-box security of lattice-based schemes.

One second contribution stems from a simple observation: implementations of otherwise perfectly secure schemes have failed in spectacular ways by introducing weaknesses, a common one being randomness failure: this is epitomized by nonce reuses in ECDSA, leading to jailbreaking Sony PS3 consolesFootnote 1 and exposing Bitcoin wallets [8]. The post-quantum community is aware of this point of failure but does not seem to have converged on a systematic way to mitigate it [46]. Randomness failures have been manually discovered and fixed in implementations of Dilithium [45], Falcon [47, 51] and other schemes; the case of Falcon is particularly relevant to us because the sampler implemented was the one described in this document!

Our second contribution is a first step at systematically detecting such failures: we propose a statistical test suite called SAGA for validating discrete Gaussians. This test suite can check univariate samples; we therefore use it to validate our own implementation of our sampler. In addition, our test suite can check multivariate Gaussians as well; this enables validation at a higher level: if the base sampler over the integers is validated, but the output of the high-level scheme does not behave like a multivariate Gaussian even though the theory predicts it should, then this is indicative of an implementation mistake somewhere else in the implementation (or, at the worst case, that the theory is deficient). We illustrate that with a simple example of a (purportedly) deficient implementation of Falcon [53], however it can be used for any other scheme sampling multivariate discrete Gaussians, including but not limited to [5, 12, 18, 25, 40]. The test suite is publicly available at: https://github.com/PQShield/SAGA.

2 Related Works

In the recent years, there has been a surge of works related to Gaussian sampling over the integers. Building on convolution techniques from [42, 50] proposed an arbitrary-center Gaussian sampler base, as well as a statistical tool (the max-log distance) to analyse it. [3, 39, 52] revisited classical techniques with the Rényi divergence. Polynomial-based methods were further studied by [4, 52, 60]. The use of rounded Gaussians was proposed in [31]. Knuth-Yao’s DDG trees have been considered in [20, 32].Footnote 2 Lazy floating-point precision was studied in [16, 19]. We note that techniques dating back to von Neumann [57] allow to generate (continuous) Gaussians elegantly using finite automata [2, 24, 33]. While these have been considered in the context of lattice-based cryptography [15, 17] they are also notoriously hard to make isochronous. Finally, [58] studied previously cited techniques with the goal of minimizing their relative error.

3 Preliminaries

3.1 Gaussians

For \(\sigma , \mu \in \mathbb {R}\) with \(\sigma >0\), we call Gaussian function of parameters \(\sigma ,\mu \) and denote by \(\rho _{\sigma ,\mu }\) the function defined over \(\mathbb {R}\) as \(\rho _{\sigma ,\mu }(x) = \exp \left( -\frac{(x-\mu )^2}{2\sigma ^2}\right) \). Note that when \(\mu = 0\) we omit it in index notation, e.g. \(\rho _{\sigma }(x) = \rho _{\sigma ,0}(x)\). The parameter \(\sigma \) (resp. \(\mu \)) is often called the standard deviation (resp. center) of the Gaussian. In addition, for any countable set \(S \subsetneq \mathbb {R}\) we abusively denote by \(\rho _{\sigma ,\mu }(S)\) the sum \(\sum _{z \in S} \rho _{\sigma ,\mu }(z)\). When \(\sum _{z \in S} \rho _{\sigma ,\mu }(z)\) is finite, we denote by \(D_{S,\sigma ,\mu }\) and call Gaussian distribution of parameters \(\sigma ,\mu \) the distribution over S defined by \(D_{S,\sigma ,\mu }(z) = \rho _{\sigma ,\mu }(z)/\rho _{\sigma ,\mu }(S)\). Here too, when \(\mu = 0\) we omit it in index notation, e.g. \(D_{S,\sigma ,\mu }(z) = D_{S,\sigma }(z) \). We use the notation \(\mathcal {B}_p\) to denote the Bernoulli distribution of parameter p.

3.2 Renyi Divergence

We recall the definition of the Rényi divergence, which we will use massively in our security proofs.

Definition 1 (Rényi Divergence)

Let \(\mathcal {P}\), \(\mathcal {Q}\) be two distributions such that \(Supp(\mathcal {P})\subseteq Supp(\mathcal {Q})\). For \(a \in (1,+\infty )\), we define the Rényi divergence of order a by

$$\begin{aligned} R_a(\mathcal {P},\mathcal {Q})=\left( \sum _{x\in Supp(\mathcal {P})} \frac{\mathcal {P}(x)^a}{\mathcal {Q}(x)^{a-1}} \right) ^{\frac{1}{a-1}}. \end{aligned}$$

In addition, we define the Rényi divergence of order \(+\infty \) by

$$\begin{aligned} R_{\infty }(\mathcal {P},\mathcal {Q})= \max _{x\in Supp(\mathcal {P})} \frac{\mathcal {P}(x)}{\mathcal {Q}(x)}. \end{aligned}$$

The Rényi divergence is not a distance; for example, it is neither symmetric nor does it verify the triangle inequality, which makes it less convenient than the statistical distance. On the other hand, it does verify cryptographically useful properties, including a few listed below.

Lemma 1

([3]). For two distributions \(\mathcal {P}, \mathcal {Q}\) and two families of distributions \((\mathcal {P}_i)_i, (\mathcal {Q}_i)_i\), the Rényi divergence verifies these properties:

  • Data processing inequality. For any function f, \(R_a(f(\mathcal {P}),f(\mathcal {Q})) \le R_a(\mathcal {P},\mathcal {Q})\).

  • Multiplicativity. \(R_a(\prod _i \mathcal {P}_i, \prod _i \mathcal {Q}_i) = \prod _i R_a(\mathcal {P}_i,\mathcal {Q}_i)\).

  • Probability preservation. For any event \(E \subseteq \text {Supp} (\mathcal {Q})\) and \(a \in (1,+\infty )\),

    $$\begin{aligned} \mathcal {Q}(E)&\ge \mathcal {P}(E)^{\frac{a}{a-1}}/R_a(\mathcal {P}, \mathcal {Q}),\end{aligned}$$
    (1)
    $$\begin{aligned} \mathcal {Q}(E)&\ge \mathcal {P}(E)/R_\infty (\mathcal {P}, \mathcal {Q}). \end{aligned}$$
    (2)

The following lemma shows that a bound of \(\delta \) on the relative error between two distributions implies a bound \(O(a \delta ^2)\) on the log of the Rényi divergence (as opposed to a bound \(O(\delta )\) on the statistical distance).

Lemma 2

(Lemma 3 of [52]). Let \(\mathcal {P},\mathcal {Q}\) be two distributions of same support \(\varOmega \). Suppose that the relative error between \(\mathcal {P}\) and \(\mathcal {Q}\) is bounded: \(\exists \delta > 0\) such that \(\left| \frac{\mathcal {P}}{\mathcal {Q}} - 1 \right| \le \delta \) over \(\varOmega \). Then, for \(a \in (1,+\infty )\):

$$\begin{aligned} R_a(\mathcal {P}, \mathcal {Q}) \le \left( 1 + \frac{a(a-1)\delta ^2}{2(1-\delta )^{a+1}} \right) ^{\frac{1}{a-1}} \underset{\delta \rightarrow 0}{\sim } 1 + \frac{a \delta ^2}{2} \end{aligned}$$

3.3 Smoothing Parameter

For \(\epsilon > 0\), the smoothing parameter \(\eta _\epsilon (\varLambda )\) of a lattice \(\varLambda \) is the smallest value \(\sigma > 0\) such that \(\rho _{\frac{1}{\sigma \sqrt{2\pi }}}(\varLambda ^\star \backslash \{\mathbf {0}\}) \le \epsilon \), where \(\varLambda ^\star \) denotes the dual of \(\varLambda \). In the literature, some definitions of the smoothing parameter scale our definition by a factor \(\sqrt{2\pi }\). It is shown in [41] that \(\eta _{\epsilon }(\mathbb {Z}^n) \le \eta _{\epsilon }^{+}(\mathbb {Z}^n)\), where:

$$\begin{aligned} \eta _{\epsilon }^{+}(\mathbb {Z}^n) = \frac{1}{\pi } \sqrt{ \frac{1}{2} \log \left( 2n \left( 1 + \frac{1}{\epsilon }\right) \right) }. \end{aligned}$$
(3)

3.4 Isochronous Algorithms

We now give a semi-formal definition of isochronous algorithms.

Definition 2

Let \(\mathcal {A}\) be a (probabilistic or deterministic) algorithm with set of input variables \(\mathcal {I}\), set of output variables \(\mathcal {O}\), and let \(\mathcal {S}\subseteq \mathcal {I}\cup \mathcal {O}\) be the set of sensitive variables. We say that \(\mathcal {A}\) is perfectly isochronous with respect to \(\mathcal {S}\) if its running time is independent of any variable in \(\mathcal {S}\).

In addition, we say that \(\mathcal {A}\) statistically isochronous with respect to \(\mathcal {S}\) if there exists a distribution \(\mathcal {D}\) independent of all the variables in \(\mathcal {S}\), such that the running time of \(\mathcal {A}\) is statistically close (for a clearly identified divergence) to \(\mathcal {D}\).

We note that we can define a notion of computationally isochronous algorithm. For such an algorithm, it is computationally it hard to recover the sensitive variables even given the distribution of the running time of the algorithm. We can even come up with a contrived example of such an algorithm: let \(\mathcal {A}()\) select in an isochronous manner an x uniformly in a space of min-entropy \(\ge \lambda \), compute \(y = H(x)\) and wait a time y before outputting x. One can show that recovering x given the running time of \(\mathcal {A}\) is hard if H is a one-way function.

4 The Sampler

In this section, we describe our new sampler with arbitrary standard deviation and center. The main assumption of our setting is to consider that all the standard deviations are bounded and that the center is in [0, 1]. In other words, denoting the upper bound and lower bound on the standard deviation as \(\sigma _\mathrm{max}> \sigma _\mathrm{min}> 0\), we present an algorithm that samples the distribution \(D_{\mathbb {Z},\sigma ,\mu }\) for any \(\mu \in [0,1]\) and \(\sigma _\mathrm{min}\le \sigma \le \sigma _\mathrm{max}\).

Our sampling algorithm is called \(\textsf {SamplerZ} \) and is described in Algorithm 1. We denote by \(\textsf {BaseSampler} \) an algorithm that samples an element with the fixed half Gaussian distribution \(D_{\mathbb {Z}^+,\sigma _\mathrm{max}}\). The first step consists in using \(\textsf {BaseSampler} \). The obtained \(z_0\) sample is then transformed into where b is a bit drawn uniformly in \(\{0,1\}\). Let us denote by \(BG_{\sigma _\mathrm{max}}\) the distribution of z. The distribution of z is a discrete bimodal half-Gaussian of centers 0 and 1. More formally,

$$\begin{aligned} BG_{\sigma _\mathrm{max}}(z) = \frac{1}{2}\left\{ \begin{array}{ll} D_{\mathbb {Z}^+,\sigma _\mathrm{max}}(-z) &{} \hbox {if } z \le 0\\ D_{\mathbb {Z}^+,\sigma _\mathrm{max}}(z-1) &{} \hbox {if } z \ge 1. \end{array} \right. \end{aligned}$$
(4)
figure a

Then, to recover the desired distribution \(D_{\mathbb {Z},\sigma ,\mu }\) for the inputs \((\sigma , \mu )\), one might want to apply the classical rejection sampling technique applied to lattice based schemes [35] and accept z with probability

$$\begin{aligned} \frac{D_{\mathbb {Z},\sigma ,\mu }(z)}{BG_{\sigma _\mathrm{max}}(z)}&= \left\{ \begin{array}{ll} \exp \left( \frac{z^2}{2\sigma _\mathrm{max}^2} - \frac{(z-\mu )^2}{2\sigma ^2} \right) &{} \hbox {if } z \le 0\\ \exp \left( \frac{(z-1)^2}{2\sigma _\mathrm{max}^2} - \frac{(z-\mu )^2}{2\sigma ^2} \right) &{} \hbox {if } z \ge 1 \end{array}\right. \\&= \exp \left( \frac{z_0^2}{2\sigma _\mathrm{max}^2} - \frac{(z-\mu )^2}{2\sigma ^2} \right) . \end{aligned}$$

The element inside the \(\exp \) is computed in step 5. Next, we also introduce an algorithm denoted \(\textsf {AcceptSample} \). The latter performs the rejection sampling (Algorithm 2): using \(\textsf {ApproxExp} \) an algorithm that returns \(\exp (\cdot )\), it returns a Bernoulli sample with the according probability. Actually, for isochrony matters, detailed in Sect. 6, the latter acceptance probability is rescaled by a factor \(\frac{\sigma _\mathrm{min}}{\sigma }\). As z follows the \(BG_{\sigma _\mathrm{max}}\) distribution, after the rejection sampling, the final distribution of \(\textsf {SamplerZ} (\sigma , \mu )\) is then proportional to \(\frac{\sigma _\mathrm{min}}{\sigma } \cdot D_{\mathbb {Z},\sigma , \mu }\) which is, after normalization exactly equal to \(D_{\mathbb {Z},\sigma , \mu }\). Thus, with this construction, one can derive the following proposition.

Proposition 1 (Correctness)

Assume that all the uniform distributions are perfect and that \(\textsf {BaseSampler} = D_{\mathbb {Z}^+,\sigma _\mathrm{max}}\) and \(\textsf {ApproxExp} = \exp \), then the construction of \(\textsf {SamplerZ} \) (in Algorithms 1 and 2) is such that \(\textsf {SamplerZ} (\sigma , \mu )=D_{\mathbb {Z},\sigma , \mu }\).

In practical implementations, one cannot achieve perfect distributions. Only achieving \(\textsf {BaseSampler} \approx D_{\mathbb {Z}^+,\sigma _\mathrm{max}}\) and \(\textsf {ApproxExp} \approx \exp \) is possible. Section 6 proves that, under certain conditions on \(\textsf {BaseSampler} \) and \(\textsf {ApproxExp} \) and on the number of sampling queries, the final distribution remains indistinguishable from \(D_{\mathbb {Z},\sigma , \mu }\).

Table 1. Number of calls to SamplerZ, BaseSampler and \(\textsf {ApproxExp} \)

5 Proof of Security

Table 1 gives the notations for the number of calls to SamplerZ, BaseSampler and \(\textsf {ApproxExp} \) and the considered values when the sampler is instanciated for Falcon. Due to the rejection sampling in step 6, there will be a (potentially infinite) number of iterations of the while loop. We will show later in Lemma 3, that the number of iterations follows a geometric law of parameter \(\approx \frac{\sigma _\mathrm{min}\cdot \sqrt{2 \pi } }{2 \cdot \rho _{\sigma _\mathrm{max}}(\mathbb {Z}^{+})}\). We note \(\mathcal {N}_\mathrm{iter} \) a heuristic considered maximum number of iterations. By a central limit argument, \(\mathcal {N}_\mathrm{iter} \) will only be marginally higher than the expected number of iterations. To instantiate the values \(Q_\mathrm{\exp } = Q_\mathrm{bs} = \mathcal {N}_\mathrm{iter} \cdot Q_\mathrm{samplZ} \) for the example of Falcon, we take \(\mathcal {N}_\mathrm{iter} = 2\). In fact, \(\frac{\sigma _\mathrm{min}\cdot \sqrt{2 \pi } }{2 \cdot \rho _{\sigma _\mathrm{max}}(\mathbb {Z}^{+})} \le 2\) for Falcon ’s parameters.

The following Theorem estimates the security of SamplerZ, it is independant of the chosen values for the number of calls.

Theorem 1

(Security of SamplerZ). Let \(\lambda _{\small \textsc {Ideal}}\) (resp. \(\lambda _{\small \textsc {Real}}\)) be the security parameter of an implementation using the perfect distribution \(D_{\mathbb {Z},\sigma , \mu }\) (resp. the real distribution SamplerZ). If both following conditions are respected, at most two bits of security are lost. In other words, .

$$\begin{aligned} \forall x <0,\ \left| \frac{\textsf {ApproxExp} (x)-\exp (x)}{\exp (x)}\right| \le \sqrt{\frac{2 \cdot \lambda _{\small \textsc {Real}}}{2\cdot \left( 2 \cdot \lambda _{\small \textsc {Real}}+1\right) ^2 \cdot Q_{\exp }}} \end{aligned}$$
(Cond. (1))
$$\begin{aligned} R_{2 \cdot \lambda _{\small \textsc {Real}}+1}\left( \textsf {BaseSampler}, D_{\mathbb {Z}^+,\sigma _\mathrm{max}} \right) \le 1+\frac{1}{4\cdot Q_\mathrm{bs}} \end{aligned}$$
(Cond. (2))

The proof of this Theorem is given in the full version of our paper [30].

To get concrete numerical values, we assume that 256 bits are claimed on the original scheme, thus 254 bits of security are claimed for the real implementation. Then for an implementation of Falcon, the numerical values are

$$\begin{aligned} \sqrt{\frac{2 \cdot \lambda _{\small \textsc {Real}}}{2\cdot \left( 2 \cdot \lambda _{\small \textsc {Real}}+1\right) ^2 \cdot Q_{\exp }}} \approx 2^{-43} \quad \text { and }\quad \frac{1}{4\cdot Q_\mathrm{bs}} \approx 2^{-78}. \end{aligned}$$

5.1 Instanciating the \(\textsf {ApproxExp} \)

To achieve condition (1) with \(\textsf {ApproxExp} \), we use a polynomial approximation of the exponential on \([-\ln (2),0]\). In fact, one can reduce the parameter x modulo \(\ln (2)\) such that \(x = - r - s \ln (2)\). Compute the exponential remains to compute \(\exp (x) = 2^{-s}\exp (-r)\). Noting that \(s \ge 64\) happen very rarely, thus s can be saturated at 63 to avoid overflow without loss in precision.

We use the polynomial approximation tool provided in GALACTICS [4]. This tool generates polynomial approximations that allow a computation in fixed precision with chosen size of coefficients and degree. As an example, for 32-bit coefficients and a degree 10, we obtain a polynomial , with:

figure b

For any \(x\in [-\ln (2), 0]\), \(P_\mathrm{gal}\) verifies \(\left| \frac{P_\mathrm{gal}(x) - \exp (x)}{\exp (x)}\right| \le 2^{-47}\), which is sufficient to verify condition (1) for Falcon implementation.

Flexibility on the Implementation of the Polynomial. Depending on the platform and the requirement for the signature, one can adapt the polynomial to fit their constraints. For example, if one wants to minimize the number of multiplications, implementing the polynomial with Horner’s form is the best option. The polynomial is written in the following form:

$$\begin{aligned} P_\mathrm{gal}(x) = a_0 + x (a_1 + x (a_2 + x (a_3 + x (a_4 + x (a_5 + x (a_6 + x (a_7 + x (a_8 + x (a_9 + x a_{10}))))))))). \end{aligned}$$

Evaluating \(P_\mathrm{gal}\) is then done serially as follows:

Some architectures with small register sizes may be faster if the size of the coefficients of the polynomial is minimized, thus GALACTICS tool can be used to generate a polynomial with smaller coefficients. For example, we propose an alternative polynomial approximation on \([0, \frac{ln(2)}{64}]\) with 25 bits coefficients.

$$\begin{aligned} P= 1+ x + 2^{-1}x^{2} + 699051 \cdot 2^{-22} \cdot x^{3} + 699299 \cdot 2^{-24} \cdot x^{4} + 605552 \cdot 2^{-26} \cdot x^{5} \end{aligned}$$

To recover the polynomial approximation on [0, ln(2)], we compute \(P(\frac{x}{64})^{64}\).

Some architectures enjoy some level of parallelism, in which case it is desirable to minimise the depth of the circuit computing the polynomialFootnote 3. Writing \(P_\mathrm{gal}\) in Estrin’s form [22] is helpful in this regard.

5.2 Instanciating the BaseSampler

To achieve condition (2) with \(\textsf {BaseSampler} \), we rely on a cumulative distribution table (CDT). We precompute a table of the cumulative distribution function of \(D_{\mathbb {Z}^+,\sigma _\mathrm{max}}\) with a certain precision; then, to produce a sample, we generate a random value in [0, 1] with the same precision, and return the index of the last entry in the table that is greater than that value. In variable time, the sampling can be done rather efficiently with a binary search, but a constant-time implementation has essentially no choice but to read the entire table each time and carry out each comparison. This process is summed up in Algorithm 3. The parameters w and \(\theta \) are respectively the number of elements of the CDT and the precision of its coefficients. Let \(a = 2 \cdot \lambda _{\small \textsc {Real}}+1\). To derive the parameters w and \(\theta \) we use a simple script that, given \(\sigma _\mathrm{max}\) and \(\theta \) as inputs:

  1. 1.

    Computes the smallest tailcut w such that the Renyi divergence \(R_a\) between the ideal distribution \(D_{\mathbb {Z}^+, \sigma _\mathrm{max}}\) and its restriction to \(\{0, \dots , w\}\) (noted \(D_{[w], \sigma _\mathrm{max}}\)) verifies \(R_a( D_{[w], \sigma _\mathrm{max}}, D_{\mathbb {Z}^+, \sigma _\mathrm{max}}) \le 1 +{1}/({4Q_\mathrm{bs}})\);

  2. 2.

    Rounds the probability density table (PDT) of \(D_{[w], \sigma _\mathrm{max}}\) with \(\theta \) bits of absolute precision. This rounding is done “cleverly” by truncating all the PDT values except the largest:

    • \(\circ \) for \(z \ge 1\), the value \(D_{[w], \sigma _\mathrm{max}}(z)\) is truncated: \(PDT(z) = 2^{-\theta } \left\lfloor 2^{\theta } D_{[w], \sigma _\mathrm{max}}(z) \right\rfloor \).

    • \(\circ \) in order to have a probability distribution, \(PDT(0) = 1 - \sum _{z \ge 1} PDT(z)\).

  3. 3.

    Derives the CDT from the PDT and computes the final \(R_a(\textsf {SampleCDT} _{w=19,\theta = 72}, D_{\mathbb {Z}^+, \sigma _\mathrm{max}})\).

Taking \(\sigma _\mathrm{max}= 1.8205\) and \(\theta = 72\) as inputs, we found \(w=19\).

figure c
figure d

Our experiment showed that for any \(a \ge 509\), \(R_a(\textsf {SampleCDT} _{w=19,\theta = 72}, D_{\mathbb {Z}^+, \sigma _\mathrm{max}}) \le 1 + 2^{-80} \le 1 + \frac{1}{4 Q_\mathrm{bs}}\), which validates condition (2) for Falcon implementation.

6 Analysis of Resistance Against Timing Attacks

In this section, we show that Algorithm 1 is impervious against timing attacks. We formally prove that it is isochronous with respect to \(\sigma ,\mu \) and the output z (in the sense of Definition 2). We first prove a technical lemma which shows that the number of iterations in the while loop of Algorithm 1 is (almost) independent of \(\sigma , \mu , z\).

Lemma 3

Let \(\epsilon \in (0, 1)\), \(\mu \in [0, 1]\) and let \(\sigma _{\min }, \sigma , \sigma _0\) be standard deviations such that \(\eta _{\epsilon }^{+}(\mathbb {Z}^n) = \sigma _{\min } \le \sigma \le \sigma _0\). Let \(p = \frac{\sigma _\mathrm{min}\cdot \sqrt{2 \pi } }{2 \cdot \rho _{\sigma _\mathrm{max}}(\mathbb {Z}^{+})}\). The number of iterations of the \(\mathbf {while}\) loop in \(\textsf {SamplerZ} (\sigma , \mu )\) follows a geometric law of parameter

$$\begin{aligned} P_\mathbf {true}(\sigma , \mu ) \in p \cdot \left[ 1, 1 + \frac{(1 + 2^{-80}) \epsilon }{n} \right] . \end{aligned}$$

The proof of Lemma 3 can be found in the full version of our paper [30].

Next, we show that Algorithm 1 is perfectly isochronous with respect to z and statistically isochronous (for the Rényi divergence) with respect to \(\sigma , \mu \).

Theorem 2

Let \(\epsilon \in (0, 1)\), \(\mu \in \mathbb {R}\), let \(\sigma _{\min }, \sigma , \sigma _0\) be standard deviations such that \(\eta _{\epsilon }^{+}(\mathbb {Z}^n) = \sigma _{\min } \le \sigma \le \sigma _0\), and let \(p = \frac{\sigma _\mathrm{min}\cdot \sqrt{2 \pi } }{2 \cdot \rho _{\sigma _\mathrm{max}}(\mathbb {Z}^{+})} \) be a constant in (0, 1). Suppose that the elementary operations \(\{+, -, \times , /\}\) over integer and floating-point numbers are isochronous. The running time of Algorithm 1 follows a distribution \(T_{\sigma , \mu }\) such that:

$$\begin{aligned} R_a(T_{\sigma , \mu } \Vert T)&\lesssim 1 + \frac{a\epsilon ^2 \max (1, \frac{1 - p}{p})^2}{n^2 (1 - p)} = 1 + O\left( \frac{a\epsilon ^2}{n^2} \right) \end{aligned}$$

for some distribution T independent of its inputs \(\sigma , \mu \) and its output z.

Finally, we leverage Theorem 2 to prove that the running time of \(\textsf {SamplerZ} (\sigma , \mu )\) does not help an adversary to break a cryptographic scheme. We consider that the adversary has access to some function \(g(\textsf {SamplerZ} (\sigma , \mu ))\) as well as the running time of \(\textsf {SamplerZ} (\sigma , \mu )\): this is intended to capture the fact that in practice the output of \(\textsf {SamplerZ} (\sigma , \mu )\) is not given directly to the adversary, but processed by some function before. For example, in the signature scheme Falcon, samples are processed by algorithms depending on the signer’s private key. On the other hand, we consider that the adversary has powerful timing attack capabilities by allowing him to learn the exact runtime of each call to \(\textsf {SamplerZ} (\sigma , \mu )\).

Corollary 1

Consider an adversary \(\mathcal {A}\) making \(Q_s\) queries to \(g(\textsf {SamplerZ} (\sigma , \mu ))\) for some randomized function g, and solving a search problem with success probability \(2^{-\lambda }\) for some \(\lambda \ge 1\). With the notations of Theorem 2, suppose that \(\max (1, \frac{1 - p}{p})^2 \le n (1 - p)\) and \(\epsilon \le \frac{1}{\sqrt{\lambda Q_s}}\). Learning the running time of each call to \(\textsf {SamplerZ} (\sigma , \mu )\) does not increase the success probability of \(\mathcal {A}\) by more than a constant factor.

The proof of Corollary 1 can be found in the full version of our paper [30]. A nice thing about Corollary 1 is that the conditions required to make it effective are already met in practice since they are also required for black-box security of cryptographic schemes. For example, it is systematic to set \(\sigma \ge \eta _{\epsilon }^{+}(\mathbb {Z}^n)\).

Impact of the Scaling Factor. The scaling factor \(\frac{\sigma _\mathrm{min}}{\sigma } \le \frac{\sigma _\mathrm{min}}{\sigma _\mathrm{max}}\) is crucial in making our sampler isochronous, as it decorrelates the running time \(T_{\sigma , \mu }\) from \(\sigma \). However, it also impacts the \(T_{\sigma , \mu }\), as one can easily show that \(T_{\sigma , \mu }\) is proportional to the scaling factor. It is therefore desirable to make it as small as possible. The maximal value of the scaling factor is actually dependent on the cryptographic scheme in which our sampler is used. In the full version of our paper [30], we show that for the case of the signature scheme Falcon, \(\frac{\sigma _\mathrm{min}}{\sigma _\mathrm{max}} \le 1.17^{-2} \approx 0.73\) and the impact of the scaling factor is limited. Moreover, one can easily show that for Peikert’s sampler [48], the scaling factor is equal to 1 and has no impact.

7 “Err on the Side of Gaussian”

This section focuses on ensuring correct and verified implementations of our proposed isochronous Gaussian sampler. The motivation for this section is to minimize implementation bugs, such as implementation issues with Falcon [47, 51] or the famous Heartbleed (CVE-2014-0160) or ROCA vulnerabilities [44] (CVE-2017-15361). We propose a test suite named SAGA (Statistically Acceptable GAussians) in order to verify correct univariate or multivariate Gaussian variables. At the very least, SAGA can act as a “sanity check” for implementers and practitioners. Furthermore, SAGA is designed to run in a generic fashion, agnostic to the technique used, by only requiring as input a list of univariate (i.e., outputs of \(\textsf {SamplerZ} \)) or multivariate (i.e. a set of signatures) Gaussian samples. Although we evaluate SAGA by applying it to Falcon, SAGA is applicable to any lattice-based cryptographic scheme requiring Gaussian sampling, such as other GPV-based signatures [5, 12], FrodoKEM [43], identity-based encryption [10, 18], and in fully homomorphic encryption [54].

7.1 Univariate Tests

The statistical tests we implement here are inspired by a previous test suite proposal called GLITCH [29]. We use standard statistical tools to validate a Gaussian sampler is operating with the correct mean, standard deviation, skewness, and kurtosis, and finally we check whether it passes a chi-square normality test. Skewness and kurtosis are descriptors of a normal distribution that respectively measure the symmetry and peakedness of a distribution. To view the full statistical analysis of these tests we created a Python class, UnivariateSamples, which take as initialization arguments the expected mean (mu), expected standard deviation (sigma), and the list of observed univariate Gaussian samples (data). An example of how this works, as well as its output, is shown in the full version of our paper [30].

7.2 Multivariate Tests

This section details multivariate normality tests. The motivation for these tests is to detect situations where the base Gaussian sampler over the integers is correctly implemented, yet the high-level scheme (e.g. a signature scheme) uses it incorrectly way and ends up with a defective multivariate Gaussian.

Multivariate Normality. There are a number of statistical tests which evaluate the normality of multivariate distributions. We found that multivariate normality tests predominantly used in other fields [13, 28, 38] suffer with size and scaling issues. That is, the large sample sizes we expect to use and the poor power properties of these tests will make a type II error highly likelyFootnote 4. In fact, we implemented the Mardia [38] and Henze-Zirkler [28] tests and found, although they worked for small sample sizes, they diverged to produce false negatives for sample sizes \(\ge 50\) even in small dimensions \(n=64\).

However, the Doornik-Hansen test [14] minimises these issues by using transformed versions of the skewness and kurtosis of the multivariate data, increasing the test’s power. We also note that it is much faster (essentially linear in the sample size) than [28, 38] (essentially quadratic in the sample size). As with the univariate tests, we created a Python class, denoted MultivariateSamples, which outputs four results; two based on the covariance matrix, and two based on the data’s normality. An example of how this works, as well as its output, is shown in the full version of our paper [30].

A Glitch in the (Covariance) Matrix. Our second multivariate test asks the following question: how could someone implement correctly the base sampler, yet subsequently fail to use it properly? There is no universal answer to that, and one usually has to rely on context, experience and common sense to establish the most likely way this could happen.

For example, in Falcon, univariate samples are linearly combined according to node values of a balanced binary tree computed at key generation (the Falcon tree). If there is an implementation mistake in the procedure computing the tree (during key generation) or when combining the samples (during signing), this effectively results in nodes of the Falcon tree being incorrect or omitted. Such mistakes have a very recognizable effect on the empiric covariance matrix of Falcon signatures: they make them look like block Toeplitz matrices (Fig. 1a) instead of (scaled) identity matrices in the nominal case (Fig. 1b).

We devised a test which discriminates block Toeplitz covariance matrices against the ones expected from spherical Gaussians. The key idea is rather simple: when adding O(n) coefficients over a (block-)subdiagonal of the empiric covariance matrix, the absolute value of the sum will grow in \(O(\sqrt{n})\) if the empiric covariance matrix converges to a scaled identity matrix, and in O(n) if it is block Toeplitz. We use this difference in growth to detect defective Gaussians. While we do not provide a formal proof of our test, in practice it detects reasonably well Gaussians induced by defective Falcon trees. We see proving our test and providing analogues for other GPV-based schemes as interesting questions.

Fig. 1.
figure 1

Empiric covariance matrices of Falcon signatures. Figure 1a corresponds to a correct implementation of Falcon. Figure 1b corresponds to an implementation where there is a mistake when constructing the Falcon tree.

Supplementary Tests. In the case where normality has been rejected, SAGA also provides a number of extra tests to aid in finding the issues. More details for this can be found in the full version of our paper [30].

8 Application and Limitations

Our sampler has been implemented by Pornin as part of the new isochronous implementation of Falcon  [51]. This implementation can use floating-point hardware or AVX2 instructions when available, but also includes floating-point emulation code that uses only usual integer operations. On ARM Cortex M4 CPUs, which can only support single-precision floating-point instructions, this implementation provides assembly implementations for the core double-precision floating-point operations more than twice faster than the generic emulation. As a result, our sampler can be efficiently implemented on embedded platforms as limited as Cortex M4 CPUs, while some other samplers (e.g. [32] due to a huge code size) are not compact enough to fit embedded platforms.

We perform benchmarks of this sampler implementation on a single Intel Core i7-6500U CPU core clocked at 2.5 GHz. In Table 2 we present the running times of our isochronous sampler. To compare with [59], we scale the numbers to be based on 2.5 GHz. Note that for our sampler the number of samples per second is on average for \(1.2915 < \sigma \le 1.8502\) while for [59] \(\sigma = 2\) is fixed.

Table 2. Number of samples per second at 2.5 GHz for our sampler and [59].

In Table 3 we present the running times of the Falcon isochronous implementation [51] that contains our sampler and compare it with a second non-isochronous implementation nearly identical excepting the base sampler which is a faster lazy CDT sampler, and the rejection sampling which is not scaled by a constant. Compared to the non-isochronous implementation, the isochronous one is about 22% slower, but remains very competitive speed-wise.

Table 3. Falcon signature generation time at 2.5 GHz.

Cache-Timing Protection. Following this implementation of the proposed sampler also ensures cache-timing protection [23], as the design shouldFootnote 5 bypass conditional branches by using a consistant access pattern (using linear searching of the table) and have isochronous runtime. This has been shown to be sufficient in implementations of Gaussian samplers in Frodo [7, 43].

Adapting to Other Schemes. A natural question is how our algorithms could be adapted for other schemes than Falcon, for example [5, 12, 18, 25, 40]. An obvious bottleneck seems to be the size of the CDT used in SampleCDT, which is linear in the standard deviation. For larger standard deviations, where linear searching becomes impractical, convolutions can be used to reduce \(\sigma \), and thus the runtime of the search algorithm [34, 50]. It would also be interesting to see if the DDG tree-based method of [32] has better scalability than our CDT-based method, in which case we would recommend it for larger standard deviations. On the other hand, once the base sampler is implemented, we do not see any obvious obstacle for implementing our whole framework. For example, [12] or using Peikert’s sampler [48] (in Falcon) entail a small constant number of standard deviations, therefore the rejection step would be very efficient once a base sampler for each standard deviation is implemented.

Advantages and Limitations. Our sampler has an acceptance rate \(\approx \frac{\sigma _{\min }}{\sigma _\mathrm{max}+ 0.4}\) making it especially suitable when \(\sigma _{\min }\) and \(\sigma _\mathrm{max}\) are close. In particular, our sampler is, so far, the fastest isochronous sampler for the parameters in Falcon. However, the larger the gap between \(\sigma _{\min }\) and \(\sigma _\mathrm{max}\), the lower the acceptance rate. In addition, our sampler uses a cummulative distribution table (CDT) which is accessed in an isochronous way. This table grows when \(\sigma _\mathrm{max}\) grows, while making both running time and memory usage larger. When \(\sigma _\mathrm{max}\) is large or far from \(\sigma _{\min }\), there exist faster isochronous samplers based on convolution [42] and rejection sampling [59]Footnote 6 techniques.