1 Introduction

Recent years have seen an increased focus on Lattice-based and other “quantum-resistant” public key cryptography for classical computers, which has also been fueled by official governmental warnings and endorsements [4, 5, 7, 27].

However, standardization efforts for these new algorithms are only starting [6] and post-quantum algorithms clearly have not yet reached the level of maturity that is expected from traditional cryptosystems (RSA or Elliptic Curve). Lattice-based public key algorithms have been criticized for their key and signature/ciphertext sizes, and for their lack of resistance against side-channel attacks. This paper offers a number of novel implementation techniques that address these issues.

The Bimodal Lattice Signature Scheme (BLISS) was proposed in CRYPTO 2013 by Ducas et al. [9] and offers some of the most compact and efficient lattice-based signatures currently available. A recent survey of practical lattice-based signature schemes [17] found BLISS to offer state-of-the-art performance and recommended it for practical use. The scheme has been implemented in FPGA hardware [30], on an 8-bit AVR target [24], and has been distributed as a part of the strongSwan IPsec VPN suite. Relevant hardware implementation techniques are also considered in [33]. We use BLISS as basis for our new lattice-based signature implementation, dubbed BLZZRD.

After describing the notation (Sect. 2.1) and the original BLISS scheme (Sect. 2.2), we move to our new discoveries:

  1. 1.

    Grover’s Attack on Random Oracle can be mounted with a Quantum Computer against many signature algorithms. The attack can be applied against the random oracle component of BLISS (Sect. 3). We show how to modify the algorithm and its security parameters to counter this attack with small overhead (Sect. 3.1).

  2. 2.

    Arithmetic Coding is ideally suited for compression of stationary distributions (Sect. 4.1) arising in Lattice cryptography, achieving near-optimal efficiency. We show how a Binary Arithmetic Coder (BAC) can be implemented with limited precision (Sect. 5), and used to compress lattice signatures (Sect. 5.2).

  3. 3.

    Gaussian sampling can be implemented with a BAC The same arithmetic coding tables, code, and circuitry can also be used to efficiently implement Gaussian sampling, another major operation required in Lattice cryptosystems (Sect. 5.4). The resulting sampler is optimal in the sense that it requires a minimum number of true random bits.

  4. 4.

    Polynomial Blinding is an efficient ring-arithmetic side-channel countermeasure. It hides the details of time or energy consumption of individual arithmetic operations via randomization. Polynomial blinding can be implemented with a minimal impact on overall performance or footprint (Sect. 6.1).

  5. 5.

    Split Shuffled Sampling is a general side-channel countermeasure for Gaussian samplers. Here two or more sample vectors are randomized and combined with each other, masking the properties of individual samples during the random sampling process (Sect. 6.2). This is an effective countermeasure against attacks such as the cache attack presented against BLISS in CHES 2016 [2].

Some efficient implementation tricks for tasks such as computation of the discrete Gaussian distribution are also described. We further take advantage of recent advances in security proof techniques, indicating a smaller precision requirement which results leads to smaller, faster implementations.

2 BLISS and BLZZRD

BLISS is an lattice public key signature algorithm proposed by Ducas et al. [9]. We use the more recent BLISS-B variant [8] as basis of our work. We refer the reader to these works for security analysis and other design rationale of the original proposals. Since BLZZRD is an improvement with quantum resistance, compressed signatures, and side-channel resistant countermeasures, we concentrate on these specific implementation techniques in present work.

Table 1 Original parameter sets and security targets for Bliss-B from [8, 9, 30]

2.1 Conventions and notation

Arithmetic is performed in a cyclotomic ring \( \mathbf {R} = {\mathbb {Z}}_q[x]/(x^n+1)\) where q is a small prime. Such a ring is anti-circulant as \(x^n \equiv -1 ~ (\bmod ~ x^n+1)\). Arithmetic in this ring can be performed efficiently via Number Theoretic Transforms (NTT, finite field FFT) when n divides \(q-1\) and n is a power of 2.

Non-boldface \(f_i\) denotes individual coefficients of polynomial \( \mathbf {f} = \sum _{i=0}^{n-1} f_i x^i\). Hence \( \mathbf {f} \in \mathbf {R}\) is a ring element, while \(f_i \in {\mathbb {Z}}_q\). We use \( \mathbf {a}* \mathbf {b}=\sum _{i=0}^{n-1} \sum _{j=0}^{n-1} a_i b_j x^{i+j}\) to denote the product of polynomials \( \mathbf {a}\) and \( \mathbf {b}\). For anti-circulant rings of degree n, we always have \(f_{i+n} = -f_i\). We may therefore reduce \(f_i=(-1)^{\lfloor i/n \rfloor } f_{i \bmod n}\) for all \(i \in {\mathbb {Z}}\).

We interchangeably use polynomials as zero-indexed vectors; dot product is defined as \( \mathbf {a} \cdot \mathbf {b} = \sum _{i=0}^{n-1} a_i b_j\), the Euclidean norms are \(\Vert \mathbf {a}\Vert ^2 = \mathbf {a} \cdot \mathbf {a}, \Vert \mathbf {a}\Vert _2 = \sqrt{ \mathbf {a} \cdot \mathbf {a}}\), and sup norm \(\Vert \mathbf {a}\Vert _\infty \) is the largest absolute value \(\max \{|a_0|, |a_1|, \ldots , |a_{n-1}| \}. \lfloor x \rceil = \lfloor x + \frac{1}{2} \rfloor \) denotes the closest integer to x.

A discrete Gaussian distribution with mean \(\mu \), deviation \(\sigma \) and variance \(\sigma ^2\) is denoted \({\mathcal {N}}_{\mathbb {Z}}(\mu , \sigma ^2)\). We use exclusively zero-centered (\(\mu = 0\)) distributions. See Sect. 4.1 for definitions and further discussion.

In pseudocode, we use \(\wedge , \vee , \lnot \) symbols to denote bitwise Boolean manipulation of two’s complement integers. A binary modular reduction operator \(\bmod ~n\) returns an integer in range \([0, n-1]\).

2.2 Description of BLISS

Our description differs somewhat from the original description (which is even more dense). We used the original reference implementationFootnote 1 and the strongSwan “production grade” implementationFootnote 2 to verify the correctness of our interpretation.

Various parameters, symbols, and security claims used in descriptions are given in Table 1. These are the parameters used by both the publications and can also be found in the implementations. These are unmodified for BLZZRD apart from \(\kappa \). See Sect. 2.2 for further information.

2.3 Key generation

The BLISS-B key generation is perhaps the simplest part of the scheme. Algorithm 1 describes the process. The BLISS-B key generation procedure differs from original BLISS in that keys are not rejected based on an additional norm.

2.4 Creating signatures

Algorithm 2 describes the signature generation method. Here we note that the purpose of the random oracle H is to take the input \(( \mathbf {w}, \mathbf {M})\) and hash it into a vector of length n that has exactly \(\kappa \) ones, with other entries being zero (the vector can also be interpreted as a polynomial in the ring). The reference implementation is an ad hoc construction based on SHA-512 [14]; the strongSwan implementation now uses the MGF1 construction [18]. We use a two-stage random oracle described in Sect. 3.1. The BLISS-B signature method differs from BLISS in steps 8–15 of Algorithm 2; this is the Greedy Sign Choices algorithm for minimizing the norm \(\Vert \mathbf {c} \cdot \mathbf {t}\Vert + \Vert \mathbf {c} \cdot \mathbf {u}\Vert \). Otherwise the algorithms are equivalent and BLISS can even be used to verify BLISS-B signatures without modification.

2.5 Verifying signatures

The BLISS signature verification process is described in Algorithm 3. We note that verification is very fast; only two quick checks for norm bounds and a single ring multiplication is required in addition to a call to the random oracle.

figure a
figure b
figure c

3 Security of the random oracle

Is there a trivial way to forge a signature \(( \mathbf {t}, \mathbf {z}, \mathbf {c})\) that will pass signature verification (Algorithm 3) for some public key \( \mathbf {a}\) and message \( \mathbf {M}\)?

We can easily choose arbitrary \( \mathbf {t}\) and \( \mathbf {z}\) that pass steps 1 and 2 (norm bounds). What is left is the matching of the oracle outputs \( \mathbf {c} \mathop {=}\limits ^{?} \mathbf {c}^\star \). There are \(\left( {\begin{array}{c}n\\ \kappa \end{array}}\right) \) possible ways for \(H( \mathbf {w}, \mathbf {M})\) to construct a vector with \(\kappa \) entries as 1, and remaining \(n - \kappa \) entries being zero.

Examining the BLISS reference implementations, we find that \( \mathbf {c}\) may not be transmitted or used as a vector (or a sorted list) but as an unsorted output of the oracle; in this case there are \(\frac{n!}{(n-\kappa )!}\) possibilities. To get from list (ordered set) matching to vector (unordered set) matching complexity, one can simply rearrange the input \( \mathbf {c}\) values in the signature to match the output from the oracle for the forged message \( \mathbf {M}\) (note that this is not as easy with our revised Oracle described in Sect. 3.1). These entropies (given in Table 2) appear to be consistent with security claims.

This is reasonable in classical computing where \(O(2^H)\) pre-image search is required where H is the entropy of target range; however the signer herself can create a signature that matches two different messages with relative ease. BLISS signatures are therefore not collision resistant.

We note that pre-image search is one of the things that quantum computers do well. This leads us the following simple theorem:

Theorem 1

(Exhaustive Quantum Forgery) A quantum adversary can forge an arbitrary BLISS signature with \(O\left( \sqrt{\left( {\begin{array}{c}n\\ \kappa \end{array}}\right) }\right) \) complexity.

Proof

With fixed \( \mathbf {t}\) and \( \mathbf {c}\) in Algorithm 3 one can mount an exhaustive pre-image search to find some message \( \mathbf {M}\) or small-norm vector \({\varDelta } \mathbf {z} = \mathbf {z} - \mathbf {z'}\) that satisfies \( \mathbf {c} = H( \mathbf {w} + {\varDelta } \mathbf {z}, \mathbf {M})\). This search with Grover’s Algorithm [15, 16] will require \(\frac{\pi }{4} \sqrt{\left( {\begin{array}{c}n\\ \kappa \end{array}}\right) }\) lookups. This is optimal for a quantum search [39]. \(\square \)

We observe that with the suggested parameters and security levels in Tables 1 and 2, the security level against a quantum adversary falls short of the target; for example, BLISS-I and BLISS-II with claimed 128-bit security would fall with complexity of roughly \(2^{66}\).

3.1 New random oracle

With \(n=512\), a trivial encoding of \( \mathbf {c}\) requires \(9\kappa \) bits for transmitting the indexes. In BLZZRD the \( \mathbf {c}\) vector itself (or the nonzero index set) is not transmitted. An intermediate function \(H_i( \mathbf {w}, \mathbf {M})\) is first used to compute a \(\theta \)-bit random hash \(c_\theta \). The intermediate then can be converted to the \( \mathbf {c}\) vector with ones at exactly \(\kappa \) positions via \(H_o\) oracle function:

$$\begin{aligned} \mathbf {c} = H( \mathbf {w}, \mathbf {M}) = H_o\left( H_i( \mathbf {w}, \mathbf {M})\right) . \end{aligned}$$
(1)

Related random Oracle construction was also proposed in [37], were hash function output was extended using a PRNG. Fortunately we now have a hash standard that directly supports eXtendable-Output Functions (XOFs).

Table 2 Complexity of exhaustive forgery attack on the hash-based random oracle H, based on entropy

BLZZRD uses the SHA3 hash function and SHAKE XOF [13, 14]. \(H_i\) is implemented with SHA3-256 or SHA3-384, depending on corresponding \(\theta \) value:

$$\begin{aligned} c_\theta = H_i( \mathbf {w}, \mathbf {M}) = \mathsf {SHA3}-\theta ( \mathbf {w} ~|~ \mathbf {M}). \end{aligned}$$
(2)

Contents of \( \mathbf {w} = (w_0 ~|~ w_1 ~| \cdots |~ w_{n-1})\) are encoded as network byte order 16-bit integers. To produce the \( \mathbf {c}\) indexes, SHA3’s SHAKE256 extendable-output function (XOF) is used. The \(c_\theta \) intermediate hash is used to seed the XOF, from which an arbitrary number of 16-bit big-endian values can be extracted. These are masked to index range \([0, n-1]\) and rejected if already contained in \( \mathbf {c}\). This process is repeated until \(\kappa \) ones are found for the \( \mathbf {c}\) vector.

Note that SHAKE256 accepts an arbitrary-sized input “key” and has an internal chaining capacity of 512 bits. Its collision resistance as a hash function is at 256-bit security level, and this is also its security level against a quantum pre-image attack; more than adequate for 128-, 160-, and 192- bit security goals.

When transmitting or comparing signatures (Algorithms 2 and 3) we may use \(c_\theta \) instead of the full \( \mathbf {c}\) vector or index set. This leads to more compact signatures. See Table 2 for numerical values for \(\theta \); it has been chosen to be the double of the of the security parameter to counter the Quantum Forgery Attack of Theorem 1. The new \(\kappa \) parameter is chosen so that \(\left( {\begin{array}{c}n\\ \kappa \end{array}}\right) > 2^\theta \).

The increased \(\kappa \) has an impact on both signature creation and verification speed, and also the rejection ratio (Step 18 in Algorithm 2.) Experimentally the slowdown is less than \(30\%\) in all cases. This is somewhat arbitrary due to the fashion that the constant M is determined in [9].

Resistance against general quantum attacks Even though our modification of \(\kappa \) helps to secure the scheme against this particular quantum attack, it does not secure the scheme against others. We suggest other authors to suggest revised parameters for other parameters.

4 Discrete Gaussians

Obtaining random numbers from a Gaussian (normal) or other non-uniform distributions is called sampling. Cryptographically secure sampling is required by many Lattice-based cryptographic algorithms; see [10] for an overview.

4.1 Gaussian sampling

A random sampler from a zero-centered discrete Gaussian distribution \({\mathcal {N}}_{\mathbb {Z}}(0, \sigma ^2)\) returns integer \(x \in {\mathbb {Z}}\) with probability given by density function \(\rho _\sigma (x)\). This probability mass of discrete Gaussian distribution at x is exactly proportional to \(\rho _\sigma (x) \propto \exp (-\frac{x^2}{2 \sigma ^2})\), where \(\sigma \) is a deviation parameter (see Fig. 1). For \(\sigma \gtrapprox 2\) we can approximate it to high precision with

$$\begin{aligned} \rho _\sigma (x) \approx \frac{1}{\sigma \sqrt{2\pi }} e^{-\frac{x^2}{2 \sigma ^2}}. \end{aligned}$$
(3)

4.2 Tailcut

As |x| grows, \(\rho _\sigma (x)\) eventually diminishes into statistical insignificance. We may choose a cryptographic threshold parameter such as \(\epsilon = 2^{-128}\) and simply ignore all \(|x| > \tau \) when \(2 \sum _{x > \tau } \rho _\sigma (x) \le \epsilon \). Therefore the probability that a random x from the distribution satisfies \(-\tau< x < \tau \) is greater than \(1 - \epsilon \). We call \(\tau \) the “tailcut parameter.” It’s typical numerical value for \(\epsilon = 2^{-128}\) bound is related to deviation by roughly \(\tau = 13.2 \sigma \).

4.3 Efficient computation of density tables

Essentially all sampling algorithms require distribution density tables. Evaluation of transcendental functions (such as \(\exp \)) is slow if generic algorithms are used.

We derive two parameters \(b = \exp \big (-\frac{1}{2\sigma ^2}\big )\) and \(c = {1}/{\sigma \sqrt{2 \pi }}\) from deviation \(\sigma \). Equation (3) can now be written as \(\rho _\sigma (x) = c b^{(x^2)}\). Since \(x^2 \in {\mathbb {Z}}\), square-and-multiply exponentiation algorithms (analogous to those used for modular exponentiation) may be used to efficiently compute the power \(b^{(x^2)}\) at arbitrary \(x \in {\mathbb {Z}}\). Real values can be kept at fixed point range [0, 1] if floating point arithmetic is not available.

Fig. 1
figure 1

The green bars illustrate the probability mass for integers \(x \in {\mathbb {Z}}\) with a \(\sigma = 2.5\) discrete Gaussian distribution (Eq. 3). Blue line is the corresponding continuous probability density function (color figure online)

When tabulating the density, due to symmetry \(\rho _\sigma (x) = \rho _\sigma (-x)\) we consider the sequence \(t_0, t_1, t_2, \ldots \) with \(t_i = \rho _\sigma (i)\). By observing that the ratio of consecutive values satisfies \(\frac{t_{i+1}}{t_i} = u_i = b^{2i+1}\) we arrive at the following recurrence:

$$\begin{aligned} \begin{array}{r l p{1mm} r l p{1mm} l} t_0 =&{} c &{}&{} u_0 =&{} b &{}&{} \text {Initialize.}\\ t_i =&{} t_{i-1} u_{i-1} &{}&{} u_i =&{} b^2 u_{i-1} &{}&{} \text {For}~i \ge 1. \end{array} \end{aligned}$$
(4)

The algorithm of Eq. (4) computes consecutive discrete Gaussian density function values with only two multiplications per point (\(b^2\) stays constant). This new technique makes the table initialization process much faster, a great advantage for limited-resource implementations.

Table 3 The internal variables used by main BAC routines

4.4 Gaussian sampling algorithms

We first analyze the “inversion sampler” which is one way of using uniform random bits to select an element from the target distribution by inverting its cumulative distribution. Since \({\mathcal {N}}_{\mathbb {Z}}(0, \sigma ^2)\) is symmetric we can define a cumulative sequence \(s_i = \sum _{x=-i}^{i} \rho _\sigma (x)\). It can be computed as an extension of sequences of Eq. (4) via \(s_0 = t_0\) and \(s_i = s_{i-1} + 2 t_i\). Clearly the sum of all probabilities converges to \(s_\infty = 1\).

For cryptographic applications we can assume a source of unbiased random bits \(z_i \in \{0,1\}\). A sequence of n random bits can be viewed as real-valued \(z \in [0,1]\) via binary fraction \(z = 0.z_1z_2z_3\ldots z_n\). When n is finite, \(z_1, z_2, \ldots z_n\) only defines a range of size \(2^{-n}\). For uniformly (\(n=\infty \)) random \(z \in [0,1]\) the corresponding sampled integer x can be derived with additional random sign bit via

$$\begin{aligned} x = \left\{ \begin{array}{cl} 0 &{} \quad \text {if}~ z< s_0, \\ \pm i &{}\quad \text {if}~ s_{i-1} \le z < s_i ~\text {for}~ i \ge 1. \end{array} \right. \end{aligned}$$
(5)

This corresponds to “inversion sampling,” and can be implemented via a binary search into monotonically increasing table \(s_i\) by first randomizing a large number of bits to create a high-precision real number z.

A wide variety of sampling methods such as Inversion Sampling [28], Knuth-Yao Sampling [20], The Ziggurat Method [3, 11, 25, 26], Kahn-Karney Sampling [19], and “Bernoulli” sampling [9] have been proposed for lattice cryptography.

5 Arithmetic coding

Arithmetic Coding is a classical data compression technique [31], notable for its optimality under certain conditions. Information theory tells us that the average entropy (in bits/symbol) of a stationary and memoryless discrete source such as \({\varOmega }= {\mathcal {N}}_{\mathbb {Z}}(0, \sigma ^2)\) is

$$\begin{aligned} H({\varOmega }) = -\sum _{x=-\infty }^\infty \rho _\sigma (x) \log _2 \rho _\sigma (x). \end{aligned}$$
(6)

Equation (6) gives us a lower bound for bits required to represent any discrete Gaussian sequence; this is also the expected number of random entropy bits required in sampling.

Theorem 2

Arithmetic coding is optimal for a static and memoryless discrete source \({\varOmega }\) in the sense that it is able to encode n symbols from such a source into \(n H({\varOmega }) + O(1)\) data bits.

Proof

See Section 1.5 of [35]. \(\square \)

5.1 Implementing a Binary Arithmetic Coder

Arithmetic Coding and Decoding can be implemented in many ways. Binary Arithmetic Code (BAC) [23] is one such approach. For BLZZRD reference code, we implemented a BAC encoder and decoder for a static (non-dynamic) distribution. Range scaling details of the new codec were inspired by [35], even though that implementation isn’t a BAC. See [38] for a classic implementation of arithmetic coding.

Our implementation is in C language, uses 64-bit fixed point precision, and is highly portable. Importantly, we pose no specific limits on buffer carry bit back-propagation, such as “bit stuffing” used in Q-Coder hardware architecture [29]. Full source is distributed together with the reference implementation (see Sect. 7). The implementation is only about 250 lines.

Table 3 gives the variable conventions used in the implementation. Constants used to define the BLZZRD BAC are as follows:

P :

Bit precision used by the implementation. We used \(P=64\).

D :

Alphabet size in bits. For discrete Gaussians, we need \(2^D > 2\tau \).

With a BAC, an alphabet of size \(2^D\) can be viewed as a binary tree of depth D. For each one of the D binary decisions in encoding or decoding, a single comparison is required.

The input frequencies can be arbitrarily scaled “counts.” The probability of occurrence of x is

$$\begin{aligned} \mathsf {Pr}(x) = \frac{ \mathsf {freq}[x]}{\sum _{i=0}^{2^D-1} \mathsf {freq}[i]}. \end{aligned}$$
(7)

Given a table of frequencies \( \mathsf {freq}[2^D]\), function \( \mathsf {BuildDist}\) (Algorithm 4) creates a corresponding scaled BAC distribution table “tree” \( \mathsf {dist}[2^D]\) that is used by both encoding and decoding functions.

figure d

The encoding routine \( \mathsf {AriEncode}\) (Algorithm 6) requires a helper function \( \mathsf {StoreWithCarry}\) (Algorithm 5) to propagate carries. Arithmetic coding routines generally require an output buffer as these carries cannot be reliably predicted. The decoding routine \( \mathsf {AriDecode}\) (Algorithm 8) utilizes a helper function \( \mathsf {ShiftGetBit}\) (Algorithm 7) to get new bits into the working variable x.

figure e
figure f
figure g
figure h
Fig. 2
figure 2

In this toy example we are encoding discrete Gaussian with \(\sigma = 2.5\) with tail cutting applied at \(\approx \)0.001 level; the range is \(-7 \ldots 7\) and the integers fit 4 bits. For convenience the distribution is centered at 8 average (high bit flip). Left side shows the operation of binary sample decoding; this corresponds to binary search within the cumulative distribution. On each step either min or max bound is set to divisor value. Only one scaling multiplication and comparison is required. Four bits yield the output \(0+4+0+1 = 5\). Right side illustrates decoding of multiple samples. Output (5, 10, 10) corresponds to \((-3, 2, 2)\) when adjusted to zero center. After decoding 3 samples the bounds \( \mathsf {min} = 0.14237.._{10} = 0.0010010001110.._2\) and \( \mathsf {max} = 0.14285.._{10} = 0.0010010010010.._2\) already match in 8 binary fraction bits. Both bounds and the input fraction can be shifted left, discarding the integer bits

The coding and decoding procedure is illustrated by Fig. 2. Exhaustive testing with various distributions was performed to verify the correct operation of the codec.

5.2 Compressing signatures

We have implemented arithmetic coding compression and decompression for the \( \mathbf {t}\) component of signatures generated with new BLZZRD parameters. The \( \mathbf {z}\) vector was also modeled as discrete Gaussian with a small \(\sigma \). See Table 4 for our parameters and experimental results. Rather than transmitting the \( \mathbf {c}\) vector indexes, an intermediate hash of \(\theta \) bits is transmitted. The last line gives the experimental average signature sizes, including overhead required to encode parameter and array sizes.

5.3 Comparison with a Huffman code

The extended version of [30] describes an advanced “block” Huffman encoding technique for the BLISS-1 parameter set. A similar technique is also implemented in the strongSwan project.

The codec achieves good performance (for a Huffman code) by assigning codes to quadruplets \(\Big (\lfloor \frac{{{\mathrm{abs}}}(t_{2i})}{2^8} \rfloor , \lfloor \frac{{{\mathrm{abs}}}(t_{2i+1})}{2^8} \rfloor , z_{2i}, z_{2i+1} \Big )\) rather than individual \(t_i\) and \(z_i\) values. The lower 8 bits of \(t_i\) values is stored as they are, and up to four sign bits are stored for nonzero entries.

We note that the 64-entry Huffman tree given in Appendix B of [30] for BLISS-I can only encode values in range \(-1023 \le t_i \le 1023\). Since \(t_i\) comes from Gaussian distribution with \(\sigma = 215\), there is a probability of \(P = 1 - \sum _{x=-1023}^{1023} \rho _\sigma (x) \approx 2^{-18.98}\) that individual \(t_i\) values will overflow and a probability of \(1-(1-P)^n \approx 2^{-9.98}\) (or roughly one in thousand) that an individual \( \mathbf {t}\) signature vector cannot be compressed using the proposed code.

Ignoring these frequent overflow conditions, our exhaustive testing has showed that the block Huffman codec compresses \(( \mathbf {t}, \mathbf {z})\) to an average of 5650.1 bits/signature. Our arithmetic coder achieves a slightly better result, 5565.2 bits/signature. A non-customized Huffman coding technique would yield significantly inferior results.

Furthermore, [30] states that “\( \mathbf {c}\) does not contain any easily removable redundancy,” and therefore \(\kappa \log _2(n) = 207\) bits are used. This is not strictly accurate as our new oracle (Sect. 3.1) requires a smaller number of bits. In case of the original BLISS-I security parameters only \(\theta =128\) bits would be required, a \(38 \%\) drop. For BLZZRD-I, we have \(\kappa =58\) and \(\theta =256\), indicating a reduction to less than half of the corresponding plain BLISS encoding size.

Arithmetic Coding is somewhat slower than Huffman coding, but as can be observed in Table 7, compression is not a bottleneck in the signing or verification operation. As indicated by Theorem 2, Arithmetic Coding is the most efficient way to compress signatures with elements from non-uniform but static distribution.

5.4 Sampling with an arithmetic decoder

An interesting consequence of the information theoretic optimality of an arithmetic coder is that when the decoder is fed uniform random bits, it will output random samples in the desired target distribution. Therefore the same code, circuitry, and tables that are used to compress signatures can double as a random sampler as well. We have utilized Algorithm 8 as a Gaussian sampler in this fashion. With precision \(P=64\) we may reasonably expect to reach a security level close to 128-bit level, based on the “Valiant-Valiant” sampling Theorem and conjectures [34].

Table 4 BLZZRD signature compression using Binary Arithmetic Coding

Table 5 gives performance characteristics of our initial implementation. The binary search CDF sampler used tables of size \(2^{12}\) and an additional sign bit, whereas the BAC sampler used tables of size \(2^{13}\) without special coding for sign bits. The random usage was calculated from average number of random bytes used to sample vectors of length \(n=512\). The table also includes numbers for a blinded sampler (Sect. 6.2). Precision of \(P=64\) was used in all cases and the speeds are comparable. We observe that the main advantage of the new BAC sampler is that it uses a very small number of true random bits; the actual random usage is typically within \(1.3 \%\) of the theoretic minimum for vector sizes used on Lattice cryptography in experiments with our implementation.

By comparison, The Knuth-Yao sampler can be also be expected to use a small number of bits, shown to average \(H+2\) bits per sample [20]. However, Knuth-Yao always uses a whole number of bits per sample whereas a BAC sampler can be utilized to create vectors of samples where bits of entropy are “shared” between individual samples. This elementary property directly follows from Theorem 2. In many ways the Knuth-Yao sampler compares to an BAC sampler like Huffman Codes compare to Arithmetic Compression.

For BLZZRD parameters the actual randomness usage of Knuth-Yao is about \(16 \%\) higher than the BAC sampler for vectors of length \(n=512\). Knuth-Yao of [32] is suitable primarily for small deviations used in Ring-LWE encryption (\(\sigma \approx 3\)), not for larger \(\sigma \) required by BLISS/BLZZRD.

Table 5 Sampling with an Arithmetic Coder vs a simple inverse CDF sampler on a Core i7-5600U @ 2.6 GHz

6 Randomized side-channel countermeasures for ideal lattices

Blinding is a standard countermeasure against both the timing attack [21] and emissions-based attacks such as Differential Power Analysis [22] for traditional public key cryptosystems. Blinding countermeasures add randomness to private key operations, making determination of secrets from observations more difficult for the attacker. In case of RSA, there are two kinds of blinding, base blinding and exponent blinding. In case of ECC, scalar blinding can be used.

Examining private key operations in lattice cryptosystems, we note that there are two main components that can introduce side-channel vulnerabilities to the implementation; ring polynomial arithmetic and Gaussian sampling. We have implemented analogous blinding countermeasures against both. For BLISS/BLZZRD (Algorithm 2), special note should be made to implement the GreedySC component using side-channel resistant techniques as well. An advanced cache attack against a BLISS implementation was described in CHES 2016 [2], which attacked the samplers of a reference implementation.

6.1 Blinded polynomial multiplication

Basic arithmetic in \({\mathbb {Z}}_q\) and especially the modular reduction by small prime q may easily introduce emissions to an ideal lattice cryptography implementation. For example, the arithmetic operations described for a low-resource Ring-LWE implementation in [24] contain a number of data-conditional execution cases.

The simplest form of polynomial blinding is to multiply polynomials with random constants \(a, b \in \mathbf {Z}_q\). This can be done in regular or NTT domain; the results are equivalent. One can set \(a=b^{-1}\) or multiply the result by \(c = (ab)^{-1}\):

$$\begin{aligned} \mathbf {h}= & {} a \mathbf {f} * b \mathbf {g} \\ \mathbf {f}* \mathbf {g}= & {} (ab)^{-1} \mathbf {h}. \end{aligned}$$

Note that anti-cyclic NTT multiplication requires each polynomial to be “pre-processed” by multiplying entries with tabulated roots of unity \(\omega ^i\) anyway. Therefore by choosing \(a=\omega ^i, a=\omega ^j\), this type of blinding can be done with virtually no extra cost. The normalization constant becomes \(c = \omega ^{-i-j}\).

figure i

An another type of blinding of anti-cyclic polynomial multiplication can be achieved via circularly “shifting” the polynomials. As noted in Sect. 2.1, we may write a polynomial as \( \mathbf {f}=\sum _{i=0}^{n-1} f_i x^i\). Shifting by j positions is equivalent to computing

$$\begin{aligned} x^j \mathbf {f} = \sum _{i=0}^{n-1} f_i x^{i + j} = \sum _{i=0}^{n-1} f_{i - j} x^i. \end{aligned}$$
(8)

Here the coefficients are therefore simply rotated in anti-cyclic fashion. Both constant multiplication and shifting by c are captured by function \( \mathsf {PolyBlind}\) (Algorithm 9).

Algorithm 9, \( \mathsf {PolyBlind}\) is very fast. The inverse operation \( \mathsf {PolyBlind}'( \mathbf {v}, -s, c^{-1})\), distinguished by a negative shift value s, is equally easy to construct. With that function, we have

$$\begin{aligned} \mathsf {PolyBlind}'( \mathsf {PolyBlind}( \mathbf {v}, s, c), -s, c^{-1}) = \mathbf {v} \end{aligned}$$
(9)

Due to isometries of the anti-circulant ring, we can use a total of four blinding parameters: ab (constants) and rs (shift values) in the blinded scheme to compute the polynomial product \( \mathbf {f} * \mathbf {g}\):

$$\begin{aligned} \mathbf {f}' =&~ \mathsf {PolyBlind}( \mathbf {f}, r, a) ~~ 0 \le r< n, 0< a< q \\ \mathbf {g}' =&~ \mathsf {PolyBlind}( \mathbf {g}, s, b) ~~ 0 \le s< n, 0< b < q \\ \mathbf {h}' =&~ \mathbf {f}' * \mathbf {g}' \\ \mathbf {f} * \mathbf {g} =&~ \mathsf {PolyBlind}'\left( \mathbf {h},' -(r+s), (ab)^{-1}\right) . \end{aligned}$$

One may choose a and b from tabulated roots of unity; \(a=\omega ^i, a=\omega ^j\) and avoid computing the inverse since \((ab)^{-1} = \omega ^{-(i+j)}\). This type of blinding has a relatively small performance penalty. If roots of unity are used as constants, the total amount of “noise” entropy introduced is \(4\log _2(n) = 36\) bits.

Table 6 Parameters for \(Z \approx X + kY\) split samplers, where the target distribution is \({\mathcal {N}}_{\mathbb {Z}}(0, \sigma )\), k is a small integer constant, and X and Y have distribution \({\mathcal {N}}_{\mathbb {Z}}(0, \sigma ')\) with \(\sigma '=\frac{1}{\sigma }\sqrt{1+k^2}\)

The basic blinding technique is generally applicable to schemes based on ideal lattices. For example the optimized implementations of ring arithmetic for the “New Hope” Ring-LWE key exchange [1] can be blinded in straightforward fashion and with only a negligible performance penalty.

6.2 Blinded Gaussian sampling

We note that the BLISS/BLZZRD algorithm always samples vectors of n variables at once. We define a function \( \mathsf {VectorSample}(n, \sigma ) = {\mathcal {N}}^n_{\mathbb {Z}}(0, \sigma ^2)\) that produces a vector of n samples from discrete Gaussian with deviation parameter \(\sigma \). Step 1 of signing process (Algorithm 2) can be written as

$$\begin{aligned} ( \mathbf {t}, \mathbf {u}) \leftarrow (&\mathsf {VectorSample}(n, \sigma ), \\&\mathsf {VectorSample}(n, \sigma )). \end{aligned}$$

If \( \mathsf {VectorSample}\) is implemented naively as a loop, emissions from the implementation may reveal information about the random numbers being sampled and attacker may determine which elements of the random vector have specific features, as is done in the cache attack presented in CHES 2016 [2]. This may be alleviated to some degree by using \( \mathsf {VectorShuffle}( \mathsf {VectorSample}(n, \sigma ))\), which is just a random shuffle of the Gaussian sample vector.

From probability theory we know that the sum of any two discrete Gaussian distributions is a discrete Gaussian distribution. More precisely, variances (which is the square of deviation) are additive. Let X and Y have distributions \({\mathcal {N}}_{\mathbb {Z}}(\mu _X, \sigma ^2_X)\) and \({\mathcal {N}}_{\mathbb {Z}}(\mu _Y, \sigma ^2_Y)\), respectively. Then their sum \(X + Y\) has distribution \({\mathcal {N}}_{\mathbb {Z}}(\mu _X + \mu _Y, \sigma ^2_X + \sigma ^2_Y)\). With zero-centered distributions the average does not change, but the resulting deviation will be \(\sigma _{X+Y} = \sqrt{\sigma ^2_X + \sigma ^2_Y}\). By induction, this can be generalized to more variables. We use this property to create a more secure Gaussian vector sampler. Algorithm 10 constructs the target distribution from a sum of random samples from \(\sigma ' = \frac{1}{\sqrt{m}}\sigma \).

figure j

As pointed out in [30], one can also choose \(\sigma '=\frac{\sigma }{\sqrt{1+k^2}}\) and construct the final distribution as \(Z = X + kY\). The convolution parameter k must be carefully chosen so that the statistical distance to the actual target distribution is not too great. Table 6 gives some appropriate parameters to use. In the resulting construction all elements of the second vector are simply multiplied by the integer constant k:

$$\begin{aligned} \mathbf {v}&= \mathsf {VectorShuffle}( \mathsf {VectorSample}(n, \sigma '))\\&\quad +k * \mathsf {VectorShuffle}( \mathsf {VectorSample}(n, \sigma ')). \end{aligned}$$

The algorithm has a performance impediment factor of m or less. However, fast “leakier” algorithms may be used. More significantly, the required tables are much smaller. For example, a CDF table of only 256 entries gives a secure tailcut at \(\tau = 14.34 \sigma \) for BLZZRD-I with parameters of Table 6.

The amount of noise entropy introduced by the permutations is technically in thousands of bits. Even though this does not directly translate to attack complexity, this countermeasure makes emissions- and cache-based attacks on the sampler significantly harder, defeating the BLISS attack of CHES 2016 [2]. It may be possible to adopt the attack to counter some blinding parameters, but probably not all. The countermeasure parameter m can be increased without significantly modifying the implementation or breaking compatibility.

Table 7 Breakdown of time consumption by different BLZZRD elementary operations in the plain C reference implementation

7 Implementation

An experimental reference implementation that contains most of the features described in this work is available as an self-contained public domain distribution https://github.com/mjosaarinen/blzzrd. Table 7 summarizes its performance on a 2015 laptop computer (Intel Core i7-4870HQ @ 2.50 GHz).

On the same system, OpenSSL’s (version 1.0.2g) optimized implementation of ECDSA with the NIST P-256 curve [12] required 0.039 ms for creating a signature and 0.097 ms for verification. All variants of Reference BLZZRD have at least 40% faster signature verification when compared to this implementation of NIST P-256 ECDSA.

NIST P-256 is the fastest curve implemented by OpenSSL; BLZZRD outperforms all ECDSA curves in signature verification with a high margin. For larger curves (409 bits or more), BLZZRD signature generation was also faster than ECDSA.

We note that (unlike the OpenSSL ECDSA benchmark implementation) our BLZZRD reference implementation has no assembly language optimizations and was generally written for readability and maintainability rather than for raw speed. It is also the first implementation with side-channel countermeasures, so direct performance comparisons to other Lattice-based schemes with countermeasures are difficult.

8 Conclusions

We have offered a number of techniques that can be used to improve the security and performance of lattice public key algorithms. Using these techniques, we constructed BLZZRD, a practical, evolutionary variant of the BLISS-B signature scheme.

We have described a direct Grover’s attack on the random oracle component of the BLISS signature scheme. In order to counter this attack, a new hash-based random oracle is proposed, with adjusted \(\kappa \) parameter and an “intermediate” hash which reduces signature size. It is currently not easy to estimate the security of all components of BLISS/BLZZRD against quantum attacks, but the new random oracle parameters are consistent with suggested quantum-resistant key sizes for symmetric ciphers [7].

Many algorithms in Lattice cryptography utilize discrete Gaussian variables in signatures and ciphertexts. We show that arithmetic coding can be used to compress these quantities in essentially optimal fashion. We describe an efficient Binary Arithmetic Coder (BAC) that produces smaller signatures than previous compressors. A somewhat surprising finding is that arithmetic coders can be adopted as non-uniform (Gaussian) Samplers that have comparable performance characteristics (millions of samples per second) to other sampling algorithms, but require only a (optimally) small amount of true random bits.

Standard “blinding” side-channel countermeasures for public key algorithms introduce randomness (noise) into private key operations, thus making determination of secrets more difficult. In ideal lattice cryptography, the main components used by private key operations typically are ring arithmetic and non-uniform (Gaussian) random sampling.

For ring arithmetic, we introduce “blinded polynomial multiplication,” a simple randomization technique based of constant multiplication and rotation of polynomials. This technique is cheap to implement and utilizes the specific isometries of the types of anti-circulant rings often used in lattice cryptography.

For Gaussian sampling, we note that lattice cryptography algorithms typically require vectors rather than individual samples. We show that sampling processes can be blinded by shuffling these vectors and combining multiple Gaussian distributions with each other. This also results in a smaller implementation footprint since the size of required CDF tables becomes smaller. The general countermeasure is effective against side-channel attacks such as the cache attack recently described against BLISS.