Keywords

1 Introduction

The security of public-key cryptography (PKC) is dependent on the computational intractability of some underlying mathematical problems. The current most widely used public-key cryptographic algorithms RSA [44] and elliptic curve cryptography (ECC) [37] are based on the hardness of large integer factorization problem and elliptic curve discrete logarithm problem respectively. Unfortunately, both of these hard problems can be solved in polynomial time with large-scale quantum computers by using Shor’s [46] and Proos-Zalka’s [41] algorithm. Post-quantum cryptography (PQC) is a branch of PKC that focuses on designing cryptographic algorithms whose underlying mathematical problems remain hard even in the presence of large quantum computers. Considering the fast evolution of quantum computers and their impending threat to our current public-key infrastructure, the National Institute of Standards and Technology (NIST) started a procedure to standardize post-quantum public-key cryptographic primitives such as digital signatures, public-key encryption, and key-encapsulation mechanism in 2016 [39].

In 2020, NIST announced four finalists and five alternative candidates for the post-quantum key-encapsulation mechanism (KEM) category, that advanced to the 3rd round [2]. Three of the four finalist KEMs: Saber [19], Kyber [10], and NTRU [26], are lattice-based. NTRU is an NTRU-based KEM, whereas Kyber and Saber are based on variants of the learning with errors (LWE) problem. The security of Kyber can be reduced to module learning with errors (MLWE) problem, and the security of Saber is based on module learning with rounding (MLWR) problem. The hardness of both LWE and LWR problems are dependent on the difficulty to solve a set of noisy linear equations. This noise is explicitly added for a LWE problem but is implicitly generated in a LWR problem using the round-off of a few least significant bits.

Initially, the main focus of the NIST post-quantum standardization procedure was the mathematical security of the schemes, together with the performance, and the memory footprint of the cryptographic implementation in embedded devices. With the advancement of the standardization process, the focus was broadened to take into account the implementation-security of the schemes also. Side-channel attacks (SCA) [34] are a well-known type of physical attacks against implementations of cryptographic algorithms. These attacks exploit leakage of information, such as timing information, power consumption, electromagnetic radiation, etc., which leaks information from the physical device which runs the algorithm to extract the secret key.

Silverman et al. [47] first showed a timing attack on quantum secure lattice-based cryptographic protocol NTRUEncrypt [28] by exploiting the non-constant time implementation. To prevent the timing attack, most of the cryptographic protocols use constant-time implementation, including Saber and Kyber. In recent years, many works [3, 24, 29, 42, 50] showed SCA on lattice-based cryptographic schemes with the help of power consumption and electromagnetic leakage information. A provably-secure countermeasure against these kinds of SCA is masking [13].

The masking technique can also provide security against higher-order attacks, where the adversary can use the power consumption information of multiple points. However, the performance cost of the masked scheme increases with the order of SCA. Reparaz et al. [43] were the first to introduce a first-order SCA resistant masked implementation of chosen-plaintext attack (CPA) secure ring-LWE based decryption. Nevertheless, real-world applications use chosen-ciphertext attack (CCA) secure cryptosystems. Lattice-based quantum secure KEMs such as Saber and Kyber achieve CCA security by using a variant of Fujisaki-Okamoto transformation [30] on their CPA secure design. Oder et al. [40] proposed a 1st-order CCA secure masked Ring-LWE key decapsulation and reported an overhead factor of 5.2x in performance over an unmasked implementation on an ARM Cortex-M4.

Van Beirendonck et al. [6] proposed the first-order SCA secure implementation of Saber with an overhead factor of 2.5x. This performance was achievable because of the power-of-two moduli and efficient utilization of masking techniques specifically aimed at first-order security [48]. Heinz et al. [25] presented an optimized first-order masked implementation of Kyber with an overhead factor of 3.4x compared to the unmasked implementation of Kyber. Fritzmann et al. [21] proposed first-order masked implementations of Kyber and Saber with instruction set extensions, and Bos et al. [11] proposed higher-order masked implementations of Kyber.

First-order masked implementations of schemes are typically vulnerable against higher-order side-channel attacks [36, 49], i.e., the attacks that exploit side-channel leakages of multiple intermediate values. Ngo et al. [38] proposed an attack on the first-order masked Saber using a deep neural network constructed at the profiling stage. This attack does not violate the assumption of the first-order masked Saber but exploits higher-order side-channel leakages. Higher-order masking increases the noise level exponentially and prevents attacks that exploit higher-order side-channel leakages.

In the third-round of the NIST submission, the Saber team introduced uSaber as a variant of Saber. In uSaber, the secrets are sampled from a uniform distribution instead of a centered binomial distribution as used in Saber. The authors claim that the advantage of this modification is twofold. First, it makes the scheme simpler since sampling from a uniform distribution is more straightforward than sampling from a centered binomial distribution, and it also reduces the modulus by a factor of two. Second, this change allows a very efficient masking of the secret values. However, this claim is yet to be proven as there exists no masked implementation of uSaber to corroborate this claim.

Contribution. In this work, we provide arbitrary-order masked implementations of Saber and uSaber, and we compare their performances with the state-of-the-art masked implementations of Saber and Kyber. We are the first to propose a higher-order masked implementation of uSaber. For this, we present a masked centered uniform sampler which is then applied to uSaber instead of Saber’s centered binomial sampler. We generally take advantage of Saber’s power-of-two moduli to mask both Saber’s and uSaber’s decapsulation algorithm, and we compare different recently proposed algorithms for ciphertext comparison in higher-order masked settings.

We implement and benchmark our higher-order masked Saber and uSaber on an ARM Cortex-M4 microcontroller using the PQM4 framework. The first-, second-, and third-order masked decapsulation algorithm of Saber has an overhead factor of 2.7x, 5x, and 7.7x over the unmasked implementation, respectively. In uSaber, the overhead factor for first-order is 2.3x, second-order is 4.2x, and third-order is 6.5x compared to the unmasked version. We include the performance results and requisite of the random bytes during masking for each masked primitive of first-, second-, and third-order masked Saber and uSaber. Our implementations are available at https://github.com/KULeuven-COSIC/Higher-order-masked-Saber.

Finally, we compare the performances of our higher-order masked implementations of Saber and uSaber with the higher-order masked implementations of Kyber and Saber presented in [11, 12]. We demonstrate that the performances of masked Saber implementations outperform masked Kyber implementations. Further, we show that the performance of masked uSaber is better and requires fewer random bytes than masked Saber and Kyber for any order.

2 Preliminaries

2.1 Notation

We denote the ring of integers modulo q by \(\mathbb {Z}_q\) and the quotient ring \(\mathbb {Z}_q[X]/(X^{256}+1)\) by \({R}_q\). We use \({R}_q^l\) to represent the ring which contains vectors with l elements of \({R}_q\). The ring with \(l\times l \) matrices over \({R}_q\) is denoted by \({R}_q^{l\times l}\). We use lower case letters to denote single polynomials, bold lower case letters to denote vectors and bold upper case letters to denote matrices. The j-th coefficient of the polynomial c is represented as c[j], where \(j \in \{ 0, 1, \ldots , 255 \}\). The j-th coefficient of the i-th polynomial of the vector \(\textbf{b}\) is represented as \(\textbf{b}[i][j]\), where \(j \in \{ 0, 1, \ldots , 255 \}\) and \(i \in \{ 0, 1, \ldots , l-1 \}\). Sometimes the set of \((n+1)\) elements \(\{ x_0, x_1, \ldots , x_n \}\) from the same ring R is denoted by \(\{ x_i \}_{0\le i \le n}\).

The rounding operation is denoted by \(\lfloor \cdot \rceil \), and it returns the closest integer with ties rounded upwards. The operations \(x \ll b\) and \(x \gg b\) denote the logical shifting of x by b positions left and right, respectively. These operations are extended on polynomials by performing them coefficientwise.

We denote \(x \leftarrow \chi (S)\) when x is sampled from the set S according to the distribution \(\chi \). We use the notation \(x \leftarrow \chi (S, seed_x)\) to represent that x belongs to the set S and is generated by the pseudorandom number generator \(\chi \) with the help of seed \(seed_x\). To represent the uniform distribution we use \(\mathcal {U}\). The centered binomial distribution is denoted by \(\beta _\mu \) with standard deviation \(\sqrt{\mu /4}\). The centered uniform distribution is expressed as \(\mathcal {U}_u\), when it samples uniformly from \([-2^{(u-1)},2^{(u-1)}-1]\). We use \(\texttt{HW}(x)\) to represent the Hamming weight of x.

2.2 Saber

In this section, we introduce the Saber encryption scheme. The parameter set of Saber includes three power-of-two moduli q, p and t, which define the rings \(R_q\), \(R_p\) and \(R_t\) used in the algorithm. From these moduli, one can calculate the number of bits of one coefficient as \(\epsilon _q = \log _2(q)\), \(\epsilon _p = \log _2(p)\) and \(\epsilon _t = \log _2(t)\). The parameter set also includes a vector length l, which increases with increase in security, and an integer \(\mu \) defining the coins of the secret distribution \(\beta _\mu \). Given a set of parameters, the key generation, encryption, and decryption of Saber are shown in Fig. 1. For an in-depth review of the Saber encryption scheme, we refer to the original paper [19, 20].

2.3 uSaber

uSaber or uniform-Saber was proposed in third round NIST submission [20] as a variant of Saber. The principal alteration in uSaber from Saber is that it uses a centered uniform distribution \(\mathcal {U}_u\) for sampling secret vectors instead of the centered binomial distribution \(\beta _\mu \). The coefficients in polynomials of secret vector are from \([-2^{(u-1)},2^{(u-1)}-1]\) rather than \([-\mu /2,\mu /2]\). Due to this modification, uSaber receives approximately the same level of security as Saber with a slightly reduced parameters set as shown in Table 1.

Fig. 1.
figure 1

\(\texttt{Saber}{.}\texttt{PKE}\)

Table 1. Parameters of Saber and uSaber with security and failure probability

2.4 Fujisaki-Okamoto Transformation

The encryption scheme outlined in the previous section only provides security against passive attackers (IND-CPA security). One can obtain active security (IND-CCA) security by using a generic transformation such as a post-quantum version of the Fujisaki-Okamoto transformation [22, 27]. The idea is that the encapsulation encrypts a random input, and also uses this input as a seed for all randomness. The decapsulation can then decrypt the seed from the ciphertext and recompute the ciphertext. This recomputed ciphertext can then be used to check if the input ciphertext is generated correctly. The Fujisaki-Okamoto transformation transforms the encryption scheme into a key encapsulation mechanism (KEM). Given hash functions \(\mathcal {F}\), \(\mathcal {G}\) and \(\mathcal {H}\), the saber KEM is given in Fig. 2. Again, we refer to the original Saber paper [19, 20] for a more detailed description.

Fig. 2.
figure 2

\(\texttt{Saber}{.}\texttt{KEM} \)

2.5 Higher-Order Masking

Masking is a widely used countermeasure against side-channel attacks. The nth-order masked scheme can provide security against at most nth-order differential power attacks. The general idea of nth-order masking is to split the sensitive variable x into \(n+1\) shares and then perform all the operations of the algorithms on each of the shares individually. The shares of the sensitive variable look uniformly random and the sensitive information can only be retrieved after combining all the \(n+1\) shares. Moreover, if an adversary can get side-channel information from at most n points, he will not learn anything about the sensitive variable. In an nth-order masked implementation, linear operations typically duplicate \((n+1)\) times, and non-linear operations need to use more complex and costlier methods. As a consequence, the performance cost of a nth-order masked implementation increases at least by a factor of \((n+1)\).

There are several methods for masking. We primarily deal with two kinds of masking techniques: arithmetic masking and Boolean masking. For both the masking techniques, in order to obtain nth-order security, the sensitive variable \(x\in \mathbb {Z}_q\) needs to be split into \(n+1\) independent shares \(x_0,x_1,\ldots ,x_{n}\in \mathbb {Z}_q \). In arithmetic masking, the relation between the sensitive variable x and the \(n+1\) shares of x is \(x=x_0+x_1+\cdots +x_{n} \bmod {q}\). Whereas, in Boolean masking the sensitive variable x and its \(n+1\) shares are related as \(x=x_0 \oplus x_1 \oplus \cdots \oplus x_{n}\).

The arithmetic masking is advantageous for protecting arithmetic operations such as addition, subtraction, multiplication. For example, to protect the modular addition \(z=x+y \bmod {q}\) against n-order attacks, when only x contains sensitive data, we split x into \(n+1\) shares \(\{x_i\}_{0\le i \le n}\) such that \(\sum _{i=0}^{n} x_i \bmod {q}=x\), then the shares of \(z=\sum _{i=0}^{n} z_i \bmod {q}\) are:

$$z_i = \left\{ \begin{array}{ll} x_i+y \bmod {q}, &{} \text{ if } i=0 \\ x_i, &{} \text{ if } {1\le i \le n} \end{array}\right. .$$

If x and y both contains sensitive data, we split y together with x into \(n+1\) shares \(\{y_i\}_{0\le i \le n}\) such that \(\sum _{i=0}^{n} y_i=y \bmod {q}\), then the shares of \(z=\sum _{i=0}^{n} z_i \bmod {q}\) are:

$$z_i = x_i+y_i \bmod {q}, \text { } {0\le i \le n} .$$

To securely compute the multiplication \(z=x\cdot y \bmod {q}\), when x only contains sensitive data, we create \(n+1\) shares \(\{x_i\}_{0\le i \le n}\) for x such that \(\sum _{i=0}^{n} x_i \bmod {q}=x\), then the shares of \(z=\sum _{i=0}^{n} z_i \bmod {q}\) are:

$$z_i = x_i\cdot y \bmod {q}, \text { } {0\le i \le n}.$$

We prefer Boolean masking for variables that undergo bitwise operations. For example, if we want to perform logical shift operation \(z=x \gg l\) securely, write x into \(n+1\) shares \(\{x_i\}_{0\le i \le n}\) such that \(\oplus _{i=0}^{n} x_i=x\), then calculate \(z_i = x_i \gg l, \forall {i}\) to obtain the shares of \(z=\oplus _{i=0}^{n} z_i \).

3 Masking Saber

In a key encapsulation mechanism (KEM), the secret key remains fixed for a significant amount of time. Specifically, the decapsulation algorithm uses the non-ephemeral secret key \(\textbf{s}\), and therefore it is the most susceptible operation against side-channel attacks. In this paper, we focus on protecting the non-ephemeral secret key of Saber during the decapsulation. We introduce a masked decapsulation algorithm for Saber, which can resist higher-order side-channel attacks. The decapsulation procedure of Saber can be partitioned into three segments, namely decryption, re-encryption, and ciphertext comparison. For visualization, we present the flow of Saber’s decapsulation algorithm in Fig. 3. Here, all the modules that process sensitive data due to the involvement with the secret have been marked grey. These modules are vulnerable from the perspective of side-channel attacks and need to be masked. In this section, we describe all the masked primitives that are used in the higher-order masked decapsulation procedure of Saber. We also present a new algorithm to perform the ciphertext comparison component in the masked setting. We will go through each part of the decapsulation algorithm of Saber chronologically and explain the methods we have used to mask them.

3.1 Arithmetic Operations

The decapsulation algorithm of Saber is heavily dependent on polynomial arithmetic, such as polynomial addition/subtraction and polynomial multiplication. We use arithmetic masking to protect these operations. As shown in Fig. 3, the decapsulation algorithm requires the following operations: addition between one masked and another unmasked polynomial, addition between two masked polynomials, and multiplication between one masked and another unmasked polynomial. For masking these operations, we follow the methods described in Sect. 2.5.

Fig. 3.
figure 3

Decapsulation of Saber. In grey the operations that are influenced by the long term secret \(\pmb {s}\) and thus vulnerable to side-channel attacks [6].

To perform the polynomial multiplication, the original unmasked Saber multiplication uses a hybrid multiplication, a combination of Toom-Cook-4, 2 levels of Karatsuba, and school-book multiplication [19, 33, 35]. We use this same multiplication technique in our masked implementation. Chung et al. [14] have recently introduced an efficient method to perform polynomial multiplication by using the number-theoretic transform. The same method could be used for the implementation of masked Saber to provide a significant performance improvement [12]. However, this is not the goal of our work and we keep this as future work.

3.2 Compression

In the last step of Saber.PKE.Dec, m is computed by calculating the most significant bit (MSB) for each coefficient. It compresses each coefficient of the polynomial \((v-2^{\epsilon _p-\epsilon _t}c_m + h_2) \bmod {p}\) to produce a polynomial m where each coefficient is one bit long.

The logical shift operation is easy on Boolean shares. In this situation, we need to apply a logical shift operation on each share separately. Unfortunately, computing this logical shift operation on arithmetic shares is not trivial. This fact is discussed elaborately in [6] for the case of first-order masking, and the similar issue arises for higher-order masking also.

We compute MSB on arithmetic shares by taking the following steps: first, convert arithmetic shares to Boolean shares (A2B conversion), second, perform logical shift operation on Boolean shares, and finally, return to arithmetic domain with the Boolean to arithmetic (B2A) conversion. As \(m\in R_2\), the resultant polynomial after compression is a polynomial with 1 bit coefficients. Here, the Boolean shares of m act like arithmetic shares of m. Therefore, we do not need the B2A conversion step.

Bitslicing is a technique that helps to improve the performance of bitwise operations. We have opted for the algorithm proposed in [15] using the bitsliced implementation of [17] for the A2B conversion of our implementations.

3.3 Masked Hashing

In Saber, the hash function \(\mathcal {G}\) and the pseudo-random number generator XOF are realized by using SHA3-512 and SHAKE-128, respectively. Both are different instances of the sponge function Keccak-f[1600] [7]. It has been shown that this construction is easy to protect by using Boolean masking [23].

Keccak-f[1600] permutation has five steps: \(\theta , \rho , \pi , \chi \) and \(\iota \). In between these \(\theta , \rho , \pi \) are linear diffusion steps and \(\iota \) is a simple addition. As all of these four are linear operations on Boolean shares, we just need to apply them for each share. \(\chi \) is a degree 2 non-linear mapping and therefore requires extra attention to apply masking. Gross et al. [23] developed a technique to implement \(\chi \) in the higher-order mask setting. We have adopted their technique in our implementations.

3.4 Masked Centered Binomial Sampler

Saber.PKE.Enc uses the centered binomial sampler for sampling the vector \(\mathbf {s'}\). This sampler outputs the result of HW(x)−HW(y), where x and y are pseudorandom numbers of bit length four. These pseudorandom numbers are generated by using SHAKE-128. As mentioned in Sect. 3.3, SHAKE-128 is protected by using Boolean masking. After the generation of \(\mathbf {s'}\), polynomial multiplications with \(\mathbf {s'}\) (e.g. \(\mathbf {As'}\) and \(\textbf{b}^T\mathbf {s'}\)) take place. SHAKE-128 creates Boolean shares, but polynomial multiplication is an arithmetic operation that is less expensive with arithmetic shares. To mitigate this issue, we need to include a conversion algorithm that converts Boolean shares into arithmetic shares (B2A conversion) in the masked centered binomial sampler.

Schneider et al. [45] propose two efficient higher-order masked centered binomial samplers: \(sampler_1\) computes masked shares bitwise, whereas \(sampler_2\) uses the bitslicing techniques to improve throughput. We have adopted the implementation of \(sampler_2\) together with the modification made by Van Beirendonck et al. [6] specifically for Saber.

To convert shares from Boolean to arithmetic, we use the B2A algorithm proposed in [8]. The details have been provided in the Algorithm 1. In this Algorithm, SecBitAdd calculates shares of \(\texttt{HW} (x)\) and SecBitSub takes shares of \(\texttt{HW} (x)\) and shares of y as inputs and outputs shares of \(z=\texttt{HW} (x) -\texttt{HW} (y)\). The function SecConstAdd adds \(\mu /2=4\) with the shares of z to avoid any negative value that can occur after SecBitSub. In the next step the B2A function converts all the Boolean shares of z to the arithmetic shares of A and the last step converts shares of A from \(\{0,1, \ldots , 8\}\) to \(\{-4,-3,\ldots ,3,4\}\).

figure a

3.5 Masked Comparison

The masked ciphertext comparison component is required to check the equality between masked ciphertext generated from re-encryption and the public ciphertext. This step performs the equality check \(c\overset{?}{=}c^*\) of the Saber.KEM.Decaps algorithm in the masked domain.

An easy but efficient method for the first-order masked comparison is introduced by Oder et al. [40]. Unfortunately, this hash-based method is limited to first-order masking, and cannot be generalized to check ciphertext equality in the higher-order masked settings.

Different approaches for higher-order masked comparison were recently analyzed thoroughly by D’Anvers et al. [17]. In general, there are four approaches. The simple method originally due to Barthe et al. [5] groups individual bits into a large SecOR operation. This requires a pre-processing step to handle ciphertext compression that is straightforward to mask for Saber, but more complex for Kyber [21]. The arithmetic method was developed in a series of works [4, 9, 18], and aims to reduce the total number of comparisons by grouping coefficients into a random sum. The decompression method [11] developed for Kyber avoids masking the compression of the re-encrypted ciphertext, by decompressing the input ciphertext instead. Finally, the hybrid method [16] introduced the idea of using different of the previously discussed methods for the different components of the ciphertext. All of these approaches rely on A2B conversions, which can be heavily optimized using bitslicing [12, 17].

In this section, we will discuss two of these different approaches to higher-order masked comparison. The first one is the Saber-adapted decompression method, which was not considered in [17]. The second one is the simple method, which was found to be the most efficient method for Saber in that same work. For both methods, we consolidate concurrent A2B optimization techniques proposed in [12, 17].

3.5.1 Decompressed Masked Comparison Algorithm

figure b
figure c

Bos et al. [11] introduced a new method based on A2B conversion for the masked comparison algorithm for Kyber, in order to reduce the cost of the Boolean equality check circuit. This method does not perform the compression operation on the recomputed ciphertext, but performs a decompression operation on the public ciphertext instead. Then, the comparison is performed in the uncompressed domain. The decompressed operation is less costly to apply on public ciphertext, as it is public and so this operation can be performed unmasked.

Let us assume the public ciphertext be \(c=(\textbf{b},c_m)\), where \(\textbf{b}\) be the key contained part and \(c_m\) be the message contained part of the ciphertext c. In Saber, the compression operation is applied to generate the ciphertext during encryption, and this operation is a many-to-one operation. In this process, each coefficient of \({b}\) loses three bits, and each coefficient of \(c_m\) loses six bits. So, as compensation for the masked comparison, we use a decompression operation, which outputs an interval of integers for each coefficient of the public ciphertext. Let, c[j] be a coefficient of the public ciphertext c, and the corresponding output of decompression operation be \((s_c[j], e_c[j])\). This implies every element in between the interval \((s_c[j], e_c[j])\) becomes c[j] after the compression operation.

Next, we verify that each coefficient of the shared uncompressed ciphertext of \(c^*\) which is generated from the re-encryption, lies in the corresponding decompressed interval. Let \(c^*[j]\) be the arithmetically masked uncompressed ciphertext coefficient corresponding to the public ciphertext coefficient c[j]. Now, if \(c^*[j]\in (s_c[j], e_c[j])\) for all coefficients j, then the comparison returns success and outputs the shared valid key else returns a random invalid key. The test \(c^*[j]\overset{?}{\in }(s_c[j], e_c[j])\) is realized by checking whether \(c^*[j]-s_c[j]\) is a positive number and whether \(c^*[j]-e_c[j]\) is a negative number. We have adopted this method for performing the higher-order masked ciphertext comparison in Saber as shown in Algorithm 2.

In Algorithm 2, lines 2–3 and 11–12 compute the start-point and the end-point of the interval for each coefficient of the key contained part \(\textbf{b}\) and the message contained part \(c_m\) of the public ciphertext c, respectively. The MSB of any number acts as a sign bit, i.e., if the MSB is 1 then the number is negative, else the number is positive. As in an ideal case, \(c^*[j]-s_c[j]>0\) and \(c^*[j]-e_c[j]<0\), so the MSB\((c^*[j]-s_c[j])\) should be 0 and the MSB\((c^*[j]-e_c[j])\) should be 1. In order to avoid two different kinds of checking for \(c^*[j]-s_c[j]\) and \(c^*[j]-e_c[j]\), we need to add a constant l with \(c^*[j]-s_c[j]\) such that its MSB becomes 1. The value of l is \(2^{(\epsilon _q-1)}\) and \(2^{(\epsilon _p-1)}\) for \(\textbf{b}\) and \(c_m\) part of c, respectively. We compute the MSB of an arithmetically masked variable in the following way: we convert the arithmetic shares to Boolean shares using A2B conversion, and then we use a shift operation to extract the masked shares of MSB. Finally, Algorithm 2 uses Algorithm 3, the BooleanAllBitsOneTest function to combine the output bits of all coefficients and returns a single-bit indicating success or failure.

3.5.2 Simple Masked Comparison Algorithm

Next, we describe the simple method as given in [17]. Note that the re-encrypted ciphertext \(c^*\) is arithmetically masked and uncompressed, but the public ciphertext c is unmasked and compressed. As mentioned earlier, our task is to verify whether c equals \(c^*\) after compression. In this method, we perform the following steps: firstly, we transform arithmetic shares of \(c^*\) to Boolean shares by using A2B conversion algorithm, secondly, we compress the re-encrypted Boolean masked ciphertext \(c^*\) by using coefficientwise logical right shift, thirdly, we subtract the public ciphertext c from the compressed and masked re-encrypted ciphertext \(c^*\). This method is shown in Algorithm 4.

figure d

3.5.3 Bitsliced A2B

Both the decompression method and the simple method rely heavily on A2B conversions. Throughout the implementations, we use the bitsliced A2B conversion [17] for further speed-up. Moreover, A2B conversions use the SecAdd sub-function to perform masked addition. Bronchain et al. [12] proposed a SecAdd which uses \(k-1\) SecAnd operations for k-bit inputs, as opposed to \(2k-3\) SecAnd operations required in [17]. We included this technique into the implementation of [17] to receive better performance.

4 Masking uSaber

In uSaber, the coefficients of the secret vector are sampled according to the centered uniform distribution \(\mathcal {U}_u\) instead of the centered binomial distribution \(\beta _\mu \). Here, the hamming weight computation of the centered binomial distribution is replaced by the sign extension of u bits to \(\epsilon _q\) bits, to generate a sample in \([-2^{(u-1)},2^{(u-1)}-1]\) from u uniformly random bits. This secret vector sampler is the only component that differs between Saber and uSaber. A main advantage of uSaber is that the centered uniform sampler has fewer operations compared to the centered binomial sampler and therefore, is easier to mask.

Similar to the centered binomial sampler, the centered uniform sampler takes pseudorandom Boolean-masked bits as input that are produced by the masked SHAKE-128 function. Our simple higher-order masked centered uniform sampler is shown in Algorithm 5. We base it on SecConsAdd in the masked centered binomial sampler, mentioned in Sect. 3.4. First, we use xor to transform a negative number into a positive number. Second, we apply B2A conversion algorithm to convert Boolean shares to arithmetic shares. Third, we subtract \(2^{u-1}\) from the arithmetic shares to map them from \([0,2^u-1]\) back to \([-2^{(u-1)},2^{(u-1)}-1]\). This masked sampler does not require SecBitAdd and SecBitSub which are used in masked centered binomial sampler. The centered uniform sampler is simpler and requires fewer masked operations than the centered binomial sampler.

figure e

5 Performance Evaluation

To demonstrate the performance of all of the proposed methods, we implement them on a 32-bit ARM Cortex-M4 microcontroller, STM32F407-DISCOVERY development board. We adopt the widely used PQM4 [32] post-quantum cryptographic library and benchmarking framework for performance evaluation. In this framework, the system timer (SisTick) is used to measure the cycle counts. This framework uses a 24 MHz main system clock and a 48 MHz TRNG clock. We take advantage of the on-chip TRNG for sampling masking randomness instead of generating in advance and storing random bits in a table like Kyber [11]. This TRNG generates 32 random bits per 40 TRNG clock cycles, which is equal to 20 main system clock cycles. We include the cost of randomness sampling with the benchmarks. We use arm-none-eabi-gcc version 9.2.1 to accomplish the measurements of our implementations.

5.1 Performance Analysis of Comparison Algorithms for Saber

We present the cycle counts of the implementation for arbitrary order masked comparison algorithms of Saber. In Saber, the parameters are: \(\epsilon _q=13\), \(\epsilon _p=10\), \(\epsilon _t=4\) and \(l=3\). We break down the cycle counts into three parts: spent during all the A2B conversion, spent in computing the function BooleanAllBitsOneTest for the corresponding parameter, and spent in performing all other operations. Table 2 contains the performance details of masked ciphertext algorithms presented in Sects. 3.5.1 and Sects. 3.5.2.

In Table 2, we include two versions of the decompressed comparison algorithm and the simple comparison algorithm. We use the bitsliced A2B conversion technique of masked simple comparator proposed in [17] for the first version, and we improve this bitsliced A2B converter by employing the technique introduced in [12]. It can be seen from the table the performance of the decompressed comparison algorithm gains 9%, 16%, and 17% improvements for first-, second-, and third-order masking after using [12], respectively. Side by side, the improved decompressed comparison algorithm requires 21% fewer random bytes for any order masking. The performance of the simple comparison algorithm improves by 8%, 15%, and 16% for first-, second-, and third-order masking after using [12], respectively. The improved simple comparison algorithm requires approximately 19% fewer random bytes for any order masking.

As we can see from the table, the cycle count for all A2B conversions employed in the decompressed comparison algorithm is almost double for all orders compared to the simple comparison algorithm. However, for all the orders, the clock cycle required to compute the function BooleanAllBitsOneTest with corresponding parameters is approximately one-fourth in the decompressed comparison algorithm than the simple comparison algorithm. As we can see from Table 2, the improved simple comparison algorithm is approximately \(43\%\) faster and employ roughly 42% fewer random bytes than the improved decompressed comparison algorithm for any order masking of Saber. Similar results can be found for the higher-order masked uSaber. So, we use the improved simple comparison algorithm in our higher-order masked Saber and uSaber decapsulation algorithms.

5.2 Performance Analysis for Masked Saber Decapsulation

We present the performance cost of the Saber algorithm for higher-order masking in Table 3. This table also provides the breakdown of the performance cost of the higher-order masking for all the modules of the masked Saber decapsulation algorithm. As mentioned earlier, for masked Saber implementations we use the hybrid polynomial multiplication, a combination of Toom-Cook-4, Karatsuba, and schoolbooks multiplication. Therefore, we use the Saber implementation which uses the hybrid polynomial multiplication to get the overhead factor for n-th order masked Saber. To maintain simplicity, most of the implementation is written in C. Only the hybrid multiplication is in assembly and generated by using the optimal implementation proposed in [31].

Table 2. Performance breakdown of the implementation of masked comparison algorithms in the Cortex-M4 platform.
Table 3. Performance cost of all the modules of the higher-order masked decapsulation procedure of Saber.
Table 4. Randomness cost of all the modules of the higher-order masked decapsulation algorithm of Saber.

From Table 3, we can see that the performance overhead factor of masked Saber decapsulation implementation for first-order is 2.69x, for second-order is 4.96x, and for third-order is 7.71x. From the table, we can see that the overhead factor for arithmetic operations approximately is \((n+1)\) for nth-order masking due to \(n+1\) time repetitions of each operation. On the other hand, the non-linear operations on arithmetic shares, for example, hash functions, binomial sampler, compression operation, and ciphertext comparison, have much larger overhead factors in the masked setting. To maintain the security assumption, we need to use random bytes in some masking algorithms (example: SecAnd, SecAdd, SecRefresh, etc.). Table 4 shows random bytes requirements for all the components of the higher-order masked Saber decapsulation algorithm. It can be seen from Table 4 that the random bytes requirement increases with the order. The first-order implementation requires 12k random bytes, the second-order and third-order implementations require 43k (3.43x) and 93k (7.34x) random bytes, respectively.

5.3 Performance Analysis for Masked uSaber Decapsulation

The performance cost and breakdown of the performance cost of the higher-order masking for all the modules of the masked uSaber decapsulation algorithm are presented in Table 5. As we mentioned before, the main advantage of uSaber against Saber is the coefficients of the secret vector are sampled from \(\mathcal {U}_2\) instead of \(\beta _8\). Thanks to the parameter choice of secret distribution in uSaber, it needs fewer numbers of pseudorandom bits than Saber. This fact reduces the cycle cost of XOF by almost \(60\%\) for the unmasked version of uSaber compared to Saber. Another advantage is that the hamming weight computation of \(\mu \) bits in the centered binomial sampler \(\beta _\mu \) is swapped by the sign extension of u bits in the centered uniform sampler \(\mathcal {U}_u\). It reduces the performance cost of the secret sampler in unmasked uSaber by \(50\%\) compared to Saber. Altogether, the secret generation is almost \(59\%\) faster for the unmasked decapsulation algorithm of uSaber compared to Saber. The performance cost of the secret generation is lower in uSaber compared to Saber also after integrating masking. The performances of the secret generation in masked uSaber are \(55\%\), \(52\%\), and \(45\%\) faster compared to masked Saber for first-, second-, and third-order, respectively. Additionally, the value of q for uSaber is \(2^{12}\), whereas it is \(2^{13}\) for Saber. This factor reduces one bit in the A2B conversion for uSaber during the masked polynomial comparison. It makes the masked polynomial comparison \(5\%\), \(5\%\), and \(2\%\) faster in uSaber than Saber for first-, second-, and third-order, respectively.

Table 5. Performance cost of all the modules of higher-order masked decapsulation procedure of uSaber.

As we can observe from Table 5, the approximate performance overhead factor of masked uSaber decapsulation implementation for first-order is 2.32x, for second-order is 4.19x, and for third-order is 6.54x. Table 6 presents random bytes requirements for all the segments of the higher-order masked uSaber decapsulation. We obtain from Table 6 that here also the random bytes requirement grows with the order of masking. The first-order implementation needs 10k random bytes, the second-order and third-order implementations use 36k (3.49x) and 79k (7.57x) random bytes, respectively.

Table 6. Randomness cost of all the modules of higher-order masked decapsulation algorithm of uSaber.

5.4 Comparison with State-of-the-Art

In this section, we compare our masked Saber and uSaber implementations with the state-of-the-art masked implementations of Saber and Kyber. We present the performances of our masked implementations in the Cortex-M4 platform and present them in Table 7. Bronchain et al. [12] introduced faster implementations of higher-order masked A2B and B2A conversion utilizing bitsliced techniques and used these conversions to propose higher-order masking implementations of Saber and Kyber. The performances of Bronchain et al.’s masked Saber and Kyber implementations in the Cortex-M4 platform are presented in Table 7. As we mentioned before, the integration of NTT multiplication in masked Saber can provide a significant performance boost. In [12], authors use NTT multiplication for Saber to receive better performance. In order to use NTT multiplication, the authors use a multi-moduli approach that extends the modulus [1]. Even so, the performance of our 1st, 2nd and 3rd order masked implementations of Saber achieve 39%, 23%, and 13% improvement than their masked implementation of Saber, respectively.

In [11], Bos et al. proposed higher-order masked implementations of Kyber. The masked kyber implementation in [11] is faster and uses fewer random bytes than the implementation of masked kyber presented in [12] only for first-order because this masked Kyber uses an optimized implementation for first-order, while Bronchain et al.’s one uses the generalized one. The performance for 2nd and 3rd order masked implementations of Kyber in [12] receives 73% and 85% improvement over the masked Kyber of [11], respectively. However, our implementation of masked Saber is faster than masked Kyber presented in [12] 60% for first-order, 53% for second-order, and 48% for third-order. Also, the performance of our first-order masked Saber is 3% faster than the optimized implementation of the first-order masked Kyber presented in [11]. In terms of random bytes requirement, our masked Saber receives factor 20.61x and 25.98x improvement over the masked Kyber in [11] for 2nd and 3rd order masked implementations, respectively.

As discussed in Sect. 4, masked uSaber uses less number of operations and random numbers than masked Saber due to the choice of secret distribution and parameters in uSaber. Table 7 shows the performances of higher-order masked implementations of uSaber. Further, this table contains the performance of first-order masked Saber [6] and first-order masked Kyber [25], which are specially optimized to prevent the first-order differential power attacks. We can observe from Table 7 that our generalized implementation of first-order masked uSaber is 12% faster than the optimized implementation of masked Saber and is 16% faster than the optimized implementation of masked Kyber. The implementation of masked uSaber is faster than the fastest implementation of higher-order masked Saber 20% for second-order and 19% for third-order. Masked uSaber also needs 15% less random numbers for second-order and 14% less random numbers for third-order over masked Saber. In conclusion, we observe from the reported results of Table 7 that higher-order masked uSaber achieves better performance and needs fewer random bytes than masked Saber and Kyber for any order.

Table 7. The comparison between Saber and Kyber regarding the performance and the random bytes requirement.

6 Conclusions

Saber is often touted as very helpful for masking because of its two unique design components, the power-of-two moduli, and the MLWR problem. Van Beirendonck et al. [6] showed the first-order masked Saber receives better performance and needs fewer random bytes than the first-order masked Kyber. In our work, we substantiated this claim for arbitrary higher-order masking and show that the higher-order masked Saber also acquires better performance and requires fewer random bytes for its design decisions.

The third round submission document of Saber claims that the design decisions behind uSaber will be further beneficial for masking even compared to Saber. This work first concretely justifies those design decisions.

Furthermore, integrating our methods of masking is not dependent upon the underlying polynomial multiplication, which is one of the computationally expensive components. Our masked implementations can be adapted for Saber or uSaber that use the NTT multiplication instead of the hybrid multiplication.