Introduction

Through the NIST Post-Quantum Cryptography (PQC) Standardization Process [34], the cryptographic community is evaluating candidate KEM, PKE, and signature schemes potentially secure against both quantum and classical attacks. One of the requirements for these schemes to be secure is having a constant time implementation to avoid leakage of secret information through timing attacks. Furthermore, it is important to understand how efficient these constant time implementations are when running on real devices.

From an implementation perspective, while the cryptographic community has mostly focused its efforts on improving some categories of cryptosystems such as those based on lattices and codes on the Hamming metric, the same cannot be claimed for rank-based cryptosystems. Compared to lattice and code-based cryptography, rank-based cryptography is a relatively new and less explored field. Although the first rank-based cryptosystem, the Gabidulin-Paramonov-Tretjakov (GPT) public key encryption scheme [23], was introduced in 1991, and many analyses were presented in the subsequent years (such as [22, 25, 36, 37]), only recently new schemes have been proposed, such as [5, 6, 11, 12, 20, 33], some of which have also been submitted to the NIST PQC standardization process. Up until the algebraic attack recently presented in [10], these schemes seemed to provide appealing performance levels and key and ciphertext sizes, which enabled ROLLO [2] (merge of LAKE, LOCKER, and Rank-Ouroboros) and RQC [3] to pass the first round of the NIST PQC standardization process. Though, in its recent status report on the second round [7], NIST did not select ROLLO and RQC to advance on with the motivation that their security analysis needs more time to mature. On the other hand, NIST encouraged the cryptographic community to continue studying rank-based cryptosystems, as they offer a nice alternative to traditional Hamming metric codes with comparable bandwidth.

There were still some open questions of practicality pertaining to the most recently submitted NIST package for ROLLO (dated 2020/04/21 and available at pqc-rollo.org). In June 2020, a note was released [19], pointing out potential issues arising from part of the currently submitted ROLLO and RQC code not being constant time.

In this work, we present the first constant time implementation of the 128-bit secure rank-based KEM called ROLLO-I-128. This work complements and improves the preliminary results posted on the NIST website [4] on the implementation of an earlier version of ROLLO-I-256. Other non-constant time-independent implementations of ROLLO on other platforms can be found for example in [1, 13], or [32], where a software implementation on a Cortex M0 of the encapsulation routine, and a hardware implementation on a microcontroller with a crypto coprocessor, are presented, respectively.

Our Contribution

In this work, we propose different techniques that can be used to implement ROLLO and part of the RQC family of algorithms in a standalone, efficient and constant time library. Recall that ROLLO-I-128, ROLLO-I-192, and ROLLO-I-256 have decryption failure probability of \(2^{-28}\), \(2^{-34}\), and \(2^{-33}\), which, cryptographically, are not considered small. We present, for each of the proposed techniques, explicit code (with intrinsics when required), or pseudo-code and performance measures to show their impact.

As a theoretical contribution, we describe a new constant time variant of Gaussian elimination that reduces any matrix to its (not necessarily reduced) row echelon form. The only previous constant time variant we are aware of [14], only worked for full rank matrices, and returned a systematic form of such matrices, terminating the algorithm if this was not possible. Furthermore, after analyzing current non-constant time algorithms to generate a list of vectors with a given rank, we describe a novel constant time probabilistic version of one of these algorithms, and we present a procedure for reducing the probability of failing to a desired value. We also present a variation of this method which returns the entire support of the vector list. This potentially allows trade-offs between the public key size and the performance of the encapsulation step.

From an implementation perspective, we describe in detail the process of implementing the underlying finite field arithmetic with constant time operations, with and without the use of vectorization techniques. We provide an explicit description of the application of the Zassenhaus algorithm in the Rank Support Recovery algorithm described in the NIST submission of ROLLO [2]. We show how efficient polynomial arithmetic can be conducted by applying multiplication with lazy reduction and inversion of polynomials in a composite Galois field defined by a pentanomial. All these are implemented using reasonably optimized constant time algorithms. Finally, we carry out a performance analysis to show the impact of these improvements on our implementation of ROLLO-I-128 Footnote 1, when compared with its reference and optimized implementations. We expect this work to shed light on the attainable performance for constant time implementations of ROLLO and to help practitioners to make educated choices when implementing it or other constant time rank-based cryptographic algorithms.

Structure of the Paper

In “Preliminaries”, we introduce the basic concepts needed to understand the scheme and the subsequent algorithms. In “Description of the Scheme”, we describe ROLLO-I key encapsulation method. In “Proposed Algorithms”, we provide all the details regarding the binary field, vector space, and composite Galois field arithmetic, as well as the description of the Rank Support Recovery algorithm used in the decapsulation phase. In “Performance”, we compare the performance of our implementation of ROLLO-I-128 with the one of various KEM submissions to the NIST PQC standardization process. In “Conclusion”, we present the conclusions drawn from this study.

Preliminaries

In this section, we first present the rings, fields and vector spaces we will work with as well as an associated metric, namely the rank metric, and then we will define error-correcting codes associated with this metric.

Structures and Representations

In the following, we let q be a prime power and mn two positive integers. We will work with the finite fields of order q, \(q^m\) and \(q^{mn}\): \({\mathbb {F}}_{q}, {\mathbb {F}}_{q^m}, {\mathbb {F}}_{q^{mn}}\). Of course there are multiple isomorphic fields of a given order, with multiple representations and leading to different algorithms.

\(\mathbb {F}_{{\varvec{q}}}\). In this paper, as in the ROLLO specification, q will always be 2 and therefore elements and computations in \({\mathbb {F}}_{q}\) are associated to elements and computations in the modular ring \({\mathbb {Z}}/2{\mathbb {Z}}\).

\(\mathbb {F}_{{\varvec{q}}^{{\varvec{m}}}}\). As usual, elements in extensions of the base field \({\mathbb {F}}_{q}\) will be represented using quotients over the polynomial ring \({\mathbb {F}}_{q}[X]\). Thus, elements and computations in \({\mathbb {F}}_{q^m}\) are associated to polynomial representations and computations over \({\mathbb {F}}_{q}[X]/\left\langle P_0 \right\rangle\) for an irreducible polynomial \(P_0\) of degree m.

\(\mathbb {F}_{{\varvec{q}}^{{\varvec{mn}}}}\). Elements and computations in \({\mathbb {F}}_{q^{mn}}\) are similarly associated to polynomial representations and computations over \({\mathbb {F}}_{q^m}[X]/\left\langle P \right\rangle\) for an irreducible polynomial \(P \in {\mathbb {F}}_{q}[X]\) of degree n. Note that these polynomials have coefficients in \({\mathbb {F}}_{q^m}\), so elements in \({\mathbb {F}}_{q^{mn}}\) are seen as polynomials (that live in \({\mathbb {F}}_{q^m}[X]/\left\langle P \right\rangle\)) with polynomial coefficients (that live in \({\mathbb {F}}_{q}[X]/\left\langle P_0 \right\rangle\)).

It is also quite practical to use vectors and matrices to represent, and operate on, polynomials. For a field F, \({\mathcal {M}}_{n,m}(F)\) represents the set of matrices with n rows and m columns of elements in F. When n equals m this set, together with classical matrix sum and product, forms a ring that we denote \({\mathcal {M}}_n(F)\). Of course we can map polynomials to vectors (of coefficients) and inversely so we often consider an element of \({\mathbb {F}}_{q^m}\) as an element of the vector space \({\mathbb {F}}_{q}^m\), and an element of \({\mathbb {F}}_{q^{mn}}\) as an element of the vector space \({\mathbb {F}}_{q^m}^n\). For a vector \(\mathbf {v}\), we note the associated polynomial \(\mathbf {v}(X)\), and for a polynomial p, we note the associated vector \(\mathrm {vec}(p)\). When using a polynomial in a setting in which it is clear we have to use the vector representation (e.g., a matrix line, or a matrix/vector multiplication) we will not make the \(\mathrm {vec}\) transformation explicit.

Vector additions are naturally defined in \({\mathbb {F}}_{q}^m\) or \({\mathbb {F}}_{q^m}^n\) and correspond to polynomial additions over \({\mathbb {F}}_{q^m}\) and \({\mathbb {F}}_{q^{mn}}\). We define the product of two vectors \(\mathbf {u},\mathbf {v}\) by \(\mathbf {u}\mathbf {v} = \mathrm {vec}(\mathbf {u}(X) \mathbf {v}(X))\), and the inverse as \(\mathbf {u}^{-1} = \mathrm {vec}(\mathbf {u}^{-1}(X))\).

It is also possible to define vector multiplication directly over vector/matrices. To do this, we will first define ideal matrices. As we will only describe explicitly multiplications in \({\mathbb {F}}_{q^{mn}}\), we will focus our definition on this specific setting.

Definition 1

(Ideal Matrices). Let \(P\in {\mathbb {F}}_{q}[X]\) be a polynomial of degree n and \(\mathbf {v}\in {\mathbb {F}}_{q^m}^n\) the vector representation of an element of \({\mathbb {F}}_{q^{mn}}\). The ideal matrix generated by \(\mathbf {v}\) modulo P is the matrix denoted \(\mathcal {IM}_P(\mathbf {v})\in {\mathcal {M}}_n({\mathbb {F}}_{q^m})\) with n rows of the form \(X^{i} \mathbf {v}(X) \bmod P\), with \(i = 0, \ldots , n-1\).

The multiplication of two vectors \(\mathbf {u},\mathbf {v}\in {\mathbb {F}}_{q^{mn}}\) can be then computed with \(\mathbf {u}\mathbf {v} = \mathbf {u}\mathcal {IM}_P(\mathbf {v}) = (\mathcal {IM}_P(\mathbf {u})^T v^T)^T = \mathbf {v}\mathbf {u}\). Note that this definition is compatible with the previous one as we have \(\mathbf {u}\mathcal {IM}_P(\mathbf {v}) = \mathrm {vec}(u(X)v(X))\).

Metric and Support

Let \({\mathbf {e}} = (e_1, \ldots , e_n)\) be an element of \({\mathbb {F}}_{q^m}^n\). Denote by \(e_{i,j}\) the j-th component of \(e_i\), \(e_i\) being seen as an element of \({\mathbb {F}}_{q}^m\). Then the rank weight of e, denoted by \(\mathsf {w_R}({\mathbf {e}})\), is defined as \(\mathsf {w_R}({\mathbf {e}}) = \textsf {rank} \left( [ e_{i,j} ]_{i=1,\ldots ,n, j=1,\ldots ,m} \right)\) The rank distance between two vectors \(\mathbf {e, f} \in {\mathbb {F}}_{q^m}^n\) is defined by \(\mathsf {w_R}(\mathbf {e - f})=||\mathbf {e-f}||\).

For \({\mathbf {x}}=(x_1, \ldots , x_n)\in {\mathbb {F}}_{q^m}^n\), the support E of \({\mathbf {x}}\), denoted \(\mathsf {supp}({\mathbf {x}})\), is the \({\mathbb {F}}_q\)-subspace of \({\mathbb {F}}_{q^m}\) generated by the coordinates of \({\mathbf {x}}\): \(E=\langle x_1, \ldots , x_n\rangle _{{\mathbb {F}}_q}.\) Note that \(\mathsf {dim}(E) = \mathsf {w_R}({\mathbf {x}})\) and that any \({\mathbf {e}}\in E\) can be written as \({\mathbf {e}}=\sum _{i=1}^n \lambda _i x_i\) where \(\lambda _i\in {\mathbb {F}}_{q}\).

Codes

We define a \([n,k]_{q^m}\) code C over \({\mathbb {F}}_{q^m}\) as a vector subspace of \({\mathbb {F}}_{q^m}^n\) of dimension k, where n is called the length and k is the dimension of the code. An element of a code C is called a codeword. A generator matrix for an \([n, k]_{q^m}\) code C is thus any \(k \times n\) matrix G whose rows form a basis for C. Note that the generator matrix of a code is not unique.

As a linear code is a vector subspace, it is the kernel of some linear transformation. In particular, there is an \((n - k) \times n\) matrix H, called a parity check matrix for the \([n,k]_{q^m}\) code C, that verifies \(C = \{x \in {\mathbb {F}}_{q^m}^n | Hx^T=0\}\). As for the generator matrix, the parity check matrix of a code C is not unique.

We present now the definition of the ideal Low Rank Parity Check (ideal LRPC) codes, codes on which all the variants of ROLLO are based. Moreover, we introduce the underlying problem on which relies the security of the schemes. We first recall the definition of ideal codes and LRPC codes.

Definition 2

(Ideal Codes) . Let \(P \in {\mathbb {F}}_{q}[X]\) be a polynomial of degree n and \(\mathbf {h_1}, \mathbf {h_2} \in {\mathbb {F}}_{q^m}^n\). We define the \([2n, n]_{q^m}\) ideal code C defined by \((\mathbf {h_1},\mathbf {h_2})\) modulo P as the code with parity check matrix \(\left( \begin{array}{c|c} \mathcal {IM}_P(\mathbf {h_1})^T&\mathcal {IM}_P(\mathbf {h_2})^T \end{array} \right) .\)

If \(\mathbf {h_1}(X) = 1\) (and thus \(\mathcal {IM}_P(\mathbf {h_1}) = I_n\)), we say C is defined by \(\mathbf {h_2}\) modulo P. If \(\mathbf {h_1}(X)\) is invertible in \({\mathbb {F}}_{q^{mn}}\), the code C defined by \((\mathbf {h_1},\mathbf {h_2})\) modulo P is the same as the code defined by \(\mathbf {h_1}^{-1} \mathbf {h_2}\) modulo P.

Definition 3

(LRPC codes). Let \(H \in M_{n-k,n}({\mathbb {F}}_{q^m})\) be a full rank matrix such that its coefficients generate an \({\mathbb {F}}_q\)-subspace \(F = \langle h_{i,j}\rangle _{{\mathbb {F}}_q}\) of small dimension d. The \([n, k]_{q^m}\) code C of parity check matrix H is called an LRPC code of weight d.

A \([2n,n]_{q^m}\) Ideal Code defined by \((\mathbf {h_1},\mathbf {h_2})\) modulo a polynomial P can also be an LRPC code. Indeed, if \(\mathbf {h_1},\mathbf {h_2}\) are vectors in an \({\mathbb {F}}_{q}\)-subspace of small dimension and P has its coefficients in \({\mathbb {F}}_{q}\), this will be the case. Such a code is called an Ideal LRPC code.

Definition 4

(Ideal LRPC codes). Let F be a \({\mathbb {F}}_q\)-subspace of dimension d of \({\mathbb {F}}_{q^m}\), \(\mathbf {h_1}, \mathbf {h_2}\) two vectors of \({\mathbb {F}}_{q^m}^n\) with support in F and \(P \in {\mathbb {F}}_q[X]\) a polynomial of degree n. The code C with parity check matrix \((\mathcal {IM}(\mathbf {h_1})^T | \mathcal {IM}(\mathbf {h_2})^T)\) is called an \([2n, n]_{q^m}\) ideal LRPC code.

The variant of ROLLO we focus on this paper, ROLLO-I, has a security proof based on two problems. The first is a support recovery problem, which is proven equivalent to the rank-metric version of the Syndrome Decoding problem (RSD) in [2]. The second is an indistinguishability problem.

Problem 1

(r-Ideal Rank Support Recovery). Given a polynomial \(P\in {\mathbb {F}}_{q}[X]\) of degree n, vectors \({\mathbf {h}}_1, \ldots , {\mathbf {h}}_r \in {\mathbb {F}}_{q^m}^n\), a syndrome \({\mathbf {s}}\) and a weight w, it is hard to find a support \(E=\langle {\mathbf {e}}_0,\ldots , {\mathbf {e}}_{r-1}\rangle\) of dimension lower than w such that \({\mathbf {e}}_0 +{\mathbf {e}}_1{\mathbf {h}}_1+\ldots +{\mathbf {e}}_{r-1}{\mathbf {h}}_{r-1}={\mathbf {s}} \mod P.\)

Problem 2

(Ideal LRPC codes indistinguishability). Given a polynomial \(P\in {\mathbb {F}}_{q}[X]\) of degree n and a vector \({\mathbf {h}}\in {\mathbb {F}}_{q^m}^n\), it is hard to distinguish whether the ideal code C with parity-check matrix generated by \({\mathbf {h}}\) and P is a random ideal code or if it is an ideal LRPC code of weight d.

In other words, it is hard to distinguish if \({\mathbf {h}}\) was sampled uniformly at random or as \({\mathbf {x}}^{-1}{\mathbf {y}}\mod P\) where the vectors \({\mathbf {x}}\) and \({\mathbf {y}}\) have the same support of small dimension d.

Description of the Scheme

As stated by the submission documentation all ROLLO variants follow the approach inaugurated by the public key encryption protocol NTRU in 1998 [28]. As pointed out in the previous section, ROLLO is a variation of the LRPC rank metric approach and its security is proven assuming that the Ideal LRPC indistinguishability and the 2-Ideal Rank Support Recovery [2, Theorem 4.2] problems are hard.

We now describe ROLLO-I in detail. The ROLLO-I Key-Encapsulation Mechanism (KEM) is a triple of probabilistic algorithms \((\mathsf {KeyGen}; \mathsf {Encaps}; \mathsf {Decaps})\). \(\mathsf {KeyGen}\): randomly sample \(({\mathbf {x}},{\mathbf {y}})\) from a vector subspace F of \({\mathbb {F}}_{q^m}\) of dimension d, such that \(\mathsf {w_R}({\mathbf {x}}) = \mathsf {w_R}({\mathbf {y}}) = d\). Set \({\mathsf {pk}}= {\mathbf {h}}= {\mathbf {x}}^{-1}{\mathbf {y}}\mod P\) and \({\mathsf {sk}}= ({\mathbf {x}},{\mathbf {y}})\). \(\mathsf {Encaps}\): randomly sample \(({\mathbf {e}}_1,{\mathbf {e}}_2)\) from a vector subspace E of \({\mathbb {F}}_{q^m}\) of dimension r, such that \(\mathsf {w_R}({\mathbf {e}}_1) = \mathsf {w_R}({\mathbf {e}}_2) = r\). Compute \({\mathbf {c}}= {\mathbf {e}}_1 + {\mathbf {e}}_2{\mathbf {h}}\mod P\). Compute \(K = G(E)\) where G is a hash function. Output \(({\mathbf {c}},K)\). \(\mathsf {Decaps}\): Compute \({\mathbf {s}}= {\mathbf {x}}{\mathbf {c}}= {\mathbf {x}}{\mathbf {e}}_1 + {\mathbf {y}}{\mathbf {e}}_2 \mod P\). Use the Rank Support Recovery (RSR) algorithm (algorithm 13) to recover E. The RSR algorithm takes as input \(F = \mathsf {Supp}({\mathbf {x}},{\mathbf {y}})\) and \({\mathbf {s}}\) (see “Rank Syndrome Recovery Algorithm and Decapsulation” for more detail). If the RSR algorithm succeeds return \(K = G(E)\), else return \(\perp\).

We refer to Table 1 for the actual set of ROLLO-I parameters. Note that the private key can be obtained from a seed, and in the official NIST submission the seed expander was initialized with 40 bytes long seeds.

Table 1 ROLLO-I parameters

As the last column of the table shows, the decapsulation algorithm has a non-zero failure probability. This probability is however well understood and made low enough to fit the NIST call for proposals (for more detail see Section 1.4.2 of [2]).

Proposed Algorithms

We redefined ROLLO starting from the following building blocks: the binary field arithmetic corresponding to operations in \({\mathbb {F}}_{q^m}\); the vector space arithmetic, including the Gaussian reduction algorithm for binary matrices, the Zassenhaus algorithm for binary matrices, and the generation of elements of \({\mathbb {F}}_{q^m}[X] / P(X)\) of a given rank; the arithmetic in the composite Galois field \({\mathbb {F}}_{q^m}[X] / P(X)\) where P(X) is the irreducible polynomial given in the parameters; the Rank Support Recovery algorithm (RSR) used in the decapsulation phase. The key generation, encapsulation and decapsulation (or encryption and decryption) of all the variants of ROLLO are based only on the above blocks. Hence, we focused on optimizing every operations included in those layers as well as insuring the fact that they are constant time.

Target We target processors with 64-bit carryless multiplications (2010 and onward for Intel) and provided a faster alternative if they also have AVX2 instructions (2013 and onward for Intel). The code examples assume GCC’s __uint128_t type is available and uses GCC X86 intrinsics.

Notation Given \({\mathbf {x}},{\mathbf {y}}\) two binary vectors, in what follows, we denote with \({\mathbf {x}}\oplus {\mathbf {y}}\) the bit-wise XOR of \({\mathbf {x}}\) and \({\mathbf {y}}\), and with \({\mathbf {x}}\otimes {\mathbf {y}}\) the bit-wise AND of \({\mathbf {x}}\) and \({\mathbf {y}}\). With \({\mathbf {x}}\ll h\) and \({\mathbf {x}}\gg h\) with indicate, respectively, the left and right shift of \({\mathbf {x}}\) by h positions.

Binary Field Arithmetic

In this section, we present the constant time vectorized operations we propose for \({\mathbb {F}}_{q^m}\). As shown in Table 1, all variants of ROLLO-I have \(q=2\) and different values for m. Our algorithms work for all the values of m submitted to the NIST competition, but have to be slightly adapted for each value. To avoid repetitions, we will focus on the field used by ROLLO-I-128, and note what changes need to be done to adapt the algorithms for other values of m.

We implemented finite field arithmetic for the binary field \({\mathbb {F}}_{2^{m}}\), with \(m=67\), representing elements as binary polynomials of degree \(m-1\) modulo an irreducible polynomial of degree m. We used the irreducible pentanomial \(P_0(X)=X^{67} + X^5 + X^2 + X + 1\) provided by the Allan Steel database incorporated in Magma software [18] and also suggested by the authors of ROLLO. This pentanomial has also lowest possible intermediate degree, allowing the shortest shift during the reduction operations. No trinomial exists for \(m=67\).

To represent an element of the field, we use 128-bit unsigned integer, using the type __uint128_t, and sometimes casting it to __m128i, with unused bits set to zero. Addition and subtraction of two elements are a simple bit-wise XOR operation. The multiplication of two field elements is performed in two steps: a carryless multiplication of the two elements seen as polynomials (“Binary Field Arithmetic”, or a carryless squaring of a single element in “Binary Field Arithmetic”) and a polynomial reduction (“Binary Field Arithmetic” ). Inversion is performed using an addition chain (see “Binary field arithmetic”). As noted before, all operations in the binary field layer are executed in constant time, assuming the intrinsics (and in particular carry-less multiplications) are constant time.

Carryless multiplication: plain C implementation The carryless multiplication has been implemented using recursive Karatsuba multiplication [31]. More specifically, we borrowed from NTLFootnote 2 an implementation of a constant time carryless Karatsuba multiplication of two 64 bit register (which we call ntlclmul64 in algorithm 14) using only bit manipulation, and then added an extra level of Karatsuba method over this function. The full carryless multiplication \(\textsf {clmul}_K(a,b)\) is described in “Appendix A”, algorithm 14.

In Table 2, we compare this implementation with ROLLO’s polynomial multiplications. The initial NTL-based ROLLO (submission date 2019/04/10) used NTL’s generic carryless multiplication functionFootnote 3. As it is generic, this function goes through a set of tests and function calls before calling exactly the same code we used for ntlclmul64. The overhead (5 function calls, 6 if statements with two boolean tests for most of them, and a switch/case) is significant w.r.t. the final code of ntlclmul64 (78 instructions). As a result, specializing the code by removing calls, conditional branches, and extracting only the instructions needed for ROLLO we get a \(15\%\) speedup on polynomial multiplication with respect to NTL-ROLLO which called the generic function. The Karatsuba function implemented in the NTL-free version of ROLLO (submission date 2019/08/24), called NoNTL-ROLLO in the table, is \(30\%\) slower than NTL’s generic function. It seems thus that, in general, implementations of Karatsuba using NTL may obtain a nice performance upgrade just by importing/adapting the specialised code of NTL for this operation, as we did. We also notice that the latest ROLLO implementation dated 2020/04/21, is not NTL-dependent anymore.

Table 2 Cycles per plain C carryless multiplication of polynomials of degree \(m=67\) (averaged over 4 s of execution on a Macbook Pro 2017 with an 2.9 GHz Quad-Core Intel Core i7, I7-7820HQ)

Carryless multiplication: AVX2 optimization When possible, the carryless multiplication step has been performed using Intel Advanced Vector Extensions 2 instructions (AVX2) [29]. In particular, the core of this function uses the _mm_clmulepi64_si128 instruction (see also [27]) to perform 64 times 64 bit binary polynomial multiplication.

The multiplication of two m bit binary polynomials is performed in a schoolbook fashion, by dividing the input in two 64 bit registers (one containing only \(m-64\) bits) and then applying four times the function _mm_clmulepi64_si128, which acts on 64 bits registers. The results is stored in a __m256i type (4 registers), but only the \(2m-2\) least significant bits are used, while the remaining ones are set to zero. We refer to this algorithm as the \(\textsf {clmul}_S(a,b)\) algorithm, and we present our C implementation in “Appendix A”, algorithm 15. When irrelevant in the context, we will indicate with \(\textsf {clmul}(a,b)\) (with no subscript) the algorithm performing carryless multiplication, either using Karatsuba method in plain C or with schoolbook method and AVX instructions.

Let us remark that using Karatsuba multiplication [31] in this case would not give any advantage, as the cost of multiplication and addition with AVX2 instruction is very close. In practice, we show it even performs worse, due to alignment problems.

In Table 3, we show that, when comparing figures for NTL-ROLLO and others, specializing code for ROLLO’s setting has an even greater impact on performance when using AVX2, with no surprise. It also shows that alignment issues in Karatsuba have a very noticeable impact on performance and highlights the fact that ROLLO developers did the right choice opting for schoolbook multiplication in the NTL-free version of ROLLO. Our implementation has a little advantage on performance.

Table 3 Average cycles per AVX2 carryless multiplication of polynomials of degree \(m=66\) (averaged over 4 s of execution on a MacBook Pro 2017 with a 2.9 GHz Quad-Core Intel Core i7, I7-7820HQ)

This difference is explained by the fact that the permutation done in our algorithm with _mm256_permute4x64_epi64 allows us to avoid the cost of the load and store instructions, which are present at the beginning and end of each recursive call in the NIST submitted code.

Carryless squaring For squaring, which will be used in the inversion algorithm, we can use the fact that this operation actually consists of interleaving zeros to the current representation of the polynomial. Indeed, for \(a \in {\mathbb {F}}_{2^m}\), \(a^2 = \left( \sum _{i = 0}^{m-1} a_i x^i \right) ^2 = \sum _{i = 0}^{m-1} a_i x^{2i}\). For example, if the current representation of a was 11100101, then \(\textsf {clsqr}(a)\) will be 1010100000100010. To perform this operation, we decided to use a small modification of the method Interleave bits with 64-bit multiply given by Sean Eron Anderson on his web page Bit Twiddling Hacks [21]. The pseudocode is given in “Appendix A”, algorithm 16.

The squaring method is straightforward from there and its pseudocode is given in “Appendix A”, algorithm 17. For the AVX2 version, a look-up table based on the instruction _mm_shuffle_epi8 is implemented both in the submission and our work. The AVX2 performance are reported in Table 4.

We would like to remark that, although simple and perhaps even trivial in retrospect, the mentioned approaches for squaring have been proposed before in the literature. Precisely, [9] and [17] for the shuffle-based squaring and [35] for the CLMUL squaring.

Table 4 Average cycles per carryless squaring of polynomials of degree \(m=66\) (averaged over 4 s of execution on a MacBook Pro 2017 with a 2.9 GHz Quad-Core Intel Core i7, I7-7820HQ)

Reduction The \(2m-2\) bits result provided by the carryless multiplication is reduced back modulo \(P_0\) to a m bit field element, using standard techniques. The pseudocode of the algorithm for reduction is presented in “Appendix A”, algorithm 18. The AVX2 performances of the reduction are reported in Table 7.

Inversion The inversion of an element \(x \in {\mathbb {F}}_{2^{m}}\), described in “Appendix A”, algorithm 19, has been derived using Fermat’s little Theorem stating that \(x^{2^{m}-2} = x^{-1}\). The fixed exponentiation is achieved by the strategy presented in [38, Section 6.2] using the following addition chain of length 9:

$$\begin{aligned} 1 \rightarrow 2 \rightarrow 4 \rightarrow 8 \rightarrow 16 \rightarrow 32 \rightarrow 33 \rightarrow 66 \rightarrow 67 \,. \end{aligned}$$

The AVX2 performances of the binary field inversion are reported in Table 7.

Binary Vector Space Arithmetic

In this section, we describe the main algorithms used to manipulate vector spaces, i.e., Gaussian reduction, Zassenhaus algorithm, and the generation of vectors of given rank.

In our implementation, a binary matrix M, usually indicated with uppercase letters, of size \(m \times l\) is an array of __uint128_t of length l, where each element of the array is a matrix row \(m_i\). Similarly, a vector space, or the support of a set of vectors is represented with uppercase letters and stored in arrays of __uint128_t.

Gaussian Elimination Algorithm

We introduce an original algorithm to perform a constant time Gaussian elimination to convert any binary matrix to a (not necessarily reduced) row echelon form and its extension to convert it to reduced row echelon form. This algorithm is somehow a generalization of the one presented in [14], where Gaussian elimination was used to convert the binary matrix to a systematic form. In [14], if the matrix is not systematic, the algorithm breaks. Otherwise, for each column, the algorithm first sets to 1 the bits of the diagonal, by scanning the rows of the matrix from below the current pivot to the bottom of the matrix, then sets to 0 the bits in the current column, except the diagonal, by scanning the full set of rows again. This is done in a constant time manner, due to the fact that, being the matrix systematic, the number of rows under the pivot are always the same for each column step. Though, in [14], it is not defined how one could force the algorithm to continue when it is not possible to fix a 1 in the diagonal, i.e., when the matrix is not systematic. We solve the problem by always scanning all rows for each column, and by keeping track of the current pivot position, not necessarily in the diagonal. Let \(\tilde{r}\) be the current pivot row position, i is the current scanned row and j the current scanned column. Then, we perform

$$\begin{aligned} m_{\tilde{r}}&= m_{\tilde{r}} \oplus \mathsf {mask1} \cdot \mathsf {mask2} \cdot \mathsf {mask3} \cdot m_{i} \\ m_{i}&= \mathsf {mask1} \cdot \mathsf {mask3} \cdot m_{\tilde{r}} \oplus m_{i} \end{aligned}$$

where \(\mathsf {mask3}\) is set to 1 if the current row is above the pivot (\(i>r\)), \(\mathsf {mask2}\) is set to 1 if the the bit \(m_{i,j}\) is 0, and \(\mathsf {mask1}\) is set to 1 if the bit \(m_{\tilde{r},j}\) in the intersection of the current scanned row and column is 1. The steps above have the effect to leave the rows unchanged either when the current row is above the pivot row \(m_{\tilde{r}}\) or, otherwise, when the bit \(m_{i,j}\) is 0. On the other hand, when \(m_{i,j}\) is 1, if the pivot bit \(m_{\tilde{r},j}\) is 0, then the current row is swapped with the pivot row, and if the pivot bit \(m_{\tilde{r},j}\) is 1, then the 1 in position (ij) is flipped. Notice that, at the end of the algorithm, the pivot position is also the rank of the matrix. Compared to [14], for each scan of the full set of rows, we perform fewer XOR operations, but we need to compute more masks. We also have to scan all columns, while for the method from [14] it is sufficient to scan the minimum between the number of rows and the number of columns. This makes the method of [14] much faster for matrices with a small number of rows. We stress again that the method of [14] only computes the systematic form of a matrix, and for this reason is, in general, faster.

Our method can be easily extended to compute the reduced row echelon form, by storing the pivot positions and then scanning all the rows \(\mathsf {r}\) times, where \(\mathsf {r}\) is the number of rows, to remove the 1’s above the pivots.

The differences between our method and the one in [14] are summarized in Tables 5, 6.

Table 5 Comparison of our proposed Gaussian elimination algorithm and the one from Bernstein et al. [14], for a matrix with \(\mathsf {r}\) rows and \(\mathsf {c}\) columns
Table 6 Clock cycle comparison of our proposed Gaussian elimination algorithm and the one from Bernstein et al. [14], for a matrix with \(\mathsf {r}=10,20,30,100\) rows and \(\mathsf {c}=67\) columns

The pseudocode of the three algorithms can be found in algorithm 1 ([14]), algorithm 2, and algorithm 3, where M represents a binary matrix with \(\mathsf {r}\) rows and \(\mathsf {c}\) columns, \(m_i\) is the binary vector representing the i-th row of the matrix M, and \(m_{i,j}\) is the bit entry of the matrix M at position ij.

In our C implementation, we store one line \(\texttt {m[i]}\) of the binary matrix in a variable of type __uint128_t. We can perform Steps 3–4 of algorithm 1 in a constant number of operations as follows:

Similarly, also the other if statements of both algorithms can be easily executed in constant time.

Finally, note that algorithm 2 and algorithm 3 access \(m_{\tilde{r}}\). Using memory indices depending on \(\tilde{r}\) can leak information on \(\tilde{r}\) through timing attacks on machines with caches. To avoid this types of attacks, one would have to scan all the rows of the matrix and access the desired row using another mask.

figure a
figure b
figure c

Zassenhaus Algorithm

The Zassenhaus algorithm is a method to compute a basis for the intersection and sum of two vector subspaces UV of a vector space W of length m. Let us consider the two sets of generators of U and V, i.e., \(U=\langle u_0,\ldots , u_{l_1} \rangle\) and \(V=\langle v_0,\ldots , v_{l_2} \rangle\). The algorithm creates the block matrix (1) of size \((l_1+l_2)\times 2m\):

$$\begin{aligned}&\begin{bmatrix} u_{0,0} &{} \ldots &{} u_{0, m-1} &{} u_0 &{} \ldots &{} u_{0, m-1} \\ \vdots &{} &{} \vdots &{} \vdots &{} &{} \vdots \\ u_{l_1, 0}&{} \ldots &{} u_{l_1, m-1}&{} u_{l_1, 0}&{} \ldots &{} u_{l_1, m-1} \\ v_{0,0} &{} \ldots &{} v_{0, m-1} &{} 0 &{} \ldots &{} 0 \\ \vdots &{} &{} \vdots &{} \vdots &{} &{} \vdots \\ v_{l_2, 0}&{} \ldots &{} v_{l_1, m-1}&{} 0&{} \ldots &{}0 \\ \end{bmatrix} \end{aligned}$$
(1)
$$\begin{aligned}&\begin{bmatrix} a_{0,0} &{} \ldots &{} a_{0, m-1} &{} \star &{} \ldots &{} \star \\ \vdots &{} &{} \vdots &{} \vdots &{} &{} \vdots \\ a_{l_3, 0} &{} \ldots &{} a_{l_3, m-1} &{} \star &{} \ldots &{} \star \\ 0 &{} \ldots &{} 0 &{} b_{0,0} &{} \ldots &{} b_{0, m-1} \\ \vdots &{} &{} \vdots &{} \vdots &{} &{} \vdots \\ 0 &{} \ldots &{} 0 &{} b_{l_4,0} &{} \ldots &{} b_{l_4, m-1} \\ 0 &{} \ldots &{} 0 &{} 0 &{} \ldots &{} 0 \\ \vdots &{} &{} \vdots &{} \vdots &{} &{} \vdots \\ 0 &{} \ldots &{} 0 &{} 0 &{} \ldots &{} 0 \\ \end{bmatrix} \end{aligned}$$
(2)

After application of the Gauss elimination, the matrix has the form (2), reduced in row echelon form. In (2), \(\star\) stands for arbitrary numbers, \((a_0,\ldots , a_{l_3})\) is a basis of \(V+U\) and \((b_0,\ldots , b_{l_4})\) is a basis of \(V\cap U\). The pseudocode can be found in algorithm 4.

figure d

Generation of Vectors of Given Rank

The generation of a vector \({\mathbf {e}}\in {\mathbb {F}}_{q^m}^n\) of a given rank, say r is probably the most delicate part of the key generation and encapsulation routines. We are not aware of any constant time algorithm performing this task. In this section, we analyze the two non-constant time strategies adopted in the current NIST submission (dated 2020/04/21) and in [5, Sect. 5.2]. Then we derive a constant time version of the latter, with a probability of failure that can be set as small as desired, at the cost of increasing the complexity of the algorithm. Lastly, we also propose an alternative method that, while generating a vector of a given rank, also constructs the full support of the vector. This last method could turn to be useful in the case a user could store a larger public key in memory, so to have the advantage of not reconstructing the support from its basis during the encapsulation phase.

The strategy from [5] or from the NIST submission are based on the same idea: generating a basis of r random elements of \({\mathbb {F}}_{q^m}\) until they are linearly independent and then generate a random linear combination of those vectors. In the NIST submission, the r elements are randomly inserted in the error components, thus guaranteeing that the error will have rank r. The remaining \(n-r\) positions are filled with random linear combinations of the basis elements. This algorithm is detailed in algorithm 5. On the other hand, in [5], the components of the error are all filled with random linear combinations of the basis elements, until the error has rank r. This algorithm is detailed in algorithm 6. It is clear that both strategies are non-constant time. Notice also that, in [19], the authors describe how the NIST submission implementation leaks the memory access pattern.

Both approaches can be turned to be constant time by removing the repeat and while loops, and iterating the algorithm enough times so that the probability of generating a vector list of the wrong rank becomes negligible. Our proposed constant time solution is based on this idea. Precisely, we first sample r independed elements of \({\mathbb {F}}_{q^m}\) randomly. In Proposition 1, we derive the probability for those vectors to be linearly independent over \({\mathbb {F}}_2\). Second, we generate the components of the vector using masked linear combinations of the basis. Note that this algorithm also has the advantage to hide the memory access pattern. We show also that it is sufficient to repeat the full procedure once to reach a probability of failing equal to \(2^{-60}\), which is already way smaller than the ROLLO Decryption Failure Rate. However, if this is still a concern (for example when adapting this work to ROLLO-II), repeating the procedure twice leads to a probability of failing of \(2^{-120}\), and so on. The full algorithm is described in algorithm 7. Note that this is the algorithm used in our implementation.

Proposition 1

The probability that r randomly sampled elements in \({\mathbb {F}}_2^m\) have rank r is \(p = (1-q^{-m})\cdot (1-q^{1-m})\cdots (1-q^{r-1-m}).\)

Proof

The first element \(e_1 \in {\mathbb {F}}_{q^m}\) is independent if and only if it is different from zero. Since it is a vector in \({\mathbb {F}}_{q^m}\), we have \(\Pr (e_1=0)=1/q^m\). Then \(e_1,e_2\) are linearly dependent if and only if \(e_2=k e_1\), where \(k\in {\mathbb {F}}_{q}\). So \(\Pr (e_2=k e_1)=q/q^m\). We can continue this way for all the vectors until the last one, where we have that \(e_r\) is a linear combination of the previous ones if and only if \(e_r=\sum _{i=1}^{r-1} k_i e_i\) where \(k_i\in {\mathbb {F}}_{q}\). So \(\Pr (e_r=\sum _{i=1}^{r-1} k_i e_i)=q^{r-1}/q^m\). \(\square\)

For ROLLO-I-128 parameters, where \(r=7\) and \(m=67\), the probability to have a linear combination between r random vectors is \(2^{-60}\). Computing the probability that a random support is of the required dimension is only the first step of the evaluation of the failure probability of our algorithm. Assuming that a random support F of dimension r is available, we now have to compute the probability for a vector \({\mathbf {e}}\in F^n\) to be of rank strictly less than r. Let \(f_1, \ldots , f_r\) be a basis of F. The components \(e_1, \ldots , e_n\) can be written with coordinates in \(f_1, \ldots , f_r\): \(e_i = \sum \limits _{j=0}^r (e_i)_j f_j\) where \((e_i)_j \in F_q\). Let M be the \(r \times n\) matrix over \(F_q\) such that \(M_{j, i} = (e_i)_j\). Then, the fact that \(\mathsf {w_R}({\mathbf {e}}) < r\) is equivalent to the fact that the matrix M is of rank \(< r\). Since the coordinates of M are sampled randomly, this probability can be approximated by \(q^{-(1+n-r)}\).

For ROLLO-I-128 parameters, where \(r=7\) and \(n=83\), the probability to obtain a vector with rank less than r is \(2^{-77}\), hence the probability that this process generates an error of weight \(r-1\) is \(2^{-60} + 2^{-77}\) which can be approximated by \(2^{-60}\).

Now, one might generate multiple samples and if the cycle is repeated h times, the probability to fail becomes \(2^{-120}\) for \(h=2\), and \(2^{-180}\) for \(h=3\) and so on. To make this approach constant time, one can repeat the sampling as many times needed to reach the desired probability, each time computing the rank of the vector with the constant time Gaussian elimination algorithm proposed in “Gaussian elimination algorithm” and store the sampled vector space when it has the desired rank.

figure e
figure f
figure g

Now, we describe how to generate the entire support of the vector rather than just the basis. This approach takes advantage of the fact that r is usually small (maximum 9 for ROLLO-I). We start by initializing a list with the zero vector and a random vector. We then generate a second random vector, check if it is already in the list. If so, we discard it and generate another one, else we add its addition with all the previous vectors already in the list to the list. We end up generating a vector subspace F of \({\mathbb {F}}_{q^m}\) of dimension r. One can then draw randomly the coordinates of \({\mathbf {e}}\) from this list. The only caveat of this method is that the vector \({\mathbf {e}}\) can be of rank less than r as its coordinates could be in a vector subspace of F. We, therefore, have to check the rank of \({\mathbf {e}}\) before outputting the result, or run the algorithm twice to reach a probability of failing of \(2^{-120}\) (as proved above). We also notice that an implementation of such method needs to take care of hiding the memory access pattern when randomly drawing the elements from the vector space. The method is detailed in its non-constant time version in algorithm 8, and in its constant time version in algorithm 9. Note that the mask operation in line 9 of algorithm 9 should be done using an AND mask rather than a multiplication.

figure h

Composite Galois Field Arithmetic

An element in the composite Galois field \({\mathbb {F}}_{({2^m})^n}\) can be represented as a polynomial \({\mathbf {a}}(x)=a_0 + a_1 x + \ldots + a_{n-1} x^{n-1}\) in \({\mathbb {F}}_{2^{m}}[x]/ P(x)\), with \(P(x) \in {\mathbb {F}}_2[x]\) irreducible of degree n, or, equivalently, as an array \({\mathbf {a}}=(a_0, a_1, \ldots , a_{n-1})\) of length n of elements in \({\mathbb {F}}_{2^{m}}\). In our implementation, an element of \({\mathbb {F}}_{{(2^m)}^n}\) is an array of __uint128_t of length n, and we usually refer to it in the pseudocode with bold lowercase letters.

Matrix Multiplication with Lazy Reduction

The multiplication \({\mathbf {a}}\times {\mathbf {b}}\)Footnote 4 in \({\mathbb {F}}_{({2^m})^n}\), algorithm 10, is performed as the following vector by matrix multiplication

$$\begin{aligned} (a_0, a_1, \ldots , a_{n-1}) \times \begin{bmatrix} {{\hat{b}}}_{0,0} &{} \cdots &{} {{\hat{b}}}_{0,n-1} \\ {{\hat{b}}}_{1,0} &{} \cdots &{} {{\hat{b}}}_{1,n-1} \\ \vdots &{} \ddots &{} \vdots \\ {{\hat{b}}}_{n-1,0} &{} \cdots &{} {{\hat{b}}}_{n-1,n-1} \end{bmatrix}, \end{aligned}$$

where \(( {{\hat{b}}}_{i,0}, \cdots , {{\hat{b}}}_{i,n-1})\) are the coefficients of \({\mathbf {b}}(x)\cdot x^i \mod P(x)\).

In ROLLO-I-128, we have \(n=83\), so \((b_{i,0} + b_{i,1} x + \ldots + b_{i,82} x^{82}) \cdot x \mod P(x) = b_{i,82} + b_{i,0} x + (b_{i,1} + b_{i,82}) x^2 + b_{i,2} x^3 + (b_{i,3} + b_{i,82}) x^4 + b_{i,4} x^5 + b_{i,5} x^6 + (b_{i,6}+ b_{i,82}) x^7 + \ldots , b_{i,81} x^{82}\), since \(x^{82} = X^7 + X^4 + X^2 + 1\).

This allows us to reduce the number of reduction in \({\mathbb {F}}_{2^{m}}\), since when we compute the field element \((a_0, a_1, \ldots , a_{n-1}) \times ( {{\hat{b}}}_{i,0}, \cdots , {{\hat{b}}}_{i,n-1}) = (a_0 {{\hat{b}}}_{i,0} +\ldots + a_{n-1} {{\hat{b}}}_{i,n-1}) = \sum _{j=0}^{n-1} a_j {{\hat{b}}}_{i,j}\), each \(a_j {{\hat{b}}}_{i,j}\) can be computed using the carryless multiplication algorithm \(\textsf {clmul}\), and the reduction \(\textsf {red}_{{\mathbb {F}}_{2^{67}}}\) is applied only at the end of the summation. The pseudo-code of the algorithm is presented in algorithm 10.

figure i

The AVX2 performances of the polynomial multiplication are reported in Table 7.

Polynomial Inversion

For the inversion in the composite Galois field \({\mathbb {F}}_{(2^{m})^{n}} \cong {\mathbb {F}}_{2^{m}}[x]/ P(x)\), we use the technique presented in [26] in 1998, which improves the Itoh-Tsujii algorithm with pre-computed powers [30]. The idea is to compute \({\mathbf {a}}^{-1} = ({\mathbf {a}}^r)^{-1}{\mathbf {a}}^{r-1}, {\mathbf {a}}\in {\mathbb {F}}_{(2^m)^n}, {\mathbf {a}}\ne 0\), where \(r = (2^{mn} - 1)/(2^m-1)\). It is easy to prove that \({\mathbf {a}}^r \in {\mathbb {F}}_{2^m}\) as \(({\mathbf {a}}^r)^{2^m} = ({\mathbf {a}}^{1 + 2^m + 2^{2m} + \ldots + 2^{(n-1)m}})^{2^m} = {\mathbf {a}}^{1 + 2^m + 2^{2m} + \ldots + 2^{(n-1)m}} = {\mathbf {a}}^r\). This reduces inversion in the Galois field \({\mathbb {F}}_{({2^m})^n}\) to one inversion in the ground field \({\mathbb {F}}_{2^m}\), the computation of \({\mathbf {a}}^{r - 1}\) and n multiplications in \({\mathbb {F}}_{2^{m}}\).

To compute \({\mathbf {a}}^{2^m}\), one can notice that \({\mathbf {a}}^{2^m} = \left( \sum _{i = 0}a_i x^i \right) ^{2^m} \mod P = \sum _{i = 0}^n a_i x^{i 2^m} \mod P\) as \(a_i \in {\mathbb {F}}_{q^m}\forall i = 0, \ldots , n -1\). It is then sufficient to pre-compute the values of \(s_i = x^{i 2^m} \mod P, \forall i = 0, \ldots , n - 1\). Therefore, the computation of \({\mathbf {a}}^{2^m}\) can be seen as a matrix multiplication as follow:

$$\begin{aligned} S \cdot {\mathbf {a}}^T = \left( \begin{matrix} 1 &{} s_{1,0} &{} s_{2,0} &{} \ldots &{} s_{n-1,0} \\ 0 &{} s_{1,1} &{} s_{2,1} &{} \ldots &{} s_{n-1,1} \\ \vdots &{} \ldots &{} \ldots &{} \vdots \\ 0 &{} s_{1,n-1} &{} s_{2,n-1} &{} \ldots &{} s_{n-1,n-1} \\ \end{matrix} \right) \times \left( \begin{matrix} a_0 \\ a_1\\ \vdots \\ a_{n-1}\\ \end{matrix} \right) \end{aligned}$$

In addition, if P has only binary coefficients (which is the case for all variants of ROLLO), the pre-computed values also have binary coefficients meaning that the previous matrix multiplication can be performed using only XORs. The last step is to remark that \({\mathbf {a}}^{2^{km}} = S^k \cdot {\mathbf {a}}^T\) and we end up with an algorithm performing n polynomial multiplications and binary matrix multiplications, one inversion in \({\mathbb {F}}_{2^m}\) followed by n multiplications in \({\mathbb {F}}_{2^m}\).

In algorithm 12 we summarize how the inversion is performed. It uses algorithm 11 to compute \({\mathbf {a}}^{2^{km}}\). The matrix S in algorithm 11 is a pre-computed matrix depending only on P and n.

Notice that both algorithm 11 and algorithm 12 can be coded such that they execute a constant number of operations. In particular, Steps 4-5 of algorithm 11, can be performed in a constant time fashion by using a mask, as follows: compute \(\texttt {mask} = 0 - S_{i,j}\), so that \(\texttt {mask}\) is 0 if \(S_{i,j} = 0\) or a binary vector of 1’s otherwise; then compute \(t = a_j \otimes \texttt {mask}\) and finally \(b_i = b_i + t\).

figure j
figure k

It is also possible to pre-compute all the matrices \(S, S^2, S^3, \ldots , S^{n-1}\) to avoid the steps 2 and 3 of algorithm 11. This, for example, results in 70.6 KB of pre-computed matrices for ROLLO-I-128, and a speed improvement of about \(17\%\).

As an alternative method to compute the inverse of a polynomial, one might consider a constant time variant of Euclid’s algorithm, as the one proposed in [16]. Though, this type of algorithm is usually more efficient for generic moduli, where the modular reductions in Fermat’s method are considerably more expensive. After a comparison in the favor of a Sagemath [40] implementation of the method described above against the script recipx provided in [16], we decided to discard this option.

The AVX2 performances of the reduction are reported in Table 7.

Table 7 Average cycles per operation for the main algorithms presented in this work

Rank Syndrome Recovery Algorithm and Decapsulation

In this section, we describe the core of the decapsulation phase: the Rank Support Recovery (RSR) algorithm which was introduced in [24] and made constant time in [8].

Let EF be two \({\mathbb {F}}_{q}\)-subspaces of \({\mathbb {F}}_{q^m}\) and let \((e_1, \ldots , e_r)\) be a basis of E and \((f_1, \ldots , f_d)\) be a basis of F. So \(\mathsf {dim}(E)=r\) and \(\mathsf {dim}(F)=d\). We denote by EF the subspace generated by the product of the elements of E and F, i.e., \(EF = \langle \{e f \, | \, e\in E \text{ and } f\in F\} \rangle .\) Note that \((e_if_j)_{1\le i\le r, 1\le j\le d}\) is a generator family of EF. Thus, \(\mathsf {dim}(EF)\le rd\) and the equality holds with an overwhelming probability [2]. For that reason, we assume that \(\mathsf {dim}(EF) = rd\).

Let C be a LRPC code with parity check matrix \(H \in {\mathbb {F}}_{q^m}^{2n \times n}\) and let \({\mathbf {s}}=(s_1, \ldots , s_n)\) be a syndrome of the error vector \({\mathbf {e}}=(e_1,\ldots , e_{2n})\), that is, \(H{\mathbf {e}}^T={\mathbf {s}}^T\). Let E be the support of \({\mathbf {e}}\) and S be the support of \({\mathbf {s}}\). Since S is a subspace of EF, its dimension is at most rd. Finally, we denote by \(B_i= f_i^{-1} S\).

The RSR algorithm (algorithm 13) takes as input the base of the vector space F, the syndrome \({\mathbf {s}}\) and the dimension of E i.e., r; and its output is (probably) E, i.e., the support of the error \({\mathbf {e}}\). The goal of this algorithm is to recover the vector space E (see [8] for more details).

Let us explain how the algorithm recovers the support E of the error vector \({\mathbf {e}}\). Since the coordinates of the syndrome can be seen as elements in EF, the idea is to compute the support of the error as \(E = B_1 \cap B_2\cap \ldots \cap B_d, \text{ where } B_i= f_i^{-1} S.\) In fact, \(B_i = \{f_i^{-1}f_1e_1, f_i^{-1}f_2e_1, \ldots , f_i^{-1}f_d e_r\}=\{e_1,\ldots e_r, f_i^{-1}f_je_t\}_{1\le j\le d, i\ne j, 1\le t\le r }\,.\) Note that this method fails to recover E when the syndrome space S is different from EF and when the intersection contains others elements besides the \(e_j\)’s [2].

In algorithm 13, we use capital letter both for the output of Zassenhaus algorithm (section 4.2 p. 12) and the matrices with elements in \({\mathbb {F}}_{q^m}\). In this last case, we denote by \(J^{\{i\}}\) the i-th row of the matrix J. We also indicate by \(T, \_= \textsf {zassenhaus}(B_i, B_j)\) the first element of the Zassenhaus algorithm output, i.e., \(B_i + B_j\) and with \(\_ \, , T = \textsf {zassenhaus}(B_i, B_j)\) the second element of the output, that is, \(B_i \cap B_j\). With T we indicate a temporary value. The i-th element of T is denoted by \(t_i\)

There are three conditions that need to be fulfilled for this algorithm to run in constant time: (1) the size of the inputs to the Zassenhaus algorithm have to be constant. Here we always input a basis of length rd for both vector spaces; (2) for inputs of the same size, the Zassenhaus algorithm needs to run in constant time. This was taken care of in section 4.2; (3) operations involving elements of \({\mathbb {F}}_{q^m}\) (addition, multiplication, etc.) need to run in constant time. This was taken care in section 4.1.

Notice that Step 3 of the algorithm would work if, instead of the reduced row echelon form of the basis, one computes the entire vector space E and then sorts it with respect to any order. For this particular choice of parameters, this second option is slower. It could become more efficient for a much larger m and a smaller base.

figure l

Performance

We benchmark our implementation of ROLLO-I-128 on a 2017 MacBook Pro equipped with 2.9GHz Intel Core i7 (I7-7820HQ). To measure the performance of the single operations presented in this work, we use our own testing platform, and the results are reported in Table 7.

We use SUPERCOP version 20200618 [15] to compare our implementation with other existing KEMs by disabling Intel Hyper-Threading and Turbo Boost. In the key generation function and the encryption function we use the random-number generator randombytes() provided by SUPERCOP. Note that our implementation uses a stand-alone implementation of SHA256, but for a fair comparison, we have switched to OpenSSL’s SHA256 implementation, which is also used in the implementation of ROLLO-I. All primitives are compiled using clang with parameters -march=native -O3 -fomit-frame-pointer -fwrapv -Qunused-arguments -Wl,-no_pie. For non-vectorized implementation, we disable the flag -march=native.

According to our profiler: about 85% of the key generation is taken by the polynomial inversion; 5% of the encapsulation time is occupied by the polynomial multiplication, while 91% of the time is spent in generating a basis and two polynomials whose list of coefficients has given rank r. About 70% of this last step (63% of the full encapsulation time) is taken by computing the rank of the list, to make sure it has the proper rank, while about 15% is taken from the randombyte() calls.

about 75% of the decapsulation is taken by the Gaussian elimination step in the Zassenhaus algorithm. In the official ROLLO specification [2], the following number (in thousands) of clock cycles are reported for, respectively, key generation, encapsulation and decapsulation: 3537, 395, 1754. Our loss in the key generation is explained by the fact that ROLLO’s team used a not constant-time GCD algorithm for the polynomial inversion. Our loss in the encapsulation is explained by the fact that ROLLO’s team used a not constant-time generation of vectors with given rank, and in particular they did not have to check the rank of \({\mathbf {e}}_1\) and \({\mathbf {e}}_2\) two times. The not constant-time implementation of Gaussian elimination also explains the difference in the decapsulation step.

In Table 8, we report the performance results of our implementation of ROLLO-I-128 with one cycle in the generation of vectors of given rank (CT_rollo_fast)Footnote 5, and with two cycles (CT_rollo_secure). We also report the performances of the other Category 1 KEMs available in SUPERCOP.

Table 8 The number of cycles to perform key generation, encapsulation, and decapsulation of other KEMs available in SUPERCOP with 128-bit security

Conclusion

In this work, we have presented several algorithms which shed some light on the potential performance of a fully optimized constant time implementation of ROLLO-I-128. It highlights that this proposal can be quite interesting from a computational point of view both with AVX2 and without. Future work will consist in porting these algorithms to other variants of ROLLO as well as some parts of RQC which might benefit from those improvements.