Constant Time Algorithms for ROLLO-I-128

Aguilar-Melchor, Carlos; Aragon, Nicolas; Bellini, Emanuele; Caullery, Florian; Makarim, Rusydi H.; Marcolla, Chiara

doi:10.1007/s42979-021-00761-y

Constant Time Algorithms for ROLLO-I-128

Original Research
Published: 18 July 2021

Volume 2, article number 382, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

SN Computer Science Aims and scope Submit manuscript

Constant Time Algorithms for ROLLO-I-128

Download PDF

Carlos Aguilar-Melchor¹,
Nicolas Aragon²,
Emanuele Bellini ORCID: orcid.org/0000-0002-2349-0247³,
Florian Caullery³,
Rusydi H. Makarim³ &
…
Chiara Marcolla³

391 Accesses
2 Citations
Explore all metrics

Abstract

In this work, we propose different techniques that can be used to implement the rank-based key encapsulation methods and public key encryption schemes of the ROLLO, and partially RQC, family of algorithms in a standalone, efficient and constant time library. For simplicity, we focus our attention on one specific instance of this family, ROLLO-I-128. For each of these techniques, we present explicit code (including intrinsics), or pseudo-code and performance measures to show their impact. More precisely, we use a combination of original and known results and describe procedures for Gaussian reduction of binary matrices, generation of vectors of given rank, multiplication with lazy reduction and inversion of polynomials in a composite Galois field. We also carry out a global performance analysis to show the impact of these improvements on ROLLO-I-128. Through the SUPERCOP framework, we compare it to other 128-bit secure KEMs in the NIST competition. To our knowledge, this is the first optimized full constant time implementation of ROLLO-I-128.

McBits revisited: toward a fast constant-time code-based KEM

Article 12 March 2018

McBits Revisited

McBits: Fast Constant-Time Code-Based Cryptography

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Through the NIST Post-Quantum Cryptography (PQC) Standardization Process [34], the cryptographic community is evaluating candidate KEM, PKE, and signature schemes potentially secure against both quantum and classical attacks. One of the requirements for these schemes to be secure is having a constant time implementation to avoid leakage of secret information through timing attacks. Furthermore, it is important to understand how efficient these constant time implementations are when running on real devices.

From an implementation perspective, while the cryptographic community has mostly focused its efforts on improving some categories of cryptosystems such as those based on lattices and codes on the Hamming metric, the same cannot be claimed for rank-based cryptosystems. Compared to lattice and code-based cryptography, rank-based cryptography is a relatively new and less explored field. Although the first rank-based cryptosystem, the Gabidulin-Paramonov-Tretjakov (GPT) public key encryption scheme [23], was introduced in 1991, and many analyses were presented in the subsequent years (such as [22, 25, 36, 37]), only recently new schemes have been proposed, such as [5, 6, 11, 12, 20, 33], some of which have also been submitted to the NIST PQC standardization process. Up until the algebraic attack recently presented in [10], these schemes seemed to provide appealing performance levels and key and ciphertext sizes, which enabled ROLLO [2] (merge of LAKE, LOCKER, and Rank-Ouroboros) and RQC [3] to pass the first round of the NIST PQC standardization process. Though, in its recent status report on the second round [7], NIST did not select ROLLO and RQC to advance on with the motivation that their security analysis needs more time to mature. On the other hand, NIST encouraged the cryptographic community to continue studying rank-based cryptosystems, as they offer a nice alternative to traditional Hamming metric codes with comparable bandwidth.

There were still some open questions of practicality pertaining to the most recently submitted NIST package for ROLLO (dated 2020/04/21 and available at pqc-rollo.org). In June 2020, a note was released [19], pointing out potential issues arising from part of the currently submitted ROLLO and RQC code not being constant time.

In this work, we present the first constant time implementation of the 128-bit secure rank-based KEM called ROLLO-I-128. This work complements and improves the preliminary results posted on the NIST website [4] on the implementation of an earlier version of ROLLO-I-256. Other non-constant time-independent implementations of ROLLO on other platforms can be found for example in [1, 13], or [32], where a software implementation on a Cortex M0 of the encapsulation routine, and a hardware implementation on a microcontroller with a crypto coprocessor, are presented, respectively.

Our Contribution

In this work, we propose different techniques that can be used to implement ROLLO and part of the RQC family of algorithms in a standalone, efficient and constant time library. Recall that ROLLO-I-128, ROLLO-I-192, and ROLLO-I-256 have decryption failure probability of $2^{-28}$, $2^{-34}$, and $2^{-33}$, which, cryptographically, are not considered small. We present, for each of the proposed techniques, explicit code (with intrinsics when required), or pseudo-code and performance measures to show their impact.

As a theoretical contribution, we describe a new constant time variant of Gaussian elimination that reduces any matrix to its (not necessarily reduced) row echelon form. The only previous constant time variant we are aware of [14], only worked for full rank matrices, and returned a systematic form of such matrices, terminating the algorithm if this was not possible. Furthermore, after analyzing current non-constant time algorithms to generate a list of vectors with a given rank, we describe a novel constant time probabilistic version of one of these algorithms, and we present a procedure for reducing the probability of failing to a desired value. We also present a variation of this method which returns the entire support of the vector list. This potentially allows trade-offs between the public key size and the performance of the encapsulation step.

From an implementation perspective, we describe in detail the process of implementing the underlying finite field arithmetic with constant time operations, with and without the use of vectorization techniques. We provide an explicit description of the application of the Zassenhaus algorithm in the Rank Support Recovery algorithm described in the NIST submission of ROLLO [2]. We show how efficient polynomial arithmetic can be conducted by applying multiplication with lazy reduction and inversion of polynomials in a composite Galois field defined by a pentanomial. All these are implemented using reasonably optimized constant time algorithms. Finally, we carry out a performance analysis to show the impact of these improvements on our implementation of ROLLO-I-128 ^{Footnote 1}, when compared with its reference and optimized implementations. We expect this work to shed light on the attainable performance for constant time implementations of ROLLO and to help practitioners to make educated choices when implementing it or other constant time rank-based cryptographic algorithms.

Structure of the Paper

In “Preliminaries”, we introduce the basic concepts needed to understand the scheme and the subsequent algorithms. In “Description of the Scheme”, we describe ROLLO-I key encapsulation method. In “Proposed Algorithms”, we provide all the details regarding the binary field, vector space, and composite Galois field arithmetic, as well as the description of the Rank Support Recovery algorithm used in the decapsulation phase. In “Performance”, we compare the performance of our implementation of ROLLO-I-128 with the one of various KEM submissions to the NIST PQC standardization process. In “Conclusion”, we present the conclusions drawn from this study.

Preliminaries

In this section, we first present the rings, fields and vector spaces we will work with as well as an associated metric, namely the rank metric, and then we will define error-correcting codes associated with this metric.

Structures and Representations

In the following, we let q be a prime power and m, n two positive integers. We will work with the finite fields of order q, $q^m$ and $q^{mn}$: ${\mathbb {F}}_{q}, {\mathbb {F}}_{q^m}, {\mathbb {F}}_{q^{mn}}$. Of course there are multiple isomorphic fields of a given order, with multiple representations and leading to different algorithms.

$\mathbb {F}_{{\varvec{q}}}$. In this paper, as in the ROLLO specification, q will always be 2 and therefore elements and computations in ${\mathbb {F}}_{q}$ are associated to elements and computations in the modular ring ${\mathbb {Z}}/2{\mathbb {Z}}$.

$\mathbb {F}_{{\varvec{q}}^{{\varvec{m}}}}$. As usual, elements in extensions of the base field ${\mathbb {F}}_{q}$ will be represented using quotients over the polynomial ring ${\mathbb {F}}_{q}[X]$. Thus, elements and computations in ${\mathbb {F}}_{q^m}$ are associated to polynomial representations and computations over ${\mathbb {F}}_{q}[X]/\left\langle P_0 \right\rangle$ for an irreducible polynomial $P_0$ of degree m.

$\mathbb {F}_{{\varvec{q}}^{{\varvec{mn}}}}$. Elements and computations in ${\mathbb {F}}_{q^{mn}}$ are similarly associated to polynomial representations and computations over ${\mathbb {F}}_{q^m}[X]/\left\langle P \right\rangle$ for an irreducible polynomial $P \in {\mathbb {F}}_{q}[X]$ of degree n. Note that these polynomials have coefficients in ${\mathbb {F}}_{q^m}$, so elements in ${\mathbb {F}}_{q^{mn}}$ are seen as polynomials (that live in ${\mathbb {F}}_{q^m}[X]/\left\langle P \right\rangle$) with polynomial coefficients (that live in ${\mathbb {F}}_{q}[X]/\left\langle P_0 \right\rangle$).

It is also quite practical to use vectors and matrices to represent, and operate on, polynomials. For a field F, ${\mathcal {M}}_{n,m}(F)$ represents the set of matrices with n rows and m columns of elements in F. When n equals m this set, together with classical matrix sum and product, forms a ring that we denote ${\mathcal {M}}_n(F)$. Of course we can map polynomials to vectors (of coefficients) and inversely so we often consider an element of ${\mathbb {F}}_{q^m}$ as an element of the vector space ${\mathbb {F}}_{q}^m$, and an element of ${\mathbb {F}}_{q^{mn}}$ as an element of the vector space ${\mathbb {F}}_{q^m}^n$. For a vector $\mathbf {v}$, we note the associated polynomial $\mathbf {v}(X)$, and for a polynomial p, we note the associated vector $\mathrm {vec}(p)$. When using a polynomial in a setting in which it is clear we have to use the vector representation (e.g., a matrix line, or a matrix/vector multiplication) we will not make the $\mathrm {vec}$ transformation explicit.

Vector additions are naturally defined in ${\mathbb {F}}_{q}^m$ or ${\mathbb {F}}_{q^m}^n$ and correspond to polynomial additions over ${\mathbb {F}}_{q^m}$ and ${\mathbb {F}}_{q^{mn}}$. We define the product of two vectors $\mathbf {u},\mathbf {v}$ by $\mathbf {u}\mathbf {v} = \mathrm {vec}(\mathbf {u}(X) \mathbf {v}(X))$, and the inverse as $\mathbf {u}^{-1} = \mathrm {vec}(\mathbf {u}^{-1}(X))$.

It is also possible to define vector multiplication directly over vector/matrices. To do this, we will first define ideal matrices. As we will only describe explicitly multiplications in ${\mathbb {F}}_{q^{mn}}$, we will focus our definition on this specific setting.

Definition 1

(Ideal Matrices). Let $P\in {\mathbb {F}}_{q}[X]$ be a polynomial of degree n and $\mathbf {v}\in {\mathbb {F}}_{q^m}^n$ the vector representation of an element of ${\mathbb {F}}_{q^{mn}}$. The ideal matrix generated by $\mathbf {v}$ modulo P is the matrix denoted $\mathcal {IM}_P(\mathbf {v})\in {\mathcal {M}}_n({\mathbb {F}}_{q^m})$ with n rows of the form $X^{i} \mathbf {v}(X) \bmod P$, with $i = 0, \ldots , n-1$.

The multiplication of two vectors $\mathbf {u},\mathbf {v}\in {\mathbb {F}}_{q^{mn}}$ can be then computed with $\mathbf {u}\mathbf {v} = \mathbf {u}\mathcal {IM}_P(\mathbf {v}) = (\mathcal {IM}_P(\mathbf {u})^T v^T)^T = \mathbf {v}\mathbf {u}$. Note that this definition is compatible with the previous one as we have $\mathbf {u}\mathcal {IM}_P(\mathbf {v}) = \mathrm {vec}(u(X)v(X))$.

Metric and Support

Let ${\mathbf {e}} = (e_1, \ldots , e_n)$ be an element of ${\mathbb {F}}_{q^m}^n$. Denote by $e_{i,j}$ the j-th component of $e_i$, $e_i$ being seen as an element of ${\mathbb {F}}_{q}^m$. Then the rank weight of e, denoted by $\mathsf {w_R}({\mathbf {e}})$, is defined as $\mathsf {w_R}({\mathbf {e}}) = \textsf {rank} \left( [ e_{i,j} ]_{i=1,\ldots ,n, j=1,\ldots ,m} \right)$ The rank distance between two vectors $\mathbf {e, f} \in {\mathbb {F}}_{q^m}^n$ is defined by $\mathsf {w_R}(\mathbf {e - f})=||\mathbf {e-f}||$.

For ${\mathbf {x}}=(x_1, \ldots , x_n)\in {\mathbb {F}}_{q^m}^n$, the support E of ${\mathbf {x}}$, denoted $\mathsf {supp}({\mathbf {x}})$, is the ${\mathbb {F}}_q$-subspace of ${\mathbb {F}}_{q^m}$ generated by the coordinates of ${\mathbf {x}}$: $E=\langle x_1, \ldots , x_n\rangle _{{\mathbb {F}}_q}.$ Note that $\mathsf {dim}(E) = \mathsf {w_R}({\mathbf {x}})$ and that any ${\mathbf {e}}\in E$ can be written as ${\mathbf {e}}=\sum _{i=1}^n \lambda _i x_i$ where $\lambda _i\in {\mathbb {F}}_{q}$.

Codes

We define a $[n,k]_{q^m}$ code C over ${\mathbb {F}}_{q^m}$ as a vector subspace of ${\mathbb {F}}_{q^m}^n$ of dimension k, where n is called the length and k is the dimension of the code. An element of a code C is called a codeword. A generator matrix for an $[n, k]_{q^m}$ code C is thus any $k \times n$ matrix G whose rows form a basis for C. Note that the generator matrix of a code is not unique.

As a linear code is a vector subspace, it is the kernel of some linear transformation. In particular, there is an $(n - k) \times n$ matrix H, called a parity check matrix for the $[n,k]_{q^m}$ code C, that verifies $C = \{x \in {\mathbb {F}}_{q^m}^n | Hx^T=0\}$. As for the generator matrix, the parity check matrix of a code C is not unique.

We present now the definition of the ideal Low Rank Parity Check (ideal LRPC) codes, codes on which all the variants of ROLLO are based. Moreover, we introduce the underlying problem on which relies the security of the schemes. We first recall the definition of ideal codes and LRPC codes.

Definition 2

(Ideal Codes) . Let $P \in {\mathbb {F}}_{q}[X]$ be a polynomial of degree n and $\mathbf {h_1}, \mathbf {h_2} \in {\mathbb {F}}_{q^m}^n$. We define the $[2n, n]_{q^m}$ ideal code C defined by $(\mathbf {h_1},\mathbf {h_2})$ modulo P as the code with parity check matrix $\left( \begin{array}{c|c} \mathcal {IM}_P(\mathbf {h_1})^T&\mathcal {IM}_P(\mathbf {h_2})^T \end{array} \right) .$

If $\mathbf {h_1}(X) = 1$ (and thus $\mathcal {IM}_P(\mathbf {h_1}) = I_n$), we say C is defined by $\mathbf {h_2}$ modulo P. If $\mathbf {h_1}(X)$ is invertible in ${\mathbb {F}}_{q^{mn}}$, the code C defined by $(\mathbf {h_1},\mathbf {h_2})$ modulo P is the same as the code defined by $\mathbf {h_1}^{-1} \mathbf {h_2}$ modulo P.

Definition 3

(LRPC codes). Let $H \in M_{n-k,n}({\mathbb {F}}_{q^m})$ be a full rank matrix such that its coefficients generate an ${\mathbb {F}}_q$-subspace $F = \langle h_{i,j}\rangle _{{\mathbb {F}}_q}$ of small dimension d. The $[n, k]_{q^m}$ code C of parity check matrix H is called an LRPC code of weight d.

A $[2n,n]_{q^m}$ Ideal Code defined by $(\mathbf {h_1},\mathbf {h_2})$ modulo a polynomial P can also be an LRPC code. Indeed, if $\mathbf {h_1},\mathbf {h_2}$ are vectors in an ${\mathbb {F}}_{q}$-subspace of small dimension and P has its coefficients in ${\mathbb {F}}_{q}$, this will be the case. Such a code is called an Ideal LRPC code.

Definition 4

(Ideal LRPC codes). Let F be a ${\mathbb {F}}_q$-subspace of dimension d of ${\mathbb {F}}_{q^m}$, $\mathbf {h_1}, \mathbf {h_2}$ two vectors of ${\mathbb {F}}_{q^m}^n$ with support in F and $P \in {\mathbb {F}}_q[X]$ a polynomial of degree n. The code C with parity check matrix $(\mathcal {IM}(\mathbf {h_1})^T | \mathcal {IM}(\mathbf {h_2})^T)$ is called an $[2n, n]_{q^m}$ ideal LRPC code.

The variant of ROLLO we focus on this paper, ROLLO-I, has a security proof based on two problems. The first is a support recovery problem, which is proven equivalent to the rank-metric version of the Syndrome Decoding problem (RSD) in [2]. The second is an indistinguishability problem.

Problem 1

(r-Ideal Rank Support Recovery). Given a polynomial $P\in {\mathbb {F}}_{q}[X]$ of degree n, vectors ${\mathbf {h}}_1, \ldots , {\mathbf {h}}_r \in {\mathbb {F}}_{q^m}^n$, a syndrome ${\mathbf {s}}$ and a weight w, it is hard to find a support $E=\langle {\mathbf {e}}_0,\ldots , {\mathbf {e}}_{r-1}\rangle$ of dimension lower than w such that ${\mathbf {e}}_0 +{\mathbf {e}}_1{\mathbf {h}}_1+\ldots +{\mathbf {e}}_{r-1}{\mathbf {h}}_{r-1}={\mathbf {s}} \mod P.$

Problem 2

(Ideal LRPC codes indistinguishability). Given a polynomial $P\in {\mathbb {F}}_{q}[X]$ of degree n and a vector ${\mathbf {h}}\in {\mathbb {F}}_{q^m}^n$, it is hard to distinguish whether the ideal code C with parity-check matrix generated by ${\mathbf {h}}$ and P is a random ideal code or if it is an ideal LRPC code of weight d.

In other words, it is hard to distinguish if ${\mathbf {h}}$ was sampled uniformly at random or as ${\mathbf {x}}^{-1}{\mathbf {y}}\mod P$ where the vectors ${\mathbf {x}}$ and ${\mathbf {y}}$ have the same support of small dimension d.

Description of the Scheme

As stated by the submission documentation all ROLLO variants follow the approach inaugurated by the public key encryption protocol NTRU in 1998 [28]. As pointed out in the previous section, ROLLO is a variation of the LRPC rank metric approach and its security is proven assuming that the Ideal LRPC indistinguishability and the 2-Ideal Rank Support Recovery [2, Theorem 4.2] problems are hard.

We now describe ROLLO-I in detail. The ROLLO-I Key-Encapsulation Mechanism (KEM) is a triple of probabilistic algorithms $(\mathsf {KeyGen}; \mathsf {Encaps}; \mathsf {Decaps})$. $\mathsf {KeyGen}$: randomly sample $({\mathbf {x}},{\mathbf {y}})$ from a vector subspace F of ${\mathbb {F}}_{q^m}$ of dimension d, such that $\mathsf {w_R}({\mathbf {x}}) = \mathsf {w_R}({\mathbf {y}}) = d$. Set ${\mathsf {pk}}= {\mathbf {h}}= {\mathbf {x}}^{-1}{\mathbf {y}}\mod P$ and ${\mathsf {sk}}= ({\mathbf {x}},{\mathbf {y}})$. $\mathsf {Encaps}$: randomly sample $({\mathbf {e}}_1,{\mathbf {e}}_2)$ from a vector subspace E of ${\mathbb {F}}_{q^m}$ of dimension r, such that $\mathsf {w_R}({\mathbf {e}}_1) = \mathsf {w_R}({\mathbf {e}}_2) = r$. Compute ${\mathbf {c}}= {\mathbf {e}}_1 + {\mathbf {e}}_2{\mathbf {h}}\mod P$. Compute $K = G(E)$ where G is a hash function. Output $({\mathbf {c}},K)$. $\mathsf {Decaps}$: Compute ${\mathbf {s}}= {\mathbf {x}}{\mathbf {c}}= {\mathbf {x}}{\mathbf {e}}_1 + {\mathbf {y}}{\mathbf {e}}_2 \mod P$. Use the Rank Support Recovery (RSR) algorithm (algorithm 13) to recover E. The RSR algorithm takes as input $F = \mathsf {Supp}({\mathbf {x}},{\mathbf {y}})$ and ${\mathbf {s}}$ (see “Rank Syndrome Recovery Algorithm and Decapsulation” for more detail). If the RSR algorithm succeeds return $K = G(E)$, else return $\perp$.

We refer to Table 1 for the actual set of ROLLO-I parameters. Note that the private key can be obtained from a seed, and in the official NIST submission the seed expander was initialized with 40 bytes long seeds.

Table 1 ROLLO-I parameters

Full size table

As the last column of the table shows, the decapsulation algorithm has a non-zero failure probability. This probability is however well understood and made low enough to fit the NIST call for proposals (for more detail see Section 1.4.2 of [2]).

Proposed Algorithms

We redefined ROLLO starting from the following building blocks: the binary field arithmetic corresponding to operations in ${\mathbb {F}}_{q^m}$; the vector space arithmetic, including the Gaussian reduction algorithm for binary matrices, the Zassenhaus algorithm for binary matrices, and the generation of elements of ${\mathbb {F}}_{q^m}[X] / P(X)$ of a given rank; the arithmetic in the composite Galois field ${\mathbb {F}}_{q^m}[X] / P(X)$ where P(X) is the irreducible polynomial given in the parameters; the Rank Support Recovery algorithm (RSR) used in the decapsulation phase. The key generation, encapsulation and decapsulation (or encryption and decryption) of all the variants of ROLLO are based only on the above blocks. Hence, we focused on optimizing every operations included in those layers as well as insuring the fact that they are constant time.

Target We target processors with 64-bit carryless multiplications (2010 and onward for Intel) and provided a faster alternative if they also have AVX2 instructions (2013 and onward for Intel). The code examples assume GCC’s __uint128_t type is available and uses GCC X86 intrinsics.

Notation Given ${\mathbf {x}},{\mathbf {y}}$ two binary vectors, in what follows, we denote with ${\mathbf {x}}\oplus {\mathbf {y}}$ the bit-wise XOR of ${\mathbf {x}}$ and ${\mathbf {y}}$, and with ${\mathbf {x}}\otimes {\mathbf {y}}$ the bit-wise AND of ${\mathbf {x}}$ and ${\mathbf {y}}$. With ${\mathbf {x}}\ll h$ and ${\mathbf {x}}\gg h$ with indicate, respectively, the left and right shift of ${\mathbf {x}}$ by h positions.

Binary Field Arithmetic

In this section, we present the constant time vectorized operations we propose for ${\mathbb {F}}_{q^m}$. As shown in Table 1, all variants of ROLLO-I have $q=2$ and different values for m. Our algorithms work for all the values of m submitted to the NIST competition, but have to be slightly adapted for each value. To avoid repetitions, we will focus on the field used by ROLLO-I-128, and note what changes need to be done to adapt the algorithms for other values of m.

We implemented finite field arithmetic for the binary field ${\mathbb {F}}_{2^{m}}$, with $m=67$, representing elements as binary polynomials of degree $m-1$ modulo an irreducible polynomial of degree m. We used the irreducible pentanomial $P_0(X)=X^{67} + X^5 + X^2 + X + 1$ provided by the Allan Steel database incorporated in Magma software [18] and also suggested by the authors of ROLLO. This pentanomial has also lowest possible intermediate degree, allowing the shortest shift during the reduction operations. No trinomial exists for $m=67$.

To represent an element of the field, we use 128-bit unsigned integer, using the type __uint128_t, and sometimes casting it to __m128i, with unused bits set to zero. Addition and subtraction of two elements are a simple bit-wise XOR operation. The multiplication of two field elements is performed in two steps: a carryless multiplication of the two elements seen as polynomials (“Binary Field Arithmetic”, or a carryless squaring of a single element in “Binary Field Arithmetic”) and a polynomial reduction (“Binary Field Arithmetic” ). Inversion is performed using an addition chain (see “Binary field arithmetic”). As noted before, all operations in the binary field layer are executed in constant time, assuming the intrinsics (and in particular carry-less multiplications) are constant time.

Carryless multiplication: plain C implementation The carryless multiplication has been implemented using recursive Karatsuba multiplication [31]. More specifically, we borrowed from NTL^{Footnote 2} an implementation of a constant time carryless Karatsuba multiplication of two 64 bit register (which we call ntlclmul64 in algorithm 14) using only bit manipulation, and then added an extra level of Karatsuba method over this function. The full carryless multiplication $\textsf {clmul}_K(a,b)$ is described in “Appendix A”, algorithm 14.

In Table 2, we compare this implementation with ROLLO’s polynomial multiplications. The initial NTL-based ROLLO (submission date 2019/04/10) used NTL’s generic carryless multiplication function^{Footnote 3}. As it is generic, this function goes through a set of tests and function calls before calling exactly the same code we used for ntlclmul64. The overhead (5 function calls, 6 if statements with two boolean tests for most of them, and a switch/case) is significant w.r.t. the final code of ntlclmul64 (78 instructions). As a result, specializing the code by removing calls, conditional branches, and extracting only the instructions needed for ROLLO we get a $15\%$ speedup on polynomial multiplication with respect to NTL-ROLLO which called the generic function. The Karatsuba function implemented in the NTL-free version of ROLLO (submission date 2019/08/24), called NoNTL-ROLLO in the table, is $30\%$ slower than NTL’s generic function. It seems thus that, in general, implementations of Karatsuba using NTL may obtain a nice performance upgrade just by importing/adapting the specialised code of NTL for this operation, as we did. We also notice that the latest ROLLO implementation dated 2020/04/21, is not NTL-dependent anymore.

Table 2 Cycles per plain C carryless multiplication of polynomials of degree $m=67$ (averaged over 4 s of execution on a Macbook Pro 2017 with an 2.9 GHz Quad-Core Intel Core i7, I7-7820HQ)

Full size table

Carryless multiplication: AVX2 optimization When possible, the carryless multiplication step has been performed using Intel Advanced Vector Extensions 2 instructions (AVX2) [29]. In particular, the core of this function uses the _mm_clmulepi64_si128 instruction (see also [27]) to perform 64 times 64 bit binary polynomial multiplication.

The multiplication of two m bit binary polynomials is performed in a schoolbook fashion, by dividing the input in two 64 bit registers (one containing only $m-64$ bits) and then applying four times the function _mm_clmulepi64_si128, which acts on 64 bits registers. The results is stored in a __m256i type (4 registers), but only the $2m-2$ least significant bits are used, while the remaining ones are set to zero. We refer to this algorithm as the $\textsf {clmul}_S(a,b)$ algorithm, and we present our C implementation in “Appendix A”, algorithm 15. When irrelevant in the context, we will indicate with $\textsf {clmul}(a,b)$ (with no subscript) the algorithm performing carryless multiplication, either using Karatsuba method in plain C or with schoolbook method and AVX instructions.

Let us remark that using Karatsuba multiplication [31] in this case would not give any advantage, as the cost of multiplication and addition with AVX2 instruction is very close. In practice, we show it even performs worse, due to alignment problems.

In Table 3, we show that, when comparing figures for NTL-ROLLO and others, specializing code for ROLLO’s setting has an even greater impact on performance when using AVX2, with no surprise. It also shows that alignment issues in Karatsuba have a very noticeable impact on performance and highlights the fact that ROLLO developers did the right choice opting for schoolbook multiplication in the NTL-free version of ROLLO. Our implementation has a little advantage on performance.

Table 3 Average cycles per AVX2 carryless multiplication of polynomials of degree $m=66$ (averaged over 4 s of execution on a MacBook Pro 2017 with a 2.9 GHz Quad-Core Intel Core i7, I7-7820HQ)

Full size table

This difference is explained by the fact that the permutation done in our algorithm with _mm256_permute4x64_epi64 allows us to avoid the cost of the load and store instructions, which are present at the beginning and end of each recursive call in the NIST submitted code.

Carryless squaring For squaring, which will be used in the inversion algorithm, we can use the fact that this operation actually consists of interleaving zeros to the current representation of the polynomial. Indeed, for $a \in {\mathbb {F}}_{2^m}$, $a^2 = \left( \sum _{i = 0}^{m-1} a_i x^i \right) ^2 = \sum _{i = 0}^{m-1} a_i x^{2i}$. For example, if the current representation of a was 11100101, then $\textsf {clsqr}(a)$ will be 1010100000100010. To perform this operation, we decided to use a small modification of the method Interleave bits with 64-bit multiply given by Sean Eron Anderson on his web page Bit Twiddling Hacks [21]. The pseudocode is given in “Appendix A”, algorithm 16.

The squaring method is straightforward from there and its pseudocode is given in “Appendix A”, algorithm 17. For the AVX2 version, a look-up table based on the instruction _mm_shuffle_epi8 is implemented both in the submission and our work. The AVX2 performance are reported in Table 4.

We would like to remark that, although simple and perhaps even trivial in retrospect, the mentioned approaches for squaring have been proposed before in the literature. Precisely, [9] and [17] for the shuffle-based squaring and [35] for the CLMUL squaring.

Table 4 Average cycles per carryless squaring of polynomials of degree $m=66$ (averaged over 4 s of execution on a MacBook Pro 2017 with a 2.9 GHz Quad-Core Intel Core i7, I7-7820HQ)

Full size table

Reduction The $2m-2$ bits result provided by the carryless multiplication is reduced back modulo $P_0$ to a m bit field element, using standard techniques. The pseudocode of the algorithm for reduction is presented in “Appendix A”, algorithm 18. The AVX2 performances of the reduction are reported in Table 7.

Inversion The inversion of an element $x \in {\mathbb {F}}_{2^{m}}$, described in “Appendix A”, algorithm 19, has been derived using Fermat’s little Theorem stating that $x^{2^{m}-2} = x^{-1}$. The fixed exponentiation is achieved by the strategy presented in [38, Section 6.2] using the following addition chain of length 9:

$$\begin{aligned} 1 \rightarrow 2 \rightarrow 4 \rightarrow 8 \rightarrow 16 \rightarrow 32 \rightarrow 33 \rightarrow 66 \rightarrow 67 \,. \end{aligned}$$

The AVX2 performances of the binary field inversion are reported in Table 7.

Binary Vector Space Arithmetic

In this section, we describe the main algorithms used to manipulate vector spaces, i.e., Gaussian reduction, Zassenhaus algorithm, and the generation of vectors of given rank.

In our implementation, a binary matrix M, usually indicated with uppercase letters, of size $m \times l$ is an array of __uint128_t of length l, where each element of the array is a matrix row $m_i$. Similarly, a vector space, or the support of a set of vectors is represented with uppercase letters and stored in arrays of __uint128_t.

Gaussian Elimination Algorithm

We introduce an original algorithm to perform a constant time Gaussian elimination to convert any binary matrix to a (not necessarily reduced) row echelon form and its extension to convert it to reduced row echelon form. This algorithm is somehow a generalization of the one presented in [14], where Gaussian elimination was used to convert the binary matrix to a systematic form. In [14], if the matrix is not systematic, the algorithm breaks. Otherwise, for each column, the algorithm first sets to 1 the bits of the diagonal, by scanning the rows of the matrix from below the current pivot to the bottom of the matrix, then sets to 0 the bits in the current column, except the diagonal, by scanning the full set of rows again. This is done in a constant time manner, due to the fact that, being the matrix systematic, the number of rows under the pivot are always the same for each column step. Though, in [14], it is not defined how one could force the algorithm to continue when it is not possible to fix a 1 in the diagonal, i.e., when the matrix is not systematic. We solve the problem by always scanning all rows for each column, and by keeping track of the current pivot position, not necessarily in the diagonal. Let $\tilde{r}$ be the current pivot row position, i is the current scanned row and j the current scanned column. Then, we perform

$$\begin{aligned} m_{\tilde{r}}&= m_{\tilde{r}} \oplus \mathsf {mask1} \cdot \mathsf {mask2} \cdot \mathsf {mask3} \cdot m_{i} \\ m_{i}&= \mathsf {mask1} \cdot \mathsf {mask3} \cdot m_{\tilde{r}} \oplus m_{i} \end{aligned}$$

where $\mathsf {mask3}$ is set to 1 if the current row is above the pivot ($i>r$), $\mathsf {mask2}$ is set to 1 if the the bit $m_{i,j}$ is 0, and $\mathsf {mask1}$ is set to 1 if the bit $m_{\tilde{r},j}$ in the intersection of the current scanned row and column is 1. The steps above have the effect to leave the rows unchanged either when the current row is above the pivot row $m_{\tilde{r}}$ or, otherwise, when the bit $m_{i,j}$ is 0. On the other hand, when $m_{i,j}$ is 1, if the pivot bit $m_{\tilde{r},j}$ is 0, then the current row is swapped with the pivot row, and if the pivot bit $m_{\tilde{r},j}$ is 1, then the 1 in position (i, j) is flipped. Notice that, at the end of the algorithm, the pivot position is also the rank of the matrix. Compared to [14], for each scan of the full set of rows, we perform fewer XOR operations, but we need to compute more masks. We also have to scan all columns, while for the method from [14] it is sufficient to scan the minimum between the number of rows and the number of columns. This makes the method of [14] much faster for matrices with a small number of rows. We stress again that the method of [14] only computes the systematic form of a matrix, and for this reason is, in general, faster.

Our method can be easily extended to compute the reduced row echelon form, by storing the pivot positions and then scanning all the rows $\mathsf {r}$ times, where $\mathsf {r}$ is the number of rows, to remove the 1’s above the pivots.

The differences between our method and the one in [14] are summarized in Tables 5, 6.

Table 5 Comparison of our proposed Gaussian elimination algorithm and the one from Bernstein et al. [14], for a matrix with $\mathsf {r}$ rows and $\mathsf {c}$ columns

Full size table

Table 6 Clock cycle comparison of our proposed Gaussian elimination algorithm and the one from Bernstein et al. [14], for a matrix with $\mathsf {r}=10,20,30,100$ rows and $\mathsf {c}=67$ columns

Full size table

The pseudocode of the three algorithms can be found in algorithm 1 ([14]), algorithm 2, and algorithm 3, where M represents a binary matrix with $\mathsf {r}$ rows and $\mathsf {c}$ columns, $m_i$ is the binary vector representing the i-th row of the matrix M, and $m_{i,j}$ is the bit entry of the matrix M at position i, j.

In our C implementation, we store one line $\texttt {m[i]}$ of the binary matrix in a variable of type __uint128_t. We can perform Steps 3–4 of algorithm 1 in a constant number of operations as follows:

Similarly, also the other if statements of both algorithms can be easily executed in constant time.

Finally, note that algorithm 2 and algorithm 3 access $m_{\tilde{r}}$. Using memory indices depending on $\tilde{r}$ can leak information on $\tilde{r}$ through timing attacks on machines with caches. To avoid this types of attacks, one would have to scan all the rows of the matrix and access the desired row using another mask.

Zassenhaus Algorithm

The Zassenhaus algorithm is a method to compute a basis for the intersection and sum of two vector subspaces U, V of a vector space W of length m. Let us consider the two sets of generators of U and V, i.e., $U=\langle u_0,\ldots , u_{l_1} \rangle$ and $V=\langle v_0,\ldots , v_{l_2} \rangle$. The algorithm creates the block matrix (1) of size $(l_1+l_2)\times 2m$:

$$\begin{aligned}&\begin{bmatrix} u_{0,0} &{} \ldots &{} u_{0, m-1} &{} u_0 &{} \ldots &{} u_{0, m-1} \\ \vdots &{} &{} \vdots &{} \vdots &{} &{} \vdots \\ u_{l_1, 0}&{} \ldots &{} u_{l_1, m-1}&{} u_{l_1, 0}&{} \ldots &{} u_{l_1, m-1} \\ v_{0,0} &{} \ldots &{} v_{0, m-1} &{} 0 &{} \ldots &{} 0 \\ \vdots &{} &{} \vdots &{} \vdots &{} &{} \vdots \\ v_{l_2, 0}&{} \ldots &{} v_{l_1, m-1}&{} 0&{} \ldots &{}0 \\ \end{bmatrix} \end{aligned}$$

(1)

$$\begin{aligned}&\begin{bmatrix} a_{0,0} &{} \ldots &{} a_{0, m-1} &{} \star &{} \ldots &{} \star \\ \vdots &{} &{} \vdots &{} \vdots &{} &{} \vdots \\ a_{l_3, 0} &{} \ldots &{} a_{l_3, m-1} &{} \star &{} \ldots &{} \star \\ 0 &{} \ldots &{} 0 &{} b_{0,0} &{} \ldots &{} b_{0, m-1} \\ \vdots &{} &{} \vdots &{} \vdots &{} &{} \vdots \\ 0 &{} \ldots &{} 0 &{} b_{l_4,0} &{} \ldots &{} b_{l_4, m-1} \\ 0 &{} \ldots &{} 0 &{} 0 &{} \ldots &{} 0 \\ \vdots &{} &{} \vdots &{} \vdots &{} &{} \vdots \\ 0 &{} \ldots &{} 0 &{} 0 &{} \ldots &{} 0 \\ \end{bmatrix} \end{aligned}$$

(2)

After application of the Gauss elimination, the matrix has the form (2), reduced in row echelon form. In (2), $\star$ stands for arbitrary numbers, $(a_0,\ldots , a_{l_3})$ is a basis of $V+U$ and $(b_0,\ldots , b_{l_4})$ is a basis of $V\cap U$. The pseudocode can be found in algorithm 4.

Generation of Vectors of Given Rank

The generation of a vector ${\mathbf {e}}\in {\mathbb {F}}_{q^m}^n$ of a given rank, say r is probably the most delicate part of the key generation and encapsulation routines. We are not aware of any constant time algorithm performing this task. In this section, we analyze the two non-constant time strategies adopted in the current NIST submission (dated 2020/04/21) and in [5, Sect. 5.2]. Then we derive a constant time version of the latter, with a probability of failure that can be set as small as desired, at the cost of increasing the complexity of the algorithm. Lastly, we also propose an alternative method that, while generating a vector of a given rank, also constructs the full support of the vector. This last method could turn to be useful in the case a user could store a larger public key in memory, so to have the advantage of not reconstructing the support from its basis during the encapsulation phase.

The strategy from [5] or from the NIST submission are based on the same idea: generating a basis of r random elements of ${\mathbb {F}}_{q^m}$ until they are linearly independent and then generate a random linear combination of those vectors. In the NIST submission, the r elements are randomly inserted in the error components, thus guaranteeing that the error will have rank r. The remaining $n-r$ positions are filled with random linear combinations of the basis elements. This algorithm is detailed in algorithm 5. On the other hand, in [5], the components of the error are all filled with random linear combinations of the basis elements, until the error has rank r. This algorithm is detailed in algorithm 6. It is clear that both strategies are non-constant time. Notice also that, in [19], the authors describe how the NIST submission implementation leaks the memory access pattern.

Both approaches can be turned to be constant time by removing the repeat and while loops, and iterating the algorithm enough times so that the probability of generating a vector list of the wrong rank becomes negligible. Our proposed constant time solution is based on this idea. Precisely, we first sample r independed elements of ${\mathbb {F}}_{q^m}$ randomly. In Proposition 1, we derive the probability for those vectors to be linearly independent over ${\mathbb {F}}_2$. Second, we generate the components of the vector using masked linear combinations of the basis. Note that this algorithm also has the advantage to hide the memory access pattern. We show also that it is sufficient to repeat the full procedure once to reach a probability of failing equal to $2^{-60}$, which is already way smaller than the ROLLO Decryption Failure Rate. However, if this is still a concern (for example when adapting this work to ROLLO-II), repeating the procedure twice leads to a probability of failing of $2^{-120}$, and so on. The full algorithm is described in algorithm 7. Note that this is the algorithm used in our implementation.

Proposition 1

The probability that r randomly sampled elements in ${\mathbb {F}}_2^m$ have rank r is $p = (1-q^{-m})\cdot (1-q^{1-m})\cdots (1-q^{r-1-m}).$

Proof

The first element $e_1 \in {\mathbb {F}}_{q^m}$ is independent if and only if it is different from zero. Since it is a vector in ${\mathbb {F}}_{q^m}$, we have $\Pr (e_1=0)=1/q^m$. Then $e_1,e_2$ are linearly dependent if and only if $e_2=k e_1$, where $k\in {\mathbb {F}}_{q}$. So $\Pr (e_2=k e_1)=q/q^m$. We can continue this way for all the vectors until the last one, where we have that $e_r$ is a linear combination of the previous ones if and only if $e_r=\sum _{i=1}^{r-1} k_i e_i$ where $k_i\in {\mathbb {F}}_{q}$. So $\Pr (e_r=\sum _{i=1}^{r-1} k_i e_i)=q^{r-1}/q^m$. $\square$

For ROLLO-I-128 parameters, where $r=7$ and $m=67$, the probability to have a linear combination between r random vectors is $2^{-60}$. Computing the probability that a random support is of the required dimension is only the first step of the evaluation of the failure probability of our algorithm. Assuming that a random support F of dimension r is available, we now have to compute the probability for a vector ${\mathbf {e}}\in F^n$ to be of rank strictly less than r. Let $f_1, \ldots , f_r$ be a basis of F. The components $e_1, \ldots , e_n$ can be written with coordinates in $f_1, \ldots , f_r$: $e_i = \sum \limits _{j=0}^r (e_i)_j f_j$ where $(e_i)_j \in F_q$. Let M be the $r \times n$ matrix over $F_q$ such that $M_{j, i} = (e_i)_j$. Then, the fact that $\mathsf {w_R}({\mathbf {e}}) < r$ is equivalent to the fact that the matrix M is of rank $< r$. Since the coordinates of M are sampled randomly, this probability can be approximated by $q^{-(1+n-r)}$.

For ROLLO-I-128 parameters, where $r=7$ and $n=83$, the probability to obtain a vector with rank less than r is $2^{-77}$, hence the probability that this process generates an error of weight $r-1$ is $2^{-60} + 2^{-77}$ which can be approximated by $2^{-60}$.

Now, one might generate multiple samples and if the cycle is repeated h times, the probability to fail becomes $2^{-120}$ for $h=2$, and $2^{-180}$ for $h=3$ and so on. To make this approach constant time, one can repeat the sampling as many times needed to reach the desired probability, each time computing the rank of the vector with the constant time Gaussian elimination algorithm proposed in “Gaussian elimination algorithm” and store the sampled vector space when it has the desired rank.

Now, we describe how to generate the entire support of the vector rather than just the basis. This approach takes advantage of the fact that r is usually small (maximum 9 for ROLLO-I). We start by initializing a list with the zero vector and a random vector. We then generate a second random vector, check if it is already in the list. If so, we discard it and generate another one, else we add its addition with all the previous vectors already in the list to the list. We end up generating a vector subspace F of ${\mathbb {F}}_{q^m}$ of dimension r. One can then draw randomly the coordinates of ${\mathbf {e}}$ from this list. The only caveat of this method is that the vector ${\mathbf {e}}$ can be of rank less than r as its coordinates could be in a vector subspace of F. We, therefore, have to check the rank of ${\mathbf {e}}$ before outputting the result, or run the algorithm twice to reach a probability of failing of $2^{-120}$ (as proved above). We also notice that an implementation of such method needs to take care of hiding the memory access pattern when randomly drawing the elements from the vector space. The method is detailed in its non-constant time version in algorithm 8, and in its constant time version in algorithm 9. Note that the mask operation in line 9 of algorithm 9 should be done using an AND mask rather than a multiplication.

Composite Galois Field Arithmetic

An element in the composite Galois field ${\mathbb {F}}_{({2^m})^n}$ can be represented as a polynomial ${\mathbf {a}}(x)=a_0 + a_1 x + \ldots + a_{n-1} x^{n-1}$ in ${\mathbb {F}}_{2^{m}}[x]/ P(x)$, with $P(x) \in {\mathbb {F}}_2[x]$ irreducible of degree n, or, equivalently, as an array ${\mathbf {a}}=(a_0, a_1, \ldots , a_{n-1})$ of length n of elements in ${\mathbb {F}}_{2^{m}}$. In our implementation, an element of ${\mathbb {F}}_{{(2^m)}^n}$ is an array of __uint128_t of length n, and we usually refer to it in the pseudocode with bold lowercase letters.

Matrix Multiplication with Lazy Reduction

The multiplication ${\mathbf {a}}\times {\mathbf {b}}$^{Footnote 4} in ${\mathbb {F}}_{({2^m})^n}$, algorithm 10, is performed as the following vector by matrix multiplication

$$\begin{aligned} (a_0, a_1, \ldots , a_{n-1}) \times \begin{bmatrix} {{\hat{b}}}_{0,0} &{} \cdots &{} {{\hat{b}}}_{0,n-1} \\ {{\hat{b}}}_{1,0} &{} \cdots &{} {{\hat{b}}}_{1,n-1} \\ \vdots &{} \ddots &{} \vdots \\ {{\hat{b}}}_{n-1,0} &{} \cdots &{} {{\hat{b}}}_{n-1,n-1} \end{bmatrix}, \end{aligned}$$

where $( {{\hat{b}}}_{i,0}, \cdots , {{\hat{b}}}_{i,n-1})$ are the coefficients of ${\mathbf {b}}(x)\cdot x^i \mod P(x)$.

In ROLLO-I-128, we have $n=83$, so $(b_{i,0} + b_{i,1} x + \ldots + b_{i,82} x^{82}) \cdot x \mod P(x) = b_{i,82} + b_{i,0} x + (b_{i,1} + b_{i,82}) x^2 + b_{i,2} x^3 + (b_{i,3} + b_{i,82}) x^4 + b_{i,4} x^5 + b_{i,5} x^6 + (b_{i,6}+ b_{i,82}) x^7 + \ldots , b_{i,81} x^{82}$, since $x^{82} = X^7 + X^4 + X^2 + 1$.

This allows us to reduce the number of reduction in ${\mathbb {F}}_{2^{m}}$, since when we compute the field element $(a_0, a_1, \ldots , a_{n-1}) \times ( {{\hat{b}}}_{i,0}, \cdots , {{\hat{b}}}_{i,n-1}) = (a_0 {{\hat{b}}}_{i,0} +\ldots + a_{n-1} {{\hat{b}}}_{i,n-1}) = \sum _{j=0}^{n-1} a_j {{\hat{b}}}_{i,j}$, each $a_j {{\hat{b}}}_{i,j}$ can be computed using the carryless multiplication algorithm $\textsf {clmul}$, and the reduction $\textsf {red}_{{\mathbb {F}}_{2^{67}}}$ is applied only at the end of the summation. The pseudo-code of the algorithm is presented in algorithm 10.

The AVX2 performances of the polynomial multiplication are reported in Table 7.

Polynomial Inversion

For the inversion in the composite Galois field ${\mathbb {F}}_{(2^{m})^{n}} \cong {\mathbb {F}}_{2^{m}}[x]/ P(x)$, we use the technique presented in [26] in 1998, which improves the Itoh-Tsujii algorithm with pre-computed powers [30]. The idea is to compute ${\mathbf {a}}^{-1} = ({\mathbf {a}}^r)^{-1}{\mathbf {a}}^{r-1}, {\mathbf {a}}\in {\mathbb {F}}_{(2^m)^n}, {\mathbf {a}}\ne 0$, where $r = (2^{mn} - 1)/(2^m-1)$. It is easy to prove that ${\mathbf {a}}^r \in {\mathbb {F}}_{2^m}$ as $({\mathbf {a}}^r)^{2^m} = ({\mathbf {a}}^{1 + 2^m + 2^{2m} + \ldots + 2^{(n-1)m}})^{2^m} = {\mathbf {a}}^{1 + 2^m + 2^{2m} + \ldots + 2^{(n-1)m}} = {\mathbf {a}}^r$. This reduces inversion in the Galois field ${\mathbb {F}}_{({2^m})^n}$ to one inversion in the ground field ${\mathbb {F}}_{2^m}$, the computation of ${\mathbf {a}}^{r - 1}$ and n multiplications in ${\mathbb {F}}_{2^{m}}$.

To compute ${\mathbf {a}}^{2^m}$, one can notice that ${\mathbf {a}}^{2^m} = \left( \sum _{i = 0}a_i x^i \right) ^{2^m} \mod P = \sum _{i = 0}^n a_i x^{i 2^m} \mod P$ as $a_i \in {\mathbb {F}}_{q^m}\forall i = 0, \ldots , n -1$. It is then sufficient to pre-compute the values of $s_i = x^{i 2^m} \mod P, \forall i = 0, \ldots , n - 1$. Therefore, the computation of ${\mathbf {a}}^{2^m}$ can be seen as a matrix multiplication as follow:

$$\begin{aligned} S \cdot {\mathbf {a}}^T = \left( \begin{matrix} 1 &{} s_{1,0} &{} s_{2,0} &{} \ldots &{} s_{n-1,0} \\ 0 &{} s_{1,1} &{} s_{2,1} &{} \ldots &{} s_{n-1,1} \\ \vdots &{} \ldots &{} \ldots &{} \vdots \\ 0 &{} s_{1,n-1} &{} s_{2,n-1} &{} \ldots &{} s_{n-1,n-1} \\ \end{matrix} \right) \times \left( \begin{matrix} a_0 \\ a_1\\ \vdots \\ a_{n-1}\\ \end{matrix} \right) \end{aligned}$$

In addition, if P has only binary coefficients (which is the case for all variants of ROLLO), the pre-computed values also have binary coefficients meaning that the previous matrix multiplication can be performed using only XORs. The last step is to remark that ${\mathbf {a}}^{2^{km}} = S^k \cdot {\mathbf {a}}^T$ and we end up with an algorithm performing n polynomial multiplications and binary matrix multiplications, one inversion in ${\mathbb {F}}_{2^m}$ followed by n multiplications in ${\mathbb {F}}_{2^m}$.

In algorithm 12 we summarize how the inversion is performed. It uses algorithm 11 to compute ${\mathbf {a}}^{2^{km}}$. The matrix S in algorithm 11 is a pre-computed matrix depending only on P and n.

Notice that both algorithm 11 and algorithm 12 can be coded such that they execute a constant number of operations. In particular, Steps 4-5 of algorithm 11, can be performed in a constant time fashion by using a mask, as follows: compute $\texttt {mask} = 0 - S_{i,j}$, so that $\texttt {mask}$ is 0 if $S_{i,j} = 0$ or a binary vector of 1’s otherwise; then compute $t = a_j \otimes \texttt {mask}$ and finally $b_i = b_i + t$.

It is also possible to pre-compute all the matrices $S, S^2, S^3, \ldots , S^{n-1}$ to avoid the steps 2 and 3 of algorithm 11. This, for example, results in 70.6 KB of pre-computed matrices for ROLLO-I-128, and a speed improvement of about $17\%$.

As an alternative method to compute the inverse of a polynomial, one might consider a constant time variant of Euclid’s algorithm, as the one proposed in [16]. Though, this type of algorithm is usually more efficient for generic moduli, where the modular reductions in Fermat’s method are considerably more expensive. After a comparison in the favor of a Sagemath [40] implementation of the method described above against the script recipx provided in [16], we decided to discard this option.

The AVX2 performances of the reduction are reported in Table 7.

Table 7 Average cycles per operation for the main algorithms presented in this work

Full size table

Rank Syndrome Recovery Algorithm and Decapsulation

In this section, we describe the core of the decapsulation phase: the Rank Support Recovery (RSR) algorithm which was introduced in [24] and made constant time in [8].

Let E, F be two ${\mathbb {F}}_{q}$-subspaces of ${\mathbb {F}}_{q^m}$ and let $(e_1, \ldots , e_r)$ be a basis of E and $(f_1, \ldots , f_d)$ be a basis of F. So $\mathsf {dim}(E)=r$ and $\mathsf {dim}(F)=d$. We denote by EF the subspace generated by the product of the elements of E and F, i.e., $EF = \langle \{e f \, | \, e\in E \text{ and } f\in F\} \rangle .$ Note that $(e_if_j)_{1\le i\le r, 1\le j\le d}$ is a generator family of EF. Thus, $\mathsf {dim}(EF)\le rd$ and the equality holds with an overwhelming probability [2]. For that reason, we assume that $\mathsf {dim}(EF) = rd$.

Let C be a LRPC code with parity check matrix $H \in {\mathbb {F}}_{q^m}^{2n \times n}$ and let ${\mathbf {s}}=(s_1, \ldots , s_n)$ be a syndrome of the error vector ${\mathbf {e}}=(e_1,\ldots , e_{2n})$, that is, $H{\mathbf {e}}^T={\mathbf {s}}^T$. Let E be the support of ${\mathbf {e}}$ and S be the support of ${\mathbf {s}}$. Since S is a subspace of EF, its dimension is at most rd. Finally, we denote by $B_i= f_i^{-1} S$.

The RSR algorithm (algorithm 13) takes as input the base of the vector space F, the syndrome ${\mathbf {s}}$ and the dimension of E i.e., r; and its output is (probably) E, i.e., the support of the error ${\mathbf {e}}$. The goal of this algorithm is to recover the vector space E (see [8] for more details).

Let us explain how the algorithm recovers the support E of the error vector ${\mathbf {e}}$. Since the coordinates of the syndrome can be seen as elements in EF, the idea is to compute the support of the error as $E = B_1 \cap B_2\cap \ldots \cap B_d, \text{ where } B_i= f_i^{-1} S.$ In fact, $B_i = \{f_i^{-1}f_1e_1, f_i^{-1}f_2e_1, \ldots , f_i^{-1}f_d e_r\}=\{e_1,\ldots e_r, f_i^{-1}f_je_t\}_{1\le j\le d, i\ne j, 1\le t\le r }\,.$ Note that this method fails to recover E when the syndrome space S is different from EF and when the intersection contains others elements besides the $e_j$’s [2].

In algorithm 13, we use capital letter both for the output of Zassenhaus algorithm (section 4.2 p. 12) and the matrices with elements in ${\mathbb {F}}_{q^m}$. In this last case, we denote by $J^{\{i\}}$ the i-th row of the matrix J. We also indicate by $T, \_= \textsf {zassenhaus}(B_i, B_j)$ the first element of the Zassenhaus algorithm output, i.e., $B_i + B_j$ and with $\_ \, , T = \textsf {zassenhaus}(B_i, B_j)$ the second element of the output, that is, $B_i \cap B_j$. With T we indicate a temporary value. The i-th element of T is denoted by $t_i$

There are three conditions that need to be fulfilled for this algorithm to run in constant time: (1) the size of the inputs to the Zassenhaus algorithm have to be constant. Here we always input a basis of length rd for both vector spaces; (2) for inputs of the same size, the Zassenhaus algorithm needs to run in constant time. This was taken care of in section 4.2; (3) operations involving elements of ${\mathbb {F}}_{q^m}$ (addition, multiplication, etc.) need to run in constant time. This was taken care in section 4.1.

Notice that Step 3 of the algorithm would work if, instead of the reduced row echelon form of the basis, one computes the entire vector space E and then sorts it with respect to any order. For this particular choice of parameters, this second option is slower. It could become more efficient for a much larger m and a smaller base.

Performance

We benchmark our implementation of ROLLO-I-128 on a 2017 MacBook Pro equipped with 2.9GHz Intel Core i7 (I7-7820HQ). To measure the performance of the single operations presented in this work, we use our own testing platform, and the results are reported in Table 7.

We use SUPERCOP version 20200618 [15] to compare our implementation with other existing KEMs by disabling Intel Hyper-Threading and Turbo Boost. In the key generation function and the encryption function we use the random-number generator randombytes() provided by SUPERCOP. Note that our implementation uses a stand-alone implementation of SHA256, but for a fair comparison, we have switched to OpenSSL’s SHA256 implementation, which is also used in the implementation of ROLLO-I. All primitives are compiled using clang with parameters -march=native -O3 -fomit-frame-pointer -fwrapv -Qunused-arguments -Wl,-no_pie. For non-vectorized implementation, we disable the flag -march=native.

According to our profiler: about 85% of the key generation is taken by the polynomial inversion; 5% of the encapsulation time is occupied by the polynomial multiplication, while 91% of the time is spent in generating a basis and two polynomials whose list of coefficients has given rank r. About 70% of this last step (63% of the full encapsulation time) is taken by computing the rank of the list, to make sure it has the proper rank, while about 15% is taken from the randombyte() calls.

about 75% of the decapsulation is taken by the Gaussian elimination step in the Zassenhaus algorithm. In the official ROLLO specification [2], the following number (in thousands) of clock cycles are reported for, respectively, key generation, encapsulation and decapsulation: 3537, 395, 1754. Our loss in the key generation is explained by the fact that ROLLO’s team used a not constant-time GCD algorithm for the polynomial inversion. Our loss in the encapsulation is explained by the fact that ROLLO’s team used a not constant-time generation of vectors with given rank, and in particular they did not have to check the rank of ${\mathbf {e}}_1$ and ${\mathbf {e}}_2$ two times. The not constant-time implementation of Gaussian elimination also explains the difference in the decapsulation step.

In Table 8, we report the performance results of our implementation of ROLLO-I-128 with one cycle in the generation of vectors of given rank (CT_rollo_fast)^{Footnote 5}, and with two cycles (CT_rollo_secure). We also report the performances of the other Category 1 KEMs available in SUPERCOP.

Table 8 The number of cycles to perform key generation, encapsulation, and decapsulation of other KEMs available in SUPERCOP with 128-bit security

Full size table

Conclusion

In this work, we have presented several algorithms which shed some light on the potential performance of a fully optimized constant time implementation of ROLLO-I-128. It highlights that this proposal can be quite interesting from a computational point of view both with AVX2 and without. Future work will consist in porting these algorithms to other variants of ROLLO as well as some parts of RQC which might benefit from those improvements.

Notes

The full code of our implementation and of the tests we run for the comparisons is available at https://github.com/Crypto-TII/constant_time_rollo.
The code is available in the file mach_desc.h of the library NTL [39], under the method NTL_ALT1_BB_MUL_CODE0.
The mul function in GF2E.h.
When it is clear from the context, with abuse of notation we indicate ${\mathbf {a}}\times {\mathbf {b}}$ as ${\mathbf {a}}\cdot {\mathbf {b}}$ or ${\mathbf {a}}{\mathbf {b}}$, also for matrix multiplications.
With this option, there is a $2^{-60}$ probability that an error of weight less than d, or, respectively, less then r, is generated during the keygen or, respectively, the encapsulation. Furthermore, in this case, the protocol will not fail.

References

Abdouli Aa, Bellini E, Caullery F, Manzano M, Mateu V. Rank-metric Encryption on Arm-Cortex M0: Porting code-based cryptography to lightweight devices. In: Proceedings of the 6th on ASIA Public-Key Cryptography Workshop, 2019; pp. 23–30.
Aguilar-Melchor C, Aragon N, Bettaieb S, Bidoux L, Blazy O, Deneuville JC, Gaborit P, Hauteville A, Ruatta O, Tillich JP, et al. ROLLO - Rank-Ouroboros, LAKE & LOCKER. 2018. Available at: https://pqc-rollo.org/doc/rollo-specification_2020-04-21.pdf.
Aguilar-Melchor C, Aragon N, Bettaieb S, Bidoux L, Blazy O, Deneuville JC, Gaborit P, Zémor G. Rank Quasi-Cyclic (RQC). 2017. https://pqc-rqc.org/doc/rqc-specification_2017-11-30.pdf.
Aguilar-Melchor C, Bellini E, Caullery F, Makarim RH, Manzano M, Marcolla C, Mateu V. Constant-time algorithms for ROLLO. Available at: https://csrc.nist.gov/CSRC/media/Events/Second-PQC-Standardization-Conference/documents/accepted-papers/caullery-constant-time-rollo.pdf.
Al Abdouli AS, Al Ali M, Bellini E, Caullery F, Hasikos A, Manzano M, Mateu V. DRANKULA: A McEliece-like Rank Metric based Cryptosystem Implementation.In: Proceedings of the 15th international joint conference on e-business and telecommunications (ICETE 2018), 2018;vol. 2, pp. 230–41. https://doi.org/10.5220/0006838102300241.
Al Shehhi H, Bellini E, Borba F, Caullery F, Manzano M, Mateu V. An IND-CCA-secure code-based encryption scheme using rank metric. In: Progress in cryptology–AFRICACRYPT 2019: 11th international conference on cryptology in Africa, Rabat, Morocco, July 9–11, 2019, Proceedings, 2019; vol. 11627, p. 79. Springer.
Alagic G, Alperin-Sheriff J, Apon D, Cooper D, Dang Q, Kelsey J, Liu YK, Miller C, Moody D, Peralta R, et al. Status report on the second round of the NIST post-quantum cryptography standardization process. National Institute of Standards and Technology: Tech. rep; 2020.
Aragon N, Gaborit P, Hauteville A, Ruatta O, Zémor G. Low rank parity check codes: New decoding algorithms and applications to cryptography. arXiv:1904.00357 [Preprint]. 2019.
Aranha DF, López J, Hankerson D. Efficient software implementation of binary field arithmetic using vector instruction sets. In: International conference on cryptology and information security in Latin America, 2010;pp. 144–61. Springer.
Bardet M, Bros M, Cabarcas D, Gaborit P, Perlner R, Smith-Tone D, Tillich JP, Verbel J. Algebraic attacks for solving the Rank Decoding and MinRank problems without Gröbner obner basis. arXiv:2002.08322 [Preprint]. 2020.
Bellini E, Caullery F, Gaborit P, Manzano M, Mateu V. Improved veron identification and signature schemes in the rank metric. In: Information theory (ISIT), 2019 IEEE international symposium on. IEEE 2019. https://doi.org/10.1109/ISIT.2019.8849585.
Bellini E, Caullery F, Hasikos A, Manzano M, Mateu V. Code-based signature schemes from identification protocols in the rank metric. In: International conference on cryptology and network security, 2018;pp. 277–98. Springer.
Bellini E, Caullery F, Makarim R, Manzano M, Marcolla C, Mateu V. Advances and challenges of rank metric cryptography implementations. In: 2019 IEEE 37th international conference on computer design (ICCD), 2019;pp. 325–8. IEEE.
Bernstein DJ, Chou T, Schwabe P. McBits: fast constant-time code-based cryptography. In: International workshop on cryptographic hardware and embedded systems, 2013;pp. 250–72. Springer.
Bernstein DJ, Lange T. eBACS: ECRYPT Benchmarking of Cryptographic Systems: SUPERCOP (2010). https://bench.cr.yp.to/supercop.html. Accessed 15 July 2020.
Bernstein DJ, Yang BY. Fast constant-time gcd computation and modular inversion. In: IACR transactions on cryptographic hardware and embedded systems 2019;pp. 340–98.
Bos JW, Kleinjung T, Niederhagen R, Schwabe P. Ecc2k-130 on cell cpus. In: International conference on cryptology in Africa, 2010;pp. 225–242. Springer.
Bosma W, Cannon J, Playoust C. The Magma algebra system. I. The user language. J Symbolic Comput. 1997;24(3–4):235–65. https://doi.org/10.1006/jsco.1996.0125.
Article MathSciNet MATH Google Scholar
Drucker N, Gueron S, Kostic D. Constant-time implementations in some proposed KEMs: the case of Rollo and RQC. http://math.haifa.ac.il/shay/Side_Channels_2020_06_23_V01.pdf. 2020.
Enhancing Code Based Zero-Knowledge Proofs Using Rank Metric.
Eron Anderson S. Bit twiddling hacks. https://graphics.stanford.edu/~seander/bithacks.html. Accessed 03 May 2019.
Faure C, Loidreau P. A new public-key cryptosystem based on the problem of reconstructing $p$–polynomials. In: International workshop on coding and cryptography, 2005;vol. 3969, pp. 304–15. Springer. https://doi.org/10.1007/11779360_24.
Gabidulin EM, Paramonov A, Tretjakov O. Ideals over a non-commutative ring and their application in cryptology. In: Workshop on the theory and application of of cryptographic techniques, 1991;pp. 482–9. Springer.
Gaborit P, Murat G, Ruatta O, Zémor G. Low rank parity check codes and their application to cryptography. In: Proceedings of the workshop on coding and cryptography WCC-2013, Bergen. 2013.
Gaborit P, Otmani A, Kalachi HT. Polynomial-time key recovery attack on the Faure-Loidreau scheme based on Gabidulin codes. Des Codes Crypt. 2018;86(7):1391–403.
Article MathSciNet Google Scholar
Guajardo J, Paar C. Fast inversion in composite galois fields GF $((2^n)^{M})$. In: IEEE international symposium on information theory, 1998;pp. 295–5. Citeseer.
Gueron S, Kounavis ME. Intel® carry-less multiplication instruction and its usage for computing the GCM mode. White Paper. 2010.
Hoffstein J, Pipher J, Silverman JH. NTRU: A ring-based public key cryptosystem. In: Lecture notes in computer science, 1998;pp. 267–88. Springer-Verlag.
Intel® C++ Compiler 19.1 Developer guide and Reference. https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-overview-intrinsics-for-intel-advanced-vector-extensions-2-intel-avx2-instructions. Accessed 01 Jan 2020.
Itoh T, Tsujii S. A fast algorithm for computing multiplicative inverses in GF($2^m$) using normal bases. Inf Comput. 1988;78(3):171–7.
Karatsuba A, Ofman Y. Multiplication of many-digital numbers by automatic computers. Doklady Akademii Nauk SSSR, Translation in Physics-Doklady 7, 595-596, 1963. 1962;145(2), 293–94.
Lablanche J, Mortajine L, Benchaalal O, Cayrel PL, El Mrabet N. Optimized implementation of the NIST PQC submission ROLLO on microcontroller. IACR Cryptol ePrint Arch. 2019;2019:787.
MATH Google Scholar
Loidreau P. A new rank metric codes based encryption scheme. In: International Workshop on Post-Quantum Cryptography, 2017; pp. 3–17. Springer.
NIST: Post-Quantum Cryptography Call for Proposals. 2018. https://csrc.nist.gov/Projects/Post-Quantum-Cryptography/Post-Quantum-Cryptography-Standardization/Call-for-Proposals. Accessed 01 Jan 2020.
Oliveira T, López J, Cervantes-Vázquez D, Rodríguez-Henríquez F. Koblitz curves over quadratic fields. J Cryptol. 2019;32(3):867–94.
Article MathSciNet Google Scholar
Overbeck R. A new structural attack for GPT and variants. In: International Conference on Cryptology in Malaysia, 2005;pp. 50–63. Springer.
Overbeck R. Structural attacks for public key cryptosystems based on Gabidulin codes. J Cryptol. 2008;21(2):280–301.
Article MathSciNet Google Scholar
Picek S, Coello CAC, Jakobovic D, Mentens N. Finding short and implementation-friendly addition chains with evolutionary algorithms. J Heuristics. 2018;24(3):457–81.
Article Google Scholar
Shoup, Victor: NTL: A Library for doing Number Theory. 2019. https://shoup.net/ntl/. Accessed 01 Jan 2020.
Stein W, et al. Sage mathematics software (Version 9.0). The sage development team. 2020. http://www.sagemath.org.

Download references

Author information

Authors and Affiliations

ISAE-SUPAERO, Université de Toulouse, Toulouse, France
Carlos Aguilar-Melchor
Université de Limoges, Limoges Cedex, France
Nicolas Aragon
Cryptography Research Centre, Technology Innovation Institute, Abu Dhabi, United Arab Emirates
Emanuele Bellini, Florian Caullery, Rusydi H. Makarim & Chiara Marcolla

Authors

Carlos Aguilar-Melchor
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas Aragon
View author publications
You can also search for this author in PubMed Google Scholar
Emanuele Bellini
View author publications
You can also search for this author in PubMed Google Scholar
Florian Caullery
View author publications
You can also search for this author in PubMed Google Scholar
Rusydi H. Makarim
View author publications
You can also search for this author in PubMed Google Scholar
Chiara Marcolla
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: FB, JAP, AMP. Methodology: NR, FB, JAP, AP. Data analysis and interpretation: FB, JAP, AMP. Investigation: SL, IPF, NR. Data Curation: SL, NR. Writing and editing: : FB, JAP, AMP. Revision and final approval: All authors

Corresponding author

Correspondence to Emanuele Bellini.

Ethics declarations

Conflict of interest

Nicolas Aragon has received research grants from French DGA, thus this work was partially funded by French DGA. The remaining authors declare that they have no conflict of interest.

Ethical standards

This article does not contain any studies with human participants or animals, as ruled by the Directive 2010/63/EU, performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Pseudocode for the binary field arithmetic

The plain C carryless multiplication algorithm $\textsf {clmul}_K(a,b)$ is described in algorithm 14. Notice that algorithm 14 works for $64< m < 129$.

The AVX2 carryless multiplication algorithm $\textsf {clmul}_S(a,b)$ is described in algorithm 15. Note that, as algorithm 14, algorithm 15 is suitable for fields ${\mathbb {F}}_{2^{m}}$ with $64< m < 129$, which include all ROLLO-I and ROLLO-II variants. Let us recall that using Karatsuba multiplication [31] in algorithm 15 instead of steps 3-6 would not give any advantage, as the cost of multiplication and addition with AVX2 instruction is very close. In practice, as we will show, it even performs worse, due to alignment problems.

The algorithm to inverleave zeros used for the squaring algorithm is a small modification of the method Interleave bits with 64-bit multiply given by Sean Eron Anderson on his web page Bit Twiddling Hacks [21] which is given in algorithm 16.

The squaring method is given in algorithm 17. For the AVX2 version, a look-up table based on the instruction _mm_shuffle_epi8 is implemented both in the submission and our work.

The algorithm for reduction is presented in algorithm 18, where the symbols $\ll , \gg$ denote field multiplication and division by x respectively (left and right shift operators), $\oplus$ is the field addition (bit-wise XOR operator), $\otimes$ the bit-wise AND operator. As for algorithm 15, algorithm 18 is suitable for fields of size up to $2^{128}$ up to the modification of the values of the masks, the amount of shifts and their width.

The inversion of an element $x \in {\mathbb {F}}_{2^{m}}$ is described in algorithm 19. This has been derived using Fermat’s little Theorem stating that $x^{2^{m}-2} = x^{-1}$. The fixed exponentiation is achieved by the strategy presented in [38, Section 6.2] using the following addition chain of length 9:

$$\begin{aligned} 1 \rightarrow 2 \rightarrow 4 \rightarrow 8 \rightarrow 16 \rightarrow 32 \rightarrow 33 \rightarrow 66 \rightarrow 67 \,. \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aguilar-Melchor, C., Aragon, N., Bellini, E. et al. Constant Time Algorithms for ROLLO-I-128. SN COMPUT. SCI. 2, 382 (2021). https://doi.org/10.1007/s42979-021-00761-y

Download citation

Received: 26 February 2021
Accepted: 28 June 2021
Published: 18 July 2021
DOI: https://doi.org/10.1007/s42979-021-00761-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Constant Time Algorithms for ROLLO-I-128

Abstract

Similar content being viewed by others

McBits revisited: toward a fast constant-time code-based KEM

McBits Revisited

McBits: Fast Constant-Time Code-Based Cryptography

Introduction

Our Contribution

Structure of the Paper

Preliminaries

Structures and Representations

Definition 1

Metric and Support

Codes

Definition 2

Definition 3

Definition 4

Problem 1

Problem 2

Description of the Scheme

Proposed Algorithms

Binary Field Arithmetic

Binary Vector Space Arithmetic

Gaussian Elimination Algorithm

Zassenhaus Algorithm

Generation of Vectors of Given Rank

Proposition 1

Proof

Composite Galois Field Arithmetic

Matrix Multiplication with Lazy Reduction

Polynomial Inversion

Rank Syndrome Recovery Algorithm and Decapsulation

Performance

Conclusion

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical standards

Additional information

Publisher's Note

Pseudocode for the binary field arithmetic

Pseudocode for the binary field arithmetic

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation