Keywords

1 Introduction

The Key Encapsulation Mechanism (KEM) called Bit Flipping Key Encapsulation (BIKE) [2] is based on Quasi-Cyclic Moderate-Density Parity-Check (QC-MDPC) codes, and is one of the Round-2 candidates of the NIST PQC Standardization Project [15]. The submission includes several variants of the KEM and we focus here on BIKE-1-CCA Level-1 and Level-3.

The common QC-MDPC decoding algorithms are derived from the Bit-Flipping algorithm [12] and come in two main variants.

  • “Step-by-Step”: it recalculates the threshold every time that a bit is flipped. This is an enhancement of the “in-place” decoder described in [11].

  • “Simple-Parallel”: a parallel algorithm similar to that of [12]. It first calculates some thresholds for flipping bits and then flips the bits in all of the relevant positions, in parallel.

BIKE uses a decoder for the decapsulation phase. The specific decoding algorithm is a choice shaped by the target DFR, security, and performance. The IND-CCA version of BIKE Round-2 [2] is specified with the “BackFlip” decoder, which is derived from Simple-Parallel. The IND-CPA version is specified with the “One-Round” decoder, which combines the Simple-Parallel and the Step-By-Step decoders. In the “additional implementation” [7] we chose to use the “Black-Gray” decoder (BG) [5, 8], with the thresholds defined in [2]. This decoder (with different thresholds) appears in the BIKE pre-Round-1 submission “CAKE” and is due to N. Sendrier and R. Misoczki.

This paper explores a new family of decoders that combine the BG and the Bit-Flipping algorithms in different ways. Some combinations achieve the same or even better DFR compared to BG with the same block size, and at the same time also have better performance.

For better security we replace the mock-bits technique of the additional implementation [5] with a constant-time implementation that applies rotation and bit-slice-adder as proposed in [3] (and vectorized in [13]), and enhance it with further optimizations. We also report the first measurements of BIKE-1 on the new Intel “Ice-Lake” micro-architecture, leveraging the new and instructions [1] (see also [4, 10]).

The paper is organized as follows. Section 2 defines notation and offers some background. The Bit-Flipping and the BG algorithms are given in Sect. 3. In Sect. 4 we define new decoders (BGF, B and BGB) and report our DFR per block size studies in Sect. 5. We discuss our new constant-time QC-MDPC implementation in Sect. 6. Section 7 reports the resulting performance. Section 8 concludes the paper.

2 Preliminaries and Notation

Let \(\mathbb {F}_2\) be the finite field of characteristic 2. Let \(\mathcal {R}\) be the polynomial ring \(\mathbb {F}_2[X]/\left\langle X^r - 1 \right\rangle \). For every element \(v \in \mathcal {R}\) its Hamming weight is denoted by wt(v), its bits length by |v|, and its support (i. e., the positions of its set bits) by supp(v). Polynomials in \(\mathcal {R}\) are viewed, interchangeably, also as square circulant matrices in \(\mathbb {F}_2^{r \times r}\). For a matrix \(H \in \mathbb {F}_2^{r \times r}\), let \(H_i\) denote its i-th column written as a row vector. We denote a failure by the symbol \(\perp \). Uniform random sampling from a set W is denoted by \(w \xleftarrow {\$} W\). For an algorithm A, we denote its output by \(out = A()\) if A is deterministic, and by \(out \leftarrow A()\) otherwise. Hereafter, we use the notation \(x.y\mathrm {e}{-z}\) to denote the number \((x + \frac{y}{10}) \cdot 10^{-z}\) (e. g., \(1.2\mathrm {e}{-3} = 1.2 \cdot 10^{-3}\).

BIKE-1 IND-CCA. BIKE-1 (IND-CCA) flows are shown in Table 1. The computations are executed over \(\mathcal {R}\), and the block size r is a parameter. The weights of the secret key \(h=(h_0,h_1,\sigma _0,\sigma _1)\) and the errors vector \(e=(e_0,e_1)\), are w and t, respectively, the public key, ciphertext, and shared secret are \(f=(f_0,f_1)\), \(c=(c_0,c_1)\), and k, respectively. \(\mathbf {H}\), \(\mathbf {K}\) denote hash functions (as in [2]). Currently, the parameters of BIKE-1 IND-CCA for NIST Level-1 are: \(r=11,779\), \(|f|=|c|=23,558\), \(|k|=256\), \(w=142\), \(d=w/2=71\) and \(t=134\).

Table 1. BIKE-1-CCA

3 The Bit-Flipping and the Black-Gray Decoders

Algorithm 1 describes the Bit-Flipping decoder [12]. The step computes the relevant threshold according to the syndrome, the errors vector, or the Unsatisfied Parity-Check (UPC) values. The original definition of [12] takes the maximal UPC as its threshold.

figure g

Algorithm 2 describes the BG decoder. It is implemented in BIKE additional code package [7]. Every iteration of BG involves three main steps. Step I calls to perform one Bit-Flipping iteration and sets the and arrays. Steps II and III call . Here, another Bit-Flipping iteration is executed, but the errors vector e is updated according to the masks, respectively.

In Step I the decoder uses some threshold (th) to decide whether or not a certain bit is an error bit. The probability that the bit is indeed an error bit increases as a function of the gap (upc[i] - th). The algorithm records bits with a small gap in the masks so that the subsequent Step II and Step III can use the masks in order to gain more confidence in the flipped bits. In this paper \(\delta =4\).

figure n

4 New Decoders with Different Shades of Gray

In cases where Algorithm 2 can safely run without a constant-time implementation, Step II and Step III are fast. The reason is that the UPC values are calculated only for indices in , and the number of these indices is at most the number of bits that were flipped in Step I (certainly less than n). By contrast, if constant-time and constant memory-access are required, the implementation needs to access all of the n positions uniformly. In such case the performance of Step II and Step III is similar to the performance of Step I. Thus, the overall decoding time of the BG decoder with \(X_{BG}\) iterations, where each iteration is executing steps I, II, and III, is proportional to \(3 \cdot X_{BG}\).

The decoders that are based on Bit-Flipping are not perfect - they can erroneously flip a bit that is not an error bit. The probability to erroneously flip a “non-error” bit is an increasing function of wt(e)/n and also depends on the threshold (note that wt(e) is changing during the execution). Step II and Step III of BG are designed to fix some erroneously flipped bits and therefore decrease wt(e) compared to wt(e) after one iteration of Simple-Parallel (without the masks). Apparently, when wt(e)/n becomes sufficiently small the black/gray technique is no longer needed because erroneous flips have low probabilities. This observation leads us to propose several new variations of the BG decoder (see Appendix A for their pseudo-code).

  1. 1.

    A Black decoder (B): every iteration consists of only Steps I, II (i. e., there is no gray mask).

  2. 2.

    A Black-Gray-Flip decoder (BGF): it starts with one BG iteration and continues with several Bit-Flipping iterations.

  3. 3.

    A Black-Gray-Black decoder (BGB): it starts with one BG iteration and continues with several B-iterations.

Example 1 (Counting the number of steps)

Consider BG with 3 iterations. Here, every iteration involves 3 steps (I, II, and III). The total number of practically identical steps is 9. Consider, BGF with 3 iterations. Here, the first iteration involves 3 steps (I, II, and III) and the rest of the iterations involve only one step. The total number of practically identical steps is \(3+1+1 = 5\).

5 DFR Evaluations for Different Decoders

In this section we evaluate and compare the B, BG, BGB, and BGF decoders under two criteria.

  1. 1.

    The DFR for a given number of iterations and a given value of r.

  2. 2.

    The value of r that is required to achieve a target DFR with a given number of iterations.

In order to approximate the DFR we use the extrapolation method [16], and apply two forms of extrapolation: “best linear fit” [8] and “two large r’s fit” (as in [8][Appendix C]). We point out that the extrapolation method relies on the assumption that the dependence of the DFR on the block size r is a concave function in the relevant range of r. Table 2 summarizes our results. It shows the r-value required for achieving a DFR of \(2^{-23} (\approx 10^{-8})\), \(2^{-64}\), and \(2^{-128}\). It also shows the approximated DFR for \(r=11,779\) (which is the value used for BIKE-1 Level-1 CCA). Appendix B provides the full information on the experiments and the extrapolation analysis.

Table 2. The DFR achieved by different decoders. Two extrapolation methods are shown: “best linear fit” (as in [8]), “two large r’s fit” (as in [8][Appendix C]). The second column shows the number of iterations for each decoder. The third column shows the total number of (time-wise identical) executed steps.

Interpreting the Results of Table 2. The conclusions from Table 2 indicate that it is possible to trade BG with 3 iterations for BGF with 6 iterations. This achieves a better DFR and also a \(\frac{9}{8} = 1.125 \times \) speedup. Moreover, if the required DFR is at most \(2^{-64}\), it suffices to use BGF with only 5 iterations (and get the same DFR as BG with 3 iterations). This achieves a factor of \(\frac{9}{7} = 1.28 \times \) speedup. The situation is similar for BG with 4 iterations compared to BGB with 5 iterations: this achieves a \(\frac{12}{11} = 1.09 \times \) speedup. If a DFR of \(2^{-128}\) is required it is possible to trade BG with 4 iterations for BGF with 7 iterations and achieve a \(\frac{12}{9} = 1.33 \times \) speedup. Another interesting trade off is available if we are willing to slightly increase the value of r. Compare BG with 4 iterations (i. e., 12 steps) and BGF with 6 iterations (i. e., 8 steps). For a DFR of \(2^{-64}\) we have \(r_{BG}=11,003\) and \(r_{BGF}=11,027\). A very small relative increase in the block size, namely \((r_{BGF} - r_{BG})/ r_{BG} = 0.0022\), gives a \(\frac{12}{8} = 1.5 \times \) speedup.

Example 2 (BGF versus BG with 3 iterations)

Fig. 1 shows a qualitative comparison (the precise details are provided in Appendix B). The left panel indicates that BGF has a better DFR than BG for the same number of (9) steps when \(r>9,970\). Similarly, The right panel shows the same phenomenon even with a smaller number of BGF steps (7) when \(r>10,726\) (with the best linear fit method) and \(r>10,734\) (with the two large r’s method) that correspond to a DFR of \(2^{-43}\) and \(2^{-45}\), respectively. Both panels show that that crossover point appears for values of r below the range that is relevant for BIKE.

Fig. 1.
figure 1

DFR comparison of BG with 3 iterations (9 steps) to BGF with: (Left panel) 7 iterations (9 steps); (Right panel) 5 iterations (7 steps). See the text for details.

6 Constant-Time Implementation of the Decoders

The mock-bits technique was introduced in [5] for side-channel protection in order to obfuscate the (secret) \(supp(h_0),supp(h_1)\). Let \(M_i\) denote the mock-bits used for obfuscating \(supp(h_i)\) and let \(\overline{M_i} = M_i \sqcup supp(h_i)\). For example, the implementation of BIKE-1 Level-1 used \(|M_i| = 62\) mock-bits and thus \(|\overline{M_i}|=133\). The probability to correctly guess the secret 71 bits of \(h_i\) if the whole set \(|\overline{M_i}|\) is given is \(\left( {\begin{array}{c}133\\ 71\end{array}}\right) ^{-1} \approx 2^{-128}\). This technique was designed for ephemeral keys but may leak information on the private key if it is used multiple times (i. e., if most of \(|\overline{M_i}|\) can be trapped). By knowing that \(supp(h_i) \subset \overline{M_i}\), an adversary can learn that all the other \((r-|\overline{M_i}|)\) bits of \(h_i\) are zero. Subsequently, it can generate the following system of linear equations \((h_0,h_1)^T \cdot (f_0, f_1) = 0\), set the relevant variables to zero and solve it. To avoid this, \(|\overline{M_i}|\) needs to be at least r/2 (probably more) so the system is sufficiently undetermined. However, using more than \(M_i\) mock-bits makes this method impractical (it was used as an optimization to begin with).

Therefore, to allow multiple usages of the private key we modify our implementation and use some of the optimizations suggested in [3] that were later vectorized in [13]Footnote 1. Specifically, we leverage the (array) rotation technique (which was also used in [14] for FPGAs). Here, the syndrome is rotated, d times, by \(supp(h_i)\). The rotated syndrome is then accumulated in the array, using a bit-slice technique that implements a Carry Save Adder (CSA).

6.1 Optimizing the Rotation of an Array

Consider the rotation of the syndrome s (of r bits) by e. g., 1, 100 positions. It starts with “Barrel shifting” by the word size of the underlying architecture (e. g., for the words size is 512-bits), here twice (1, 024 positions). It then continues with internal shifting here by 76 positions. Reference [13] shows a code snippet (for the core functionality) for rotating by a number of positions that is less than the word size. Figure 2 presents our optimized and simplified snippet for the same functionality using the instruction instead of the and the .

Fig. 2.
figure 2

Right rotate of 512-bit registers using AVX512 instructions.

The latest Intel micro-architecture “Ice-Lake” introduces a new instruction as part of the new set. This instruction receives two 512-bit registers (ab) together with another 512-bit index register (c) and outputs in dst the following results:

figure y

Figure 3 shows how can be used in order to replace the three instructions in lines 16–18 of Fig. 2.

Remark 1

Reference [13] remarks on using tables for some syndrome rotations but mentions that it does not yield significant speedup (and in some cases even shows a performance penalty). This is due to two bottlenecks in a constant-time implementation: (a) extensive memory access; (b) pressure on the execution port that the shift operations are using. In our case, the bottleneck is (a) so using tables to reduce the number of shifts is not a remedy. For completeness, we describe a new table method that can be implemented using Ice-Lake CPUs. The new instruction [1] allows to permute two at a granularity of bytes, and therefore to perform the rotation in lines 16–18 of Fig. 2 at a granularity of 8 bits (instead of 64). To use tables for caching: (a) initialize a table with \(i=0,\ldots ,7\) right-shifts of the syndrome (only 8 rows); (b) modify lines 14–15 to use ; (c) load (in constant-time) the relevant row before calling the Barrel-shifter. As a result, lines 16–18 can be removed to avoid all the shift operations. As explained above, this technique does not improve the performance of the rotation.

Fig. 3.
figure 3

Right rotate of 512-bit registers using instructions. The initialization in Fig. 2 (lines 1–10) is omitted.

6.2 Using Vector and vector

The Ice-Lake processors support the new vectorized and instructions [1]. We used the multiplication code presented in [9][Figure 2], and the CTR DRBG code of [6, 10], in order to improve our BIKE implementation. We also used larger caching of random values (1, 024 bytes instead of 16) to fully leverage the DRBG. The results are given in Sect. 7.

7 Performance Studies

We start with describing our experimentation platforms and measurements methodology. The experiments were carried out on two platforms, (Intel® Turbo Boost Technology was turned off on both):

  • EC2 Server: An AWS EC2 instance with the \(6^{th}\) Intel®Core\(^{TM}\) Generation (Micro architecture Codename “Sky Lake” [SKL]) Xeon®Platinum 8175M CPU 2.50 GHz. This platform has 384 GB RAM, 32K L1d and L1i cache, 1MiB L2 cache, and 32MiB L3 cache.

  • Ice-Lake: Dell XPS 13 7390 2-in-1 with the \(10^{th}\) Intel®Core\(^{TM}\) Generation (Micro architecture Codename “Ice Lake” [ICL]) Intel®Core\(^{TM}\) i7-1065G7 CPU 1.30 GHz. This platform has 16 GB RAM, 48K L1d and 32K L1i cache, 512K L2 cache, and 8MiB L3 cache.

The Code. The code is written in C and x86-64 assembly. The implementations use the (vector) and instructions when available. The code was compiled with gcc (version 8.3.0) in 64-bit mode, using the “O3” Optimization level with the “-funroll-all-loops” flag, and run on a Linux (Ubuntu 18.04.2 LTS) OS.

Measurements Methodology. The performance measurements reported hereafter are measured in processor cycles (per single core), where lower count is better. All the results were obtained using the same measurement methodology, as follows. Each measured function was isolated, run 25 times (warm-up), followed by 100 iterations that were clocked (using the instruction) and averaged. To minimize the effect of background tasks running on the system, every experiment was repeated 10 times, and the minimum result was recorded.

7.1 Decoding and Decapsulation: Performance Studies

Performance of BG. Table 3 shows the performance of our implementation which uses the rotation and bit-slice-adder techniques of [3, 13], and compares the results to the additional implementation of BIKE [7]. The results show a speedup of \(3.75 \times \) \(-\) \(6.03 \times \) for the portable (C code) of the decoder, \(1.1 \times \) speedup for the implementations but a \(0.66 \times \) slowdown for the implementation. The implementation leverages the masked store and load operations that do not exist in the architecture. Note that key generation is faster because generation of mock-bits is no longer needed.

Table 4 compares our implementations with different instruction sets ( , and ). The results for BIKE-1 Level-1 show speedups of \(1.47 \times \), \(1.28 \times \), and \(1.26 \times \) for key generation, encapsulation, and decapsulation, respectively. Even better speedups are shown for BIKE-1 Level-3 of \(1.58 \times \), \(1.39 \times \), and \(1.24 \times \), respectively.

Consider the 6th column and the BIKE-1 Level-1 results. The \(\sim {}94\text {K}\) (93, 521) cycles of the key generation consists of 13K, 13K, 1K, 1K, 5.5K, 26K, 26K cycles for generating \(h_0, h_1, \sigma _0, \sigma _1, g, f_0, f_1\), respectively (and some additional overheads). Compared to the 3rd column of this table (with only implementation): 13.6K, 13.6K, 2K, 2K, 6K, 46K, 46K, respectively. Indeed, as reported in [9], the use of contributes a \(2\times \) speedup to the polynomial multiplication. Note that the vector-AES does not contribute much, because the bottleneck in generating \(h_0, h_1\) is the constant-time rejection sampling check (if a bit is set) and not the AES calculations.

Table 5 compares our right-rotation method to the snippet shown in [13]. To accurately measure these “short” functionalities, we ported them into separate compilation units and compiled them separately using the “-c” flag. In addition, the number of repetitions was increased to 10, 000. This small change improves the rotation significantly (by \(2.3 \times \)) and contributes \(\sim {}2\%\) to the overall decoding performance.

Table 3. The EC2 server performance of BIKE-1 Level-1 when using the BG decoder with 3 iterations. The cycles (in columns 4, 5) are counted in millions.
Table 4. BIKE-1 Level-1 using the BG decoder with 3 iterations. Performance on Ice-Lake using various instruction sets.
Table 5. Rotation performance, comparison of our impl. and the snippet of [13].

8 Discussion

Our study shows an unexpected shades-of-gray combination decoders: BGF offers the most favorable DFR-efficiency trade off. Indeed (see Table 2), it is possible to trade BG, which was our leading option so far, for another decoder and have the same or even better DFR for the same block size. The advantage is either in performance (e. g., BGF with 6 iterations is \(\frac{12}{8} = 1.5 \times \) faster than BG with 4 iterations) or in implementation simplicity (e. g., the B decoder that does not involve gray steps).

A Comment on the Backflip Decoder. In [8] we compared Backflip with BG and showed that it requires a few more steps to achieve the same DFR (in the relevant range of r). We note that a Backflip iteration is practically equivalent to Step I of BG plus the Time-To-Live (TTL) handling. It is possible to improve the constant-time TTL handling with the bit-slicing techniques and reduce this gap. However, this would not change the DFR-efficiency properties reported here.

Further Optimizations. The performance of BIKE’s constant-time implementation is dominated by three primitives: (a) polynomial multiplication (it remains a significant portion of the computations even after using the instructions); (b) polynomial rotation (that requires extensive memory access); (c) the rejection sampling (approximately \(25\%\) of the key generation). This paper showed how some of the new Ice-Lake features can already be used for performance improvement. Further optimizations are an interesting challenge.

Parameter Choice Recommendations for BIKE. BIKE-1 Level-1 (IND-CCA) [2] uses \(r=11,779\) with a target DFR of \(2^{-128}\), and uses the Backflip decoder. Our paper [8] shows some problems with this decoder and therefore recommends to use BG instead. It also shows that even if \(\text {DFR}=2^{-128}\) there is still a gap to be addressed, in order to claim IND-CCA security (roughly speaking - a bound on the number of weak keys). We set aside this gap for now and consider a non-weak key. If we limit the number of usages of this key to Q and choose r such that \(Q \cdot {DFR} < 2^{-\mu }\) (for some target margin \(\mu \)), then the probability that an adversary with at most Q queries sees a decoding failure is at most \(2^{-\mu }\). We suggest that KEMs should use ephemeral keys (i. e., \(Q=1\)) for forward secrecy, and this usage does not mandate IND-CCA security (IND-CPA suffices). Here, from the practical view-point, we only need to target a sufficiently small DFR such that decapsulation failures would be a significant operability impediment. However, an important property that is desired, even with ephemeral keys, is some guarantee that an inadvertent \(1 \le \alpha \) times key reuse (where \(\alpha \) is presumably not too large) would not crash the security. This suggests the option for selecting r so that \(\alpha \cdot {DFR} < 2^{-\mu }\). For example, taking \(\mu =32\) and \(\alpha =2^{32}\) (an extremely large number of “inadvertent” reuses), we can target a DFR of \(2^{-64}\). Using BGF with 5 iterations, we can use \(r=11,171\), which is smaller than 11, 779 that is currently used for BIKE.