Keywords

1 Introduction

Homomorphic encryption (HE) enables computing over encrypted data without decrypting them first; thus, it is becoming increasingly popular as a solution for processing confidential data in untrustworthy environments. Since Gentry’s introduction of the first fully homomorphic-encryption (FHE) scheme over ideal lattices [14], continuous efficiency improvements have brought these techniques closer to practical application domains. As a result, lattice-based FHE schemes are increasingly used in experimental systems [23, 26, 27], and some of them are now proposed as an industry standard [2].

Cheon et al. [11] introduced a leveled encryption scheme for approximate arithmetic (CKKS); the scheme is capable of homomorphically evaluating arbitrary polynomial functions over encrypted complex-number vectors. Although the family of leveled cryptosystems enables only a finite multiplicative depth, with each multiplication consuming one level, the CKKS scheme enables the homomorphic re-encryption of an exhausted ciphertext into an almost fresh one. This capability, commonly called bootstrapping, theoretically enables the evaluation of arbitrary-depth circuits. In practice, however, the bootstrapping procedure for CKKS is approximate, and its precision and performance determine the actual maximum depth of a circuit.

Since the initial CKKS bootstrapping procedure by Cheon et al. [10] and until the most recent work by Han and Ki [20] that operates on the full-RNS (residue number systems) version of CKKS, the bootstrapping efficiency has improved by several orders of magnitude. However, this operation remains a bottleneck for its potential applications, and its performance is crucial for the adoption of the scheme. Bootstrapping performance can be improved by following two approaches: (i) adapting the bootstrapping circuit representation by using HE-friendly numerical methods. (ii) optimizing the scheme operations themselves, which also improves the overall scheme performance.

All current CKKS bootstrapping approaches [6, 10, 20] rely, so far, on sparse secret-keys to reduce the depth of their circuit representation, and none of them has proposed parameters with an equivalent security of at least 128 bits under the recent attacks on sparse R-LWE secrets [9, 29]. The lack of stability in the security of sparse R-LWE secrets has lead the standardization initiatives to exclude sparse keys, hence also the bootstrapping operation, from the currently proposed standards [2]. This raises the question about the practicality of a bootstrapping procedure that would not require the use of sparse secret-keys.

1.1 Our Results

We propose an efficient bootstrapping procedure for the full-RNS CKKS scheme; it does not necessarily require the use of sparse secret-keys and provides a greater throughput than the current state of the art (Definition 1 in Sect. 7.1). To achieve this, we make the following contributions:

Homomorphic Polynomial Evaluation (Sect. 3). The full-RNS variant of the CKKS scheme restricts the re-scale operation only to the division by the factors \(q_{i}\) of the ciphertext modulus Q. As the choice of these factors is constrained to those enabling a number theoretic transform (NTT), the rescale cannot be done by a power of two (as in the original CKKS scheme) and it introduces a small scale deviation in the process. For complex circuits, such as polynomial evaluations, additions between ciphertexts of slightly different scales will eventually occur and will introduce errors.

We observe that this problem is trivially solved for linear circuits, by scaling the plaintext constants by the modulus \(q_{i}\) by which the ciphertext will be divided during the next rescale. By doing so, the rescale is exact and the ciphertext scale is unchanged after the operation. As a polynomial can be computed by recursive evaluations of a linear function, the linear case can be generalized. In this work, we propose a generic algorithm that consumes an optimal number of \(\lceil \log (d+1)\rceil \) levels to homomorphically evaluate degree-d polynomial functions. Starting from a user-defined output scale, the intermediate scales can be back-propagated in the recursion, thus ensuring that each and every homomorphic addition occurs between ciphertexts of the same scale (hence is errorless). Our algorithm is, to the best of our knowledge, the first general solution for the problem of the approximate rescale arising from the full-RNS variant of the CKKS scheme.

Faster Matrix \(\boldsymbol{\times }\) Ciphertext Products (Sect. 4). The most expensive CKKS homomorphic operation is the key-switch. This operation is an integral building block of the homomorphic multiplication, slot rotations, and conjugation. The CKKS bootstrapping requires two linear transformations that involve a large number of rotations (key-switch operations), so minimizing the number of key-switch and/or their complexity has a significant effect on its performance.

Given an \(n\times n\) plaintext matrix M and an encrypted vector \(\mathbf {v}\), all previous works on the CKKS bootstrapping [6, 10, 20] use a baby-step giant-step (BSGS) algorithm, proposed by Halevi and Shoup [18], to compute the encrypted product \(M\mathbf {v}\) in \(\mathcal {O}(\sqrt{n})\) rotations. These works treat the key-switch procedure as a black-box and try to reduce the number of times it is executed. Therefore, they do not exploit the hoisting proposed by Halevi and Shoup [19].

We improve this BSGS algorithm by proposing a new format for rotation keys and a modified key-switch procedure that extends the hoisting technique to a second layer. This strategy is generic and it reduces the theoretical minimum complexity (in terms of modular products) of any linear transformation over ciphertexts. In our bootstrapping it reduces the cost of the linear transformations by roughly a factor of two compared to the previous hoisting approach.

Improved Bootstrapping Procedure (Sect. 5). We integrate our proposed improvements in the bootstrapping circuit proposed by Cheon et al. [10], Chen et al. [6], Cheon et al. [7] and Han and Ki [20]. We propose a new high-precision and faster bootstrapping circuit with updated parameters that are 128-bit secure, even if considering the most recent attacks on sparse keys [9, 29].

Parameterization and Evaluation (Sect. 6). We discuss the parametrization of the CKKS scheme and its bootstrapping circuit, and we propose a procedure to choose and fine-tune the parameters for a given use-case.

We implemented our contributions, as well as our bootstrapping, in the open source library Lattigo: https://github.com/ldsec/lattigo. To the best of our knowledge, this is the first public and open-source implementation of the bootstrapping for the full-RNS variant of the CKKS scheme.

2 Background and Related Work

We now recall the full-RNS variant of the CKKS encryption scheme and review its previously proposed bootstrapping procedures.

2.1 The Full-RNS CKKS Scheme

We consider the CKKS encryption scheme [11] in its full-RNS variant [8]: the polynomial coefficients are always represented in the RNS and NTT domains.

Notation. For a fixed power-of-two N and \(L+1\) distinct primes \(q_{0},\dots ,q_{L}\), we define \(Q_{L} = \prod _{i=0}^{L}q_{i}\) and \(R_{Q_{L}} = \mathbb {Z}_{Q_{L}}[X]/(X^{N}+1)\), the cyclotomic polynomial ring over the integers modulo \(Q_{L}\). Unless otherwise stated, we consider elements of \(R_{Q_{L}}\) as their unique representative in the RNS domain: \(R_{q_0} \times R_{q_1} \times ... \times R_{q_L} \cong R_{Q_L}\): a polynomial in \(R_{Q_{L}}\) is represented by a \((L+1) \times N\) matrix of coefficients. We denote single elements (polynomials or numbers) in italics, e.g., a, and vectors of such elements in bold, e.g., \(\mathbf {a}\), with \(\mathbf {a}||\mathbf {b}\) the concatenation of two vectors. We denote \(a^{(i)}\) the element at position i of the vector \(\mathbf {a}\) or the degree-i coefficient of the polynomial a. We denote ||a|| the infinity norm of the polynomial (or vector) a in the power basis and \(\textsf {hw}(a)\) the Hamming weight of the polynomial (or vector) a. We denote \(\langle \mathbf {a},\mathbf {b}\rangle \) the inner product between the vectors \(\mathbf {a}\) and \(\mathbf {b}\). Given two vectors \(\mathbf {a}\) and \(\mathbf {b}\), each of n values, we denote \(\log (\epsilon ^{-1})\) the negative log of the L1 norm of their difference: \(\epsilon = \tfrac{1}{n}\sum _{i=0}^{n-1}|a^{(i)}-b^{(i)}|\). \([x]_{Q}\) denotes reduction of x modulo Q and \(\lfloor x\rfloor \), \(\lceil x \rceil \), \(\lfloor x\rceil \) the rounding of x to the previous, the next, and the closest integer, respectively (if x is a polynomial, the operation is applied coefficient-wise). Unless otherwise stated, logarithms are in base 2.

Plaintext and Ciphertext Space. A plaintext is a polynomial \(\textsf {pt} = m(Y)\in \mathbb {R}[Y]/(Y^{2n}+1)\) with \(Y = X^{N/2n}\) and n a power-of-two smaller than N. We define the following plaintext encodings: (i) The coefficient encoding for which the message \(\mathbf {m} \in \mathbb {R}^{2n}\) is directly encoded as the coefficients of a polynomial in Y. (ii) The slots encoding for which the message \(\mathbf {m}\in \mathbb {C}^{n}\) is subjected to the canonical embedding \(\mathbb {C}^{n}\rightarrow Y^{2n}\) for which the negacyclic convolution in \(\mathbb {R}[Y]/(Y^{2n}+1)\) results in a Hadamard product in \(\mathbb {C}^{n}\).

We represent plaintexts and ciphertexts, respectively, by the tuples \(\{\textsf {pt}, Q_{\ell }, \varDelta \}\) and \(\{\textsf {ct}, Q_{\ell }, \varDelta \}\), where, for a secret \(s \in R_{Q_{L}}\), \(\textsf {pt}\) is a degree-zero polynomial in s, i.e. of \(R_{Q_{\ell }}\), and \(\textsf {ct}\) is a degree-one polynomial in s, i.e. of \(R^{2}_{Q_{\ell }}\). We define \(Q_{\ell } = \prod _{i=0}^{\ell }q_{i}\) as the modulus at level \(\ell \) and \(\varDelta \) as a scaling factor. We denote L as the maximum level and use \(0 \le \ell \le L\) to represent a level between the smallest level 0 and the highest level L. We refer to the depth of a circuit as the number of levels required for the evaluation of the circuit.

Scheme RNS-CKKS – Basic Operations

  • Setup\((N, h, b, \sigma )\): For a power-of-two ring degree N, a secret-distribution Hamming weight h, a standard deviation \(\sigma \), and a modulus bit-size b: Select the moduli chains \(\{q_{0},\dots ,q_{L}\}\) and \(\{p_{0}, \dots , p_{\alpha -1}\}\) composed of pairwise different NTT-friendly primes (i.e. \(q_{i} \equiv 1 \mod 2N\)) close to powers of two such that \(\log (\prod _{i=0}^{L} q_{i} \times \prod _{j=0}^{\alpha -1}p_{j}) \le b\). Set \(Q_{L} = \prod _{i=0}^{L} q_{i}\), \(P= \prod _{j=0}^{\alpha -1}p_{j}\).

    Define the following distributions over R: \(\chi _{\textsf {key}}\) with coefficients uniformly distributed over \(\{-1, 0, 1\}\) and exactly h non-zero coefficients. \(\chi _{\textsf {pkenc}}\) with coefficients distributed over \(\{-1, 0, 1\}\) with respective probabilities \(\{1/4, 1/2, 1/4\}\). \(\chi _{\textsf {err}}\) with coefficients distributed according to a discrete Gaussian distribution with standard deviation \(\sigma \) and truncated to \([-\lfloor 6\sigma \rfloor , \lfloor 6\sigma \rfloor ]\).

  • Encode (\(\mathbf {m}\), \(\varDelta \), n, \(\ell \)) (coefficients\(\rightarrow \)slots): For a message \(\mathbf {m} \in \mathbb {C}^{n}\) with \(1 \le n < N\), where n divides N, apply the canonical map \(\mathbb {C}^{n} \rightarrow \mathbb {R}[Y]/(Y^{2n}+1) \rightarrow R_{Q_{\ell }}\) with \(Y = X^{N/2n}\). Compute \(\mathbf {m}' = \text {FFT}^{-1}_{n}(\mathbf {m})\) and set \(\mathbf {m}'_{0}||\mathbf {m}'_{1} \in \mathbb {R}^{2n}\), with \(\mathbf {m}'_{0} = \tfrac{1}{2}(\mathbf {m}'+\overline{\mathbf {m}'})\) and \(\mathbf {m}'_{1} = \tfrac{-i}{2}(\mathbf {m}'-\overline{\mathbf {m}'})\), as a polynomial in Y. Finally, scale the coefficients by \(\varDelta \) and round them to the nearest integer, apply the change of variable \(Y \rightarrow X\) and return \(\{\textsf {pt}, Q_{\ell }, \varDelta \}\).

  • Decode(\(\{\textsf {pt}, Q_{\ell }, \varDelta \}, n)\) (slots\(\rightarrow \)coefficients): For \(1 \le n < N\), where n divides N, apply the inverse of the canonical map \(R_{Q_{\ell }} \rightarrow \mathbb {R}[Y]/(Y^{2n}+1) \rightarrow \mathbb {C}^{n}\), with \(Y = X^{N/2n}\). Map \(\textsf {pt}\) to the vector \(\mathbf {m}'_{0}||\mathbf {m}'_{1}\in \mathbb {R}^{2n}\) and return \(\mathbf {m} = \text {FFT}_{n}(\varDelta ^{-1}\cdot (\mathbf {m}'_{0} + i\cdot \mathbf {m}'_{1}))\).

  • SecKeyGen(\(\cdot \)): Sample \(s \leftarrow \chi _{\textsf {key}}\) and return the secret key s.

  • SwitchKeyGen(\(s, s', \mathbf{w} \)): For \(\mathbf{w} \) an integer decomposition basis of \(\beta \) elements, sample \(a_{i} \in R_{PQ_{L}}\) and \(e_{i} \leftarrow \chi _{\textsf {err}}\) and return the key-switch key: \(\textsf {swk}_{(s\rightarrow s')} = (\textsf {swk}^{(0)}_{(s\rightarrow s')},\) \(\dots , \textsf {swk}^{(\beta -1)}_{(s\rightarrow s')})\), where \(\textsf {swk}^{(i)}_{(s\rightarrow s')} = (-a_{i}s' + sw^{(i)}P + e_{i}, a_{i})\).

  • PubKeyGen(s): Set the public encryption key \(\textsf {pk} \leftarrow \textsf {SwitchKeyGen}(0, s, (1))\), the relinearization key \(\textsf {rlk} \leftarrow \textsf {SwitchKeyGen}(s^2, s, \mathbf{w} )\), the rotation keys \(\textsf {rot}_{k} \leftarrow \textsf {SwitchKeyGen}(s^{5^k}, s, \mathbf{w} )\) (a different key has to be generated for each different k), and the conjugation key \(\textsf {conj} \leftarrow \textsf {SwitchKeyGen}(s^{-1}, s, \mathbf{w} )\) and return: \((\textsf {pk}, \textsf {rlk}, \{\textsf {rot}_k\}_k, \textsf {conj})\).

  • \(\textsf {Enc}(\{\textsf {pt}, Q_{\ell }, \varDelta \},s)\): Sample \(a \in _{u} R_{Q_{\ell }}\) and \(e\leftarrow \chi _{\textsf {err}}\), set \(\textsf {ct} = (-as + e, a) + (\textsf {pt}, 0)\) and return \(\{\textsf {ct}, Q_{\ell }, \varDelta \}\).

  • \(\textsf {PubEnc}(\{\textsf {pt}, Q_{\ell }, \varDelta \}, \textsf {pk})\): Sample \(u \leftarrow \chi _{\textsf {pkenc}}\) and \(e_{0}, e_{1} \leftarrow \chi _{\textsf {err}}\), set:

    \(\textsf {ct} = \textsf {SwitchKey}(u, \textsf {pk}) + (\textsf {pt} + e_0, e_1)\) and return \(\{\textsf {ct}, Q_{\ell }, \varDelta \}\).

  • \(\textsf {SwitchKey}(d, \textsf {swk}_{s\rightarrow s'})\): For \(d \in R_{Q_{\ell }}\) a polynomialFootnote 1, decompose d base \(\mathbf {w}\) such that \(d = \langle \mathbf {d}, \mathbf {w}\rangle \) and return \((d_{0}, d_{1}) = \lfloor P^{-1} \cdot \langle \mathbf {d}, \textsf {swk}_{s\rightarrow s'}\rangle \rceil \) mod \(Q_{\ell }\) for \(P^{-1} \in \mathbb {R}\).

  • \(\textsf {Dec}(\{\textsf {ct}, Q_{\ell }, \varDelta \}, s)\): For \(\textsf {ct} = (c_{0}, c_{1})\), return \(\{\textsf {pt} = c_{0} + c_{1}s, Q_{\ell }, \varDelta \}\).

The homomorphic operations of CKKS are detailed in the extended version of the paper [4].

2.2 CKKS Bootstrapping

Let \(\textsf {ct} = (c_{0}, c_{1})\) be a ciphertext at level \(\ell = 0\), and s a secret key of Hamming weight h, such that \(\textsf {Decrypt}(\textsf {ct}, s)=[c_{0} + c_{1}s]_{Q_0} = \textsf {pt}\). The goal of the bootstrapping operation is to compute a ciphertext \(\textsf {ct}'\) at level \(L-k > 0\) (where k is the depth of the bootstrapping circuit) such that \(Q_{L-k} \gg Q_{0}\) and \([c'_{0} + c'_{1}s]_{Q_{L-k}} \approx \textsf {pt}\). Since \([c_{0} + c_{1}s]_{Q_L} = \textsf {pt} + Q_{0}\cdot I\), where I is an integer polynomial [10], bootstrapping is equivalent to an extension of the CRT basis, followed by a homomorphic reduction modulo \(Q_0\).

Cheon et al. proposed the first procedure [10] to compute this modular reduction, by (i) homomorphically applying the encoding algorithm, to enable the parallel (slot-wise) evaluation, (ii) computing a modular reduction approximated by a scaled sine function on each slot, and (iii) applying the decoding algorithm to retrieve a close approximation of \(\textsf {pt}\) without the polynomial I:

$$\begin{aligned} \underbrace{\textsf {Encode}(\textsf {pt} + Q_{0} \cdot I) = \textsf {pt}'}_{\text {(i) }\textsf {SlotsToCoeffs}(\textsf {pt} + Q_{0} \cdot I)} \ \Rightarrow \ \underbrace{\frac{Q_{0}}{2\pi }\sin \left( \frac{2\pi \textsf {pt}'}{Q_{0}}\right) = \textsf {pt}''}_{\text {(ii) }\textsf {EvalSine}(\textsf {pt}')} \ \Rightarrow \ \underbrace{\textsf {Decode}(\textsf {pt}'') \approx \textsf {pt}}_{\text {(iii) }\textsf {CoeffsToSlots}(\textsf {pt}'')}. \end{aligned}$$

The complexity of the resulting bootstrapping circuit is influenced by two parameters: The first one is the secret-key Hamming weight h, which directly impacts the depth of the bootstrapping circuit. Indeed, Cheon et al. show that \(||I||\le \mathcal {O}(\sqrt{h})\) with very high probability. A denser key will therefore require evaluating a larger-degree polynomial, with a larger depth. The second parameter is the number of plaintext slots n that has a direct impact on the complexity of the circuit (but not on its depth). By scaling down the values to compress them closer to the origin, Cheon et al. are able to evaluate the sine function by using a low-degree Taylor series of the complex exponential and then use repeated squaring (the double angle formula) to obtain the correct result. In their approach, the sine evaluation dominates the circuit’s depth, whereas the homomorphic evaluation of the encoding and decoding algorithms, which they express as an \(n\times n\) matrix-vector product, dominates its width.

In a subsequent work, Chen et al. [6] propose to compute the encoding by homomorphically evaluating the Cooley-Tukey algorithm. This approach needs \(\log (n)\) depth (the number of iterations of the algorithm); to reduce this depth, Chen et al. merge several iterations together, at the cost of an increased complexity. In a concurrent work, Cheon et al. [7] explored techniques to efficiently evaluate DFTs on ciphertexts. They show how to factorize the encoding matrices into a series of \(\log _{r}(n)\) sparse matrices, where r is a power-of-two radix. The contributions in [6, 7] enabled the acceleration of the homomorphic evaluation of the encoding functions by two orders of magnitude. Chen et al. [6] also improved the approximation of the scaled sine function by using a Chebyshev interpolant.

More recently, Han and Ki port the bootstrapping procedure to the full-RNS variant of CKKS, with several improvements to the bootstrapping circuit and to the CKKS scheme [20]. They propose a generalization of its key-switch procedure by using an intermediate RNS decomposition that enables a trade-off between the complexity of the key-switch and the homomorphic capacity of a fresh ciphertext. They also give an alternative way to approximate the scaled sine function, which accounts for the magnitude of the underlying plaintext and uses the cosine function and the double angle formula. Combined, these changes yield an acceleration factor of 2.5 to 3, compared to the work of Chen et al. [6].

Both works [6, 7] were implemented with HEAAN [21], yet the implementation of only the former was published. The work of [20] was implemented using SEAL [28], but the implementation has still not been published.

2.3 Security of Sparse Keys

One commonality between all the aforementioned works is the use of sparse secret-keys with a Hamming weight \(h=64\). A key with a small Hamming weight enables a low-depth bootstrapping circuit, essential for its practicality. However, recent advances in the cryptanalysis of the R-LWE problem prove that hybrid attacks specifically targeting such sparse keys can severely affect its security [9, 29]. In light of the most recent attacks, Curtis and Player [12] estimate that, for a sparse key with \(h=64\) and a ring degree \(N = 2^{16}\), the modulus needs to be at most 990 bits to achieve a security of 128 bits. In their initial bootstrapping proposal, Cheon et al. [10] use the parameters {\(N=2^{16}\), \(\log (Q)=2480\), \(h=64\), \(\sigma =3.2\)} and estimate the security of these parameters to 80 bits. In their work, Han and Ki [20] propose new parameter sets, one of which they claim has 128-bit of security: {\(N=2^{16}\), \(\log (Q)=1450\), \(h=64\), \(\sigma =3.2\)}. However, these estimates are based on results obtained using Albrecht’s estimator [1] that, at the time, did not take into account the most recent attacks on sparse keys. The security of the parameter set {\(N=2^{16}\), \(\log (Q)=1250\), \(h=64\), \(\sigma =3.2\)} is estimated at 113 bits in the more recent work by Son and Cheon [29]. This sets a loose upper bound to security of the parameters (which have a 1450-bit modulus) proposed by Han and Ki [20]. Therefore, the bootstrapping parameters must be updated to comply with the most recent security recommendations, as none of the parameters proposed in the current works achieve a security of 128 bits.

3 Homomorphic Polynomial Evaluation

The main disadvantage of the full-RNS variant of CKKS stems from its rescale operation that does not divide the scale by a power-of-two, as in the original scheme, but by one of the moduli. Those moduli are chosen, for efficiency purposes, as distinct NTT-friendly primes [8]; under this constraint, the power-of-two rescale of the original CKKS scheme can only be approximated. As a result, ciphertexts at the same level can have slightly different scales (depending on the previous homomorphic operations) and additions between such ciphertexts will introduce an error proportional to the difference between their scale. Addressing this issue in a generic and practical way is crucial for the adoption of CKKS.

For a significant step toward this goal, we introduce a homomorphic polynomial-evaluation algorithm that is depth-optimal and ensures that additions are always made between ciphertexts with the exact same scale.

3.1 The Baby-Step Giant-Step (BSGS) Algorithm

In order to minimize the number of ciphertext-ciphertext multiplications in their bootstrapping circuit, Han and Ki [20] adapt a generic baby-step giant-step (BSGS) polynomial-evaluation algorithm for polynomials expressed in a Chebyshev basis. Algorithm 1 gives a high-level description of the procedure.

For a polynomial p(t) of degree d, with \(\textsf {m} = \lceil \log (d+1) \rceil \) and \(\textsf {l} = \lfloor \textsf {m}/2 \rfloor \), the algorithm first decomposes p(t) into \(\sum _{i=0}^{\lfloor d/\textsf {l} \rfloor } u_{i, 2^{\textsf {l}}}(t)\cdot T_{2^{i\cdot \textsf {l}}}(t)\), with \(u_{i, 2^{\textsf {l}}}(t) = \sum _{j=0}^{2^{\textsf {l}}-1}c_{i, j} \cdot T_{j}(t)\), \(c_{i, j}\in \mathbb {C}\) and \(T_{0\le j < 2^{\textsf {l}}}\) a pre-computed power basis. We denote \(u_{\lfloor d/\textsf {l} \rfloor , 2^{\textsf {l}}}(t)\) as \(u_{\textsf {max}}\). The BSGS algorithm then recursively combines the monomials \(u_{i, 2^{j+1}}(t) = u_{i+1, 2^{j}}(t)\cdot T_{2^{j}}(t) + u_{i, 2^{j}}(t)\) in a tree-like manner by using a second pre-computed power basis \(T_{2^{\textsf {l}< i<\textsf {m}}}(t)\) to minimize the number of non-scalar multiplications. The algorithm requires \(2^{\textsf {m}-\textsf {l}} + 2^{\textsf {l}} + \textsf {m} - \textsf {l} - 3 + \lceil (d+1)/2^{\textsf {l}}\rceil \) non-scalar products and has, in the best case, depth m.

figure a

3.2 Errorless Polynomial Evaluation

We address the errors introduced by the approximate rescale for the evaluation of a polynomial p(t). We scale each of the leaf monomials \(u_{i, 2^{\textsf {l}}}(t)\) by some scale \(\varDelta \) such that all evaluations of the subsequent monomials \(u_{i, 2^{j+1}}(t) = u_{i+1, 2^{j}}(t)\cdot T_{2^{j}}(t) + u_{i, 2^{j}}(t)\) are done with additions between ciphertexts of the same scale. More formally, let \(\varDelta _{u_{i, 2^{j+1}}(t)}\) be the scale of \(u_{i, 2^{j+1}}(t)\) (the result of the monomial evaluation), \(\varDelta _{T_{2^{j}}(t)}\) the scale of the power-basis element \(T_{2^{j}}(t)\), and \(q_{T_{2^{j}}(t)}\) the modulus by which the product \(u_{i+1, 2^{j}}(t) \cdot T_{2^{j}}(t)\) is rescaled. We set \(\varDelta _{u_{i+1, 2^{j}}(t)} = \varDelta _{u_{i, 2^{j+1}}(t)} \cdot q_{T_{2^{j}}(t)} / \varDelta _{T_{2^{j}}(t)}\) and \(\varDelta _{u_{i, 2^{j}}(t)} = \varDelta _{u_{i, 2^{j+1}}(t)}\). Starting from a target scale \(\varDelta _{p(t)}\) and \(p(t) = u_{0, 2^{\textsf {m}}}(t) = u_{1, 2^{\textsf {m}-1}}(t)\cdot T_{2^{\textsf {m}-1}}(t) + u_{0, 2^{\textsf {m}-1}}(t)\), we recursively compute and propagate down the tree the scale each \(u_{i, 2^{j}}(t)\) should have. The recursion ends when reaching \(u_{i, 2^{\textsf {l}}}\), knowing the scale that they must have. Since \(u_{i, 2^{\textsf {l}}}(t) = \sum _{j=0}^{2^{\textsf {l}}-1}c_{i, j} T_{j}(t)\), we can use the same technique to derive by what value each of the coefficients \(c_{i, j}\) must be scaled, so that the evaluation of \(u_{i, 2^{\textsf {l}}}(t)\) is also done with exact additions and ends up with the desired scale.

figure b
Table 1. Comparison of the homomorphic evaluation of a Chebyshev interpolant of degree d of \(\cos (2\pi (x-0.25)/2^{r})\) in the interval \((-K/2^{r}, K/2^{r})\) followed by r evaluations of \(\cos (2x) = 2\cos ^{2}(x)-1\). The scheme parameters are \(N=2^{16}\), \(n=2^{15}\), \(h=196\) and \(q_{i} \approx 2^{55}\). \(\varDelta _{\epsilon } = |\varDelta _{\text {in}} - \varDelta _{\text {out}}|\cdot \varDelta _{\text {in}}^{-1}\).

Algorithm 2 is our proposed solution: it integrates our scale-propagation technique to the recursive decomposition of p(t) into q(t) and r(t). We compare Algorithms 1 and 2 in Table 1 by evaluating a Chebyshev interpolant of the homomorphic modular reduction done during the bootstrapping circuit. This function plays a central role in the bootstrapping hence is an ideal candidate for evaluating the effect of the proposed approaches (see Sect. 5.4). To verify that our algorithm correctly avoids additions between ciphertexts of different scales, we forced both algorithms to always rescale a ciphertext before an addition (in practice, it is better to check the levels of the ciphertexts before an addition, and dynamically assess if a level difference can be used to scale one ciphertext to the scale of the other). We observe that our algorithm yields two advantages: It enables (i) a scale-preserving polynomial evaluation (the output-scale is identical to the input scale), and (ii) a much better precision by successfully avoiding errors due to additions between ciphertexts of different scales.

3.3 Depth-Optimal Polynomial Evaluation

In practice, Algorithm 1 will consume more than the optimal \(\textsf {m}\) levels for a specific class of d due to the way the rescale and level management work in the full-RNS variant of the CKKS scheme. This discrepancy arises from the following interactions (recall that Algorithm 1 evaluates each \(u_{i}(t)\) as a linear combination of a pre-computed power-basis \(\{T_{0}(t), T_{1}(t), \dots , T_{2^{\textsf {l}}-1}(t)\}\)):

  1. 1.

    If \(\textsf {l} > 1\), then the depth to evaluate \(T_{2^{\textsf {l}}-1}(t)\) is \(\textsf {l}\) and evaluating the \(u_{i}(t)\) will necessarily cost \(\textsf {l} + 1\) levels due to the constant multiplications.

  2. 2.

    If \(\textsf {l} = 1\), then the depth to evaluate \(T_{1}(t)\) is zero, hence the depth to evaluate the \(u_{i}(t)\) is and remains \(\textsf {l}\).

  3. 3.

    If \(d > 8\), then Algorithm 1 sets \(\textsf {l}>1\).

  4. 4.

    If \(2^{\textsf {m}} - 2^{\textsf {l}-1} \le d < 2^{\textsf {m}}\), then all the elements of the power basis \(\{T_{2^{\textsf {l}}}, T_{2^{\textsf {l+1}}}, \dots , T_{2^{\textsf {m}-1}}\}\) need to be used during the recombination step of Algorithm 1.

Hence, if \(\textsf {l} > 1\) and \(d > 2^{\textsf {m}} - 2^{\textsf {l}-1}\), the total depth to execute Algorithm 1 is necessarily \(\textsf {m}+1\). This could be avoided by always setting \(\textsf {l} = 1\) regardless of d, but it would lead to a very costly evaluation, as the number of non-scalar multiplications would grow proportionally to d. To mitigate this additional cost, we only enforce \(\textsf {l} = 1\) on the coefficient of p(t) whose degree is \(\ge 2^{\textsf {m}} - 2^{\textsf {l}-1}\). Hence, Algorithm 2 first splits p(t) into \(p(t) = a(t) + b(t)\cdot T_{{2^{\textsf {m}}-2^{\textsf {l}-1}}}(t)\). It then evaluates a(t) with the optimal \(\textsf {l}\) and recurses on b(t) until \(\textsf {l}=1\). The number of additional recursions is bounded by \(\log (\textsf {m})\), because each recursion sets the new degree to half of the square root of the previous one. In practice, these additional recursions add only \(\lceil \log (d + 1 - (2^{\textsf {m}} - 2^{\textsf {l}-1}))\rceil \) non-scalar multiplications but enable the systematic evaluation of any polynomial by using exactly m levels.

3.4 Conclusions

For an extra cost of \(\lceil \log (d + 1 - (2^{\textsf {m}} - 2^{\textsf {l}-1}))\rceil \) ciphertext-ciphertext products, our proposed algorithm guarantees an optimal depth hence an optimal-level consumption. This extra cost is negligible, compared to the base cost of Algorithm 1, i.e., \(2^{\textsf {m}-\textsf {l}} + 2^{\textsf {l}} + \textsf {m} - \textsf {l} - 3 + \lceil (d+1)/2^{\textsf {l}}\rceil \). It also guarantees exact additions throughout the entire polynomial evaluation, hence preventing the precision loss related to additions between ciphertexts of different scales and making the procedure easier to use. It also enables the possibility to choose the output scale that can be set to the same as the input scale, making the polynomial evaluation scale-preserving. As linear transformations and constant multiplications can already be made to be scale-preserving, our polynomial evaluation is the remaining building block for enabling scale-preserving circuits of arbitrary depth.

4 Key-Switch and Improved Matrix-Vector Product

The key-switch procedure is the generic public-key operation of the CKKS scheme. By generating specific public key-switch keys derived from secret keys \(s'\) and s, it is possible to enable the public re-encryption of ciphertexts from key \(s'\) to s. Beyond the public encryption procedure (switching from \(s'=0\) to s), a key-switch is required by most homomorphic operations to cancel the effect of encrypted arithmetic on the decryption circuit, thus ensuring the compactness of the scheme. In particular, homomorphic multiplications require the re-encryption from key \(s^2\) back to s, whereas slot-rotations require the re-encryption from the equivalent rotation of s back to s. The cost of the key-switch dominates the cost of these operations by one to two orders of magnitude because it requires many NTTs and CRT reconstructions. Hence, optimizations of the key-switch algorithm have a strong effect on the overall efficiency of the scheme.

We propose an optimized key-switch key format and key-switch algorithm (Sect. 4.1). We then apply them to rotation-keys and further improve the hoisted-rotation technique (Sect. 4.2) introduced by Halevi and Shoup [19]. Finally, we propose a modified procedure for matrix-vector multiplications over packed ciphertexts (Sect. 4.3) which features a novel double-hoisting optimization.

4.1 Improved Key-Switch Keys

Given a ciphertext modulus \(Q_{L} = \prod _{j=0}^{L}q_{j}\), we use a basis w composed of products among the \(q_{j}\), as described by Han and Ki [20]. We also include the entire basis w in the keys, as done by Bajard et al. and Halevi et al. [3, 16]; this saves one constant multiplication during the key-switch and enables a simpler key-switch keys generation. A more detailed overview of these works can be found in the extended version of the paper [4].

We propose a simpler and more efficient hybrid approach. Specifically, we use the basis \(w^{(i)} = \tfrac{Q_{L}}{q_{\alpha _{i}}}[(\tfrac{Q_{L}}{q_{\alpha _{i}}})^{-1}]_{q_{\alpha _{i}}}\) with \(q_{\alpha _{i}} = \prod _{j = \alpha _{i}}^{\text {min}(\alpha (\beta +1)-1, L)} q_{j}\) for \(0\le i < \beta \), \(\beta = \lceil (L+1)/\alpha \rceil \) and \(\alpha \) a positive integer. In other words, Q is factorized into \(\beta \) equally-sized composite-numbers \(q_{\alpha _{i}}\), each composed of up to \(\alpha \) different primes. Thus, our key-switch keys have the following format:

$$\begin{aligned} \left( \textsf {swk}^{0}_{q_{\alpha _{i}}}, \textsf {swk}^{1}_{q_{\alpha _{i}}}\right) = \left( [-a_{i}s + s' \cdot P\cdot \tfrac{Q_{L}}{q_{\alpha _{i}}} \cdot [(\tfrac{Q_{L}}{q_{\alpha _{i}}})^{-1}]_{q_{\alpha _{i}}} + e_{i}]_{PQ_{L}}, [a_{i}]_{PQ_{L}}\right) . \end{aligned}$$

We set \(P = \prod ^{\alpha -1}_{j=0}p_{j}\), and the bit-size of P such that \(q_{\alpha _{i}} \le P,\, \forall \alpha _i\). As shown by Gentry et al. [15], this leads to a negligible error introduced by the key-switch operation. Algorithm 3 describes the associated key-switch procedure that corresponds to the standard one adapted to our keys.

figure c

4.2 Improved Hoisted-Rotations

The slot-rotation operation in CKKS is defined by the automorphism \(\phi _k: X \rightarrow X^{5^{k}} (\text {mod}\ X^{N} + 1 )\). It rotates the message slots by k positions to the left. After a rotation, the secret under which the ciphertext is encrypted is changed from s to \(\phi _{k}(s)\), and a key-switch \(\phi _{k}(s) \rightarrow s\) is applied to return to the original key.

Halevi and Shoup [19] show that as \(\phi _{k}\) is an automorphism, it distributes over addition and multiplication, and commutes with the power-of-two base decomposition. As \(\phi _k\) acts individually on the coefficients by permuting them without changing their norm (the modular reduction by \(X^{N}+1\) at most induces a sign change), it also commutes with the special RNS decomposition (see Supplementary material in the extended version of the paper [4]): \([\phi _{k}(a)]_{q_{\alpha _{i}}} = \phi _{k}([a]_{q_{\alpha {i}}})\).

Hence, when several rotations have to be applied on the same ciphertext, \([a]_{q_{\alpha _{i}}}\) can be pre-computed and re-used for each subsequent rotation: \(\sum \phi _{k}([a]_{q_{\alpha _{i}}}) \cdot \textsf {rot}_{k, q_{\alpha _{i}}}\). This technique proposed by Halevi et al., called hoisting, significantly reduces the number of NTTs and CRT reconstructions, at the negligible cost of having to compute the automorphism for each of the \([a]_{q_{\alpha _{i}}}\).

We further exploit the properties of the automorphism to reduce its execution cost, by observing that \(\phi _k^{-1}\) can be directly pre-applied on the rotation keys:

$$\begin{aligned} \left( \widetilde{\textsf {rot}}^{0}_{k, q_{\alpha _{i}}}, \widetilde{\textsf {rot}}^{1}_{k, q_{\alpha _{i}}}\right) = \left( [-a_{i}\phi _{k}^{-1}(s) + s \cdot P\cdot \tfrac{Q_{L}}{q_{\alpha _{i}}} \cdot [(\tfrac{Q_{L}}{q_{\alpha _{i}}})^{-1}]_{q_{\alpha _{i}}} + e_{i}]_{PQ_{L}}, [a_{i}]_{PQ_{L}}\right) \end{aligned}$$

Compared to a \(\textsf {rot}_{k, q_{\alpha _{i}}}\), a traditional rotation-key as defined in Sect. 2.1, this reduces the number of automorphisms per-rotation to only one:

$$\begin{aligned} \langle \phi _{k}(\mathbf {a}), \textsf {rot}_{k} \rangle = \phi _{k}\left( \langle \mathbf {a}, \widetilde{\textsf {rot}}_{k} \rangle \right) . \end{aligned}$$

Our improved algorithm for hoisted rotations is detailed in Algorithm 4.

4.3 Faster Matrix-Vector Operations

figure d

We now discuss the application of homomorphic slot-rotations to the computation of matrix-vector products on packed ciphertexts. The ability to efficiently apply generic linear transformations to encrypted vectors is pivotal for a wide variety of applications of homomorphic encryption. In particular, the homomorphic evaluation of the CKKS encoding and decoding procedures, which are linear transformations, dominates the cost in the original bootstrapping procedure.

Halevi and Shoup propose to express an \(n \times n\) matrix M in diagonal form and to use a baby-step giant-step (BSGS) algorithm (Algorithm 5) to evaluate the matrix product in \(\mathcal {O}(\sqrt{n})\) rotations [17, 18]. At the time of this writing, all the existing bootstrapping procedures for the CKKS scheme are based on this approach and are not reported to use hoisting, unlike done for BGV [18, 19]. We now break down the cost of this BSGS algorithm, analyze its components and, using our observations, we present our improvements to this approach.

figure e

Dominant Complexity of Rotations. The dominant cost factor of Algorithm 5 is the number of rotations, as each rotation requires key-switch operations. These rotations comprise four steps (see Algorithm 4):

  1. 1.

    Decompose: Decompose a polynomial of \(R_{Q_{\ell }}\) in base \(\mathbf{w} \) and return the result in \(R_{PQ_{\ell }}\). This operation requires NTTs and CRT basis extensions.

  2. 2.

    MultSum: Compute a sum of products of polynomials in \(R_{PQ_{\ell }}\). This operation only requires coefficient-wise additions and multiplications.

  3. 3.

    ModDown: Divide a polynomial of \(R_{PQ_{\ell }}\) by P and return the result in \(R_{Q_{\ell }}\). This operation requires NTTs and CRT basis extensions.

  4. 4.

    Permute: Apply the automorphism \(\phi _{k}\) on a polynomial of \(R_{Q_{\ell }}\). It represents a permutation of the coefficients and has in theory no impact on complexity.

Fig. 1.
figure 1

Normalized complexity of each step (op.) of a rotation. The complexity for each operation was computed with \(N=2^{16}\), \(0\le \ell \le 23\) and \(\alpha = 4\). The complexity derivation can be found in the extended version of the paper [4].

Let n be the number of non-zero diagonals of M, and \(n_{1}, n_{2}\) be two integers such that \(n = n_{1}n_{2}\); the complexity of the original BSGS algorithm (Algorithm 5) is \(n_1+n_2\) rotations and it is minimized when \(n_{1} \approx n_{2}\):

$$\begin{aligned} (n_{2}+n_{1}) \cdot (\textit{Decompose} + \textit{MultSum} + \textit{ModDown} + \textit{Permute}), \end{aligned}$$

to which \(2n_{2}n_{1}\) multiplications in \(R_{Q_{\ell }}\) should also be added (line 8 of Algorithm 5). We denote inner-loop and outer-loop the lines that depend, respectively, on \(n_{1}\) and \(n_{2}\). Figure 1 shows the weight of each of the four steps in the total complexity. The complexity of the steps MultSum and Permute is negligible compared to the complexity of Decompose and ModDown, as products and additions are very inexpensive compared to NTTs and CRT basis extensions. We base our optimization on this observation.

Improved BSGS Algorithm. We propose a new optimization that we refer to as double-hoisting. This optimization improves the hoisting technique proposed by Halevi and Shoup [19] and further reduces the complexity related to the inner-loop rotations by adding a second layer of hoisting.

The first level, proposed by Halevi and Shoup [19], applies to the inner-loop rotations (line 8 of Algorithm 5). This renders the computation devoted to Decompose independent of the value \(n_{1}\), so the complexity is reduced to

$$\begin{aligned}&\ n_{2} \cdot (\textit{Decompose} + \textit{MultSum} + \textit{ModDown} + \textit{Permute})\\ +&\ n_{1} \cdot (\textit{MultSum} + \textit{ModDown} + \textit{Permute}) + \textit{Decompose}. \end{aligned}$$

The second level, which we propose, introduces an additional hoisting for the inner-loop rotations, as the ModDown step is a coefficient-wise operation. Similarly to the Decompose step, this operation commutes with the Permute step and the ciphertext-plaintext multiplications (line 8 of Algorithm 5). Therefore, we need to apply it only once after the entire inner-loop of \(n_{1}\) rotations. Applying the same reasoning for the ModDown step of the outer-loop rotations we can reduce the number of key-switch operations from \(n_1+n_2\) to \(n_2 + 1\):

$$\begin{aligned}&\ n_{2} \cdot (\textit{Decompose} + \textit{MultSum} + \textit{ModDown} + \textit{Permute})\\ +&\ n_{1} \cdot (\textit{MultSum} + \textit{Permute}) + \textit{Decompose} + \textit{ModDown}. \end{aligned}$$

Algorithm 6 describes our double-hoisting BSGS for matrix-vector products.

figure f

Discussion. In addition to benefiting from our improved key-switch (Sect. 4.1) and rotation (Sect. 4.2) procedures, Algorithm 6 introduces a trade-off: The ModDown step in the inner-loop now depends on the value \(n_{2}\), and the ModDown step of the outer-loop is performed only once. However, the \(2n_{1}n_{2}\) multiplications and additions are performed in \(R_{PQ_{\ell }}\) instead of \(R_{Q_{\ell }}\). Hence, the complexity dependency on \(n_{1}\) is significantly reduced at the cost of slightly increasing the dependency on \(n_{1}n_{2}\). Applying the ModDown step at the end of each loop has the additional benefit of introducing the rounding error only once.

Table 2 compares the complexity of a non-hoisted, single-hoisted (Algorithm 5) and double-hoisted (Algorithm 6) BSGS, each with its optimal ratio \(n_1/n_2\). Our approach minimizes the complexity when \(2^{3} \le n_{1}/n_{2} \le 2^{4}\). This shows that the strategy of the previously proposed bootstrapping procedures [6, 10, 20], which minimize the number of rotations by setting \(n_{1} \approx n_{2}\), is not optimal anymore. The maximum gain occurs when n (the number of non zero diagonals) is around 128. This can be exploited by factorizing the linear transforms, used during the bootstrapping, into several sparse matrices (see Sect. 5.3).

Table 2. Complexity of Algorithm 5 [17], 1-hoisted Algorithm 5 [19] and our 2-hoisted Algorithm 6. M is a \(2^{15}\times 2^{15}\) matrix with \(n=n_{1}n_{2}\) non zero diagonals. The used parameters are \(N=2^{16}\), \(n=2^{15}\), \(\ell = 18\), \(\alpha = 4\). The speed-up factor is the ratio between the \(\#\textsf {Mul}_{\mathbb {Z}_{p}}\), taking as baseline the 1-hoisted approach.

Increasing the ratio from \(n_{2}/n_{1} \approx 1\) to \(n_{2}/n_{1} \approx 16\) in our bootstrapping parameters (Sect. 6) increases the number of keys by a factor around 1.6 and reduces the computation time by 20%. Hence, Algorithm 6 reduces the overall complexity of matrix-vector products, by introducing a time-memory trade-off.

We also observe that these improvements are not restricted to plaintext matrices or to the CKKS scheme and can be applied to other R-LWE scheme, such as BGV [5] or BFV [13], as long as the scheme (or its implementation) allows for the factorization of an expensive operation. For example, in the BFV scheme, the quantization (division by Q/t) (as well as the re-linearization if the matrix is in ciphertext) can be delayed to the outer-loop.

5 Bootstrapping for the Full-RNS CKKS Scheme

We present our improved bootstrapping procedure for the full-RNS variant of the CKKS scheme. We follow the high-level procedure of Cheon et al. [10] and adapt each step by relying on the techniques proposed in Sects. 3 and 4.

The purpose of the CKKS bootstrapping [10] is, in contrast with BFV’s [13], not to reduce the error. Instead, and similarly to BGV [5] bootstrapping, it is meant to reset the ciphertext modulus to a higher level in order to enable further homomorphic multiplications. The approximate nature of CKKS, due to the plaintext and ciphertext error being mixed together, implies that each homomorphic operation decreases the output precision. As a result, all the currently proposed bootstrapping circuits only approximate the ideal bootstrapping operation, and their output precision also determines their practical utility.

5.1 Circuit Overview

Let \(\{\mathsf {ct}=(c_0, c_1), Q_{0}, \varDelta \}\) be a ciphertext that encrypts an n-slot message under a secret-key s with Hamming weight h, such that \(\mathsf {Decrypt}(\mathsf {ct}, s) = c_{0} + sc_{1} = \lfloor \varDelta \cdot m(Y) \rceil + e \in \mathbb {Z}[Y]/(Y^{2n} +1)\), where \(Y = X^{N/2n}\). The bootstrapping operation outputs a ciphertext \(\{\textsf {ct}'=(c_{0}', c_{1}'), Q_{L-k}, \varDelta \}\) such that \(c_{0}' + sc_{1}' = \lfloor \varDelta \cdot m(Y) \rceil + e' \in \mathbb {Z}[Y]/(Y^{2n} +1)\), where \(k < L\) is the number of levels consumed by the bootstrapping and \(||e'|| \ge ||e||\) is the error that results from the combination of the initial error e and the error induced by the bootstrapping circuit.

The bootstrapping circuit is divided into the five steps detailed below. For the sake of conciseness, we describe the plaintext circuit and omit the error terms.

  1. 1.

    ModRaise: \(\textsf {ct}\) is raised to the modulus \(Q_{L}\) by applying the CRT map \(R_{q_{0}} \rightarrow R_{q_0} \times R_{q_1} \times \dots \times R_{q_L}\). This yields a ciphertext \(\{\mathsf {ct}, Q_{L}, \varDelta \}\) for which

    $$ [c_{0} + sc_{1}]_{Q_{L}} = Q_{0} \cdot I(X) + \lfloor \varDelta \cdot m(Y) \rceil = m', $$

    where \(Q_{0}\cdot I(X) = \big [-[sc_{1}]_{Q_{0}} + sc_{1}\big ]_{Q_{L}}\) is an integer polynomial for which ||I(X)|| is \(\mathcal {O}(\sqrt{h})\) [10]. The next four steps remove this unwanted \(Q_{0} \cdot I(X)\) polynomial by homomorphically evaluating an approximate modular reduction by \(Q_{0}\).

  2. 2.

    SubSum: If \(2n \ne N\), then \(Y \ne X\) and I(X) is not a polynomial in Y. SubSum maps \(Q_{0}\cdot I(X) + \lfloor \varDelta \cdot m(Y) \rceil \) to \((N/2n) \cdot (Q_{0}\cdot \tilde{I}(Y) + \lfloor \varDelta \cdot m(Y) \rceil )\), a polynomial in Y [10].

  3. 3.

    CoeffsToSlots: The message \(m' = Q_{0} \cdot \tilde{I}(Y) + \lfloor \varDelta \cdot m(Y) \rceil \) is in the coefficient domain, which prevents slot-wise evaluation of the modular reduction. This step homomorphically evaluates the inverse discrete-Fourier-transform (DFT) and produces a ciphertext encrypting Encode(\(m'\)) that enables the slot-wise evaluation of the approximated modular reduction.

    Remark: This step returns two ciphertexts, each encrypting 2n real values. If \(4n \le N\), these ciphertexts can be repacked into one. Otherwise, the next step is applied separately on both ciphertexts.

  4. 4.

    EvalSine: The modular reduction \(f(x) = x\bmod 1\) is homomorphically evaluated on the ciphertext(s) encrypting \(\textsf {Encode}(m')\). This function is approximated by \(\dfrac{Q_{0}}{2\pi \varDelta } \cdot \sin \left( \dfrac{2 \pi \varDelta x}{Q_{0}}\right) \), which is tight when \(Q_{0}/\varDelta \gg ||m(Y)||\). As the range of x is determined by \(||\tilde{I}(Y)||\), the approximation needs to account for the secret-key density.

  5. 5.

    SlotsToCoeffs: This step homomorphically evaluates the DFT on the ciphertext encrypting \(f(\textsf {Encode}(m'))\). It returns a ciphertext at level \(L - k\) that encrypts Decode(f(Encode(\(m'\)))) \(\approx f(m') \approx \lfloor \varDelta \cdot m(Y) \rceil \), which is a close approximation of the original message.

We now detail our approach for each step.

5.2 ModRaise and SubSum

We base the ModRaise and SubSum operations directly on the initial bootstrapping of Cheon et al. [10]. The SubSum step multiplies the encrypted message by a factor N/2n that needs to be subsequently cancelled. We take advantage of the following CoeffsToSlot step, a linear transformation, to scale the corresponding matrices by 2n/N. As we also use this trick for grouping other constants, we elaborate more on the matrices scaling in Sect. 5.5.

5.3 CoeffsToSlots and SlotsToCoeffs

Let n be a power-of-two integer such that \(1 \le n < N\); the following holds for any two vectors \(\mathbf{m} ,\mathbf{m} ' \in \mathbb {C}^{n}\) due to the convolution property of the complex DFT

$$\begin{aligned} \textsf {Decode}_{n}(\textsf {Encode}_{n}(\mathbf{m} ) \otimes \textsf {Encode}_{n}(\mathbf{m} ')) \approx \mathbf{m} \odot \mathbf{m} ', \end{aligned}$$

where \(\otimes \) and \(\odot \) respectively denote the nega-cyclic convolution and Hadamard multiplication. I.e., the encoding and decoding algorithms define an isomorphism between \(\mathbb {R}[Y]/(Y^{2n}+1)\) and \(\mathbb {C}^{n}\) [11]. The goal of the CoeffsToSlots and SlotsToCoeffs steps is to homomorphically evaluate this isomorphism on a ciphertext.

Let \(\psi = e^{i\pi /{n}}\) be a 2n-th primitive root of unity. As 5 and \(-1\) mod 2n span \(\mathbb {Z}_{2n}\), \(\{\psi ^{5^{k}}, \overline{\psi ^{5^{k}}}, 0\le k < n\}\) is the set of all 2n-th primitive roots of unity. Given a polynomial \(m(Y) \in \mathbb {R}[Y]/(Y^{2n} + 1)\) with \(Y = X^{N/2n}\), the decoding algorithm is defined as the evaluation of this polynomial at each root of unity \(\textsf {Decode}_{n}(m(Y)) = (m(\psi ), m(\psi ^{5}), \dots , m(\psi ^{5^{2n-1}}))\). The decoding isomorphism is fully defined by the \(n \times n\) special Fourier transform matrix \(\text {SF}_{n, (j, k)} = \psi ^{j5^{k}}\), with inverse (the encoding matrix) \(\text {SF}^{-1}_{n} = \frac{1}{n}\overline{\text {SF}}^{T}_{n}\) [7]. Its homomorphic evaluation can be expressed in terms of plaintext matrix-vector products:

  1. 1.

    CoeffsToSlots\((\mathbf{m} ): t_0 = \frac{1}{2}\big (\text {SF}^{-1}_{n} \times \mathbf{m} + \overline{\text {SF}^{-1}_{n} \times \mathbf{m} }\big ), t_1 = -\frac{1}{2}i(\text {SF}^{-1}_{n} \times \mathbf{m} - \overline{\text {SF}^{-1}_{n} \times \mathbf{m} }\big )\)

  2. 2.

    SlotsToCoeffs\((t_0, t_1): \mathbf{m} = \text {SF}_{n} \times (t_0 + i\cdot t_1)\).

DFT Evaluation. In their initial bootstrapping proposal, Cheon et al. [10] homomorphically compute the DFT as a single matrix-vector product in \(\mathcal {O}(\sqrt{n})\) rotations and depth 1, by using the baby-step giant-step (BSGS) approach of Halevi and Shoup [18] (Algorithm 5 in Sect. 4.3). To further reduce the complexity, two recent works from Cheon et al. [7] and Chen et al. [6] exploit the structure of the equivalent FFT algorithm by recursively merging its iterations, reducing the complexity to \(\mathcal {O}(\sqrt{r}\log _{r}(n))\) rotations at the cost of increasing the depth to \(\mathcal {O}(\log _{r}(n))\), for r a power-of-two radix between 2 and n.

We base our approach on the work of [7] and [6], and we use our double hoisting BSGS to evaluate the matrix-vector products (see Sect. 4.3 and Algorithm 6). This step is parameterized by \(\rho = \lceil \log _{r}(n)\rceil \), the depth of the linear transformation (i.e., the number of matrices that we need to evaluate).

Figure 2 shows the effect of our algorithm on the CoeffsToSlots step, compared with the original BSGS algorithm for \(\rho _{\text {SF}_{n}^{-1}}=\{2,3,4\}\). The complexity is computed as the number of products in \(\mathbb {Z}_{p}\), with parameters \(N=2^{16}\), a target \(\ell = 17\) (the level after CoeffsToSlots) and \(n=2^{15}\) slots.

Each level of hoisting reduces the total complexity by a noticeable amount. Regular hoisting, as proposed by Halevi and Shoup [19], achieves its minimum complexity when \(n_{1}\approx 2^{2}n_{2}\) instead of \(n_{1} \approx n_{2}\). Using our double hoisting, the minimum complexity is further shifted to \(n_{1}\approx 2^{4}n_{2}\). On average, our method reduces the complexity of the linear transformations in the bootstrapping by a factor of \(2\times \) compared to the single hoisting technique of Halevi and Shoup.

Fig. 2.
figure 2

Theoretical complexity of CoeffToSlots for different \(\rho _{\text {SF}_{n}^{-1}}\) using Algorithm 5 with no hoisting, single hoisting, and double hoisting (Algorithm 6).

Efficient Repacking of Sparse Plaintexts. The first part of CoeffsToSlots is a DFT that outputs a vector of \(\mathbb {C}^{n}\) values; the second part of CoeffsToSlots applies the map \(\mathbb {C}^{n} \rightarrow \mathbb {R}^{2n}\) to this vector. During the decoding, the inverse mapping \(\mathbb {R}^{2n} \rightarrow \mathbb {C}^{n}\) is used. This map can be computed with simple operations, e.g., conjugation, multiplication by \(-i\), and additions. If the original ciphertext is not fully packed (\(0< n < N/2\) slots), the two resulting ciphertexts can be merged into one, requiring one evaluation of EvalSine instead of two.

We observe that decoding a plaintext \(\mathbf {m} \in \mathbb {C}^{n}\) by using the decoding algorithm for a plaintext of \(\mathbb {C}^{2^{k}n}\) slots (assuming that \(2^{k}n < N\)) outputs a vector comprising \(2^{k}\) concatenated replicas of m. Therefore, a ciphertext that encrypts \(\mathbf {m} \in \mathbb {C}^{n}\) can also be seen as a ciphertext encrypting \(\mathbf {m}' \in \mathbb {C}^{2n}\) for \(\mathbf {m}' = \mathbf {m} || \mathbf {m}\). This property can be used to save two levels when repacking and unpacking ciphertexts before and after the EvalSine:

  • Repacking before the EvalSine (\(\mathbb {C}^{n} \rightarrow \mathbb {R}^{2n}\)): Repacking into one ciphertext is done by extending the domain of the plaintext vectors of the last matrix of the CoeffsToSlots step from \(\mathbb {C}^{n}\) to \(\mathbb {C}^{n}||0^{n}\). Thus, the last n slots are set to zero and can be used to store the imaginary part of the first n slots. This repacking involves one additional rotation and it does not consume any additional levels.

  • Unpacking after the EvalSine (\(\mathbb {R}^{2n} \rightarrow \mathbb {C}^{n}\)): For this operation, we evaluate the following \(2n\times 2n\) matrix on the ciphertext before the DFT

    $$ \begin{bmatrix} I_{n} \quad \quad &{} i \cdot I_{n}\\ I_{n} \quad \quad &{} i \cdot I_{n}\\ \end{bmatrix}, $$

    where \(I_{n}\) is the \(n\times n\) identity matrix. Its effect is to homomorphically apply the map \(\mathbb {R}^{2n} \rightarrow \mathbb {C}^{n}||\mathbb {C}^{n}\), which is a valid encoding of \(\mathbb {C}^{n}\), due to the properties of the encoding algorithm. This additional matrix (transformation) is combined with the first group of the SlotsToCoeffs matrices, thus slightly increasing its density.

5.4 EvalSine

EvalSine implements the homomorphic modular reduction of the message \(m' = Q_{0} \cdot \tilde{I}(Y) + \varDelta \cdot m(Y)\) modulo \(Q_0\). The modular reduction is approximated by

$$\begin{aligned} f(x) = \dfrac{Q_{0}}{\varDelta }\dfrac{1}{2\pi }\sin \left( 2\pi x \dfrac{\varDelta }{Q_{0}}\right) \approx \dfrac{Q_{0}}{\varDelta }\cdot \left( \dfrac{\varDelta }{Q_{0}}x \bmod 1\right) , \end{aligned}$$

which scales the message \(m'\) down to \(\tilde{I}(Y) + (\varDelta /Q_{0}) \cdot m(Y)\), removes the \(\tilde{I}(Y)\) polynomial by reducing the message modulo 1, and scales the message back to \(\varDelta \cdot m(Y)\). As \(\tilde{I}(Y)\) determines the range and degree of the approximation, the EvalSine step has to account for the secret-key density h. In particular, the range of the approximation \((-K, K)\) is chosen such that Pr\([||\tilde{I}(Y)|| > K] \le \kappa \) for a user-defined \(\kappa \). We elaborate more on how we parameterize K, in Sect. 6.2.

Previous Work. Chen et al. [6] directly approximate the function \(\tfrac{1}{2\pi }\cdot \sin (2\pi x)\) by using a standard Chebyshev interpolant of degree \(d=119\) in an interval of \((-K, K)\) for \(K = 12\) (using a sparse key with \(h=64\)). Han and Ki [20] approximate \(\cos (2\pi \tfrac{1}{2^r}(x-0.25))\) followed by r iterations of the double angle formula \(\cos (2x) = 2\cos (x)^{2} -1\) to obtain \(\sin (2\pi x)\). The factor \(1/2^{r}\) reduces the range of the approximation to \((-K/2^{r}, K/2^{r})\), enabling the use of a smaller-degree interpolant. They combine it with a specialized Chebyshev interpolation that places the node around the expected intervals of the input. This reduces the degree of the approximation and the cost of its evaluation. In their work, they use an interpolant of degree 30 with a scaling factor \(r=2\) (they also use a sparse key with \(h=64\)).

In a recent work, Lee et al. [25] propose to compose the sine/cosine function with a low degree arcsine. This additional step corrects the error introduced by the sine, especially if \(Q_{0}/\varDelta \) is small (when the values are not close to the origin). This improves the overall precision of the bootstrapping and enables bootstrapping messages with larger values. However, this comes at the cost of increasing the depth of the EvalSine step, as a second polynomial must be evaluated.

Our Work. Both the methods of Chen et al. and Han and Ki have \(d = \mathcal {O}(K)\), therefore doubling K requires at most doubling d, and the evaluation will require at most one additional level, as the Chebyshev interpolant can be evaluated in \(\mathcal {O}(\log (K))\) levels. Hence, precision put aside, the level consumption should not be a fundamental problem when evaluating the large degree interpolant (as required by dense keys). However, the effects of the approximate rescale procedure, if not properly managed, can significantly reduce the output precision. Our EvalSine makes use of our novel polynomial evaluation technique (Sect. 3).

We propose a more compact expression of the modular reduction function \(f(x) = \tfrac{1}{2\pi }\sin (2\pi x)\), which is approximated by \(g_{r}(x)\), a modified scaled cosine followed by r iterations of the double-angle formula:

$$\begin{aligned} g_0(x) = \dfrac{1}{\root 2^{r} \of {2\pi }}\cos \big (2\pi \tfrac{1}{2^{r}}(x-0.25)\big )\ \text {and}\ g_{i+1} = 2g_{i}^{2} - \Bigg (\dfrac{1}{\root 2^{r} \of {2\pi }}\Bigg )^{2^{i}}. \end{aligned}$$

We include the \(1/2\pi \) factor directly in the function we approximate, even when using the double angle formula, without consuming an additional level, impacting the precision, or fundamentally changing its evaluation. We observed that even though the approximation technique of Han and Ki is well suited for small K, the standard Chebyshev interpolation technique, as used by Chen et al., remains more efficient when K is large. The reason is that Han and Ki’s interpolant has a minimum degree of \(2K-1\), so it grows in degree with respect to K much faster than the standard Chebyshev interpolation. Hence, we use the approximation method of Han and Ki when K is small (for sparse keys) and the standard Chebyshev approximation, as done by Chen et al., for dense keys.

As suggested by Lee et al. [25], we can further improve this step by composing it with \(\arcsin (x)\), i.e., \(\frac{1}{2\pi }\arcsin (\sin (2\pi x))\), which corrects the error \(e_{g_{r}(x)} = |g_{r}(x) -x\bmod 1|\). Unlike Lee et al., we do not interpolate the arcsine, rather we choose to use a low degree Taylor polynomial and show in our results (see Sect. 7.2) that it is sufficient to achieve similar results.

Algorithm 7 details our implementation of the \(\textsf {EvalSine}\) procedure. The ciphertext must be multiplied by several constants, before and after the polynomial evaluation. For efficiency, we merge these constants with the linear transformations. See Sect. 5.5 for further details.

figure g

5.5 Matrix Scaling

Several steps of the bootstrapping circuit require the ciphertexts to be multiplied by constant plaintext values. This is most efficiently done by merging them and pre-multiplying the resulting constants to the SF\(_{n}^{-1}\) and SF\(_{n}\) matrices.

Before EvalSine, the ciphertext has to be multiplied (i) by 1/N to cancel the N/2n and 2n factors introduced by the SubSum and CoeffsToSlots steps, (ii) by \(1/(2^{r}K)\) for the scaling by \(1/2^{r}\) and change of variable for the polynomial evaluation in Chebyshev basis, and (iii) by \(Q_{0}/2^{\lceil \log (Q_{0})\rceil }\) to compensate for the error introduced by the approximate division by \(\lfloor Q_{0}/\varDelta \rceil \). Therefore, the matrices resulting from the factorization of SF\(_{n}^{-1}\) are scaled by

$$\begin{aligned} \mu _{\textsf {CtS}} = \Bigg (\dfrac{1}{2^{r}KN}\cdot \dfrac{Q_{0}}{2^{\lfloor \log (Q_{0}) \rceil }}\Bigg )^{\dfrac{1}{\rho _{\text {SF}_{n}^{-1}}}}, \end{aligned}$$

where \(\rho _{\text {SF}_{n}^{-1}}\) is the degree of factorization of SF\(_{n}^{-1}\). Evenly spreading the scaling factors across all matrices ensures that they are scaled by a value as close as possible to 1.

After EvalSine, the ciphertext has to be multiplied (i) by \(2^{\lceil \log (q_{0})\rceil }/Q_{0}\) to compensate for the error introduced by the approximate multiplication by \(\lfloor Q_{0}/\varDelta \rceil \), and (ii) by \(\varDelta /\delta \), where \(\varDelta \) is the scale of the ciphertext after the EvalSine step and \(\delta \) is the desired ciphertext output scale. Therefore, the matrices resulting from the factorization of SF\(_{n}\) are scaled by

$$\begin{aligned} \mu _{\textsf {StC}} = \Bigg (\dfrac{\varDelta }{\delta }\cdot \dfrac{2^{\lfloor \log (Q_{0}) \rceil }}{Q_{0}}\Bigg )^{\dfrac{1}{\rho _{\text {SF}_{n}}}}, \end{aligned}$$

where \(\rho _{\text {SF}_{n}}\) is the degree of factorization of SF\(_{n}\).

6 Parameter Selection

A proper parameterization is paramount to the security and correctness of the bootstrapping procedure. Whereas security is based on traditional hardness assumptions, setting the correctness-related parameters is accomplished mostly through experimental processes for finding appropriate trade-offs between the performance and the probability of decryption errors. In this Section, we discuss various constraints and inter-dependencies in the parameter selection. Then, we propose a generic procedure for finding appropriate parameter sets.

6.1 Security

For each parameter set, we select a modulus size with an estimated security of 128 bits. These values are shown in Table 3 for several choices of the secret-key Hamming weight h, and are based on the work of Curtis and Player [12]. According to the authors, these parameters result from conservative estimations, and account for hypothetical future improvements to the most recent attacks of Cheon et al. [9] and Son et al. [29]. Therefore, their actual security is underestimated.

Table 3. Modulus size \(\log (QP)\) for different secret-key densities h (\(\lambda \ge 128\)).

6.2 Choosing K for EvalSine

Each coefficient of the polynomial \(\tilde{I}(Y) \in \mathbb {R}[Y]/(Y^{2n}+1)\) is the result of the sum of \(h+1\) uniformly distributed variables in \(\mathbb {Z}_{Q_{0}}\) [10], hence it follows an Irwin–Hall distribution [25]. By centering and normalizing the coefficients of \(\tilde{I}(Y)\), we get instead the sum of \(h+1\) uniformly distributed variables in \((-0.5, 0.5)\). The probability \(\text {Pr}[||\tilde{I}(Y)|| > K]\) can be computed by adapting the cumulative probability function of the Irwin-Hall distribution:

$$\begin{aligned} 1-\left( \left( \dfrac{2}{(h+1)!}\sum _{i=0}^{\lfloor K+0.5(h+1)\rfloor }(-1)^{i}{h+1 \atopwithdelims ()i}(K+0.5(h+1)-i)^{h+1}\right) -1\right) ^{2n}. \end{aligned}$$
(1)

The previous works [6, 7, 10, 20] use a sparse key with \(h=64\) and \(K=12\), which regardless of the security, gives a failure probability of \(2^{-14.7}\) and \(2^{-6.7}\) for \(n = 2^{7}\) and \(n=2^{15}\) respectively, according to Eq. (1). Clearly, these parameters were not chosen for large n and are most likely an artifact of the first proposal for a bootstrapping for CKKS [10], for which only a small number of slots was practical. In our work, we increase h to ensure an appropriate security and use a much larger number of slots (e.g., \(n=2^{15}\)), hence we need to adapt K. Table 4 shows that if we target a failure probability \(\le 2^{-15.0}\) for \(n=2^{15}\) slots and take h as a parameter, then \(K \approx 1.81\sqrt{h}\).

Table 4. \(\text {Pr}[||\tilde{I}(Y)|| > K] \approx 2^{-16}\) for \(n=2^{15}\) and variable h.

6.3 Finding Parameters

We describe a general heuristic procedure for selecting and fine-tuning bootstrapping parameters. Each operation of the bootstrapping requires a different scaling and a different precision, therefore different moduli. Choosing each modulus optimally for each operation not only leads to a better performance and a better final precision but also optimizes the bit consumption of each operation and increases the remaining homomorphic capacity after the bootstrapping.

We describe our procedure to find suitable parameters for the bootstrapping in Algorithm 8 and propose five reference parameter sets that result from this algorithm. The parameter sets were selected for their performance and similarity with those in previous works, thus enabling a comparison. For each set, Table 5 shows the parameters related to CKKS and to the bootstrapping circuit.

7 Evaluation

We implemented the improved algorithm of Sects. 3 and 4, along with the bootstrapping procedure of Sect. 5 in the Lattigo library [24]. We evaluated it by using the parameters of Sect. 6.3. Lattigo is an open-source library that implements the RNS variants of the BFV [3, 13, 16] and CKKS [8] schemes in Golang [30]. All experiments were conducted single-threaded on an i5-6600k at 3.5 GHz with 32 GB of RAM running Windows 10 (Go version 1.15.6, GOARCH = amd64, GOOS = windows).

Table 5. The sets of parameters of the full-RNS CKKS used to evaluate the performance of our bootstrapping code. \(+\) means concatenation in the chain and \(a \cdot b\) denotes the consecutive concatenation of a different moduli of size b. Moduli with fractional a are only partially used by the step they are allocated to.

7.1 The Bootstrapping Metrics

Although CPU costs are an important aspect when evaluating a bootstrapping procedure, these factors have to be considered together with other performance-related metrics such as the size of the output plaintext space, the failure probability, the precision, and the remaining multiplicative depth. To compare our bootstrapping procedure with the existing ones, we use the same concept of a bootstrapping utility metric, as introduced by Han and Ki [6].

Definition 1

(Bootstrapping Throughput). For n a number of plaintext slots, \(\log (\epsilon ^{-1})\) the output precision, \(\log (Q_{L-k})\) the output coefficient-modulus size after the bootstrapping (remaining homomorphic capacity) and complexity a measure of the computational cost (in CPU time), the bootstrapping throughput is defined as:

$$ \text {throughput} = \dfrac{n\times \log (\epsilon ^{-1})\times \log (Q_{L-k})}{complexity}. $$

Note that we express the remaining homomorphic capacity in terms of the modulus size, instead of the number of levels, because \(Q_{L-k}\) can be re-allocated differently at each bootstrapping call, e.g., a small number of moduli with a large plaintext scale or a large number of moduli with a small plaintext scale.

As \(\kappa \), the bootstrapping failure probability, is a probability and not a metric, we chose to not include it directly in Definition 1. However, we still believe it should be taken into account as an opportunity-cost variable. Indeed, the event of a bootstrapping failure will likely result in the need to re-run the entire circuit. Hence, the probability of failure should be weighed vs. the cost of having to re-run a circuit to determine if \(\kappa \) is in an acceptable range.

7.2 Results

figure h

We run our benchmarks and report the bootstrapping performance for each parameter set of Table 5, and we compare them with the previous works of Chen et al. [6], Han and Ki [20], and the recent and concurrent work of Lee et al. [25]. Unfortunately, the implementations of these works have not been publicly released and we were not able to reproduce their results on our own hardware for a fair comparison. The parameters and results are summarized in Table 6 and 7, respectively. Reports on experiments that demonstrate the numerical stability of our bootstrapping can be found in the extended version of the paper [4].

Fig. 3.
figure 3

Bootstrapping throughput comparison. We plot the results for our best performing parameter set against the state of the art. Nodes are labeled with n, the number of plaintext slots.

Table 6. Parameter comparison of [6, 20, 25] and our work. “–” means that value was not reported. Lee et al.’s [25] parameters are based on our Set III.
Table 7. Comparison of the bootstrapping performances of [6, 20, 25] and our proposed bootstrapping for the full-RNS variant of CKKS with parameter sets I, II, III, IV and V. MU, SS, CtS, StC designate ModUp, SubSum, SlotstoCoeffs, CoeffstoSlots. “–” indicates that the prior work did not report the value. All timings are single threaded. The plaintext real and imaginary part are uniformly distributed in the interval \(-1\) and 1.

Focusing only on the overall performance, our most performing set (Set III) achieves throughput \(14.1\times \) and \(28.4\times \) larger than the best result reported by Han and Ki [20] and Lee et al. [25] respectively. Our Set IV uses dense keys and achieves a throughput \(4.6\times \) and \(9\times \) larger than the work of Han and Ki and Lee et al. respectively. Both these works use SEAL [28] and are evaluated on similar hardware. Our sets III and IV achieve a throughput \(54.2\times \) and \(17.4\times \) larger than the best result reported by Chen et al. [6], implemented using the HEAAN library [21]. HEAAN does not implement the full-RNS variant of CKKS, hence the latter comparison shows the significant performance gains that can be achieved by combining optimized algorithms with a full-RNS implementation.

The implementation of Lee et al. makes use of the recent work of Kim et al. [22] which proposes new techniques to minimize the error during computation, notably a delayed rescaling that consists in rescaling the ciphertext before a multiplication and not after, so that the error is as small as possible when doing the multiplication. This enables Lee et al. to achieve a slightly higher precision than ours (our implementation does not use the work of Kim et al.). Lee et al. results are also the ones with the most residual homomorphic capacity. The primary reason is the implementation of the CKKS scheme in SEAL, which can only use one special prime (\(\alpha = 1\), see Sect. 4) during the key-switching. This increases the ciphertext homomorphic capacity, but at the cost of an increased key-switch complexity. The second reason is that they allocate less levels to the linear transformations (in total, three less than our parameters). This enables them to reduce the depth of the bootstrapping, at the cost of increasing its complexity, which shows in their timings.

We observe that there is a correlation between the value \(Q_{0}/\varDelta \) and the precision. A better precision is achieved when using a smaller ratio, even when the arcsin is not composed with the scaled sine. Previous works usually assume that \(||m|| \approx ||\text {FFT}^{-1}(m)||\) to set \(Q_{0}/\varDelta \) and derive the expected precision of the scaled sine. In practice, since each coefficient of \(\text {FFT}^{-1}(m)\) is a dot product between the vector m and a complex vector of roots of unity (zero-mean and small variance), if the mean of m is close to zero, then \(||\text {FFT}^{-1}(m)|| \ll ||m||\) with overwhelming probability. For example, given m uniform in \((-1, 1)\) and \(n=2^{15}\) slots, then \(||m||/||\text {FFT}^{-1}(m)|| \approx 100\). Hence the message is much closer to the origin than expected, which reduces the inherent error of the scaled sine and amplifies the effectiveness of the arcsine. We note that even if the distribution of m is not known, it is possible to enforce this behavior with a single plaintext multiplication by homomorphically negating half of its coefficients before the bootstrapping. One could even homomorphically split m in half and create two symmetric vectors to enforce a zero mean. A more detailed analysis of this behavior and how to efficiently exploit it or integrate it into the linear transforms of the bootstrapping could be a interesting future research line.

All our sets have a failure probability that is two to three orders magnitude smaller than previous works, except for the results of Lee et al. which use our suggested parameters. For example, following Eq. (1), if successive bootstrappings are carried out with \(n=2^{15}\) slots, then [6] and [20] would reach a 1/2 failure probability after 52 bootstrappings, whereas ours would reach the same probability after 24,656 bootstrappings.

Figure 3 plots the best performing instances of Table 7.

8 Conclusion

In this work, we have introduced a secure, reliable, precise and efficient bootstrapping procedure for the full-RNS CKKS scheme that does not require the use of sparse secret-keys. To the best of our knowledge, this is the first reported instance of a practical bootstrapping parameterized for at least 128-bit security.

To achieve this, we have proposed a generic algorithm for the homomorphic evaluation of polynomials with reduced error and optimal in level consumption. In addition to the increase in precision and efficiency, our algorithm also improves the usability of the full-RNS variant of CKKS (for which managing a changing scale in large circuits is known to be a difficult task).

We have also proposed an improved key-switch format that we apply to the homomorphic matrix-vector multiplication. Our novel double hoisting algorithm reduces the complexity of the CoeffsToSlots and SlotsToCoeffs by roughly a factor of 2 compared to previous works. The performance gain for these procedures enables their use outside of the bootstrapping, for applications where the conversion between coefficient- and slot-domains would enable much more efficient homomorphic circuits (e.g., in the training of convolutional neural networks or R-LWE to LWE ciphertext conversion).

We have also proposed a systematic approach to parameterize the bootstrapping, including a way to precisely assess its failure probability. We have evaluated our bootstrapping procedure and have shown that its throughput with “dense” secret-keys (\(h=N/2\)) is up to 4.6\(\times \) larger than the best state-of-the-art results with sparse keys (\(h=64\)). When the sparse-keys-adjusted parameters of Curtis and Player [12] for \(h=192\) and 128-bits of security are considered, our procedure has a 14.1\(\times \) larger throughput than the previous work that uses a sparse key with \(h=64\) with insecure parameters. Additionally, all our parameters lead to a more reliable instance than the previous works, with a failure probability orders of magnitude lower.

We have implemented our contributions in the Lattigo library [24]. This is, to the best of our knowledge, the first open-source implementation of a bootstrapping procedure for the full-RNS variant of the CKKS scheme.