Keywords

1 Introduction

Homomorphic encryption has been an area of active research since the first design of a Fully Homomorphic Encryption (FHE) scheme by Gentry [9]. FHE allows performing arbitrary secure computations over encrypted sensitive data without ever decrypting them. One of the potential applications is to outsource computations to a public cloud without compromising data privacy.

A salient property of contemporary FHE schemes is that ciphertexts are “noisy”, where the noise increases with every homomorphic operation, and decryption starts failing once the noise becomes too large. This is addressed by setting the parameters large enough to accommodate some level of noise, and using Gentry’s “bootstrapping” technique to reduce the noise once it gets too close to the decryption-error level. However, the large parameters make homomorphic computations quite slow, and so significant effort was devoted to constructing more efficient schemes. Two of the most promising schemes in terms of practical performance have been the BGV scheme of Brakerski, Gentry and Vaikuntanathan [6], and the Fan-Vercauteren variant of Brakerski’s scale-invariant scheme [5, 8], which we call here the BFV scheme. Both of these schemes rely on the hardness of the Ring Learning With Errors (RLWE) problem.

Both schemes manipulate elements in large cyclotomic rings, modulo integers with many hundreds of bits. Implementing the necessary multi-precision modular arithmetic is expensive, and one way of making it faster is to use a “Residue Number System” (RNS) to represent the big integers. Namely, the big modulus q is chosen as a smooth integer, \(q=\prod _i q_i\), where the factors \(q_i\) are same-size, pairwise coprime, single-precision integers (typically of size 30–60 bits). Using the Chinese Remainder Theorem (CRT), an integer \(x\in \mathbb {Z}_q\) can be represented by its CRT components \(\{x_i= x\bmod q_i\in \mathbb {Z}_{q_i}\}_i\), and operations on x in \(\mathbb {Z}_q\) can be implemented by applying the same operations to each CRT component \(x_i\) in its own ring \(\mathbb {Z}_{q_i}\).

Unfortunately, both BGV and BFV feature some scaling operations that cannot be directly implemented on the CRT components. In both schemes there is sometimes a need to interpret \(x\in \mathbb {Z}_q\) as a rational number (say in the interval \([-q/2,q/2)\)) and then either lift x to a larger ring \(\mathbb {Z}_Q\) for \(Q>q\), or to scale it down and round to get \(y=\left\lceil \delta x \right\rfloor \in \mathbb {Z}_t\) (for some \(\delta \ll 1\) and accordingly \(t\ll q\)). These operations seem to require that x be translated from its CRT representation back to standard “positional” representation, but computing these translations back and forth will negate the gains from using RNS to begin with.

While implementations of the BGV scheme using CRT representation are known (e.g., [10, 13]), implementing BFV in this manner seems harder. One difference is that BFV features more of these scaling operations than BGV. Another is that in BGV numbers are typically scaled by just single-precision factors, while in BFV these factors are often big, of order similar to the multi-precision modulus q. An implementation of the BFV scheme using CRT representation was recently reported by Bajard et al. [3], featuring significant speedup as compared to earlier implementations such as in [16]. This implementation, however, uses somewhat complex procedures, and moreover these procedures incur an increase in the ciphertext noise.

In the current work we propose simpler and more efficient procedures for the CRT-based scaling and lifting as compared to the procedures in [3]. The same techniques are also applicable to other scale-invariant homomorphic encryption schemes, such as YASHE and YASHE’ [4], and many other lattice-based cryptographic primitives that require CRT-based scaling.

We implemented our procedures in the PALISADE library [19]. We evaluate the runtime performance of decryption and homomorphic multiplication in the range of multiplicative depths from 1 to 100. For example, the runtimes for depth-20 decryption and homomorphic multiplication are 3.1 and 62 ms, respectively, which can already support outsourced-computing applications with latencies up to few seconds, even without bootstrapping.

Our Contributions. We propose new procedures for CRT basis extension and scaling in RNS using floating-point arithmetic for some intermediate computations. Our procedures have a low probability of introducing small approximation errors, but in the context of homomorphic operations these errors are inconsequential. As we explain in Sect. 4.5, they increase the ciphertext noise after homomorphic multiplications by at most 2 bits for any depth of the multiplication circuit (typically significantly less than 1 bit), and those contributions were not observable in our experiments. We apply these techniques to develop:

  • A BFV decryption procedure supporting CRT moduli up to 59 bits, using extended precision floating-point arithmetic natively available in x86 architecturesFootnote 1.

  • A BFV homomorphic multiplication procedure that has practically the same noise requirements as the textbook BFV.

  • A multi-threaded CPU implementation of our BFV variant in PALISADE.

Comparison with the RNS Variant by Bajard et al. [3]Footnote 2. Al Badawi et al. [2] compare the complexity and performance of decryption and homomorphic multiplication in our variant and the one proposed in [3]. For the experimental comparison, they implemented both variants in PALISADE (CPU) and DSI_BFV (GPU). Their analysis shows that the practical (experimentally observed) noise growth of our variant is much lower. For instance, our variant supports the multiplicative depth of 35 at \(n=2^{15}\) and \(\log _2 q=600\) while the Bajard et al. variant can support only the depth of 26. They also demonstrate that the computational complexity and actual runtimes for our variant both in GPU and CPU are lower (even when the noise growth difference is ignored) in all cases, except for the decryption when the size of CRT moduli is 60 bits. The combined effect of smaller noise growth and computational complexity is a speed-up of 2x or more in homomorphic multiplication for the same depth.

2 Notations and Basic Procedures

For an integer \(n\ge 2\), we identify below the ring \(\mathbb {Z}_n\) with its representation in the symmetric interval \(\mathbb {Z}\cap [-n/2, n/2)\). For an arbitrary real number x, we denote by \([x]_n\) the reduction of x into that interval (namely the real number \(x'\in [-n/2,n/2)\) such that \(x'-x\) is an integer divisible by n). We also denote by \(\left\lfloor x \right\rfloor \), \(\left\lceil x \right\rceil \), and \(\left\lceil x \right\rfloor \) the rounding of x to an integer down, up, and to the nearest integer, respectively. We denote vectors by boldface letters, and extend the notations \(\left\lfloor x \right\rfloor \), \(\left\lceil x \right\rceil \), \(\left\lceil x \right\rfloor \) to vectors element-wise.

Throughout this paper we fix a set of k co-prime moduli \(q_1, \ldots , q_k\) (all integers larger than 1), and let their product be \(q=\prod _{i=1}^k q_i\). For all \(i\in \{1,\ldots ,k\}\), we also denote

$$\begin{aligned} {q_i^*}= q/q_i \in \mathbb {Z}\quad \text {and}\quad {\tilde{q}_i}= {q_i^*}^{-1} \pmod {q_i} \in \mathbb {Z}_{q_i}, \end{aligned}$$
(1)

namely, \({\tilde{q}_i}\in \left[ -\frac{q_i}{2}, \frac{q_i}{2}\right) \) and \({q_i^*}\cdot {\tilde{q}_i}=1\pmod {q_i}\).

Complexity Measures. In our setting we always assume that the moduli \(q_i\) are single-precision integers (i.e. \(|q_i|<2^{63}\)), and that operations modulo \(q_i\) are inexpensive. We assign unit cost to mod-\(q_i\) multiplication and ignore additions, and analyze the complexity of our routines just by counting the number of multiplications. Our procedures also include floating-point operations, and here too we assign unit cost to floating-point multiplications and divisions (typically in “double float” format as per IEEE 754) and ignore additions.

2.1 CRT Representation

We denote the CRT representation of an integer \(x\in \mathbb {Z}_q\) relative to the CRT basis \(\{q_1,\ldots ,q_k\}\) by \(x\sim (x_1,\ldots ,x_k)\) with \(x_i = [x]_{q_i}\in \mathbb {Z}_{q_i}\). The formula expressing x in terms of the \(x_i\)’s is \(x = \sum _{i=1}^k x_i \cdot {\tilde{q}_i}\cdot {q_i^*}\pmod {q}\). This formula can be used in more than one way to “reconstruct” the value \(x\in \mathbb {Z}_q\) from the \(x_i\)’s. In this work we use in particular the following two facts:

$$\begin{aligned} x= & {} \big (\sum _{i=1}^k \underbrace{[x_i \cdot {\tilde{q}_i}]_{q_i} \cdot {q_i^*}}_{\in \mathbb {Z}_q}\big ) -\upsilon \cdot q~\text {for some}~\upsilon \in \mathbb {Z}, \end{aligned}$$
(2)
$$\begin{aligned} \text {and}\quad x= & {} \big (\sum _{i=1}^k \underbrace{x_i \cdot {\tilde{q}_i}\cdot {q_i^*}}_{\in [-\frac{q_iq}{4},\frac{q_iq}{4})}\big ) -\upsilon '\cdot q\quad \text {for some}~\upsilon '\in \mathbb {Z}. \end{aligned}$$
(3)

2.2 CRT Basis Extension

Let \(x\in \mathbb {Z}_q\) be given in CRT representation \((x_1,\ldots ,x_k)\), and suppose we want to extend the CRT basis by computing \([x]_p\in \mathbb {Z}_p\) for some other modulus \(p>1\). Using Eq. 2, we would like to compute \([x]_p = \left[ \big (\sum _{i=1}^k [x_i \cdot {\tilde{q}_i}]_{q_i} \cdot {q_i^*}\big ) - \upsilon \cdot q\right] _p.\) The main challenge here is to compute \(\upsilon \) (which is an integer in \(\mathbb {Z}_k\)). The formula for \(\upsilon \) is:

$$ \upsilon = \left\lceil \big (\sum _{i=1}^k [x_i \cdot {\tilde{q}_i}]_{q_i} \cdot {q_i^*}\big )/q \right\rfloor = \left\lceil \sum _{i=1}^k [x_i \cdot {\tilde{q}_i}]_{q_i} \cdot \frac{{q_i^*}}{q} \right\rfloor = \left\lceil \sum _{i=1}^k \frac{[x_i \cdot {\tilde{q}_i}]_{q_i}}{q_i} \right\rfloor . $$

To get \(\upsilon \), we compute for every \(i\in \{1,\ldots ,k\}\) the element \(y_i:=[x_i\cdot {\tilde{q}_i}]_{q_i}\) (using single-precision integer arithmetic), and next the rational number \(z_i := y_i/q_i\) (in floating-point). Then we sum up all the \(z_i\)’s and round them to get \(\upsilon \). Once we have the value of \(\upsilon \), as well as all the \(y_i\)’s, we can directly compute Eq. 2 modulo p to get \( [x]_p = \left[ \big (\sum _{i=1}^k y_i \cdot [{q_i^*}]_p\big ) - \upsilon \cdot [q]_p\right] _p. \)

In our setting p and the \(q_i\)’s are parameters that can be pre-processed. In particular we pre-compute all the values \([{q_i^*}]_p\)’s and \([q]_p\), so the last equation becomes just an inner-product of two \((k+1)\)-vectors in \(\mathbb {Z}_p\).

Complexity Analysis. The computation of \(\upsilon \) requires k single-precision integer multiplications to compute the \(y_i\)’s, then k floating-point division operations to compute the \(z_i\)’s, and then some additions and one rounding operation. In total it takes k integer and \(k+1\) floating-point operations. When p is a single-precision integer, the last inner product takes \(k+1\) integer multiplications, so the entire procedure takes \(2k+1\) integer and \(k+1\) floating-point operations.

For larger p we may need to do \(k+1\) multi-precision multiplications, but we may be able to use CRT representation again. When \(p=\prod _{j=1}^{k'} p_j\) for single-precision co-prime \(p_j\)’s, we can compute \(\upsilon \) only once and then compute the last inner product for each \(p_i\) (provided that we pre-computed \([{q_i^*}]_{p_j}\)’s and \([q]_{p_j}\) for all i and j). The overall complexity in this case will be \(kk'+k+k'\) integer operations and \(k+1\) floating-point operations.

Correctness. The only source of errors in this procedure is the floating-point operations when computing \(\upsilon \): Instead of the exact values \(z_i=y_i/q_i\), we compute their floating-point approximations \(z^*_i\) (with error \(\epsilon _i\)), and so we obtain \(\upsilon ^*=\left\lceil \sum _i(z_i+\epsilon _i) \right\rfloor \) which may be different from \(\upsilon =\left\lceil \sum _i z_i \right\rfloor \).

Since the \(z_i\)’s are all in \([-\frac{1}{2},\frac{1}{2})\), then using IEEE 754 double floats we have that the \(\epsilon _i\)’s are bounded in magnitude by \(2^{-53}\), and therefore the overall magnitude of the error term \(\epsilon :=\sum \epsilon _i\) is bounded, \(|\epsilon |<k \cdot 2^{-53}\). If we assume \(k\le 32\), this gives us \(|\epsilon |<2^{-48}\). (Similarly, if we use single floats we get \(|\epsilon |<2^{-19}\).)

When applying the procedure above, we should generally check that the resulting \(\upsilon ^*\) that we get is outside the possible-error region \(\mathbb {Z}+\frac{1}{2}\pm \epsilon \). If \(\upsilon ^*\) falls in the error region, we can re-run this procedure using higher precision (and hence smaller \(\epsilon \)) until the result is outside the error region.

It turns out that for our use cases, we do not need to check for these error conditions, and can often get by with a rather low precision for this computation. One reason for this is that for our uses, even if we do incur a floating-point approximation error, it only results in a small contribution to ciphertext noise, which has no practical significance.

Moreover, we almost never see these approximation errors, because the value \(\sum _i z_i\) that we want to approximate equals x / q modulo 1. When we use that procedure in our implementation, we sometimes have (pseudo)random values of \(x\in \mathbb {Z}_q\), in which case the probability that the result falls in the error region is bounded by \(2|\epsilon |\). In other cases, we even have a guarantee that \(|x|\ll q\) (say \(|x|<q/4\)), so we know a-priori that the value will always fall outside of the error region. For more details, see Sects. 2.4 and 4.5.

Comparison to Other Approaches for Computing \(\upsilon \). Two exact approaches for computing \(\upsilon \) are presented in [22] and [15]. The first approach introduces an auxiliary modulus and performs the CRT computations both for p and the extra modulus, thus significantly increasing the number of integer operations and also increasing the implementation complexity [22]. The second approach computes successive fixed-point approximations until the computed value of \(\upsilon \) is outside the error region (in one setting) or computes the exact value (in another setting with higher complexity) [15]. Both of these techniques incur higher computational costs than our method.

2.3 Simple Scaling in CRT Representation

Let \(x\in \mathbb {Z}_q\) be given in CRT representation \((x_1,\ldots ,x_k)\), and let \(t\in \mathbb {Z}\) be an integer modulus \(t\ge 2\). We want to “scale down” x by a t / q factor, namely to compute the integer \(y=\left\lceil t/q \cdot x \right\rfloor \in \mathbb {Z}_t\). We do it using Eq. 3, as follows:

$$\begin{aligned} y:=\left\lceil \frac{t}{q} \cdot x \right\rfloor = \left\lceil \big (\sum _{i=1}^k x_i \cdot ({\tilde{q}_i}\cdot \frac{t}{q_i})\big ) \right\rfloor -\upsilon '\cdot t =\left[ \left\lceil \big (\sum _{i=1}^k x_i \cdot ({\tilde{q}_i}\cdot \frac{t}{q_i})\big ) \right\rfloor \right] _t. \end{aligned}$$
(4)

The last equation follows since the two sides are congruent modulo t and are both in the interval \([-t/2,t/2)\), hence they must be equal.

In our context, t and the \(q_i\)’s are parameters that we can pre-process (while the \(x_i\)’s are computed on-line). We pre-compute the rational numbers \(t{\tilde{q}_i}/ q_i \in [-t/2,t/2)\), separated into their integer and fractional parts:

$$ t{\tilde{q}_i}/q_i ~=~ \omega _i + \theta _i,\quad \text {with}\,\omega _i\in \mathbb {Z}_t ~\text {and}~\theta _i \in [-\frac{1}{2}, \frac{1}{2}). $$

With the \(\omega _i\)’s and \(\theta _i\) pre-computed, we take as input the \(x_i\)’s, compute the two sums \(w := \left[ \sum _i x_i \omega _i\right] _t\) and \(v := \left\lceil \sum _i x_i \theta _i \right\rfloor ,\) (using integer arithmetic for w and floating-point arithmetic for v), then output \([w+v]_t\).

Complexity Analysis. The procedure above takes k floating-point multiplications, some additions, and one rounding to compute v, and then an inner product mod t between two \((k+1)\)-vectors: the single-precision vector \((x_1,\ldots ,x_k,1)\) and the mod-t vector \((\omega _1,\ldots ,\omega _k,v)\). When the modulus t is a single-precision integer, the \(\omega _i\)’s are also single-precision integers, and hence the inner product takes k integer multiplications. The total complexity is therefore \(k+1\) floating-point operations and k integer modular multiplications.

For a larger t we may need to do O(k) multi-precision operations to compute the inner product. But in some cases we can also use CRT representation here: For \(t=\prod _{j=1}^{k'} t_j\) (with the \(t_j\)’s co-prime), we can represent each \(\omega _i\in \mathbb {Z}_t\) in the CRT basis \(\omega _{i,j}=[\omega _i]_{t_j}\). We can then compute the result y in the same CRT basis, \(y_j=[y]_{t_j}\) by setting \(w_j=\left[ \sum _i x_i \omega _{i,j}\right] _{t_j}\) for all j, and then \(y_j=[v+w_j]_{t_j}\). This will still take only \(k+1\) floating-point operations, but \(kk'\) modular multiplications.

Correctness. The only source of errors in this routine is the computation of \(v:=\left\lceil \sum _i x_i \theta _i \right\rfloor \): Since we only keep the \(\theta _i\)’s with limited precision, we need to worry about the error exceeding the precision. Let \(\tilde{\theta }_i\) be the floating-point values that we keep, while \(\theta _i\) are the exact values (\(\theta _i=t{\tilde{q}_i}/q-\omega _i\)) and \(\epsilon _i\) are the errors, \(\epsilon _i=\tilde{\theta }_i-\theta _i\). Since \(|\tilde{\theta }_i|\le \frac{1}{2}\), then for IEEE 754 double floats we have \(|\epsilon _i|<2^{-53}\). The value that our procedure computes for v is therefore \(\tilde{v} := \left\lceil \sum _i x_i(\theta _i+\epsilon _i) \right\rfloor \), which may be different from \(v:=\left\lceil \sum _i x_i\theta _i \right\rfloor \).

We can easily control the magnitude of the error term \(\sum x_i\epsilon _i\) by limiting the size of the \(q_i\)’s: Since \(|x_i|<q_i/2\) for all i, then \(\left| \sum _i x_i \epsilon _i\right| < 2^{-54}\cdot \sum _i q_i\). For example, if \(k< 32\), as long as all our moduli satisfy \(q_i\le 2^{47}< 2^{54}/4k\), we are ensured that \(|\sum x_i\epsilon _i|<1/4\).

If we use the extended double floating-point precision (“long double” in C/C++) natively supported by x86 architectures, which stores 64 bits in the significand as compared to 52 bits in the IEEE 754 double float, we can increase the upper bound for the moduli up to \(q_i\le 2^{59}\).

When using the scaling procedure for decryption, we can keep \(y'=\left\lceil t/q \cdot x \right\rfloor \) close to an integer by controlling the ciphertext noise. For example, we can ensure that \(y'\) (and therefore also v) is within 1 / 4 of an integer, and thus if we also restrict the size of the \(q_i\)’s as above, then we always get the correct result. Using the scaling procedure in other settings may require more care, see the next section for a discussion.

2.4 Complex Scaling in CRT Representation

The scaling procedure above was made simpler by the fact that we scale by a t / q factor, where the original integer is in \(\mathbb {Z}_q\) and the result is computed modulo t. During homomorphic multiplication, however, we have a more complicated setting: Over there we have three parameters tpq, where \(q=\prod _{i=1}^k q_i\) as before, we similarly have \(p=\prod _{j=1}^{k'} p_j\), and we know that p is co-prime with q and \(p\gg t\).

The input is \(x\in \mathbb {Z}\cap [-qp/2t, qp/2t)\subset \mathbb {Z}_{qp}\), represented in the CRT basis \(\{q_1,\ldots ,q_k,p_1,\ldots ,p_{k'}\}\). We need to scale it by a t / q factor and round, and we want the result modulo q in the CRT basis \(\{q_1,\ldots ,q_k\}\). Namely, we want to compute \(y:=\big [\left\lceil t/q\cdot x \right\rfloor \big ]_q\). This complex scaling is accomplished in two steps:

  1. 1.

    First we essentially apply the CRT scaling procedure from Sect. 2.3 using \(q'=qp\) and \(t'=tp\), computing \(y':=[\left\lceil tp/qp \cdot x \right\rfloor ]_p\) (which we can think of as computing \(y'\) modulo tp and then discarding the mod-t CRT component). Note that since \(x\in [-qp/2t,qp/2t)\) then \(\left\lceil tp/qp \cdot x \right\rfloor \in [-p/2,p/2)\). Hence even though we computed \(y'\) modulo p, we know that \(y'=\left\lceil t/q \cdot x \right\rfloor \) without modular reduction.

  2. 2.

    Having a representation of \(y'\) relative to CRT basis \(\{p_1,\ldots ,p_{k'}\}\), we extend this basis using the procedure from Sect. 2.2, adding \([y']_{q_i}\) for all the \(q_i\)’s. Then we just discard the mod-\(p_j\) CRT components, thus getting a representation of \(y=[y']_q\).

The second step is a straightforward application of the procedure from Sect. 2.2, but the first step needs some explanation. The input consists of the CRT components \(x_i=[x]_{q_i}\) and \(x'_j=[x]_{p_j}\), and we denote \(Q:=qp\), \(Q^*_i:=Q/q_i=q^*_ip\), \({Q'_j}^*:=Q/p_j=qp^*_j\), and also \(\tilde{Q}_i=[(Q^*_i)^{-1}]_{q_i}\) and \(\tilde{Q'}_j=[({Q'_j}^*)^{-1}]_{p_j}\). Then by Eq. 3 we have

$$ \frac{t}{q}\cdot x = \frac{t}{q}\big ( \sum _{i=1}^k x_i \tilde{Q}_i Q^*_i + \sum _{j=1}^{k'} x'_j\tilde{Q'}_j {Q'_j}^* - \upsilon 'Q\big ) = \sum _{i=1}^k x_i \cdot \frac{t\tilde{Q}_i p}{q_i} + \sum _{j=1}^{k'} x'_j \cdot t\tilde{Q'}_j p^*_j - t\upsilon 'p. $$

Reducing the above expression modulo any one of the \(p_j\)’s, all but one of the terms in the second sum drop out (as well as the term \(t\upsilon 'p\)), and we get:

$$\textstyle [\left\lceil t/q \cdot x \right\rfloor ]_{p_j} = \left[ \left\lceil \sum _{i=1}^k x_i \cdot \frac{t\tilde{Q}_i p}{q_i} \right\rfloor + x'_j \cdot [t\tilde{Q'}_j p^*_j]_{p_j}\right] _{p_j}. $$

As in Sect. 2.3, we pre-compute all the values \(\frac{t\tilde{Q}_i p}{q_i}\), breaking them into their integral and fractional parts, \(\frac{t\tilde{Q}_i p}{q_i}=\omega '_i+\theta '_i\) with \(\omega '_i\in \mathbb {Z}_p\) and \(\theta '_i\in [-\frac{1}{2},\frac{1}{2})\). We store all the \(\theta '_i\)’s as double (or extended double) floats, for every ij we store the single-precision integer \(\omega '_{i,j}=[\omega '_i]_{p_j}\), and for every j we also store \(\lambda _j:=[t\tilde{Q'}_j p^*_j]_{p_j}\). Then given the integer x, represented as \(x\sim (x_1,\ldots ,x_k,x'_1,\ldots ,x'_{k'})\), we compute

$$\textstyle v:=\left\lceil \sum _i \theta '_i x_i \right\rfloor ,~\text {and for all}~j~ w_j:=\big [\lambda _j x'_j + \sum _i \omega '_{i,j} x_i\big ]_{p_j}~ \text {and}~ y'_j:= \big [v+w_j]_{p_j}. $$

Then we have \(y'_j=[\left\lceil t/q \cdot x \right\rfloor ]_{p_j}\), and we return \(y' \sim \{y'_1,\ldots ,y'_{k'}\}\in \mathbb {Z}_p\).

Correctness. When computing the value \(v=\left\lceil \sum _i \theta '_i x_i \right\rfloor \), we can bound the floating-point inaccuracy before rounding below 1/4, just as in the simple scaling procedure from Sect. 2.3. However, when we use complex scaling during homomorphic multiplication, we do not have the guarantee that the exact value before rounding is close to an integer, and so we may encounter rounding errors where instead of rounding to the nearest integer, we will round to the second nearest. Contrary to the case of decryption, here such “rounding errors” are perfectly acceptable, as the rounding error is only added to the ciphertext noise.

We remark also that in the second CRT basis extension (from \(\mathbb {Z}_{p}\) to \(\mathbb {Z}_{pq}\), before discarding the mod-p components), we regain the guarantee that the exact value before rounding is close to an integer: This is because the value that we seek before rounding is \(v=x/p \pmod {1}\), we have the guarantee that \(|x|\le q/2\), and our parameter choices imply that \(p>q\) (by a substantial margin). Since \(|\frac{x}{p}|\le \frac{q}{2p}\ll \frac{1}{2}\), we are ensured to land outside of the error region of \(\mathbb {Z}+\frac{1}{2}\pm \epsilon \). See Sect. 4.5 for more details of our parameter choices.

Complexity Analysis. The complexity of the first step above where we compute \(y'=[\left\lceil t/q \cdot x \right\rfloor ]_{p}\), is similar to the simple scaling procedure from Sect. 2.3. Namely we have \(k+1\) floating-point operations when computing v, and then for each modulus \(p_j\) we have \(k+1\) single-precision modular multiplications to compute \(w_j\). Hence the total complexity of this step is \(k+1\) floating-point operations and \(k'(k+1)\) modular multiplications.

The complexity of the CRT basis extension, as described in Sect. 2.2, is \(k+1\) floating-point operations and \(k'(k+1)+k\) single-precision modular multiplications. Hence the total complexity of complex scaling is \(2(k+1)\) floating-point operations and \(2k'(k+1)+k\) modular multiplications.

3 Background: Scale-Invariant Homomorphic Encryption

For self-containment we briefly sketch Brakerski’s “scale-invariant” homomorphic encryption scheme from [5]. Then we discuss the Fan-Vercauteren variant of the scheme and some optimizations due to Bajard et al. [3].

3.1 Brakerski’s Scheme

The starting point for Brakerski’s scheme is Regev’s encryption scheme [21], with plaintext space \(\mathbb {Z}_t\) for some modulus \(t>1\), where secret keys and ciphertexts are dimension-n vectors over \(\mathbb {Z}_q^n\) for some other modulus \(q\gg t\). (Throughout this section we assume for simplicity of notations that q is divisible by t. It is well known that this condition in superfluous, however, and replacing q / t by \(\left\lceil q/t \right\rfloor \) everywhere works just as well.)

The decryption invariant of this scheme is that a ciphertext \(\mathbf{ct }\), encrypting a message \(m\in \mathbb {Z}_t\) relative to secret key \(\mathbf{sk }\), satisfies

$$ \left[ \left\langle {\mathbf{sk },\mathbf{ct }}\right\rangle \right] _q=m\cdot q/t + e, ~\text {for a small noise term}~|e|\ll q/t, $$

where \(\left\langle {\cdot ,\cdot }\right\rangle \) denotes inner product. Decryption is therefore implemented by setting \(m := \big [\left\lceil \frac{t}{q}\cdot [\left\langle {\mathbf{sk },\mathbf{ct }}\right\rangle ]_q \right\rfloor \big ]_t\)Footnote 3. Homomorphic addition of two ciphertext vectors \(\mathbf{ct }_1,\mathbf{ct }_2\) consists of just adding the two vectors over \(\mathbb {Z}_q\), and has the effect of adding the plaintexts and also adding the two noise terms. Homomorphic multiplication is more involved, consisting of the following parts:

Key Generation. In Brakerski’s scheme, the secret key \(\mathbf{sk }\) must also be small, namely \(\Vert \mathbf{sk }\Vert \ll q/t\). Moreover, the public key includes a “relinearization gadget”, consisting of \(\log q\) matrices \(W_i\in \mathbb {Z}_q^{n\times n^2}\). Denoting the tensor product of \(\mathbf{sk }\) with itself (over \(\mathbb {Z}\)) by \(\mathbf{sk }^*=\mathbf{sk }\otimes \mathbf{sk }\in \mathbb {Z}^{n^2}\), the relinearization matrices satisfy

$$ \left[ \mathbf{sk }\times W_i\right] _q= 2^i\mathbf{sk }^* + \mathbf {e}_i^*, ~\text {for a small noise term}~\Vert \mathbf {e}_i^*\Vert \ll q/t. $$

Homomorphic Multiplication. Let \(\mathbf{ct }_1,\mathbf{ct }_2\) be two ciphertexts, satisfying the decryption invariant \(\left[ \left\langle {\mathbf{sk },\mathbf{ct }_i}\right\rangle \right] _q=m_i\cdot q/t + e_i\). The multiplication consists of:

  1. 1.

    Tensoring. Taking the tensor product \(\mathbf{ct }_1\otimes \mathbf{ct }_2\) without modular reduction, then scaling down by t / q, hence getting \(\mathbf{ct }^*:=\big [\left\lceil t/q \cdot \mathbf{ct }_1\otimes \mathbf{ct }_2 \right\rfloor \big ]_q\).

  2. 2.

    Relinearization. Decomposing \(\mathbf{ct }^*\) into bits \(\mathbf{ct }^*_i\in \{0,1\}^{n^2}\) (where \(\mathbf{ct }^*=\sum _i2^i\mathbf{ct }^*_i\)), then setting \(\mathbf{ct }^{\times }:=[\sum _i W_i\times \mathbf{ct }^*_i]_q\).

To see that \(\mathbf{ct }^{\times }\) is indeed an encryption of the product \(m_1m_2\) relative to \(\mathbf{sk }\), denote the rational vector before rounding by \(\mathbf{ct }'= t/q \cdot \mathbf{ct }_1\otimes \mathbf{ct }_2\), and the rounding error by \(\mathbf {\epsilon }\) (so \(\mathbf{ct }^*=\mathbf {\epsilon }+\mathbf{ct }'+q\cdot \text {something}\)), and we have

$$\begin{aligned} \left\langle {\mathbf{sk }^*,\mathbf{ct }'}\right\rangle= & {} \left\langle {\mathbf{sk }\otimes \mathbf{sk }, \frac{t}{q} \mathbf{ct }_1\otimes \mathbf{ct }_2}\right\rangle = t/q \cdot (\left\langle {\mathbf{sk },\mathbf{ct }_1}\right\rangle \cdot \left\langle {\mathbf{sk },\mathbf{ct }_2}\right\rangle \\= & {} t/q \cdot (m_1\cdot q/t + e_1 + k_1q)(m_2\cdot q/t + e_2 + k_2q)\\= & {} m_1m_2 \cdot q/t + \underbrace{e_1m_2+m_1e_2+e_1e_2t/q+t(k_1 e_2+k_2 e_1)}_{e'\ll q/t} + q\cdot \text {something}. \end{aligned}$$

Including the rounding error, and since \(\mathbf{sk }\) is small (and hence so is \(\mathbf{sk }^*\)), we get

$$\begin{aligned} \left\langle {\mathbf{sk }^*,\mathbf{ct }^*}\right\rangle = \left\langle {\mathbf{sk }^*,\mathbf {\epsilon }+\mathbf{ct }'+\mathbf {k^*}q}\right\rangle = m_1m_2 \cdot q/t + \underbrace{e'+\left\langle {\mathbf{sk }^*,\mathbf {\epsilon }}\right\rangle }_{e''\ll q/t} +q\cdot \text {something}, \end{aligned}$$
(5)

so \(\mathbf{ct }^*\) encrypts \(m_1m_2\) relative to \(\mathbf{sk }^*\). After relinearization, we have

$$\begin{aligned}&\left\langle {\mathbf{sk },\mathbf{ct }^{\times }}\right\rangle = \mathbf{sk }\times \sum _i W_i\times \mathbf{ct }^*_i = \sum _i\left\langle {(2^i\mathbf{sk }^* + \mathbf {e}_i^*), \mathbf{ct }^*_i}\right\rangle \\&= \big \langle {\mathbf{sk }^*, \sum _i 2^i\mathbf{ct }^*_i}\big \rangle + \sum _i \left\langle {\mathbf {e}_i^*, \mathbf{ct }^*_i}\right\rangle = m_1m_2 \cdot q/t + \underbrace{e'' + \sum _i \left\langle {\mathbf {e}_i^*, \mathbf{ct }^*_i}\right\rangle }_{\tilde{e}} \pmod {q}. \end{aligned}$$

Since the \(\mathbf{ct }^*_i\)’s are small then so is the noise term \(\tilde{e}\), as needed.

3.2 The Fan-Vercauteren Variant

In [8], Fan and Vercauteren ported Brakerski’s scheme to the ring-LWE setting, working over polynomial rings rather than over the integers. Below we let \(R=\mathbb {Z}[X]/\langle {f\left( X\right) }\rangle \) be a fixed ring, where \(f\in \mathbb {Z}[X]\) is a monic irreducible polynomial of degree n (typically an m-th cyclotomic polynomial \(\varPhi _m \left( x\right) \) of degree \(n = \phi \left( m\right) \)). We use some convenient basis to represent R over \(\mathbb {Z}\) (most often just the power basis, i.e., the coefficient representation of the polynomials). Also, let \(R_t=R/tR\) denote the quotient ring for an integer modulus \(t\in \mathbb {Z}\) in the same basis.

The plaintext space of this variant is \(R_t\) for some \(t>1\) (i.e., a polynomial of degree at most \(n-1\) with coefficients in \(\mathbb {Z}_t\)), the secret key is a 2-vector \(\mathbf{sk }=(1,s)\in R^2\) with \(\Vert s\Vert \ll q/t\), ciphertexts are 2-vectors \(\mathbf{ct }=(c_0,c_1)\in R_q^2\) for another modulus \(q\gg t\), and the decryption invariant is the same as in Brakerski’s scheme, namely \([\left\lceil \frac{t}{q}[\left\langle {\mathbf{sk },\mathbf{ct }}\right\rangle ]_q \right\rfloor ]_t = [\left\lceil \frac{t}{q}[c_0+c_1s]_q \right\rfloor ]_t = m \cdot \frac{q}{t} + e\) for a small noise term \(e\in R, \Vert e\Vert \ll q/t\).

For encryption, the public key includes a low-noise encryption of zero, \(\mathbf{ct }^0=(\mathbf{ct }^0_0,\mathbf{ct }^0_1)\), and to encrypt \(m\in R_t\) they choose low-norm elements \(u,e_1,e_2\in R\) and set \(Enc_{\mathbf{ct }^0}(m) := [u \cdot \mathbf{ct }^0 + (e_0,e_1) + (\varDelta m,0)]_q\), where \(\varDelta = \lfloor q/t \rfloor \). Homomorphic addition just adds the ciphertext vectors in \(R_q^2\), and homomorphic multiplication is the same as in Brakerski’s scheme, except (a) the special form of \(\mathbf{sk }\) lets them optimize the relinearization “matrices” and use vectors instead, and (b) they use base-w decomposition (for a suitable word-size w) instead of base-2 decompositionFootnote 4. In a little more detail:

  1. (a)

    For the secret-key vector \(\mathbf{sk }=(1,s)\), the tensor product \(\mathbf{sk }\otimes \mathbf{sk }\) can be represented by the 3-vector \(\mathbf{sk }^*=(1,s,s^2)\). Similarly, for the two ciphertexts \(\mathbf{ct }^i=(c^i_0,c^i_1)\) (\(i=1,2\)), it is sufficient to represent the tensor \(\mathbf{ct }_1\otimes \mathbf{ct }_2\) by the 3-vector \(\mathbf{ct }^*=(c^*_0,c^*_1,c^*_2)=[c^1_0c^2_0,\ (c^1_0c^2_1+c^1_1c^2_0),\ c^1_1c^2_1]_q\).

  2. (b)

    For the relinearization gadget, all they need is to “encrypt” the single element \(s^2\) using \(\mathbf{sk }\). When using a base-w decomposition, they have vectors (rather than matrices) \(W_i=(\beta _i,\alpha _i)\), with uniform \(\alpha _i\)’s and \(\beta _i=[w^i s^2 -\alpha _i s + e_i]_q\) (for low-norm noise terms \(e_i\)).

    After computing the three-vector \(\mathbf{ct }^*=(c^*_0,c^*_1,c^*_2)\) as above during homomorphic multiplication, they decompose \(c^*_2\) into its base-w digits, \(c^*_2=\sum _i w^ic^*_{2,i}\). Then computing \(\mathbf{ct }^{\times }=\sum _i W_i\times \mathbf{ct }^*_i\) only requires that they set

    $$\tilde{c}_0:=[\sum _{i=1}^k \beta _i c^*_{2,i}]_q,~ \tilde{c}_1:=[\sum _{i=1}^k \alpha _i c^*_{2,i}]_q,\quad \text {and then}~ \mathbf{ct }^{\times }:=[(c^*_0+\tilde{c}_0,c^*_1+\tilde{c}_1)]_q. $$

3.3 CRT Representation and Optimized Relinearization

Bajard et al. described in [3] several optimizations of the Fan-Vercauteren variant, centered around the use of CRT representation of the large integers involved. (They called it a Residue Number System, or RNS, but in this writeup we prefer the term CRT representation.) Specifically, the modulus q is chosen as a product of same-size, pairwise coprime, single-precision moduli, \(q=\prod _{i=1}^k q_i\), and each element \(x\in \mathbb Z_q\) is represented by the vector \((x_i=[x]_{q_i})_{i=1}^k\).

One significant optimization from [3] relates to the relinearization step in homomorphic multiplication. Recall that in that step we decompose the ciphertext \(\mathbf{ct }^*\) into low-norm components \(\mathbf{ct }^*_i\), such that reconstructing \(\mathbf{ct }^*\) from the \(\mathbf{ct }^*_i\)’s is a linear operation, namely \(\mathbf{ct }^*=\sum _i \tau _i\mathbf{ct }^*_i\) for some known coefficients \(\tau _i\). Instead of decomposing \(\mathbf{ct }^*\) into bit or digits, Bajard et al. suggested to use its CRT components \(\mathbf{ct }^*_i = [\mathbf{ct }^* \tilde{q_i}]_{q_i}\) and secret key components \(s^2_i = [s^2 \, q_i^*]_{q}\) when computing the relinearization key, and rely on the reconstruction from Eq. 3 (which is linear).

We remark that it is more efficient to use the CRT components \(\mathbf{ct }^*_i = [\mathbf{ct }^*]_{q_i}\) and secret key components \(s^2_i = [s^2 \tilde{q_i} q_i^*]_{q}\). The latter corresponds to \([s^2]_{q_i}\) for the i-th modulus and 0’s for all other moduli. This optimization removes one scalar multiplication in each \(\mathbf{ct }^*_i\) term (as compared to [3]) and eliminates the need for any precomputed parameters in the relinearization procedure.

As in [3], we also apply digit decomposition to the residues, thus allowing a more granular control of noise growth at small multiplicative depths. A detailed discussion of this technique is provided in Appendix B.1 of [3].

4 Our Optimizations

4.1 The Scheme that We Implemented

The scheme that we implemented is the Fan-Vercauteren variant of Brakerski’s scheme (we refer to this variant as the “BFV scheme”), with a modified CRT-based relinearization step of Bajard et al. We begin with a concrete stand-alone description of the functions that we implemented, then describe our simpler/faster CRT-based implementation of these functions.

Parameters. Let \(t,m,q\in \mathbb {Z}\) be parameters (where the single-precision t determines the plaintext space, and m, |q| depend on t and the security parameter), such that \(q=\prod _{i=1}^k q_i\) for same-size, pairwise coprime, single-precision moduli \(q_i\).

Let \(n=\phi (m)\), and let \(R=\mathbb {Z}[X]/\varPhi _m(X)\) be the m-th cyclotomic ring, and denote by \(R_q=R/qR\) and \(R_t=R/tR\) the quotient rings. In our implementation we represent elements in \(R,R_q,R_t\) in the power basis (i.e., polynomial coefficients), but note that other “small bases” are possible (such as the decoding basis from [18]), and for non-power-of-two cyclotomics they could sometimes result in better parameters. We let \(\chi _e, \chi _k\) be distributions over low-norm elements in R in the power basis, specifically we use discrete Gaussians for \(\chi _e\) and the uniform distribution over \(\{-1,0,1\}^n\) for \(\chi _k\).

Key Generation. For the secret key, choose a low-norm secret key \(s\leftarrow \chi _k\) and set \(\mathbf{sk }:=(1,s)\in R^2\). For the public encryption key, choose a uniform random \(a\in R_q\) and \(e\leftarrow \chi _e\), set \(b :=\left[ -(as+e)\right] _q \in R_q\), and compute \(\mathbf{pk }:=(b,a)\).

Recall that we denote \({q_i^*}=\frac{q}{q_i}\) and \({\tilde{q}_i}=\left[ {q_i^*}^{-1}\right] _{q_i}\). For relinearization, choose a uniform \(\alpha _i\in R_q\) and \(e_i\leftarrow \chi _e\), and set \(\beta _i=[{\tilde{q}_i}{q_i^*}s^2 - \alpha _i s + e_i]_{q}\) for each \(i=1,\ldots ,k\). The public key consists of \(\mathbf{pk }\) and all the vectors \(W_i:=(\beta _i,\alpha _i)\).

Encryption. To encrypt \(m\in R_t\), choose \(u\leftarrow \chi _k\) and \(e'_0,e'_1\leftarrow \chi _e\) and output the ciphertext \(\mathbf{ct }:=[u\cdot \mathbf{pk }+(e'_0,e'_1)+(\varDelta m,0)]_q\), where \(\varDelta =q/t\).

Decryption. For a ciphertext \(\mathbf{ct }=(c_0,c_1)\), compute \(x:=[\left\langle {\mathbf{sk },\mathbf{ct }}\right\rangle ]_q=[c_0+c_1 s]_q\) and output \(m:=[\left\lceil x\cdot t/q \right\rfloor ]_t\).

Homomorphic Addition. On input \(\mathbf{ct }^1\), \(\mathbf{ct }^2\), output \([\mathbf{ct }^1+\mathbf{ct }^2]_q\).

Homomorphic Multiplication. Given \(\mathbf{ct }^i=(c^i_0,c^i_1)_{i=1,2}\), do the following:

  1. 1.

    Tensoring: Compute \(c'_0:=c^1_0 c^2_0\), \(c'_1:=c^1_0 c^2_1 + c^1_1 c^2_0\), \(c'_2:=c^1_1 c^2_1\in R\) without modular reduction, then set \(c^*_i=[\left\lceil t/q\cdot c'_i \right\rfloor ]_q\) for \(i=0,1,2\).

  2. 2.

    Relinearization: Decompose \(c^*_2\) into its CRT components \(c^*_{2,i}=[c^*_2]_{q_i}\), set \(\tilde{c}_0:=[\sum _{i=1}^k \beta _ic^*_{2,i}]_q\), \(\tilde{c}_1:=[\sum _{i=1}^k \alpha _ic^*_{2,i}]_q\), output \(\mathbf{ct }^{\times }:=[(c^*_0+\tilde{c}_0,c^*_1+\tilde{c}_1)]_q\).

4.2 Pre-computed Values

When setting the parameters, we pre-compute some tables to help speed things up later. Specifically:

  • We pre-compute and store all the values that are needed for the simple CRT scaling procedure in Sect. 2.3: For each \(i=1,\ldots ,k\), we compute the rational number \(t{\tilde{q}_i}/q_i\), split into integral and fractional parts. Namely, \(\omega _i:=\left\lceil t\cdot \frac{{\tilde{q}_i}}{q_i} \right\rfloor \in \mathbb {Z}_t~\text {and}~\theta _i:=\frac{t\cdot {\tilde{q}_i}}{q_i} - \omega _i\in [-\frac{1}{2},\frac{1}{2}). \) We store \(\omega _i\) as a single-precision integer and \(\theta _i\) as a double (or long double) float.

  • We also choose a second set of single-precision coprime numbers \(\{p_j\}_{j=1}^{k'}\) (coprime to all the \(q_i\)’s), such that \(p:=\prod _j p_j\) is bigger than q by a large enough margin. Specifically we will need to ensure that for \(c^1_0,c^1_1,c^2_0,c^2_1\in R\) with coefficients in \([-q/2,q/2)\), the element \(c^*:=c^1_0c^2_1+c^1_1c^2_0\in R\) (without modular reduction) has coefficients in the range \([-qp/2t,qp/2t)\). For our setting of parameters, where all the \(q_i\)’s and \(p_j\)’s are 55-bit primes and t is up to 32 bits, it is sufficient to take \(k'=k+1\). For smaller CRT primes or larger values of t, a higher value of \(k'\) may be needed. Below we denote for all j, \({p_j^*}:=p/p_j\) and \({\tilde{p}_j}:=[({p_j^*})^{-1}]_{p_j}\). We also denote \(Q:=qp\), and for every ij we have \(Q^*_i:=Q/q_i=q^*_ip\), \({Q'_j}^*:=Q/p_j=qp^*_j\), and also \(\tilde{Q}_i=[(Q^*_i)^{-1}]_{q_i}\) and \(\tilde{Q'}_j=[({Q'_j}^*)^{-1}]_{p_j}\).

  • We pre-compute and store all the values that are needed in the procedure from Sect. 2.2 to extend the CRT basis \(\{q_1,\ldots ,q_k\}\) by each of the \(p_j\)’s, as well the values that are needed to extend the CRT basis \(\{p_1,\ldots ,p_{k'}\}\) by each of the \(q_i\)’s. Namely for all ij we store the single-precision integers \(\mu _{i,j}=[{q_i^*}]_{p_j}\) and \(\nu _{i,j}=[{p_j^*}]_{q_i}\), as well as \(\phi _j=[q]_{p_j}\) and \(\psi _i=[p]_{q_i}\).

  • We also pre-compute and store all the values that are needed for the complex CRT scaling procedure in Sect. 2.4. Namely, we pre-compute all the values \(\frac{t\tilde{Q}_i p}{q_i}\), breaking them into their integral and fractional parts, \(\frac{t\tilde{Q}_i p}{q_i}=\omega '_i+\theta '_i\) with \(\omega '_i\in \mathbb {Z}_p\) and \(\theta '_i\in [-\frac{1}{2},\frac{1}{2})\). We store all the \(\theta '_i\)’s as double (or long double) floats, for every ij we store the single-precision integer \(\omega '_{i,j}=[\omega '_i]_{p_j}\), and for every j we also store \(\lambda _j:=[t\tilde{Q'}_j p^*_j]_{p_j}\).

4.3 Key-Generation and Encryption

The key-generation and encryption procedures are implemented in a straightforward manner. Small integers such as noise and key coefficients are drawn from \(\chi _e\) or \(\chi _k\) and stored as single-precision integers, while uniform elements in \(a\leftarrow \mathbb {Z}_q\) are chosen directly in the CRT basis by drawing uniform values \(a_i\in \mathbb {Z}_{q_i}\) for all i.

Operations in \(R_q\) are implemented directly in CRT representation, often requiring the computation of the number-theoretic-transform (NTT) modulo the separate \(q_i\)’s. The only operations that require computations outside of \(R_q\) are decryption and homomorphic multiplications, as described next.

4.4 Decryption

Given the ciphertext \(\mathbf{ct }=(c_0,c_1)\) and secret key \(\mathbf{sk }=(1,s)\), we first compute the inner product in \(R_q\), setting \(x:=[c_0+c_1 s]_q\). We obtain the result in coefficient representation relative to the CRT basis \(q_1,\ldots ,q_k\). Namely for each coefficient of x (call it \(x_\ell \in \mathbb {Z}_q\)) we have the CRT components \(x_{\ell ,i}=[x_\ell ]_{q_i}\), \(i=1,\ldots ,k,\ell =0,\ldots ,n-1\).

We then apply to each coefficient \(x_\ell \) the simple scaling procedure from Sect. 2.3. This yields the scaled coefficients \(m_\ell =[\left\lceil t/q \cdot x_\ell \right\rfloor ]_t\), representing the element \(m=[\left\lceil t/q \cdot x \right\rfloor ]_t\in R_t\), as needed.

As we explained in Sect. 2.3, in the context of decryption we can ensure correctness by controlling the noise to guarantee that each \(t/q \cdot x_\ell \) is within 1/4 of an integer, and limit the size of the \(q_i\)’s to 59 bits to ensure that the error is bounded below 1/4.

Decryption Complexity. The dominant factor in decryption is NTTs modulo the individual \(q_i\)’s, that are used to compute the inner product \(x:=[c_0+c_1 s]_q\in R_q\). Specifically we need 2k of them, k in the forward direction (one for each \([c_1]_{q_i}\)) and k inverse NTTs (one for each \([c_1s]_{q_i}\)). These operations require \(O(kn\log n)\) single-precision modular multiplications, where \(n=\phi (m)\) is the degree of the polynomials and k is the number of moduli \(q_i\). Once this computation is done, the simple CRT scaling procedure takes \((k+1)n\) floating-point operations and kn integer multiplications modulo t.

4.5 Homomorphic Multiplication

The input to homomorphic multiplication is two ciphertexts \(\mathbf{ct }^1=(c^1_0,c^1_1),\mathbf{ct }^2=(c^2_0,c^2_1)\), where each \(c^a_b\in R_q\) is represented in the power basis with each coefficient represented in the CRT basis \(\{q_i\}_{i=1}^k\). The procedure consists of three steps, where we first compute the “double-precision” elements \(c'_0,c'_1,c'_2\in R\), then scale them down to get \(c^*_i:=[\left\lceil t/q\cdot c'_i \right\rfloor ]_q\), and finally apply relinearization.

Multiplication with Double Precision. We begin by extending the CRT basis using the procedure from Sect. 2.2. For each coefficient x in any of the \(c^a_b\)’s, we are given the CRT representation \((x_1,\ldots ,x_k)\) with \(x_i=[x]_{q_i}\) and compute also the CRT components \((x'_1,\ldots ,x'_{k'})\) with \(x'_j=[x]_{p_j}\). This gives us a representation of the same integer x, in the larger ring \(\mathbb {Z}_{qp}\), which in turn yields a representation of the \(c^a_b\)’s in the larger ring \(R_{qp}\).

Next we compute the three elements \(c'_0:=[c^1_0 c^2_0]_{pq}\), \(c'_1:=[c^1_0 c^2_1 + c^1_1 c^2_0]_{pq}\) and \(c'_2:=[c^1_1 c^2_1]_{pq}\), where all the operations are in the ring \(R_{qp}\). By our choice of parameters (with p sufficiently larger than q), we know that there is no modular reduction in these expressions, so in fact we obtain \(c'_0,c'_1,c'_2\in R\). These elements are represented in the power basis, with each coefficient \(x\in \mathbb {Z}_{qp}\) represented by \((x_1,\ldots ,x_k,x'_1,\ldots ,x'_{k'})\) with \(x_i=[x]_{q_i}\) and \(x'_j=[x]_{p_j}\).

Scaling Back Down to \(R_q\). By our choice of parameters, we know that all the coefficients of the \(c'_\ell \)’s are integers in the range \([-qp/2t,qp/2t)\), as needed for the complex CRT scaling procedure from Sect. 2.4. We therefore apply that procedure to each coefficient \(x\in \mathbb {Z}_{qp}\), computing \(x^*=[\left\lceil t/q \cdot x \right\rfloor ]_q\). This gives us the power-basis representation of the elements \(c^*_\ell =[\left\lceil t/q \cdot c'_\ell \right\rfloor ]_q\in R_q\) for \(\ell =0,1,2\).

Relinearization. For relinearization, we use a modification of the technique by Bajard et al. [3] discussed in Sect. 3.3. Namely, at this point we have the elements \(c^*_0, c^*_1, c^*_2\in R_q\) in CRT representation, \(c^*_{\ell ,i}=[c^*_\ell ]_{q_i}\) (for \(\ell =0,1,2\) and \(i=1,\ldots ,k\)). To relinearize, we use the relinearization gadget vectors \((\beta _i,\alpha _i)\) that were computed during key generation. For each \(q_i\), we first compute \(\tilde{c}_{0,i}:=\big [\sum _{j=1}^k [\beta _j]_{q_i}\cdot c^*_{2,j}\big ]_{q_i}\)and \(\tilde{c}_{1,i}:=\big [\sum _{j=1}^k [\alpha _j]_{q_i}\cdot c^*_{2,j}\big ]_{q_i}\), and then \(c^{\times }_{0,i}:=[c^*_{0,i}+\tilde{c}_{0,i}]_{q_i}\) and \(c^{\times }_{1,i}:=[c^*_{1,i}+\tilde{c}_{1,i}]_{q_i}\).

This gives the relinearized ciphertext \(\mathbf{ct }^{\times }=(c^{\times }_0,c^{\times }_1)\in R_q^2\), which is the output of the homomorphic multiplication procedure.

Correctness. Correctness of the CRT basis-extension and complex scaling procedures was discussed in Sects. 2.2 and 2.4, respectively. Though both CRT basis extension and scaling procedures may introduce some approximation errors due to the use of floating-point arithmetic, these errors only increase the ciphertext noise by a small (practically negligible) amount.

To illustrate the small contribution of approximation errors, consider the noise estimate for the original Brakerski’s scheme described in Sect. 3.1. (Similar arguments apply to any other scale-invariant scheme, including BFV and YASHE.) The approximation error in the CRT basis extension before the tensor product can change the value of \(\upsilon \) at most by one, with probability \(\approx 2^{-48}\). This means that the value of \(k_1\) or \(k_2\) may grow by one with the same probability, thus increasing the noise term \(t(k_1 e_2+k_2 e_1)\) in Eq. 5 to \(t((k_1+\epsilon _1) e_2+(k_2+\epsilon _2) e_1)\), where \(\epsilon _i \in \{0,1\}\) and \(\Pr [\epsilon _i\ne 0]\approx 2^{-47+\log n}\). Recall that \(k_i\approx \left\lceil \left\langle {\mathbf{sk },\mathbf{ct }_i}\right\rangle /q \right\rfloor \), so \(\Vert k_i\Vert _{\infty }\)’s are at least \(\sqrt{n}\). As n in all practical cases is typically above 1024 (and often much higher), the difference between \(k_1 e_2\,+\,k_2 e_1\) and \((k_1\,+\,\epsilon _1) e_2\,+\,(k_2\,+\,\epsilon _2) e_1\) is less than 3% (and even this only occurring with probability \(2^{-47+\log n}\)). In our experiments we never noticed this effect.

To study the effect of the approximation error introduced by scaling, we replace the term \(\mathbf{ct }^*=\mathbf {\epsilon }+\mathbf{ct }'+q\cdot \mathsf {something}\) for Brakerski’s scheme (Sect. 3.1) with \(\mathbf{ct }^*=\mathbf {\epsilon }\,+\, \mathbf {\epsilon _s} \,+\, \mathbf{ct }'\,+\,q\cdot \mathsf {something}\), where \(\mathbf {\epsilon _s}\) is the scaling error. To ensure that the noise growth is not impacted, it suffices to ensure that the added noise term \(|\mathbf{sk }^2\cdot \epsilon _s|\) (corresponding to the term \(\left\langle {\mathbf{sk }^*,\epsilon _s}\right\rangle \) in the description from Sect. 3.1) is smaller than the previous noise term of \(t(k_1 e_2+k_2 e_1)\). This is always the case if we have \(\left\| \mathbf {\epsilon _s} \right\| _\infty <1/4\) (as we do for decryption), but in some cases we can also handle larger values of \(\epsilon _s\) (e.g., later in the computation where the terms \(e_1,e_2\) are already larger, or when working with a large plaintext-space modulus t).

Finally, we note that the floating-point arithmetic in the second CRT-basis extension (inside complex scaling) does not produce any errors. This is because we use \(p\gg q\) (to ensure that all the coefficients before scaling fit in the range \([-pq/2t,+pq/2t]\)). The analysis from Sect. 2.2 then tells us that when computing the CRT basis extension from mod-p to mod-pq we never end up in the error region.

Multiplication Complexity. As for decryption, here too the dominant factor is the NTTs that we must compute when performing multiplication operations in \(R_q\) and \(R_{qp}\). Specifically we need to transform the four elements \(c^a_b\in R_{qp}\) after the CRT extension in order to compute the three \(c'_{\ell }\in R_{qp}\), then transform back the \(c'_\ell \)’s before scaling them back to \(R_q\) to get the \(c^*_\ell \)’s. For relinearization we need to transform all the elements \(c^*_{2,i}\in R_q\) before multiplying them by the \(\alpha _i\)’s and \(\beta _i\)’s, and also transform \(c^*_0,c^*_1\) before we can add them. Each transform in \(R_q\) takes k single-precision NTTs, and each transform in \(R_{qp}\) takes \(k+k'\) NTTs, so the total number of single-precision NTTs is \(k^2+9k+7k'\). Each transform takes \(O(n\log n)\) multiplications, so the NTTs take \(O(k^2n\log n)\) modular multiplications overall. In our experiments, these NTTs account for 58–77% of the homomorphic multiplication running time.

In addition to these NTTs, we spend \(4(k+k')n\) modular multiplications computing the \(c'_\ell \)’s in the transformed domain and \(2k^2n\) modular multiplications computing the products \(c^*_{2,i}\beta _i\) and \(c^*_{2,i}\beta _i\) in the transformed domain. We also spend \(4n(kk'+k+k')\) modular multiplications and \(4(k+1)n\) floating-point operations in the CRT-extension procedure in Sect. 4.5, and additional \(3n(2k'(k+1)+k)\) modular multiplications and \(3(k'+k+2)n\) floating-point operations in the complex scaling in Sect. 4.5. Hence other than the NTTs, we have a total of \((7k+3k'+10)n\) floating-point operations and \((2k^2+10kk'+11k+14k')n\) modular multiplications.

4.6 Noise Growth

We show that our decryption and homomorphic multiplication procedures introduce almost no extra noise (up to 2 bits) as compared to the textbook BFV.

Textbook BFV. The worst-case noise bound for correct decryption using textbook BFV is written as [16]:

$$\begin{aligned} \left\| v \right\| _\infty < \left( \varDelta - r_t(q)\right) /2, \end{aligned}$$
(6)

where \(r_t(q) = t \left( q/t-\varDelta \right) \).

The initial noise in \([c_0 + c_1 s]_q\) is bounded by \(B_e \left( 1+2 \delta \left\| s \right\| _\infty \right) \), where \(B_e\) is the effective (low-probability) upper bound for Gaussian errors, and \(\delta \) is the polynomial multiplication expansion factor \(\sup \left\{ ||ab ||_\infty / ||a ||_\infty ||b ||_\infty : a,b \in R \right\} \). The initial noise is the same in all three BFV variants as the first RNS procedure is introduced at the scaling step of decryption.

The noise bound for binary tree multiplication of depth L is given by [16]

$$\begin{aligned} \left\| v_{ mult } \right\| _\infty < C_1^L V + L C_1^{L-1} C_2, \end{aligned}$$
(7)

where \(\left\| v_1 \right\| _\infty , \left\| v_2 \right\| _\infty < V\) and

$$\begin{aligned} C_1 = \left( 1+\epsilon _2\right) \delta ^2 t \left\| s \right\| _\infty , \epsilon _2 = 4 \left( \delta \left\| s \right\| _\infty \right) ^{-1}, \end{aligned}$$
(8)
$$\begin{aligned} C_2 = \delta ^2 \left\| s \right\| _\infty \left( \left\| s \right\| _\infty + t^2\right) + \delta \ell _{w, q} w B_e. \end{aligned}$$
(9)

Here \(\ell _{w, q}\) is the number of base-w digits in q.

Our RNS Variant. Our RNS variant has the following requirement for correct decryption:

$$\begin{aligned} \left\| v' \right\| _\infty < \left( \varDelta - r_t(q)\right) /4. \end{aligned}$$
(10)

Here the denominator is 4 (rather than 2 in the textbook BFV) to guarantee that the simple scaling procedure does not approach the possible-error region \(\mathbb {Z}+\frac{1}{2}\pm \epsilon \). This adds at most 1 bit of noise to the textbook BFV bound.

The low-probability (around \(2^{-48}\) in our implementation) approximation error in CRT basis extension before computing the tensor product without modular reduction simply changes the value of \(\epsilon _2\) to \(5 \left( \delta \left\| s \right\| _\infty \right) ^{-1}\), which can be easily shown using the same procedure as in Appendix I of [4] for the YASHE’ scheme and the same logic as described for Brakerski’s scheme in Sect. 4.5. Note that the value of \(\epsilon _2 \ll 1\), which implies that the change of the factor from 4 to 5 should have no practical effect, especially considering the low probability of this approximation error. We did not observe any practical noise increase due to this error in our experiments.

The effect of the scaling approximation error can be factored into the existing term \(\delta ^2 \left\| s \right\| _\infty ^2\) in \(C_2\), which corresponds to the error in rounding \(t/q \cdot \mathbf{ct }_1\otimes \mathbf{ct }_2\). In our case, we need to multiply this term by \(\left( 1+2\left\| \mathbf {\epsilon _s} \right\| _\infty \right) \), as explained in Sect. 4.5. As \(\left\| \mathbf {\epsilon _s} \right\| _\infty <1/4\) when we use the same floating-point precision as in decryption, this term is smaller than \(C'_1 V\) in all practical settings, including the case of fresh ciphertexts at \(t=2\) (see Sect. 4.5 for a more detailed discussion). We add 1 more bit to the textbook BFV noise to account for the potential extra noise during first-level multiplications, especially if larger values of \(\left\| \mathbf {\epsilon _s} \right\| _\infty \) are selected to use a lower precision for floating-point arithmetic. For homomorphic multiplications at higher levels, we will always have \(\left\| \mathbf {\epsilon _s} \right\| _\infty \ll C'_1 V\).

The relinearization term \(\delta \ell _{w, q} w B_e\) in the textbook BFV expression gets replaced with \(\delta \ell _{w, 2^{\nu }} w k B_e\), where \(\nu \) is the CRT moduli bit size, which is the same as for the Bajard et al. variant and the same as for the textbook BFV if \(w \le \nu \).

In summary, the binary tree multiplication noise constraint for our RNS variant is given by

$$\begin{aligned} \left\| v'_{\text {mult}} \right\| _\infty < {C'}_1^L V + L {C'}_1^{L-1} {C'}_2, \end{aligned}$$
(11)
$$\begin{aligned} C'_1 = \left( 1+\epsilon '_2\right) \delta ^2 t \left\| s \right\| _\infty , \epsilon '_2 = 5 \left( \delta \left\| s \right\| _\infty \right) ^{-1}, \end{aligned}$$
(12)
$$\begin{aligned} C'_2 = \delta ^2 \left\| s \right\| _\infty \left( \left\{ 1+2\left\| \mathbf {\epsilon _s} \right\| _\infty \right\} \left\| s \right\| _\infty + t^2\right) + \delta \ell _{w, 2^{\nu }} w k B_e. \end{aligned}$$
(13)

5 Implementation Details and Performance Results

5.1 Parameter Selection

Tighter Heuristic (Average-Case) Noise Bounds. The polynomial multiplication expansion factor \(\delta \) in Eqs. 8 and 9 is typically selected as \(\delta = n\) for the worst-case scenario [3, 16]. However, our experiments for the textbook BFV, our BFV variant, and products of discrete Gaussian and ternary generated polynomials showed that we can select \(\delta = C \sqrt{n}\) for practical experiments, where C is a constant close to one (for the case of power-of-two cyclotomics). This follows from the Central Limit Theorem (or rather subgaussian analysis), since all dominant polynomial multiplication terms result from the multiplication of polynomials with zero-centered random coefficients.

The highest experimental value of C for which we observed decryption failures was 0.9. We also ran numerous experiments at n varying from \(2^{10}\) to \(2^{17}\) for the cases of (1) multiplying a discrete Gaussian polynomial by a ternary uniform polynomial and (2) multiplying a discrete uniform polynomial by a ternary uniform polynomial, which cover the dominant terms in the noise constraints for BFV. The highest experimental value of C (observed for the product of a discrete Gaussian polynomial by a ternary uniform polynomial at \(n=1024\)) was 1.75. Therefore, we selected \(C=2\) for our experiments, i.e., we set \(\delta = 2 \sqrt{n}\).

Security. To choose the ring dimension n, we ran the LWE security estimatorFootnote 5 (commit f59326c) [1] to find the lowest security levels for the uSVP, decoding, and dual attacks following the standard homomorphic encryption security recommendations [7]. We selected the least value of the number of security bits \(\lambda \) for all 3 attacks on classical computers based on the estimates for the BKZ sieve reduction cost model.

The secret-key polynomials were generated using discrete ternary uniform distribution over \(\left\{ -1,0,1 \right\} ^n\). In all of our experiments, we selected the minimum ciphertext modulus bitwidth that satisfied the correctness constraint for the lowest ring dimension n corresponding to the security level \(\lambda \ge 128\).

Other Parameters. We set the Gaussian distribution parameter \(\sigma \) to \(8/\sqrt{2 \pi }\) [7], the error bound \(B_e\) to \(6 \sigma \), and the lower bound for p to 2tnq. For the digit decomposition of residues in the relinearization procedure, we used the base w of 30 bits for the range of multiplicative depths from 1 to 10. For larger multiplicative depths, we utilized solely the CRT decomposition.

5.2 Implementation Details

Software Implementation. The BFV scheme based on the decryption and homomorphic multiplication algorithms described in this paper was implemented in PALISADE [19], a modular C++11 lattice cryptography library that supports several SHE and proxy re-encryption schemes based on cyclotomic rings [20]. The results presented in this work were obtained for a power-of-two cyclotomic ring \(\mathbb {Z}[x]/\langle x^n +1 \rangle \), which supports efficient polynomial multiplication using negacylic convolution [17]. For efficient modular multiplication implementation in NTT, scaling, and CRT basis extension, we used the Number Theory Library (NTL) function MulModPrecon, which is described in Lines 5–7 of Algorithm 2 in [14]. All single-precision integer computations were done in unsigned 64-bit integers. Floating-point computations were done in IEEE 754 double-precision and extended double-precision floating-point formats. Our implementation of the BFV scheme is publicly accessible (included in PALISADE starting with v1.1).

Loop Parallelization. Multi-threading in our implementation is achieved via OpenMP. The loop parallelization in the scaling and CRT basis extension operations is applied at the level of single-precision polynomial coefficients (w.r.t. n). The loop parallelization for NTT and component-wise vector multiplications (polynomial multiplication in the evaluation representation) is applied at the level of CRT moduli (w.r.t. k).

Experimental Setup. We ran the experiments in PALISADE version 1.1, which includes NTL version 10.5.0 and GMP version 6.1.2. The evaluation environment for the single-threaded experiments was a commodity desktop computer system with an Intel Core i7-3770 CPU with 4 cores rated at 3.40 GHz and 16 GB of memory, running Linux CentOS 7. The compiler was g++ (GCC) 5.3.1. The evaluation environment for the multi-threaded experiments was a server system with 2 sockets of 16-core Intel Xeon E5-2698 v3 at 2.30 GHz CPU (which is a Haswell processor) and 250 GB of RAM. The compiler was g++ (GCC) 4.8.5.

5.3 Results

Single-Threaded Mode. Table 1 presents the timing results for the range of multiplicative depths L from 1 to 100 for the single-threaded mode of operation. It also demonstrates the contributions of CRT basis extension, scaling, and NTT to the homomorphic multiplication time (excluding the relinearization).

Table 1 suggests that the relative contribution of CRT basis extension and scaling operations to the homomorphic multiplication runtime (without relinearization) first declines from 42% at \(L=1\) to 37% at \(L=10\), and then grows up to 50% at \(L=100\). The remaining execution time is dominated by NTT operations. Our complexity and profiling analysis indicated that the initial decline is caused by a decreasing contribution (w.r.t. to modular multiplications in NTTs) of the linear terms of k and \(k'\) to the computational complexity of homomorphic multiplication as k increases from 1 to 4. The subsequent increase in relative execution time is due to the \(O(k^2 n)\) modular multiplications needed for CRT basis extension and scaling operations, which start contributing more than the \(O(k n \log n)\) modular multiplications in the NTT operations for polynomial multiplications as k further increases.

Table 1. Timing results for decryption, homomorphic multiplication, and relinearization in the single-threaded mode; \(t=2\), \(\log _2 q_i \approx 55\), \(\lambda \ge 128\)
Table 2. Timing results with multiple threads for decryption, multiplication, and relinearization, for the case of \(L=20, n=2^{14}, k=8\) from Table 1

Our profiling analysis showed that the contributions of floating-point operations to CRT basis extension and scaling were always under 5% and 10% (under 5% for \(k>5\)), respectively. This corresponded to at most 2.5% of the total homomorphic multiplication time (typically the value was closer to 1%). This result justifies the practical use of our much simpler algorithms, as compared to [3], considering that our approach has lower computational complexity.

Table 1 also shows that the contribution of the relinearization procedure to the total homomorphic multiplication time grows from 11% (\(L=1\)) to 57% (\(L=100\)) due to the quadratic dependence of the number of NTTs in the relinearization procedure on the number of coprime moduli k.

The profiling of the decryption operation showed that only 8% (\(L=100\)) to 18% (\(L=1\)) was spent on CRT scaling while at least 60% was consumed by NTT operations and up to 10% by component-wise vector products. This supports our analysis, asserting that the decryption operation is dominated by NTT, and the effect of the scaling operation is insignificant.

Multi-threaded Mode. Table 2 illustrates the runtimes for \(L=20\) on a 32-core server system when the number of threads is varied from 1 to 32. The highest runtime improvement factors for decryption and homomorphic multiplication (with relinearization) are 3.1 and 4.4, respectively.

The decryption runtime is dominated by NTT, and the NTTs are parallelized at the level of CRT moduli (parameter k, which is 8 in this case). Table 2 shows that the maximum improvement is indeed achieved at 8 threads. Any further increase in the number of threads increases the overhead related to multi-threading without providing any improvement in speed. The theoretical maximum improvement factor of 8 is not reached most likely due to the distribution of the load between the cores of two sockets in the server. A more careful fine-tuning of OpenMP thread affinity settings would be needed to achieve a higher improvement factor, which is beyond the scope of this work.

The runtime of homomorphic multiplication (without relinearization) shows a more significant improvement with increase in the number of threads: it continues improving until 32 threads and reaches the speedup of 6.1 compared to the single-threaded execution time. This effect is due to the CRT basis extension and scaling operations, which are parallelized at the level of polynomial coefficients (parameter \(n=2^{14}\)). However, as the contribution of NTT operations is high (nearly 70% for the single-threaded mode, as illustrated in Table 1), the benefits of parallelization due to CRT basis extension and scaling are limited (their relative contribution becomes smaller as the number of threads increases).

The relinearization procedure is NTT-bound and, therefore, shows approximately the same relative improvement as the decryption procedure, i.e., a factor of 2.9, which reaches its maximum value at 8 threads.

In summary, our analysis suggests that the proposed CRT basis extension and scaling operations parallelize well (w.r.t. ring dimension n) but the overall parallelization improvements of homomorphic multiplication and decryption largely depend on the parallelization of NTT operations. In our implementation, no intra-NTT parallelization was applied and thus the overall benefits of parallelization were limited.

6 Conclusion

In this work we described simpler alternatives to the CRT basis extension and scaling procedures of Bajard et al. [3], and implemented them in the PALISADE library [19]. These procedures are based on the use of floating-point arithmetic for certain intermediate computations. These procedures are not only simpler but also have lower computational complexity and noise growth than the procedures proposed in [3].

Our single- and multi-threaded experiments suggest that the main bottleneck of the implementation of our BFV variant is NTT operations. In other words, the cost of the CRT maintenance procedures, i.e., CRT basis extension and scaling, is relatively small. Therefore, further improvements in the BFV runtimes can be achieved by optimizing the NTT operations, focusing on their parallelization.

We have shown that our procedures can be applied to any scale-invariant homomorphic encryption scheme based on Brakerski’s scheme, including YASHE. The CRT basis extension and scaling procedures may also be utilized in other lattice-based cryptographic constructions; for instance, scaling is a common technique used in many lattice schemes based on dual Regev’s cryptosystem [11, 21].