Keywords

Note: NFLlib is available under an open source license at https://github.com/quarkslab/NFLlib

1 Introduction

Lattice cryptography is often praised for its simplicity, its versatility and its possible resistance to quantum attacks. However, its large memory requirements makes its practical use strenuous. The introduction of ideal lattice cryptography completely reshaped this belief [24, 27]. In ideal lattice cryptography, primitives rely on the hardness of problems involving polynomial rings in which lattices can be represented by a few polynomials. In recent years, several hardware and software implementations of lattice signatures and encryption have been developed. These implementations show performances competitive with (or even surpassing) those of currently used primitives such as RSA or elliptic curves (see e.g.  [9, 26]). Due to its efficiency and security arguments, ideal lattice cryptography starts to be deployed in productsFootnote 1 and is promised a bright future.

Besides signature and encryption, lattice cryptography has shown to be amazingly versatile. In particular, most of the homomorphic encryption (HE) schemes rely on lattices. The latter research area is really active, and recent years have seen loads of HE implementations using polynomial rings. Lattices are also used to instantiate schemes with advanced properties, such as identity-based encryption (IBE), functional encryption or multilinear maps.

To work efficiently over polynomials rings in software, we are aware of three main approaches:

  1. (1)

    Use the generic number theory library NTL  [30]. This is the approach used in lots of HE implementations (and in particular HElib  [16, 17]), and in the IBE implementation of [10].

  2. (2)

    Use the generic number theory library flint  [18]. This is the approach used in [22] to implement two HE schemes, and in [1] for multilinear maps.

  3. (3)

    Use home-made libraries that implement operations in the polynomial ring \(\mathbb Z_{p}[x]/(x^n+1)\).

    This is the approach used in the open-source VPN implementation [31], Microsoft homomorphic encryption implementation [4], SIMD-optimized implementations [12, 15], GPU implementations [7, 21] and also [9, 26].

Note that all the aforementioned implementations consider uniquely (or may be instantiated with) the polynomial ring

$$\begin{aligned} R_p \mathop = \limits ^{{{\text {def}}}} \mathbb Z_{p}[x]/(x^n+1) \end{aligned}$$

for a modulus \(p\equiv 1\pmod {2n}\) and n some power of 2. This setting is wide-spread in ideal lattice cryptography because of its simplicity of exposition and of implementation. Among other advantages, in that setting, polynomials can be multiplied in quasi-linear time using the Number Theoretic Transform (NTT), a Fast Fourier Transform on finite rings [28]. Now, home-made implementations (i.e. item (3)) of polynomial operations in the latter setting have shown to achieve better performances than using the generic libraries NTL or flint (see e.g.  [22, Table 4]).Footnote 2 This leads us to the following question:

How fast can a specialized polynomial library dedicated to lattice cryptography over \(R_p\) be?

1.1 Our Contribution: NFLlib

In this work, we present NFLlib, an efficient and scalable C++ library specialized for cryptography over \(R_p=\mathbb Z_{p}[x]/(x^n+1)\). NFLlib includes optimized subroutines to perform arithmetic operations over polynomials and allows to easily implement ideal lattice cryptography.Footnote 3 The library contains algorithmic optimizations (Double-CRT representation, Number-Theoretic Transform, lazy modular reduction), and programming optimizations (Streaming SIMD Extensions, Advanced Vector Extensions, C++ expression templates).

We benchmarked the library’s arithmetic operations over \(R_p\) against the generic libraries NTL, flint, and against the HE library HElib. Our results show that focusing on a setting widely used in ideal lattice cryptography allowed to gain several orders of magnitude of efficiency compared to generic libraries.

NFLlib will be open-source and available under the GNU General Public License v3.0. It is designed for ideal lattice cryptography, provides a complete set of operations, and minimizes the amount of new code needed to add new features. In short, one of our hopes is that making this library open-source (and thus seeking for contributions) spurs on the development of ideal lattice cryptography in prototypes in the very near future.

1.2 Related Work

Libraries. The NFLlib library is specialized for a particular polynomial ring and therefore differs completely from the generic libraries NTL  [30] and flint  [18]. These latter libraries allow to perform powerful number theory, while NFLlib focus on a particular polynomial ring. This specialization allowed us to optimize the underlying operations while being designed to be used for ideal lattice cryptography. Another library that implements lattice cryptography is HElib  [16, 17], which uses NTL. HElib has become a reference to benchmark HE because it implements a full-fledged HE scheme [5] and features efficient packing techniques and other optimizations. Note that NFLlib does not compare to HElib in term of functionality, but NFLlib could replace NTL in HElib when working over \(R_p\), and would yield a much more efficient implementation.

Double-CRT Representation. Using moduli that split completely to store the polynomial coefficients in CRT form (first layer of CRT), and using the NTT representation of the polynomials (second layer of CRT) is a technique that has been used previously in lattice cryptography. In particular, it is used in the HElib library [13, 16, 17] and in the GPU implementation [7]. However, NFLlib features specific primes in the moduli decomposition, chosen to optimize the NTT and allow lazy modular multiplication.

1.3 Outline

In Sect. 2 we describe the basic cryptographic and mathematical notions needed in our paper. In Sect. 3 we present our library, its main components and how these allowed us to get our performance results. In Sect. 4 we compare the performance of our library with other libraries on different algorithms (NTT, multiplications, additions, etc.). Finally, in Sect. 5, we describe some implementations of lattice cryptographic algorithms and highlight the performance results obtained.

2 Preliminaries

Throughout the paper, we let n be a power of 2 and \(p>0\) be a modulus (not necessarily prime as we will see later). Define R to be the ring \(R_p\mathop = \limits ^{{{\text {def}}}}\mathbb Z_{p}[x]/(x^n+1)\), i.e. the ring of polynomials having integer coefficients, modulo \(x^n+1\). For any integer p, define \(R_p\mathop = \limits ^{{{\text {def}}}}\mathbb Z_{p}[x]/(x^n+1)\) the ring R modulo p, i.e. polynomials modulo \(x^n+1\) with coefficients modulo p. We denote by \(a \mod b\) the remainder of the euclidean division of a by b, and \(c\equiv a\pmod {b}\) represents any number such that \(c\mod b=a\). We use the classical Landau notation.

Ideal Lattice Cryptography. In most of existing implementations, the structured lattices used in ideal lattice cryptography have an interpretation in terms of arithmetic in the ring \(\mathbb Z_p[x]/(x^n+1)\), for n a power of 2. Jumping ahead, note that we will chose particular values for p in order to optimize the polynomial multiplications when using the Number Theoretic Transform (see also [9, 15, 29]) in combination with the Chinese Remainder Theorem.

The Chinese Remainder Theorem (CRT). Throughout the paper, the modulus p will be composite and square-free, and its factorization is denoted \(p=p_1\cdots p_\ell \). The CRT yields an isomorphism \(\mathbb Z_p\simeq \mathbb Z_{p_1}\times \cdots \times \mathbb Z_{p_\ell }\), which extends directly to polynomials rings. In particular, we have that \(R_{p}\simeq R_{p_1}\times \cdots \times R_{p_\ell }\). Jumping ahead of Sect. 3.1, the latter equivalence shows that, instead of working with a polynomial \(a(x)\in R_p\), we will choose \(p=p_1\cdots p_\ell \) with particular \(p_i\)’s and work with \(\ell \) polynomials \(a_i(x)\in R_{p_i}\).

The Number Theoretic Transform (NTT). To multiply polynomials efficiently, we use the quasi-linear polynomial multiplication algorithm called the NTT [28]. The advantages of using NTT for ideal lattice cryptography were recently demonstrated in hardware and software implementations [9, 14, 15, 29].

3 NFLlib: A Library for Ideal-Lattice Cryptography

In this section, we introduce NFLlib, a C++ library for ideal-lattice cryptography, i.e. for manipulating polynomials of \(R_p=\mathbb Z_{p}[x]/(x^n+1)\). The entry point to our library is a templated class \({\texttt {poly\,< class T,\,size\_t degree,}}\) \({\texttt {size\_t sizeModulus >}}\).

To obtain a usable class, one must define three parameters before compilation: the word type (uint16_t, uint32_t or uint64_t), the degree n of the polynomial \(x^n+1\) that defines \(R_p\) (which must be a power of two), and the bit-size of the modulus p, which will be internally constructed as a product of \(\ell \) fixed-size primes: \( p = p_1\times \cdots \times p_\ell \,.\)

The poly class features: overloaded operators for modular arithmetic and data manipulation (and the associated static functions); C++ template expressions to minimize the inherent performance degradation of overloaded operator; functions to sample polynomials in \(R_p\) with different distributions for the coefficients (uniform distribution modulo p, uniformly bounded distribution, discrete Gaussian distribution); transformation-related functions (NTT, CRT, export, import); SSE and AVX2 optimizations for compatible architectures.

The word type T is the most critical parameter. It defines which (and how many) primes \(p_i\)’s are available, what is the maximal polynomial degree n possible, and which underlying code is used. Indeed, the code is specialized and might feature SIMD optimizations (especially when using 32-bit and 16-bit words).

All the functions provided by the poly class have been developed from scratch, and are based on the native (scalar or vectorial) instructions of a modern CPU. Only exceptions, the Salsa20-based pseudo-random number generator, and the CRT inversion function which uses GMP if the modulus used is too large for native instructions.

NFLlib is a specialized polynomial library dedicated to ideal-lattice cryptography. It is well known that in this setting representing polynomials by their values instead of their coefficients (i.e.  representing them in NTT form) and using the CRT to represent values is very beneficial for performance. We therefore use such a representation.

NFLlib ’s performance results are mainly due to the fact that most of the functions have been developed directly based on native operations, and to four major choices that have proven to be very efficient. These choices are:

  • the fixed-size CRT representation — see Sect. 3.1;

  • the modular multiplication for scalars — see Sect. 3.2;

  • the NTT algorithm — see Sect. 3.3.

  • the SSE and AVX2 optimizations — see Sect. 4.

In the aforementioned sections, we describe the particular choices we made in NFLlib, and discuss their respective impacts.

3.1 Fixed-Size CRT Representation

For efficiency reasons, we selected the moduli p used in our ideal lattice setting as a product of \(\ell \) fixed-size primes \(p_i\)’s fitting on one word.Footnote 4 Thus, one can work with the CRT representation \((a_1(x), \ldots , a_\ell (x))\in R_{p_1}\times \cdots \times R_{p_\ell }\) of a polynomial \(a(x)\in R_p\), where \(a_i(x) = a(x)\mod p_i\). All the \(a_i(x)\) can then be processed independently.

The primes forming the moduli are chosen with the following constraints:

  • Constraint 1. Their size must be at most the word size minus two bits, so that we can do lazy modular reductions in the NTT algorithm (which gives roughly a \(30\,\%\) speedup);

  • Constraint 2. They must satisfy Eq. (1) for a given parameter \(s_0\), in order to use the modular multiplication algorithm of Sect. 3.2;

  • Constraint 3. For any possible value of n — the degree of the quotient polynomial in \(R_p\) — they must be congruent to \(1 \pmod {2n}\), so that we can find n-th roots of \(-1\) to use the NTT algorithm of Sect. 3.3 and do polynomial multiplications modulo \(x^n+1\).

Constraint 2 will ensure that Constraint 1 is satisfied when \(s_0\ge 2\). By default, NFLlib sets \(s_0=2\). In order to satisfy Constraint 3, we had to arbitrarily select a maximal polynomial degree \(n_{\max }\) in NFLlib. (Note that the constraint is then satisfied for any degree \(n\leqslant n_{\max }\)). The higher \(n_{\max }\) is, the less primes verify Constraint 3. When the word size is 16 bits, these constraints are stronger than for larger words. For example for \(n_{\max }=2048\), only one 14-bit prime verifies Constraint 3 (supposing \(s_0=2\)). For 64-bit words on the other hand, it is possible to find thousands of primes verifying the constraints even for very large polynomial degrees such as \(n_{\max }=2^{20}\). Algorithm 1 returns the primes satisfying Contraints 1–3.

figure a

Defining these primes statically is beneficial for performance, and therefore they have been included in a parameter file params.hpp with \(n_{\max }=512\) when \(s=16\) (2 primes), \(n_{\max }=2^{15}\) when \(s=32\) (291 primes), and \(n_{\max }=2^{20}\) when \(s=64\) (primes limited voluntarily to one thousand). All of these have been chosen with \(s_0=2\) as explained before. Of course other values of \(s_0\) and \(n_{\max }\) may be defined by the user of NFLlib.Footnote 5

3.2 Optimizing the Modular Multiplication

As explained in Sect. 3.1, NFLlib includes invariant primes of 14, 30 and 62 bits, and computations are performed independently over these primes. However — as already emphasized in [25] — computing modular reductions with an invariant integer using a well-tuned Newton reciprocal followed by multiplications and adjustments wins over the hardware division instructions.

During the library construction, we observed that the gcc compiler automatically optimized the modular multiplications when working with 16-bit or 32-bit words (i.e.  for 14- and 30-bit primes), but not with 64-bit words. In this section, we consider the problem of dividing a two-word integer by a single word integer. This problem was extensively studied in [25] which proposed a new algorithm (Algorithm 4 in the latter paper) giving a speedup of roughly \(30\,\%\) over the Newton reciprocal algorithm [25, Algorithm 1]. The former algorithm was included in the version 4.3 of the gmp library.

However, in NFLlib, the primes are feature so that their (two) most significant bits equal to 0, and the algorithms in [25] are optimized for numbers with their most significant bit equal to 1. In the rest of the section, we describe a new algorithm which significantly improves over [25] for numbers p smaller than the word base \(\beta =2^s\), as illustrated in Table 1.

Table 1. Time per componentwise multiplication of polynomials of degree 1024 modulo a 62-bit prime (average over 100,000 polynomial multiplications on an Intel Xeon CPU E5-2666 v3 at 2.90GHz). We implemented Algorithms 1 and 4 of [25] (with 4p instead of p and two conditional subtractions at the end), but they perform one order of magnitude slower than our improved algorithm.

Assume that one wants to compute a modular reduction with a modulus p such that

$$\begin{aligned} (1+1/2^{3s_0}) \cdot \beta /(2^{s_0}+1) ~ < ~ p ~ < ~\beta /2^{s_0}\,,\end{aligned}$$
(1)

for an integer \(1\leqslant s_0\leqslant s-1\) (note that all our 62-bit primes verify Eq. (1)). For any number \(u\in [0,\beta ^2)\), denote \(\langle u_1,u_0\rangle \) its decomposition in words smaller than \(\beta \), so that \(u=u_1\cdot \beta +u_0\). We describe our new modular reduction in Algorithm 2.

figure b

We have the following theorem. For space constraints, we defer its proof to the final version of the paper.

Theorem 1

Assume \(1\leqslant s_0\leqslant s-1\) and p verifies Eq. (1) and \(u=\langle u_1, u_0\rangle \in [0, p^2)\). Let \(v=\langle v_1, v_0\rangle = \lfloor \beta ^2/p\rfloor \). Then Algorithm 2 with input \((u,p,v_0,s_0)\) outputs \((u\mod p)\).

3.3 A Lazy NTT Algorithm

We use Harvey’s NTT algorithm [14]. This algorithm uses two techniques to reduce its computational costs: pre-computed quotients to accelerate modular multiplications, and lazy reductions (i.e.  inputs and outputs can be up to twice the modulus). Quotient pre-computations in the NTT was already performed by NTL  [30] but Harvey proves elegantly that the NTT butterflies can be modified so that the output is in [0, 2p) when the input is in [0, p), using only one conditional subtraction (instead of three in the initial algorithm). This gives a very nice performance boost of about \(30\,\%\), as shown in [14]. Note that this justifies to select primes ad in Sect. 3.2.

As usual, before applying the NTT we multiply the i-th coordinate of the polynomial we are going to transform by \(\psi ^i\), where \(\psi \) is an n-th root of \(-1\) which allows us to have negatively wrapped convolutions when we multiply two elements (i.e. reductions modulo \(x^n+1\)). After the NTT, we reduce the coefficients to [0, p) but we do not apply the bit-reverse permutation by default. The reason for this is that, in lattice based cryptography, we often want to offload work from the NTT to the inverse NTT. For example in an LWE encryption scheme, at encryption time one needs to: (1) generate multiple noise polynomials, (2) convert each of them with an NTT, and (3) multiply/add them. In the decryption phase, on the other hand, there is no noise polynomials to generate and there is just one multiplication, one addition and a single inverse NTT. If in a given case, such as the one described in Sect. 5.1 we want to balance both transformations, such a change can be activated with a compilation option.

Our library has no particular contribution concerning the NTT, we just show in this paper that it is a lot more efficient than the Bluestein FFT used in HElib (see Sect. 4). Our implementation does not use assembly language, but it is quite efficient, scalable and general.

4 Performances Evaluation and Comparison with NTL, FLINT and HElib

In this section we analyze the performance of our library and report comparative benchmarks with the NTL  [30], flint  [18] and HElib  [16] libraries.

Recall that NTL and flint are generic libraries that allow to work with polynomials in any modular rings, and HElib is a software library (based on NTL) that implement an optimized version of the Brakerski-Gentry- Vaikuntanathan [5] (BGV) homomorphic encryption scheme. We chose to compare to these libraries because they are widely used in the literature on lattice cryptography implementations. We restricted them to the same settings as NFLlib, i.e.  to work over \(R_p=\mathbb Z_{p}[x]/(x^n+1)\) with moduli p as in Sect. 3.1.

Setting. We benchmarked NFLlib against NTL, flint and HElib on random polynomial generation, NTT and inverse NTT, modular addition and multiplication in NTT representation. All the benchmarks were made using the following fixed parameter sets:

  1. (1)

    \(n=256\) with a modulus p of 14 bits,

  2. (2)

    \(n=512\) with a modulus p of 30 bits,

  3. (3)

    \(n=1024\) with a modulus p of 62 bits,

  4. (4)

    \(n=1024\) with a modulus p of about 6200 bits (product of 100 62-bit moduli).

As expected, NFLlib has been instantiated with 16-bit words and 32-bit words respectively for the parameters sets 1 and 2. For NTL, we used the zz_pX objects for the parameters sets 1 and 2, and ZZ_pX otherwise. For flint, we used the type fmpz_mod_poly_t. Finally, HElib includes a DoubleCRT class with the same representation as NFLlib.Footnote 6

We performed all our benchmarks on a c4.2xlarge instance of Amazon Web Services with an Intel Xeon CPU E5-2666 v3 (Haswell) at 2900 Mhz and 15 GB of RAM with gcc 4.9, GMP 6.0, NTL 8.1, flint 2.5.Footnote 7

Remark 1

To demonstrate the performance of our library on different architectures, we also benchmarked the NTT transformation on a MacBook Air (called macbookair) with an Intel Core i7-4650U Processor at 1700 Mhz and 8 GB of RAM, using the native clang++ (Apple LLVM version 6.0), GMP 6.0, NTL 8.1, flint 2.5. (We restricted ourselves to the benchmark of the NTT transform to be concise.)

Random Polynomial Generation. To benchmark random generation, we used the ntl::random function of NTL, the fmpz_mod_poly_randtest function of flint and the default random generator of NFLlib (described only in the full version due to space constraints). We present our results in Table 2. Note that the flint library implements the Mersenne Twister algorithm that is unsuitable for a cryptographic use.

Table 2. Timings to generate random polynomials in \(\mathbb Z_{p}[x]/(x^n+1)\) using the built-in functions of different libraries on c4.2xlarge.

NTT and iNTT. Working with the NTT representation of polynomials (after the negative wrapped convolution) is a cornerstone of our optimization, since additions and multiplications become essentially linear in the number of coefficients. We report in Table 3 the benchmarks of the NTT (including the negative wrapped convolution). Note that NTL provides an NTT functions thanks to TofftRep and to toFFTRep (resp. for zz_pX and ZZ_pX); no such functions seem to be available in the flint library.Footnote 8 In HElib, the DoubleCRT class has two functions to convert from (via negative wrapped convolution and NTT) and to (via inverse NTT and inverse of the convolution) a polynomial ZZX. (For space constraints, the timings of the inverse NTT are provided in the full version of the paper).

Table 3. Timings to compute the Number Theoretic Transform of a polynomial in \(\mathbb Z_{p}[x]/(x^n+1)\) using (when possible) the built-in functions of different libraries.

SEE and AVX2 Optimizations. Because of the highly parallel nature of operations over polynomials (the same operations are to be performed on multiple data objects), using Streaming SIMD Extensions (SSE) and Advanced Vector Extensions (AVX) instructions might greatly increase performance. This has been shown in [12, 15] respectively for lattice signature and encryption.

NFLlib includes SSE and AVX2 specializations of the NTT algorithm and of the modular operations for 16-bit and 32-bit words. We compared NFLlib ’s NTT to Güneysu et al. AVX-optimized NTT [15] (where once again the NTT includes the negative wrapped convolution \(\varPsi \)) on c4.2xlarge.

The GOPS implementation works with the double type for a 23-bit modulus p (lazy-reduction) and takes 5030 cycles. NFLlib can be instantiated with 14-bit primes or 30-bit primes and takes respectively 3324 and 7334 cycles when using SSE4 instructions, and 2767 and 5956 cycles when using AVX2 instructions. As a comparison, the 62-bit version (i.e. non-SIMD) of the NTT takes 10020 cycles.

5 Implementing Ideal Lattice Cryptography with NFLlib

5.1 High Performance Key Exchange

In this section, we consider an equivalent of the key transport protocol RSASVE of NIST SP 800 56B, using [23] encryption scheme, to illustrate the performances of our library in a concrete setting. The client chooses a random message and encrypts it with the server public key then, the server decrypts this random value that is used to derivate (with a hashing function) a common secret.

Server-Side Focus. As a server usually has to handle many clients, the main issue is how costly is the server-side computation. Thus, we focus on the server cost.

Server Authentication and Forward Secrecy. The public key sent by the server may be a certificate signed by any algorithm (e.g. DSA) so that the client is able to be convinced of the server’s identity. Since this has no cost for the server we do not focus on which signature scheme is used. We note that as suggested in [20], the server can send two keys: one signed to prove his identity, and one ephemeral key generated to ensure forward secrecy. Then the client sends two secrets and the common secret is derived from both initial secrets with a key derivation function (e.g. a hash function). Due to the signature of one of the public keys, the client knows that only the server can get the common secret and if the ephemeral key is destroyed at the end of the key exchange, forward secrecy is ensured. This means that from the server side multiplying the communication and computational costs just by two, allows to have a forward secrecy property.

The algorithm we implemented is the RLWE encryption scheme of [23].Footnote 9 The code for the encryption and decryption functions (see [23]) is presented in Algorithms 3 and 4. This code highlights how simple is to implement algorithms with NFLlib: the encryption function and decryption functions are very readable, and have respectively 9 and 4 lines of code.

figure c
figure d
Table 4. Number of key exchanges per second on a server with an i7-4770 processor using only one core. When the four cores are used, performance are multiplied by a factor four. There is no standard implementation of RSA15360 and our library does not work with 80 bits of security for this application (hence the input N/A). RSA and ECDH (p curves) results have been obtained with the speed test of openssl 1.0.1f. The results noted NFLlib correspond to the amount of decryptions per second with our implementation of the RLWE scheme of [23].

Table 4 shows the performances of the protocols for 80, 128 and 256 bits of security. In RSA and NFLlib, the server needs to do a decryption, while in ECDH it performs a modular exponentiation. NFLlib allows to deal with more clients or to use less CPU time for the same amount of clients. The gap is around a factor 200, so it is possible to process 10 times more clients with 10 times less CPU time and to increase by a factor two the security with respect to ECDH (or maintain the security level and add forward secrecy).

5.2 Using NFLlib for Homomorphic Encryption

A trending application of ideal lattice cryptography is homomorphic encryption; a fully homomorphic encryption (FHE) scheme enables one to process any function on encrypted data. The first implementations of FHE were quite inefficient, but in six years the landscape has considerably changed and recent implementations run in reasonable time [11, 17]. However, the bootstrapping procedure — necessary to achieve fully homomorphic encryption — remains a bottleneck.

To overcome thereof, the cryptographic community focused on somewhat homomorphic encryption (SHE) schemes, i.e. schemes only able to handle a bounded number of homomorphic operations (and especially of homomorphic multiplications). However, even for this simplified setting, to homomorphically evaluate non trivial functions the parameter sizes remain very large (see e.g.  [8, 22]); to handle around 40 levels, one usually works with parameters such that \(2^{10}\leqslant n,\log q\leqslant 2^{20}\).

These large parameters explain why the static parameters in NFLlib were selected to handle polynomials up to degree \(2^{20}\) and modulus up to 62, 000 bits. Now, from the results of Sect. 4, we estimate that implementations using NTL or flint with \(R_p\) should immediately gain a factor 15 to 50 in performances by using NFLlib. As an example, we modified the open-source implementation of the somewhat homomorphic encryption scheme FV of [22] and directly replaced flint by NFLlib— we obtained the improvements described in Table 5.

Table 5. Using NFLlib in the FV implementation of [22], instead of flint. The polynomial degree is \(n=4096\) and the modulus p has 124 bits. The relatively small gain on the homomorphic multiplication can be explained by the fact that the scale-invariant procedure is essentially constituted of operations independent of NFLlib, such as divisions and rounding.

6 Conclusion

This work introduces NFLlib, an optimized open-source C++ library designed to handle polynomials over \(\mathbb Z_{p}[x]/(x^n+1)\), a widespread setting in ideal lattice cryptography. Because of its algorithmic and programming optimizations, NFLlib is much faster than NTL and flint, and as fast as AVX-optimized implementations of the literature. We hope the library will help building efficient prototypes using lattice cryptography in the very near future.