Keywords

1 Introduction

At PQCrypto 2016, the National Institute of Standards and Technology (NIST) announced the Post-Quantum Cryptography Standardization Process for replacing existing standards for public-key cryptography with quantum-resistant cryptosystems. For lattice-based cryptosystems, polynomial multiplications have been the most time-consuming operations. Recently standardized [AAC+22] Dilithium, Kyber, and Falcon wrote number–theoretic transforms (NTTs) into their specifications in response.

OpenSSH 9.0 defaults to NTRU PrimeFootnote 1. However, in NTRU Prime the polynomial ring doesn’t allow NTT-based multiplications naturally. State-of-the-art vectorized implementations introduced various techniques extending coefficient rings, or computed the results over \(\mathbb {{Z}}\). In each of these approaches, empirically small-degree polynomial multiplications is always an important bottleneck. We study the compatibility of vectorization and various algorithmic techniques in the literature and choose the ARM Cortex-A72 implementing the Armv8-A architectureFootnote 2 for this work. We are interested in vectorized polynomial multiplications for NTRU Prime. [BBCT22] showed that a vectorized generic polynomial multiplication takes \(\sim 1.5\times \) time of a “generic by small (ternary coefficients)” one with AVX2. [BBCT22] applied Schönhage and Nussbaumer to ease vectorization. Schönhage and Nussbaumer double the sizes of the coefficient rings and lead to a larger number of small-degree polynomial multiplications. We explain how to avoid the doubling with Good–Thomas, Rader’s, and Bruun’s FFTs.

We implement our ideas on Cortex-A72 implementing Armv8.0-A with the vector instruction set Neon. However, we emphasize that our approaches are built around the notion of vectorization and not a specific architecture.

1.1 Contributions

We summarize our contributions as follows.

  • We formalize the needs of vectorization commonly involved in vectorized implementations.

  • We propose vectorized polynomial multipliers essentially quartering and halving the number of small-dimensional polynomial multiplications after FFTs.

  • We propose novel accumulative (subtractive) variants of Barrett multiplication absorbing the follow up addition (subtraction).

  • We implement the ideas with the SIMD technology Neon in Armv8.0-A on a Cortex-A72. Our fastest polynomial multiplier outperforms the state-of-the-art optimized implementation by a factor of \(6.1 \times \).

  • In addition to the polynomial multiplication, we vectorize the sorting network, polynomial inversions, encoding, and decoding subroutines used in ntrulpr761 and sntrup761. For ntrulpr761, our key generation, encapsulation, and decapsulation are \(2.98 \times \), \(2.79 \times \), and \(3.07 \times \) faster than the state-of-the-art optimized implementation. For sntrup761, we outperform the reference implementation significantly.

1.2 Code

Our source code can be found at https://github.com/vector-polymul-ntru-ntrup/NTRU_Prime under the CC0 license.

1.3 Structure of This Paper

Section 2 goes through the preliminaries. Section 3 surveys FFTs. Section 4 describes our implementations. We show the performance numbers in Sect. 5.

2 Preliminaries

Section 2.1 describes the polynomials rings in NTRU Prime, Sect. 2.2 describes our target platform Cortex-A72, and Sect. 2.3 describes the modular arithmetic.

2.1 Polynomials in NTRU Prime

The NTRU Prime submission comprises two families: Streamlined NTRU Prime and NTRU LPRime. Both operate on the polynomial ring \({\mathbb {{Z}}_q[x]}/{\langle {x^p - x - 1}\rangle }\) where q and p are primes such that the ring is a finite field. We target the polynomial multiplications for parameter sets sntrup761 and ntrulpr761 where \(q = 4591\) and \(p = 761\). One should note that sntrup761, which is used by OpenSSH, uses a (Quotient) NTRU structure, and requires inversions in \({\mathbb {{Z}}_3[x]}\big /{\left\langle {x^{761} - x - 1} \right\rangle }\) and \({\mathbb {{Z}}_{4591}[x]}\big /{\left\langle {x^{761} - x - 1} \right\rangle }\). We refer the readers to the specification [BBC+20] for more details. With no other assumptions on the inputs, we call a polynomial multiplication “big by big”. If one of the inputs is guaranteed to be ternary, we call it “big by small”. We optimize both although the former is required only if we apply the fast constant-time GCD [BY19] to the inversions in the key generation of sntrup761. The fast constant-time GCD is left as a future work.

2.2 Cortex-A72

Our target platform is the ARM Cortex-A72, implementing the 64-bit Armv8.0-A instruction set architecture. It is a superscalar Central Processing Unit (CPU) with an in-order frontend and an out-of-order backend. Instructions are first decoded into \(\mu \)ops in the frontend and dispatched to the backend, which contains these eight pipelines: L for loads, S for stores, B for branches, I0/I1 for integer instructions, M for multi-cycle integer instructions, and F0/F1 for Single-Instruction-Multiple-Data (SIMD) instructions. The frontend can only dispatch at most three \(\mu \)ops per cycle. Furthermore, in a single cycle, the frontend dispatches at most one \(\mu \)op using B, at most two \(\mu \)ops using I0/I1, at most two \(\mu \)ops using M, at most one \(\mu \)op using F0, at most one \(\mu \)op using F1, and at most two \(\mu \)ops using L/S [ARM15, Sect. 4.1].

We mainly focus on the pipelines F0, F1, L, and S for performance. F0/F1 are both capable of various additions, subtractions, permutations, comparisons, minimums/maximums, and table lookupsFootnote 3. However, multiplications can only be dispatched to F0, and shifts to F1. The most heavily-loaded pipeline is clearly the critical path. If there are more multiplications than shifts, we much prefer instructions that can use either pipeline to go to F1 since the time spent in F0 will dominate our runtime. Conversely, with more shifts than multiplications, we want to dispatch most non-shifts to F0. In practice, we interleave instructions dispatched to the pipeline with the most workload with other pipelines (or even L/S)—and pray. Our experiment shows that this approach generally works well. In the case of chacha20 implementing randombytes for benchmarking [BHK+22], we even consider a compiler-aided mixing of I0/I1, F0/F1, and L/SFootnote 4. The idea also proved valuable for Keccak on some other Cortex-A cores [BK22, Table 1].

SIMD Registers. The 64-bit Armv8-A has 32 architectural 128-bit SIMD registers with each viewable as packed 8-, 16-, 32-, or 64-bit elements ([ARM21, Fig. A1-1]), denoted by suffixes .16B .8H, .4S, and .2D on the register name, respectively.

Armv8-A Vector Instructions

Multiplications. A plain mul multiplies corresponding vector elements and returns same-sized results. There are many variants of multiplications: mla/mls computes the same product vector and accumulates to or subtracts from the destination. There are high-half products sqdmulh and sqrdmulh. The former computes the double-size products, doubles the results, and returns the upper halves. The latter first rounds to the upper halves before returning them. There are long multiplications s{mul,mla,mls}l{,2}. smull multiplies the corresponding signed elements from the lower 64-bit of the source registers and places the resulting double-width vector elements in the destination register. It is usually paired with an smull2 using the upper 64-bit instead. Their accumulating and subtracting variants are s{mla,mls}l{,2}. We will not use the unsigned counterparts u{mul,mla,mls}l{,2}.

Shifts. shl shifts left; sshr arithmetically shifts right; srshr rounds the results after shifting. We won’t use the unsigned ushr and urshr.

Additions/Subtractions. For basic arithmetic, the usual add/sub adds/subtracts the corresponding elements. Long variants s{add,sub}l{,2} add or subtract the corresponding elements from the lower or upper 64-bit halves and signed-extend into double-width resultsFootnote 5.

Permutations. Then we have permutations—uzp{1,2} extracts the even and odd positions respectively from a pair of vectors and concatenates the results into a vector. ext extracts the lowest elements (there is an immediate operand specifying the number of bytes) of the second source vector (as the high part) and concatenates to the highest elements of the first source vector. zip{1,2} takes the bottom and top halves of a pair of vectors and riffle-shuffles them into the destination.

2.3 Modular Arithmetic

figure a

Let q be an odd modulus, and \(\texttt {R}\) be the size of the arithmetic. We describe the modular reductions and multiplications for computing in \(\mathbb {{Z}}_q\). Barrett reduction [Bar86] reduces a value a by approximating \(a \bmod ^\pm q\) with \(a - \left\lfloor {\frac{a \cdot \left\lfloor {\frac{2^e \texttt {R}}{q}} \right\rceil }{2^e \texttt {R}}} \right\rceil \) (cf. Algorithm 1). For multiplying an unknown a with a fixed value b, we compute \(ab - \left\lfloor {\frac{a \left\lfloor {\frac{b \texttt {R}}{q}} \right\rceil _2}{\texttt {R}}} \right\rceil q \equiv a b \bmod ^\pm q\) (Barrett multiplication [BHK+22]) where \(\left\lfloor {} \right\rceil _2\) is the function mapping a real number r to \(2 \left\lfloor {\frac{r}{2}} \right\rceil \) (cf. Algorithm 2). We give novel multiply-add/sub variants of Barrett multiplication in Algorithms 3 and 4. Algorithm 3 (resp. Algorithm 4) computes a representation of \(a + bc\) (resp. \(a-bc\)) by merging a mul with an add (resp. a sub) into an mla (resp. mls), saving 1 instruction.

figure b

3 Fast Fourier Transforms

We go through the mathematics behind various fast Fourier transforms (FFTs) and emphasize their defining conditions. This section is structured as follows. Section 3.1 reviews the Chinese remainder theorem for polynomial rings and discrete Fourier transform (DFT). We then survey various FFTs, including Cooley–Tukey in Sect. 3.2, Bruun and its finite field counterpart in Sect. 3.3, Good–Thomas in Sect. 3.4, Rader in Sect. 3.5, and Schönhage and Nussbaumer in Sect. 3.6. We use number–theoretic transform (NTT) as a synonym of FFT.

3.1 The Chinese Remainder Theorem (CRT) for Polynomial Rings

Let \(n = \prod _l n_l\), and \(\boldsymbol{g}_{i_0, \dots , i_{h - 1}} \in R[x]\) be coprime polynomials for all indices \((i_l)_{l=0\cdots h-1}\) where \(0\le i_l<n_l\). The CRT gives us a chain of isomorphisms

$$ \begin{aligned} \frac{R[x]}{\left\langle {\prod _{i_0, \dots , i_{h - 1}} \boldsymbol{g}_{i_0, \dots , i_{h - 1}}} \right\rangle } & \cong {} & {} \prod _{i_0} \frac{R[x]}{\left\langle {\prod _{i_1, \dots , i_{h - 1}} \boldsymbol{g}_{i_0, \dots , i_{h - 1}}} \right\rangle } \\ \cong \cdots & \cong {} & {} \prod _{i_0, \dots , i_{h - 1}} \frac{R[x]}{\left\langle {\boldsymbol{g}_{i_0, \dots , i_{h - 1}}} \right\rangle }. \end{aligned} $$

Multiplying in \(\prod _{i_0, \dots , i_{h - 1}} {R[x]}\bigg /{\left\langle {\boldsymbol{g}_{i_0, \dots , i_{h - 1}}} \right\rangle }\) is cheap if the polynomial modulus is small. If the isomorphism chain is also cheap, we improve the polynomial multiplications in \({R[x]}\bigg /{\left\langle {\prod _{i_0, \dots , i_{h - 1}} \boldsymbol{g}_{i_0, \dots , i_{h - 1}}} \right\rangle }\). For small \(n_l\)’s, it is usually cheap to decompose a polynomial ring into a product of \(n_l\) polynomial rings.

Transformations will be described with the words “radix”, “split”, and “layer”. We demonstrated below for \(h = 2\). Suppose we have isomorphisms

$$ {R[x]}\bigg /{\left\langle {\prod _{i_0, i_1} \boldsymbol{g}_{i_0, i_1}} \right\rangle } \overset{\eta _0}{\cong }\ \prod _{i_0} {R[x]}\bigg /{\left\langle {\prod _{i_1} \boldsymbol{g}_{i_0, i_1}} \right\rangle } \overset{\eta _1}{\cong }\ \prod _{i_0, i_1} {R[x]}\big /{\left\langle {\boldsymbol{g}_{i_0, i_1}} \right\rangle } $$

where \(i_0\in \{0, \dots , n_0 - 1\}\) and \(i_1\in \{0, \dots , n_1 - 1\}\). We call \(\eta _0\) a radix-\(n_0\) split and an implementation of \(\eta _0\) a radix-\(n_0\) computation, and similarly for \(\eta _1\). Usually, we implement several isomorphisms together to minimize memory operations. The resulting computation is called a multi-layer computation. Suppose we implement \(\eta _0\) and \(\eta _1\) with a single pair of loads and stores, and \(\eta _0\) and \(\eta _1\) both rely on X, a shape of computations, then the resulting multi-layer computation is called a 2-layer X. If additionally \(n_0 = n_1\), the computation is a 2-layer radix-\(n_0\) X, and similarly for more layers.

3.2 Cooley–Tukey FFT

In a Cooley–Tukey FFT [CT65], we have \(\zeta \in R\), \(\omega _n\in R\) a principal nth root of unity, n coprime to \(\text {char}(R)\), and \(\boldsymbol{g}_{i_0, \dots , i_{h - 1}} = x - \zeta \omega _n^{\sum _l i_l \prod _{j < l} n_j} \in R[x]\). Since \(\prod _{i_0, \dots , i_{h - 1}} \boldsymbol{g}_{i_0, \dots , i_{h - 1}} = x^{n} - \zeta ^n\), the efficiency of multiplying polynomials in \({R[x]}/{\left\langle {x^{n} - \zeta ^n} \right\rangle }\) boils down to the efficiency of the isomorphisms indexed by \(i_l\)’s. Furthermore, it is a cyclic NTT if \(\zeta ^n = 1\).

3.3 Bruun-Like FFTs

[Bru78] first introduced the idea of factoring into trinomials \(\boldsymbol{g}_{i_0, \dots , i_{h - 1}}\) when n is a power of two—to reduce the number of multiplications in R while operating over \(\mathbb {C}\). [Mur96] generalized this to arbitrary even n. For our implementations, we need the results on factoring \(x^{2^k} + 1 \in \mathbb {F}_q[x]\) when \(q \equiv 3 \pmod {4}\) [BGM93] and composed multiplications of polynomials in \(\mathbb {F}_q[x]\) [BC87]. Factoring \(x^n - 1\) over \(\mathbb {F}_q\) is actively researched [BGM93, Mey96, TW13, MVdO14, WYF18, WY21].

Review: The Original Bruun’s FFT (\(\boldsymbol{R = \mathbb {C}}\)). We choose \( \boldsymbol{g}_{i_0, \dots , i_{h - 1}} = x^{2} - \left( \zeta \omega _n^{\sum _l i_l \prod _{j < l} n_j} + \zeta ^{-1} \omega _n^{-\sum _l i_l \prod _{j < l} n_j} \right) x + 1 \) so \(x^{2n} - \left( \zeta ^n + \zeta ^{-n} \right) x^n + 1 = \prod _{i_0, \dots , i_{h - 1}} \boldsymbol{g}_{i_0, \dots , i_{h - 1}} \). This provides us an alternative factorization for \(x^{4n} - 1 = (x^{2n} - 1) (x^{2n} + 1)\) by choosing \(\zeta ^n = \omega _4\). For a complex number with norm 1, since the sum of its inverse and itself is real, we only need arithmetic in \(\mathbb {R}\) to reach \(\prod _{i_0, \dots , i_{h - 1}} {\mathbb {C}[x]}\bigg /{\left\langle {\boldsymbol{g}_{i_0, \dots , i_{h - 1}(x)}} \right\rangle }\).

\(\boldsymbol{R = \mathbb {F}_q}\) where \(\boldsymbol{q \equiv 3 \pmod {4}}\). We need Theorem 1 for our implementations.

Theorem 1

([BGM93, Theorem 1]). Let \(q \equiv 3 \pmod {4}\) and \(2^w\) be the highest power of two in \(q + 1\). If \(k < w\), then \(x^{2^k} + 1\) factors into irreducible trinomials \(x^2 + \gamma x + 1\) in \(\mathbb {F}_q[x]\). Else (i.e., \(k \ge w\)) \(x^{2^k} + 1\) factors into irreducible trinomials \(x^{2^{k - w + 1}} + \gamma x^{2^{k - w}} - 1\) in \(\mathbb {F}_q[x]\).

Given \(\boldsymbol{f}_0, \boldsymbol{f}_1 \in \mathbb {F}_q[x]\), we define their “composed multiplication” as \(\left( \boldsymbol{f}_0 \odot \boldsymbol{f}_1 \right) {:}{=}\prod _{\boldsymbol{f}_0(\alpha )=0} \prod _{\boldsymbol{f}_1(\beta )=0} \left( x - \alpha \beta \right) \) where \(\alpha , \beta \) run over all the roots of \(\boldsymbol{f}_0, \boldsymbol{f}_1\) in an extension field of \(\mathbb {F}_q\).We need the following from [BC87]:

Lemma 1

([BC87, Eq. 8]).\( \prod _{i_0} \boldsymbol{f}_{0, i_0} \odot \prod _{i_1} \boldsymbol{f}_{1, i_1} = \prod _{i_0, i_1} \left( \boldsymbol{f}_{0, i_0} \odot \boldsymbol{f}_{1, i_1} \right) \) holds for any sequences of polynomials \(\boldsymbol{f}_{0, i_0}, \boldsymbol{f}_{1, i_1} \in \mathbb {F}_q[x]\).

Lemma 2

([BC87, Eq. 5]). If \(\boldsymbol{f}_0 = \prod _\alpha (x - \alpha ) \in \mathbb {F}_q[x]\), then for any \( \boldsymbol{f}_1 \in \mathbb {F}_q[x]\), we have \(\boldsymbol{f}_0 \odot \boldsymbol{f}_1 = \prod _\alpha \alpha ^{\text {deg}(\boldsymbol{f}_1)} \boldsymbol{f}_1(\alpha ^{-1} x) \in \mathbb {F}_q[x]\).

Lemma 3

Let r be odd, \(x^r - 1 = \prod _{i_0} (x - \omega _r^{i_0}) \in \mathbb {F}_q[x]\), and \(x^{2^k} - 1 = \prod _{i_1} \boldsymbol{f}_{i_1} \in \mathbb {F}_q[x]\). We have \( x^{2^k r} - 1 = \prod _{i_0} \left( x^{2^k} - \omega _r^{2^k i_0} \right) = \prod _{i_0, i_1} \omega _r^{i_0 \text {deg} (\boldsymbol{f}_{i_1})} \boldsymbol{f}_{i_1}( \omega _r^{-i_0} x). \)

Proof

First observe \(x^{2^k r} - 1 = \left( x^r - 1 \right) \odot \left( x^{2^k} - 1 \right) \)Footnote 6. By Lemma 1, this equals

\( \prod _{i_0} \left( (x - \omega _r^{i_0}) \odot \left( x^{2^k} - 1 \right) \right) = \prod _{i_0, i_1} \left( (x - \omega _r^{i_0}) \odot \boldsymbol{f}_{i_1} \right) . \) According to Lemma 2, \((x - \omega _r^{i_0}) \odot \left( x^{2^k} - 1 \right) = x^{2^k} - \omega _r^{2^k i_0}\) and \((x - \omega _r^{i_0}) \odot \boldsymbol{f}_{i_1} = \omega _r^{i_0 \text {deg} (\boldsymbol{f}_{i_1})} \boldsymbol{f}_{i_1} (\omega _r^{-i_0} x)\) as desired.

In summary, by Lemma 3 we have the following isomorphisms:

$$ \frac{\mathbb {F}_q[x]}{\left\langle {x^{2^k r} - 1} \right\rangle } \cong \frac{\mathbb {F}_q[x]}{\left\langle {\prod _{i_0} \left( x^{2^k} - \omega _r^{2^k i_0} \right) } \right\rangle } \cong \frac{\mathbb {F}_q[x]}{\left\langle {\prod _{i_0, i_1} \omega _r^{i_0 \text {deg} (\boldsymbol{f}_{i_1})} \boldsymbol{f}_{i_1} (\omega _r^{-i_0} x)} \right\rangle }. $$

Radix-2 Bruun’s Butterflies and Inverses. Define \({\textbf {Bruun}}_{\alpha , \beta }\) as follows:

$$ {\textbf {Bruun}}_{\alpha , \beta }: {\left\{ \begin{array}{ll} \frac{R[x]}{\left\langle {x^4 + (2 \beta - \alpha ^2) x^2 + \beta ^2} \right\rangle } &{} \rightarrow \frac{R[x]}{\left\langle {x^2 + \alpha x + \beta } \right\rangle } \times \frac{R[x]}{\left\langle {x^2 - \alpha x + \beta } \right\rangle } \\ a_0 + a_1 x + a_2 x^2 + a_3 x^3 &{} \mapsto \left( (\hat{a}_0 + \hat{a}_1 x), (\hat{a}_2 + \hat{a}_3 x) \right) \end{array}\right. } $$

where

$$ \left\{ \begin{aligned} (\hat{a}_0, \hat{a}_1) & = {} & {} \left( a_0 - \beta a_2 + \alpha \beta a_3, a_1 + (\alpha ^2 - \beta ) a_3 - \alpha a_2 \right) , \\ (\hat{a}_2, \hat{a}_3) & = {} & {} \left( a_0 - \beta a_2 - \alpha \beta a_3, a_1 + (\alpha ^2 - \beta ) a_3 + \alpha a_2 \right) . \end{aligned} \right. $$

We compute \((a_0 - \beta a_2, a_1 + (\alpha ^2 - \beta ) a_3,\alpha a_2, \alpha \beta a_3)\), swap the last two values implicitly, and do an addition-subtraction (cf. Fig. 1). Notice that we can use Barrett_mla and Barrett_mls whenever a product is followed by only one accumulation (\(a_1 + \left( \alpha ^2 - \beta \right) a_3\)) or subtraction (\(a_0 - \beta a_2\)).

Fig. 1.
figure 1

Bruun’s butterfly. \((\hat{a}_0, \hat{a}_1, \hat{a}_2, \hat{a}_3) = {\textbf {Bruun}}_{\alpha , \beta }(a_0, a_1, a_2, a_3)\).

$$ 2 {\textbf {Bruun}}_{\alpha , \beta }^{-1}: {\left\{ \begin{array}{ll} \frac{R[x]}{\left\langle {x^2 + \alpha x + \beta } \right\rangle } \times \frac{R[x]}{\left\langle {x^2 - \alpha x + \beta } \right\rangle } &{} \rightarrow \frac{R[x]}{\left\langle {x^4 + (2 \beta - \alpha ^2) x^2 + \beta ^2} \right\rangle } \\ \left( (\hat{a}_0 + \hat{a}_1 x), (\hat{a}_2 + \hat{a}_3 x) \right) &{} \mapsto 2 a_0 + 2 a_1 x + 2 a_2 x^2 + 2 a_3 x^3 \end{array}\right. } $$

correspondingly defines the inverse, where

$$ \left\{ \begin{aligned} 2 (a_0, a_1) & = {} & {} (\hat{a}_0 + \hat{a}_2 + \left( \hat{a}_3 - \hat{a}_1 \right) \alpha ^{-1} \beta , \hat{a}_1 + \hat{a}_3 - \left( \hat{a}_0 - \hat{a}_2 \right) \alpha ^{-1} \beta ^{-1} \left( \alpha ^2 - \beta \right) ), \\ 2 (a_2, a_3) & = {} & {} (\left( \hat{a}_3 - \hat{a}_1 \right) \alpha ^{-1}, \left( \hat{a}_0 - \hat{a}_2 \right) \alpha ^{-1} \beta ^{-1}). \\ \end{aligned} \right. $$

We compute \( \left( \hat{a}_0 + \hat{a}_2, \hat{a}_1 + \hat{a}_3, \hat{a}_0 - \hat{a}_2, \hat{a}_3 - \hat{a}_1 \right) \), swap the last two values implicitly, multiply the constants \(\alpha ^{-1}, \beta , \alpha ^{-1} \beta ^{-1}\), and \(\left( \alpha ^2 - \beta \right) \), and add-sub (cf. Fig. 2). Both \({\textbf {Bruun}}_{\alpha , \beta }\) and \(2 {\textbf {Bruun}}_{\alpha , \beta }^{-1}\) take 4 multiplications.

Fig. 2.
figure 2

Bruun’s Inverse butterfly. \((2 a_0, 2 a_1, 2 a_2, 2 a_3) = 2 {\textbf {Bruun}}_{\alpha , \beta }^{-1}(\hat{a}_0, \hat{a}_1, \hat{a}_2, \hat{a}_3)\).

We will use three special cases of Bruun’s butterflies.

  • \({\textbf {Bruun}}_{\sqrt{2}, 1}\): The initial split of \(x^{2^k} + 1\) is \({\textbf {Bruun}}_{\sqrt{2}, 1}\). Since \(\beta = \alpha ^2 - \beta = 1\), we only need two multiplications by \(\times \sqrt{2}\).

  • \({\textbf {Bruun}}_{\alpha , \pm 1}\): We avoid multiplying with \(\beta =\pm 1\) in \({\textbf {Bruun}}_{\alpha , \pm 1}\) and \(2 {\textbf {Bruun}}_{\alpha , \pm 1}^{-1}\).

  • \({\textbf {Bruun}}_{\alpha , \frac{\alpha ^2}{2}}\): We save no multiplications, but only use 2 constants \(\alpha \) and \(\frac{\alpha ^2}{2}\) instead of 4. It is used in the split of \(x^{2^k} + \omega _r^{2^k i}\) for an odd r.

3.4 Good–Thomas FFTs

A Good–Thomas FFT [Goo58] converts cyclic FFTs and convolutions into multi-dimensional ones for coprime \(n_l\)’s. For the polynomial ring \({R[x]}/{\left\langle {x^{n} - 1} \right\rangle }\), we implement \({R[x]}/{\left\langle {x^n - 1} \right\rangle } \cong \prod _{i_0, \dots , i_{h - 1}} {R[x]}/{\left\langle {x - \prod _l \omega _{n_l}^{i_l}} \right\rangle }\) with a multi-dimensional FFT induced by the equivalences \(x \sim \prod _l u_l\) and \(\forall l, u_l^{n_l} \sim 1\). Formally, we have

$$ \begin{aligned} &\, &\, & \frac{R[x]]}{\left\langle {x^n - 1} \right\rangle } \cong \frac{R[x, u_0, \dots , u_{h - 1}]}{\left\langle {x - \prod _l u_l, u_0^{n_0} - 1, \dots , u_{h - 1}^{n_{h - 1}} - 1} \right\rangle } \\ & \cong &\, & \prod _{i_0, \dots , i_{h - 1}} \frac{R[x, u_0, \dots , u_{h - 1}]}{\left\langle {x - \prod _l u_l, u_0 - \omega _{n_0}^{i_0}, \dots , u_{h - 1} - \omega _{n_{h - 1}}^{i_{h - 1}}} \right\rangle } \cong \prod _{i_0, \dots , i_{h - 1}} \frac{R[x]}{\left\langle {x - \prod _l \omega _{n_l}^{i_l}} \right\rangle }. \end{aligned} $$

We illustrate the idea for \(h = 2, n_0 = 2\), and \(n_1 = 3\). Let \(P_{(14)}\) be the permutation matrix exchanging the 1st and the 4th rows. We write the size-6 FFT matrix as follows:

$$ P_{(14)} \begin{pmatrix} 1 &{} 1 &{} 1 &{} 1 &{} 1 &{} 1 \\ 1 &{} \omega _6 &{} \omega _6^2 &{} \omega _6^3 &{} \omega _6^4 &{} \omega _6^5 \\ 1 &{} \omega _6^2 &{} \omega _6^4 &{} 1 &{} \omega _6^2 &{} \omega _6^4 \\ 1 &{} \omega _6^3 &{} 1 &{} \omega _6^3 &{} 1 &{} \omega _6^3 \\ 1 &{} \omega _6^4 &{} \omega _6^2 &{} 1 &{} \omega _6^4 &{} \omega _6^2 \\ 1 &{} \omega _6^5 &{} \omega _6^4 &{} \omega _6^3 &{} \omega _6^2 &{} \omega _6 \\ \end{pmatrix} P_{(14)} = \begin{pmatrix} 1 &{} 1 &{} 1 &{} 1 &{} 1 &{} 1 \\ 1 &{} \omega _6^4 &{} \omega _6^2 &{} 1 &{} \omega _6^4 &{} \omega _6^2 \\ 1 &{} \omega _6^2 &{} \omega _6^4 &{} 1 &{} \omega _6^2 &{} \omega _6^4 \\ 1 &{} 1 &{} 1 &{} \omega _6^3 &{} \omega _6^3 &{} \omega _6^3 \\ 1 &{} \omega _6^4 &{} \omega _6^2 &{} \omega _6^3 &{} \omega _6 &{} \omega _6^5 \\ 1 &{} \omega _6^2 &{} \omega _6^4 &{} \omega _6^3 &{} \omega _6^5 &{} \omega _6 \\ \end{pmatrix} =\begin{pmatrix} 1 &{} 1 \\ 1 &{} -1 \end{pmatrix} \otimes \begin{pmatrix} 1 &{} 1 &{} 1 \\ 1 &{} \omega _6^4 &{} \omega _6^2 \\ 1 &{} \omega _6^2 &{} \omega _6^4 \end{pmatrix}. $$

3.5 Rader’s FFT for Odd Prime p

Suppose \(\omega _p \in R\) for an odd prime p. [Rad68] introduced how to map a polynomial \(\sum _{i} a_i x^i\in {R[x]}/{\left\langle {x^p - 1} \right\rangle }\) to the tuple \(\left( \hat{a}_j\right) {:}{=}\left( \sum _{i} a_i \omega _p^{ij}\right) \in \prod _i {R[x]}\big /{\left\langle {x - \omega _p^i} \right\rangle }\) with a size-\((p - 1)\) cyclic convolution. Let g be a generator of \(\mathbb {{Z}}_p^*\) and write \(j = g^{k}\) and \(i = g^{-\ell }\). Then \(\hat{a}_{g^{k}} - a_0 = \hat{a}_j - a_0 = \sum _{i = 1}^{p - 1}a_i \omega _p^{ij} = \sum _{\ell = 0}^{p - 2} a_{g^{-\ell }} \omega _p^{g^{k - \ell }}\) for \(k = 0, \dots , p - 2\).

The sequence \(\left( \sum _{\ell = 0}^{p - 2} a_{g^{-\ell }} \omega _p^{g^{k - \ell }} \right) _{j = 0, \dots , p - 2}\) is the size-\((p - 1)\) cyclic convolution of sequences \(\left( a_{g^{{-i}}} \right) _{{i = 0, \dots , p - 2}}\) and \(\left( \omega _p^{g^i} \right) _{{i = 0, \dots , p - 2}}\). For example, let \(p = 5\). We have \((1, 2, 3, 4) = (2^4, 2, 2^3, 2^2)\) and

$$ \begin{pmatrix} \hat{a}_2 - a_0 \\ \hat{a}_4 - a_0 \\ \hat{a}_3 - a_0 \\ \hat{a}_1 - a_0 \\ \end{pmatrix} = \begin{pmatrix} \omega _5 &{} \omega _5^2 &{} \omega _5^4 &{} \omega _5^3 \\ \omega _5^3 &{} \omega _5 &{} \omega _5^2 &{} \omega _5^4 \\ \omega _5^4 &{} \omega _5^3 &{} \omega _5 &{} \omega _5^2 \\ \omega _5^2 &{} \omega _5^4 &{} \omega _5^3 &{} \omega _5 \\ \end{pmatrix} \begin{pmatrix} a_3 \\ a_4 \\ a_2 \\ a_1 \end{pmatrix}. $$

3.6 Schönhage’s and Nussbaumer’s FFTs

Instead of isomorphisms based on CRT, we sometimes compute chains of monomorphisms and determine the unique inverse image from the product of two images. Given polynomials \(\boldsymbol{a}, \boldsymbol{b}\in {R[x]}/{\left\langle {\boldsymbol{g}} \right\rangle }\) where \(\boldsymbol{g}\) is a degree-\(n_0 n_1\) polynomial, we introduce \(y = x^{n_1}\), and write \(\boldsymbol{a}\) and \(\boldsymbol{b}\) as polynomials in \({R[x, y]}/{\left\langle {x^{n_1} - y, \boldsymbol{g}_0} \right\rangle }\) where \(\boldsymbol{g}_0|_{y=x^{n_1}} = \boldsymbol{g}(x)\). In other words, \(\boldsymbol{a}(y) := \sum _{i_0 = 0}^{n_0 - 1} \left( \sum _{i = 0}^{n_1 - 1} a_{i + i_0 n_1} x^i \right) y^{i_0} \in {R[x, y]}/{\left\langle {x^{n_1} - y, \boldsymbol{g}_0} \right\rangle }\). We recap transforms when \({R[x,y]}/{\left\langle {x^{n_1}-y,\boldsymbol{g}_0} \right\rangle }\) does not naturally split.

We want an injection \({R[x]}/{\left\langle {x^{n_1} - y} \right\rangle }\hookrightarrow \bar{R}\) such that \({R[x,y]}/{\left\langle {x^{n_1}-y,\boldsymbol{g}_0} \right\rangle } \hookrightarrow {\bar{R}[y]}\big /{\left\langle {\boldsymbol{g}_0} \right\rangle }\) is a monomorphism with \({\bar{R}[y]}\big /{\left\langle {\boldsymbol{g}_0} \right\rangle } \cong \prod _j {\bar{R}[y]}\big /{\left\langle {\boldsymbol{g}_{0, j}} \right\rangle }\). A Schönhage FFT [Sch77] is when \(\boldsymbol{g}_0 | (y^{n_0} - 1)\), and \(\bar{R} = {R[x]}/{\left\langle {\boldsymbol{h}} \right\rangle }\) with \(\boldsymbol{h}|\mathrm {\Phi }_{n_0}(x)\) (the \(n_0\)-th cyclotomic polynomial). E.g., “cyclic Schönhage” for powers of two \(n_0\), \(n_1 = \frac{n_0}{4}\), \(\boldsymbol{g}_0 = y^{n_0} - 1\), and \(\boldsymbol{h}= x^{2 n_1} + 1\) is:

$$ \begin{aligned} \frac{R[x]}{\left\langle {x^{n_0 n_1} - 1} \right\rangle } \cong \frac{\frac{R[x]}{\left\langle {x^{n_1} - y} \right\rangle }[y]}{\left\langle {y^{n_0} - 1} \right\rangle } \hookrightarrow \frac{\frac{R[x]}{\left\langle {x^{2n_1} + 1} \right\rangle }[y]}{\left\langle {y^{n_0}-1} \right\rangle } \triangleq \frac{\bar{R}[y]}{\left\langle {y^{n_0}-1} \right\rangle } \cong \prod _{i} \frac{\bar{R}[y]}{\left\langle {y - x^i} \right\rangle }. \end{aligned} $$

We can also exchange the roles of x and y and get Nussbaumer’s FFT [Nus80]. We map \({R[x,y]}/{\left\langle {x^{n_1}-y,\boldsymbol{g}_0} \right\rangle } \hookrightarrow {R[x,y]}/{\left\langle {\boldsymbol{h},\boldsymbol{g}_0} \right\rangle }\) for \(\boldsymbol{g}_0| \mathrm {\Phi }_{2 n_1}(y)\) and \(\boldsymbol{h}| (x^{2 n_1} - 1)\). This can be illustrated for powers of two \(n_0 = n_1\), \(\boldsymbol{h}= x^{2 n_1} - 1\), and \(\boldsymbol{g}_0 = y^{n_0} + 1\):

$$ \begin{aligned} \frac{R[x]}{\left\langle {x^{n_0 n_1} + 1} \right\rangle } \cong \frac{R[x, y]}{\left\langle {x^{n_1} - y, y^{n_0} + 1} \right\rangle } \hookrightarrow \frac{\frac{R[y]}{\left\langle {y^{n_0} + 1} \right\rangle }[x]}{\left\langle {x^{2n_1}-1} \right\rangle } \triangleq \frac{\tilde{R}[x]}{\left\langle {x^{2 n_1} - 1} \right\rangle } \cong \prod _i \frac{\tilde{R}[x]}{\left\langle {x - y^i} \right\rangle }. \end{aligned} $$

Our presentation is motivated by [Ber01, Sect. 9, Paragraph “High–radix variants”] and [vdH04, Sect. 3].

4 Implementations

In this section, we discuss our ideas for multiplying polynomials over \(\mathbb {{Z}}_{4591}\). For brevity, we assume \(R = \mathbb {{Z}}_{4591}\) in this section. The state-of-the-art vectorized “big by big” polynomial multiplication in NTRU Prime [BBCT22] computed the product in \({R[x]}\big /{\left\langle {(x^{1024} + 1) (x^{512} - 1)} \right\rangle }\) with Schönhage and Nussbaumer. This leads to 768 size-8 base multiplications where all of them are negacyclic convolutions. [BBCT22] justified the choice as follows:

\(\dots \) since \(4591 - 1 = 2 \cdot 3^3 \cdot 5 \cdot 17\), no simple root of unity is available for recursive radix-2 FFT tricks. \(\dots \) They ([ACC+21]) performed radix-3, radix-5, and radix-17 NTT stages in their NTT (defined in \({R[x]}\big /{\left\langle {x^{1530} - 1} \right\rangle }\)). We instead use a radix-2 algorithm that efficiently utilizes the full ymm registers (for vectorization) in the Haswell architecture.

We propose transformations (essentially) quartering and halving the number of coefficients involved in base multiplications for vectorization. Our first transformation computes the result in \({R[x]}\big /{\left\langle {x^{1536} - 1} \right\rangle }\). We apply Good–Thomas with \(\omega _3 \in R\) for a more rapid decrease of the sizes of polynomial rings, Schönhage for radix-2 butterflies, and Bruun over \({R[x]}\big /{\left\langle {x^{32} + 1} \right\rangle }\). This leads to 384 size-8 base multiplications defined over trinomial moduli. Our second transformation computes the result in \({R[x]}\big /{\left\langle {x^{1632} - 1} \right\rangle }\). We show how to incorporate Rader for radix-17 butterflies and Good–Thomas for the coprime factorization \(17 \cdot 3 \cdot 2\). For computing the size-16 weighted convolutions, we split with Cooley–Tukey and Bruun for \({R[x]}\big /{\left\langle {x^{16} \pm \omega _{102}^i} \right\rangle }\). Since no coefficient ring extensions are involved, this leads to 96 size-8 base multiplication with binomial moduli, 96 size-8 base multiplications with trinomial moduli, and six size-16 base multiplications with binomial moduli.

Section 4.1 formalizes the needs of vectorization, and Sect. 4.2 goes through our implementation Good–Thomas for big-by-small polynomial multiplications. We then go through big-by-big polynomial multiplications. Section 4.3 goes through our implementation Good–Schönhage–Bruun, and Sect. 4.4 goes through our implementation Good–Rader–Bruun.

4.1 The Needs of Vectorization

We formalize “the needs of vectorization” to justify how we choose among transformations. In the literature, power-of-two-sized FFTs are oftenly described as easily vectorizable. In this paper, we explicitly state and relate them to the designs of vectorization-friendly polynomial multiplications. Our definition is based on our programming experience.

We assume that a reasonable vector instruction set should provide the following features accessible to programmers:

  • Several vector registers each holding a large number of bits of data. Commonly, each register holds \(2^k\) bits.

  • Several vector arithmetic instructions computing \(2^k\)-bit data from \(2^k\)-bit data while regarding each \(2^k\)-bit data as packed elements.

    • If input and output are regarded as packed \(2^{k'}\)-bit data, we call the instruction a single-width instruction.

    • If input is regarded as packed \(2^{k' - 1}\)-bit data and output is regarded as packed \(2^{k'}\)-bit data, we call the instruction a widening instruction.

    • If input is regarded as packed \(2^{k'}\)-bit data and output is regarded as packed \(2^{k' - 1}\)-bit data, we call the instruction a narrowing instruction.

The terminologies “widening” and “narrowing” come from [ARM21]. For a \(k' \le k\), we are interested in the number of elements \(v = 2^{k - k'}\) contained in a vector register. Intuivitely, we want to compute with minimal number of data shuffling while maintaining the vectorization feature: if we want to add up several pairs \((a_i, b_i)\) of elements, we assign \((a_i)\) to one vector register and \((b_i)\) to another one and issue a vector addtion, similarly for subtractions, multiplications, and bitwise operations. We formalize this intuition for algebra homomorphisms.

Let \(\pi \) be a platform-dependent set of module homomorphisms. We’ll specify \(\pi = \pi (\texttt {neon})\) in the case of Neon shortly. Let f be an algebra homomorphism. We call f “vectorization friendly” if f is a composition of homomorphisms of the form \(g \otimes \text {id}_v \otimes d\) for g an algebra homomorphism, d a composition of elements from \(\pi \). Since \(g \otimes \text {id}_v\) operates over several chunks of v-sets, we need no permutations for this part. For the set \(\pi \), we define it with the matrix view for simplicity. \(\pi \) is defined as the set of module homomorphisms representable as a \(v' \times v'\) diagonal matrix or a size-\(v'\) cyclic/negacyclic shift for \(v'\) a multiple of v.

In this paper, we start with for \(v'\) a multiple of v and transform accordingly.

4.2 Good–Thomas FFT in “Big\(\times \)Small” Polynomial Multiplications

We recall below the design principle of vectorization–friendly Good–Thomas from [AHY22], and describe our implementation Good–Thomas for the “big by small” polynomial multiplications. For a cyclic convolution \({R[x]}/{\left\langle {x^{v n_0 n_1} - 1} \right\rangle }\) where \(n_0\) and \(n_1\) coprime, and v a multiple of the number of coefficients in a vector, one introduces the equivalences \(x^v \sim u w\), \(u^{n_0} \sim w^{n_1} \sim 1\). Usually, one picks \(n_0\) and \(n_1\) carefully for fast computations. In the simplest form, one picks \(n_0\) as a power of 2 and \(n_1 = 3\). Our Good–Thomas computes the polynomial multiplication in \({\mathbb {{Z}}[x]}\big /{\left\langle {x^{1536} - 1} \right\rangle }\) with \((v, n_0, n_1) = (4, 128, 3)\) where \(v = 4\) comes from the fact that each Neon SIMD register holds four 32-bit values. After reaching \({\mathbb {{Z}}[x, u, w]}\big /{\left\langle {x^4 - u w, u^3 - 1, w^{128} - 1} \right\rangle }\), we want to compute size-3 NTT over \(u^3 - 1\) and size-128 NTT over \(w^{128} - 1\). It suffices to choose a large modulus \(q'\) with a principal 384-th root of unity. We choose \(q'\) as a 32-bit modulus bounding the maximum value of the product in \({\mathbb {{Z}}[x]}\big /{\left\langle {x^{1536} - 1} \right\rangle }\). Obviously, our Good–Thomas supports any “big-by-small” polynomial multiplications with size less than or equal to 1536.

4.3 Good–Thomas, Schönhage’s, and Bruun’s FFT

This section describes our Good–Schönhage–Bruun. We briefly recall the AVX2-optimized “big by big” polynomial multiplication by [BBCT22]. They computed the product in \({R[x]}\big /{\left\langle {(x^{512} - 1) (x^{1024} + 1)} \right\rangle }\). They first applied Schönhage as follows.

$$ \begin{aligned} {} & {} & \frac{R[x]}{\left\langle {(x^{512} - 1) (x^{1024} + 1)} \right\rangle } \cong \frac{ \frac{R[x]}{\left\langle {x^{32} - y} \right\rangle } [y]}{\left\langle {(y^{16} - 1) (y^{32} + 1)} \right\rangle } \\ & \hookrightarrow {} & {} \frac{ \frac{R[x]}{\left\langle {x^{64} + 1} \right\rangle } [y]}{\left\langle {(y^{16} - 1) (y^{32} + 1)} \right\rangle } \cong \prod _{i = 0, 1, 3, j = 0, \dots , 15} \frac{ \frac{R[x]}{\left\langle {x^{64} + 1} \right\rangle } [y]}{\left\langle {y - x^{2 i + 8 j}} \right\rangle }. \end{aligned} $$

They then applied Nussbaumer for multiplying in \(\frac{R[x]}{\left\langle {x^{64} + 1} \right\rangle }\) as follows.

$$ \frac{R[x]}{\left\langle {x^{64} + 1} \right\rangle } \cong \frac{\frac{R[x]}{\left\langle {x^8 - z} \right\rangle }[z]}{\left\langle {z^8 + 1} \right\rangle } \hookrightarrow \frac{\frac{R[x]}{\left\langle {x^{16} - 1} \right\rangle }[z]}{\left\langle {z^8 + 1} \right\rangle } \cong \frac{\frac{R[z]}{\left\langle {z^8 + 1} \right\rangle }[x]}{\left\langle {x^{16} - 1} \right\rangle } \cong \prod _{k = 0, \dots , 15} \frac{\frac{R[z]}{\left\langle {z^8 + 1} \right\rangle }[x]}{\left\langle {x - z^k} \right\rangle }. $$

The vectorization-friendliness of Schönhage is obvious. In principle, Nussbaumer is vectorization-friendly since it shares the same computation as Schönhage after transposing.

Truncated Schö nhage vs Good–Thomas and Schönhage. We first discuss an optimization of Schönhage if there is a principal root of unity with order coprime to the one defining Schönhage.

How it Works, Mathematically. In \(R = \mathbb {{Z}}_{4591}\), we know that there is a principal 3rd root of unity \(\omega _3 \in R\). Instead of computing in \({R[x]}\big /{\left\langle {(x^{512} - 1) (x^{1024} + 1)} \right\rangle }\), we apply Schönhage and Good–Thomas FFTs to \({R[x]}\big /{\left\langle {x^{1536} - 1} \right\rangle }\). By definition, if \(\boldsymbol{\omega }\) is a principal \(2^k\)-th root of unity, then \(\omega _3 \boldsymbol{\omega }\) is a principal \(3 \cdot 2^k\)-th root of unity. Let’s define \(\bar{R} = {R[x]}\big /{\left\langle {x^{32} + 1} \right\rangle }\). We introduce a principal 32-th root of unity \(\boldsymbol{\omega }_{32} = x^2\) as follows:

$$ \frac{R[x]}{\left\langle {x^{1536} - 1} \right\rangle } \cong \frac{\frac{R[x]}{\left\langle {x^{16} - y} \right\rangle }[y]}{\left\langle {y^{96} - 1} \right\rangle } \hookrightarrow \frac{\bar{R}[y]}{\left\langle {y^{96} - 1} \right\rangle }. $$

Then \(\omega _3 \boldsymbol{\omega }_{32}\) is a principal 96-th root of unity implementing \( {\bar{R}[y]}\big /{\left\langle {y^{96} - 1} \right\rangle } \cong \prod _{i = 0, 1, 2, j = 0, \dots , 31} {\bar{R}[y]}\bigg /{\left\langle {y - \omega _3^i \boldsymbol{\omega }_{32}^{j}} \right\rangle } \). However, one should not implement this isomorphism with Cooley–Tukey FFT. Observe that multiplication by \(\boldsymbol{\omega }_{32} = x^2\) requires negating and permuting whereas multiplication by \(\omega _3\) requires actual modular multiplication. Cooley–Tukey FFT requires one to multiply \(\omega _3^i \boldsymbol{\omega }_{32}^j\) which is unreasonably complicated while optimizing for \(i, j \ne 0\). We apply Good–Thomas FFT implementing \( {\bar{R}[y]}\big /{\left\langle {y^{96} - 1} \right\rangle } \cong {\bar{R}[y]}\big /{\left\langle {y - u w, u^3 - 1, w^{32} - 1} \right\rangle } \). Obviously, we only need multiplications by powers of \(\omega _3\) and \(\boldsymbol{\omega }_{32}\) and not \(\omega _3 \boldsymbol{\omega }_{32}\). See Table 1 for an overview of available approaches.

Table 1. Approaches for computing the size-1536 product of two polynomials drawn from \({R[x]}\big /{\left\langle {x^{761} - x - 1} \right\rangle }\).

How it Works, Concretely. We detail the implementation as follows.

  • We transform the input array in[761] into a temporary array out[3][32][32], where out[i][j][0-31] is the size-32 polynomial in \(\frac{R[x]}{\left\langle {x^{32} + 1, u - \omega _3^{i}, w - x^{2j}} \right\rangle }\). Concretely, we combine the permutations of Good–Thomas and Schönhage as out[i][j][k] = in[\(\left( 16 ( 64 i + 33 j ) \bmod 96 \right) + k\)] if \(\left( 16 ( 64 i + 33 j ) \bmod 96 \right) + k < 761\) and zero otherwise. This step is the foundation of the implicit permutations [ACC+21].

  • For input small, we start with the 8-bit form of the polynomial. Since coefficients are in \(\left\{ {\pm 1, 0} \right\} \), we first perform five layers of radix-2 butterflies without any modular reductions. The initial three layers of radix-2 butterflies are combined with the implicit permutations. For the last two layers of radix-2 butterflies, we use ext if the root is not a power of \(x^{16}\). For the last layer of radix-2 butterflies, we merge the sign-extension and add-sub pairs into the sequence saddl, saddl2, ssubl, ssubl2. We then apply one layer of radix-3 butterflies based on the improvement of [DV78, Equation 8]. We compute the radix-3 NTT \((\hat{v}_0, \hat{v}_1, \hat{v}_2)\) of size-32 polynomials \((v_1, v_2, v_3)\) as:

    $$ \left\{ \begin{aligned} \hat{v}_0 & = v_0 + v_1 + v_2, \\ \hat{v}_1 & = (v_0 - v_2) + \omega _3 (v_1 - v_2), \\ \hat{v}_2 & = (v_0 - v_1) - \omega _3 (v_1 - v_2). \\ \end{aligned} \right. $$
    Algorithm 1
    figure d

    . Radix-2 butterfly with symbolic root \(x^2\).

  • For the input big, we use the 16-bit form and perform one layer of radix-3 butterflies followed by five layers of radix-2 butterflies. This implies only 1536 coefficients are involved in radix-3 butterflies instead of 3072 as for the input small. We first apply one layer of radix-3 butterflies and two layers of radix-2 butterflies followed by one layer of Barrett reductions while permuting implicitly for Good–Thomas and Schönhage. Then, we perform three layers of radix-2 butterflies and another layer of Barrett reductions.

Nussbaumer vs Bruun. Next, we discuss efficient polynomial multiplications in \({R[x]}\big /{\left\langle {x^{32} + 1} \right\rangle }\). [BBCT22] applied Nussbaumer to \({R[x]}\big /{\left\langle {x^{64} + 1} \right\rangle }\). We state without proof that applying Nussbaumer to \({R[x]}\big /{\left\langle {x^{32} + 1} \right\rangle }\) results in 8 polynomial multiplications in \({R[z]}\big /{\left\langle {z^8 + 1} \right\rangle }\). We instead apply Brunn’s FFT resulting in multiplications in rings \({R[x]}\big /{\left\langle {x^8 + \alpha x^4 + 1} \right\rangle }\) for 4 different \(\alpha \). Since

$$ \begin{aligned} x^{32}+1 &= (x^{16}+1229x^2+1)(x^{16} - 1229 x^2+1) \\ &= (x^8+58x^4+1)(x^8 - 58 x^4+1)(x^8+2116x^4+1)(x^8 - 2116 x^4+1), \end{aligned} $$

we apply \({\textbf {Bruun}}_{1229, 1}\) followed by \({\textbf {Bruun}}_{58, 1}\) and \({\textbf {Bruun}}_{2116, 1}\). We have slower FFT and base multiplications, but we do only half as many as in [BBCT22]. See Table 2 for comparisons.

Table 2. Approaches for multiplying in \({R[x]}\big /{\left\langle {x^{64} + 1} \right\rangle }\) and \({R[x]}\big /{\left\langle {x^{32} + 1} \right\rangle }\).

Then, we perform \(96 \cdot 4 = 384\) size-8 base multiplications and compute the inverses of Bruun’s, Schönhage’s, and Good–Thomas FFT.

4.4 Good–Thomas, Rader’s, and Bruun’s FFT

In the previous section, we replace Nussbaumer with Bruun. This section shows how to replace Schönhage with Rader while computing in \({R[x]}\big /{\left\langle {x^{1632} - 1} \right\rangle }\). We name the resulting computation Good–Rader–Bruun.

Schö nhage vs Rader-17. We first observe that the Schönhage in [BBCT22] reduced a size-1536 problem to several size-64 problems. We are looking for a multiple of 17 close to \(\frac{1536}{64} = 48\). We choose 51 since one can define a size-51 cyclic NTT nicely over \(\mathbb {{Z}}_q\) and optimize further by extending the size-51 cyclic NTT to size-102. For the size-102 cyclic NTT, we apply the 3-dimensional Good–Thomas FFT by identifying \((\omega _{17}, \omega _3, \omega _{2}) = (\omega _{102}^{e_0}, \omega _{102}^{e_1}, \omega _{102}^{e_2})\) as the principal roots of unity where \((e_0, e_1, e_2)\) is the unique tuple satisfying \(\forall a \in \mathbb {{Z}}_{102}, a \equiv e_0 (a \bmod 17) + e_1 (a \bmod 3) + e_2 (a \bmod 2) \pmod {102}\). Algorithm 6 is an illustration. Radix-2 and radix-3 computations are straightforward. For the radix-17 cyclic FFT, we apply Rader’s FFT. Algorithm 7 illustrates the multi-dimensional cyclic FFT. Obviously, the above computation is vectorization–friendly.

figure e
figure f

Generalize Bruun Over \(\boldsymbol{x^{2^k} + c}\) for \(\boldsymbol{c \ne \pm 1}\). The composed multiplication over a finite field shows that the remaining factorization follows the same pattern of factorizing \({R[x]}\big /{\left\langle {x^{16} \pm 1} \right\rangle }\). The isomorphism \({R[x]}\big /{\left\langle {x^{16} - \omega _{102}^{2i}} \right\rangle } \cong \prod {R[x]}\big /{\left\langle {x^8 \pm \omega _{102}^i} \right\rangle }\) is obvious. Since we also have \(\prod _i {R[x]}\big /{\left\langle {x^{16} - \omega _{102}^{2i + 1}} \right\rangle } \cong \prod _i {R[x]}\big /{\left\langle {x^{16} + \omega _{102}^{2i}} \right\rangle }\) by permuting, it suffices to understand the isomorphisms defined on \({R[x]}\big /{\left\langle {x^{16} + \omega _{102}^{2i}} \right\rangle }\). Applying Lemma 3, we have \({R[x]}\big /{\left\langle {x^{16} + \omega _{102}^{2i}} \right\rangle } \cong \prod {R[x]}\big /{\left\langle {x^8 \pm \sqrt{2} \omega _{102}^{128i} x^4 + \omega _{102}^{256i}} \right\rangle }\).

Finally, the remaining computing task is multiplication in \({R[x]}\big /{\left\langle {x^8 + \alpha x^4 + \beta } \right\rangle }\) for some \(\alpha , \beta \in R\). We extend the idea of [CHK+21, Algorithm 17] by altering between multiplying in R[x] and reducing modulo \(x^8 + \alpha x^4 + \beta \).

5 Results

We present the performance numbers in this section. We focus on polynomial multiplications, leaving the fast constant-time GCD [BY19] as future work.

5.1 Benchmark Environment

We use the Raspberry Pi 4 Model B featuring the quad-core Broadcom BCM2711 chipset. It comes with a 32 kB L1 data cache, a 48 kB L1 instruction cache, and a 1 MB L2 cache and runs at 1.5 GHz. For hashing, we use the aes, sha2, and fips202 from PQClean [KSSW] without any optimizations due to the lack of corresponding cryptographic units. For the randombytes, [BHK+22] used the randombytes from SUPERCOP which in turn used chacha20. We extract the conversion from chacha20 into randombytes from SUPERCOP and replace chacha20 with our optimized implementations using the pipelines I0/I1, F0/F1. We use the cycle counter of the PMU for benchmarking. Our programs are compilable with GCC 10.3.0, GCC 11.2.0, Clang 13.1.6, and Clang 14.0.0. We report numbers for the binaries compiled with GCC 11.2.0.

Table 3. Overview of polynomial multiplications in ntrulpr761/sntrup761.

5.2 Performance of Vectorized Polynomial Multiplications

Table 3 summarizes the performance of vectorized polynomial multiplications.

Table 4. Detailed Good–Schönhage–Bruun cycle counts including reducing to \(\frac{\mathbb {{Z}}_{4591}[x]}{\left\langle {x^{761} - x - 1} \right\rangle }\).

For NTRU Prime, our Good–Rader–Bruun performs the best. It is followed by Good–Thomas and Good–Schönhage–Bruun. Notice that Good–Rader–Bruun requires no extensions or changes of coefficient rings. The closest instances in the literature regarding vectorization are the Good–Thomas and Schönhage–Nussbaumer by [BBCT22], and Good–Thomas by [Haa21]. [BBCT22]’s, [Haa21], and our Good–Thomas compute “big by small” polynomial multiplications. We outperform [Haa21] Good–Thomas by a factor of \(6.1 \times \) since they implemented the base multiplications with scalar code using the C \(\%\) operator. On the other hand, [BBCT22]’s Schönhage–Nussbaumer and our Good–Schönhage–Bruun compute “big by big” polynomial multiplications. Regarding the impact of switching “big by small” to “big by big”, [BBCT22]’s Schönhage–Nussbaumer takes \(\frac{25113}{16992} \approx 147.79\%\) cycles of their own Good–Thomas [BBCT22, Sect. 3.4.2] while our Good–Schönhage–Bruun takes only \(\frac{50398}{47696} \approx 105.67 \%\) cycles of our own Good–Thomas. Essentially, this demonstrates the benefit of vectorization-friendly Good–Thomas and Bruun over truncated [vdH04] Schönhage and Nussbaumer.

Table 5. Detailed cycle counts of Good–Rader–Bruun, excluding reductions to \({\mathbb {{Z}}_{4591}[x]}\big /{\left\langle {x^{761} - x - 1} \right\rangle }\).
Table 6. Performance of inversions, encoding, and decoding in NTRU Prime.

We also provide the detailed cycle counts of the polynomial multiplications. For the “big by big” polynomial multiplications in sntrup761/ntrulpr761, Table 5 details the numbers of Good–Rader–Bruun and Table 4 details the numbers of Good–Schönhage–Bruun.

5.3 Performance of Schemes

Before comparing the overall performance, we first illustrate the performance numbers of some other critical subroutines. Most of our optimized implementations of these subroutines are not seriously optimized except for parts involving polynomial multiplications. We simply translate existing techniques and AVX2-optimized implementations into Neon. Table 6 summarizes the performance of inversions, encoding, and decoding.

Inversions, Sorting Network, Encoding, and Decoding. For sntrup761, we need one inversion over \(\mathbb {{Z}}_{4591}\) and one inversion over \(\mathbb {{Z}}_3\). We bitslice the inversion over \(\mathbb {{Z}}_3\), and identify and vectorize the hottest loop in the inversion over \(\mathbb {{Z}}_{4591}\). Additionally, we translate AVX2-optimized sorting network, encoding, and decoding into Neon. Notice that inversions over \(\mathbb {{Z}}_2\), \(\mathbb {{Z}}_3\), and \(\mathbb {{Z}}_{4591}\), sorting networks, encoding, and decoding are implemented in a generic sense. With fairly little effort, they can be used for other parameter sets.

Performance of sntrup761/ntrulpr761. Table 7 summarizes the overall performance. For ntrulpr761, our key generation, encapsulation, and decapsulation are \(2.98 \times \), \(2.79 \times \), and \(3.07 \times \) faster than [Haa21]. For sntrup761, we outperform the reference implementation significantly. Finally, Table 8 details the performance.

Table 7. Overall cycles of sntrup761/ntrulpr761.

Constant-Time Concerns. There are no input-dependent branches in our code. Our program is constant-time only if one believes the documentation [ARM15]. The source code from [Haa21] and Armv8-A works [NG21, BHK+22], indicate the requirement of the same assumption. In the most relevant documented Neon implementations, our code is constant-time, but this is never strictly guaranteedFootnote 7 even with Data-Independent Timing (DIT). If ARM decides to extend the domain of DIT to relevant multiplication instructions used in this paper, our code is guaranteed to be constant-time once the DIT flag is set. Furthermore, literally all the lattice-based post-quantum cryptosystems will be benefit from this since the constant-time concerns arise from the basic building blocks implementing modular multiplications.

Table 8. Detailed performance numbers of sntrup761 and ntrulpr761 with Good–Rader–Bruun. Only performance-critical subroutines are shown.