Algorithmic Views of Vectorized Polynomial Multipliers – NTRU Prime

Hwang, Vincent; Liu, Chi-Ting; Yang, Bo-Yin

doi:10.1007/978-3-031-54773-7_2

Vincent Hwang^9,11,
Chi-Ting Liu¹⁰ &
Bo-Yin Yang¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14584))

Included in the following conference series:

International Conference on Applied Cryptography and Network Security

436 Accesses
2 Citations

Abstract

In this paper, we explore the cost of vectorization for multiplying polynomials with coefficients in $\mathbb {{Z}}_q$ for an odd prime q, as exemplified by NTRU Prime, a postquantum cryptosystem that found early adoption due to its inclusion in OpenSSH.

If there is a large power of two dividing $q - 1$, we can apply radix-2 Cooley–Tukey fast Fourier transforms to multiply polynomials in $\mathbb {{Z}}_q[x]$. The radix-2 nature admits efficient vectorization. Conversely, if 2 is the only power of two dividing $q - 1$, we can apply Schönhage’s and Nussbaumer’s FFTs to craft radix-2 roots of unity, but these double the number of coefficients.

We show how to avoid the doubling while maintaining the vectorization friendliness with Good–Thomas, Rader’s, and Bruun’s FFTs. In particular, in sntrup761, the most common instance of NTRU Prime we have $q=4591$, and we exploit the existing Fermat-prime factor of $q - 1$ for Rader’s FFT and power-of-two factor of $q + 1$ for Bruun’s FFT.

Polynomial multiplications in $\mathbb {{Z}}_{4591}[x]/\left\langle {x^{761}-x-1} \right\rangle $ is still a worthwhile target because while out of the NIST PQC competition, sntrup761 is still going to be used with OpenSSH by default in the near future.

Our polynomial multiplication outperforms the state-of-the-art vector-optimized implementation by $6.1 \times $. For ntrulpr761, our keygen, encap, and decap are $2.98 \times $, $2.79 \times $, and $3.07 \times $ faster than the state-of-the-art vector-optimized implementation. For sntrup761, we outperform the reference implementation significantly.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Speeding up the Number Theoretic Transform for Faster Ideal Lattice-Based Cryptography

Efficient Polynomial Multiplication via Modified Discrete Galois Transform and Negacyclic Convolution

Efficient Multiplication of Somewhat Small Integers Using Number-Theoretic Transforms

Keywords

1 Introduction

At PQCrypto 2016, the National Institute of Standards and Technology (NIST) announced the Post-Quantum Cryptography Standardization Process for replacing existing standards for public-key cryptography with quantum-resistant cryptosystems. For lattice-based cryptosystems, polynomial multiplications have been the most time-consuming operations. Recently standardized [AAC+22] Dilithium, Kyber, and Falcon wrote number–theoretic transforms (NTTs) into their specifications in response.

OpenSSH 9.0 defaults to NTRU Prime^{Footnote 1}. However, in NTRU Prime the polynomial ring doesn’t allow NTT-based multiplications naturally. State-of-the-art vectorized implementations introduced various techniques extending coefficient rings, or computed the results over $\mathbb {{Z}}$. In each of these approaches, empirically small-degree polynomial multiplications is always an important bottleneck. We study the compatibility of vectorization and various algorithmic techniques in the literature and choose the ARM Cortex-A72 implementing the Armv8-A architecture^{Footnote 2} for this work. We are interested in vectorized polynomial multiplications for NTRU Prime. [BBCT22] showed that a vectorized generic polynomial multiplication takes $\sim 1.5\times $ time of a “generic by small (ternary coefficients)” one with AVX2. [BBCT22] applied Schönhage and Nussbaumer to ease vectorization. Schönhage and Nussbaumer double the sizes of the coefficient rings and lead to a larger number of small-degree polynomial multiplications. We explain how to avoid the doubling with Good–Thomas, Rader’s, and Bruun’s FFTs.

We implement our ideas on Cortex-A72 implementing Armv8.0-A with the vector instruction set Neon. However, we emphasize that our approaches are built around the notion of vectorization and not a specific architecture.

1.1 Contributions

We summarize our contributions as follows.

We formalize the needs of vectorization commonly involved in vectorized implementations.
We propose vectorized polynomial multipliers essentially quartering and halving the number of small-dimensional polynomial multiplications after FFTs.
We propose novel accumulative (subtractive) variants of Barrett multiplication absorbing the follow up addition (subtraction).
We implement the ideas with the SIMD technology Neon in Armv8.0-A on a Cortex-A72. Our fastest polynomial multiplier outperforms the state-of-the-art optimized implementation by a factor of $6.1 \times $.
In addition to the polynomial multiplication, we vectorize the sorting network, polynomial inversions, encoding, and decoding subroutines used in ntrulpr761 and sntrup761. For ntrulpr761, our key generation, encapsulation, and decapsulation are $2.98 \times $, $2.79 \times $, and $3.07 \times $ faster than the state-of-the-art optimized implementation. For sntrup761, we outperform the reference implementation significantly.

1.2 Code

Our source code can be found at https://github.com/vector-polymul-ntru-ntrup/NTRU_Prime under the CC0 license.

1.3 Structure of This Paper

Section 2 goes through the preliminaries. Section 3 surveys FFTs. Section 4 describes our implementations. We show the performance numbers in Sect. 5.

2 Preliminaries

Section 2.1 describes the polynomials rings in NTRU Prime, Sect. 2.2 describes our target platform Cortex-A72, and Sect. 2.3 describes the modular arithmetic.

2.1 Polynomials in NTRU Prime

The NTRU Prime submission comprises two families: Streamlined NTRU Prime and NTRU LPRime. Both operate on the polynomial ring ${\mathbb {{Z}}_q[x]}/{\langle {x^p - x - 1}\rangle }$ where q and p are primes such that the ring is a finite field. We target the polynomial multiplications for parameter sets sntrup761 and ntrulpr761 where $q = 4591$ and $p = 761$. One should note that sntrup761, which is used by OpenSSH, uses a (Quotient) NTRU structure, and requires inversions in ${\mathbb {{Z}}_3[x]}\big /{\left\langle {x^{761} - x - 1} \right\rangle }$ and ${\mathbb {{Z}}_{4591}[x]}\big /{\left\langle {x^{761} - x - 1} \right\rangle }$. We refer the readers to the specification [BBC+20] for more details. With no other assumptions on the inputs, we call a polynomial multiplication “big by big”. If one of the inputs is guaranteed to be ternary, we call it “big by small”. We optimize both although the former is required only if we apply the fast constant-time GCD [BY19] to the inversions in the key generation of sntrup761. The fast constant-time GCD is left as a future work.

2.2 Cortex-A72

Our target platform is the ARM Cortex-A72, implementing the 64-bit Armv8.0-A instruction set architecture. It is a superscalar Central Processing Unit (CPU) with an in-order frontend and an out-of-order backend. Instructions are first decoded into $\mu $ops in the frontend and dispatched to the backend, which contains these eight pipelines: L for loads, S for stores, B for branches, I0/I1 for integer instructions, M for multi-cycle integer instructions, and F0/F1 for Single-Instruction-Multiple-Data (SIMD) instructions. The frontend can only dispatch at most three $\mu $ops per cycle. Furthermore, in a single cycle, the frontend dispatches at most one $\mu $op using B, at most two $\mu $ops using I0/I1, at most two $\mu $ops using M, at most one $\mu $op using F0, at most one $\mu $op using F1, and at most two $\mu $ops using L/S [ARM15, Sect. 4.1].

We mainly focus on the pipelines F0, F1, L, and S for performance. F0/F1 are both capable of various additions, subtractions, permutations, comparisons, minimums/maximums, and table lookups^{Footnote 3}. However, multiplications can only be dispatched to F0, and shifts to F1. The most heavily-loaded pipeline is clearly the critical path. If there are more multiplications than shifts, we much prefer instructions that can use either pipeline to go to F1 since the time spent in F0 will dominate our runtime. Conversely, with more shifts than multiplications, we want to dispatch most non-shifts to F0. In practice, we interleave instructions dispatched to the pipeline with the most workload with other pipelines (or even L/S)—and pray. Our experiment shows that this approach generally works well. In the case of chacha20 implementing randombytes for benchmarking [BHK+22], we even consider a compiler-aided mixing of I0/I1, F0/F1, and L/S^{Footnote 4}. The idea also proved valuable for Keccak on some other Cortex-A cores [BK22, Table 1].

SIMD Registers. The 64-bit Armv8-A has 32 architectural 128-bit SIMD registers with each viewable as packed 8-, 16-, 32-, or 64-bit elements ([ARM21, Fig. A1-1]), denoted by suffixes .16B .8H, .4S, and .2D on the register name, respectively.

Armv8-A Vector Instructions

Multiplications. A plain mul multiplies corresponding vector elements and returns same-sized results. There are many variants of multiplications: mla/mls computes the same product vector and accumulates to or subtracts from the destination. There are high-half products sqdmulh and sqrdmulh. The former computes the double-size products, doubles the results, and returns the upper halves. The latter first rounds to the upper halves before returning them. There are long multiplications s{mul,mla,mls}l{,2}. smull multiplies the corresponding signed elements from the lower 64-bit of the source registers and places the resulting double-width vector elements in the destination register. It is usually paired with an smull2 using the upper 64-bit instead. Their accumulating and subtracting variants are s{mla,mls}l{,2}. We will not use the unsigned counterparts u{mul,mla,mls}l{,2}.

Shifts. shl shifts left; sshr arithmetically shifts right; srshr rounds the results after shifting. We won’t use the unsigned ushr and urshr.

Additions/Subtractions. For basic arithmetic, the usual add/sub adds/subtracts the corresponding elements. Long variants s{add,sub}l{,2} add or subtract the corresponding elements from the lower or upper 64-bit halves and signed-extend into double-width results^{Footnote 5}.

Permutations. Then we have permutations—uzp{1,2} extracts the even and odd positions respectively from a pair of vectors and concatenates the results into a vector. ext extracts the lowest elements (there is an immediate operand specifying the number of bytes) of the second source vector (as the high part) and concatenates to the highest elements of the first source vector. zip{1,2} takes the bottom and top halves of a pair of vectors and riffle-shuffles them into the destination.

2.3 Modular Arithmetic

Let q be an odd modulus, and $\texttt {R}$ be the size of the arithmetic. We describe the modular reductions and multiplications for computing in $\mathbb {{Z}}_q$. Barrett reduction [Bar86] reduces a value a by approximating $a \bmod ^\pm q$ with $a - \left\lfloor {\frac{a \cdot \left\lfloor {\frac{2^e \texttt {R}}{q}} \right\rceil }{2^e \texttt {R}}} \right\rceil $ (cf. Algorithm 1). For multiplying an unknown a with a fixed value b, we compute $ab - \left\lfloor {\frac{a \left\lfloor {\frac{b \texttt {R}}{q}} \right\rceil _2}{\texttt {R}}} \right\rceil q \equiv a b \bmod ^\pm q$ (Barrett multiplication [BHK+22]) where $\left\lfloor {} \right\rceil _2$ is the function mapping a real number r to $2 \left\lfloor {\frac{r}{2}} \right\rceil $ (cf. Algorithm 2). We give novel multiply-add/sub variants of Barrett multiplication in Algorithms 3 and 4. Algorithm 3 (resp. Algorithm 4) computes a representation of $a + bc$ (resp. $a-bc$) by merging a mul with an add (resp. a sub) into an mla (resp. mls), saving 1 instruction.

3 Fast Fourier Transforms

We go through the mathematics behind various fast Fourier transforms (FFTs) and emphasize their defining conditions. This section is structured as follows. Section 3.1 reviews the Chinese remainder theorem for polynomial rings and discrete Fourier transform (DFT). We then survey various FFTs, including Cooley–Tukey in Sect. 3.2, Bruun and its finite field counterpart in Sect. 3.3, Good–Thomas in Sect. 3.4, Rader in Sect. 3.5, and Schönhage and Nussbaumer in Sect. 3.6. We use number–theoretic transform (NTT) as a synonym of FFT.

3.1 The Chinese Remainder Theorem (CRT) for Polynomial Rings

Let $n = \prod _l n_l$, and $\boldsymbol{g}_{i_0, \dots , i_{h - 1}} \in R[x]$ be coprime polynomials for all indices $(i_l)_{l=0\cdots h-1}$ where $0\le i_l<n_l$. The CRT gives us a chain of isomorphisms

$$ \begin{aligned} \frac{R[x]}{\left\langle {\prod _{i_0, \dots , i_{h - 1}} \boldsymbol{g}_{i_0, \dots , i_{h - 1}}} \right\rangle } & \cong {} & {} \prod _{i_0} \frac{R[x]}{\left\langle {\prod _{i_1, \dots , i_{h - 1}} \boldsymbol{g}_{i_0, \dots , i_{h - 1}}} \right\rangle } \\ \cong \cdots & \cong {} & {} \prod _{i_0, \dots , i_{h - 1}} \frac{R[x]}{\left\langle {\boldsymbol{g}_{i_0, \dots , i_{h - 1}}} \right\rangle }. \end{aligned} $$

Multiplying in $\prod _{i_0, \dots , i_{h - 1}} {R[x]}\bigg /{\left\langle {\boldsymbol{g}_{i_0, \dots , i_{h - 1}}} \right\rangle }$ is cheap if the polynomial modulus is small. If the isomorphism chain is also cheap, we improve the polynomial multiplications in ${R[x]}\bigg /{\left\langle {\prod _{i_0, \dots , i_{h - 1}} \boldsymbol{g}_{i_0, \dots , i_{h - 1}}} \right\rangle }$. For small $n_l$’s, it is usually cheap to decompose a polynomial ring into a product of $n_l$ polynomial rings.

Transformations will be described with the words “radix”, “split”, and “layer”. We demonstrated below for $h = 2$. Suppose we have isomorphisms

$$ {R[x]}\bigg /{\left\langle {\prod _{i_0, i_1} \boldsymbol{g}_{i_0, i_1}} \right\rangle } \overset{\eta _0}{\cong }\ \prod _{i_0} {R[x]}\bigg /{\left\langle {\prod _{i_1} \boldsymbol{g}_{i_0, i_1}} \right\rangle } \overset{\eta _1}{\cong }\ \prod _{i_0, i_1} {R[x]}\big /{\left\langle {\boldsymbol{g}_{i_0, i_1}} \right\rangle } $$

where $i_0\in \{0, \dots , n_0 - 1\}$ and $i_1\in \{0, \dots , n_1 - 1\}$. We call $\eta _0$ a radix-$n_0$ split and an implementation of $\eta _0$ a radix-$n_0$ computation, and similarly for $\eta _1$. Usually, we implement several isomorphisms together to minimize memory operations. The resulting computation is called a multi-layer computation. Suppose we implement $\eta _0$ and $\eta _1$ with a single pair of loads and stores, and $\eta _0$ and $\eta _1$ both rely on X, a shape of computations, then the resulting multi-layer computation is called a 2-layer X. If additionally $n_0 = n_1$, the computation is a 2-layer radix-$n_0$ X, and similarly for more layers.

3.2 Cooley–Tukey FFT

In a Cooley–Tukey FFT [CT65], we have $\zeta \in R$, $\omega _n\in R$ a principal nth root of unity, n coprime to $\text {char}(R)$, and $\boldsymbol{g}_{i_0, \dots , i_{h - 1}} = x - \zeta \omega _n^{\sum _l i_l \prod _{j < l} n_j} \in R[x]$. Since $\prod _{i_0, \dots , i_{h - 1}} \boldsymbol{g}_{i_0, \dots , i_{h - 1}} = x^{n} - \zeta ^n$, the efficiency of multiplying polynomials in ${R[x]}/{\left\langle {x^{n} - \zeta ^n} \right\rangle }$ boils down to the efficiency of the isomorphisms indexed by $i_l$’s. Furthermore, it is a cyclic NTT if $\zeta ^n = 1$.

3.3 Bruun-Like FFTs

[Bru78] first introduced the idea of factoring into trinomials $\boldsymbol{g}_{i_0, \dots , i_{h - 1}}$ when n is a power of two—to reduce the number of multiplications in R while operating over $\mathbb {C}$. [Mur96] generalized this to arbitrary even n. For our implementations, we need the results on factoring $x^{2^k} + 1 \in \mathbb {F}_q[x]$ when $q \equiv 3 \pmod {4}$ [BGM93] and composed multiplications of polynomials in $\mathbb {F}_q[x]$ [BC87]. Factoring $x^n - 1$ over $\mathbb {F}_q$ is actively researched [BGM93, Mey96, TW13, MVdO14, WYF18, WY21].

Review: The Original Bruun’s FFT ($\boldsymbol{R = \mathbb {C}}$). We choose $ \boldsymbol{g}_{i_0, \dots , i_{h - 1}} = x^{2} - \left( \zeta \omega _n^{\sum _l i_l \prod _{j < l} n_j} + \zeta ^{-1} \omega _n^{-\sum _l i_l \prod _{j < l} n_j} \right) x + 1 $ so $x^{2n} - \left( \zeta ^n + \zeta ^{-n} \right) x^n + 1 = \prod _{i_0, \dots , i_{h - 1}} \boldsymbol{g}_{i_0, \dots , i_{h - 1}} $. This provides us an alternative factorization for $x^{4n} - 1 = (x^{2n} - 1) (x^{2n} + 1)$ by choosing $\zeta ^n = \omega _4$. For a complex number with norm 1, since the sum of its inverse and itself is real, we only need arithmetic in $\mathbb {R}$ to reach $\prod _{i_0, \dots , i_{h - 1}} {\mathbb {C}[x]}\bigg /{\left\langle {\boldsymbol{g}_{i_0, \dots , i_{h - 1}(x)}} \right\rangle }$.

$\boldsymbol{R = \mathbb {F}_q}$ where $\boldsymbol{q \equiv 3 \pmod {4}}$. We need Theorem 1 for our implementations.

Theorem 1

([BGM93, Theorem 1]). Let $q \equiv 3 \pmod {4}$ and $2^w$ be the highest power of two in $q + 1$. If $k < w$, then $x^{2^k} + 1$ factors into irreducible trinomials $x^2 + \gamma x + 1$ in $\mathbb {F}_q[x]$. Else (i.e., $k \ge w$) $x^{2^k} + 1$ factors into irreducible trinomials $x^{2^{k - w + 1}} + \gamma x^{2^{k - w}} - 1$ in $\mathbb {F}_q[x]$.

Given $\boldsymbol{f}_0, \boldsymbol{f}_1 \in \mathbb {F}_q[x]$, we define their “composed multiplication” as $\left( \boldsymbol{f}_0 \odot \boldsymbol{f}_1 \right) {:}{=}\prod _{\boldsymbol{f}_0(\alpha )=0} \prod _{\boldsymbol{f}_1(\beta )=0} \left( x - \alpha \beta \right) $ where $\alpha , \beta $ run over all the roots of $\boldsymbol{f}_0, \boldsymbol{f}_1$ in an extension field of $\mathbb {F}_q$.We need the following from [BC87]:

Lemma 1

([BC87, Eq. 8]).$ \prod _{i_0} \boldsymbol{f}_{0, i_0} \odot \prod _{i_1} \boldsymbol{f}_{1, i_1} = \prod _{i_0, i_1} \left( \boldsymbol{f}_{0, i_0} \odot \boldsymbol{f}_{1, i_1} \right) $ holds for any sequences of polynomials $\boldsymbol{f}_{0, i_0}, \boldsymbol{f}_{1, i_1} \in \mathbb {F}_q[x]$.

Lemma 2

([BC87, Eq. 5]). If $\boldsymbol{f}_0 = \prod _\alpha (x - \alpha ) \in \mathbb {F}_q[x]$, then for any $ \boldsymbol{f}_1 \in \mathbb {F}_q[x]$, we have $\boldsymbol{f}_0 \odot \boldsymbol{f}_1 = \prod _\alpha \alpha ^{\text {deg}(\boldsymbol{f}_1)} \boldsymbol{f}_1(\alpha ^{-1} x) \in \mathbb {F}_q[x]$.

Lemma 3

Let r be odd, $x^r - 1 = \prod _{i_0} (x - \omega _r^{i_0}) \in \mathbb {F}_q[x]$, and $x^{2^k} - 1 = \prod _{i_1} \boldsymbol{f}_{i_1} \in \mathbb {F}_q[x]$. We have $ x^{2^k r} - 1 = \prod _{i_0} \left( x^{2^k} - \omega _r^{2^k i_0} \right) = \prod _{i_0, i_1} \omega _r^{i_0 \text {deg} (\boldsymbol{f}_{i_1})} \boldsymbol{f}_{i_1}( \omega _r^{-i_0} x). $

Proof

First observe $x^{2^k r} - 1 = \left( x^r - 1 \right) \odot \left( x^{2^k} - 1 \right) $^{Footnote 6}. By Lemma 1, this equals

$ \prod _{i_0} \left( (x - \omega _r^{i_0}) \odot \left( x^{2^k} - 1 \right) \right) = \prod _{i_0, i_1} \left( (x - \omega _r^{i_0}) \odot \boldsymbol{f}_{i_1} \right) . $ According to Lemma 2, $(x - \omega _r^{i_0}) \odot \left( x^{2^k} - 1 \right) = x^{2^k} - \omega _r^{2^k i_0}$ and $(x - \omega _r^{i_0}) \odot \boldsymbol{f}_{i_1} = \omega _r^{i_0 \text {deg} (\boldsymbol{f}_{i_1})} \boldsymbol{f}_{i_1} (\omega _r^{-i_0} x)$ as desired.

In summary, by Lemma 3 we have the following isomorphisms:

$$ \frac{\mathbb {F}_q[x]}{\left\langle {x^{2^k r} - 1} \right\rangle } \cong \frac{\mathbb {F}_q[x]}{\left\langle {\prod _{i_0} \left( x^{2^k} - \omega _r^{2^k i_0} \right) } \right\rangle } \cong \frac{\mathbb {F}_q[x]}{\left\langle {\prod _{i_0, i_1} \omega _r^{i_0 \text {deg} (\boldsymbol{f}_{i_1})} \boldsymbol{f}_{i_1} (\omega _r^{-i_0} x)} \right\rangle }. $$

Radix-2 Bruun’s Butterflies and Inverses. Define ${\textbf {Bruun}}_{\alpha , \beta }$ as follows:

$$ {\textbf {Bruun}}_{\alpha , \beta }: {\left\{ \begin{array}{ll} \frac{R[x]}{\left\langle {x^4 + (2 \beta - \alpha ^2) x^2 + \beta ^2} \right\rangle } &{} \rightarrow \frac{R[x]}{\left\langle {x^2 + \alpha x + \beta } \right\rangle } \times \frac{R[x]}{\left\langle {x^2 - \alpha x + \beta } \right\rangle } \\ a_0 + a_1 x + a_2 x^2 + a_3 x^3 &{} \mapsto \left( (\hat{a}_0 + \hat{a}_1 x), (\hat{a}_2 + \hat{a}_3 x) \right) \end{array}\right. } $$

where

$$ \left\{ \begin{aligned} (\hat{a}_0, \hat{a}_1) & = {} & {} \left( a_0 - \beta a_2 + \alpha \beta a_3, a_1 + (\alpha ^2 - \beta ) a_3 - \alpha a_2 \right) , \\ (\hat{a}_2, \hat{a}_3) & = {} & {} \left( a_0 - \beta a_2 - \alpha \beta a_3, a_1 + (\alpha ^2 - \beta ) a_3 + \alpha a_2 \right) . \end{aligned} \right. $$

We compute $(a_0 - \beta a_2, a_1 + (\alpha ^2 - \beta ) a_3,\alpha a_2, \alpha \beta a_3)$, swap the last two values implicitly, and do an addition-subtraction (cf. Fig. 1). Notice that we can use Barrett_mla and Barrett_mls whenever a product is followed by only one accumulation ($a_1 + \left( \alpha ^2 - \beta \right) a_3$) or subtraction ($a_0 - \beta a_2$).

$$ 2 {\textbf {Bruun}}_{\alpha , \beta }^{-1}: {\left\{ \begin{array}{ll} \frac{R[x]}{\left\langle {x^2 + \alpha x + \beta } \right\rangle } \times \frac{R[x]}{\left\langle {x^2 - \alpha x + \beta } \right\rangle } &{} \rightarrow \frac{R[x]}{\left\langle {x^4 + (2 \beta - \alpha ^2) x^2 + \beta ^2} \right\rangle } \\ \left( (\hat{a}_0 + \hat{a}_1 x), (\hat{a}_2 + \hat{a}_3 x) \right) &{} \mapsto 2 a_0 + 2 a_1 x + 2 a_2 x^2 + 2 a_3 x^3 \end{array}\right. } $$

correspondingly defines the inverse, where

$$ \left\{ \begin{aligned} 2 (a_0, a_1) & = {} & {} (\hat{a}_0 + \hat{a}_2 + \left( \hat{a}_3 - \hat{a}_1 \right) \alpha ^{-1} \beta , \hat{a}_1 + \hat{a}_3 - \left( \hat{a}_0 - \hat{a}_2 \right) \alpha ^{-1} \beta ^{-1} \left( \alpha ^2 - \beta \right) ), \\ 2 (a_2, a_3) & = {} & {} (\left( \hat{a}_3 - \hat{a}_1 \right) \alpha ^{-1}, \left( \hat{a}_0 - \hat{a}_2 \right) \alpha ^{-1} \beta ^{-1}). \\ \end{aligned} \right. $$

We compute $ \left( \hat{a}_0 + \hat{a}_2, \hat{a}_1 + \hat{a}_3, \hat{a}_0 - \hat{a}_2, \hat{a}_3 - \hat{a}_1 \right) $, swap the last two values implicitly, multiply the constants $\alpha ^{-1}, \beta , \alpha ^{-1} \beta ^{-1}$, and $\left( \alpha ^2 - \beta \right) $, and add-sub (cf. Fig. 2). Both ${\textbf {Bruun}}_{\alpha , \beta }$ and $2 {\textbf {Bruun}}_{\alpha , \beta }^{-1}$ take 4 multiplications.

We will use three special cases of Bruun’s butterflies.

${\textbf {Bruun}}_{\sqrt{2}, 1}$: The initial split of $x^{2^k} + 1$ is ${\textbf {Bruun}}_{\sqrt{2}, 1}$. Since $\beta = \alpha ^2 - \beta = 1$, we only need two multiplications by $\times \sqrt{2}$.
${\textbf {Bruun}}_{\alpha , \pm 1}$: We avoid multiplying with $\beta =\pm 1$ in ${\textbf {Bruun}}_{\alpha , \pm 1}$ and $2 {\textbf {Bruun}}_{\alpha , \pm 1}^{-1}$.
${\textbf {Bruun}}_{\alpha , \frac{\alpha ^2}{2}}$: We save no multiplications, but only use 2 constants $\alpha $ and $\frac{\alpha ^2}{2}$ instead of 4. It is used in the split of $x^{2^k} + \omega _r^{2^k i}$ for an odd r.

3.4 Good–Thomas FFTs

A Good–Thomas FFT [Goo58] converts cyclic FFTs and convolutions into multi-dimensional ones for coprime $n_l$’s. For the polynomial ring ${R[x]}/{\left\langle {x^{n} - 1} \right\rangle }$, we implement ${R[x]}/{\left\langle {x^n - 1} \right\rangle } \cong \prod _{i_0, \dots , i_{h - 1}} {R[x]}/{\left\langle {x - \prod _l \omega _{n_l}^{i_l}} \right\rangle }$ with a multi-dimensional FFT induced by the equivalences $x \sim \prod _l u_l$ and $\forall l, u_l^{n_l} \sim 1$. Formally, we have

$$ \begin{aligned} &\, &\, & \frac{R[x]]}{\left\langle {x^n - 1} \right\rangle } \cong \frac{R[x, u_0, \dots , u_{h - 1}]}{\left\langle {x - \prod _l u_l, u_0^{n_0} - 1, \dots , u_{h - 1}^{n_{h - 1}} - 1} \right\rangle } \\ & \cong &\, & \prod _{i_0, \dots , i_{h - 1}} \frac{R[x, u_0, \dots , u_{h - 1}]}{\left\langle {x - \prod _l u_l, u_0 - \omega _{n_0}^{i_0}, \dots , u_{h - 1} - \omega _{n_{h - 1}}^{i_{h - 1}}} \right\rangle } \cong \prod _{i_0, \dots , i_{h - 1}} \frac{R[x]}{\left\langle {x - \prod _l \omega _{n_l}^{i_l}} \right\rangle }. \end{aligned} $$

We illustrate the idea for $h = 2, n_0 = 2$, and $n_1 = 3$. Let $P_{(14)}$ be the permutation matrix exchanging the 1st and the 4th rows. We write the size-6 FFT matrix as follows:

$$ P_{(14)} \begin{pmatrix} 1 &{} 1 &{} 1 &{} 1 &{} 1 &{} 1 \\ 1 &{} \omega _6 &{} \omega _6^2 &{} \omega _6^3 &{} \omega _6^4 &{} \omega _6^5 \\ 1 &{} \omega _6^2 &{} \omega _6^4 &{} 1 &{} \omega _6^2 &{} \omega _6^4 \\ 1 &{} \omega _6^3 &{} 1 &{} \omega _6^3 &{} 1 &{} \omega _6^3 \\ 1 &{} \omega _6^4 &{} \omega _6^2 &{} 1 &{} \omega _6^4 &{} \omega _6^2 \\ 1 &{} \omega _6^5 &{} \omega _6^4 &{} \omega _6^3 &{} \omega _6^2 &{} \omega _6 \\ \end{pmatrix} P_{(14)} = \begin{pmatrix} 1 &{} 1 &{} 1 &{} 1 &{} 1 &{} 1 \\ 1 &{} \omega _6^4 &{} \omega _6^2 &{} 1 &{} \omega _6^4 &{} \omega _6^2 \\ 1 &{} \omega _6^2 &{} \omega _6^4 &{} 1 &{} \omega _6^2 &{} \omega _6^4 \\ 1 &{} 1 &{} 1 &{} \omega _6^3 &{} \omega _6^3 &{} \omega _6^3 \\ 1 &{} \omega _6^4 &{} \omega _6^2 &{} \omega _6^3 &{} \omega _6 &{} \omega _6^5 \\ 1 &{} \omega _6^2 &{} \omega _6^4 &{} \omega _6^3 &{} \omega _6^5 &{} \omega _6 \\ \end{pmatrix} =\begin{pmatrix} 1 &{} 1 \\ 1 &{} -1 \end{pmatrix} \otimes \begin{pmatrix} 1 &{} 1 &{} 1 \\ 1 &{} \omega _6^4 &{} \omega _6^2 \\ 1 &{} \omega _6^2 &{} \omega _6^4 \end{pmatrix}. $$

3.5 Rader’s FFT for Odd Prime p

Suppose $\omega _p \in R$ for an odd prime p. [Rad68] introduced how to map a polynomial $\sum _{i} a_i x^i\in {R[x]}/{\left\langle {x^p - 1} \right\rangle }$ to the tuple $\left( \hat{a}_j\right) {:}{=}\left( \sum _{i} a_i \omega _p^{ij}\right) \in \prod _i {R[x]}\big /{\left\langle {x - \omega _p^i} \right\rangle }$ with a size-$(p - 1)$ cyclic convolution. Let g be a generator of $\mathbb {{Z}}_p^*$ and write $j = g^{k}$ and $i = g^{-\ell }$. Then $\hat{a}_{g^{k}} - a_0 = \hat{a}_j - a_0 = \sum _{i = 1}^{p - 1}a_i \omega _p^{ij} = \sum _{\ell = 0}^{p - 2} a_{g^{-\ell }} \omega _p^{g^{k - \ell }}$ for $k = 0, \dots , p - 2$.

The sequence $\left( \sum _{\ell = 0}^{p - 2} a_{g^{-\ell }} \omega _p^{g^{k - \ell }} \right) _{j = 0, \dots , p - 2}$ is the size-$(p - 1)$ cyclic convolution of sequences $\left( a_{g^{{-i}}} \right) _{{i = 0, \dots , p - 2}}$ and $\left( \omega _p^{g^i} \right) _{{i = 0, \dots , p - 2}}$. For example, let $p = 5$. We have $(1, 2, 3, 4) = (2^4, 2, 2^3, 2^2)$ and

$$ \begin{pmatrix} \hat{a}_2 - a_0 \\ \hat{a}_4 - a_0 \\ \hat{a}_3 - a_0 \\ \hat{a}_1 - a_0 \\ \end{pmatrix} = \begin{pmatrix} \omega _5 &{} \omega _5^2 &{} \omega _5^4 &{} \omega _5^3 \\ \omega _5^3 &{} \omega _5 &{} \omega _5^2 &{} \omega _5^4 \\ \omega _5^4 &{} \omega _5^3 &{} \omega _5 &{} \omega _5^2 \\ \omega _5^2 &{} \omega _5^4 &{} \omega _5^3 &{} \omega _5 \\ \end{pmatrix} \begin{pmatrix} a_3 \\ a_4 \\ a_2 \\ a_1 \end{pmatrix}. $$

3.6 Schönhage’s and Nussbaumer’s FFTs

Instead of isomorphisms based on CRT, we sometimes compute chains of monomorphisms and determine the unique inverse image from the product of two images. Given polynomials $\boldsymbol{a}, \boldsymbol{b}\in {R[x]}/{\left\langle {\boldsymbol{g}} \right\rangle }$ where $\boldsymbol{g}$ is a degree-$n_0 n_1$ polynomial, we introduce $y = x^{n_1}$, and write $\boldsymbol{a}$ and $\boldsymbol{b}$ as polynomials in ${R[x, y]}/{\left\langle {x^{n_1} - y, \boldsymbol{g}_0} \right\rangle }$ where $\boldsymbol{g}_0|_{y=x^{n_1}} = \boldsymbol{g}(x)$. In other words, $\boldsymbol{a}(y) := \sum _{i_0 = 0}^{n_0 - 1} \left( \sum _{i = 0}^{n_1 - 1} a_{i + i_0 n_1} x^i \right) y^{i_0} \in {R[x, y]}/{\left\langle {x^{n_1} - y, \boldsymbol{g}_0} \right\rangle }$. We recap transforms when ${R[x,y]}/{\left\langle {x^{n_1}-y,\boldsymbol{g}_0} \right\rangle }$ does not naturally split.

We want an injection ${R[x]}/{\left\langle {x^{n_1} - y} \right\rangle }\hookrightarrow \bar{R}$ such that ${R[x,y]}/{\left\langle {x^{n_1}-y,\boldsymbol{g}_0} \right\rangle } \hookrightarrow {\bar{R}[y]}\big /{\left\langle {\boldsymbol{g}_0} \right\rangle }$ is a monomorphism with ${\bar{R}[y]}\big /{\left\langle {\boldsymbol{g}_0} \right\rangle } \cong \prod _j {\bar{R}[y]}\big /{\left\langle {\boldsymbol{g}_{0, j}} \right\rangle }$. A Schönhage FFT [Sch77] is when $\boldsymbol{g}_0 | (y^{n_0} - 1)$, and $\bar{R} = {R[x]}/{\left\langle {\boldsymbol{h}} \right\rangle }$ with $\boldsymbol{h}|\mathrm {\Phi }_{n_0}(x)$ (the $n_0$-th cyclotomic polynomial). E.g., “cyclic Schönhage” for powers of two $n_0$, $n_1 = \frac{n_0}{4}$, $\boldsymbol{g}_0 = y^{n_0} - 1$, and $\boldsymbol{h}= x^{2 n_1} + 1$ is:

$$ \begin{aligned} \frac{R[x]}{\left\langle {x^{n_0 n_1} - 1} \right\rangle } \cong \frac{\frac{R[x]}{\left\langle {x^{n_1} - y} \right\rangle }[y]}{\left\langle {y^{n_0} - 1} \right\rangle } \hookrightarrow \frac{\frac{R[x]}{\left\langle {x^{2n_1} + 1} \right\rangle }[y]}{\left\langle {y^{n_0}-1} \right\rangle } \triangleq \frac{\bar{R}[y]}{\left\langle {y^{n_0}-1} \right\rangle } \cong \prod _{i} \frac{\bar{R}[y]}{\left\langle {y - x^i} \right\rangle }. \end{aligned} $$

We can also exchange the roles of x and y and get Nussbaumer’s FFT [Nus80]. We map ${R[x,y]}/{\left\langle {x^{n_1}-y,\boldsymbol{g}_0} \right\rangle } \hookrightarrow {R[x,y]}/{\left\langle {\boldsymbol{h},\boldsymbol{g}_0} \right\rangle }$ for $\boldsymbol{g}_0| \mathrm {\Phi }_{2 n_1}(y)$ and $\boldsymbol{h}| (x^{2 n_1} - 1)$. This can be illustrated for powers of two $n_0 = n_1$, $\boldsymbol{h}= x^{2 n_1} - 1$, and $\boldsymbol{g}_0 = y^{n_0} + 1$:

$$ \begin{aligned} \frac{R[x]}{\left\langle {x^{n_0 n_1} + 1} \right\rangle } \cong \frac{R[x, y]}{\left\langle {x^{n_1} - y, y^{n_0} + 1} \right\rangle } \hookrightarrow \frac{\frac{R[y]}{\left\langle {y^{n_0} + 1} \right\rangle }[x]}{\left\langle {x^{2n_1}-1} \right\rangle } \triangleq \frac{\tilde{R}[x]}{\left\langle {x^{2 n_1} - 1} \right\rangle } \cong \prod _i \frac{\tilde{R}[x]}{\left\langle {x - y^i} \right\rangle }. \end{aligned} $$

Our presentation is motivated by [Ber01, Sect. 9, Paragraph “High–radix variants”] and [vdH04, Sect. 3].

4 Implementations

In this section, we discuss our ideas for multiplying polynomials over $\mathbb {{Z}}_{4591}$. For brevity, we assume $R = \mathbb {{Z}}_{4591}$ in this section. The state-of-the-art vectorized “big by big” polynomial multiplication in NTRU Prime [BBCT22] computed the product in ${R[x]}\big /{\left\langle {(x^{1024} + 1) (x^{512} - 1)} \right\rangle }$ with Schönhage and Nussbaumer. This leads to 768 size-8 base multiplications where all of them are negacyclic convolutions. [BBCT22] justified the choice as follows:

$\dots $ since $4591 - 1 = 2 \cdot 3^3 \cdot 5 \cdot 17$, no simple root of unity is available for recursive radix-2 FFT tricks. $\dots $ They ([ACC+21]) performed radix-3, radix-5, and radix-17 NTT stages in their NTT (defined in ${R[x]}\big /{\left\langle {x^{1530} - 1} \right\rangle }$). We instead use a radix-2 algorithm that efficiently utilizes the full ymm registers (for vectorization) in the Haswell architecture.

We propose transformations (essentially) quartering and halving the number of coefficients involved in base multiplications for vectorization. Our first transformation computes the result in ${R[x]}\big /{\left\langle {x^{1536} - 1} \right\rangle }$. We apply Good–Thomas with $\omega _3 \in R$ for a more rapid decrease of the sizes of polynomial rings, Schönhage for radix-2 butterflies, and Bruun over ${R[x]}\big /{\left\langle {x^{32} + 1} \right\rangle }$. This leads to 384 size-8 base multiplications defined over trinomial moduli. Our second transformation computes the result in ${R[x]}\big /{\left\langle {x^{1632} - 1} \right\rangle }$. We show how to incorporate Rader for radix-17 butterflies and Good–Thomas for the coprime factorization $17 \cdot 3 \cdot 2$. For computing the size-16 weighted convolutions, we split with Cooley–Tukey and Bruun for ${R[x]}\big /{\left\langle {x^{16} \pm \omega _{102}^i} \right\rangle }$. Since no coefficient ring extensions are involved, this leads to 96 size-8 base multiplication with binomial moduli, 96 size-8 base multiplications with trinomial moduli, and six size-16 base multiplications with binomial moduli.

Section 4.1 formalizes the needs of vectorization, and Sect. 4.2 goes through our implementation Good–Thomas for big-by-small polynomial multiplications. We then go through big-by-big polynomial multiplications. Section 4.3 goes through our implementation Good–Schönhage–Bruun, and Sect. 4.4 goes through our implementation Good–Rader–Bruun.

4.1 The Needs of Vectorization

We formalize “the needs of vectorization” to justify how we choose among transformations. In the literature, power-of-two-sized FFTs are oftenly described as easily vectorizable. In this paper, we explicitly state and relate them to the designs of vectorization-friendly polynomial multiplications. Our definition is based on our programming experience.

We assume that a reasonable vector instruction set should provide the following features accessible to programmers:

Several vector registers each holding a large number of bits of data. Commonly, each register holds $2^k$ bits.
Several vector arithmetic instructions computing $2^k$-bit data from $2^k$-bit data while regarding each $2^k$-bit data as packed elements.
- If input and output are regarded as packed $2^{k'}$-bit data, we call the instruction a single-width instruction.
- If input is regarded as packed $2^{k' - 1}$-bit data and output is regarded as packed $2^{k'}$-bit data, we call the instruction a widening instruction.
- If input is regarded as packed $2^{k'}$-bit data and output is regarded as packed $2^{k' - 1}$-bit data, we call the instruction a narrowing instruction.

The terminologies “widening” and “narrowing” come from [ARM21]. For a $k' \le k$, we are interested in the number of elements $v = 2^{k - k'}$ contained in a vector register. Intuivitely, we want to compute with minimal number of data shuffling while maintaining the vectorization feature: if we want to add up several pairs $(a_i, b_i)$ of elements, we assign $(a_i)$ to one vector register and $(b_i)$ to another one and issue a vector addtion, similarly for subtractions, multiplications, and bitwise operations. We formalize this intuition for algebra homomorphisms.

Let $\pi $ be a platform-dependent set of module homomorphisms. We’ll specify $\pi = \pi (\texttt {neon})$ in the case of Neon shortly. Let f be an algebra homomorphism. We call f “vectorization friendly” if f is a composition of homomorphisms of the form $g \otimes \text {id}_v \otimes d$ for g an algebra homomorphism, d a composition of elements from $\pi $. Since $g \otimes \text {id}_v$ operates over several chunks of v-sets, we need no permutations for this part. For the set $\pi $, we define it with the matrix view for simplicity. $\pi $ is defined as the set of module homomorphisms representable as a $v' \times v'$ diagonal matrix or a size-$v'$ cyclic/negacyclic shift for $v'$ a multiple of v.

In this paper, we start with for $v'$ a multiple of v and transform accordingly.

4.2 Good–Thomas FFT in “Big$\times $Small” Polynomial Multiplications

We recall below the design principle of vectorization–friendly Good–Thomas from [AHY22], and describe our implementation Good–Thomas for the “big by small” polynomial multiplications. For a cyclic convolution ${R[x]}/{\left\langle {x^{v n_0 n_1} - 1} \right\rangle }$ where $n_0$ and $n_1$ coprime, and v a multiple of the number of coefficients in a vector, one introduces the equivalences $x^v \sim u w$, $u^{n_0} \sim w^{n_1} \sim 1$. Usually, one picks $n_0$ and $n_1$ carefully for fast computations. In the simplest form, one picks $n_0$ as a power of 2 and $n_1 = 3$. Our Good–Thomas computes the polynomial multiplication in ${\mathbb {{Z}}[x]}\big /{\left\langle {x^{1536} - 1} \right\rangle }$ with $(v, n_0, n_1) = (4, 128, 3)$ where $v = 4$ comes from the fact that each Neon SIMD register holds four 32-bit values. After reaching ${\mathbb {{Z}}[x, u, w]}\big /{\left\langle {x^4 - u w, u^3 - 1, w^{128} - 1} \right\rangle }$, we want to compute size-3 NTT over $u^3 - 1$ and size-128 NTT over $w^{128} - 1$. It suffices to choose a large modulus $q'$ with a principal 384-th root of unity. We choose $q'$ as a 32-bit modulus bounding the maximum value of the product in ${\mathbb {{Z}}[x]}\big /{\left\langle {x^{1536} - 1} \right\rangle }$. Obviously, our Good–Thomas supports any “big-by-small” polynomial multiplications with size less than or equal to 1536.

4.3 Good–Thomas, Schönhage’s, and Bruun’s FFT

This section describes our Good–Schönhage–Bruun. We briefly recall the AVX2-optimized “big by big” polynomial multiplication by [BBCT22]. They computed the product in ${R[x]}\big /{\left\langle {(x^{512} - 1) (x^{1024} + 1)} \right\rangle }$. They first applied Schönhage as follows.

$$ \begin{aligned} {} & {} & \frac{R[x]}{\left\langle {(x^{512} - 1) (x^{1024} + 1)} \right\rangle } \cong \frac{ \frac{R[x]}{\left\langle {x^{32} - y} \right\rangle } [y]}{\left\langle {(y^{16} - 1) (y^{32} + 1)} \right\rangle } \\ & \hookrightarrow {} & {} \frac{ \frac{R[x]}{\left\langle {x^{64} + 1} \right\rangle } [y]}{\left\langle {(y^{16} - 1) (y^{32} + 1)} \right\rangle } \cong \prod _{i = 0, 1, 3, j = 0, \dots , 15} \frac{ \frac{R[x]}{\left\langle {x^{64} + 1} \right\rangle } [y]}{\left\langle {y - x^{2 i + 8 j}} \right\rangle }. \end{aligned} $$

They then applied Nussbaumer for multiplying in $\frac{R[x]}{\left\langle {x^{64} + 1} \right\rangle }$ as follows.

$$ \frac{R[x]}{\left\langle {x^{64} + 1} \right\rangle } \cong \frac{\frac{R[x]}{\left\langle {x^8 - z} \right\rangle }[z]}{\left\langle {z^8 + 1} \right\rangle } \hookrightarrow \frac{\frac{R[x]}{\left\langle {x^{16} - 1} \right\rangle }[z]}{\left\langle {z^8 + 1} \right\rangle } \cong \frac{\frac{R[z]}{\left\langle {z^8 + 1} \right\rangle }[x]}{\left\langle {x^{16} - 1} \right\rangle } \cong \prod _{k = 0, \dots , 15} \frac{\frac{R[z]}{\left\langle {z^8 + 1} \right\rangle }[x]}{\left\langle {x - z^k} \right\rangle }. $$

The vectorization-friendliness of Schönhage is obvious. In principle, Nussbaumer is vectorization-friendly since it shares the same computation as Schönhage after transposing.

Truncated Schö nhage vs Good–Thomas and Schönhage. We first discuss an optimization of Schönhage if there is a principal root of unity with order coprime to the one defining Schönhage.

How it Works, Mathematically. In $R = \mathbb {{Z}}_{4591}$, we know that there is a principal 3rd root of unity $\omega _3 \in R$. Instead of computing in ${R[x]}\big /{\left\langle {(x^{512} - 1) (x^{1024} + 1)} \right\rangle }$, we apply Schönhage and Good–Thomas FFTs to ${R[x]}\big /{\left\langle {x^{1536} - 1} \right\rangle }$. By definition, if $\boldsymbol{\omega }$ is a principal $2^k$-th root of unity, then $\omega _3 \boldsymbol{\omega }$ is a principal $3 \cdot 2^k$-th root of unity. Let’s define $\bar{R} = {R[x]}\big /{\left\langle {x^{32} + 1} \right\rangle }$. We introduce a principal 32-th root of unity $\boldsymbol{\omega }_{32} = x^2$ as follows:

$$ \frac{R[x]}{\left\langle {x^{1536} - 1} \right\rangle } \cong \frac{\frac{R[x]}{\left\langle {x^{16} - y} \right\rangle }[y]}{\left\langle {y^{96} - 1} \right\rangle } \hookrightarrow \frac{\bar{R}[y]}{\left\langle {y^{96} - 1} \right\rangle }. $$

Then $\omega _3 \boldsymbol{\omega }_{32}$ is a principal 96-th root of unity implementing $ {\bar{R}[y]}\big /{\left\langle {y^{96} - 1} \right\rangle } \cong \prod _{i = 0, 1, 2, j = 0, \dots , 31} {\bar{R}[y]}\bigg /{\left\langle {y - \omega _3^i \boldsymbol{\omega }_{32}^{j}} \right\rangle } $. However, one should not implement this isomorphism with Cooley–Tukey FFT. Observe that multiplication by $\boldsymbol{\omega }_{32} = x^2$ requires negating and permuting whereas multiplication by $\omega _3$ requires actual modular multiplication. Cooley–Tukey FFT requires one to multiply $\omega _3^i \boldsymbol{\omega }_{32}^j$ which is unreasonably complicated while optimizing for $i, j \ne 0$. We apply Good–Thomas FFT implementing $ {\bar{R}[y]}\big /{\left\langle {y^{96} - 1} \right\rangle } \cong {\bar{R}[y]}\big /{\left\langle {y - u w, u^3 - 1, w^{32} - 1} \right\rangle } $. Obviously, we only need multiplications by powers of $\omega _3$ and $\boldsymbol{\omega }_{32}$ and not $\omega _3 \boldsymbol{\omega }_{32}$. See Table 1 for an overview of available approaches.

Table 1. Approaches for computing the size-1536 product of two polynomials drawn from ${R[x]}\big /{\left\langle {x^{761} - x - 1} \right\rangle }$.

Full size table

How it Works, Concretely. We detail the implementation as follows.

We transform the input array in[761] into a temporary array out[3][32][32], where out[i][j][0-31] is the size-32 polynomial in $\frac{R[x]}{\left\langle {x^{32} + 1, u - \omega _3^{i}, w - x^{2j}} \right\rangle }$. Concretely, we combine the permutations of Good–Thomas and Schönhage as out[i][j][k] = in[$\left( 16 ( 64 i + 33 j ) \bmod 96 \right) + k$] if $\left( 16 ( 64 i + 33 j ) \bmod 96 \right) + k < 761$ and zero otherwise. This step is the foundation of the implicit permutations [ACC+21].
For input small, we start with the 8-bit form of the polynomial. Since coefficients are in $\left\{ {\pm 1, 0} \right\} $, we first perform five layers of radix-2 butterflies without any modular reductions. The initial three layers of radix-2 butterflies are combined with the implicit permutations. For the last two layers of radix-2 butterflies, we use ext if the root is not a power of $x^{16}$. For the last layer of radix-2 butterflies, we merge the sign-extension and add-sub pairs into the sequence saddl, saddl2, ssubl, ssubl2. We then apply one layer of radix-3 butterflies based on the improvement of [DV78, Equation 8]. We compute the radix-3 NTT $(\hat{v}_0, \hat{v}_1, \hat{v}_2)$ of size-32 polynomials $(v_1, v_2, v_3)$ as:
$$ \left\{ \begin{aligned} \hat{v}_0 & = v_0 + v_1 + v_2, \\ \hat{v}_1 & = (v_0 - v_2) + \omega _3 (v_1 - v_2), \\ \hat{v}_2 & = (v_0 - v_1) - \omega _3 (v_1 - v_2). \\ \end{aligned} \right. $$

Algorithm 1
. Radix-2 butterfly with symbolic root $x^2$.
Full size image
For the input big, we use the 16-bit form and perform one layer of radix-3 butterflies followed by five layers of radix-2 butterflies. This implies only 1536 coefficients are involved in radix-3 butterflies instead of 3072 as for the input small. We first apply one layer of radix-3 butterflies and two layers of radix-2 butterflies followed by one layer of Barrett reductions while permuting implicitly for Good–Thomas and Schönhage. Then, we perform three layers of radix-2 butterflies and another layer of Barrett reductions.

Nussbaumer vs Bruun. Next, we discuss efficient polynomial multiplications in ${R[x]}\big /{\left\langle {x^{32} + 1} \right\rangle }$. [BBCT22] applied Nussbaumer to ${R[x]}\big /{\left\langle {x^{64} + 1} \right\rangle }$. We state without proof that applying Nussbaumer to ${R[x]}\big /{\left\langle {x^{32} + 1} \right\rangle }$ results in 8 polynomial multiplications in ${R[z]}\big /{\left\langle {z^8 + 1} \right\rangle }$. We instead apply Brunn’s FFT resulting in multiplications in rings ${R[x]}\big /{\left\langle {x^8 + \alpha x^4 + 1} \right\rangle }$ for 4 different $\alpha $. Since

$$ \begin{aligned} x^{32}+1 &= (x^{16}+1229x^2+1)(x^{16} - 1229 x^2+1) \\ &= (x^8+58x^4+1)(x^8 - 58 x^4+1)(x^8+2116x^4+1)(x^8 - 2116 x^4+1), \end{aligned} $$

we apply ${\textbf {Bruun}}_{1229, 1}$ followed by ${\textbf {Bruun}}_{58, 1}$ and ${\textbf {Bruun}}_{2116, 1}$. We have slower FFT and base multiplications, but we do only half as many as in [BBCT22]. See Table 2 for comparisons.

Table 2. Approaches for multiplying in ${R[x]}\big /{\left\langle {x^{64} + 1} \right\rangle }$ and ${R[x]}\big /{\left\langle {x^{32} + 1} \right\rangle }$.

Full size table

Then, we perform $96 \cdot 4 = 384$ size-8 base multiplications and compute the inverses of Bruun’s, Schönhage’s, and Good–Thomas FFT.

4.4 Good–Thomas, Rader’s, and Bruun’s FFT

In the previous section, we replace Nussbaumer with Bruun. This section shows how to replace Schönhage with Rader while computing in ${R[x]}\big /{\left\langle {x^{1632} - 1} \right\rangle }$. We name the resulting computation Good–Rader–Bruun.

Schö nhage vs Rader-17. We first observe that the Schönhage in [BBCT22] reduced a size-1536 problem to several size-64 problems. We are looking for a multiple of 17 close to $\frac{1536}{64} = 48$. We choose 51 since one can define a size-51 cyclic NTT nicely over $\mathbb {{Z}}_q$ and optimize further by extending the size-51 cyclic NTT to size-102. For the size-102 cyclic NTT, we apply the 3-dimensional Good–Thomas FFT by identifying $(\omega _{17}, \omega _3, \omega _{2}) = (\omega _{102}^{e_0}, \omega _{102}^{e_1}, \omega _{102}^{e_2})$ as the principal roots of unity where $(e_0, e_1, e_2)$ is the unique tuple satisfying $\forall a \in \mathbb {{Z}}_{102}, a \equiv e_0 (a \bmod 17) + e_1 (a \bmod 3) + e_2 (a \bmod 2) \pmod {102}$. Algorithm 6 is an illustration. Radix-2 and radix-3 computations are straightforward. For the radix-17 cyclic FFT, we apply Rader’s FFT. Algorithm 7 illustrates the multi-dimensional cyclic FFT. Obviously, the above computation is vectorization–friendly.

Generalize Bruun Over $\boldsymbol{x^{2^k} + c}$ for $\boldsymbol{c \ne \pm 1}$. The composed multiplication over a finite field shows that the remaining factorization follows the same pattern of factorizing ${R[x]}\big /{\left\langle {x^{16} \pm 1} \right\rangle }$. The isomorphism ${R[x]}\big /{\left\langle {x^{16} - \omega _{102}^{2i}} \right\rangle } \cong \prod {R[x]}\big /{\left\langle {x^8 \pm \omega _{102}^i} \right\rangle }$ is obvious. Since we also have $\prod _i {R[x]}\big /{\left\langle {x^{16} - \omega _{102}^{2i + 1}} \right\rangle } \cong \prod _i {R[x]}\big /{\left\langle {x^{16} + \omega _{102}^{2i}} \right\rangle }$ by permuting, it suffices to understand the isomorphisms defined on ${R[x]}\big /{\left\langle {x^{16} + \omega _{102}^{2i}} \right\rangle }$. Applying Lemma 3, we have ${R[x]}\big /{\left\langle {x^{16} + \omega _{102}^{2i}} \right\rangle } \cong \prod {R[x]}\big /{\left\langle {x^8 \pm \sqrt{2} \omega _{102}^{128i} x^4 + \omega _{102}^{256i}} \right\rangle }$.

Finally, the remaining computing task is multiplication in ${R[x]}\big /{\left\langle {x^8 + \alpha x^4 + \beta } \right\rangle }$ for some $\alpha , \beta \in R$. We extend the idea of [CHK+21, Algorithm 17] by altering between multiplying in R[x] and reducing modulo $x^8 + \alpha x^4 + \beta $.

5 Results

We present the performance numbers in this section. We focus on polynomial multiplications, leaving the fast constant-time GCD [BY19] as future work.

5.1 Benchmark Environment

We use the Raspberry Pi 4 Model B featuring the quad-core Broadcom BCM2711 chipset. It comes with a 32 kB L1 data cache, a 48 kB L1 instruction cache, and a 1 MB L2 cache and runs at 1.5 GHz. For hashing, we use the aes, sha2, and fips202 from PQClean [KSSW] without any optimizations due to the lack of corresponding cryptographic units. For the randombytes, [BHK+22] used the randombytes from SUPERCOP which in turn used chacha20. We extract the conversion from chacha20 into randombytes from SUPERCOP and replace chacha20 with our optimized implementations using the pipelines I0/I1, F0/F1. We use the cycle counter of the PMU for benchmarking. Our programs are compilable with GCC 10.3.0, GCC 11.2.0, Clang 13.1.6, and Clang 14.0.0. We report numbers for the binaries compiled with GCC 11.2.0.

Table 3. Overview of polynomial multiplications in ntrulpr761/sntrup761.

Full size table

5.2 Performance of Vectorized Polynomial Multiplications

Table 3 summarizes the performance of vectorized polynomial multiplications.

Table 4. Detailed Good–Schönhage–Bruun cycle counts including reducing to $\frac{\mathbb {{Z}}_{4591}[x]}{\left\langle {x^{761} - x - 1} \right\rangle }$.

Full size table

For NTRU Prime, our Good–Rader–Bruun performs the best. It is followed by Good–Thomas and Good–Schönhage–Bruun. Notice that Good–Rader–Bruun requires no extensions or changes of coefficient rings. The closest instances in the literature regarding vectorization are the Good–Thomas and Schönhage–Nussbaumer by [BBCT22], and Good–Thomas by [Haa21]. [BBCT22]’s, [Haa21], and our Good–Thomas compute “big by small” polynomial multiplications. We outperform [Haa21] Good–Thomas by a factor of $6.1 \times $ since they implemented the base multiplications with scalar code using the C $\%$ operator. On the other hand, [BBCT22]’s Schönhage–Nussbaumer and our Good–Schönhage–Bruun compute “big by big” polynomial multiplications. Regarding the impact of switching “big by small” to “big by big”, [BBCT22]’s Schönhage–Nussbaumer takes $\frac{25113}{16992} \approx 147.79\%$ cycles of their own Good–Thomas [BBCT22, Sect. 3.4.2] while our Good–Schönhage–Bruun takes only $\frac{50398}{47696} \approx 105.67 \%$ cycles of our own Good–Thomas. Essentially, this demonstrates the benefit of vectorization-friendly Good–Thomas and Bruun over truncated [vdH04] Schönhage and Nussbaumer.

Table 5. Detailed cycle counts of Good–Rader–Bruun, excluding reductions to ${\mathbb {{Z}}_{4591}[x]}\big /{\left\langle {x^{761} - x - 1} \right\rangle }$.

Full size table

Table 6. Performance of inversions, encoding, and decoding in NTRU Prime.

Full size table

We also provide the detailed cycle counts of the polynomial multiplications. For the “big by big” polynomial multiplications in sntrup761/ntrulpr761, Table 5 details the numbers of Good–Rader–Bruun and Table 4 details the numbers of Good–Schönhage–Bruun.

5.3 Performance of Schemes

Before comparing the overall performance, we first illustrate the performance numbers of some other critical subroutines. Most of our optimized implementations of these subroutines are not seriously optimized except for parts involving polynomial multiplications. We simply translate existing techniques and AVX2-optimized implementations into Neon. Table 6 summarizes the performance of inversions, encoding, and decoding.

Inversions, Sorting Network, Encoding, and Decoding. For sntrup761, we need one inversion over $\mathbb {{Z}}_{4591}$ and one inversion over $\mathbb {{Z}}_3$. We bitslice the inversion over $\mathbb {{Z}}_3$, and identify and vectorize the hottest loop in the inversion over $\mathbb {{Z}}_{4591}$. Additionally, we translate AVX2-optimized sorting network, encoding, and decoding into Neon. Notice that inversions over $\mathbb {{Z}}_2$, $\mathbb {{Z}}_3$, and $\mathbb {{Z}}_{4591}$, sorting networks, encoding, and decoding are implemented in a generic sense. With fairly little effort, they can be used for other parameter sets.

Performance of sntrup761/ntrulpr761. Table 7 summarizes the overall performance. For ntrulpr761, our key generation, encapsulation, and decapsulation are $2.98 \times $, $2.79 \times $, and $3.07 \times $ faster than [Haa21]. For sntrup761, we outperform the reference implementation significantly. Finally, Table 8 details the performance.

Table 7. Overall cycles of sntrup761/ntrulpr761.

Full size table

Constant-Time Concerns. There are no input-dependent branches in our code. Our program is constant-time only if one believes the documentation [ARM15]. The source code from [Haa21] and Armv8-A works [NG21, BHK+22], indicate the requirement of the same assumption. In the most relevant documented Neon implementations, our code is constant-time, but this is never strictly guaranteed^{Footnote 7} even with Data-Independent Timing (DIT). If ARM decides to extend the domain of DIT to relevant multiplication instructions used in this paper, our code is guaranteed to be constant-time once the DIT flag is set. Furthermore, literally all the lattice-based post-quantum cryptosystems will be benefit from this since the constant-time concerns arise from the basic building blocks implementing modular multiplications.

Table 8. Detailed performance numbers of sntrup761 and ntrulpr761 with Good–Rader–Bruun. Only performance-critical subroutines are shown.

Full size table

Notes

1.
https://marc.info/?l=openssh-unix-dev &m=164939371201404 &w=2.
2.
ARMv8-A, which naturally comes with the SIMD technology Neon, is currently the most prevalent architecture for mobile devices and is used for all Apple hardware.
3.
There are some exceptions, including addv, smaxv, sadalp. We are not using them in this paper and refer to [ARM15] for more details.
4.
We write some assembly and only obtain comparable performance. So we keep the implementations with intrinsics instead for readability.
5.
There are several options for signed-extending vector elements—saddl{,2} and ssubl{,2} which go to either F0/F1, sxtl{,2} to F1, and smull{,2} going to F0.
6.
$\forall \text { coprime } q_0, q_1, \left\{ {\omega _{q_0}^{i_0} \omega _{q_1}^{i_1}| 0 \le i_0 < q_0, 0 \le i_1 < q_1} \right\} = \left\{ {\omega _{q_0 q_1}^i | 0 \le i < q_0 q_1} \right\} $ in the splitting field of $x^{q_0 q_1} - 1$.
7.
ARM’s DIT flag, according to https://developer.arm.com/documentation/ddi0595/2021-06/AArch64-Registers/DIT--Data-Independent-Timing, does not guarantee the high half multiplications sqrdmulh and sqdmulh to be constant-time.

References

Alagic, G., et al.: NISTIR8413 – status report on the second round of the nist post-quantum cryptography standardization process (2022). https://doi.org/10.6028/NIST.IR.8413-upd1
Alkim, E., et al.: Polynomial multiplication in NTRU Prime comparison of optimization strategies on cortex-M4. IACR Trans. Cryptogr. Hardware Embed. Syst. 2021(1), 217–238 (2021). https://tches.iacr.org/index.php/TCHES/article/view/8733
Alkim, E., Hwang, V., Yang, B.Y.: Multi-parameter support with NTTs for NTRU and NTRU Prime on cortex-M4. IACR Trans. Cryptogr. Hardware Embed. Syst. 349–371 (2022)
Google Scholar
ARM. Cortex-A72 Software Optimization Guide (2015). https://developer.arm.com/documentation/uan0016/a/
ARM. Arm Architecture Reference Manual, Armv8, for Armv8-A architecture profile (2021). https://developer.arm.com/documentation/ddi0487/gb/?lang=en
Barrett, P.: Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor. In: Odlyzko, A.M. (ed.) CRYPTO 1986. LNCS, vol. 263, pp. 311–323. Springer, Heidelberg (1986). https://doi.org/10.1007/3-540-47721-7_24
Chapter Google Scholar
Bernstein, D.J., et al.: NTRU Prime. In: Submission to the NIST Post-Quantum Cryptography Standardization Project [?] (2020). https://ntruprime.cr.yp.to/
Bernstein, D.J., Brumley, B.B., Chen, M.S., Tuveri, N.: OpenSSLNTRU: faster post-quantum TLS key exchange. In: 31st USENIX Security Symposium (USENIX Security 2022), pp. 845–862 (2022)
Google Scholar
Brawley, J.V., Carlitz, L.: Irreducibles and the composed product for polynomials over a finite field. Disc. Math. 65(2), 115–139 (1987)
Article MathSciNet Google Scholar
Bernstein, D.J.: Multidigit multiplication for mathematicians (2001)
Google Scholar
Blake, I.F., Gao, S., Mullin, R.C.: Explicit factorization of $x^{2^k} + 1$ over $\mathbb{F}_p$ with prime $p \equiv 3 \;mod \;4$. Appl. Algebra Eng. Commun. Comput. 4(2), 89–94 (1993)
Google Scholar
Becker, H., Hwang, V., Kannwischer, M.J., Yang, B.Y., Yang, S.Y.: Neon NTT: faster Dilithium, Kyber, and Saber on cortex-A72 and apple M1. IACR Trans. Cryptogr. Hardware Embed. Systems 2022(1), 221–244 (2022). https://tches.iacr.org/index.php/TCHES/article/view/9295
Becker, H., Kannwischer, M.J.: Hybrid scalar/vector implementations of Keccak and SPHINCS+ on AArch64. Cryptology ePrint Archive (2022)
Google Scholar
Bruun, G.: z-transform DFT filters and FFT’s. IEEE Trans. Acoust. Speech Signal Process. 26(1), 56–63 (1978)
Article Google Scholar
Bernstein, D.J., Yang, B.Y.: Fast constant-time GCD computation and modular inversion. IACR Trans. Cryptogr. Hardware Embed. Syst. 2019(3), 340–398 (2019). https://tches.iacr.org/index.php/TCHES/article/view/8298
Chung, C.M.M., Hwang, V., Kannwischer, M.J., Seiler, G., Shih, C.J., Yang, B.Y.: NTT multiplication for NTT-unfriendly rings new speed records for Saber and NTRU on Cortex-M4 and AVX2. IACR Trans. Cryptogr. Hardware Embed. Syst. 2021(2), 159–188 (2021). https://tches.iacr.org/index.php/TCHES/article/view/8791
Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Math. Comput. 19(90), 297–301 (1965)
Article MathSciNet Google Scholar
Dubois, E., Venetsanopoulos, A.: A new algorithm for the radix-3 FFT. IEEE Trans. Acoust. Speech Signal Process. 26(3), 222–225 (1978)
Article Google Scholar
Good, I.J.: The interaction algorithm and practical Fourier analysis. J. Roy. Stat. Soc.: Ser. B (Methodol.) 20(2), 361–372 (1958)
MathSciNet Google Scholar
Haasdijk, J.: Optimizing NTRU LPRime on the ARM Cortex - A72 (2021). https://github.com/jhaasdijk/KEMobi
Kannwischer, M.J., Schwabe, P., Stebila, D., Wiggers, T.: PQClean. https://github.com/PQClean
Meyn, H.: Factorization of the cyclotomic polynomial $x^{2^n} + 1$ over finite fields. Finite Fields Appl. 2(4), 439–442 (1996)
Article MathSciNet Google Scholar
Murakami, H.: Real-valued fast discrete Fourier transform and cyclic convolution algorithms of highly composite even length. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 3, pp. 1311–1314 (1996)
Google Scholar
Martínez, F.E., Vergara, C.R., de Oliveira, L.B.: Explicit factorization of $x^n-1 \in \mathbb{F} _q[x]$. arXiv preprint arXiv:1404.6281 (2014)
Nguyen, D.T., Gaj, K.: Optimized software implementations of CRYSTALS-Kyber, NTRU, and Saber using NEON-based special instructions of ARMv8,. In: Third PQC Standardization Conference (2021)
Google Scholar
Nussbaumer, H.: Fast polynomial transform algorithms for digital convolution. IEEE Trans. Acoust. Speech Signal Process. 28(2), 205–215 (1980)
Article MathSciNet Google Scholar
Rader, C.M.: Discrete Fourier transforms when the number of data samples is prime. Proc. IEEE 56(6), 1107–1108 (1968)
Article Google Scholar
Schönhage, A.: Schnelle multiplikation von polynomen über körpern der charakteristik 2. Acta Informatica 7(4), 395–398 (1977)
Article MathSciNet Google Scholar
Tuxanidy, A., Wang, Q.: Composed products and factors of cyclotomic polynomials over finite fields. Des. Codes Crypt. 69(2), 203–231 (2013)
Article MathSciNet Google Scholar
van der Hoeven, J.: The truncated Fourier transform and applications. In: Proceedings of the 2004 International Symposium on Symbolic and Algebraic Computation, pp. 290–296 (2004)
Google Scholar
Yansheng, W., Yue, Q.: Further factorization of $x^n - 1$ over a finite field (II). Disc. Math. Algor. Appl. 13(06), 2150070 (2021)
Google Scholar
Yansheng, W., Yue, Q., Fan, S.: Further factorization of $x^n - 1$ over a finite field. Finite Fields Appl. 54, 197–215 (2018)
Article MathSciNet Google Scholar

Download references

Acknowledgments

This work was supported in part by the Academia Sinica Investigator Award AS-IA-109-M01, and Taiwan’s National Science and Technology Council grants 112-2634-F-001-001-MBK and 112-2119-M-001-006.

Author information

Authors and Affiliations

Max Planck Institute for Security and Privacy, Bochum, Germany
Vincent Hwang
National Taiwan University, Taipei, Taiwan
Chi-Ting Liu
Academia Sinica, Taipei, Taiwan
Vincent Hwang & Bo-Yin Yang

Authors

Vincent Hwang
View author publications
You can also search for this author in PubMed Google Scholar
Chi-Ting Liu
View author publications
You can also search for this author in PubMed Google Scholar
Bo-Yin Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Vincent Hwang , Chi-Ting Liu or Bo-Yin Yang .

Editor information

Editors and Affiliations

New York University Abu Dhabi, Abu Dhabi, United Arab Emirates
Christina Pöpper
Radboud University Nijmegen, Nijmegen, The Netherlands
Lejla Batina

A Detailed Performance Numbers

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hwang, V., Liu, CT., Yang, BY. (2024). Algorithmic Views of Vectorized Polynomial Multipliers – NTRU Prime. In: Pöpper, C., Batina, L. (eds) Applied Cryptography and Network Security. ACNS 2024. Lecture Notes in Computer Science, vol 14584. Springer, Cham. https://doi.org/10.1007/978-3-031-54773-7_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-54773-7_2
Published: 29 February 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-54772-0
Online ISBN: 978-3-031-54773-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Algorithmic Views of Vectorized Polynomial Multipliers – NTRU Prime

Abstract

Similar content being viewed by others

Speeding up the Number Theoretic Transform for Faster Ideal Lattice-Based Cryptography

Efficient Polynomial Multiplication via Modified Discrete Galois Transform and Negacyclic Convolution

Efficient Multiplication of Somewhat Small Integers Using Number-Theoretic Transforms

Keywords

1 Introduction

1.1 Contributions

1.2 Code

1.3 Structure of This Paper

2 Preliminaries

2.1 Polynomials in NTRU Prime

2.2 Cortex-A72

2.3 Modular Arithmetic

3 Fast Fourier Transforms

3.1 The Chinese Remainder Theorem (CRT) for Polynomial Rings

3.2 Cooley–Tukey FFT

3.3 Bruun-Like FFTs

Theorem 1

Lemma 1

Lemma 2

Lemma 3

Proof

3.4 Good–Thomas FFTs

3.5 Rader’s FFT for Odd Prime p

3.6 Schönhage’s and Nussbaumer’s FFTs

4 Implementations

4.1 The Needs of Vectorization

4.2 Good–Thomas FFT in “Big\(\times \)Small” Polynomial Multiplications

4.3 Good–Thomas, Schönhage’s, and Bruun’s FFT

4.4 Good–Thomas, Rader’s, and Bruun’s FFT

5 Results

5.1 Benchmark Environment

5.2 Performance of Vectorized Polynomial Multiplications

5.3 Performance of Schemes

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

A Detailed Performance Numbers

A Detailed Performance Numbers

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation