1 Introduction

The design of algorithms for binary polynomial multiplication has long been of great interest to many researchers. Because of applications in a variety of areas, such as cryptography and coding theory, new techniques for improving polynomial multiplication have been presented in numerous papers, e.g., [1, 4, 5, 7, 8, 1318, 20, 2325, 27, 28]. For cryptographic applications, arithmetic in the binary extension field \(\mathbb {F}_{2^n}\) is often used and, of the basic operations in \(\mathbb {F}_{2^n}\), multiplication contributes most to the total number of bit operations. For example, Bernstein [3] showed that a 251-bit scalar multiplication on a binary Edward curves entails 44,679,665 bit operations and that about 96.3 % of this computational cost is due to field multiplications. Multiplications in \(\mathbb {F}_{2^n}\) can be performed in two steps: polynomial multiplication and polynomial reduction. The cost of reduction is \(O(n)\) arithmetic operations, whereas the cost of multiplication is \(O(n^\omega )\), where \(1<\omega \le 2\). The cost of reduction is, therefore, negligible with respect to polynomial multiplication for a large value of \(n\).

Let \(O(n^\omega )\) be the arithmetic complexity, i.e., the number of bit operations for computing the product of two degree \((n-1)\) polynomials over the binary field. The classical or the school-book method of binary polynomial multiplication requires \(n^2\) and \((n-1)^2\) bit level multiplications and additions, respectively. Using Karatsuba’s algorithm [19], multiplication of two binary polynomials can be performed with three multiplications and four additions of half-size polynomials. Recursive use of the Karatsuba algorithm gives \(\omega \le 1.58\). More precisely, the Karatsuba algorithm requires \(7n^{1.58}+O(n)\) operations.

The Karatsuba algorithm is based on the 2-way split, where the polynomials being multiplied are divided into two parts and the Karatsuba algorithm is then applied recursively. As an extension, the 3-way split version of the Karatsuba algorithm requires six multiplications of one-third size polynomials. In [26], the use of the Chinese remainder theorem resulted in sub-quadratic complexity for polynomial multiplication algorithms with six multiplications. In [24] and [25], methods have been presented for 3-way splits with \(6.33n^{1.63}+O(n)\) operations. More recently, this complexity has been improved to \(6.27n^{1.63}+O(n)\) as reported in [11] and then to \(5.8n^{1.63}+O(n)\) as described in [9].

At the CRYPTO 2009 conference, Bernstein proposed several algorithms, including 2-, 3- and 4-way split methods for polynomial multiplication over binary fields [3]. Bernstein’s 2-way split algorithm improves the complexity of the Karatsuba algorithm to \(6.5n^{1.58}+O(n)\). It should be noted that in [27], Zhou and Michalic also reported similar results for a 2-way split algorithm using a different approach. Bernstein’s 2-way and 4-way split algorithms improve the additive complexity, while his 3-way split algorithm improves both the multiplicative and the additive complexity; specifically, the latter was reduced to \(25.5n^{1.46}+O(n)\).

The approach used in [3] for reducing z complexity is to use the best possible algorithms in each recursion rather than the same algorithm in all recursions. For example, the product of degree five binary polynomials, (that is \(n=6\)), requires 61 operations using the school-book method, but Bernstein reduced it to 57 operations by first using his 2-way split algorithm and then applying the school-book algorithm. The improved upper bounds are presented in [2]. This approach was also used in [25] and [13]. The best known results for almost all input sizes up to 1000 are listed in [2] using the 3-way and 4-way algorithms introduced in [3]. On the other hand, for values of \(n=11,12,15,16,18,19\) and \(20\), the results reported in [6] are superior to those in [2].

1.1 Notation and model of computation

\(\mathbb {F}_{q^n}\) is used for the finite field with \(q^n\) elements (where \(q\) is a prime power), and \(\mathbb {F}_q[X]\) is employed for the ring of polynomials over \(\mathbb {F}_q\). \(M_q(n)\) represents the minimum number of bit operations required for the computation of the product of two polynomials of degree less than \(n\) over \(\mathbb {F}_q\). \(D_q(n)\) is used for the delay complexity of polynomial multiplication over \(\mathbb {F}_q\), and \(D_A\) and \(D_X\) denote the delay of bit level multiplication and addition, respectively. Throughout this paper, the cost metric related to polynomial multiplication is taken as the number of bit operations (bit addition and bit multiplication) required for multiplying polynomials over \(\mathbb {F}_2\) or \(\mathbb {F}_4\), and since the computations are over characteristic two fields, addition and subtraction are equal.

1.2 Our contributions

The work presented in this paper represents the following contributions:

  • A modification of Bernstein’s 3-way algorithm offering improvements, albeit small but covering a wider range of polynomial degrees.

  • An improved version of the 5-way algorithm introduced in [12] through an optimization of the number of additions.

  • A new 3-way algorithm with a lower complexity than the ones described in [3, 10, 11]: it entails the asymptotic arithmetic complexity of \(15.125n^{1.46}+O(n)\) and delay complexity \(10\log _3(n)D_X+D_A\).

  • New optimizations of algorithms for polynomial multiplication over \(\mathbb {F}_4\).

  • A new minimum number of bit operations for binary polynomial multiplication presented in [2] and [6].

  • New results on the minimum number of bit operations for binary polynomial multiplication with logarithmic delay complexity.

1.3 Organization of paper

The remainder of the paper is organized as follows: Known algorithms related to our work are presented in the next section along with a description of the slight improvements that have been developed. The proposed improved algorithms over \(\mathbb {F}_2\) are introduced in Sect. 3, and the reduced complexity of multiplication over \(\mathbb {F}_4\) is explained in Sect. 4. Section 5 details how our improvements can enhance cryptographic applications, followed by a summary of our conclusions in Sect.  6.

2 Some known algorithms and their slight improvements

This section provides a brief review of a number of known efficient polynomial multiplication algorithms over \(\mathbb {F}_2\) and presents methods of obtaining slight improvements in some of these algorithms. To save space, the details of the known algorithms are not included; only their complexities are discussed with appropriate references.

2.1 School-book algorithm

Let \(A=\sum _{i=0}^{n-1}a_iX^i,\,B=\sum _{i=0}^{n-1}b_iX^i\) and \(C=AB=\sum _{i=0}^{2n-2}c_iX^i\). The school-book algorithm computes the coefficients of the product of \(A\) and \(B\) as \(C_i= \sum _{j+k=i}^{2n-2}a_jb_kX^i\) where \(0 \le j,k <n.\) The number of multiplications and additions required are \(n^2\) and \((n-1)^2\), respectively. Moreover, one can easily derive the following:

$$\begin{aligned} \left\{ \begin{array}{l} M_2(n+1)\le M_2(n)+4n,\\ D_2(n+1) \le D_2(n)+D_X. \end{array} \right. \end{aligned}$$
(1)

2.2 Karatsuba algorithm (with Bernstein’s improvement)

Now, let \(A\) and \(B\) be degree \((2n-1)\) polynomials over \(\mathbb {F}_2\) and \(C\) be their product. The improved Karatsuba algorithm splits \(A\) and \(B\) into two parts as \( A(x)=A_0+X^{n}A_1\) and \(B(x)=B_0+X^{n}B_1\) where \(A_0=\sum _{i=0}^{n-1} a_{i}X^i,\,A_1=\sum _{i=0}^{n-1} a_{i+n}X^i,\,B_0=\sum _{i=0}^{n-1} b_{i}X^i\), and \(B_1=\sum _{i=0}^{n-1} b_{i+n}X^i\). Bernstein proposed the following algorithm:

$$\begin{aligned}&(A_0+X^{n}A_1)(B_0+X^{n}B_1)\\&\quad =(1+X^n)(A_0B_0+X^nA_1B_1) +X^n(A_0+A_1)(B_0+B_1). \end{aligned}$$

The arithmetic complexity of the algorithm is as follows [3]:

$$\begin{aligned} \left\{ \begin{array}{l} M_2(n+k)\le 2M_2(n)+M_2(k)+3n+4k-3,\\ \qquad n/2 \le k \le n,\\ D_2(2n) \le D_2(n)+3D_X,\\ M_2(n) \le 6.5n^{1.58}-7n+1.5,\\ D_2(n) \le 3\log _2(n)D_X+D_A. \end{array} \right. \end{aligned}$$
(2)

Remark 1

Assume that \(k=n-\ell \) in (2) where \(\ell = \{1,2,3\}\). In this case, it should be noted that the last \(\ell \) terms of \(A_0B_0\) and \((A_0+A_1)(B_0+B_1)\) are identical. Therefore, once \(A_0B_0\) is computed, the cost of computing \((A_0+A_1)(B_0+B_1)\) is less than \(M_2(n)\). The computation of the last \(\ell \) terms is done using the school-book method, which yields the minimum values, and it is \(\ell ^2\) for \(\ell \in \{1,2,3\}\). Hence we have the following recursion:

$$\begin{aligned}&M_2(2n-\ell )\le 2M_2(n)\nonumber \\&\quad +M_2(n-\ell )+7n-4\ell -3-\ell ^2, \quad 1 \le \ell \le 3. \end{aligned}$$
(3)

It should be noted that Bernstein obtained bounds by computing explicit algorithms and thus because of the detection of common operations, the bounds in [2] are less than the values obtained directly through the recursion. For \(\ell > 3\), the number of common expressions might change depending on the value of \(n\).

2.3 Bernstein’s 3-way split algorithm

Let \(A\) and \(B\) be degree \((3n-1)\) polynomials over \(\mathbb {F}_2\) and \(C\) be their product. This method splits \(A\) and \(B\) in three parts as follows: \(A=A_0+A_1X^{n}+A_2X^{2n},\,B=B_0+B_1X^{n}+B_2 X^{2n}\) where \(A_j=\sum _{i=0}^{n-1} a_{i+nj}X^i\) and \(B_j=\sum _{i=0}^{n-1} b_{i+nj}X^i\) for \(j=0,1,2.\) Bernstein’s 3-way split algorithm is the following [3]:

$$\begin{aligned} \left\{ \begin{array}{l} P_0 = A_0B_0, \,P_1 = (A_0+A_1+A_2)(B_0+B_1+B_2), \\ P_2 = (A_0+A_1X+A_2X^2)(B_0+B_1X+B_2X^2), \\ P_3 = \left( (A_0+A_1+A_2)+(A_1X+A_2X^2)\right) \left( (B_0+B_1+B_2)\right. \\ \qquad \quad \left. +(B_1X+B_2X^2)\right) , \\ P_4 = A_2B_2,\,U=P_0+(P_0+P_1)X^n, \,\\ \qquad V=P_2+(P_2+P_3)(X^{n}+X), \\ C=U+P_4(X^{4n}+X^{n})+\displaystyle \frac{(U+V+P_4(X^4+X))(X^{2n}+X^{n})}{X^2+X}. \end{array} \right. \end{aligned}$$
(4)

The arithmetic complexity of the algorithm is as follows [3, 10, 11]:

$$\begin{aligned} \left\{ \begin{array}{l} M_2(3n) \le 3M_2(n)+2M_2(n+2)+35n-12,\quad n \ge 2,\\ M_2(2n+k)\le 2M_2(n)+M_2(k)+2M_2(n+1)+25n\\ \quad +10k-12,\quad 1 \le k \le n-1,\\ D_2(3n) \le D_2(n)+(3n+8)D_X,\\ M_2(n) \le 25.5n^{1.46}-25.5n+1,\\ D_2(n) \le (1.5n+8\log _3(n)-1.5)D_X+D_A. \end{array} \right. \end{aligned}$$
(5)

The reason for the linear delay complexity is the division by \((X^2+X)\) in the Eq. (4). This division requires \((n-2)\) bit additions and a delay of \((n-2)D_X\). A detailed explanation is in Section 2.3.2 of [11]. We also note that one can obtain a logarithmic delay for this type of exact division. However, in this case, the number of additions increases significantly.

Remark 2

It should be noted that in (4), the first term of each of \(P_0\) and \(P_2\) is \(a_0b_0\), and the first term of each of \(P_1\) and \(P_3\) is \((a_0+a_n+a_{2n})(b_0+b_n+_{2n})\). Two multiplications are thus saved here. As well, the last term of \(P_2 \) and that of \(P_4\) are identical, which also saves a multiplication. Finally, the last two terms of \(P_2 \) and \(P_3\) are likewise the same, which brings the savings up to five operations. It should also be noted that the first term of \(P_0+P_1\) and that of \(P_2+P_3\) are also the same. The result of all of the above observations is a total of nine common expressions for computing \(M(3n)\). On the other hand, for \(M_2(2n+k),\, 1 \le k \le n-1\), one can observe three common multiplications in the first term of \(P_2\) and \(P_0\), the first term of \(P_3\) and \(P_1\), and the last term of \(P_2 \) and \(P_3\). Furthermore, the first term of \(P_0+P_1\) and that of \(P_2+P_3\) are the same. Therefore, (5) can be rewritten as

$$\begin{aligned} \left\{ \begin{array}{l} M_2(3n) \le 3M_2(n)+2M_2(n+2)+35n-12-9,\quad n \ge 2,\\ M_2(2n+k)\le 2M_2(n)+M_2(k)+2M_2(n+1)+25n\\ \quad +10k-12-4,\quad 1 \le k \le n-1.\\ \end{array} \right. \end{aligned}$$
(6)

One can also note that the number of common operations is actually greater than that indicated above. These observations were also reported in [3] and explicit algorithms are obtained by eliminating the common operations in [2]. The results in [2] are, therefore, better than the theoretical results detailed in [3].

2.4 Karatsuba-like improved 3-way split algorithm

Let \(A,B,C,A_0,A_1,A_2,B_0,B_1\) and \(B_2\) be as in Bernstein’s 3-way algorithm presented above. This algorithm was obtained in [9] using a technique similar to that employed in [27]. The algorithm is as follows:

$$\begin{aligned} \left\{ \begin{array}{l} P_0 = A_0B_0= P_{0L}+P_{0H}X^{n}, \, P_1 = A_1B_1= P_{1L}+P_{0H}X^{n},\\ P_2 = A_2B_2= P_{2L}+P_{2H}X^{n}, \,\\ P_3 = (A_1+A_2)(B_1+B_2) = P_{3L}+P_{3H}X^{n},\\ P_4 = (A_0+A_1)(B_0+B_1)= P_{4L}+P_{4H}X^{n},\,\\ P_5 = (A_0+A_2)(B_0+B_2)= P_{5L}+P_{5H}X^{n}, \\ R_0 = P_{0H}+P_{1L}, \, R_1 = R_0+P_{0L},\, R_2 = R_1+P_{4L},\,\\ R_3=P_{1H}+P_{2L},\ R_4=R_1+R_3, R_5=P_{4H}+P_{5L},\,\\ R_6=R_4+R_5,\, R_7=R_3+P_{2H},\, R_8=R_7+R_0,\,\\ R_9=R_8+P_{3L}, R_{10}=R_9+P_{5H},\, \, R_{11}=R_7+P_{3H},\\ C=P_{0L}+R_2X^{n}+R_6X^{2n}+R_{10}X^{3n}+R_{11}X^{4n}+P_{2H}X^{5n}. \end{array} \right. \end{aligned}$$

Assume that \(A\) and \(B\) are degree \(2n+k-1\) polynomials, where \(1 \le k \le n\). \(A_0,\,A_1,\,B_0\) and \(B_1\) are then degree \((n-1)\) polynomials, and \(A_2\) and \(B_2\) are degree \((k-1)\) polynomials. Therefore, \(P_{0L},\,P_{1L}\), and \(P_{2L}\) are degree \((n-1)\) polynomials, and \(P_{0H}\) and \(P_{1H}\) are \((n-2)\) polynomials. On the other hand, \(P_{2L}\) is a degree \((n-1)\) polynomial, \(P_{2H}\) is a degree \((2k-n-1)\) polynomial for \(n/2<k\le n,\,P_{2L}\) is a degree \((2k-2)\) polynomial, and \(P_{2H}=0\) for \(k\le n/2 \). Note that \((A_0+A_1)\) and \((B_0+ B_1)\) each require \(n\) additions, \((A_0+A_2),\,(A_1+A_2),\,(B_0+B_2)\), and \((B_1+b_2)\) each require \(k\) additions; \(R_0,\,R_3,\,R_5,\,R_{10}\), and \(R_{11}\) each require \((n-1)\) additions; \(R_1,\,R_2,\,R_4,\,R_6,\,R_8\), and \(R_9\) each require \(n\) additions and \(R_7\) requires \((2k-n-1)\) additions for \(n/2<k\le n\). For \(k\le n/2,\,R_7\) requires no additions. Therefore, we obtain the following recursions [9]:

$$\begin{aligned} \left\{ \begin{array}{l} M_2(3n) \le 6M_2(n)+18n-6,\\ M_2(2n+k) \le 5M_2(n)+M_2(k)+12n+6k-6,\\ \qquad n/2<k\le n,\\ M_2(2n+k) \le 5M_2(n)+M_2(k)+13n+4k-5,\quad k\le n/2,\\ D_2(3n) \le D_2(n)+4D_X,\\ M_2(n) \le 5.8n^{1.63}-6n+1.2,\\ D_2(n) \le 4\log _3(n)D_X+D_A. \end{array} \right. \end{aligned}$$
(7)

Remark 3

Assume that \(k=n-\ell \) for \(1\le \ell \le 2\). The last \(\ell \) terms of the products \(A_0B_0\) and \((A_0+A_2)(B_0+B_2)\) are then the same, and the last \(\ell \) terms of the products \(A_1B_1\) and \((A_1+A_2)(B_1+B_2)\) are also the same. Therefore, we can obtain the following bound using the school-book method:

$$\begin{aligned}&M_2(3n-\ell ) \le 5M_2(n)\nonumber \\&\quad +\,M_2(n-\ell )+18n-6\ell -6-2\ell ^2, \quad 1\le \ell \le 2. \end{aligned}$$
(8)

2.5 Bernstein’s 4-way split algorithm

Let \(A\) and \(B\) be two degree \((4n-1)\) polynomials over \(\mathbb {F}_2\) and \(C\) be their product. This method splits \(A\) and \(B\) into four parts as \(A=A_0+A_1X^{n}+A_2X^{2n}+A_3X^{3n},\,B=B_0+B_1X+B_2 X^{2n}+B_3X^{3n}\) where \(A_j=\sum _{i=0}^{n-1} a_{i+nj}X^i\) and \(B_j=\sum _{i=0}^{n-1} b_{i+nj}X^i\) for \(j=0,1,2,3.\) Bernstein’s 4-way algorithm is the following:

$$\begin{aligned} \left\{ \begin{array}{l} AB=(1+X^{2n})((1+X^n)(A_0B_0+X^nA_1B_1+X^{2n}A_2B_2+X^{3n}A_3B_3)\\ \quad +X^n(A_0+A_1)(B_0+B_1)+X^{3n}(A_2+A_3)(B_2+B_3))\\ \quad +X^{2n}(A_0+A_2+(A_1+A_3)X^n)(B_0+B_2+(B_1+B_3)X^n). \end{array} \right. \end{aligned}$$

The arithmetic complexity of the algorithm is as follows [3, 9]:

$$\begin{aligned} \left\{ \begin{array}{l} M_2(4n) \le M_2(2n)+6M_2(n)+27n-8,\\ M_2(3n+k) \le M_2(2n)+5M_2(n)+M_2(k)+19n+8k-8,\\ \qquad n/2 \le k \le n,\\ D_2(4n) \le D_2(n)+5D_X,\\ M_2(n) \le 6.425n^{1.58}-6.8n+1.375,\\ D_2(n) \le 5\log _4(n)D_X+D_A. \end{array} \right. \end{aligned}$$
(9)

Remark 4

It should be noted that if \(k=n-\ell \) in (9) for \(1 \le \ell \le 3\), then \(A_2B_2\) and \((A_2+A_3)(B_2+B_3)\) have the same last \(\ell \) terms. Similarly, \((A_0+A_2+(A_1+A_3)X^n)(B_0+B_2+(B_1+B_3)X^n)\) and \(A_1B_1\) have the same last \(\ell \) terms. Therefore, once \(A_2B_2\) and \(A_1B_1\) are computed using the school-book method, the cost of computing \((A_2+A_3)(B_2+B_3)\) and \((A_0+A_2+(A_1+A_3)X^n)(B_0+B_2+(B_1+B_3)X^n)\) is less than or equal to \(M_2(n)-\ell ^2\) and \(M_2(2n)-\ell ^2\), respectively. Thus, we get the following recursion:

$$\begin{aligned}&M_2(4n-\ell )\le M_2(2n)+5M_2(n)\nonumber \\&\quad +M_2(n-\ell )+27n-8\ell -8-2\ell ^2, \quad 1 \le \ell \le 3. \end{aligned}$$
(10)

2.6 CNH 3-way split algorithm

Let \(A,B,C,A_0,A_1,A_2,B_0,B_1\), and \(B_2\) be defined as in Bernstein’s 3-way algorithm. In [10, 11], Cenk, Negre, and Hasan proposed the following algorithm for computing \(C=AB\), where \(\alpha \) is the generator of \(\mathbb {F}_4\):

$$\begin{aligned} \left\{ \begin{array}{l} P_0 =A_0 B_0 ,\, P_1 = (A_0 + A_1+A_2)(B_0+B_1+B_2),\\ P_2 = (A_0 + A_2+ \alpha (A_1 + A_2)) (B_0 + B_2+ \alpha (B_1 + B_2)),\\ P_3 = (A_0 + A_1+ \alpha (A_1 + A_2)) (B_0 + B_1+ \alpha (B_1 + B_2)),\,\\ P_4 = A_2 B_2, \\ C=(P_0 + X^{n}P_4)(1+X^{3n}) +(P_1+(1+\alpha )( P_2+P_3))\\ \quad (X^{n}+X^{2n}+X^{3n}) +\alpha ( P_2 +P_3) X^{3n} + P_2 X^{2n} + P_3 X^{n}\\ \end{array} \right. \end{aligned}$$
(11)

The complexities of the algorithm are computed in [10, 11] as follows:

$$\begin{aligned} \left\{ \begin{array}{l} M_{2}(3n) \le 2M_{4}(n) +3M_2(n) + 29n -12,\\ M_4(3n)\le 5M_4(n) + 58n -21,\\ D_2(n) \le D_4(n/3)+8D_X,\\ D_4(n) \le D_4(n/3)+10D_X. \end{array} \right. \end{aligned}$$
(12)

Remark 5

We can improve this algorithm by observing the common additions in \((P_1+(1+\alpha )( P_2+P_3)) (X^{n}+X^{2n}+X^{3n})\). Assume that the inputs are from \(\mathbb {F}_4[X]\). For simplicity let \(R=(P_1+(1+\alpha )( P_2+P_3))\). Since \(R\) is a degree \((2n-2)\) polynomial, we can write \(R=R_0+R_1X^n\) where \(R_0\) is a degree \((n-1)\) polynomial and \(R_1\) is a degree \((n-2)\) polynomial. We have then

$$\begin{aligned}&R(X^{n}+X^{2n}+X^{3n})\\&\quad =X^nR_0+X^{2n}(R_0+R_1)+X^{3n}(R_0+R_1)+X^{4n}R_1, \end{aligned}$$

requiring \(2(n-1)\) \(\mathbb {F}_4\) additions for \(R_0+R_1\) which improves the original computation cost \(2(2n-2)\). It should be noted that this technique does not change the delay complexity. The complexity for degree \((2n+k)\) polynomials can be easily be obtained for \(1 \le k \le n\) since, in this case, \((A_1+A_2),\,(B_1+B_2),\,((A_0+A_1)+A_2)\), and \(((B_0+B_1)+B_2)\) each require \(8k\) additions. As well, \((P_0 + X^{n}P_4)\) needs \((n-1)\) additions if \(k > n/2\) and \((2k-1)\) additions if \(k<n/2.\) The following are thus the new complexities for polynomial multiplication over \(\mathbb {F}_4\):

$$\begin{aligned} \left\{ \begin{array}{l} M_4(3n)\le 5M_4(n) + 56n -19, \quad M_4(1)=7,\\ M_4(2n+k)\le 4M_4(n) + M_4(k)+48n+8k -19,\\ \qquad n/2 \le k \le n,\\ M_4(2n+k)\le 4M_4(n) + M_4(k)+46n+12k -19,\\ \qquad 1 \le k < n/2,\\ D_4(n) \le D_4(n/3)+10D_X, D_4(1)=2D_X+D_A\\ M_4(n) \le 30.25n^{1.46}-28n+4.75,\\ D_4(n) \le (10\log _3(n)+2)D_X+D_A. \end{array} \right. \end{aligned}$$
(13)

Similarly, the complexities over \(\mathbb {F}_2\) are obtained as follows:

$$\begin{aligned} \left\{ \begin{array}{l} M_2(n) \le 2M_{4}(n/3) +3M_2(n/3) + 29n -12, \quad M_2(1)=1,\\ D_2(n) \le D_4(n/3)+8D_X,\quad D_2(1)=D_A,\\ M_2(n) \le 30.25n^{1.46}-9.27n\log _3(n)-27.5n+0.75,\\ D_2(n) \le 10\log _3(n)D_X+D_A. \end{array} \right. \end{aligned}$$
(14)

3 New improved algorithms over \(\mathbb {F}_2\)

This section presents a method that yields better complexities than the Bernstein 3-way algorithm. Moreover, a new 5-way split algorithm for binary polynomial multiplication resulting from improvements to the one described in [12] is introduced, and a new 3-way split algorithm with improved complexity is also proposed. The complexity comparisons of the methods introduced in this section are included in Table 1.

Table 1 Cost of polynomial multiplication over \(\mathbb {F}_{2}\)

3.1 A new split method for Bernstein’s 3-way split algorithm

Let \( A(X)=\sum _{i=0}^{3n-1}a_iX^i\) and \( B(X)=\sum _{i=0}^{3n-1}b_iX^i\) be two polynomials of degree \(3n-1\). In this method, we compute \((XA(X))(XB(X))\) instead of \(A(X)B(X)\) using Bernstein’s 3-way split algorithm. Note that \(XA(X)=\sum _{i=0}^{3n-1}a_iX^{i+1}\) and \(XB(X)=\sum _{i=0}^{3n-1}b_iX^{i+1}\) are degree \(3n\) polynomials with first terms zero. We now apply Bernstein’s 3-way split algorithm by assuming that \(XA(X)\) and \(XB(X)\) are degree \(3n+2\) polynomials. Here, we take the coefficients of \(X^{3n+1}\) and \(X^{3n+2}\) of both \(XA(X)\) and \(XB(X)\) as zero, and thus we have:

$$\begin{aligned} \textit{XA}(X)= & {} A_0+A_1X^{n+1}+A_2X^{2n+2}, \\ \textit{XB}(X)= & {} B_0+B_1X^{n+1}+B_2X^{2n+2}, \end{aligned}$$

where each of \(A_i\) and \(B_i\) for \(0\le i \le 2\) are degree \(n\) polynomials. However, it should be noted that the first term of \(A_0\) and \(B_0\) is zero and that the last two terms of \(A_2\) and \(B_2\) are zero. Therefore, we can say that this method splits \(3n\)-term polynomials as \((n,n+1,n-1)\) rather than \((n,n,n)\) where the \(i\)-th value in the triples for \(i=1,2,3\) shows the number of terms of \(A_i\) and \(B_i\). The computational cost of Bernstein’s 3-way algorithm for this splitting approach is as follows:

  • \(4n-2\): Computing \(A_0+A_1+A_2\) and \(B_0+B_1+B_2\). These are degree \(n\) polynomials.

  • \(2n-2\): Computing \(A_1X+A_2X^2\) and \(B_1X+B_2X^2\). These are degree \((n+1)\) polynomials with the constant term being zero.

  • \(2n\): Computing \(A_0+(A_1X+A_2X^2)\) and \(B_0+(B_1X+B_2X^2)\). These are degree \((n+1)\) polynomials with the constant term being zero.

  • \(2n\): Computing \(A_0+A_1+A_2+(A_1X+A_2X^2)\) and \(B_0+B_1+B_2+(B_1X+B_2X^2)\). These are degree \((n+1)\) polynomials.

  • \(M_2(n)\): Computing \(P_0=A_0B_0\) where\(P_0\) is a degree \(2n\) polynomial with the constant term and the coefficient of \(X\) as zero.

  • \(M_2(n+1)\): Computing \(P_1=(A_0+A_1+A_2)(B_0+B_1+B_2)\) where \(P_1\) is a degree \(2n\) polynomial.

  • \(M_2(n+1)\): Computing \(P_2=(A_0+A_1X+A_2X^2)(B_0+B_1X+B_2X^2)\) where \(P_2\) is a degree \(2n+2\) polynomial with the constant term and the coefficient of \(X\) being zero.

  • \(M_2(n+2)-1\): Computing \(P_3=(A_0+A_1+A_2+A_1X+A_2X^2)(B_0+B_1+B_2+B_1X+B_2X^2)\) where \(P_3\) is a degree \(2n+2\) polynomial and the last term is the same as that of \(P_2\).

  • \(M_2(n-1)\): Computing \(P_4=A_2B_2\) where \(P_4\) is a degree \(2n-4\) polynomial.

  • \(2n\): Computing \(S=P_2+P_3\) where \(S\) is a degree \((2n+1)\) polynomial because the last terms of \(P_2\) and \(P_3\) are equal.

  • \(3n-1\): Computing \(U=P_0+(P_0+P_1)X^{n+1}\) where \(U\) is a degree \(3n+1\) polynomial and the first two terms are zero.

  • \(3n+3\): Computing \(V=P_2+S(X^{n+1}+X)\) where \(V\) is a degree \(3n+2\) term with the first term being zero.

  • \(7n-6\): Computing \(W=U+V+P_4(X^4+X)\) where \(W\) is a degree \(3n+2\) polynomial with the first term as zero.

  • \(3n\): Computing \(W'=W/(X(X+1))\) where \(W'\) is a degree \(3n\) polynomial.

  • \(2n\): Computing \(W''=W'(X^{2n+2}+X^{n+1})\) where \(W''\) is a degree \(5n+2\) polynomial with first \(n\) terms being zero.

  • \(5n-3\): Computing \(C=U+P_4(X^{4n+4}+X^{n+1})+W''\). This is the product polynomial \(X^2A(X)B(X)\).

It should also be noted that the original algorithm is better for \((3n-1)\) terms polynomials. However, for \((2n+k)\) term polynomials with \(1 \le k \le n-2\), the proposed splitting approach yields better results than the original recursion. For example, the method introduced above splits \((3n-2)\) term polynomials as \((n-1,n,n-1)\) instead of \((n,n,n-2)\). The recursions for the above computations for a \(3n\)-term and a similar computations for \((3n-2)\) term polynomials can be summed up as follows:

$$\begin{aligned} \left\{ \begin{array}{l} M_2(3n) \le M_2(n)+2M_2(n+1)+M(n+2)\\ \qquad +M(n-1)+35n-12,\\ M_2(3n-2) \le 2M_2(n)+M_2(n+1)+2M(n-1)\\ \qquad +35n-13.\\ \end{array} \right. \end{aligned}$$
(15)

3.2 Improved 5-way split algorithm

This section presents a new improvement to the 5-way split algorithm described in [12]. Let \(A=\sum _{i=0}^{5n-1} a_i X^i\) and \(B=\sum _{i=0}^{5n-1} b_i X^i\) two degree \((5n-1)\) polynomials over \(\mathbb {F}_2\) and \(C=\sum _{i=0}^{10n-2} c_i X^i\) be their product. This method splits \(A\) and \(B\) in five parts as \(A=A_0+A_1X^{n}+A_2X^{2n}+A_3X^{3n}+A_4X^{4n},\,B=B_0+B_1X^n+B_2 X^{2n}+B_3X^{3n}+B_4X^{4n}\), where \(A_j=\sum _{i=0}^{n-1} a_{i+nj}X^i\) and \(B_j=\sum _{i=0}^{n-1} b_{i+nj}X^i\) for \(j=0,1,2,3,4.\) Then we can write \(C=\sum _{i=0}^{8} C_i X^{in}\). Cenk and Özbudak proposed the following algorithm in [12]:

$$\begin{aligned} \left\{ \begin{array}{l} m_1 = A_0B_0,\, m_2 = A_1B_1,\, m_3 = A_2B_2,\, m_4 = A_3B_3,\, \\ m_5 = A_4B_4, m_6 = (A_0+A_1)(B_0+B_1),\,\\ m_7 = (A_0+A_2)(B_0+B_2),\, m_8 = (A_2+A_4)(B_2+B_4),\\ m_9 = (A_3+A_4)(B_3+B_4),\,\\ m_{10} = (A_0+A_2+A_3)(B_0+B_2+B_3),\\ m_{11} = (A_1+A_2+A_4)(B_1+B_2+B_4),\\ m_{12} = (A_0+A_3+A_1+A_4)(B_0+B_3+B_1+B_4),\\ m_{13} = (A_0+A_1+A_2+A_3+A_4)(B_0+B_1+B_2+B_3+B_4),\\ C_0 = m_1,\, C_1 = m_6+m_1+m_2,\, C_2 = m_7+m_1+m_3+m_2,\\ C_3 = m_1+m_{13}+m_{12}+m_{10}+m_8+m_3+m_5+m_4,\\ C_4 = m_6+m_1+m_2+m_{13}+m_{10}+m_{11}+m_9+m_5+m_4,\\ C_5 = m_7+m_1+m_3+m_2+m_{13}+m_{11}+m_{12}+m_5,\\ C_6 = m_8+m_3+m_5+m_4,\, C_7 = m_9+m_4+m_5,\, C_8 = m_5. \end{array} \right. \end{aligned}$$
(16)

The improvement to this algorithm is based on the use of the method described in [27]. To this end, we divide each \(m_i\) for \(1 \le i \le 13\) into two parts as \(m_i=p_{2i-1}+p_{2i}X^n\), where \(p_{2i-1}\) is a degree \((n-1)\) polynomial, \(p_{2i}\) is a degree \((n-2)\) polynomial, and \(n \ge 2\). We substitute the new decompositions of the \(m_i\)’s into \(C_i\)’s and let the new representation of \(C\) be \(C=\sum _{i=1}^{10} U_i X^{(i-1)n}\). The explicit new algorithm is as follows:

$$\begin{aligned} \left\{ \begin{array}{l} t_1=p_1+p_2, \, t_2=t_1+p_3,\, t_3=t_2+p_{11},\\ t_4=p_4+p_5,\, t_5=p_{12}+p_{13}, t_6=t_4+t_5,\, t_7=t_2\!+\!t_6,\,\\ t_8=t_1+t_4,\, t_{9}=p_6+p_7,\, t_{10}=t_8+t_9,\\ t_{11}=t_{10}+p_9,\, t_{12}=p_{14}+p_{15},\, t_{13}=t_{11}+t_{12},\,\\ t_{14}=p_{19}+p_{23},\, t_{15}=t_{14}+p_{25}, t_{16}=t_{13}+t_{15},\,\\ t_{17}=p_8+p_9,\, t_{18}=t_{17}+p_{10},\, t_{19}=t_{18}+p_{18},\,\\ t_{20}=t_{18}+t_{9}, t_{21}=p_{16}+p_{17},\,t_{22}=t_{20}+t_{21},\,\\ t_{23}=t_{22}+t_{3},\, t_{24}=p_{20}+p_{21}, t_{25}=p_{24}+p_{25},\,\\ t_{26}=p_{19}+p_{24},\, t_{27}=t_{24}+t_{25},\, t_{28}=t_{27}+t_{26},\,\\ t_{29}=t_{28}+t_{23}, t_{30}=t_{7}+t_{19},\,t_{31}=t_{27}+t_{30},\,\\ t_{32}=p_{22}+p_{23},\, t_{33}=t_{31}+t_{32},\,t_{34}=t_{11}+p_1,\\ t_{35}=t_{34}+p_{10},\, t_{36}=t_{35}+t_{12},\,t_{37}=t_{36}+p_{22},\,\\ t_{38}=t_{37}+p_{24},\, t_{39}=t_{38}+p_{26},\\ U_1=p_1,\, U_2=t_3,\, U_3=t_7,\, U_4=t_{16},\, U_5=t_{29},\,\\ U_6=t_{33},\, U_7=t_{39}, U_8=t_{22},\,U_9=t_{19},\,U_{10}=p_{10},\, \end{array} \right. \end{aligned}$$
(17)

The cost of (17) is \((39n-17)\) additions. The cost of linear combinations of \(A_i\)’s and the linear combinations of \(B_i\)’s can be computed with a total of \(16n\) additions. The following recursion is thus obtained:

$$\begin{aligned} M_2(5n)\le 13M_2(n)+55n-17. \end{aligned}$$
(18)

When the input sizes are \((4n+k)\) for \(1 \le k \le n\), the sizes of \(A_4\) and \(B_4\) are then \(k\) bits and the cost of \((A_2+A_4),\,(A_3+A_4),\,(B_2+B_4)\), and \((B_3+B_4)\) is \(4k\) rather than \(4n\). On the other hand, the size of \(m_5=A_4B_4=p_{9}+p_{10}X^{n}\) is a \(2k-1\). It should be noted that \(p_9\) is an \(n\)-bit polynomial, \(p_{10}\) is a \((2k-n-1)\)-bit polynomial for \(n/2 \le k \le n,\,p_9\) is a \((2k-1)\)-bit polynomial, and \(p_{10}\) is the \(0\) polynomial for \(1 \le k < n/2\). When the cost of \(t_{11},\,t_{17},\,t_{18}\), and \(t_{35}\) in (17) is re-computed, the following recursion is obtained:

$$\begin{aligned} \begin{array}{l} M_2(4n+k) \le 12M_2(n)+M_2(k)+47n+8k-17.\\ \end{array} \end{aligned}$$
(19)

An additional remark can be made regarding the case of \(k=n-\ell \) for \(1 \le \ell \le 3\). Here, the last \(\ell \) terms of \(m_4\) and \(m_9\) are identical, and similarly the last \(\ell \) terms of \(m_3\) and \(m_8\) are identical. We can, therefore, write

$$\begin{aligned}&M_2(5n\,{-}\,\ell ) \,{\le }\, 12M_2(n) \,{+}\,M_2(n\,{-}\,\ell )\,{+}\,55n\,{-}\,8\ell \,{-}\,17\,{-}\,\ell ^2.\nonumber \\ \end{aligned}$$
(20)

The delay complexity can be computed as

$$\begin{aligned} D_2(5n) \le D_2(n)+13D_X. \end{aligned}$$
(21)

The complexities are summarized as follows:

$$\begin{aligned} \left\{ \begin{array}{l} M_2(5n)\le 13M_2(n) + 55n -17,\\ M_2(4n+k) \le 12M_2(n)+M_2(k)+47n+8k-17, \,\\ \qquad 1 \le k \le n,\\ D_2(5n) \le D_2(n)+13D_X. \end{array} \right. \end{aligned}$$
(22)

Asymptotic complexities of this algorithm are the following:

$$\begin{aligned} \left\{ \begin{array}{l} M_2(n)\le 13M_2(n/5) + 55n/5 -17, \, M_2(1)=1,\\ M_2(n) \le 6.46n^{1.58}-6.87n+1.42,\\ D_2(n) \le D_2(n/5)+13D_X,\, D_2(1)=D_A,\\ D_2(n) \le 13\log _5(n)D_X+D_A.\\ \end{array} \right. \end{aligned}$$
(23)

3.3 New improved 3-way algorithm

This section presents a process for improving the algorithm discussed in Sect. 2.6 by about \(50\%\). The enhancement is obtained by analyzing the products \(P_2\) and \(P_3\) in (11). Let \(A,\,B,\,C,\,A_0,\,A_1,\,A_2,\,B_0,\,B_1\), and \(B_2\) \(\in \mathbb {F}_2[X]\) be defined as in the explanation of the CNH algorithm in Sect. 2. It should be noted that if

$$\begin{aligned} P_2= & {} (A_0 + A_2+ \alpha (A_1 + A_2)) (B_0 + B_2+ \alpha (B_1 + B_2))\\= & {} P_{2,0}+\alpha P_{2,1}, \end{aligned}$$

then one can compute

$$\begin{aligned} P_3= & {} (A_0 + A_1+ \alpha (A_1 + A_2)) (B_0 + B_1+ \alpha (B_1 + B_2))\\= & {} (P_{2,0}+P_{2,1})+\alpha P_{2,1}. \end{aligned}$$

This calculation shows that \(P_3\) can be obtained from \(P_2\). Note that this method works because \(A_i,B_i \in \mathbb {F}_2[X]\) for \(0\le i \le 2\). By using \(P_3 = (P_{2,0}+P_{2,1})+\alpha P_{2,1},\) we propose the following algorithm:

$$\begin{aligned} \left\{ \begin{array}{l} P_0 =A_0 B_0 ,\, P_1 = (A_0 + A_1+A_2)(B_0+B_1+B_2),\,\\ P_4 = A_2 B_2,\\ P_2 = (A_0 + A_2+ \alpha (A_1 + A_2)) (B_0 + B_2+ \alpha (B_1 + B_2))\\ \qquad \! =P_{2,0}+\alpha P_{2,1},\\ C=P_4X^{4n}+(P_0+P_1+P_{2,1})X^{3n}+(P_{2,0}\\ \qquad +\,P_1+P_{2,1})X^{2n}+(P_4+P_1+P_{2,0})X^{n}+P_0\\ \end{array} \right. \end{aligned}$$
(24)

Now we can compute the complexity of this algorithm where \(A_0,B_0,A_1\), and \(B_1\) are degree \((n-1)\) polynomials and \(A_2\) and \(B_2\) are degree \((k-1)\) polynomials. Assume that \(1 \le k \le n\). Each of \((A_1+A_2)\) and \((A_0+A_2)\) then requires \(k\) additions, and \((A_0+(A_1+A_2))\) requires \(n\) additions. Since the polynomials are over \(\mathbb {F}_2,\,(A_0 + A_2+ \alpha (A_1 + A_2))\) does not require any additions. Similarly, the right-hand side, i.e., \(B_i\)’s, require \((n+2k)\) additions. On the other hand, each of \((P_1+P_{2,1}),\,(P_0+(P_1+P_{2,1})),\,(P_{2,0}+(P_1+P_{2,1}))\) and \((P_1+P_{2,0})\) requires \((2n-1)\) additions, and \((P_{4}+(P_1+P_{2,0}))\) requires \((2k-1)\) additions. Finally, the overlaps of the coefficients of \(X^0,\,X^n,\,X^{2n}\), and \(X^{3n}\) require \((3n-3)\) additions, and the cost of the overlapping of the coefficient of \(X^{4n}\) with the other terms is \((n-1)\) if \(n/2 \le k \le n\), and \((2k-1)\) if \(1 \le k < n/2\). On the other hand, the delay complexity can be computed as described in [11] and we obtain the complexities as follows:

$$\begin{aligned} \left\{ \begin{array}{l} M_2(3n)\le 3M_2(n)+M_4(n)+20n-5,\\ M_2(2n+k)\le 2M_2(n)+M_2(k)+M_4(n)+14n+6k-5,\,\\ \quad n/2 \le k \le n,\\ M_2(2n+k)\le 2M_2(n)+M_2(k)+M_4(n)+13n+8k-11,\,\\ \quad 1 \le k < n/2.\\ D_2(3n) \le D_4(n)+7D_X, \end{array} \right. \end{aligned}$$
(25)

In order to compute \(M_2(n)\), we need \(M_4(n)\). By using the results in (13), one can obtain asymptotic complexities of this algorithm as follows:

$$\begin{aligned} \left\{ \begin{array}{l} M_2(n)\le 3M_2(n/3)+M_4(n/3)+20n/3-5, \, M_2(1)=1,\\ M_2(n) \le 15.125n^{1.46}-14.25n-2.4274\log _3(n)+0.125,\\ D_2(n) \le D_4(n/3)+8D_X,\, D_2(1)=D_A,\\ D_2(n) \le 10\log _3(n)D_X+D_A.\\ \end{array} \right. \end{aligned}$$
(26)

3.4 Comparison of complexities

To enable an easy comparison, the complexity results are presented in Table 1. As it can be seen, the 2-way algorithm is the Karatsuba algorithm with Bernstein’s improvement. On the other hand, the proposed 3-way algorithm is far superior to the 3-way split algorithms. Bernstein’s 4-way split and the proposed 5-way split algorithms that yield improvements are also included in the table. It should also be noted that Negre has reported [21, 22] about improvements in the 3-way splits algorithm of [9] with a complexity \(4.68n^{1.63}+O(n)\) and in the 4-way split algorithm of [3] with a complexity \(5.25n^{1.58}+O(n)\).

4 Minimum number of bit operations for \(M_4(n)\)

The algorithm presented in Sect. 3.3 entails the multiplication of polynomials over \(\mathbb {F}_4\). Efficient algorithms for multiplication over \(\mathbb {F}_4\) are, therefore, needed to obtain better complexity results over \(\mathbb {F}_2\). We can use the multiplication algorithms over \(\mathbb {F}_2\) presented in the previous sections for multiplications over \(\mathbb {F}_4\). However, it should be noted that the addition of \(\mathbb {F}_4\) elements requires two-bit additions and that the multiplication of \(\mathbb {F}_4\) elements requires seven-bit operations, i.e., four multiplications and three additions (using the school-book algorithm). The determination of the cost of multiplications over \(\mathbb {F}_4\), therefore, requires the following modifications to the recursions presented in the previous sections: \(M_2(n)\) is converted to \(M_4(n)\), and the number of additions over \(\mathbb {F}_2\) is multiplied by two. If the algorithm includes bit multiplications (as in the case of the school-book algorithm), then the number of bit multiplications is multiplied by seven, which is the cost of multiplication in \(\mathbb {F}_4\). As an illustration, the school-book algorithm for the multiplication of polynomials over \(\mathbb {F}_4\) can be modified as follows: Let \(A\) and \(B\) be degree \(n\) polynomials over \(\mathbb {F}_4\). We can write \(A=A_0+X^na_n\) and \(B=B_0+X^nb_n\), where \(A_0\) and \(B_0\) are degree \((n-1)\) polynomials over \(\mathbb {F}_4\), and \(a_n\) and \(b_n\) are in \(\mathbb {F}_4\). Then

$$\begin{aligned} A\cdot B=A_0B_0+X^n(A_0b_n+a_nB_0)+X^{2n}a_nb_n. \end{aligned}$$

The costs of \(A_0B_0,\,(A_0b_n+a_nB_0)\) and \(a_nb_n\) are \(M_4(n),\,2nM_4(1)+2n\), and \(M_4(1)\), respectively. The final overlap needs \(2(n-1)\) additions. Using \(M_4(1)\le 7\), we obtain the following:

$$\begin{aligned} \left\{ \begin{array}{l} M_4(n+1)\le M_4(n)+18n+5,\\ D_4(n+1) \le D_4(n)+D_X. \end{array} \right. \end{aligned}$$
(27)

Similarly, the improved Karatsuba algorithm presented in Sect.  2 has the following recursion for \(\mathbb {F}_4\) multiplications:

$$\begin{aligned} \left\{ \begin{array}{l} M_4(n+k)\le 2M_4(n)+M_4(k)+6n+8k-6,\,\\ \quad n/2\le k \le n,\\ D_4(2n) \le D_4(n)+3D_X. \end{array} \right. \end{aligned}$$
(28)

On the other hand, the 3-way algorithm discussed in Sect.  2 has the following recursion for multiplications over

$$\begin{aligned} \left\{ \begin{array}{l} M_4(2n+k) \le 5M_4(n)+M_4(k)+24n+12k-12,\,\\ \quad n/2<k\le n,\\ D_4(3n) \le D_4(n)+4D_X. \end{array} \right. \end{aligned}$$
(29)

Bernstein’s 4-way split algorithm presented in Sect. 2 can be used for multiplication over \(\mathbb {F}_4\) using the following recursion:

$$\begin{aligned}&M_4(3n+k)\le M_4(2n)\nonumber \\&\quad +\,5M_4(n)+M_4(k)+38n+16k-16,\quad n/2\le k \le n.\nonumber \\ \end{aligned}$$
(30)

The recursive equation for the new 5-way split algorithm introduced in Sect. 3.2 can be used for multiplications over \(\mathbb {F}_4\) by applying the following recursion:

$$\begin{aligned} \left\{ \begin{array}{l} M_4(4n+k)\le 12M_4(n)+M_4(k)+96n+16k-36,\,\\ \quad 1\le k \le n,\\ D_4(4n) \le D_4(n)+5D_X. \end{array} \right. \end{aligned}$$
(31)

The next step is to describe a general method for multiplying polynomials over \(\mathbb {F}_4\). Let \(\alpha \) be the generator of \(\mathbb {F}_4,\,A=\sum _{i=0}^{n-1}a_iX^i,\,B=\sum _{i=0}^{n-1}B_iX^i\) and \(C=AB=\sum _{i=0}^{2n-2}C_iX^i\) be polynomials over \(\mathbb {F}_4\). We can write, \(A=A_0+\alpha A_1\) and \(B=B_0+\alpha B_1\) where \(A_0,\,A_1,\,B_0\), and \(B_1\) are degree \(n-1\) polynomials over \(\mathbb {F}_2\). We then have

$$\begin{aligned} AB= & {} (A_0+\alpha A_1)(B_0+\alpha B_1)\nonumber \\= & {} A_0B_0+A_1B_1+((A_0+A_1)(B_0+B_1)+A_0B_0)\alpha .\nonumber \\ \end{aligned}$$
(32)

The complexity of this formula can be computed as

$$\begin{aligned} \left\{ \begin{array}{l} M_4(n)\le 3M_2(n)+6n-2.\\ D_4(n) \le D_2(n)+2D_X. \end{array} \right. \end{aligned}$$
(33)

As a final step, we can then use the CNH 3-way algorithm discussed in Sect. 2. The recursion of this algorithm is the following:

$$\begin{aligned} \left\{ \begin{array}{l} M_4(3n)\le 5M_4(n) + 56n -19,\\ M_4(2n+k)\le 4M_4(n) + M_4(k)+48n+8k -19,\,\\ \quad n/2 \le k \le n,\\ D_4(n) \le D_4(n/3)+10D_X. \end{array} \right. \end{aligned}$$
(34)

5 Improved upper bounds over \(\mathbb {F}_2\)

This section presents the new upper bounds on the minimum number of operations for binary polynomial multiplications with the use of the algorithms discussed in the previous sections.

The first improvement is for \(n=9\). The improved 3-way algorithm presented in Sect. 2 yields \(M_2(9)\le 126\) whereas this bound is reported as 132 in [2]. On the other hand, the new 5-way algorithm results in \(M_2(15)\le 317\), which is better than the 326 arrived at [6]. Explicit algorithms for \(n=9\) and \(n=15\) are presented in the appendix. Similarly, we obtain \(M_2(18) \le 438\), which is better than that reported in [6]. For \(n=11,12\), we were unable to obtain improvements on the upper bounds compared to the results described in [6]. However, for almost all values of \(n\) greater than 20, we have obtained improved bounds and tabulated new bounds for some specific values of \(n\), which are used in cryptographic applications. Details are included in the appendix.

We also note that although improvements in the number of bit operations can be obtained primarily through modifications to Bernstein’s 3-way algorithm, the corresponding level of delay complexities is significantly higher because Bernstein’s 3-way algorithm entails a linear delay complexity in input size. For this reason, we have also searched the minimum number of bit operations with a logarithmic delay. In this respect, the new 3-way algorithm introduced in Sect. 3.3 produces the best results. It should be noted that although the numbers of operations increase slightly, delay complexities decrease significantly since the new 3-way split algorithm is associated with a logarithmic delay. The results are summarized in Table 2 that includes four different complexities. Column A shows the known best bounds reported in [2] and [6] before the current work. The improved minimum numbers of bit operations over \(\mathbb {F}_2\) and \(\mathbb {F}_4\) are listed in columns B and C, respectively, and the best possible minimum number of bit operations with logarithmic delay complexities are indicated in column D. In addition to \(M_2(n)\) and \(M_4(n)\), the table also provides the name of the algorithm along with the new size of the polynomial after splitting.

Table 2 New upper bounds on \(M_2(n),\,D_2(n),\,M_4(n)\) and \(D_4(n)\) where A, B, and C present minimum number of bit operations; and D presents minimum number of bit operations with logarithmic delay

The numbers in the column entitled Alg. of Table 2 represent the following algorithms: 1 is the school-book, 2 is the Karatsuba with Bernstein’s improvement, 2.1 is the Karatsuba with Bernstein’s improvement with input size \(2n-1\), 2.2 is the Karatsuba with Bernstein’s improvement with input size \(2n-2\), 2.3 is the Karatsuba with Bernstein’s improvement with input size \(2n-3\), 3 is Karatsuba-like 3-way split, 5 is Bernstein’s 3-way split, 5.1 is modified Bernstein’s 3-way split algorithm with input size \(3n\), 5.2 is modified Bernstein’s 3-way split algorithm with input size \(3n-2\), 6 is Bernstein’s 4-way split with input size \(4n\), 6.1 is Bernstein’s 4-way split with input size \(4n-1\), 6.2 is for Bernstein’s 4-way split with input size \(4n-2\), 7 is for the improved 5-way split for input size \(5n\), 7.1 is improved 5-way split for input size \(5n-1\), 8 is for the method referring in [6], 9 is the general method described in Sect. 4, 10 is the Karatsuba algorithm with Bernstein’s improvements for \(\mathbb {F}_4\), 14 is the improved CNH 3-way split algorithm over \(\mathbb {F}_4\) in Sect. 2, 15 is Bernstein’s 4-way for polynomials over \(\mathbb {F}_4\), and finally 16 is the improved 5-way split for polynomials over \(\mathbb {F}_4\).

For example, for \(n=15\) in column B, it can be seen that the new 5-way algorithm is used, and the new size of the polynomials becomes five. To verify the complexity, one should then use the \(M_2(5)\). It must also be noted that special care should be given in those cases in which the size of the polynomials after splitting may be different, as in the case of \(M_2(17)\), which contains a multiplication of size nine and a multiplication of size eight. An additional remark is related to the modified Bernstein’s algorithm. If the size is a multiple of three, say \(3n\), then the sizes of the polynomials after splitting are \(n,\,n+1\), and \(n-1\); if the size is \(3n-2\), then the new sizes are \(n\) and \(n-1\). For example, for \(3n-2=67\), the size of the new polynomial is 23 given in Table 2 and the other sizes are then both 22.

6 Conclusion

This paper has presented improvements in the bounds reported in [3, 6] for binary polynomial multiplication through two new proposed algorithms along with the optimization and modification of previous algorithms. The use of the new 3-way and 5-way split algorithms together with the modification of Bernstein’s 3-way split algorithm produces improved results. These results for values of \(n\) that are of interest for cryptographic applications are presented in the appendix. The latter also presents the algorithms for \(n=9\) and \(n=15\). Finally, it should be noted that the results in this paper can be further improved by eliminating common operations that appeared in the algorithms.