1 Introduction

Cryptography based on Galois field (GF) (or finite field) arithmetic has been widely utilized for secure communications, authentication, and digital signatures in many systems. For modern ciphers, the substitution function is one of the most integral parts to be resistant against major cryptanalytic techniques such as differential and linear cryptanalyses [1, 2]. Inversion functions over \(\textit{GF}(2^m)\) are known as a useful component for m-bit substitution functions [3]. Many ISO/IEC standard ciphers (e.g., AES and Camellia) employ an inversion function over \(\textit{GF}(2^8)\) in substitution functions [4, 5]. For example, SubBytes of AES consists of an inversion over \(\textit{GF}(2^8)\) (i.e., S-Box) and an affine transformation over \(\textit{GF}(2)\). The hardware performance of such ciphers heavily depends on the inversion circuits used. As a result of the explosive increase in resource-constrained devices in the context of Internet of things (IoT) applications, there is currently substantial demand for lightweight implementation of inversion functions [3, 6].

So far, many approaches to reducing the hardware cost of \(\textit{GF}(2^8)\) inversion circuits have been proposed. While it has been shown that direct mapping-based approaches (e.g., table lookup, PPRM, and BDD [7,8,9]) are useful for low-latency implementation, the tower field approach, which calculates \(a^{-1}\)\((= a^{254})\)\((a \in GF(2^8))\) using the equivalent tower field, is a promising approach for achieving the compact and efficient implementation. This technique converts the original field \(\textit{GF}(2^8)\) into an isomorphic tower field such as \(\textit{GF}(((2^2)^2)^2)\) and \(\textit{GF}((2^4)^2)\) in the middle of the inversion. Researchers have previously shown that the tower field approach is efficient because the subfields \(\textit{GF}((2^2)^2)\) and \(\textit{GF}(2^4)\) operations are designed more compactly than the original field operations. Satoh et al. [10] were the first to present a compact implementation of the AES S-Box by the tower field \(\textit{GF}(((2^2)^2)^2)\) represented by polynomial bases (PB). Canright [11] further reduced the gate count of the AES S-Box by using normal bases (NB) and optimizing the change-of-basis. Canright’s implementation was the smallest for a long time. Nogami et al. [12] recently mixed polynomial and normal bases (MB) to achieve the most efficient implementation. They showed that the product of gate count and critical delay for the inversion circuit could be reduced by the MB. Some implementations using \(\textit{GF}((2^4)^2)\) have also been proposed by researchers such as Jeon et al. [13], who presented PB-based \(\textit{GF}((2^4)^2)\) inversion circuit design. These results suggest that such field representations have a significant impact on hardware performance.

The above bases (i.e., PB, NB, and MB) represent each element of \(\textit{GF}(2^m)\) using m bits in a non-redundant manner. However, there are two redundant representations, namely polynomial ring representation (PRR) and redundantly represented basis (RRB), which use n\((> m)\) bits to represent each element of \(\textit{GF}(2^m)\). The defining polynomial of these redundant representations is given by a reducible polynomial of degree n, whereas that of non-redundant representations is given by an irreducible polynomial of degree m. This means that redundant representations provide a wider variety of polynomials that can be selected as a defining polynomial than non-redundant representations. Drolet [14] showed that the use of PRR makes it possible to select a binomial \(x^n+1\) as a defining polynomial, which can lead to the design of small-complexity arithmetic circuits. Wu et al. [15] and Nekado et al. [16] showed that RRB-based designs were useful for designing efficient inversion circuits.

This paper presents a technique in which non-redundant and redundant GF arithmetic are combined to achieve a compact and efficient \(\textit{GF}(2^8)\) inversion circuit design. The key idea underlying the proposed circuit is calculation of the inversion of the tower field \(\textit{GF}((2^4)^2)\) by the NB, PRR, and RRB combination. The former part for the 16th and 17th powers of the input is calculated by an NB with a symmetric property. This is followed by calculation of the latter parts for \(\textit{GF}(2^4)\) inversion and \(\textit{GF}(2^4)\) multiplication by PRR and RRB, respectively. The mapping from NB to PRR/RRB is efficiently implemented by the symmetric property of the NB. The efficacy of the proposed circuit is evaluated by means of gate counts and logic synthesis results using a TSMC 65-nm CMOS standard cell library. The proposed circuit has approximately 25% higher efficiency (i.e., area–time product) excluding the change-of-basis than any other conventional circuits, including those with the tower field \( GF(((2^2)^2)^2)\). In addition, the flexibility of redundant representations in the proposed circuit enables it to have the best efficiency even including the change-of-basis from/to \(\textit{GF}(2^8)\). To the best of our knowledge, the proposed circuit is the most efficient tower field arithmetic-based implementation for the AES S-Box.

While a preliminary version [17] presented and evaluated the above circuit based on the combination of non-redundant and redundant GF arithmetic, this paper further enhances our circuit by two new techniques. First, we present a variation of the new conceptual circuit with an additional unification of inner operations, which achieves the shortest critical path among tower field inversion circuits including that in [17]. The unification also results in a more area–time efficient inversion/S-Box circuit than ever before. We then present a more efficient unified S-Box design which supports both encryption and decryption by a new optimization technique for change-of-basis. The new designs are evaluated by the same manner as the preliminary version [17]. As a result, we confirm that our new designs have the highest efficiency in terms of area–time product even where both encryption and decryption are supported.

The remainder of this paper is organized as follows. Section 2 introduces preliminary and related work associated with the design of \(\textit{GF}(2^8)\) inversion circuits. The redundant GF representations introduced in the proposed circuit are also described. Section 3 presents the proposed \(\textit{GF}(2^8)\) inversion circuit. In addition, this section evaluates the proposed circuit by the results of gate count and logic synthesis. Section 4 presents the AES S-Box design that incorporates the proposed inversion circuit and its change-of-basis, and shows their evaluation results in the same manner as Sect. 3. Finally, Sect. 5 presents concluding remarks.

2 Preliminaries and related works

2.1 Inversion circuits by tower fields

This section briefly describes previous work on the design of \(\textit{GF}(2^8)\) inversion circuits based on tower field arithmetic. The inverse element of \(a \in GF(2^8)\) is given by

$$\begin{aligned} a^{254} = \left\{ \begin{array}{lll} a^{-1} &{} \mathrm{for}\; a \ne 0 \\ 0 &{} \mathrm{if} \; a = 0 \end{array},\right. \end{aligned}$$

because any element of \(\textit{GF}(2^8)\) satisfies \(a = a^{256}\). The basic idea underlying the tower field approach is reduction in hardware cost by exploiting smaller arithmetic operations over subfield \(\textit{GF}((2^2)^2)\) or \(\textit{GF}(2^4)\) instead of \(\textit{GF}(2^8)\). There is a one-to-one mapping (i.e., change-of-basis) between the elements of \(\textit{GF}(2^8)\) and those of the tower field. This GF inversion over a tower field is efficiently implemented in the Itoh–Tsujii Algorithm (ITA) [18].

Fig. 1
figure 1

Inversion circuit over \(\textit{GF}(((2^2)^2)^2)\) in [11]

Figure 1 illustrates a \(\textit{GF}(2^8)\) inversion circuit presented in [11], where the data path is divided into upper and lower 4 bits and each component denotes an arithmetic circuit over subfield \(\textit{GF}((2^2)^2)\). Let \(a \in GF(((2^2)^2)^2)\) be the input given by \(h \alpha ^{16} + l \alpha \) in an NB \(\{\alpha ^{16}, \alpha \}\), where h and l\((\in GF((2^2)^2))\) are, respectively, the upper and lower 4 bits of a, and \(\alpha \) is a root of an irreducible polynomial of degree 2 over \(\textit{GF}((2^2)^2)\) (i.e., a defining polynomial for extending \(\textit{GF}((2^2)^2)\) to \(\textit{GF}(((2^2)^2)^2)\)). The inversion of a is calculated in the following three stages: (1) calculation of the 16th and 17th powers, (2) subfield inversion, and (3) final multiplication. Note that the above \(\textit{GF}((2^2)^2)\) operators are replaced with the \(\textit{GF}(2^4)\) operators in the case of the tower field \(\textit{GF}((2^4)^2)\).

The performance of this inversion circuit depends on the tower field and its basis representation. Three of the best known circuit structures are based on the tower field of \(\textit{GF}(((2^2)^2)^2)\). Satoh et al. first designed this kind of \(\textit{GF}(((2^2)^2)^2)\) inversion circuit using PB [10]. Canright then designed a more compact circuit based on NB [11]. The hardware cost of inversion and exponentiation operations can be reduced by NB because the squaring operation is performed solely by wiring. Nogami et al. presented the possibility of MB, which employs both polynomial and normal bases for the input and output data, respectively [12]. Their method exhibited improved performance in the product of gate count and critical delay for the \(\textit{GF}(((2^2)^2)^2)\) inversion circuit and the AES S-Box, including change-of-basis. In addition to \(\textit{GF}(((2^2)^2)^2)\), it is possible to design efficient inversion circuits using another tower field of \(\textit{GF}((2^4)^2)\). Jeon et al. [13] designed \(\textit{GF}((2^4)^2)\) inversion circuit based on PB with smaller critical delay than those of \(\textit{GF}(((2^2)^2)^2)\) inversion circuits.

2.2 Redundant representations for Galois fields

Polynomial ring representation (PRR) is a redundant representation of GF [14]. An extension field \(\textit{GF}(2^m)\) based on a PB has a set of elements (i.e., polynomials) whose degrees are at most \(m-1\) (i.e., m bits). Elements of an NB-based \(\textit{GF}(2^m)\) are also represented by m bits. On the other hand, an extension field \(\textit{GF}(2^m)\) based on PRR has a set of polynomials whose degrees are up to \(n-1\) (i.e., n bits), where \(n > m\) [14]. In other words, whereas a PB- or an NB-based \(\textit{GF}(2^m)\) is defined as an m-dimensional linear space over \(\textit{GF}(2)\), a PRR-based \(\textit{GF}(2^m)\) is defined as an m-dimensional subspace of an n-dimensional linear space. PRR is also equivalent to cyclic redundancy code (CRC), a kind of error-correction code.

Let x and H(x) be an indeterminate element and an irreducible polynomial of degree m over \(\textit{GF}(2)\), respectively. Let G(x) be a polynomial of degree \(n - m\), which is relatively prime to H(x), and is satisfied with \(G(0) \not = 0\). Let P(x) be a polynomial (of degree n) given by the product of G(x) and H(x). A set of polynomials of degrees less than or equal to \(n - 1\), where each polynomial is divisible by G(x), together with modulo P(x) arithmetic is isomorphic to \(\textit{GF}(2^m)\). Note here that \(n = m + \deg G(x)\). The representation of \(\textit{GF}(2^m)\) using such a residue ring is called PRR. A PRR can be constructed from any PB-based \(\textit{GF}(2^m)\).

As an example of PRR, we present the construction of a PRR-based \(\textit{GF}(2^4)\), where the defining polynomial P(x) is given by \(x^5 + 1\) and its factors G(x) and H(x) are given by

$$\begin{aligned} G(x)= & {} x+1, \\ H(x)= & {} x^4+x^3+x^2+x+1. \end{aligned}$$

We first compute the multiplicative unit element E(x), which is given by

$$\begin{aligned} E(x) = U(x)G(x) = 1 - V(x)H(x), \end{aligned}$$

where U(x) and V(x) are polynomials satisfying \(U(x)G(x) + V(x)H(x) = 1\). Hence, \(E(x) = (x^3+x)(x+1) = x^4+x^3+x^2+x\). Let \(C_i(x) = (x^i \bmod H(x)) \times E(x) \bmod P(x)\) (\(0 \le i \le 2^4-2\)). Here, \(C_i(x)\) corresponds to \(\beta ^i\), where \(\beta \) is a root of H(x). In addition, zero is mapped to zero. Thus, we can derive a correspondence between a PB-based \(\textit{GF}(2^4)\) with the irreducible polynomial H(x) and the PRR-based \(\textit{GF}(2^4)\). Table 1 shows the correspondence between the PB- and PRR-based \(\textit{GF}(2^4)\), which shows an example of one-to-one mapping between PB- and PRR-based elements. Here, the PB-based \(\textit{GF}(2^4)\) represents elements by polynomials of degrees at most 3 (i.e., 4 bits), whereas the PRR-based \(\textit{GF}(2^4)\) represents elements by polynomials of degrees up to 4 (i.e., 5 bits). In addition, the PRR-based \(\textit{GF}(2^4)\) consists of only polynomials dividable by \(G(x) = x+1\), which means that the PRR-based \(\textit{GF}(2^4)\) is equivalent to a cyclic code with the generator polynomial G(x). See [20] for details about the construction method.

Table 1 Example of correspondence between PB- and PRR-based \(\textit{GF}(2^4)\)

It is known that the performance of the GF circuit generally improves as the number of terms in the modular polynomial decreases [19]. Here, a binomial \(x^n + 1\) is available for the modular polynomial of PRR-based \(\textit{GF}(2^m)\), whereas it is unavailable for GFs based on non-redundant representations. This is because the modular polynomial P(x) is given by a reducible polynomial (i.e., \(G(x) \times H(x)\)). Thus, the performance of PRR-based GF arithmetic circuits can be better than those of PB- and NB-based arithmetic circuits. For example, we can use \(x^{m+1} + 1\) for P(x) if the mth degree all one polynomial (AOP) is irreducible according to the following formula over \(\textit{GF}(2)\):

$$\begin{aligned} x^{m+1} + 1 = (x+1)(x^m+x^{m-1}+ \dots +1), \end{aligned}$$

where the polynomial \(x^m+x^{m-1}+ \dots +1\) is called the AOP of degree m.

The major advantages of using the binomial are as follows: (i) Parallel multiplication can be given as the discrete time Wiener–Hopf equation and (ii) squaring and a part of constant multiplication are performed only by bit-wise permutation (i.e., wiring). In (i), reduction by P(x) in multiplication can be performed only by bit-wise permutation while multiplication based on non-redundant representation requires some addition over \(\textit{GF}(2)\) for the reduction. In (ii), squaring and a part of constant multiplication can be performed without any logic gate since they are the special case of multiplication. Accordingly, the PRR-based design can be more efficient than conventional designs.

Redundantly represented basis (RRB) is another redundant representation of GF [16]. Each element is represented by a primitive nth root where n is the minimal integer such that \(\textit{GF}(2^m)\) can be embedded into \(\textit{GF}(2^n)\).

RRB is available when the AOP of degree m is irreducible. Let \(\beta \) be a root of the AOP. The m elements (i.e., bases) \(\beta ^{m-1}, \beta ^{m-2}, \dots ,\) and \(\beta ^0\) are linearly independent and then compose a PB. In contrast, RRB employs a binomial \(\beta ^{m+1}-1\) as the defining polynomial, which is satisfied with the following equation:

$$\begin{aligned} \beta ^{m+1} + 1 = (\beta + 1) (\beta ^{m} + \beta ^{m-1} + \dots + 1) = 0. \end{aligned}$$

The set \(\{\beta ^m, \beta ^{m-1}, \dots , \beta ^0\}\) is called RRB. Because the degree of the binomial is \(m+1\), each element is represented by a linear combination of \(\beta ^m, \beta ^{m-1}, \dots ,\) and \(\beta ^0\). Note that the elements of such an RRB-based \(GF(2^m)\) are represented in a non-unique manner because \(\beta ^m, \beta ^{m-1}, \dots ,\) and \(\beta ^0\) are linearly dependent in contrast to PRRFootnote 1.

RRB-based \(\textit{GF}(2^m)\) squaring can be performed by bit-wise permutation, as is the case with NB. This is because RRB is equivalent to an extended (Type I) optimal normal basis (ONB). We can derive RRB by adding a base \(\{\beta ^0\}\) to ONB. This means that an efficient multiplication method for ONB, called the cyclic vector multiplication algorithm (CVMA) [21], is also available for RRB. Thus, we can design more compact and efficient multipliers by combining RRB and CVMA. As an example, let us consider a \(\textit{GF}(2^4)\) multiplier based on RRB. Let s and t\((\in GF(2^4))\) be the inputs and u\((\in GF(2^4))\) be the output. Let \(\beta \) be a root of the AOP of degree 4. In RRB, s is given by \(s_4\beta ^4+s_3\beta ^3+ \dots + s_0\), where \(s_0, s_1, \dots ,\) and \(s_4 \in GF(2)\). Also, t and u are given in the same manner. The multiplication is represented by

$$\begin{aligned} u = s \times t = u_4\beta ^4 + u_3 \beta ^3 + u_2\beta ^2 + u_1 \beta + u_0, \end{aligned}$$

where

$$\begin{aligned} u_0= & {} (s_1 + s_4)(t_1+t_4) + (s_2+s_3)(t_2+t_3), \end{aligned}$$
(1)
$$\begin{aligned} u_1= & {} (s_0 + s_1)(t_0+t_1) + (s_2+s_4)(t_2+t_4), \end{aligned}$$
(2)
$$\begin{aligned} u_2= & {} (s_0 + s_2)(t_0+t_2) + (s_3+s_4)(t_3+t_4), \end{aligned}$$
(3)
$$\begin{aligned} u_3= & {} (s_0 + s_3)(t_0+t_3) + (s_1+s_2)(t_1+t_2),\end{aligned}$$
(4)
$$\begin{aligned} u_4= & {} (s_0 + s_4)(t_0+t_4) + (s_1+s_3)(t_1+t_3). \end{aligned}$$
(5)

The critical delay of the RRB-based \(\textit{GF}(2^4)\) multiplier is \(T_A + 2T_X\), while those of multipliers based on non-redundant representations are \(T_A + 3T_X\) [16]. The gate count of the RRB-based multiplier requires only 10 AND and 25 XOR gates [16], whereas that of a PRR-based multiplier requires 25 AND and 20 XOR gates [14].

Nekado et al. [16] designed a more efficient \(\textit{GF}((2^4)^2)\) inversion circuit based on RRB by utilizing the above advantage. Figure 2 shows the block diagram of \(\textit{GF}((2^4)^2)\) inversion circuit in [16], where \(\textit{GF}(2^4)\) is given by RRB and \(\textit{GF}((2^4)^2)\) (i.e., the quadratic extension of \(\textit{GF}(2^4)\)) is given by NB. In the circuit, RRB-based multiplications are divided into two parts denoted by Mul0 and Mul1 in order to reduce circuit area. Mul0 performs bit-wise XOR operations, and Mul1 performs bit-wise AND operations followed by XOR operations. This circuit achieved lower latency than the previous ones thanks to the efficient multiplications based on RRB.

Fig. 2
figure 2

Inversion circuit over RRB-based \(\textit{GF}((2^4)^2)\) in [16]

3 Proposed \(\textit{GF}(2^8)\) inversion circuit

3.1 Circuit description

This section presents our proposed \(\textit{GF}(2^8)\) inversion circuit that takes full advantage of the above redundant GF arithmetic. The important ideas are to employ the tower field \(\textit{GF}((2^4)^2)\) inside the circuit and perform the subfield (i.e., \(\textit{GF}(2^4)\)) operations using redundant GF arithmetic. We introduce PRR for the \(\textit{GF}(2^4)\) inversion because we can exploit a defining polynomial, \(P(x) = x^5 + 1\), thanks to the irreducible AOP of degree 4. We also introduce RRB for the \(\textit{GF}(2^4)\) multiplication. In addition, we employ an NB for the input in order to exploit the Frobenius mapping feature, which performs the 16th power of input solely by wiring.

In accordance with ITA, our inversion circuit consists of three stages, as shown in Fig. 1. Here, we represent the inputs of Stages 1, 2, and 3 by NB, PRR, and RRB, respectively. In particular, we employ an NB that has a symmetric property, which makes it possible to convert the elements from NB to PRR without increasing the circuit delay.

Figure 3 shows a block diagram of our proposed circuit, where components HL,  and F, respectively, calculate \(H_{i, j}, L_{i, j},\) and \(F_{i', j'}\) described in the following. When input a is represented by \(h\alpha ^{16} + l\alpha \), components NB2RRB convert h and l from NB to RRB solely by wiring. Note that H and L are shared with Stages 1 and 3. The stages in the proposed circuit are designed as follows:

Fig. 3
figure 3

Proposed inversion circuit

3.1.1 Calculation of the 16th and 17th powers

Stage 1 performs the 16th and 17th powers of input, where input a is given by NB, and outputs \(a^{16}\) and \(a^{17}\) are given by RRB and PRR, respectively. Let \(\alpha \) be a root of an irreducible polynomial of degree 2 over \(\textit{GF}(2^4)\). The irreducible polynomial is given by \(\alpha ^2+\mu \alpha +\nu \), where \(\mu \) and \(\nu \) are constants of \(\textit{GF}(2^4)\). When input a is represented by \(a = h\alpha ^{16} + l \alpha \) in an NB \(\{\alpha ^{16}, \alpha \}\), \(a^{16}\) and \(a^{17}\) are, respectively, given by

$$\begin{aligned} a^{16}= & {} l \alpha ^{16} + h \alpha , \end{aligned}$$
(6)
$$\begin{aligned} a^{17}= & {} hl \mu ^2 + (h+l)^2 \nu . \end{aligned}$$
(7)

Equation (6) indicates that \(a^{16}\) is performed by twisting wires.

The change-of-basis from NB to RRB does not require any additional gates because the NB (e.g., \(\{\beta ^4, \beta ^3, \beta ^2, \beta ^1\}\)) can be considered as a reduced version of RRB (e.g., \(\{\beta ^4, \beta ^3, \beta ^2, \beta ^1, \beta ^0\}\)) with the same root of the AOP of degree 4. Conversely, the change-of-basis from NB to PRR requires some gates. However, the symmetric property of the NB used in our circuit provides a mapping that does not increase the circuit delay.

Let us now look at the change-of-basis from NB to PRR. Since the change-of-basis is given by an isomorphim represented by \(z' = \Gamma (z)\), where an element z in one GF representation is converted into an element \(z'\) in another GF representation. In the binary vector form, the output \(\varvec{z}'\) is obtained from the product of a conversion matrix \(\gamma \) and the transposed input (i.e., \(\varvec{z}' = \gamma \varvec{z}^T\)) when the conversion matrix \(\gamma \) represents the isomorphism \(\Gamma \). The PRR-based \(\textit{GF}(2^4)\) is given with the defining polynomial \(P(x) = x^5 + 1\) (\(G(x) = x + 1\) and \(H(x) = x^4 + x^3 + x^2 + x + 1\)) and the conversion matrix from NB to PRR is as follows:

$$\begin{aligned} \phi = \left( \begin{matrix} 1 &{} 1 &{} 1 &{} 1 \\ 0 &{} 1 &{} 1 &{} 1 \\ 1 &{} 0 &{} 1 &{} 1 \\ 1 &{} 1 &{} 0 &{} 1 \\ 1 &{} 1 &{} 1 &{} 0 \end{matrix}\right) , \end{aligned}$$

where the least significant bits are in the upper left corner. (See [20] for an explanation of how to obtain the matrix.) Let \(d = d_4x^4 +d_3x^3 + \dots +d_0\) be the output of Stage 1 (i.e., the 17th power of input in PRR), where \(d_0, d_1, \dots ,\) and \(d_4\) are elements of \(\textit{GF}(2)\). The output is provided by applying the change-of-basis \(\varPhi \) from NB to PRR to \(a^{17}\) (i.e., the product of the conversion matrix \(\phi \) and the transposed vector form of \(a^{17}\)). However, the multiplication of \(\phi \) and the output of Eq. (7) requires an additional circuit with \(2T_X\) delay if the multiplication is performed explicitly. To avoid such additional circuit, we derive another output equation from Eq. (7) as follows:

$$\begin{aligned} d= & {} \varPhi (hl \mu ^2 + (h+l)^2 \nu ) \nonumber \\= & {} \varPhi (\mu ^2(hl)) + \varPhi (\nu ((h+l)^2)) \nonumber \\= & {} \varPhi '(hl) + \varPhi ''((h+l)^2), \end{aligned}$$
(8)

where \(\varPhi '\) and \(\varPhi ''\) are the linear functions obtained by merging \(\varPhi \) with the constant multiplications of \(\mu ^2\) and \(\nu \), respectively. Note that constant multiplications over GF can also be given as linear functions represented by conversion matrices. When \(\mu = \beta ^4+\beta \) and \(\nu = \beta \), the resulting matrices \(\phi '\) and \(\phi ''\) representing, respectively, \(\varPhi '\) and \(\varPhi ''\) are given as

$$\begin{aligned} \phi ' = \left( \begin{matrix} 0 &{} 1 &{} 1 &{} 0 \\ 0 &{} 0 &{} 1 &{} 1 \\ 0 &{} 0 &{} 0 &{} 1 \\ 1 &{} 0 &{} 0 &{} 0 \\ 1 &{} 1 &{} 0 &{} 0 \end{matrix}\right) , \ \phi '' = \left( \begin{matrix} 1 &{} 1 &{} 1 &{} 0 \\ 1 &{} 1 &{} 1 &{} 1 \\ 0 &{} 1 &{} 1 &{} 1 \\ 1 &{} 0 &{} 1 &{} 1 \\ 1 &{} 1 &{} 0 &{} 1 \end{matrix}\right) , \end{aligned}$$

where the least significant bits are in the upper left corners. The resulting elements of PRR are shown in Table 1.

To design the circuit defined by Eq. (8), we exploit the ONB with the symmetric property that h and l are given by \(h = h_4\beta ^4+h_3\beta ^3+h_2\beta ^2+h_1\beta \) and \(l = l_4\beta ^4+l_3\beta ^3+l_2\beta ^2+l_1\beta \) with a common NB \(\{\beta ^4, \beta ^3, \beta ^2, \beta ^1\}\), where \(h_1, \dots , h_4\) and \(l_1, \dots , l_4\) are the elements of \(\textit{GF}(2)\). Here, “symmetric” indicates that h and l is represented by the same NB while the circuit in [16] uses different (i.e., asymmetric) bases for h and lFootnote 2. The usage of ONB makes it more efficient to compute \(a^{17}\) and also makes it possible to perform the change-of-basis between the ONB, PRR, and RRB by solely bit-wise permutation because all of the used representations are defined by the AOP of degree 4. As a result, the outputs \(d_0, d_1, \dots ,\) and \(d_4\) are given by

$$\begin{aligned} d_0= & {} (h_1l_2 + h_2l_1 + h_3l_4 + h_4l_3 + h_1l_1+h_4l_4) \nonumber \\&+ (h_1+l_1+h_3+l_3+h_4+l_4), \end{aligned}$$
(9)
$$\begin{aligned} d_1= & {} (h_1l_2 + h_2l_1 + h_1l_3 + h_3l_1 +h_2l_2+h_4l_4) \nonumber \\&+ (h_1+l_1+h_2+l_2+h_3+l_3+h_4+l_4), \end{aligned}$$
(10)
$$\begin{aligned} d_2= & {} (h_1l_3 + h_3l_1+h_1l_4 + h_4l_1 +h_2l_3+h_3l_2 +h_2l_2) \nonumber \\&\quad + (h_1+l_1+h_2+l_2+h_4+l_4), \end{aligned}$$
(11)
$$\begin{aligned} d_3= & {} (h_1l_4 + h_4l_1 + h_2l_3+h_3l_2+h_2l_4 + h_4l_2 + h_3l_3) \nonumber \\&+ (h_2+l_2+h_3+l_3+h_4+l_4), \end{aligned}$$
(12)
$$\begin{aligned} d_4= & {} (h_2l_4 + h_4l_2 +h_3l_4 + h_4l_3 + h_1l_1+h_3l_3) \nonumber \\&+ (h_1+l_1+h_2+l_2+h_3+l_3), \end{aligned}$$
(13)

respectively. Here, the symmetric property enables us to factor Eqs. (9101112)–(13) as follows:

$$\begin{aligned} d_0= & {} H_{1, 2}\vee L_{1, 2} + H_{3, 4}\vee L_{3, 4} + h_2 \vee l_2+h_3l_3, \end{aligned}$$
(14)
$$\begin{aligned} d_1= & {} H_{1, 2} \vee L_{1, 2} + H_{1, 3}L_{1, 3}+ h_3 \vee l_3+h_4 \vee l_4, \end{aligned}$$
(15)
$$\begin{aligned} d_2= & {} H_{1, 3} \vee L_{1, 3} +H_{1, 4}L_{1, 4} + H_{2, 3}\vee L_{2, 3} +h_4 \vee l_4, \nonumber \\\end{aligned}$$
(16)
$$\begin{aligned} d_3= & {} H_{1, 4} \vee L_{1, 4} + H_{2, 3} \vee L_{2, 3} + H_{2, 4}L_{2, 4} + h_1 \vee l_1, \nonumber \\\end{aligned}$$
(17)
$$\begin{aligned} d_4= & {} H_{2, 4} \vee L_{2, 4} + H_{3, 4} \vee L_{3, 4} + h_1 \vee l_1+h_2l_2, \end{aligned}$$
(18)

where \(H_{i,j} = h_i+h_j\), \(L_{i,j} = l_i+l_j\)\((1 \le i < j \le 4)\), and \(\vee \) denotes the OR operator (i.e., \(a \vee b = a+b+ab\)). The component denoted by Stage 1 in Fig. 3 performs the computations corresponding to Eqs. (14151617)–(18). Therefore, the proposed Stage 1 is performed with only \(T_A + 3T_X\) (or \(T_O + 3T_X\)) delay, whereas conventional \(\textit{GF}(((2^2)^2)^2)\) inversion implementations are performed with at least \(6T_X\) delay, where \(T_A, T_O,\) and \(T_X\) denote the delays of the AND, OR, and XOR gates, respectively. In addition, the proposed Stage 1 is implemented with 5 AND, 10 OR, and 27 XOR gates while the conventional design with \(T_A + 3T_X\) delay requires 10 AND and 35 XOR gates [16].

3.1.2 Subfield inversion

Stage 2 performs the inversion over the subfield \(\textit{GF}(2^4)\), where the input and output are given by PRR and RRB, respectively. We first describe the architecture of the PRR-based \(\textit{GF}(2^4)\) inversion and then show the change-of-basis from PRR to RRB below.

The inversion over \(\textit{GF}(2^4)\) is performed by the 14th power of the input. The input (i.e., the output of Stage 1) \(d = d_4x^4 + d_3x^3 + \dots + d_0\) is given as an element of the PRR-based \(\textit{GF}(2^4)\). The input is satisfied with the condition (called the linear recurrence relation) \(d_0 + d_1 + d_2 + d_3 + d_4 = 0\)Footnote 3 because it is equivalent to the codeword of a CRC generated by \(G(x)= x + 1\), which makes it possible to perform the exponentiation by bit-wise operations over the PRR-based \(\textit{GF}(2^4)\) in an efficient manner.

Let \(e = e_4x^4 + e_3x^3 + \dots + e_0\) be the inverse element of d in PRR, where \(e_0, e_1, \dots ,\) and \(e_4\) are elements of \(\textit{GF}(2)\). Using the linear recurrence relation, we can derive \(e_0, e_1, \dots ,\) and \(e_4\) as follows:

$$\begin{aligned} e_0= & {} (d_1 \vee d_4)(d_2 \vee d_3), \end{aligned}$$
(19)
$$\begin{aligned} e_1= & {} ((d_4+1)(d_1+d_2)) \vee (d_0d_4(d_2 \vee d_3)), \end{aligned}$$
(20)
$$\begin{aligned} e_2= & {} ((d_3+1)(d_2+d_4)) \vee (d_0d_3(d_1 \vee d_4)), \end{aligned}$$
(21)
$$\begin{aligned} e_3= & {} ((d_2+1)(d_1+d_3)) \vee (d_0d_2(d_1 \vee d_4)), \end{aligned}$$
(22)
$$\begin{aligned} e_4= & {} ((d_1+1)(d_3+d_4)) \vee (d_0d_1(d_2 \vee d_3)). \end{aligned}$$
(23)

According to Eqs. (19202122)–(23), the proposed Stage 2 requires \(T_A + T_O + T_X\) delay, whereas the conventional structures [10,11,12, 16] require at least \(T_A+3T_X\). Also, the proposed Stage 2 is implemented with 13 AND, 6 OR, and 4 XOR, and 4 NOT gates, whereas the conventional fastest Stage 2 [16] requires 12 AND, 11 XOR, and 2 XNOR gates. Note that when the multiplicative unit element E(x) \((= x^4 + x^3 + x^2 + x)\) is given as the input, the output becomes not E(x) but 1. However, the output is acceptable in Stage 3 (i.e., \(\textit{GF}(2^4)\) multiplication) because both E(x) and 1 are the idempotent elements in the residue ring modulo P(x).

Let us now look at the PRR-to-RRB mapping. To provide it uniquely, we focus on the definition of PRR in [14], in which the mapping \(\Psi \) from PRR defined by x to another representation defined by \(\beta \) is isomorphism. According to [20], the change-of-basis from PRR can be performed by substituting a root of H(x) (i.e., \(\beta \)) to elements. Let \(f = f_4\beta ^4+f_3\beta ^3+ \dots + f_0\) be the output of Stage 2 in RRB, where \(f_0, f_1, \dots ,\) and \(f_4\) are elements of \(\textit{GF}(2)\). The output can be given by

$$\begin{aligned} f = \Psi (e) = e_4\beta ^4+e_3\beta ^3+e_2\beta ^2+e_1\beta +e_0. \end{aligned}$$

This is because the RRB is defined using H(x). This means that the PRR-to-RRB mapping is performed without any additional circuit, assuming that \(f_0 = e_0, \dots ,\) and \(f_4=e_4\). As a result, the PRR-based design provides inversion and change-of-basis with fewer logic gates.

Table 2 Critical delay and gate count of inversion circuits over tower fields

3.1.3 Final multiplication

Stage 3 generates the final output using two \(\textit{GF}(2^4)\) multiplication operations, where both the inputs and output are given by RRB. Since Stage 3 solely consists of \(\textit{GF}(2^4)\) multiplication, Stage 3 can be efficiently implemented with the efficient RRB-based \(\textit{GF}(2^4)\) multipliers described in Sect. 2.2.

Let \(h' = h'_4\beta ^4+h'_3\beta ^3+\dots +h'_0\) be the upper 5 bits of the final output \(a^{-1}\) in RRB, where \(h'_0, h'_1, \dots ,\) and \(h'_4\) are elements of \(\textit{GF}(2)\). Multiplying f and l, we can calculate elements \(h'_0, h'_1, \dots ,\) and \(h'_4\) as follows:

$$\begin{aligned} h'_0= & {} L_{1, 4}F_{1, 4} + L_{2, 3}F_{2, 3}, \end{aligned}$$
(24)
$$\begin{aligned} h'_1= & {} l_1F_{0, 1} + L_{2, 4}F_{2, 4}, \end{aligned}$$
(25)
$$\begin{aligned} h'_2= & {} l_2F_{0, 2} + L_{3, 4}F_{3, 4}, \end{aligned}$$
(26)
$$\begin{aligned} h'_3= & {} l_3F_{0, 3} + L_{1, 2}F_{1, 2}, \end{aligned}$$
(27)
$$\begin{aligned} h'_4= & {} l_4F_{0, 4} + L_{1, 3}F_{1, 3}, \end{aligned}$$
(28)

where \(F_{i',j'}\) denotes \(f_{i'} + f_{j'}\)\((0 \le i' < j' \le 4)\). The lower five bits of \(a^{-1}\) (denoted by \(l'\)) are also obtained in the same manner as in Eqs. (24252627)–(28). The component denoted by Stage 3 in Fig. 3 performs the computations corresponding to Eqs. (24252627)–(28). Note here that the calculations for \(F_{i', j'}\) can be shared within Stage 3. As a result, the number of circuit components for the two multipliers in Stage 3 is reduced.

The above inversion circuit achieves the shortest critical delay among tower field inversion circuits as evaluated in Sect. 4. In the following, we describe a variation of the proposed circuit with a smaller critical delay at the cost of a slight area overhead. We focus on Stage 2 and \(F_{i',j'}\) computation prior to Stage 3. Since Stage 2 is given by one-to-one mapping and \(F_{i',j'}\) is computed by XORing the output of Stage 2, we can unify Stage 2 and \(F_{i',j'}\) computation by deriving \(F_{i',j'}\) directly from \(d_0\), \(d_1\), \(\dots \), and \(d_4\) as follows:

$$\begin{aligned} F_{0,1}= & {} d_2(d_1d_3+1) + d_0d_1 + d_3d_4, \end{aligned}$$
(29)
$$\begin{aligned} F_{0,2}= & {} d_4(d_1d_2+1) + d_0d_2 + d_1d_3, \end{aligned}$$
(30)
$$\begin{aligned} F_{0,3}= & {} d_1(d_3d_4+1) + d_0d_3 + d_2d_4, \end{aligned}$$
(31)
$$\begin{aligned} F_{0,4}= & {} d_3(d_2d_4+1) + d_0d_4 + d_1d_2, \end{aligned}$$
(32)
$$\begin{aligned} F_{1,2}= & {} d_1(d_0d_2+1) + d_0d_4 + d_2d_3, \end{aligned}$$
(33)
$$\begin{aligned} F_{1,3}= & {} d_3(d_0d_1+1) + d_0d_2 + d_1d_4, \end{aligned}$$
(34)
$$\begin{aligned} F_{1,4}= & {} d_0(d_2d_3+1) + d_1d_3 + d_2d_4, \end{aligned}$$
(35)
$$\begin{aligned} F_{2,3}= & {} d_0(d_1d_4+1) + d_1d_2 + d_3d_4, \end{aligned}$$
(36)
$$\begin{aligned} F_{2,4}= & {} d_2(d_0d_4+1) + d_0d_3 + d_1d_4, \end{aligned}$$
(37)
$$\begin{aligned} F_{3,4}= & {} d_4(d_0d_3+1) + d_0d_1 + d_2d_3. \end{aligned}$$
(38)

The above computation for each \(F_{i',j'}\) requires \(T_A + 2T_X\) delay while the critical delay of the original circuit in Fig. 3 requires \(T_A + T_O + 2T_X\). On the other hand, we require 20 AND, 20 XOR, and 10 NOT gates to compute \(F_{i',j'}\) based on Eqs. (293031323334353637)–(38), whereas the original circuit requires 13 AND, 6 OR, and 14 XOR, and 4 NOT gates (i.e., 13 AND, 6 OR, and 4 XOR, and 4 NOT gates for computing Stage 2, and 10 XOR gates for computing \(F_{i',j'}\)). Thus, we can further reduce the critical delay of the inversion circuit at the expense of a few additional gates.

3.2 Implementation results

Table 2 shows the circuit delay and gate count of the proposed inversion circuit, where \((g_0, g_1, g_2, g_3, g_4, g_5, g_6)\) in the gate count column, respectively, indicate the number of AND, OR, XOR, XNOR, NOT, NAND, and NOR gates, and representation indicates the GF representation(s) used in the circuit. In the representation column, “Tower field of \(\textit{GF}(2^8)\)” denotes the representations for \(\textit{GF}(((2^2)^2)^2)\) or \(\textit{GF}((2^4)^2)\), and “Intermediate field” denotes \(\textit{GF}((2^2)^2)\) or \(\textit{GF}(2^4)\). “This work (compact)” denotes the inversion circuit in Fig. 3, and “This work (high-speed)” denotes the circuit where Stage 2 and \(F_{i',j'}\) computation are unified as described in the last paragraph of Sect. 3. For comparison, Table 2 also shows those of the conventional inversion circuits. The critical delay of all the conventional ones were given by reference to [16]. On the other hand, the gate counts of the conventional ones were individually given because there were no single reference data covering all of them. The gate count of [11] was given from the original paper, that of [10] was given from a public source code by the authors [22], and those of [13, 16] were given by reference to [16]. The gate count of [12] was given from a straightforward structure designed by us according to [12] since there were neither public data nor source code.

The critical paths of Stages 1, 2, and 3 in the proposed circuit require \(T_A +3T_X\), \(T_A +T_O +T_X\), and \(T_A +2T_X\) delay, respectively. In our high-speed version, that of Stages 2 and 3 is at most \(3T_A + 2T_X\). As a result, the total delay of our inversion circuit is \(3T_A+T_O+6T_X\) (or \(3T_A+6T_X\)), which is the fastest compared with the other inversion circuits. The gate count in “This work (high-speed)” is smaller or comparable to the conventional ones because the number of additional XOR and XNOR gates due to unification is trivial. In total, the high-speed version newly designed in this paper is more efficient than any other circuits including our compact version.

Table 3 Performance evaluation of inversion circuits over tower fields

To conduct a detailed evaluation, the above tower field \(\textit{GF}(2^8)\) inversion circuits were synthesized using Synopsys Design Compiler with a TSMC 65-nm CMOS standard cell library. Table 3 shows the synthesis results, where area indicates the circuit area estimated based on a two-way NAND equivalent gate size (i.e., gate equivalents (GE))Footnote 4, time indicates the circuit delay under the worst-case conditions, and area–time product indicates the product of area and time. For the best performance comparison, an area optimization option (which maximizes the effort of minimizing the number of gates without flattening the description) was applied. Note that the results were consistent even when the following speed optimization (which searches for the minimum timing without increasing the area obtained from the prior area optimization) options was applied. The conventional inversion circuits were also synthesized and evaluated using the same tools and options. The source codes of [10] and [11] were obtained from authors’ Web sites [22, 23], respectively. (Like them, we also applied gate-reduction techniques to our inversion circuit.) The source codes of [12, 13], and [16] were described by the authors according to the structures given in the papers.

Fig. 4
figure 4

Overview of AES a S-Box and b inverse S-Box based on tower field arithmetic

Our compact inversion circuit achieved the smallest area although the total gate count of the proposed circuit was roughly the same as the conventional ones [11, 16]. The less XOR gates in our circuit would lead to the smaller circuit area because an XOR (or XNOR) gate requires larger area than a NAND (or NOR) gate in standard cell libraries. Consequently, we confirmed that the compact version of the proposed circuit achieved the smallest area of 174.75 GE, the smaller circuit delay of 1.81 ns compared with conventional other circuits. In addition, we performed a dynamic gate-level timing simulation to estimate power consumption with Verilog-XL Simulator at the operation frequency of 100 MHz. As a result, we confirmed that the power consumption of our circuit was 38% lower than the conventional best in our environment. Moreover, the high-speed version of the proposed circuit achieved the smallest critical delay of 1.55 ns, and the area–time product was 24.8% smaller than that of the conventional best circuit, respectively.

4 Application to AES S-Box

4.1 Description of AES S-Box

The proposed inversion circuit was efficiently applied to the AES S-Box design. The AES S-Box consists of a \(\textit{GF}(2^8)\) inversion and an affine transformation over \(\textit{GF}(2)\). Here, the \(\textit{GF}(2^8)\) is represented in a PB with an irreducible polynomial \(x^8 + x^4 + x^3 + x + 1\). Therefore, a change-of-basis between \(\textit{GF}(2^8)\) and \(\textit{GF}((2^4)^2)\) is required if the inversion over \(\textit{GF}((2^4)^2)\) is applied. Figure 4 shows an overview of the AES S-Box with tower field arithmetic. In an S-Box (Fig. 4a), the input (in the PB-based \(\textit{GF}(2^8)\)) is initially mapped to the tower field by applying an change-of-basis \(\varDelta _f\) which is given by an isomorphim. After the inversion operation over the tower field, the inverse mapping and affine transformation are finally performed in series. Here, we can merge the inverse mapping into the affine transformation because both of them are represented in the form of constant matrices over \(\textit{GF}(2)\). The merged mapping is denoted by \(\varDelta _l\). This merging reduces the delay and gate counts. On the other hand, in an inverse S-Box (Fig. 4b), the inverse affine transformation is performed prior to change-of-basis and tower field inversion.Hence, the inverse affine transformation and change-of-basis are unified as \(\varLambda _f\), and the inverse change-of-basis \(\varLambda _l\) is solely performed after the tower field inversion.

Fig. 5
figure 5

Typical architecture of unified S-Box

Fig. 6
figure 6

AES a S-Box and b inverse S-Box with proposed technique for optimizing linear mappings

The matrices for the change-of-basis \(\varDelta _f\) and \(\varDelta _l\) (\(\varLambda _f\) and \(\varLambda _l\)) have an impact on the performance of S-Box. When tower field \(GF((2^4)^2)\) is used, the matrices are defined by the bases of \(GF(2^4)\) and the defining polynomials for the extension of \(GF(2^4)\) to \(GF((2^4)^2)\). The efficiency of the two matrices for change-of-basis \(\varDelta _f\) and \(\varDelta _l\) (\(\varLambda _f\) and \(\varLambda _l\)) is determined by the largest Hamming weight in the columns. For example, if the largest Hamming weight in the columns is four, the critical path becomes \(2T_X\) delay. If it is five, the critical path becomes \(3T_X\) delay. Therefore, the matrices should be selected with a view to minimizing the largest Hamming weight in the columns.

Some techniques for optimizing change-of-basis between AES field and \(GF((2^4)^2)\) have been reported in [16, 24, 25]. In [24, 25], an algorithm which searches all construction of isomorphic mappings from/to PB-based \(GF((2^4)^2)\) was used. In addition, a technique in [16] used More Miscellaneously Mixed Bases (MMMB) to expand search space of the isomorphim for change-of-basis by utilizing asymmetric property of input of inversion circuit given by RRB. However, these technique cannot be applied to our inversion circuits because the irreducible polynomial of \(GF(2^4)\) should be given by the AOP of degree 4 and the input of inversion circuit should be given by symmetric NB.

In this paper, we design a unified S-Box that supports both encryption and decryption shown in Fig. 5 in addition to the S-Box. There are efficient conversion matrices for either encryption or decryption in the proposed inversion circuit. However, we found that there is no efficient conversion matrix for both encryption and decryption in the structure of Fig. 5 due to the deficiency of search space of isomorphim for change-of-basis. Therefore, we introduce a new technique for expanding the search space. Figure 6 illustrates the AES S-Box structures with the proposed technique, where the component “constant multiplication” perform a multiplication over PB-based \(GF(2^8)\) with a fixed value. The S-Box based on tower field arithmetic (denoted by S) is represented by \(S(a) = A(\varTheta '((\varTheta (a))^{-1})) + \mathrm{0x63}\), where \(\varTheta \) is a change-of-basis from the PB-based \(GF(2^8)\) to a tower field, \(\varTheta '\) is its inverse change-of-basis, and A denotes the linear mapping of the affine transformation. We can rewrite the equation using a nonzero fixed value c (in the PB-based \(GF(2^8)\)) as follows:

$$\begin{aligned} S(a) = A(c(\varTheta '((\varTheta (c(a)))^{-1}))) + \mathrm{0x63}. \end{aligned}$$

Because the multiplication with c is a linear mapping, we can unify c and \(\varTheta \) as \(\varDelta _f\), and unify \(\varTheta '\), c, and A as \(\varDelta _l\). The fixed value c can take one of 255 elements over \(GF(2^8)\). Thus, we can increase the variety of conversion matrices by 255 times. The proposed technique can be also applied to the inverse S-Box \(S'\) as follows:

$$\begin{aligned} S'(a) = c'(\varTheta '(((\varTheta (c'(A'(a)))))^{-1})) + \varTheta (c'(\mathrm{0x05})), \end{aligned}$$

where \(A'\) is the linear mapping of the inverse affine transformation and \(c'\) is a nonzero fixed value for decryption. Note that c and \(c'\) do not need to be the same in general.

Table 4 Performance comparison of S-Boxes based on tower field arithmetic
Table 5 Performance comparison of unified S-Boxes based on tower field arithmetic

With the proposed technique, we successfully found efficient conversion matrices \(\delta _f\), \(\delta _l\), \(\lambda _f\), and \(\lambda _l\), respectively, for \(\varDelta _f\), \(\varDelta _l\), \(\varLambda _f\), and \(\varLambda _l\) when the \(GF(2^4)\) elements of Stage 1 are represented in an NB \(\{\beta ^4, \beta ^3, \beta ^2, \beta ^1\}\) and the defining polynomial for the extension is given by \(\alpha ^2+(\beta ^4+\beta )\alpha +\beta \). As a concrete example, the matrices \(\delta _f\), \(\delta _l\), \(\lambda _f\), and \(\lambda _l\) are given by

$$\begin{aligned} \delta _f= & {} \left( \begin{matrix} 1 &{} 1 &{} 0 &{} 0 &{} 1 &{} 0 &{} 1 &{} 0 \\ 1 &{} 1 &{} 0 &{} 0 &{} 1 &{} 0 &{} 0 &{} 0 \\ 0 &{} 1 &{} 1 &{} 0 &{} 0 &{} 1 &{} 1 &{} 0 \\ 0 &{} 0 &{} 0 &{} 1 &{} 1 &{} 0 &{} 0 &{} 1 \\ 0 &{} 1 &{} 1 &{} 0 &{} 1 &{} 0 &{} 0 &{} 0 \\ 1 &{} 1 &{} 1 &{} 1 &{} 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} 1 &{} 0 &{} 0 &{} 0 &{} 1 &{} 1 &{} 0 \\ 1 &{} 0 &{} 1 &{} 0 &{} 0 &{} 1 &{} 0 &{} 0 \end{matrix} \right) , \delta _l = \left( \begin{matrix} 1 &{} 0 &{} 1 &{} 1 &{} 1 &{} 0 &{} 1 &{} 1 &{} 1 &{} 1 \\ 1 &{} 0 &{} 1 &{} 0 &{} 0 &{} 1 &{} 0 &{} 0 &{} 0 &{} 1 \\ 0 &{} 0 &{} 1 &{} 1 &{} 0 &{} 1 &{} 0 &{} 1 &{} 1 &{} 1 \\ 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 1 &{} 1 &{} 1 &{} 1 &{} 0 \\ 0 &{} 1 &{} 0 &{} 0 &{} 1 &{} 0 &{} 0 &{} 1 &{} 1 &{} 0 \\ 1 &{} 0 &{} 1 &{} 0 &{} 0 &{} 1 &{} 0 &{} 1 &{} 0 &{} 0 \\ 1 &{} 0 &{} 0 &{} 1 &{} 0 &{} 0 &{} 1 &{} 1 &{} 1 &{} 1 \\ 1 &{} 0 &{} 1 &{} 0 &{} 0 &{} 1 &{} 0 &{} 0 &{} 1 &{} 0 \end{matrix} \right) , \\ \lambda _f= & {} \left( \begin{matrix} 0 &{} 0 &{} 0 &{} 0 &{} 1 &{} 1 &{} 0 &{} 0 \\ 1 &{} 0 &{} 0 &{} 0 &{} 0 &{} 1 &{} 0 &{} 0 \\ 1 &{} 1 &{} 1 &{} 0 &{} 1 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} 1 &{} 0 &{} 1 &{} 1 &{} 0 &{} 0 \\ 0 &{} 1 &{} 0 &{} 1 &{} 0 &{} 0 &{} 1 &{} 1 \\ 0 &{} 0 &{} 1 &{} 1 &{} 1 &{} 1 &{} 0 &{} 0 \\ 0 &{} 0 &{} 1 &{} 1 &{} 1 &{} 0 &{} 0 &{} 0 \\ 1 &{} 1 &{} 0 &{} 0 &{} 0 &{} 1 &{} 1 &{} 0 \end{matrix} \right) , \lambda _l = \left( \begin{matrix} 1 &{} 0 &{} 0 &{} 1 &{} 0 &{} 0 &{} 1 &{} 0 &{} 0 &{} 1 \\ 0 &{} 0 &{} 1 &{} 0 &{} 1 &{} 0 &{} 1 &{} 0 &{} 1 &{} 0 \\ 0 &{} 0 &{} 0 &{} 1 &{} 1 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} 1 &{} 1 &{} 1 &{} 1 &{} 1 &{} 1 &{} 0 &{} 0 &{} 0 \\ 1 &{} 0 &{} 1 &{} 0 &{} 0 &{} 1 &{} 1 &{} 0 &{} 0 &{} 0 \\ 1 &{} 0 &{} 1 &{} 0 &{} 0 &{} 0 &{} 1 &{} 1 &{} 0 &{} 0 \\ 1 &{} 1 &{} 1 &{} 1 &{} 0 &{} 0 &{} 1 &{} 0 &{} 0 &{} 1 \\ 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 1 &{} 0 &{} 1 \end{matrix} \right) , \end{aligned}$$

where the least significant bits are in the upper left corners. The vector of the inverse affine transformation is given by \(\varTheta (c'(\mathrm{0x05})) = \mathrm{0x29}\). Here, the largest Hamming weight in each column of \(\delta _f\) and \(\lambda _f\) is at most four while that of \(\delta _l\) and \(\lambda _l\) is at most six. This means that the former and latter mappings are implemented with at most \(2T_X\) and \(3T_X\) delays, respectively.

4.2 Implementation results

Tables 4 and 5 show the critical delay and the number of XOR gates required for the change-of-basis of the proposed AES S-Box and unified S-Box compared with the conventional implementations, respectively. Here, the numbers of XOR gates for Canright’s and Jeon’s S-Boxes, which were achieved by factorizing XOR gates, were given according to their papers. We applied the similar factorization technique to our S-Boxes. On the other hand, those for other S-Boxes were derived directly from conversion matrices without any factorization. Our circuits achieve \(3T_A + T_O + 11T_X\) or \(3T_A + 11T_X\) delay, which is smaller than the conventional S-Boxes with tower field arithmeticFootnote 5. Tables 4 and 5 also show the synthesis results (area and delay time) obtained from the same tool, synthesis options, and method as the above. The source codes were given from the same methods as Table 3. Note here that Canright’s design in [11] supports both encryption and decryption, and we have slightly changed it to support only encryption to allow a fair comparison to our design (for Table 4). On the other hand, since Satoh’s design supports either encryption or decryption, we have also changed it to support both encryption and decryption as described in Fig. 5 (for Table 5). As a result, the area–time products of our AES S-Boxes unified S-Boxes were, respectively, 28.1 and 31.5% better than Canright’s ones, which had been the smallest for a long time, and were also, respectively, 23.2 and 22.3% better than Nekado’s latest ones. The power consumption of our S-Boxes and unified S-Boxes, which were estimated in the same manner as Sect. 4, were, respectively, 40.8 and 42.1% better than Canright’s ones, and were also, respectively, 33.2 and 31.2% better than Nekado’s ones in our environment. Note that the results of inverse S-Boxes would be consistent with Table 4.

A further discussion point when applying the proposed method to cryptographic cores is the well-known side channel issue. In particular, the resource sharing of Stages 1 and 3 would cause glitches during the computation. To apply our method to masking-based countermeasures with pipelining such as threshold implementation (TI) [28, 29] and generalized masking scheme (GMS) [30, 31] to defeat sophisticated (higher-order) attacks utilizing glitches, we need to decompose shared resources, which results in the increase of 12 XOR gates for the compact version or 17 XOR gates for high-speed version in total. Note, however, that such increase would also happen in other works (e.g., [10, 11, 16]) using the similar optimization. In contrast, our method is more suitable for multiplexing-based countermeasures, such as WDDL [32], due to the high efficiency. A further and comprehensive study on the side channel security is definitely one of the important future topics for our method.

5 Conclusion

This paper proposed a new \(GF(2^8)\) inversion circuit that utilizes a combination of non-redundant and redundant GF arithmetic. The proposed circuit, which is based on tower field arithmetic, was designed by utilizing PRR and RRB for the subfield inversion and multiplication, respectively. The flexibility of such redundant representations can provide efficient change-of-basis from/to \(GF(2^8)\). The efficiency of our proposed inversion circuit and its AES S-Box was evaluated by gate count and logic synthesis results with a 65-nm CMOS standard cell library. Consequently, we confirmed that the newly proposed inversion circuit was approximately 47% faster than the conventional best \(GF(((2^2)^2)^2)\) circuit with 5% area overhead. In addition, we proposed a new optimization technique for change-of-basis between AES and tower field. The proposed S-Boxes and unified S-Boxes achieved the best efficiency in comparison with any other existing S-Boxes based on tower field arithmetic.

Redundant GF representations, such as PRR and RRB, provide high flexibility for GF arithmetic circuit design. It is definitely possible to obtain efficient circuit structures using them because the search space of isomorphism for change-of-basis increases as a result of their flexibility. In addition, a combination of non-redundant and redundant GF representations has the potential to further improve GF circuits, as shown in this paper. Further research is being conducted to expand the application of the design methodology based on hybrid GF arithmetic.