Keywords

1 Introduction

With the advancements in quantum computing, it has become crucial to investigate the security of cryptographic primitives against quantum attacks. For symmetric key ciphers such as AES, Grover’s search algorithm [9] is the main threat, which provides a quadratic speedup, significantly reducing the time required to perform an exhaustive search for the key. The speedup reflects the asymptotic behavior of Grover’s algorithm, providing a rough estimate of the security vulnerabilities introduced by quantum computers when applied to symmetric primitives [14]; moreover, estimating the detailed cost of implementing Grover’s algorithm is challenging, hence most authors only use a rough approximation.

In 2016, the National Institute of Standards and Technology (NIST, US) initiated a competitionFootnote 1 aimed at identifying candidate post-quantum algorithms for standardization. As part of this competition, NIST established various security categories for classifying the algorithms (refer to Table 1). These categories were defined based on the complexity of quantum attacks, which can be quantified in terms of circuit size. Specifically, security categories 1, 3, and 5 correspond to key recovery attacks against AES-128, -192, and -256, respectively.

Table 1. The three security levels for NIST’s post-quantum competition defined based on the security of three AES variants.

In addition to the gate count, another important parameter known as \(\texttt{MAXDEPTH}\) has been introduced. NIST has proposed an approach that limits quantum attacks to a fixed running time or circuit depth. This limitation is motivated by the challenges associated with executing extremely lengthy serial computations. As a result, there is growing interest in improving the estimation of these levels, particularly in terms of optimizing the cost of quantum implementations for AES. This problem is of significant independent interest and has garnered considerable attention.

The introduction of security levels in NIST’s post-quantum competition has spurred the search for optimal quantum circuits. With the potential advantages offered by quantum technology, global efforts are currently directed towards the design of practical large-scale quantum computer architectures, aiming to achieve a tangible quantum advantage [7]. At the highest level of abstraction, the first stage generates a quantum circuit from a reduced set of universal quantum gates. The second stage makes the computation robust to errors and thus requires the inclusion of quantum error correction methods. The industrial community has already developed tools for the synthesis and optimization of quantum circuits, including ProjectQ [11, 27] and Q# [25], enabling significant advancements in this field.

1.1 Related Work

A quantum circuit can be constructed with the Clifford \(+T\) gate set. Similar to [8, 16, 18, 20, 21], at ASIACRYPT 2020, Zou et al. [29] followed the definitions of the complexity of quantum circuits.

  • Time Complexity. The time complexity refers to the time required to execute non-parallelizable logical T gates, also known as T-depth.

  • Space Complexity. The space complexity corresponds to the number of logical qubits needed for the entire quantum computation, often referred to as width.

  • Circuit Complexity. The circuit complexity is determined by the product of the time and space complexity, i.e., DW-cost. This metric has been selected as the trade-off metric by Jaques et al. at EUROCRYPT 2020 [14] and Huang et al. at ASIACRYPT 2022 [12].

Extensive research has been conducted on the design of quantum circuits for AES. The initial circuit proposal by Grassl et al. in [8] introduced a zig-zag structure to minimize the number of required qubits. This structure has since been adopted in various works [1, 8, 18, 29]. Langenberg et al. [18] presented a new circuit for the AES S-box and key expansion, reducing the qubit count. Building upon this work, Zou et al. [29] further optimized the zig-zag structure and introduced new AES S-box circuits.

Early research primarily aimed to reduce qubit counts in quantum circuits. Equally important is minimizing circuit depth, given the challenges of lengthy serial computations required by Grover’s algorithm. Additionally, ensuring fault tolerance in quantum circuits is crucial, making the T-depth a significant factor to consider, as emphasized in [12, 13].

Nevertheless, the number of qubits can still be a bottleneck. A significant challenge in realizing large-scale quantum computers is the redundancy required for error-correction codes, which demands a large number of physical qubits [10]. Recently, the research community has taken into account both space complexity and time complexity. At EUROCRYPT 2020, Jaques et al. [14] proposed several methods to reduce quantum depth and the number of qubits. Building upon this work, Huang et al. [12] presented S-boxes of depth 3 and 4, effectively reducing the T-depth and DW-cost. Subsequently, a pipelined architecture was introduced in [13] to further decrease the circuit depth.

1.2 Our Contributions

This paper aligns with the ongoing research direction and prioritizes the depth and the number of qubits as the primary goal. By emphasizing a low-depth metric, we strike a favorable balance between time and space. To further optimize quantum circuits for AES, it becomes crucial to enhance the architecture and building blocks. The contributions put forth by this paper are outlined below.

Improved Structure of S-Box. In a recent study by Huang et al. [12] a structure for the AES S-box was proposed utilizing 120 ancilla qubits with a T-depth of 4. In this paper, we introduce the mixing-XOR technique, an algorithm designed to identify idle qubits within linear transformations and repurpose them to store intermediate values. This technique significantly reduces the depth, the number of qubits, and gates required for various quantum circuits. Using this technique, we notably reduce the width of the AES S-box to just 83 ancilla qubits with a T-depth of 4.

Combination of S-Box and S-Box\(^\dag \) (Reverse of S-Box). In a study conducted by Jang et al. [13], the shallowed pipeline architecture was introduced to evaluate the S-box and its reverse, S-box\(^\dag \), simultaneously for two bytes. The S-box\(^\dag \) operation is used to clean upFootnote 2 the ancilla qubits used in S-box. During the execution of the S-box, 120 ancilla qubits are utilized to store intermediate values. These ancilla qubits need to be cleaned up by S-box\(^\dag \) so that they can be reused in the subsequent round. The approach proposed in [13] requires a total of \(120 \times 2 = 240\) ancilla qubits to execute these two operations simultaneously. We propose the combined pipeline architecture with the share technique to combine S-box and S-box\(^\dag \) concurrently. This technique significantly reduces the required number of ancilla qubits to just 98, resulting in a 59% reduction.

Improved Quantum Circuit for Reduced Circuit Complexity. We release various quantum circuits for AES-128/192/256 under the combined pipeline architecture, following a similar approach as in [12,13,14]. Compared with the results in [13], the qubit count and DW-cost are decreased by 42.4%, 41.2%, 36.5% for AES-128, -192, -256, respectively. Notably, the circuits with the AND gate and out-of-place linear layer have a lower depth. For AES-128, -192, and -256, we only require the depth of 730, 876, and 1,018, respectively. When applied to Grover’s algorithm, these circuits provide more precise and efficient resource estimates than those proposed in [13, 14].

Introducing AND Gates into the Zig-Zag Architecture. We propose modifications to the S-box circuit introduced by Huang et al. [12] at ASIACRYPT 2022. Our improved circuit achieves a remarkable reduction in qubit count with the increase of the depth (cf. Sect. 7). By applying the round-in-place zig-zag architecture, the DW-cost for AES-128 is further reduced to 132,800, while the previous best result is 204,800 in [12]. Figure 1 illustrates a comparison of the trade-off between T-depth and width, demonstrating that our circuits exhibit reduced width and DW-cost compared to [12].

Fig. 1.
figure 1

Width and T-depth for implementing a quantum circuit for AES-128.

1.3 Organization

In Sect. 2, we present our notation and define our metrics. In Sect. 3, we propose an improved structure of the S-box. The combination of S-box and S-box\(^\dag \) is described in Sect. 4. The quantum circuits for the AES components are presented in Sect. 5. The complete circuits for AES and the resource estimations based on the pipeline and zig-zag architectures are shown in Sect. 6 and Sect. 7. Finally, we conclude and propose future research directions in Sect. 8.

2 Preliminaries

2.1 Notation

Let \(\mathbb {F}_2\) be the finite field with two elements 0 and 1 and let \(\mathbb {F}_{2^k}\) be the finite field with \(2^k\) elements. Consider \(a, b \in \mathbb {F}_2\). Then \(\overline{a} = a \oplus 1\) represents the inversion of a, and \(a \oplus b\), \(a \cdot b\) and a|b denote the XOR, AND and OR operations of a and b. We denote with M a binary \(n\times n\) matrix over \(\mathbb {F}_2\); \(\boldsymbol{x}\) and \(\boldsymbol{y}\) are n-bit vectors over \(\mathbb {F}_2\) with \(\boldsymbol{x}=[x_0 x_1 \ldots x_{n-1}]\) and \(\boldsymbol{y}=[y_0 y_1 \ldots y_{n-1}]\). A linear map defined by M is defined as \(\boldsymbol{y}^t = M \cdot \boldsymbol{x}^t\). For a linear map M, \(M^\dag \) is defined as the adjoint of M, where \(MM^\dag =I\) and I is the identity map. The linear map is computed by a circuit as the sequential composition of the individual gates.

2.2 Quantum Circuits

A quantum state is typically written as and is used to emphasize that the state has n qubits. Quantum gates exploit quantum entanglement and the superposition states of qubits [4, 6]. In general, a quantum circuit is constructed by using the widely adopted universal fault-tolerant gate set Clifford phase shift or T gate:

$$ H = \frac{1}{\sqrt{2}} \begin{pmatrix} 1 &{} 1 \\ 1 &{} -1 \end{pmatrix},\,\,\, S = \begin{pmatrix} 1 &{} 0 \\ 0 &{} i\end{pmatrix},\,\,\, \textrm{CNOT} = \begin{pmatrix} 1 &{} 0 &{} 0 &{} 0 \\ 0 &{} 1 &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} 1 \\ 0 &{} 0 &{} 1 &{} 0 \end{pmatrix},\,\,\, T=\begin{pmatrix} 1 &{} 0 \\ 0 &{} e^{i\pi /4}\end{pmatrix}. $$

We also employ the Pauli-X gate \(X= HS^2H = \begin{pmatrix} 0 &{} 1 \\ 1 &{} 0 \end{pmatrix}\). The Pauli-X gate transforms a single qubit from into , and the \(\textrm{CNOT}\) gate converts into . The Toffoli gate is a universal gate defined as the CCNOT gate and maps to :

$$ \begin{pmatrix} 1 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} 1 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} 1 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} 1 &{} 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 &{} 1 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 1 &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 1 \\ 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 1 &{} 0 \end{pmatrix}. $$

Optimization Goals. In the evaluation of quantum circuits, considering both time and memory costs is crucial for assessing their efficiency. The metrics of width, T-depth (#TD), and DW-cost are the primary goals of our focus and attention. A forward-looking perspective suggests that each gate has a depth equal to one, with the T gate incurring a cost similar to other gates. This perspective defines the full depth (#FD) as the time cost metric.

Implementing the AND Operation. As is shown in [24], the Clifford gates are much cheaper than the T gate. Thus, the T gate has a higher impact on the running time. In our quantum circuit, the AND operation \(a \cdot b\) is the only source of T-depth. Currently, there exist multiple approaches for implementing the AND operations:

  • the Toffoli gate with T-depth 1, or 4 (cf. [2, 26]), achieving .

  • the quantum \(\textrm{AND}\) gate using one auxiliary qubit with T-depth 1 (cf. [14]), achieving .

The main distinction lies in the fact that the target qubit of the AND gate must be in the state , whereas the Toffoli gate does not have this requirement on its target qubit. In our paper, we consider the target qubit of the AND operation to be in the state , allowing us to convert a Toffoli gate into an AND gate. Consequently, implementing the AND operation becomes a unified process, and whether it is represented as a Toffoli gate or an AND gate depends on specific scenarios.

2.3 Description of AES Family

The Advanced Encryption Standard (AES) [5] is a block cipher with 128-bit state (16 bytes) standardized by NIST. The AES family has three members denoted as AES-128 (10 rounds), AES-192 (12 rounds), and AES-256 (14 rounds) with 128-bit, 192-bit, 256-bit keys, respectively.

Fig. 2.
figure 2

Structure of an AES round.

The AES round function consists of four operations: AddRoundKey \(\circ \) MixColumns \(\circ \) ShiftRows \(\circ \) SubBytes (see Fig. 2). The 128-bit state is represented in a \(4 \times 4\) matrix of bytes. Next, we explain each operation.

  • AddRoundKey.    The operation exclusive-ors each round key to the state.

  • ShiftRows.    It cyclically rotates the cells of the i-th row to the left over i positions, for \(i = 0, 1, 2, 3\).

  • SubBytes.    The operation applies an 8-bit S-box to each byte of the state in parallel. The 128-bit state is transformed with 16 S-box lookups.

  • MixColumns.    The MixColumns operation applies a linear transformation on each column of the state with the matrix

    $$ M = \begin{pmatrix} \texttt{0x02} &{} \texttt{0x03} &{} \texttt{0x01} &{} \texttt{0x01} \\ \texttt{0x01} &{} \texttt{0x02} &{} \texttt{0x03} &{} \texttt{0x01} \\ \texttt{0x01} &{} \texttt{0x01} &{} \texttt{0x02} &{} \texttt{0x03} \\ \texttt{0x03} &{} \texttt{0x01} &{} \texttt{0x01} &{} \texttt{0x02} \end{pmatrix}. $$

For each column in the state, we use MixColumn to represent the operation. Hence, one MixColumns operation consists of four MixColumn operations on the four columns of the state. In SubBytes, 16 S-box operations substitute the individual bytes of the state.

Denote the master key by 32-bit words \(W_0, W_1, \ldots , W_{s-1}\), where \(s=4\) for AES-128, \(s=6\) for AES-192, and \(s=8\) for AES-256, respectively. Three operations are used in the algorithm: RotWord cyclically rotates the four bytes to the left by one position; Rcon exclusive-ors a constant to each byte of the word; SubWord applies four S-boxes in parallel to the four bytes of the word.

3 Improved Quantum Circuit Implementations of AES S-Box

In [14], the S-box circuit constructed by Jaques et al. for AES requires 120 ancilla qubits with T-depth 6. Subsequently, Huang et al. [12] reduced the T-depth to 4 while keeping the number of qubits at 120. In this section, our aim is to construct a new circuit from the classical circuit with a reduced number of qubits and gates, as well as less quantum depth. We build upon these papers and introduce the mixing-XOR technique (m-XOR) to accomplish this construction. The technique enables us to identify reusable qubits within the S-box, resulting in a circuit with T-depth 4, requiring only \(74+9\) ancilla qubits, where 9 qubits with the value are used to satisfy the parallelism of AND gates. If the S-box circuit used for improvements is updated, our approach can still be employed to reduce these three metrics.

3.1 M-XOR Technique

Two operations are used in quantum linear circuits:

  • The updating operation is in-place and can be implemented by a \(\textrm{CNOT}\) gate , defined as \(\textrm{CNOT}(a, b)\).

  • The creating operation is out-of-place, requiring two CNOT gates and , defined as \(\textrm{CNOT}2(a,b,c)\).

which correspond to s-XOR and g-XOR in classical circuits (cf. [15, 28]). We observe that it is beneficial to transform a creating operation into an updating operation. Based on Observation 1, we propose the m-XOR technique to construct the quantum circuit from the classical circuit with a reduced number of qubits and gates, as well as less quantum depth.

Observation 1

Given a quantum circuit with creating operations, some qubits can be reused by transforming creating operations into updating operations.

Definition 1

(m-XOR technique). The m-XOR technique mixes updating operations and creating operations to implement a linear transformation with the lowest number of qubits.

Example 1

Given that we assign two qubits to \(q_a\) and \(q_b\), the subsequent operation is a creating operation: \(q_c = q_c \oplus (q_a \oplus q_b)\). If \(q_a\) is not further utilized in the subsequent circuit, this operation can be converted to an updating operation: \(q_a = q_a \oplus q_b\). When utilizing \(q_c\), we simply select the qubit \(q_a\).

The detection of idle qubits forms the essence of this technique. Proposition 1 outlines the qubits that can be eliminated from the circuit.

Proposition 1

In a sequentially written quantum circuit, the conversion from a creating operation \(t_{c} = t_c \oplus (t_{a} \oplus t_b)\) to an updating operation \(t_a = t_a \oplus t_b\) requires the fulfillment of the following conditions:

  • \(t_a\) should not be utilized in the subsequent circuit.

  • \(t_c\) does not appear in the previous circuit.

To successfully perform the conversion, both conditions must be satisfied. Failing to meet either condition can compromise the correctness of the circuit.

Proof

For the operation \(t_{c} = t_c \oplus (t_{a} \oplus t_b)\), the question is how to save the value of \(t_c\). If \(t_a\) is involved in subsequent computations, it cannot be safely updated. If \(t_a\) is not used in the subsequent circuit, we can use \(t_a = t_a \oplus t_b\) to replace \(t_c\), which requires that \(t_c\) does not appear in the previous circuit. Otherwise, \(t_c\) is not initialized to and the transformation ignores the original value of \(t_c\).    \(\square \)

Proposition 1 implies Algorithm 1 to optimize a quantum circuit using the m-XOR technique.

figure s

3.2 Applying the m-XOR Technique to AES S-Box

We apply Algorithm 1 to optimize the AES S-box from [12]. The circuit contains 34 AND operations, 120 creating operations, and 4 X gates. Apart from the 8-bit inputs \(u_0, u_1, \ldots , u_7\) and the 8-bit outputs \(s_0, s_1, \ldots , s_7\), the circuit of the S-box requires 120 ancilla qubits: \(t_0,\ldots ,t_{26}\), \(m_0, \ldots , m_{62}\), and \(l_0, \ldots , l_{29}\).

Table 2. New structure of S-box of AES with 74 ancilla qubits.

We find that 46 qubits are unnecessary. Our optimized S-box circuit requires only 74 ancilla qubits, denoted as \(q_0, q_1, \ldots , q_{73}\). A detailed list of all allocated qubits is provided in Table 2. To facilitate the description, we introduce the AND operation and the X gate, which are defined as follows:

$$\begin{aligned} \begin{aligned} &\textrm{AND}(a,b,c) \rightarrow c = a\cdot b\, ,\\ &\textrm{X}(a) \rightarrow a = a \oplus 1\, . \end{aligned} \end{aligned}$$

We take an example to show the process. In No. 2 of the previous circuit, the operation \(\textrm{CNOT}2(u_7, u_1, t_2)\) statisties Proposition 1. Therefore, we can use \(u_7\) to save the value of \(t_2\), reducing one ancilla qubit and one \(\textrm{CNOT}\) gate. The gate substitution is

$$\begin{aligned} \textrm{CNOT}2(u_7, u_1, t_2) \rightarrow \textrm{CNOT}(u_1, u_7)\,. \end{aligned}$$

3.3 Comparisons of Resource Estimations of the AES S-Box

There are various implementations of S-box quantum circuits. Some circuits use Toffoli gates, while others use AND gates. We show the comparison of different Toffoli-based circuits in Table 3.

Table 3. Comparison of implementations of S-box using Toffoli gates. “6 + 16” represents 6 ancilla qubits and 16 input and output qubits of the AES S-box.

Next, we compare the S-box circuits utilizing AND gates with [12, 14]. The authors estimated both the S-box and S-box\(^{\dagger }\) using Q#. To ensure a fair comparison, we adopt the same approach (cf. Table 4). As mentioned in [12], for the 4-th AND layer, the execution of 18 parallel AND gates is required, with each gate necessitating an ancilla qubit, resulting in a total of 18 qubits. We find that \(q_{73}, s_0, s_1,\ldots , s_7\) remain after the 4-th AND layer. Thus, the remaining 9 qubits need to be allocated. Thus, the width of the S-box circuit is \(74+16+9=99\). We find that the optimized quantum circuit has a reduced number of qubits and gates, as well as less quantum depth.

Table 4. Comparison of several implementations of S-box and S-box\(^{\dagger }\) based on the AND gates. Here #M counts the number of Measurements.

4 Improved Combination of S-Box and S-Box\(^\dag \)

In [14], Jaques et al. used the pipeline architecture for AES. Subsequently, Jang et al. [13] proposed the shallowed pipeline architecture, which necessitates \(120+120=240\) ancilla qubits for executing S-box and S-box\(^\dag \). In this section, we present the combined pipeline architecture with the share technique to combine the S-box and S-box\(^\dag \). The combined architecture demonstrates a significant efficiency improvement by utilizing only \(74 + 24 = 98\) ancilla qubits. Note that we do not use the inverse S-box in the pipeline architecture.

4.1 Pipeline Architecture for AES

A pipeline architecture is employed to reduce the reuse of qubits and achieve lower T-depth in circuit execution. As is shown in [13], the resource estimation based on AND gates and the pipeline architecture in [14] may underestimate the depth/width cost of the circuits, partly due to the bugs in Q#Footnote 3. Then, Jang et al. [13] proposed a shallowed pipeline architecture as depicted in Fig. 3. In the following, we provide a detailed description of each component.

Fig. 3.
figure 3

The shallowed pipeline architecture.

The SubBytes operation comprises 16 distinct S-boxes applied to 16 bytes, denoted as SB in Fig. 3. For the i-th round function \(R_i\), we represent \(R_i\) as the sum of SB and \(r_i\). Each S-box takes an 8-qubit input, utilizes a set of ancilla qubits \(Q^1_i\) for each byte \(B_i\) (\(0\le i \le 15\)), and produces an 8-qubit output on the subsequent line. In [13], \(|Q^1_i| = 120\). If we do not clean up the ancilla qubits in each \(Q^1_i\), we would need to allocate an ancilla qubit set for the subsequent S-box operations, resulting in a total of \(\text {10 rounds }\times \text { 16 bytes }\times \text { 120 qubits }= 19,200\text { qubits}.\)

Hence, the use of SB\(^\dag \), which represents the adjoint of SB, becomes necessary to clear the ancilla qubits. During the execution of SB, 16 sets of ancilla qubits \(Q^1_i\) (\(0\le i \le 15\)) are utilized. Subsequently, SB\(^\dag \) is employed to clean up the qubits in each \(Q^1_i\). To ensure that the T-depth does not increase, the execution of SB\(^\dag \) in \(R_j\) and SB in \(R_{j+1}\) is performed simultaneously. To achieve this, 16 sets of ancilla qubits \(Q^2_{k}\) (\(0\le k \le 15\)) are required. We illustrate this relationship using an example.

Example 2

We begin with an initial set of 16 \(Q^1_i\) (\(0\le i \le 15\)) and 16 \(Q^2_k\) (\(0\le k \le 15\)).

  • In \(R_1\), SB uses 16 \(Q^1_i\) (\(0\le i \le 15\)).

  • In \(R_2\), SB uses 16 \(Q^2_k\) (\(0\le k \le 15\)) and SB\(^\dag \) cleans up the qubits in 16 \(Q^1_i\) (\(0\le i \le 15\)).

  • Then, in round \(R_3\), SB uses the 16 \(Q^1_i\) (\(0\le i \le 15\)) sets, and SB\(^\dag \) clears the qubits in the 16 \(Q^2_k\) (\(0\le k \le 15\)) sets.

  • These two sets of ancilla qubits are alternated in the remaining rounds.

  • The total count of ancilla qubits is \(2 \times 16 \times 120=3840\).

Fig. 4.
figure 4

Independent structure to execute S-box and S-box\(^\dag \) simultaneously.

4.2 Combined Pipeline Architecture

For each byte \(B_i\) of the state, there are two ancilla qubit sets \(Q^1_i\) and \(Q^2_i\) used in S-box and S-box\(^\dag \) (see Fig. 4). With the S-box circuit from [12], the independent structure in [13] requires \(|Q^1_i| + |Q^2_i| = 120+120=240\) ancilla qubits. For 16 bytes of AES, the total ancilla qubit count is \(240 \times 16=3840\).

We point out that there are unnecessary qubits for this operation, which is based on Observation 2.

Observation 2

In the independent structure of S-box and S-box\(^\dag \), during the execution of S-box\(^\dag \), the qubits are consistently cleaned up, and these qubits are not utilized in the S-box operation. Conversely, S-box employs a fresh qubit set to select the available qubits.

Thus, it should be explored to reuse the qubits cleaned up by S-box\(^\dag \) immediately. We propose the share technique that combines the qubit sets of S-box and S-box\(^\dag \) (see Fig. 5). The combination only uses one set, the share set \(SQ_i\). After the analysis in Sect. 4.3, we have \(|SQ_i| = 74+24=98\), which is much smaller than \(|Q^1_i| + |Q^2_i| = 240\). Using the combined S-box and S-box\(^\dag \), we can propose the combined pipeline architecture for AES (cf. Fig. 6).

Fig. 5.
figure 5

Combined structure to execute S-box and S-box\(^\dag \) simultaneously.

Fig. 6.
figure 6

The combined pipeline architecture, where C is the combined S-box and S-box\(^\dag \).

Discussion on Different Pipeline Architectures. We employ a simple structure to discuss this distinction. Assume that an r-round cipher (\(r \ge 1\)) consists of only two components, SB and MixColumns, with corresponding depths of \(d_s\) (\(d_s \ge 1\)) and \(d_m\) (\(d_m \ge 1\)). Additionally, SB requires \(q_s\) (\(q_s \ge 0\)) ancilla qubits, MixColumns requires \(q_m\) (\(q_m \ge 0\)) ancilla qubits, and each round requires \(q_r\) (\(q_r \ge 1\)) qubits. Table 5 shows the comparison.

Table 5. Comparison of different pipeline architectures.

In the original pipeline architecture, we first execute SB, and then run SB\(^{\dag }\) and MixColumns simultaneously. Then, we use \(((r+1)\cdot q_r + q_s + q_m)\) qubits with depth \((d_s + \max (d_s, d_m))\). For the shallowed pipeline architecture, we use the independent structure to execute SB and SB\(^{\dag }\) simultaneously and then execute MixColumns. The circuit requires \(((r+1)\cdot q_r + \max (2q_s, q_m))\) qubits with depth \((d_s + d_m)\). In the combined pipeline architecture, we use the combined structure to execute SB and SB\(^{\dag }\) simultaneously and then execute MixColumns. For the share set, we use \((1+\epsilon )q_s\) qubits, where \(\epsilon \) depends on the size of the share set and we have \(0 \le \epsilon \le 1\). Then, we can obtain the following observation.

Observation 3

If \(d_s > d_m\), the shallowed and combined pipeline architectures have the lowest circuit depth. If \(q_m > \epsilon \cdot q_s\), the combined pipeline architecture has the lowest width.

4.3 Share Technique

The share technique is based on Observation 4, where we determine which qubits can be reused and preassign them accordingly.

Observation 4

Suppose that S-box and S-box\(^\dag \) are executed simultaneously. When S-box\(^\dag \) clears up a qubit, the qubit can be immediately reused by S-box.

Next, we introduce the general share technique in Definition 2. This technique utilizes five sets that satisfy Property 1, defined as follows.

Definition 2

The share technique utilizes a set \(SQ_i\) to execute S-box and S-box\(^\dag \) simultaneously for a single byte \(B_i\). Four sets are employed to split \(SQ_i\), where the set with subscript public stores the qubits in the state , and the set with subscript private stores the qubits that are being used.

  • \(q^{old}_{private}\) is the set of qubits that will be cleaned up by S-box\(^\dag \).

  • \(q^{old}_{public}\) is the set of unallocated qubits for S-box\(^\dag \).

  • \(q^{new}_{private}\) is the set of qubits used by S-box.

  • \(q^{new}_{public}\) is the set of qubits that are not used by S-box.

Property 1

The five qubit sets in the public qubit technique satisfy the following equations:

$$\begin{aligned} \begin{aligned} &SQ_i = q^{old}_{private} \cup q^{old}_{public} = q^{new}_{private} \cup q^{new}_{public},\\ &q^{old}_{private} \cap q^{old}_{public} = \phi \,,\,\,q^{new}_{private} \cap q^{new}_{public} = \phi . \end{aligned} \end{aligned}$$
(1)

These equations can be explained as follows: S-box and S-box\(^\dag \) utilize the share set \(SQ_i\) in our optimization, where we set \(|SQ_i| = 74+24=98\). For each S-box operation, an ancilla qubit set stores 74 qubits, resulting in \(|q^{old}_{private}| = 74\), and \(|q^{old}_{public}| = |SQ_i/q^{old}_{private}| = 24\). When S-box\(^\dag \) cleans up one qubit q, we have

$$\begin{aligned} q^{old}_{private} = q^{old}_{private}/\{q \}\, , \qquad q^{old}_{public} = q^{old}_{public} \cup \{q \}\, . \end{aligned}$$
(2)

At the same time, S-box requires qubits to store intermediate values. S-box chooses these qubits from \(q^{old}_{public}\). Assuming that q is chosen by S-box, we have

$$\begin{aligned} q^{old}_{public} = q^{old}_{public} / \{q \}\, , \qquad q^{new}_{private} = q^{new}_{private} \cup \{q \}\, . \end{aligned}$$
(3)

Finally, after completing the combination, we assign the qubits not used in S-box to \(q^{new}_{public}\), i.e., \(q^{new}_{public} = SQ/q^{new}_{private}\). Since this S-box also uses 74 ancilla qubits, we have

$$\begin{aligned} \begin{aligned} |q^{new}_{private}| = |q^{old}_{private}| = 74\,, \quad |q^{new}_{public}| = |q^{old}_{public}| = 24\,. \end{aligned} \end{aligned}$$
(4)

Proposition 2 demonstrates that the sizes of the five qubit sets remain unchanged after executing the combination in the share technique.

Proposition 2

In the share technique, after completing the combination of S-box and S-box\(^\dag \), the sizes of the five qubit sets remain constant.

Proof

During the combination process, we set \(|SQ_i| = a\) and \(|q^{old}_{private}| = u\). By Property 1, \(|q^{old}_{public}|\) is \(z=a-u\).

Next, we examine the combination process in detail. S-box\(^\dag \) cleans up all the qubits in \(q^{old}_{private}\) and adds them to \(q^{old}_{public}\). This results in a total of \(u + z = a\) qubits in \(q^{old}_{public}\). When executing S-box, u qubits are selected from \(q^{old}_{public}\). Therefore, \(|q^{old}_{public}|\) remains \(a-u=z\). Additionally, the unselected qubits from \(|q^{old}_{public}|\), which amount to z, are stored in \(|q^{new}_{public}|\).

In conclusion, as long as \(q^{old}_{public}\) contains a sufficient number of qubits, the sizes of the qubit sets remain unchanged. This completes the proof.    \(\square \)

Example 3

Suppose we visualize the unallocated qubits in \(q^{old}_{public}\) as a pool of water, as shown in Fig. 7. The water in the pool represents the available qubits for allocation. The process in which S-box\(^\dag \) cleans up the qubits can be likened to adding water to the pool, replenishing the available qubits. When S-box requires an ancilla qubit, it can be seen as drawing water (qubits) from the pool.

Fig. 7.
figure 7

The water pool in the public qubit technique.

We provide Algorithm 2 to decide the size of \(SQ_i\). In the aforementioned process, the ability to execute the S-box depends on the availability of qubits in the pool. If the pool is empty, the S-box cannot be executed. Specifically, if we find \(|q^{old}_{public}| = 0\), it means that no qubits can be selected from \(q^{old}_{public}\), indicating that \(|SQ_i|\) is too small to accommodate the S-box operation. Here, \(|SQ_i|\) represents the maximum capacity of the “pool” to hold qubits. Therefore, we gradually increase the size until Algorithm 2 no longer returns “Error”.

As a result, \(SQ_i\) is composed of two parts: the initially set qubits, which are assigned the value and stored in \(q^{old}_{public}\), and the qubits that require cleaning by S-box\(^\dag \), which are stored in \(q^{new}_{private}\).

figure w

4.4 Applying the Share Technique to the AES S-Box

In this section, we present the complete combination of AES S-box and S-box\(^\dag \) using the share technique. We illustrate the implementation of the S-box based on the structure provided in Table 2. To facilitate the combination, we propose a method to split the S-box circuit based on its T-depth. The S-box circuit is divided into several layers, taking into account the T-depth of each gate. The layering scheme is as follows:

  • The first layer consists of the gates that precede the execution of T gates (No. 0–26).

  • The second layer includes gates with T-depth 1 (No. 27–35).

  • The third layer encompasses gates between T-depth 1 and 2 (No. 36–52).

  • The fourth layer comprises gates with T-depth 2 (No. 53–55).

  • The remaining layers are split according to the proposed method: No. 56–61, No. 62–65, No. 66–84, No. 85–102, No. 103–157.

The circuit is executed in order, with a total of nine layers. We denote each layer as \(L_i\). For S-box\(^\dag \), the order of execution is reversed. The ancilla qubits involved in each layer are specified in Table 6.

Table 6. The layers and the corresponding ancilla qubits in S-box and S-box\(^\dag \).

In the case of S-box, in each layer \(L_i\), every qubit q represents the qubit that is initially used by the S-box. Prior to \(L_i\), the qubit q is initialized to . Conversely, for S-box\(^\dag \), in each layer \(L_i\), each qubit q indicates the qubit that is cleaned up by S-box\(^\dag \). After the completion of \(L_i\), the qubit q is reset to .

Table 7. Qubit allocation for the combination of S-box and S-box\(^\dag \). “Previous Pool” is the qubit count of the previous pool in \(L_i\). “Need” is the qubit count that S-box requires in \(L_i\). “Preset Qubits” is the number of qubits that we preset. “Cleaning” is the qubit count that S-box\(^\dag \) cleans up. “New Pool” is the qubit count of the new pool.

To ensure that the T-depth remains unchanged, we align each layer in S-box and S-box\(^\dag \). In each layer \(L_i\), when S-box requires a qubit in the state, we check if the water pool is empty. If the pool is empty, we preset a qubit in the pool, thereby increasing the size of \(SQ_i\). After the completion of \(L_i\), the qubits that are cleaned up by S-box\(^\dag \) are placed back into the pool for the next layer \(L_{i+1}\). The size of \(SQ_i\) is the sum of \(q^{old}_{private}\) and the number of preset qubits. The complete qubit allocation is illustrated in Table 7, and we provide a detailed explanation below.

  • Prior to \(L_1\), there are no qubits in the pool.

  • During the execution of \(L_1\), we preset 16 qubits in S-box. Then, one qubit is cleaned up by S-box\(^\dag \) and added to the pool.

  • In \(L_2\), we need to preset an additional 8 qubits in the pool. A total of 18 qubits are cleaned up by S-box\(^\dag \), resulting in a pool size of 18 qubits.

  • The subsequent steps follow a similar pattern.

Table 8. Cost comparison of the cost of the combination of S-box and S-box\(^\dag \). Here 32 qubits represent 16-bit inputs and 16-bit outputs of two bytes.

In total, we preset 24 qubits in the pool. Consequently, \(|SQ_i| = 74 + 24 = 98\). Prior to executing the combination, 74 ancilla qubits are utilized, and then we allocate 24 qubits with an initial state of in \(q^{old}_{public}\). After the combination, there are still 24 qubits remaining in the pool for subsequent combinations (see Proposition 2). Therefore, we have \(|SQ| = 98\), \(|q^{old}_{public}| =24\), \(|q^{old}_{private}| = 74\), \(|q^{new}_{public}| = 24\), and \(|q^{new}_{private}| = 74\). We give the complete comparison in Table 8. The number of ancilla qubits is reduced from 240 to 98.

5 The Components of Quantum Circuits for AES

This section describes the quantum circuits for the AES components MixColumns, Key Schedule, AddRoundKey, and ShiftRows. We mainly introduce the improvement of the depth of MixColumns. Other components are similar to the previous work.

Table 9. New circuit of MixColumns with quantum depth 16. Here every number represents a qubit. \(\texttt {Update}(b,a)\) represents the CNOT operation . are represented by 24, 1, 10, 11, 12, 13, 30, 15, 8, 25, 2, 3, 4, 5, 14, 7, 0, 17, 26, 19, 20, 29, 22, 31, 16, 9, 18, 27, 28, 21, 6, 23, respectively.

5.1 Implementation of MixColumns

The implementation of MixColumns has been widely studied [12, 13, 29]. Usually, we can use optimized classical circuits to reduce the cost (see for example [17, 19, 22, 23, 28]).

In quantum key search, the full depth of the circuit influences the time cost for Grover’s search. However, the depth in classical circuits and quantum circuits is different. The main reason is that one qubit cannot be used in two gates simultaneously. The previous work merely translated the optimal classical circuits into quantum circuits. However, we believe that relaxing the gate count constraint could lead to better depth performance. Therefore, we relaxed the gate count constraint in Xiang et al.’s approach in [28] and generated a series of candidate circuits. Then, we can calculate their quantum depth quickly. As a result, we obtained an in-place circuit with a depth of 16 (cf. Table 9) for comparison (cf. Table 10).

Table 10. Comparison of MixColumns implementations for each column.

5.2 Implementation of the Key Schedule

In the key schedule of AES, we use several 32-bit words to save the key. For AES-128, we use 128 qubits to represent the 128-bit master key (called \(W_0, W_1, W_2, W_3\)). After executing XOR operations for these qubits and the 128-bit state, we update 128 qubits for the next AddRoundKey operation.

The schedule is similar to [12,13,14]. Firstly, we use four ancilla qubit set \(SQ_i\) (\(0 \le I \le 3\)) to run SubWord, which requires \(4\times 98 = 392\) ancilla qubits. All the S-boxes in the key schedule and round function are designed to operate in parallel. SubWord\(^\dag \) is executed in the next round to clean up the ancilla qubits, which is introduced in Sect. 4. The 32-bit output values of SubWord are XORed to . The Rcon operation is implemented with X gates for the corresponding qubits to generate \(W_{4i+4}\). Finally, \(W_{4i+5}, W_{4i+6}, W_{4i+7}\) are updated by CNOT gates. Rcon is executed by adding X gates. The schedule in our pipeline architecture is similar to [12,13,14].

5.3 Implementation of AddRoundKey and ShiftRows

In AddRoundKey, 128 \(\textrm{CNOT}\) gates are required and no ancilla qubits are set. For the ShiftRows operation, the swap for qubits is a logical operation that only changes the index of qubits. Therefore, the operation does not require any gates.

6 Quantum Circuit of AES Based on the Pipelined Architecture

As mentioned in Sect. 4, the pipelined architecture is suitable for implementing low-depth AES circuits. In this section, we have made improvements with different quantum circuits based on the combined pipeline architecture using ProjectQ. The code is available at https://github.com/QunLiu-sdu/Improved-Quantum-Circuits-for-AES. Then, we provide a more precise analysis of the Grover algorithm’s complexity.

6.1 Resource Estimations Based on the Toffoli Gate and and Gate

In [12, 13, 18, 20, 21, 29], the authors present the quantum resources needed for the circuits without decomposing the Toffoli gates. We adhere to the same circuit metrics to facilitate a fair comparison.

Grover’s algorithm requires us to estimate the complexity resulting from decomposing Toffoli gates. According to [26], the Toffoli gate can be decomposed:

  • the circuit with T-depth 1 and 4 ancilla qubits;

  • the circuit with T-depth 4 and 0 ancilla qubits.

On the other hand, in [13, 14], the authors recommend using AND gates to achieve a lower complexity for Grover’s algorithm. We also utilize AND-based decomposition to construct AES quantum circuits. In fact, the circuits constructed using this method have the lowest circuit depth. We applied these circuits to Grover’s algorithm, obtaining more precise resource estimates.

The structure of each AES round is shown in Fig. 6. For each round, 20 S-boxes are executed in parallel (16 S-boxes for SubBytes and 4 S-boxes for SubWord). In \(R_1\), we execute 20 S-boxes with 20 shared qubit sets because no S-boxes\(^\dag \) are required. In other rounds, we execute 20 combinations of S-box and S-box\(^\dag \). After S-box operations, the output is saved in 128 new qubits for the ShiftRows, MixColumns, and AddRoundKey operations.

Moreover, for each AES quantum circuit, we considered two implementations for the linear layer:

  • in-place circuit, which utilizes the circuit found by us with depth 16.

  • out-of-place circuit, which is from [19] with depth 11.

In Table 11, the results use the Toffoli gate and do not decompose it. Our circuit achieves the optimal trade-off between Toffoli depth and width. In conclusion, compared with the previous lowest results in [21], the product of our implementations achieved a reduction of 35%, 38%, 36% for AES-128, -192, and -256, respectively. Compared with the shallowed architecture in [13], the number of qubits achieves a reduction of 42%, 41%, and 36%, respectively.

In Table 12, we show the results of decomposing Toffoli gates into circuits with T-depth 4. Compared with the shallowed architecture in [13], the number of qubits, full depth, and DW-cost achieve a reduction. The minimum depth of circuits is 770, 924, and 1074 for AES-128, -192, and -256 respectively. Table 13 also shows the results of decomposing Toffoli gates into circuits with T-depth 1. To reduce the T-depth, we employed additional qubits to implement the quantum circuit.

The circuits using the AND gates are shown in Table 14, which achieves the lower circuit depth. For the in-place version, the circuit depth achieves a reduction of \(13.8\%\), \(13.6\%\), and \(14.3\%\), respectively. It is worth noting that the circuits with the AND gate and out-of-place linear layer have a lower depth. For AES-128, -192, and -256, we only require the depth of 730, 876, and 1,018, respectively.

In conclusion, for the different types of AND operations, we find that the circuits decomposed using AND gates have the lowest complexity for Grover’s algorithm, which we show in Sect. 6.2. We also used the AES S-box with T-depth 3 proposed in [12]. Since the number of gates was almost doubled, we were not able to obtain high-quality Grover algorithm attack complexity.

Table 11. Comparisons of quantum resources of AES without decomposing the Toffoli gates.
Table 12. Comparisons of quantum resources of AES decomposing Toffoli gates with T-depth 4.
Table 13. Comparisons of quantum resources of AES decomposing Toffoli gates with T-depth 1.
Table 14. Comparisons of quantum resources with the AND gates.

6.2 Performance of Grover’s Algorithm

In this part, based on the circuits implemented using AND gates and the out-of-place linear layer, we applied these AES circuits to Grover’s algorithm, obtaining more precise and efficient resource estimates (cf. Table 15). “\(r=\lceil k/n \rceil \)” (plaintext, ciphertext) is the number of pairs that are required to recover a unique key. For AES-128, we just choose \(r = 1\). For AES-192/-256, we choose \(r = 2\). It implies that we need to use two plaintext-ciphertext pairs to determine the key. In terms of resources corresponding to Grover’s algorithm, one key expansion algorithm corresponds to two round functions. We primarily focus on four metrics, \(FD\times G\), \(FD\times W\), \(FD^{2}\times G\), and \(FD^{2}\times W\).

Table 15. Quantum resources required for Grover’s search on AES.

7 Quantum Circuit of AES Based on the Zig-Zag Architecture

In this section, we propose AES circuits based on the zig-zag architecture utilizing AND gates. The zig-zag architecture is typically employed for low-width implementations.

At ASIACRYPT 2022, Huang et al. [12] aimed to reduce the DW-cost in the zig-zag architecture by introducing low-depth S-boxes based on AND gates. They proposed an improved zig-zag architecture based on the round-in-place technique. The S-box requires 120 ancilla qubits with T-depth 4, resulting in a minimum DW-cost of 204,800 (width \(\times \) T-depth = \(2,560 \times 80\)) for AES-128. We notice that several papers have adopted similar circuits based on the round-in-place technique to optimize the quantum circuit for AES (cf. [21]).

Our approach begins by introducing the zig-zag architecture and round-in-place technique. We then highlight the advantages of utilizing the round-in-place zig-zag architecture iteratively, offering an improved trade-off between width and T-depth. Subsequently, we propose a new circuit for the AES S-box that significantly reduces the required ancilla qubits to just \(60+10\). By employing this optimized S-box in the circuits in [12], we achieve a substantial reduction in DW-cost, resulting in a final value of 132,800 (width \(\times \) T-depth = \(1,660 \times 80\)).

7.1 Zig-Zag Architecture and Round-in-Place Technique in [12]

The zig-zag architecture is proposed in [8], which reduces the number of qubits by performing reverse operations in each round (cf. Fig. 8). \(R_1\), \(R_2\), \(R_3\), and \(R_4\) are performed in order. Then, \(R_3^\dag \), \(R_2^\dag \), and \(R_1^\dag \) are utilized to clean up the 4-th, 3-rd, and 2-nd lines, which can be reused to store the outputs of \(R_7\), \(R_6\), and \(R_5\), respectively. Other rounds proceed similarly. This method requires a larger T-depth. Subsequently, at ASIACRYPT 2020, Zou et al. [29] improved the zig-zag architecture and implemented AES-128 with 512 qubits and Toffoli depth 2016.

Fig. 8.
figure 8

Zig-zag architecture.

Huang et al. [12] improved the zig-zag architecture based on a round-in-place technique. We simply introduce the technique. Suppose \(U_f\) is a circuit that map to , where denotes the ancilla qubits, then, \(U_f\) is not in-place. If we also have the inverse circuit \(U_{f^{-1}}\) that maps to , we can achieve the in-place circuit by swapping and . Figure 9 shows a round-in-place S-box circuit, which maps to .

Fig. 9.
figure 9

Round-in-place S-box circuit.

Usually, \(U_{f}\) is easy to construct. Huang et al. [12] provide a method to convert \(U_f\) into \(U_{f^{-1}}\). Suppose \(x, y\in \mathbb {F}^8_2\) are the input and output of \(U_f\). we have \(y = LS_0(x) + c\), where L is a linear function and \(S_0(x)\) is the inverse of x in \(\mathbb {F}^8_2\). Then, \(x=S_0^{-1}L^{-1}(y+c) = S_0L^{-1}(y+c) = L^{-1}(LS_0)L^{-1}(y+c)\). Let the last 4 X gates of AES S-box be \(U_c\). \(U_f = U_0 + U_c\), where \(U_0\) implements . Then, the circuit in Fig. 10 is \(U_{f^{-1}}\), where \(U_L\) and \(U_{L^{-1}}\) are the circuits of L and \(L^{-1}\), respectively. Huang et al. provide an SAT-based method and implement \(U_L\) or \(U_{L^{-1}}\) by 14 CNOT gates. Finally, adding \(14\times 3=42\) CNOT gates and 4 X gates, \(U_{f}\) can be converted into \(U_{f^{-1}}\).

Fig. 10.
figure 10

The circuit of \(U_{f^{-1}}\) based on \(U_f\).

Based on the round-in-place S-box circuit, one can construct the round-in-place round function \(R_i\) easily. Figure 11 shows the round function. SubByte\(^1\) uses S-box circuits. SubByte\(^{-1}\) uses S-box\(^{-1}\) circuits. SB, MC, and ARK represent the in-place ShiftRows, MixColumns, and AddRoundKey, respectively. The circuit maps to .

Fig. 11.
figure 11

Round-in-place round function of AES.

7.2 Executing the Round-in-Place Round Function Iteratively

In [12, 21], the authors provide a method for executing the round-in-place round function iteratively. We simply introduce the technique. For AES, there are 16 bytes for the state. Each byte requires an ancilla qubit set to run round-in-place S-box circuits. If only m S-boxes of SubBytes are executed in parallel, then 16/m ancilla qubit sets are needed. We define the number of iterations as \(i=16/m\) (\(i=1,2,4,8,16\)).

For \(i=1,2\), the key schedule can be split. In the key schedule, four S-boxes are needed in each round. However, they do not require a round-in-place S-box because contains the output values of SubWord. Thus, the schedule only uses the \(U_f\) mapping to . The S-box used in the schedule is single-depth, while the round-in-place S-box in the round function is double-depth. We can split the SubWord in the key schedule into two parts (cf. Fig. 12). SubWord\(_{\frac{1}{2}}\) indicates that only half of SubWord is used in the operations. The first part and SubByte\(^1\) are executed in parallel. The second part and SubByte\(^{-1}\) are executed in parallel. For more detailed information, refer to [12, 21].

Fig. 12.
figure 12

Two parts in round function for \(i=1\).

For \(i=4,8,16\), we execute two operations in parallel (cf. Fig. 13). We take \(i=4\) as an example. SB\(_{\frac{1}{4}}\) represents a quarter of SubBytes and SW\(_{\frac{1}{4}}\) represents a quarter of SubWord. It requires five ancilla qubit sets and the T-depth is four double-depth S-boxes.

Fig. 13.
figure 13

Round-in-place round function executing SubBytes and SubWord in parallel for \(i=4\).

7.3 Constructing a Low-Width S-Box Circuit

To improve the circuit, we constructed a new AES S-box circuit with only \(60+10\) ancilla qubits. The circuit is suitable for the round-in-place zig-zag architecture with AND gates.

Next, we present this construction method, which always allows the use of the minimum number of qubits without increasing the T-depth. Our approach first satisfies the maximum parallel count of AND gates in different layers.

  • For the structure of AES S-box with T-depth 4 in [12], 8 input qubits \(u_0, \ldots , u_7\), and 8 output qubits \(s_0, \ldots , s_7\).

  • We consider that the target qubit of each AND gate is allocated as . \(q_0, q_1, \ldots , q_{33}\) are allocated for each target qubit. There are 9, 3, 4, and 18 target qubits in T-depth 1, 2, 3, and 4, respectively.

  • Because the layer in T-depth 4 contains the most AND gates, we must satisfy its parallelism first. Thus, \(q_{34}, q_{35}, \ldots , q_{50}\) are allocated as the inputs of these AND gates.

  • However, the AND gates in T-depth 4 still cannot be implemented in parallel. The main reason is that nine qubits are included in two AND gates at the same time.

  • We have to allocate nine qubits \(q_{51}, q_{52}, \ldots , q_{59}\) to restore these qubits. Thus, there are 60 ancilla qubits \(q_0, q_1, \ldots , q_{59}\) in T-depth 4. Note that \(q_{51}, q_{52}, \ldots , q_{59}\) can be reset as after these AND gates and be used in other AND gates.

  • In other layers, AND gates do not require any more ancilla qubits. The lower bound on the number of ancilla qubits is 60.

  • For the 4-th AND layer, 18 ancilla qubits are required for the AND gates. Apart from 8 qubits from \(s_0, s_1, \ldots , s_7\), we need to allocate 10 ancilla qubits.

  • The final number of ancilla qubits of the S-box is 70.

After executing the S-box, we need to execute S-box\(^\dagger \) to clean up the ancilla qubits, which corresponds to \(U_f\) in Fig. 9. We estimate the resource of S-box and S-box\(^\dagger \) (cf. Table 16). Compared with the S-box circuit with 83 ancilla qubits, the new circuit requires more gates and depth. Therefore, this S-box circuit does not optimize the complexity of Grover’s algorithm.

Table 16. Implementation of \(U_f\) (S-box and S-box\(^{\dagger }\)) based on the AND gates. Here #M counts the number of Measurements.

Figure 10 shows how to transform \(U_f\) into \(U_{f^{-1}}\), which requires 42 additional CNOT gates and 4 X gates based on \(U_f\). Thus, \(U_{f^{-1}}\) requires \(688+42=730\) CNOT gates and \(220+4=224\) 1qClifford gates.

In the previous work [12, 29], there are two types of S-boxes. The first type is used in SubBytes, called the \(\mathcal {C}^0\)-circuit, which maps to . The second type is used in SubWord, called the \(\mathcal {C}^{*}\)-circuit, which maps to . We follow a unified principle to design the circuit of the AES S-box. If the output qubits \(s_0, s_1, \ldots , s_7\) are only used to save the output, the S-box is suitable for both \(\mathcal {C}^0\)- and \(\mathcal {C}^{*}\)-circuits. Note that for the \(\mathcal {C}^{*}\)-circuit, 18 ancilla qubits are also required for the AND gates. As 8 qubits from \(s_0, s_1, \ldots , s_7\) can not be used. we need to allocate 18 ancilla qubits.

7.4 Applying New S-Box Circuit into the Architecture in [12]

With reference to Fig. 12, we can calculate the resources for each round of AES. We take AES-128 as an example. Note that there is no MixColumns operation in the last round of AES. In SubBytes\(^1\), there are 16 \(U_f\). In SubBytes\(^{-1}\), there are 16 \(U_{f^{-1}}\). In MixColumns, there are \(92 \times 4 = 368\) CNOT gates. In AddRoundKey, there are 128 CNOT gates. In the Key Schedule, there are 4 \(U_f\), and at most 4 X gates for Rcon. Therefore, one round of AES requires 20 \(U_f\), 16 \(U_{f^{-1}}\), \(368+128=496\) CNOT gates, and at most 4 X gates for Rcon. AES-128 requires \(10\times 20=200\) \(U_f\), \(10\times 16=160\) \(U_{f^{-1}}\), \(10\times 496 - 368 =4592\) CNOT gates, and \(8\times 1+2\times 4=16\) X gates.

Next, based on the number of iterations i (\(i=1,2,4,8,16\)), different trade-offs of width/T-depth are provided in Table 17. We describe how to calculate the number of ancilla qubits and T-depth.

Table 17. Different trade-offs of width/T-depth for quantum circuit of AES-128.

For the case of \(i=1, 2\), we execute the two-part round function (cf. Fig. 12), which allows the key schedule to require only \(\frac{2}{i} \times (60+18)\) ancilla qubits. In round function, SubBytes\(^1\) and SubBytes\(^{-1}\) require \(\frac{16}{i} \times (60+10)\) ancilla qubits. The number of ancilla qubits is \(\frac{2}{i} \times 78 + \frac{16}{i} \times 70\). Then, 256 input qubits and \(\frac{128}{i}\) output qubits are needed. For each round, the T-depth is \(4\times 2\times i = 8i\). Therefore, T-depth of AES-128 is \(10 \times 8i = 80i\).

For the case of \(i=4,8, 16\), we execute the round function like Fig. 13. The key schedule requires 78 ancilla qubits. In round function, SubBytes\(^1\) and SubBytes\(^{-1}\) require \(\frac{16}{i} \times 70\) ancilla qubits. The number of ancilla qubits is \(78 + \frac{16}{i} \times 70\). Then, 256 input qubits and \(\frac{128}{i}\) output qubits are needed. The T-depth is \(10 \times i \times 2 \times 4 = 80i\).

Furthermore, we can also make an interleaved execution of the S-boxes between key schedule and round functions, reducing the number of additional qubits by increasing the T-depth (cf. Fig. 14). However, this method does not affect the lowest DW-cost. Therefore, we compared this approach with the circuits used in [12] in Fig. 1.

Fig. 14.
figure 14

Round-in-place round function executing SubBytes and SubWord serially for \(i=4\).

Finally, AES-128 can be implemented by the round-in-place round function with the lower DW-cost 132,800, while the previous best result is 204,800 in [12]. For AES-192, we achieve a circuit with DW-cost (width \(\times \) T-depth) \(1,724 \times 96 = 165,504\). For AES-256, we achieve a circuit with DW-cost (width \(\times \) T-depth) \(1,788 \times 112 = 200,256\).

8 Conclusion

In this paper, we investigated the optimization of quantum circuits for AES variants (-128, -192, -256). We provided an improved structure of the S-box based on the m-XOR technique and combined the S-box and S-box\(^\dag \) based on the share technique. Then, we introduce the implementations of the AES components. Next, we estimated the required resources based on the pipelined and zig-zag architectures with Toffoli gates and AND gates. The combined pipeline architecture reduces the depth and the number of qubits required for various quantum circuits. Although our circuits perform well in the quantum case, we believe that further improvements are possible by exploiting the structure of the S-box. If a superior S-box circuit is proposed, our method can be immediately applied to reduce the complexity of the AES quantum circuit.