1 Introduction

Machine learning techniques with artificial neural networks are ubiquitous in modern society as the ability to make reliable predictions from the vast amount of data is essential in various domains of science and technology. A convolutional neural network (CNN) is one such example, especially for data with a large number of features. It effectively captures spatial correlation within data and learns important features (Lecun et al. 1998), which is shown to be useful for many pattern recognition problems, such as image classification, signal processing, and natural language processing (LeCun et al. 2015). It has also opened the path to Generative Adversarial Networks (GANs) (Goodfellow et al. 2014). CNNs are also rising as a useful tool for scientific research, such as in high-energy physics (Aurisano et al. 2016; Acciarri et al. 2017), gravitational wave detection (George and Huerta 2018) and statistical physics (Tanaka and Tomiya 2017). By all means, the computational power required for the success of machine learning algorithms increases with the volume of data, which is increasing at an overwhelming rate. With the potential of quantum computers to outperform any foreseeable classical computers for solving certain computational tasks, quantum machine learning (QML) has emerged as the potential solution to address the challenge of handling an ever-increasing amount of data. For example, several innovative quantum machine learning algorithms have been proposed to offer speedups over their classical counterparts (Lloyd et al. 2013; 2014; Wiebe et al. 2015; Rebentrost et al. 2014; Kerenidis et al. 2019; Blank et al. 2020). Motivated by the benefits of CNN and the potential power of QML, quantum convolutional neural network (QCNN) algorithms have been developed (Cong et al. 2019; Kerenidis et al. 2019; Liu et al. 2021; Henderson et al. 2020; Chen et al. 2020; Li et al. 2020; MacCormack et al. 2020; Wei et al. 2021; Mangini et al. 2021) (see Appendix ?? for a brief summary and comparison of other approaches to QCNN). Previous constructions of QCNN have reported success in developing efficient quantum arithmetic operations that exactly implement the basic functionalities of classical CNN or in developing parameterized quantum circuits inspired by key characteristics of CNN. While the former likely requires fault-tolerant quantum devices, the latter has been focused on quantum data classification. In particular, Cong et al. proposed a fully parameterized quantum circuit (PQC) architecture inspired by CNN and demonstrated its success for some quantum many-body problems (Cong et al. 2019). However, the study of fully parameterized QCNN for performing pattern recognition, such as classification, on classical data is missing.

In this work, we present a fully parameterized quantum circuit model for QCNN that solves supervised classification problems on classical data. In a similar vein to Cong et al. (2019), our model only uses two-qubit interactions throughout the entire algorithm in a systematic way. The PQC models—also known as variational quantum circuits (Cerezo et al. 2021)—are attractive since they are expected to be suitable for Noisy Intermediate-Scale Quantum (NISQ) hardware (Preskill 2018; Bharti et al. 2021). Another advantage of QCNN models for NISQ computing is their intrinsically shallow circuit depth. Furthermore, QCNN models studied in this work exploit entanglement, which is a global property, and hence have the potential to transcend classical CNN that is only able to capture local correlations. We benchmark the performance of the parameterized QCNN with respect to several variables, such as quantum data encoding methods, structures of parameterized quantum circuits, cost functions, and optimizers using two standard datasets, namely MNIST and Fashion MNIST, on Pennylane (Bergholm et al. 2020). The quantum encoding benchmark also examines classical dimensionality reduction methods, which is essential for early quantum computers with a limited number of logical qubits. The various QCNN models tested in this work employs a small number of free parameters, ranging from 12 to 51. Nevertheless, all QCNN models produced high classification accuracy, with the best case being about 99% for MNIST and about 94% for Fashion MNIST. Moreover, we discuss a QCNN model that only requires nearest neighbor qubit interactions, which is a desirable feature for NISQ computing. Comparing classification performances of QCNN and CNN models shows that QCNN is more favorable than CNN under the similar training conditions for both benchmarking datasets.

The remainder of the paper is organized as follows. Section 2 sets the theoretical framework of this work by describing the classification problem, the QCNN algorithm, and various methods for encoding classical data as a quantum state. Section 3 describes variables of the QCNN model, such as parameterized quantum circuits, cost functions, and classical data pre-processing methods, that are subject to our benchmarking study. Section 4 compares and presents the performance of various designs of QCNN for binary classification of MNIST and Fashion MNIST datasets. Conclusions are drawn and directions for future work are suggested in Section 5.

2 Theoretical framework

2.1 Classification

Classification is an example of pattern recognition, which is a fundamental problem in data science that can be effectively addressed via machine learning. The goal of L-class classification is to infer the class label of an unseen data point , given a labelled dataset

The classification problem can be solved by training a parameterized quantum circuit. Hereinafter, we refer to fully parameterized quantum circuits trained for machine learning tasks as quantum neural network (QNN). For this supervised classification task, a QNN is trained by optimizing the parameters of quantum gates so as to minimize the cost function

$$ C(\boldsymbol{\theta}) = {\sum}_{i=1}^{M} \alpha_{i} c(y_{i},f(x_{i},\boldsymbol{\theta}))$$

where f(xi,𝜃) is the machine learning model defined by the set of parameters 𝜃 that predicts the label of xi, c(a,b) quantifies the dissimilarity between a and b, and αi is a weight that satisfies \({\sum }_{i=1}^{M}\alpha _{i}=1\). After the training is finished, the class label for the unseen data point \(\tilde {x}\) is determined as \( \tilde {y} = f(\tilde {x},\boldsymbol {\theta }^{*}),\) where \(\boldsymbol {\theta }^{*} = \arg \min \limits _{\boldsymbol {\theta }} C(\boldsymbol {\theta })\). If the problem is restricted to binary classification (i.e., L = 2), the class label can be inferred from a single-qubit von Neumann measurement. For example, the sign of an expectation value of σz observable can represent the binary label (Park et al. 2021). Hereinafter, we focus on the binary classification, albeit potential future work towards multi-class classification will be discussed in Section 5.

2.2 Quantum convolutional neural network

An interesting family of quantum neural networks utilizes tree-like (or hierarchical) structures (Grant et al. 2018) with which the number of qubits from a preceding layer is reduced by a factor of two for the subsequent layer. Such architectures consist of \(O(\log (n))\) layers for n input qubits, thereby permitting shallow circuit depth. Moreover, they can avoid one of the most critical problems in the PQC based algorithms known as “barren plateau”, thereby guaranteeing the trainability (Pesah et al. 2021). These structures also make a natural connection to the tensor network, which serves as a useful ground for exploring many-body physics, neural networks, and the interplay between them.

The progressive reduction of the number of qubits is analogous to the pooling operation in CNN. A distinct feature of the QCNN architecture is the translational invariance, which forces the blocks of parameterized quantum gates to be identical within a layer. The quantum state resulting from an i th layer of QCNN can be expressed as

$$ \left|{\psi_{i}(\boldsymbol{\theta}_{i})}\rangle\langle{\psi_{i}(\boldsymbol{\theta}_{i})}\right| = \text{Tr}_{B_{i}}(U_{i}(\boldsymbol{\theta}_{i})\left|{\psi_{i-1}}\rangle\langle{\psi_{i-1}}\right|U_{i}(\boldsymbol{\theta}_{i})^{\dagger}), $$
(1)

where \(\text {Tr}_{B_{i}}(\cdot )\) is the partial trace operation over subsystem \(B_{i}\in \mathbb {C}^{2^{n/2^{i}}}\), Ui is the parameterized unitary gate operation that includes quantum convolution and the gate part of pooling, and |ψ0〉 = |0〉n. Following the existing nomenclature, we refer to the structure (or template) of the parameterized quantum circuit as ansatz. In our QCNN architecture, Ui always consists of two-qubit quantum circuit blocks, and the convolution and pooling part each uses the identical quantum circuit blocks within the given layer. Since a two-qubit gate requires 15 parameters at most (Vatan and Williams 2004), in i th layer consisting of li > 0 independent convolutional filter and one pooling operation the maximum number of parameters subject to optimization is 15(li + 1). Then, the total number of parameters is at most \(15{\sum }_{i=1}^{\log _{2}(n)}(l_{i}+1)\) if the convolution and pooling operations are iterated until only one qubit remains. One can also consider an interesting hybrid architecture in which the QCNN layers are stacked until m qubits are remaining and then a classical neural network takes over from the m qubit measurement outcomes. In this case, the number of quantum circuit parameters is less than the maximum number given above. Usually, li is set to be a constant. Therefore, the number of parameters subject to optimization grows as \(O(\log (n))\), which is an exponential reduction compared to the general hierarchical structure discussed in Ref. Grant et al. (2018). This also implies that the number of parameters can be suppressed double-exponentially with the size of classical data if the exponentially large state space is fully utilized for encoding the classical data. An example quantum circuit for a QCNN algorithm with eight qubits for binary classification is depicted in Fig. 1. Generalizing Fig. 1 to larger systems can be done simply by connecting all neighboring qubits with the two-qubit parameterized gates in the translationally invariant way.

Fig. 1
figure 1

A schematic of the QCNN algorithm for an example of 8 input qubits. The quantum circuit consists of three parts: quantum data encoding (green rectangle), convolutional filters (blue rounded rectangle), and pooling (red circle). The quantum data encoding is fixed in a given structure of QCNN, while the convolutional filter and pooling use parameterized quantum gates. There are three layers in this example, and in each layer, multiple convolutional filters can be applied. The number of filters for i th layer is denoted by li. In each layer, the convolutional filter applies the same two-qubit ansatz to nearest neighbor qubits in a translationally invariant way. Similarly, pooling operations within the layer use the same ansatz. In this example, the pooling operation is represented as a controlled unitary transformation, which is activated when the control qubit is 1. However, general controlled operations can also be considered. The measurement outcome of the quantum circuit is used to calculate the user-defined cost function. The classical computer is used to compute the new set of parameters based on the gradient, and the quantum circuit parameters are updated accordingly for the subsequent round

The optimization of the gate parameters can be carried out by iteratively updating the parameters based on the gradient of the cost function until some condition for the termination is reached. The cost function gradient can be calculated classically or by using a quantum computer via parameter-shift rule (Li et al. 2017; Mitarai et al. 2018; Schuld et al. 2019). When the parameter-shift rule is used, QCNN requires an exponentially smaller number of quantum circuit executions compared to the general hierarchical structures inspired by tensor networks (e.g., tree tensor network) in Ref. Grant et al. (2018). While the latter uses O(n) runs, the former only uses \(O(\log (n))\) runs.

2.3 Quantum data encoding

Many machine learning techniques transform input data \(\mathcal {X}\) into a different space to make it easier to work with. This transformation \(\phi : \mathcal {X} \rightarrow \mathcal {X}^{\prime }\) is often called a feature map. In quantum computing, the same analogy can be applied to perform a quantum feature map, which acts as \(\phi : \mathcal {X} \rightarrow {\mathscr{H}}\) where the vector space \({\mathscr{H}}\) is a Hilbert space (Schuld and Killoran 2019). In fact, such feature mapping is mandatory when one uses quantum machine learning on classical data, since classical data must be encoded as a quantum state (Rebentrost et al. 2014; Lloyd et al. 2014; Giovannetti et al. 2008; Park et al. 2019; Veras et al. 2020; Araujo et al. 2021). The quantum feature map \(x \in \mathcal {X} \rightarrow |{\phi (x)}\rangle \in {\mathscr{H}}\) is equivalent to applying a unitary transformation Uϕ(x) to the initial state |0〉n to produce Uϕ(x)|0〉n = |ϕ(x)〉, where n is the number of qubits. This refers to the green rectangle in Fig. 1.

There exist numerous structures of Uϕ(x) to encode the classical input data x into a quantum state. In this work, we benchmark the performance of the QCNN algorithm with several different quantum data encoding techniques. These techniques are explained in detail in this section.

2.3.1 Amplitude encoding

One of the most general approaches to encode classical data as a quantum state is to associate normalized input data with probability amplitudes of a quantum state. This encoding scheme is known as the amplitude encoding (AE). The amplitude encoding represents input data of x = (x1,...,xN)T of dimension N = 2n as amplitudes of an n-qubit quantum state |ϕ(x)〉 as

$$ U_{\phi}(x) : x \in \mathbb{R}^{N} \rightarrow |{\phi(x)}\rangle = \frac{1}{\|x\|} {\sum}_{i=1}^{N} x_{i} |{i}\rangle, $$
(2)

where |i〉 is the i th computational basis state. Clearly, with amplitude encoding, a quantum computer can represent exponentially many classical data. This can be of great advantage in QCNN algorithms. Since the number of parameters subject to optimization scales as \(O(\log (n))\) (see Section 2.2), the amplitude encoding reduces the number of parameters doubly-exponentially with the size (i.e., dimension) of the classical data. However, the quantum circuit depth for amplitude encoding usually grows as O(poly(N)), although there exists a method to reduce it to \(O(\log (N))\) at the cost of increasing the number of qubits to O(N) (Araujo et al. 2021).

2.3.2 Qubit encoding

The computational overhead of amplitude encoding motivates qubit encoding, which uses a constant quantum circuit depth while using O(N) number of qubits. The qubit encoding embeds one classical data point xi, that is rescaled to lie between 0 and π, into a single qubit as \(|{\phi (x_{i})}\rangle = \cos \limits \) \((\frac {x_{i}}{2})|{0}\rangle + \sin \limits (\frac {x_{i}}{2})|{1}\rangle \) for i = 1,...,N. Hence, the qubit encoding maps input data of x = (x1,…,xN)T to N qubits as

$$ \begin{array}{@{}rcl@{}} &&U_{\phi}(x) : x \in \mathbb{R}^{N} \rightarrow |{\phi(x)}\rangle \\&&= \bigotimes_{i=1}^{N} (\cos(\frac{x_{i}}{2})|{0}\rangle + \sin(\frac{x_{i}}{2})|{1}\rangle), \end{array} $$
(3)

where xi ∈ [0,π) for all i. The encoding circuit can be expressed with a unitary operator \(U_{\phi }(x) = \bigotimes _{j=1}^{N} U_{x_{j}}\) where

$$ U_{x_{j}} = e^{-i \frac{x_{j}}{2} \sigma_{y}} := \begin{bmatrix} \cos(\frac{x_{j}}{2}) & -\sin(\frac{x_{j}}{2}) \\ \sin(\frac{x_{j}}{2}) & \cos(\frac{x_{j}}{2}) \end{bmatrix}. $$

2.3.3 Dense qubit encoding

In principle, since a quantum state of one qubit can be described with two real-valued parameters, two classical data points can be encoded in one qubit. Thus, the qubit encoding described above can be generalized to encode two classical vectors per qubit by using rotations around two orthogonal axes (LaRose and Coyle 2020). By choosing them to be the x and y axes of the Bloch sphere, this method, which we refer to as dense qubit encoding, encodes \(x_{j} = (x_{j_{1}}, x_{j_{2}})\) into a single qubit as

$$|{\phi(x_{j})}\rangle = e^{-i\frac{x_{j_{2}}}{2}\sigma_{y}}e^{-i\frac{x_{j_{1}}}{2}\sigma_{x}} |{0}\rangle.$$

Hence, the dense qubit encoding maps an N-dimensional input data x = (x1,…,xN)T to N/2 qubits as

$$ \begin{array}{@{}rcl@{}} &&U_{\phi}(x) : x \in \mathbb{R}^{N} \rightarrow |{\phi(x)}\rangle \\&&= \bigotimes_{j=1}^{N/2} \left( e^{-i\frac{x_{N/2+j}}{2}\sigma_{y}}e^{-i\frac{x_{j}}{2}\sigma_{x}} |{0}\rangle \right). \end{array} $$
(4)

Note that there is freedom to choose which pair of classical data to be encoded in one qubit. In this work, we chose the pairing as shown in Eq. (4), but one may choose to encode x2j− 1 and x2j in j th qubit.

2.3.4 Hybrid encoding

As shown in previous sections, the amplitude encoding is advantageous when the quantum circuit width (i.e., the number of qubits) is considered while the qubit encoding is advantageous when the quantum circuit depth is considered. These two encoding schemes represent the extreme ends of the quantum circuit complexities for loading classical data into a quantum system. In this section, we introduce simple hybrid encoding methods to compromise the quantum circuit complexity between these two extreme ends. In essence, the hybrid encoding implements the amplitude encoding to a number of independent blocks of qubits in parallel. Let us denote the number of qubits in each independent block that amplitude-encodes classical data by m. Then, each block can encode O(2m) classical data. Let us also denote that there are b such blocks of m qubits by b. Then, the quantum system of b blocks contains b2m classical data. The first hybrid encoding, which we refer to as hybrid direct encoding (HDE), can be expressed as

$$ U_{\phi}(x) : x \!\in\! \mathbb{R}^{N} \rightarrow |{\phi(x)}\rangle = \bigotimes_{j=1}^{b} \left( \frac{1}{\|x\|_{j}} {\sum}_{i=1}^{2^{m}} x_{ij} |{i}\rangle_{j} \right). $$
(5)

Note that each block can have a different normalization constant, and hence, the amplitudes may not be a faithful representation of the data unless the normalization constant have similar values. To circumvent this problem, we also introduce hybrid angle encoding (HAE), which can be expressed as

$$ |{\phi(x)}\rangle = \bigotimes_{k=1}^{b}\! \left( \!{\sum}_{i=1}^{2^{m}} {\prod}_{j=0}^{m-1} \cos^{1-\mathtt{i}_{j}}\!\left( x_{g(j),k}\right)\sin^{\mathtt{i}_{j}}\!\left( x_{g(j),k}\right) |{i}\rangle_{k}\! \right)\!, $$
(6)

where ∈{0,1}m is the binary representation of i with j being the j + 1th bit of the bit string, xj,k represents the j th element of the data assigned to the k th block of qubits, and \(g(j)=2^{j}+{\sum }_{l=0}^{j-1}\mathtt {i}_{l}2^{l}\). In this case, having b block of m qubits allows b(2m − 1) classical data to be encoded. The performance of these hybrid encoding methods will be compared in Section 4.

Since the hybrid methods are parallelized, the quantum circuit depth is reduced to O(2m) where m < N, while the number of qubits is O(mN/2m). Therefore, the hybrid encoding algorithms use fewer number of qubits than the qubit encoding and use shallower quantum circuit depth than the amplitude encoding. Finding the best trade-off between the quantum circuit width and depth (i.e., the choice of m) depends on the specific details of given quantum hardware.

3 Benchmark variables

3.1 Ansatz

An important step in the construction of a QCNN model is the choice of ansatz. In general, the QCNN structure is flexible to use an arbitrary two-qubit unitary operation at each convolutional filter and each pooling step. However, we constrain our design such that all convolutional filters use the same ansatz, and the same applies to all pooling operations (but differ from convolutional filters). We later show that the QCNN with fixed ansatz provides excellent results for the benchmarking datasets. While using different ansatz for all filters can be an interesting attempt for further improvements, this will increase the number of parameters to be optimized.

In the following, we introduce a set of convolutional and pooling ansatz (i.e., parameterized quantum circuit templates) used in our QCNN models.

3.1.1 Convolution filter

Parameterized quantum circuits for convolutional layers in QCNN are composed of different configurations of single-qubit and two-qubit gate operations. Most circuit diagrams in Fig. 2 are inspired by past studies. For instance, circuit 1 is used as the parameterized quantum circuit for training a tree tensor network (TTN) (Grant et al. 2018). Circuits 2, 3, 4, 5, 7, and 8 are taken from the work by Sim et al. (2019) which includes the analysis on expressibility and entangling capability of four-qubit parameterized quantum circuits. We modified these quantum circuits to two-qubit forms to utilize them as building blocks of the convolutional layer, which always consists of two qubits. Circuits 7 and 8 are reduced versions of circuits that recorded the best expressibility in the study. Circuit 2 is a two-qubit version of the quantum circuit that exhibited the best entangling capability. Circuits 3, 4 and 5 are drawn from circuits that have balanced significance in both expressibility and entangling capability. Circuit 6 is developed as a proper candidate of two-body Variational Quantum Eigensolver (VQE) entangler in Ref. Parrish et al. (2019). This circuit is also known to be able to implement an arbitrary SO(4) gate (Wei and Di 2012). In fact, a total VQE entangler can be constructed by linearly arranging the SO(4) gates throughout input qubits. Since this structure is similar to the structure of convolutional layers in QCNN, the SO(4) gate would be a great candidate to be used in the convolution layer. Circuit 9 represents the parameterization of an arbitrary SU(4) gate (Vatan and Williams 2004; MacCormack et al. 2020).

Fig. 2
figure 2

Parameterized quantum circuits used in the convolutional layer. Ri(𝜃) is a rotation around the i axis of the Bloch sphere by an angle of 𝜃, and H is the Hadamard gate. U3(𝜃,ϕ,λ) is an arbitrary single-qubit gate that can be expressed as U3(𝜃,ϕ,λ) = Rz(ϕ)Rx(−π/2) Rz(𝜃)Rx(π/2)Rz(λ). (a) Convolutional circuit 1. (b) Convolutional circuit 2. (c) Convolutional circuit 3. (d) Convolutional circuit 4. (e) Convolutional circuit 5. (f) Convolutional circuit 6. (g) Convolutional circuit 7. (h) Convolutional circuit 8. (i) Convolutional circuit 9

3.1.2 Pooling

The pooling layer applies parameterized quantum gates to two qubits and traces out one of the qubits to reduce the two-qubit states to one-qubit states. Similar to the choice of ansatz for the convolutional filter, there exists a variety of choices of two-qubit circuits of the pooling layer. In this work, we choose a simple form of a two-qubit circuit consisting of two free parameters for the pooling layer. The circuit is shown in Fig. 3.

Fig. 3
figure 3

Parameterized quantum circuit used in the pooling layer. The pooling layer applies two controlled rotations Rz(𝜃1) and Rx(𝜃2), respectively, each activated when the control qubit is 1 (filled circle) or 0 (open circle). The control (first) qubit is traced out after the gate operations in order to reduce the dimension

Application of the parameterized gates in the pooling step in conjunction with the convolutional circuit 9 might be redundant since it is already an arbitrary SU(4) gate. Thus, for the convolutional circuit 9, we test two QCNN constructions, with and without the parameterized two-qubit circuit in the pooling layer. In the latter, the pooling layer only consists of tracing out one qubit.

3.2 Cost function

The variational parameters of the ansatz are updated to minimize the cost function calculated on the training dataset. In this benchmark study, we test the performance of QCNN models with two different cost functions, namely the mean squared error and the cross-entropy loss.

3.2.1 Mean squared error

Before training QCNN, we map original class labels of {0,1} to {1,− 1} respectively to associate them with the eigenvalues of the qubit observables. Then, the mean squared error (MSE) between predictions and class labels becomes

$$ C(\boldsymbol{\theta}) = \frac{1}{N} {\sum}_{i=1}^{N} (\hat{M_{z}}(\psi_{i}(\boldsymbol{\theta})) - \tilde{y}_{i})^{2}, $$
(7)

where \(\hat {M_{z}}(\psi _{i})=\langle \psi _{i}|\sigma _{z}|\psi _{i}\rangle \) is the Pauli-Z expectation value of one-qubit state extracted from QCNN for i th training data, and \(\tilde {y}_{i}\) ∈{1,− 1} is the label of the corresponding training data (i.e., \(\tilde {y}_{i} = 1-2y_{i}\)). Since QCNN performs a single-qubit measurement in the Z basis, the final state can be thought of as a mixed state ai|0〉〈0| + bi|1〉〈1|. Then, minimizing the cost function above with respect to 𝜃 would correspond to forcing ai to be as larger as possible than bi if the i th training data is labelled 0, and vice versa if it is labelled 1.

3.2.2 Cross-entropy loss

Cross-entropy loss is widely used in training classical neural networks. It measures the performance of a classification model whose output is a probability between 0 and 1. Due to the probabilistic property of quantum mechanics, one could consider the cross-entropy loss by considering probabilities of measuring computational basis states in the single-qubit measurement of QCNN. The cross-entropy loss for the i th training data can be expressed as

$$ \begin{array}{@{}rcl@{}} C(\boldsymbol{\theta}) &=& {\sum}_{i=1}^{N}y_{i}\log\left( \Pr[\psi_{i}(\boldsymbol{\theta}) = 1]\right) \\&&+ (1-y_{i})\log\left( \Pr[\psi_{i}(\boldsymbol{\theta}) = 0]\right), \end{array} $$
(8)

where yi ∈{0,1} is the class label and \(\Pr [\psi _{i}(\boldsymbol {\theta }) = y_{i}]\) is the probability of measuring the computational basis state |yi〉 from the QCNN circuit.

3.3 Classical data pre-processing

The size of quantum circuits that can be reliably executed on NISQ devices is limited due to the noise and technical challenges of building quantum hardware. Thus, the encoding schemes for high-dimensional data usually require the number of qubits that are beyond the current capabilities of quantum devices. Therefore, classical dimensionality reduction techniques will be useful in the near-term application of quantum machine learning techniques. In this work, we pre-process data with three classical dimensionality reduction techniques, namely bilinear interpolation, principal component analysis (PCA) (Jolliffe 2002) and autoencoding (AutoEnc) (Goodfellow et al. 2016). For the simulation presented in the following section, amplitude encoding is used only with bilinear interpolation while all other encoding schemes are tested with PCA and autoencoding. Bilinear interpolation and PCA are carried out by utilizing tf.image.resize from TensorFlow and sklearn.decomposition.PCA from scikit-learn, respectively. Autoencoders are capable of modelling complex non-linear functions, while PCA is a simple linear transformation with cheaper and faster computation. Since the pre-processing step should not produce too much computational resource overhead or result in overfitting, we train a simple autoencoder with one hidden layer. The data in the latent space (i.e., hidden layer) is then fed to quantum circuits.

4 Simulation

4.1 QCNN results overview

This section reports the classical simulation results of the QCNN algorithm for binary classification carried out with Pennylane (Bergholm et al. 2020). The test is performed with two standard datasets, namely MNIST and Fashion MNIST, under various conditions as described in the previous section. Note that the MNIST and Fashion MNIST datasets are 28×28 image data, each with ten classes. Our benchmark focuses on binary classification, and hence, we select classes 0 and 1 for both datasets.

The variational parameters in the QCNN ansatze are optimized by minimizing the cost function with an optimizer provided in Pennylane (Bergholm et al. 2020). In particular, we tested Adam (Kingma and Ba 2017) and Nesterov moment (Nesterov 1983) optimization algorithms. At each iteration t, we create a small batch by randomly selecting data from the training set. Compared to training on the full dataset, training on the mini-batch not only reduces simulation time but also helps the gradients to escape from local minima. For both Adam and Nesterov moment optimizers, the batch size was 25 and the learning rate was 0.01. We also fixed the number of iterations to be 200 to speed up the training process. Note that the training can alternatively be stopped at different conditions, such as when the validation set accuracy does not increase for a predetermined number of consecutive runs (Grant et al. 2018). The numbers of training (test) data are 12665 (2115) and 12000 (2000) for MNIST and fashion MNIST datasets, respectively.

Tables 1 and 2 show the mean classification accuracy and one standard deviation obtained from five instances with random initialization of parameters. The number of random initialization is chosen to be the same as that of Ref. Grant et al. (2018). The results are obtained for various QCNN models of different convolutional and pooling circuits and data encoding strategies. When benchmarking with the hybrid encoding schemes (i.e., HDE and HAE), we used two blocks of four qubits, which results in having 32 and 30 features encoded in 8 qubits, respectively. For all results presented here, training is done with the cross-entropy loss. Similar results are obtained when MSE is used for training, and we present the MSE results in Appendix ??. Here we only report the classification results obtained with the Nesterov optimizer, since it consistently provided better convergence. The ansatze in the table are listed in the same order as the list of convolutional circuits shown in Fig. 2. The last row of the table (i.e., Ansatz 9b) contains the results when the QCNN circuit only consists of the convolutional circuit 9 without any unitary gates in the pooling step.

Table 1 Mean accuracy and one standard deviation of the classification for 0 and 1 in the MNIST dataset when the model is trained with the cross-entropy loss
Table 2 Mean accuracy and one standard deviation of the classification for 0 (t-shirt/top) and 1 (trouser) in the Fashion MNIST dataset when the model is trained with the cross-entropy loss

The simulation results show that all ansatze perform reasonably well, while the ones with more number of free parameters tend to produce higher score. Since all ansatze perform reasonably well, one may choose to use the ansatz with smaller number of free parameters to save the training time. For example, by choosing ansatz 4 and amplitude encoding, one can achieve 97.8% classification accuracy while using only 24 free parameters total instead of achieving 98.4% at the cost of increasing the number of free parameters to 51. It is also important to note that most of the results obtained with the hybrid data encoding and PCA is considerably worse than the others. This is due to the normalization problem discussed in Section 2.3.4, which motivated the development of the hybrid angle encoding. We observed that the normalization problem is negligible in the case with autoencoding. The simulation results clearly demonstrates that the normalization problem is resolved by the hybrid angle encoding as expected, and reasonably good results can be obtained with this method. For MNIST, HAE with PCA provides the best solution among the hybrid encoding schemes on average. On the other hand, for Fashion MNIST, HDE with autoencoding provides the best solution among the hybrid encoding schemes on average.

In the following, we also compare the two classical dimensionality reduction methods by presenting the overall mean accuracy and standard deviation values obtained by averaging over all ansatze and the random initializations. The average values are presented in Table 3. As discussed before, the HDE with PCA does not perform well for both datasets due to the data normalization issue. Besides this case, interestingly, PCA works better than autoencoding for MNIST data, and vice versa for Fashion MNIST data, thereby suggesting that the choice of the classical pre-processing method should be data-dependent.

Table 3 Comparison of the classical dimensionality reduction methods for angle encoding, dense encoding, and hybrid encoding

Finally, we also examine how classification performance improves as the number of convolutional filters in each layer increases. For simplicity, we set the number of convolutional filters in each layer to be same, i.e., l1 = l2 = l3 = L (see Fig. 1 for the definition of li). Without loss of generality, we pick two ansatze and five encodings. For ansatze, we choose the one with the smallest number of free parameters and another with arbitrary SU(4) operations. These are circuit 2 and circuit 9b, and they use 12 and 45 parameters total, respectively. For data encoding, we tested amplitude, qubit, and dense encoding. The qubit and dense encoding are further grouped under two different classical dimensionality reduction techniques, PCA and autoencoding. Since the qubit and dense encoding load 8 and 16 features, respectively, we label them as PCA8, AutoEnc8, PCA16, and AutoEnc16 based on the number of features and the dimensionality reduction techniques. The classification accuracies for L = {1,2,3} are plotted in Fig. 4. The simulation results show that in some cases, the classification accuracy can be improved by increasing the number of convolutional filters. For example, the classification accuracy for MNIST data can be improved from about 86% to 96% when circuit 2 and dense encoding with autoencoding are used. For Fashion MNIST, the classification accuracy is improved from about 88% to 90% when circuit 2 and amplitude encoding are used, and from about 86% to 90% when circuit 2 and qubit encoding with PCA is used. However, we do not observe general trend with respect to the number of convolutional filters. In particular, the relationship between the classification accuracy and L is less obvious for circuit 9b. We speculate that this attributes to the fact that circuit 9b implements an arbitrary SU(4), which is an arbitrary two-qubit gate, and hence, repetitive application of an arbitrary SU(4) is redundant.

Fig. 4
figure 4

Classification accuracy vs the number of convolutional filters L for MNIST and Fashion MNIST datasets. The number of filters in each layer is set to be equal, i.e., l1 = l2 = l3 = L (see Fig. 1 for the definition of li). The simulation is carried out with two ansatze, circuit 2 and circuit 9b, and five encoding schemes, amplitude encoding (AE), qubit encoding with PCA (PCA8) and with autoencoding (AutoEnc8), and dense encoding with PCA (PCA16) and with autoencoding (AutoEnc16). (a) Convolutional circuit 2 for MNIST. (b) Convolutional circuit 2 for Fashion MNIST. (c) Convolutional circuit 9b for MNIST. (d) Convolutional circuit 9b for Fashion MNIST

Fig. 5
figure 5

A QCNN circuit with the open-boundary condition and no gate operations for pooling. In this case, the QCNN circuit can be constructed with nearest neighbor qubit interactions only

4.2 Boundary conditions of the QCNN circuit

The general structure of QCNN shown in Fig. 1 uses two-qubit gates between the first (top) and last (bottom) qubits, which can be thought of as imposing periodic-boundary condition. One may notice that all-to-all connectivity can be established even without connecting the boundaries. Thus, we tested the classification performance of a QCNN architecture without having the two-qubit gates to close the loop. We refer to this case as the open-boundary QCNN. Without loss of generality, we tested QCNNs with two different ansatz, the convolutional circuit 2 (Ansatz 2 in Tables 1 and 2) which uses the smallest number of free parameters and the convolutional circuit 9 which implements an arbitrary SU(4). In case of the latter, pooling was done without parameterized gates, and hence, the ansatz is equivalent to ansatz 9b in Tables 1 and 2. By imposing the open-boundary condition in conjunction with the ansatz 9b, one can modify the qubit arrangement of the QCNN circuit so as to use nearest neighbor qubit interactions only. For an example of 8-qubit QCNN circuit, the modified structure is depicted in Fig. 5. Such design is particularly advantageous for NISQ devices that have limited physical qubit connectivity. For example, if one employs the qubit or the dense encoding method, the QCNN algorithm can be implemented with a 1-dimensional chain of physical qubits.

The simulation results are presented in Table 4 for MNIST and Fashion MNIST datasets. These results are attained with one convolutional filters per layer, i.e., l1 = l2 = l3 = 1. The simulation results demonstrate that for the case of two ansatze tested the classification performance between open- and periodic-boundary QCNN circuits are similar. Although the number of free parameters are the same under these conditions, depending on the specification of the quantum hardware such as the qubit connectivity, the open-boundary QCNN circuit can have shallower depth. The open-boundary circuit with ansatz 9b is even more attractive for NISQ devices since the convolutional operations can be done with only nearest neighbor qubit interactions as mentioned above.

Table 4 Mean classification accuracy and one standard deviation of the classification for 0 and 1 in the benchmarking datasets when the QCNN circuit is constructed under the open-boundary condition

4.3 Comparison to CNN

We now compare the classification results of QCNN to that of classical CNN. Our goal is to compare the classification accuracy of the two given a similar number of parameters subject to optimization. To make a fair comparison between the two, we fix all hyperparameters of the two methods to be the same, except we used the Adam optimizer for CNN since it performed significantly better than the Nesterov moment optimizer. A detailed description of the classical CNN architecture is provided in Appendix ??.

It is important to note that CNN can be trained with such small number of parameters effectively only when the number of nodes in the input layer is small. Therefore, the CNN results are only comparable to that of the qubit and dense encoding cases which requires 8 and 16 classical input nodes, respectively. We designed four different CNN models with the number of free parameters being 26, 34, 44 and 56 to make them comparable to the QCNN models. In these cases, a dimensionality reduction technique must precede. For hybrid and amplitude encoding, which require relatively simpler data pre-processing, the number of nodes in the CNN input layer is too large to be trained with a small number of parameters as in QCNN.

Comparing the values in Table 5 with the QCNN results, one can see that QCNN models perform better than their corresponding CNN models for the MNIST dataset. The same conclusion also holds for the Fashion dataset, except for the CNN models with 44 and 56 parameters that achieve similar performance as their corresponding QCNN models. Another noticeable result is that the QCNN models have considerably smaller standard deviations than the CNN models on average. This implies that the QCNN models not only achieve higher classification accuracy than the CNN models under similar training conditions but also are less sensitive to the random initialization of the free parameters.

Table 5 Mean classification accuracy and one standard deviation obtained with classical CNN for classifying 0 and 1 in the MNIST and Fashion MNIST datasets

In Fig. 6, we present two representative examples of the cross-entropy loss as a function of the number of training iterations. For simplicity, we show such data for two cases in MNIST data classification: circuit 9b and qubit encoding with autoencoding, and circuit 9b and dense encoding with PCA. Considering the number of free parameters, these cases are comparable to the CNN models with 8 inputs with autoencoding and 16 inputs with PCA, respectively. Recall that the mean classification accuracy and one standard deviation in QCNN (CNN) is 96.6 ± 2.2 (90.4 ± 13.4) for the first case, and 98.3 ± 0.5 (93.0 ± 13.4) for the second case. Figure 6 shows that in both cases, the QCNN models are trained faster than the CNN models, while the advantage manifests more clearly in the first case. Furthermore, the standard deviations in the QCNN models are significantly smaller than that of the CNN models.

Fig. 6
figure 6

Cross-entropy loss as a function of the number of training iterations. The QCNN models use circuit 9b as the ansatz. (a) The QCNN model with qubit encoding and autoencoding is compared to the CNN model with 8-inputs. (b) The QCNN model with dense encoding and PCA is compared to the CNN model with 16-inputs. (a) 8-input CNN with AutoEnc vs QCNN. (b) 16-input CNN with PCA vs QCNN

5 Conclusion

Fully parameterized quantum convolutional neural networks pave promising avenues for near-term applications of quantum machine learning and data science. This work presented an extensive benchmark of QCNN for solving classification problems on classical data, a fundamental task in pattern recognition. The QCNN algorithm can be tailored with many variables such as the structure of parameterized quantum circuits (i.e., ansatz) for convolutional filters and pooling operators, quantum data encoding methods, classical data pre-processing methods, cost functions and optimizers. To improve the utility of QCNN for classical data, we also introduced new data encoding schemes, namely hybrid direct encoding and hybrid angle encoding, with which the exchange between quantum circuit depth and width for state preparation can be configured. With diverse combinations of the aforementioned variables, we tested 8-qubit QCNN models for binary classification of MNIST and Fashion MNIST datasets by simulation with Pennylane. The QCNN models tested in this work operated with a small number of free parameters, ranging from 12 to 51. Despite the small number of free parameters, QCNN produced high classification accuracy for all instances, with the best case being close to 99% for MNIST and 94% for Fashion MNIST. We also compared QCNN results to CNN and observed that QCNN performed noticeably better than CNN given the similar training conditions for both benchmarking datasets. The comparison between QCNN and CNN is only valid for qubit and dense encoding cases in which the number of input qubits grows linearly with the dimension of the input data. With amplitude or hybrid encoding, the number of input qubits is substantially smaller than the dimension of the data, and hence, there is no classical analogue. We speculate that the advantage of QCNN lies in the ability to exploit entanglement, which is a global effect, while CNN is only capable of capturing local correlations.

The QCNN architecture proposed in this work can be generalized for L-class classification through one-vs-one or one-vs-all strategies. It also remains an interesting future work to examine the construction of a multi-class classifier by leaving \(\lceil \log _{2}(L)\rceil \) qubits for measurement in the output layer. Another interesting future work is to optimize the data encoding via training methods provided in Ref. Lloyd et al. (2020). However, since QCNN itself can be viewed as a feature reduction technique, it is not clear whether introducing another layer of the variational quantum circuit for data encoding would help until a thorough investigation is carried out. Understanding the underlying principle for the quantum advantage demonstrated in this work also remains to be done. One way to study this is by testing QCNN models with a set of data that does not exhibit local correlation but contains some global feature while analyzing the amount of entanglement created in the QCNN circuit. Since the circuit depth grows only logarithmically with the number of input qubits and the gate parameters are learned, the QCNN model is expected to be suitable for NISQ devices. However, the verification through real-world experiments and noisy simulations remains to be done. Furthermore, testing the classification performance as the QCNN models grow bigger remains an interesting future work. Finally, the application of the proposed QCNN algorithms for other real-world datasets such as those relevant to high-energy physics and medical diagnosis is of significant importance.