1 Introduction

Supervised learning is the most commonly applied form of machine learning. It works in two stages. During the training stage, the algorithm extracts patterns from the training dataset that contains pairs of samples and labels and converts these patterns into a mathematical representation called a model. During the inference stage, this model is used to make predictions about unseen samples. Machine learning algorithms, in general, are data hungry; their performance depends heavily on the size of the datasets used for training. However, this training process is computationally expensive, and working with large datasets requires huge amounts of computational horsepower. As the number and volume of datasets available for research and commercial purposes continues to grow exponentially, new technologies such as NVIDIA CUDA (Nickolls et al. 2008) and Google TPUs (Jouppi et al. 2017) have emerged to enable faster processing of data. Computational speeds have been increasing rapidly over the last several decades in accordance with the observation that the number of transistors in a dense integrated circuit (IC) doubles about every 2 years—this is informally known as Moore’s Law (Friedman 2015). The semiconductor manufacturing process has shrunk from 10 μm in 1970 to about 5 nm in 2020. It is believed, however, that we are nearing the limits of the processing power that classical computers can offer us (Leiserson et al. 2020). At scales smaller than this, quantum mechanical effects come into play (Sperling 2018), and they impose physical limitations on how small electronic components can get.

Quantum computing, on the other hand, proposes to leverage these quantum mechanical effects to carry out computation. In contrast to classical computers that operate on bits that can exist in only one out of two states at a time, quantum computers exploit the fact that quantum bits (qubits) can exist in any one of the infinite possible linear superpositions of these two states. This allows quantum computers to execute multiple paths of computation simultaneously. Quantum computers can efficiently solve computational problems which are believed to be intractable for classical computers (Nielsen and Chuang 2002). Owing to its roots in theoretical physics, most research articles on the topic are written for physicists; this makes them difficult to access for researchers from other fields. The purpose of this paper is to give a broad overview of the synergies between quantum computing and machine learning. We briefly outline the history of quantum physics, describe the preliminaries of quantum computation, and then review the latest research on applying the principles of quantum computation to supervised machine learning. A large-scale general-purpose quantum computer does not yet exist, but restricted quantum machines capable of solving optimization problems are already being sold commercially (Castelvecchi 2017; Johnston 2013). By eschewing results from physics that have little bearing on quantum computation and by providing additional background that may benefit the unfamiliar reader, we hope to make this introduction accessible to data scientists, machine learning practitioners, and researchers from other fields.

Quantum mechanics arose through a number of discoveries in the early twentieth century. Einstein (1905) explained the photoelectric effect in 1905 by postulating that light and all electromagnetic radiation is made up of discrete particles that later came to be called photons. Previously in 1865, Maxwell (1865) had demonstrated that electric and magnetic fields travel through space as waves. de Broglie (1924) proposed in 1923 that particles can exhibit wave-like properties and waves can behave like particles. Building on his approach, Heisenberg (Edwards 1979) developed matrix mechanics, and Schrödinger (1926) developed wave mechanics, both of which were later found to be equivalent. These developments laid the foundation of quantum mechanics. The equations of quantum mechanics have since been extensively tested in innumerable experiments, but even after almost a century of debate, physicists strongly disagree over how those equations should be interpreted and mapped to reality (Schlosshauer et al. 2013).

Benioff (1980) in 1980 and Feynman (1999) in 1982 observed that simulating the evolution of certain quantum systems may be an intractable problem that cannot be solved efficiently by computers, and yet, these quantum systems solved the problem by merely evolving thus suggesting that the evolution of quantum systems could be used as a method of computation. In 1985, Deutsch (1985) designed a universal quantum computer as a quantum-counterpart to the Universal Turing Machine. Deutsch and Jozsa (1992) proposed the Deutsch-Jozsa problem for which the deterministic quantum algorithm is exponentially faster than any deterministic classical solution. Shor’s algorithm (1999) for factoring large integers was the first quantum algorithm that could solve a problem of practical significance faster than any classical algorithm. Grover’s algorithm (1996) showed quadratic improvement in unordered search problems. These results laid the foundation of quantum computation. Since then, quantum algorithms have been proposed for numerous areas including cryptography, search and optimisation, simulation of quantum systems, solving large systems of linear equations, and machine learning (Montanaro 2016).

The volume of training data available to us as well as the sizes of machine learning models in terms of trainable parameters are both growing at a rapid pace. On the other hand, the computational power that classical computing can afford us is fast approaching its limits. In this scenario, quantum computing is increasingly being looked upon as a front-running candidate to cater to the computational demands of machine learning and artificial intelligence in the future. Machine learning algorithms often involve repeated execution of linear algebra routines on large matrices. Quantum solutions can offer exponential speed-up for these routines (Harrow et al. 2009; Wiebe et al. 2012). The benefits quantum computing can bring to machine learning, however, go beyond speed-up in execution. Many tasks in machine learning such as maximum likelihood estimation using hidden variables, principal component analysis, and training of neural networks. require optimization of a non-convex objective function. Optimizing non-convex functions is an NP-hard problem. Classical optimization methods such as gradient descent can get stuck at local minimum or saddle points and may never find the global minimum. Adiabatic quantum computers use quantum annealing to solve non-convex optimization problems by finding low-energy configurations of an appropriate energy function by exploiting tunneling effects to escape local minima (Santoro and Tosatti 2006). Methods based on Grover’s search can find the global minimum in a discrete unordered search space. Under certain conditions, there is a provable separation between quantum learnability and classical learnability, and there are classes of functions that can be learned in polynomial time using quantum methods but not using classical methods (Ciliberto et al. 2018; Servedio and Gortler 2004). In other words, quantum computers may not only find solutions faster than classical computers, but they may even find solutions that are better than those found classically.

The term quantum machine learning is generally used to denote analysis of classical data on quantum computers. This is known as quantum-enhanced machine learning. There are however other ways in which the fields of quantum computing and machine learning overlap. Classical machine learning can be applied to data emanating from quantum systems to solve problems in physics. Another stream of research deals with generalizing classical machine learning to work with quantum data where the input and output are quantum states. Recently, Tang (2019) developed a classical algorithm for recommendation systems that was inspired by quantum computing creating a new category referred to as quantum-inspired algorithms. These are classical algorithms that can be run on conventional computers which borrow ideas from quantum computing to achieve significant theoretical speed-ups over the best prevailing classical algorithms. This paper limits itself to quantum-enhanced machine learning and presents a selection of quantum approaches for implementing supervised machine learning algorithms on quantum computers. Far from being a comprehensive review of the field, it aims to offer the reader a background on the multitude of approaches proposed over the years with enough detail to set the stage for a more detailed exploration. For additional information, we recommend the following excellent surveys and reviews: (Schuld et al. 2015b; Wittek 2014; Biamonte et al. 2017; Adcock et al. 2015; Arunachalam and de Wolf 2017; Kopczyk 2018; Dunjko and Briegel 2018; Dunjko and Wittek 2020; Montanaro 2016).

2 Background of quantum computation

Quantum mechanics is based on four fundamental postulates (Dunjko and Briegel 2018; Nielsen and Chuang 2002): (1) the pure state of a quantum system is given by a unit vector |ψ〉 in a complex Hilbert space; (2) the evolution of a pure state in a closed system is governed by a Hamiltonian H as specified by Schrödinger’s equation \(H |{\psi }\rangle = i \hbar \frac {\partial }{\partial t} |{\psi }\rangle \); (3) the quantum state of a composite system is given by the tensor product of the individual systems; (4) projective measurements (observables) are specified by Hermitian operators, and the process of measurement changes the observed system from |ψ〉 to an eigenstate |ϕ〉 with probability given by the Born rule p(ϕ) = |〈ψ|ϕ〉|2. In this section, we briefly set up the background of quantum computation based on the above postulates.

2.1 Single qubit

A classical bit can exist in one of two states denoted as 0 and 1. A quantum bit or qubit can exist not only in these two discrete states but in all possible linear superpositions of them. Mathematically, the quantum state of a qubit is represented as a state vector in a two-dimensional Hilbert space. In the Dirac notation, the state vector of a qubit ψ is called a ket and is written as:

$$ \begin{array}{@{}rcl@{}} |{\psi}\rangle = \alpha |0\rangle + \beta |1\rangle \end{array} $$
(1)

where α and β are complex numbers and |α|2 + |β|2 = 1. The Born’s rule tells us that if this qubit is measured, we will get |0〉 with probability |α|2 and |1〉 with probability |β|2. Quantum measurements are non-deterministic, and the act of measurement changes the quantum state irreversibly. Before measurement, the qubit exists in a quantum superposition of the states |0〉 and |1〉. The outcome of the measurement, however, is not quantum but classical, i.e. you get either a |0〉 or a |1〉 but not a superposition of the two. During the measurement, the quantum state collapsesFootnote 1 to the classical state it gets observed in, and all subsequent measurements deterministically result in this same outcome with a probability equal to 1.

The choice of basis vectors |0〉 and |1〉 is arbitrary. We can represent the system using a different set of orthogonal basis vectors such as |+〉 and |−〉 (called the Hadamard or sign basis). Once the computational basis is decided, kets can be represented as column vectors:

$$ \begin{array}{@{}rcl@{}} |0\rangle = \begin{bmatrix} 1 \\ 0 \end{bmatrix}, |1\rangle = \begin{bmatrix} 0 \\ 1 \end{bmatrix} \end{array} $$
(2)
$$ \begin{array}{@{}rcl@{}} |{\psi}\rangle = \alpha \begin{bmatrix} 1 \\ 0 \end{bmatrix} + \beta \begin{bmatrix} 0 \\ 1 \end{bmatrix} = \begin{bmatrix} \alpha \\ \beta \end{bmatrix} \end{array} $$
(3)

The two representations for |ψ〉 given in Eqs. 1 and 3 are equivalent.

2.2 Multiple qubits

The quantum state of a system consisting of more than one unentangled qubits can be represented as the tensor product of the quantum states of the individual qubits. The state of a two-qubit system comprising of qubits represented by |ψ1〉 and |ψ2〉 can be written as |Ψ〉 = |ψ1〉⊗|ψ2〉. In general, the state of n qubits \(\{|{\psi _{1}}\rangle , |{\psi _{2}}\rangle , \dots , |{\psi _{n}}\rangle \}\) can be represented as:

$$ \begin{array}{@{}rcl@{}} |{{\varPsi}}\rangle=|{\psi_{1}}\rangle \otimes |{\psi_{2}}\rangle \otimes {\dots} \otimes |\psi_{k}\rangle=|{\psi_{1}\psi_{2}\dots\psi_{k}}\rangle . \end{array} $$
(4)

However, not all multi-qubit states can be represented as a tensor product of individual states. Consider the state below, one of the Bell states:

$$ \begin{array}{@{}rcl@{}} |{{\varPhi}^{+}}\rangle = \frac{1}{\sqrt{2}}(|00\rangle + |11\rangle) \end{array} $$
(5)

Suppose it could be decomposed into the tensor product of two states as below:

$$ \begin{array}{@{}rcl@{}} |{{\varPhi}^{+}}\rangle &=& (a_{1}|0\rangle + b_{1}|1\rangle) \otimes (a_{2}|0\rangle + b_{2}|1\rangle) \\ &=& a_{1}a_{2}|00\rangle + a_{1}b_{2}|01\rangle + a_{2}b_{1}|10\rangle + b_{1}b_{2}|11\rangle \end{array} $$
(6)

From Eqs. 5 and 6, we know that a1b2 = a2b1 = 0. Therefore, either a1a2 = 0 or b1b2 = 0. But, from Eq. 5, both a1a2≠ 0 and b1b2≠ 0. This proves that the Bell state |Φ+〉 cannot be decomposed into the tensor product of two single-qubit states. In such cases, we say that the two qubits are entangled. Given an entangled pair of qubits, measurement on one qubit instantaneously affects the other qubit. Entanglement plays a central role in many quantum algorithms especially in the field of quantum cryptography. There is no counterpart to quantum entanglement in classical physics.

2.3 Quantum gates

Classical computers manipulate information stored in bits using logic gates such as AND, OR, NOT, NAND, and XOR. Likewise, quantum computers manipulate qubits using quantum gates. Transformations on quantum states are represented as rotation of the Hilbert space. Rotation is linear and reversible. Consequently, all transformations on quantum states must be linear and reversible. Quantum gates essentially transform the system from one state to another state. These transformations can be represented as matrices.

The simplest quantum gate is the NOT gate. The NOT gate transforms |ψ1〉 = α|0〉 + β|1〉 to |ψ2〉 = α|1〉 + β|0〉 and can be represented as:

$$ \begin{array}{@{}rcl@{}} \textit{NOT}=\begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix} \ \end{array} $$
(7)

The Hadamard gate acts on a single qubit. It is often used to map a qubit from one of its basis states into an equal superposition of all basis states. It transforms |0〉 to \(\frac {1}{\sqrt {2}}(|0\rangle +|1\rangle )\) and |1〉 to \(\frac {1}{\sqrt {2}}(|0\rangle -|1\rangle )\) and is given by:

$$ \begin{array}{@{}rcl@{}} H = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix} \end{array} $$
(8)

In general, an n-qubit Hadamard gate is used to initialize an n-qubit system into an equal superposition of all basis states:

(9)

where x ∈{0,1}n denotes all strings of length n consisting of 0 and 1.

The CNOT (controlled-NOT) gate acts on two qubits where the first qubit acts as a control signal that decides whether the NOT operation should be performed on the second qubit. If the control qubit is |1〉, the NOT operation is applied; if it is |0〉, it is not applied. The CNOT gate leaves the states |00〉 and |01〉 unchanged, while it maps |10〉 to |11〉 and |11〉 to |10〉. It is represented as:

$$ \begin{array}{@{}rcl@{}} CNOT = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 \end{bmatrix} \end{array} $$
(10)

The SWAP gate swaps the states of two qubits transforming |ψ,ϕ〉 to |ϕ,ψ〉. The CSWAP (controlled-SWAP) gate acts on three qubits and swaps the state of the second and third qubit if the first qubit is |1〉. The Toffoli gate (CCNOT) acts on three qubits and performs the computation:

$$ \begin{array}{@{}rcl@{}} |a, b, c\rangle \rightarrow |a, b, c \oplus ab\rangle \end{array} $$
(11)

2.4 Quantum parallelism

While classical computers can execute only one computational path at a time, quantum computers can leverage the ability of quantum states to exist in superpositions to simultaneously execute multiple computational paths. For example, consider the classical function \(f(x): \{0,1\}^{2} \rightarrow \{0,1\}\). The function takes two bits as input and outputs a single bit. To evaluate f on all two-bit permutations using classical computation, we need to call f four times: f(0,0), f(0,1), f(1,0), and f(1,1). Quantum superposition allows us to evaluate all four inputs in a single call to f.

Since quantum transformations must be reversible and f is not reversible, we define a reversible quantum function:

$$ \begin{array}{@{}rcl@{}} U_{f}|x\rangle|y\rangle\rightarrow|x\rangle|y \oplus f(x)\rangle \end{array} $$
(12)

The input |ϕ〉 is set up in a superposition of states by initializing two qubits to |0〉 and applying the Hadamard transform:

$$ \begin{array}{@{}rcl@{}} |\phi\rangle = (H \otimes H)|00\rangle &=& \frac{|0\rangle+|1\rangle}{\sqrt{2}} \otimes \frac{|0\rangle+|1\rangle}{\sqrt{2}} \\&=& \frac{1}{2}(|00\rangle+|01\rangle+|10\rangle+|11\rangle) \end{array} $$
(13)

Setting |y〉 = |0〉, we apply Uf as follows:

$$ \begin{array}{@{}rcl@{}} U_{f}|\phi\rangle|0\rangle&=&\frac{1}{2}U_{f}(|00\rangle+|01\rangle+|10\rangle+|11\rangle)\otimes|0\rangle \\ &=& \frac{1}{2}U_{f}(|00,0\rangle+|01,0\rangle+|10,0\rangle+|11,0\rangle) \\ &=& \frac{1}{2}(|00,f(00)\rangle+|01,f(01)\rangle\\ &&+|10,f(10)\rangle+|11,f(11)\rangle) \end{array} $$
(14)

Thus, with a single application of f, we simultaneously evaluate four inputs. Using the Hadamard transform, the set-up in the input in an equal superposition of all basis vectors is a useful starting point for many quantum algorithms.

2.5 No cloning theorem

An important result that has profound implications is the no cloning theorem (Wootters and Zurek 1982) which states that it is not possible to create a copy of an unknown quantum state. Since measurement irreversibly changes the quantum state, given a single copy of the state |ψ〉 = α|0〉 + β|1〉, the values of the amplitudes α and β cannot be exactly determined. Although quantum parallelism can be leveraged to simultaneously execute multiple computational paths, the no cloning theorem places restrictions on the amount of information one can extract from the final quantum stateFootnote 2.

2.6 Adiabatic quantum computation

Numerous models have been proposed for quantum computing such as the quantum Turing machine, quantum circuit model, adiabatic quantum computing, measurement-based quantum computing, blind quantum computing, and topological quantum computing (Dunjko and Briegel 2018). All these models are computationally equivalent, but they are implemented very differently. An approach that has shown promise in solving optimization problems is adiabaticFootnote 3 quantum computing (Farhi et al. 2000) and is of particular interest because building restricted quantum computers to perform quantum annealing (Section 3.2) based on adiabatic quantum computing is simpler than building universal quantum computers. In adiabatic quantum computing, the optimization problem to be solved is encoded as a boolean satisfiability problem such that the ground state of its HamiltonianFootnote 4 represents the desired solution. The quantum system is initially set up with a simple Hamiltonian that is easy to construct. The system is then evolved from the initial state to a final state. The adiabatic theorem states that if the system is evolved slowly enough, it will remain in the ground state of the instantaneous Hamiltonian throughout the evolution. The final system configuration then represents the solution to the optimization problem.

3 Quantum machine learning

Quantum computing methods for machine learning can be divided into two broad classes: (1) methods designed to run on a universal quantum computer that involve preparation, storage, and processing of quantum states and the retrieval of the classical solution from these states; (2) methods designed to run on quantum annealers that solve optimization problems through the physical evolution of quantum systems according to the principles of adiabatic quantum computing. In Section 3.1, we describe common subroutines designed for circuit quantum computers that can be applied to machine learning problems. In Section 3.2, we present quantum annealing that can solve quadratic unconstrained binary optimization (QUBO) problems. Finally, in Section 3.3, we present quantum versions of selected classical machine learning algorithms.

3.1 Important subroutines of quantum algorithms

A straightforward approach to achieving speed-ups over classical machine learning algorithms is to identify their computationally expensive and frequently executed subroutines and develop quantum alternatives for them. In this section, we describe some common subroutines that form a part of many quantum learning algorithms.

3.1.1 Quantum encoding

Qubits are a scarce and expensive resource. The restricted physical implementations of quantum computers available today have very few qubitsFootnote 5. Furthermore, the procedure to encode classical data into quantum states is far from trivial and, in many cases, contributes significantly to the overall complexity of the quantum algorithm. An important question therefore that has implications for performance and feasibility of quantum algorithms is how to represent classical data in quantum states.

Basis encoding

Suppose we have a dataset D having M instances: D = {x(1),...,x(m),...,x(M)}. Suppose for each instance x(m)D, x(m) consists of N binary features so that \(x^{(m)}=\{b^{(m)}_{1},...,b^{(m)}_{n},...,b^{(m)}_{N}\}\). In basis encoding, each instance x(m) is encoded as the basis state |x(m)〉. The dataset D can thus be represented as a superposition of all computational basis states:

$$ \begin{array}{@{}rcl@{}} |{\psi}\rangle = \frac{1}{\sqrt{M}}\sum\limits_{i=1}^{M}|{x^{(i)}}\rangle \end{array} $$
(15)

For example, if D = {010,011}, it can be represented using basis encoding as:

$$ \begin{array}{@{}rcl@{}} |D\rangle = \frac{1}{\sqrt{2}}|010\rangle + \frac{1}{\sqrt{2}}|011\rangle \end{array} $$
(16)

In general, preparing a quantum state representing a dataset D of M instances consisting of N binary features each has a runtime of \(\mathcal {O}(MN)\) and requires N qubits as described by Ventura and Martinez (2000). Since M ≪ 2N in most cases, basis encoding tends to be sparse, i.e. most basis states have zero amplitude.

Amplitude encoding

In amplitude encoding, data is encoded in the amplitudes of quantum states. A single data instance x with N normalized real-valued features can be encoded as:

$$ \begin{array}{@{}rcl@{}} |{\psi_{x}}\rangle = \sum\limits_{i=1}^{N}x_{i}|i\rangle \end{array} $$
(17)

For example, x = {2,3} can be encoded in the amplitudes of the two basis states |0〉 and |1〉 as:

$$ \begin{array}{@{}rcl@{}} |x\rangle = \frac{1}{\sqrt{2^{2}+3^{2}}}(2|0\rangle + 3|1\rangle) \end{array} $$
(18)

Consider a dataset D of M such instances D = {x(1),...,x(m),...,x(M)}. All individual instances x(m) can be concatenated into a single MN-dimensional vector \(D^{\prime } = \{x^{(1)}_{1},...,x^{(1)}_{N},...,x^{(M)}_{1},...,x^{(M)}_{N}\}\) and then further encoded into the amplitudes of a quantum state as described in Eq. 17. This encoding requires log(MN) qubits and assumes an efficient procedure to prepare a quantum state with arbitrary amplitudes. Although some quantum states can be prepared efficiently, there exist states in the Hilbert space that cannot be prepared efficiently (Kliesch et al. 2011). Schuld and Petruccione (2018b) present a list of methods to prepare amplitude encoded states.

Besides basis and amplitude encoding, many other methods of encoding exist such as Qsample encoding, dynamic encoding, squeezing embedding, displacement embedding, and Hamiltonian encoding (Schuld and Killoran 2019; Lloyd et al. 2020).

3.1.2 Quantum random access memory

Classical computers store the input data required for the computation in random access memory (RAM). A RAM consists of an address register, an output register, and memory cells. It takes an n bit address as input in the address register and returns the data stored in one of N = 2n memory cells in the output register. This operation incurs a runtime of \(\mathcal {O}(N)\) in the typical implementations used today. The quantum analog of RAM, referred to as Quantum Random Access Memory (QRAM), performs the same operation using quantum registers. It takes as input a superposition of addresses \({\sum }_{a}c_{a}|a\rangle \) and returns a superposition of the data at those addresses \({\sum }_{a}c_{a}|a\rangle |{D_{a}}\rangle \) in \(\mathcal {O}(\log N)\) time:

(19)

Many quantum algorithms proceed by assuming the existence of a QRAM that can encode classical data in a superposition and later retrieve it efficiently. Giovannetti et al. (2008) show that conventional RAM architectures are not suitable for such a purpose and propose a bucket-brigade architecture for the QRAM where the N data points are stored in the leaves of a tree-like structure. Such a QRAM can encode Nd −dimensional vectors in \(\mathcal {O}(\log Nd)\) using \(\log (Nd)\) qubits.

Implementing such a QRAM system in hardware is not trivial, and it has so far not been experimentally realized. It also remains to be seen whether the theoretical speedup that QRAM offers to quantum algorithms will be nullified by practical physical considerations. Although the runtime of the QRAM scales logarithmically, the physical resources it requires scale as \(\mathcal {O}(Nd)\). In the presence of noise, active error correction will be necessary to compensate for the errors it causes, and this may nullify the speedup QRAM offers (Arunachalam et al. 2015; Adcock et al. 2015). Loading data into the QRAM is possible in logarithmic time only when the data exhibits a certain level of uniformity. In the absence of this uniformity, the QRAM loses its efficiency and needs \(\mathcal {O}(\sqrt {N})\) time. In addition to these caveats, the above analyses do not consider the speed of communication and memory latency. These factors become prominent only when the amount of data is very large. However, since quantum computing’s appeal for machine learning is based on the premise that it will enable us to train models on data volumes larger than what classical computers can support, the limitations imposed by latency need to be studied further. Hardware considerations will ultimately play an important role in determining the feasibility of the QRAM.

Due to its general nature, QRAM is used by many quantum algorithms as a black-box oracle, and it plays a central role in achieving speedup over classical alternatives. However, it is worth noting that not all quantum algorithms depend on the QRAM, and there exist alternative methods to encode classical data into quantum states and retrieve it efficiently (Grover and Rudolph 2002; Aaronson 2015).

3.1.3 Grover’s algorithm and amplitude amplification

Grover’s algorithm (1996) is a quantum search algorithm that offers a quadratic speed-up over classical algorithms when performing a search over an unstructured search space. Suppose we are given a set of N elements X = {x1,...,xi,...,xN} where \(x_{i} \in \{0,1\}^{m}\) and a boolean function \(f:\{0,1\}^{m} \rightarrow \{0,1\}\) such that:

$$ \begin{array}{@{}rcl@{}} f(x)= \begin{cases} 1 & x = x^{*}\\ 0 & x \neq x^{*} \end{cases} \end{array} $$
(20)

Any classical algorithm that performs a search for x in X is \(\mathcal {O}(N)\) in time. Grover’s algorithm can perform such a search in \(\mathcal {O}(\sqrt {N})\). The algorithm has three steps to it.

In the first step, a quantum state is set up in an equal superposition of basis states using the Hadamard transform. As an example, consider N = 8. We set up the state using 3 qubits as:

$$ \begin{array}{@{}rcl@{}} |{\psi_{1}}\rangle&=&(H \otimes H \otimes H)|000\rangle \\ &=&\frac{1}{2\sqrt{2}} (|000\rangle+|001\rangle+|010\rangle+|011\rangle\\&&+|100\rangle+|101\rangle+|110\rangle+|111\rangle) \\ &=&\begin{bmatrix} \frac{1}{2\sqrt{2}} & \frac{1}{2\sqrt{2}} & \frac{1}{2\sqrt{2}} & \frac{1}{2\sqrt{2}} & \frac{1}{2\sqrt{2}} & \frac{1}{2\sqrt{2}} & \frac{1}{2\sqrt{2}} & \frac{1}{2\sqrt{2}}\end{bmatrix}^{T} \end{array} $$
(21)

The second step referred to as phase inversion deals with flipping the amplitude of each |x〉 if f(x) = 1 and leaving it unchanged if f(x) = 0. To do this, we define a unitary quantum oracle O|x〉 = (− 1)f(x)|x〉. Suppose, in our example, x is present at the fourth position. Applying gate O on |ψ1〉 gives us:

$$ \begin{array}{@{}rcl@{}} |{\psi_{2}}\rangle=\begin{bmatrix} \frac{1}{2\sqrt{2}} & \frac{1}{2\sqrt{2}} & \frac{1}{2\sqrt{2}} & \frac{-1}{2\sqrt{2}} & \frac{1}{2\sqrt{2}} & \frac{1}{2\sqrt{2}} & \frac{1}{2\sqrt{2}} & \frac{1}{2\sqrt{2}}\end{bmatrix}^{T} \end{array} $$
(22)

The third step referred to as inversion around the mean involves flipping all amplitudes around their collective mean \(\mu =\frac {1}{N}{\sum }_{x}{\alpha _{x}}\). This is performed by the Grover diffusion operator G:

(23)

Applying G to |ψ2〉 gives us:

$$ \begin{array}{@{}rcl@{}} |{\psi_{3}}\rangle=\begin{bmatrix} \frac{1}{4\sqrt{2}} & \frac{1}{4\sqrt{2}} & \frac{1}{4\sqrt{2}} & \frac{5}{4\sqrt{2}} & \frac{1}{4\sqrt{2}} & \frac{1}{4\sqrt{2}} & \frac{1}{4\sqrt{2}} & \frac{1}{4\sqrt{2}}\end{bmatrix}^{T} \end{array} $$
(24)

These two operations of phase inversion and inversion around the mean constitute one round of the algorithm. Each successive round increases the amplitude of the target element x and reduces that of all other elements. If we were to measure the system at the end of the first round, we would get the target element as outcome with a probability of \((\frac {5}{4\sqrt {2}})^{2}=78\%\).

After application of the second round (i.e. performing phase inversion and inversion around the mean once again), we get the below state which will find the target element with a probability of 95%:

$$ \begin{array}{@{}rcl@{}} |{\psi_{4}}\rangle=\begin{bmatrix} \frac{-1}{8\sqrt{2}} & \frac{-1}{8\sqrt{2}} & \frac{-1}{8\sqrt{2}} & \frac{11}{8\sqrt{2}} & \frac{-1}{8\sqrt{2}} & \frac{-1}{8\sqrt{2}} & \frac{-1}{8\sqrt{2}} & \frac{-1}{8\sqrt{2}}\end{bmatrix}^{T} \end{array} $$
(25)

For a list of N inputs, \(\sqrt {N}\) such rounds are usually performed. The same algorithm can also find k matching entries instead of a single entry. Several modifications have been proposed that extend this work. Durr and Hoyer (1996) propose a quantum algorithm to find the index of the minimum value from a list of N values in time \(\mathcal {O}(c\sqrt {N})\) with a probability of at least \((1-\frac {1}{2^{c}})\). These methods generalize Grover’s search and are collectively referred to as amplitude amplification techniques (Brassard and Hoyer 1997).

3.1.4 Calculating inner products using swap test

The swap test (Buhrman et al. 2001) is a simple subroutine used to compute the overlap between two quantum states |ϕ〉 and |ψ〉. Quantum procedures can be easily described using circuit diagrams. The circuit diagram of the swap test is shown in Fig. 1.

Fig. 1
figure 1

Circuit diagram describing swap test. Computation proceeds from left to right

The system is initially prepared in the state |0,ϕ,ψ〉. The Hadamard gate applied on the first ancilla qubitFootnote 6 transforms the state to \(\frac {1}{\sqrt {2}}(|0,\phi ,\psi \rangle +|1,\phi ,\psi \rangle )\). The CSWAP further transforms it to \(\frac {1}{\sqrt {2}}(|0,\phi ,\psi \rangle +|1,\psi ,\phi \rangle )\). After the application of the second Hadamard gate to the first qubit, the state can be written as \(\frac {1}{2}|0\rangle (|{\phi ,\psi }\rangle +|{\psi ,\phi }\rangle )+\frac {1}{2}|1\rangle (|{\phi ,\psi }\rangle -|{\psi ,\phi }\rangle )\). The probability of measuring the first qubit as |0〉 is given by \(p=\frac {1}{2}+\frac {1}{2}{|\langle {\psi |\phi }\rangle |}^{2}\). If |ϕ〉 and |ψ〉 are equal, |〈ψ|ϕ〉|2 = 1, and the observed value of p is 1. If |ϕ〉 and |ψ〉 are orthogonal, |〈ψ|ϕ〉|2 = 0, and the observed value of p is \(\frac {1}{2}\). The degree of overlap given by the inner product of the two states can be estimated with this method to precision 𝜖 using \(\mathcal {O}(1/\epsilon )\) copies (Dunjko and Briegel 2018).

3.1.5 Solving systems of linear equations (HHL)

Solving a system of linear equations is an important problem ubiquitous throughout science and engineering. A seminal result in quantum computation is the HHL algorithm (Harrow et al. 2009) that solves the following problem: given a Hermitian matrix \(A \in \mathbb {R}^{N \times N}\) and a unit vector \(\overrightarrow {b} \in \mathbb {R}^{N}\), find a solution vector \(\overrightarrow {x} \in \mathbb {R}^{N}\) that satisfies the equation \(A\overrightarrow {x}=\overrightarrow {b}\).

We present here a condensed outline of the algorithm. The solution we are interested in is |x〉 = A− 1|b〉. Let {v1,...,vN} be the eigenvectors of A with corresponding eigenvalues {λ1,...,λN}. The vector \(\overrightarrow {b}\) is encoded using amplitude encoding (Section 3.1.1) as \(|b\rangle ={\sum }_{i=1}^{N}{\beta _{i}|{v_{i}}\rangle }\). Hamiltonian simulation is used to transform the matrix A into a unitary operator, and quantum phase estimationFootnote 7 is used to carry out eigendecomposition to get the state \({\sum }_{i=1}^{N}{\beta _{i}|{v_{i}}\rangle |{\lambda _{i}}\rangle }\). An ancilla qubit is added, rotation conditioned on |λi〉 is carried out, and the eigenvalue register |λi〉 is uncomputed to yield a state proportional to \({\sum }_{i=1}^{N}{\beta _{i} \lambda _{i}^{-1}|{v_{i}}\rangle } = A^{-1}|b\rangle = |x\rangle \).

It is important to note that while a classical algorithm finds all coefficients xi for x, the HHL algorithm finds the quantum state \(|x\rangle ={\sum }_{i=1}^{N}{x_{i}|i\rangle }\). Obtaining values for all xi takes \(\mathcal {O}(N)\) repetitions; this observation nullifies the speed-up the quantum algorithm has over the classical counterparts. Hence, the HHL algorithm is most useful when used as a subroutine carrying out an intermediate step in a larger process where the quantum state |x〉 is consumed by the next subroutine in the process.

3.2 Quantum annealing

Quantum annealing is a metaheuristic optimization algorithm that leverages quantum effects to solve quadratic unconstrained binary optimization (QUBO) problems that deal with optimizing functions of the form:

$$ \begin{array}{@{}rcl@{}} C(x_{1},x_{2},...,x_{n}) = \sum\limits_{i}a_{i}x_{i} + \sum\limits_{i,j}b_{i,j}x_{i}x_{j} \end{array} $$
(26)

where \(a_{i} \in \mathbb {R}\), \(b_{i,j} \in \mathbb {R}\), xi ∈{0,1}. A wide range of problems can be mapped to QUBO and then solved by quantum annealers which are special-purpose quantum computers specifically built to perform quantum annealing.

The Ising model is used in physics to represent a large variety of systems. It was originally proposed to model magnetic materials where every molecule has a spin that can align or anti-align with an applied magnetic field (Bian et al. 2010). The Hamiltonian of the system representing its energy is given by:

$$ \begin{array}{@{}rcl@{}} H = \sum\limits_{i}h_{i}s_{i} + \sum\limits_{i,j}J_{i,j}s_{i}s_{j} \end{array} $$
(27)

where si ∈{− 1,+ 1} is the spin of the i th molecule, hi is the strength of the magnetic field at the i th molecule, and Ji,j is the strength of the interaction between neighboring spins i and j. From Eqs. 26 and 27, it can be seen that the QUBO problem convenient maps onto the Ising Hamiltonian with the mapping si = 2xi − 1.

Quantum annealing works as follows. An initial Hamiltonian H0 that is easy to construct is chosen. The system is evolved under a time-dependent Hamiltonian given by:

$$ \begin{array}{@{}rcl@{}} H(t) = (1-t)H_{0} + tH_{f} \end{array} $$
(28)

where t is gradually changed from 0 to 1 and the final Hamiltonian Hf is the same as in Eq. 27. At t = 0, the system starts in the ground state of H0. According to the quantum adiabatic theorem, the system remains in the ground state of the instantaneous Hamiltonian H(t) throughout its evolution provided it is changed sufficiently slowly (Ambainis and Regev 2004). At t = 1, the final Hamiltonian of the system will encode the solution to the problem.

Quantum annealing should not be conflated with the more general adiabatic quantum computing. Quantum annealing specifically solves optimization problems; adiabatic quantum computing is a model of quantum computing that is equivalent to a universal quantum computer. The Hamiltonians used in quantum annealing are classical Hamiltonians, while adiabatic quantum computing uses quantum Hamiltonians that have no classical counterparts (Biswas et al. 2017). For a more comprehensive treatment of adiabatic quantum computing, we refer the reader to (Albash and Lidar 2018).

3.3 Quantum algorithms for machine learning

In this section, we explore how the background and subroutines described in the previous sections can be applied to solve machine learning problems. The common supervised machine learning setting is as follows. The training set consists of M instances {x1,...,xM} where \(x_{i} \in \mathbb {R}^{N}\) with their corresponding labels {y1,...,yM}. Each label yi can be either a real value (for regression problems) or a discrete class label (for classification problems). In the training phase, the algorithm extracts patterns from the dataset and learns a model. In the inference phase, this model is used to process unseen instances and predict their corresponding labels.

3.3.1 k-Nearest neighbors

The k-nearest neighbors (KNN) is one of the simplest supervised learning algorithms. To predict the label of a new unseen instance xtest, the algorithm looks at k instances in the training set that are closest to xtest and chooses the class that appears most often in the labels of these k-nearest neighbors as the predicted label (for regression, the algorithm assigns the mean value of the k-nearest neighbors as the label). An advantage of KNN is that, unlike many other supervised algorithms, it is non-parametric and makes no assumptions about the data distribution. However, since all computation is deferred, inference can become prohibitively expensive for large training sets. During inference, the distance of the test instance from all other training instances is calculated; this is the most computationally intensive step in the process. Hence, quantum versions of KNN focus on faster evaluation of the distance between two instances.

Aïmeur et al. (2006) propose using the overlap |〈a|b〉| as computed by the swap test (Section 3.1.3) as a measure of similarity between |a〉 and |b〉. Llyod et al. (Lloyd et al. 2013) develop a technique based on the swap test to compute the distance between \(\overrightarrow {a}\) and \(\overrightarrow {b}\). A state \(|{\psi }\rangle =\frac {1}{\sqrt {2}}(|0,a\rangle +|1,b\rangle )\) is constructed by setting up an ancilla. A second state given by \(|\phi \rangle =\frac {1}{\sqrt {Z}}(|\overrightarrow {a}||0\rangle -|\overrightarrow {b}||1\rangle )\) is constructed where \(Z=|\overrightarrow {a}|^{2}+|\overrightarrow {b}|^{2}\). Using \(\langle x|x \rangle =|\overrightarrow {x}|^{-1}|\overrightarrow {x}|\), the authors make the observation that \(|\overrightarrow {a}-\overrightarrow {b}|^{2}=Z|\langle \phi |\psi \rangle |^{2}\). With this, the distance between \(\overrightarrow {a}\) and \(\overrightarrow {b}\) can be retrieved using a swap test (Schuld et al. 2015b). The authors use the above technique to implement the nearest-centroid classification algorithm in which the centroids of all training instances belonging to each class are precomputed; during inference, a class label is assigned to the test instance by calculating its distance from all centroids. Given a training set of M instances, this procedure solves the problem of classifying an N-dimensional test vector into one of several classes in \(\mathcal {O}(log(MN))\) compared to \(\mathcal {O}(MN)\) required by classical algorithms.

Wiebe et al. (2014a, 2014b) argue that the nearest-centroid classification presented above can perform poorly in practice since training instances are often embedded in complicated manifolds, and the centroids may lie outside these manifolds. They propose two fast methods for computing the distance between vectors based on an alternative representation of classical information in quantum states, amplitude amplification, and Durr-Hoyer minimum finding (Durr and Hoyer 1996).

3.3.2 Support vector machines

Support vector machines (SVM) (Cortes and Vapnik 1995) is a popular classification algorithm that determines the optimal hyperplane that separates instances of two classes in the training data and classifies test instances based on which side of the separating hyperplane they lie on. Given a set of training instances with their labels {(x1,y1),...,(xi,yi),...,(xM,yM)} where \(x_{i} \in \mathbb {R}^{N}\) and yi ∈{− 1,+ 1}, the algorithm learns an N-dimensional hyperplane given by w = [β1,β2,...,βN]T that separates the instances of the two classes with maximum margin. Classification of a test instance xt is performed as:

$$ \begin{array}{@{}rcl@{}} y_{t} = sign(w^{T}x_{t} + b) \end{array} $$
(29)

Since the instances may not be strictly separable, slack variables are introduced that provide a soft margin by allowing some data points to violate the margin criterion. This formulation is known as the maximum margin classifier or support vector classifier; this however still requires the data points to be linearly separable. SVMs overcome this limitation of linear separability by what is known as the kernel trick which transforms the feature space into a new, higher-dimensional feature space. Instances that were not linearly separable in the original feature space may be linearly separable in the new feature space. Mathematically, the kernel trick generalizes the dot product between two feature vectors 〈xi,xj〉 by a kernel function k(xi,xj). Different kernels can be chosen depending on the data distribution. Training an SVM involves solving the quadratic programming problem (Press et al. 2007):

$$ \begin{array}{@{}rcl@{}} minimize && E=\frac{1}{2}\sum\limits_{nm}\alpha_{n}\alpha_{m}t_{n}t_{m}k(x_{n},x_{m})-\sum\limits_{n}\alpha_{n}, \\ subject to && 0 \leq \alpha_{n} \leq C, \\ and{\kern25.5pt} && \sum\limits_{n}\alpha_{n}t_{n}=0 \end{array} $$
(30)

where \(\alpha _{n} \in \mathbb {R}\), C is the regularization parameter, and k(⋅,⋅) is the kernel function.

Anguita et al. (2003) observe that training support vector machines may be a hard problem and propose a quantum variant that uses Durr and Hoyer’s minimum finding (Durr and Hoyer 1996) based on Grover’s algorithm to solve the optimization problem. Rebentrost et al. (2012) suggest computing inner products using a quantum method based on an approach similar to the one discussed in Section 3.3.1 which leads to an exponential speed-up with respect to the dimension of the feature vector N. They also describe a least-squares reformulation of the SVM algorithm with slack variables ej that converts the quadratic optimization problem into a problem of solving a system of linear equations which leads to an additional exponential speed-up in terms of the number of training instances M:

$$ \begin{array}{@{}rcl@{}} y_{j}(\overrightarrow{w} \cdot \overrightarrow{x_{j}}+b) \geq 1 \rightarrow (\overrightarrow{w} \cdot \overrightarrow{x_{j}}+b)=y_{j}-y_{j}e_{j} \end{array} $$
(31)

As shown in Section 3.2, quantum annealing is particularly well-suited for solving optimization problems. Willsch et al. (2020) demonstrate a practical implementation of training SVMs on the commercially available D-Wave DW2000Q quantum annealer by formulating it as a QUBO problem. Mengoni and Di Pierro (2019) review approaches for designing quantum computing techniques for kernel methods in machine learning.

3.3.3 Neural networks

Artificial neural networks or simply neural networks were originally inspired from biological neural networks that model the activity of neurons in human brains. The basic building block of a neural network is the neuron, also called node or perceptron, that maps the input \(x \in \mathbb {R}^{N}\) to the output \(y \in \mathbb {R}\) as followsFootnote 8:

$$ \begin{array}{@{}rcl@{}} y = g\left( \sum\limits_{i=1}^{N}{w_{i}x_{i}} + b\right) \end{array} $$
(32)

where \(w_{i} \in \mathbb {R}\), \(b \in \mathbb {R}\), and \(g: \mathbb {R} \rightarrow \mathbb {R}\) is the activation function. The outputs of some neurons can be fed as inputs to other neurons thus creating layers within the network.

Even though neural networks were amongst the first machine learning algorithms to be proposed, their research stagnated for several decades from 1940s to 1980s due to the inherent difficulty and large computational power required to train them. They returned to popularity after the introduction of backpropagation (Rumelhart et al. 1986) which eased these problems by offering a faster method for training. In the last decade, with GPUs and cloud computing providing cheaper access to massive computational power, neural networks have dwarfed other learning algorithmsFootnote 9 to become one of the biggest success stories of modern computers finding applications in various industries including healthcare, manufacturing, finance, and analytics to solve problems in image processing, computer vision, natural language processing, predictive modeling, and many other areas. However, even today, significant resources are required to train neural networks, and training times for research and industrial problems can run into weeks or even months.

An obvious difficulty arises in considering quantum computation as a means for implementing neural networks. Quantum computation (and indeed quantum mechanics itself) is a theory fundamentally based on linear transformations, while an important practical advantage neural networks enjoy over many other learning algorithms is that they can model non-linear data distributions. Bringing non-linearity into quantum algorithms is a non-trivial task (Cao et al. 2017). However, classical neural networks do make heavy use of linear algebra, and the inherent randomness of quantum mechanical effects can be leveraged to automatically introduce noise in the training process to improve model robustness—something that needs to be done purposefully in classical training (Allcock et al. 2018). Numerous neural network architectures have emerged in recent times to tackle problems belonging to a wide range of supervised, unsupervised, and reinforcement learning tasks (Goodfellow et al. 2016). Research in this field has been scattered with different proposals addressing narrow problems in piecemeal style. We present here a select subset of these proposals.

Most early work on quantizingFootnote 10 neural networks focussed on Hopfield networks (1982) which differ from the neural networks presently used in practice. In a Hopfield network, all neurons have undirected connections with all other neurons as opposed to feed-forward networks that are organized as layers; also, each neuron outputs a binary 0 or 1 instead of a real number. Hopfield networks are used to model associative memories which allow retrieval of data based on content rather than addresses. Kak (1995) introduces the idea of quantum neural computation and Peruš (2000) describes a quantum associative network by drawing analogies between Hopfield networks and quantum information processing. Behrman et al. (2000) present a quantum realization of neural networks by showing that a single quantum dotFootnote 11 molecule can act as a recurrent quantum neural network. More recently, Rebentrost et al. (2018) present a technique based on quantum Hebbian learning and fast quantum matrix inversion to train Hopfield networks.

Boltzmann machines (Ackley et al. 1985), closely related to Hopfield networks, are stochastic generative networks that can learn a probability distribution over a set of inputs. They are trained by adjusting the interconnection weights between the neurons so that the thermal statistics of the system as described by the Boltzmann-Gibbs distribution reproduces the statistics of the data (Biamonte et al. 2017). Boltzmann machines can be conveniently represented by an Ising model whose spins encode features and interactions encode statistical dependencies between the features. In a restricted Boltzmann machine, connections exist only between neurons belonging to different layers; this makes them easier to train than fully-connected Boltzmann machines. Restricted Boltzmann machines can be stacked together to form deep belief networks that can be used to learn internal representations or can be trained under supervision to perform classification. Training Boltzmann machines is exponentially hard and is performed using approximation techniques like contrastive divergence (Hinton 2002; Salakhutdinov and Hinton 2009) that rely on Gibbs sampling. Wiebe et al. (2014a, 2014b) propose quantum methods to efficiently train full Boltzmann machines by preparing a coherent analog of the Gibbs state from which samples can be drawn. Adachi et al. (2015) investigate an alternative approach of performing the sampling on a D-Wave quantum annealer instead of classical Gibbs sampling.

A different line of research involves developing quantum analogs for classical perceptrons. Schuld et al. (2015a) introduce a quantum perceptron model with a step activation function that can be used to develop superposition-based learning schemes in which a superposition of training vectors can be processed in parallel. Kapoor et al. (2016) develop two quantum techniques for modeling perceptrons; the first provides quadratic speed-up with respect to the number of training instances, and the second provides a quadratic reduction in the scaling of the training time with the margin between the two classes.

Feedforward networks are one of the simplest neural network architectures in which the connections between neurons do not form any loops or cycles. They are usually trained using backpropagation, and the optimization is performed using some variant of gradient descent. Most machine learning architectures used in practice today are based on feedforward networks or their derivatives such as convolutional neural networks or recurrent neural networks (Goodfellow et al. 2016). Allcock et al. (2018) define an efficient quantum subroutine for robust inner product estimation using QRAM (Giovannetti et al. 2008) and use it to demonstrate quadratic speed-ups in the size of the network over classical counterparts; they additionally claim that the proposed quantum method naturally imitates regularization techniques like drop-out leading to more robust networks. Farhi et al. (2000) present a general framework for binary classification specifically designed for near-term quantum processors in which the input strings are mapped to computational basis states, and the predicted label is given by the outcome of a Pauli operator measured on a readout qubit. This framework extends to a full quantum neural network that can classify both classical and quantum data. Convolutional neural networks (CNNs) (Lecun et al. 1998) have achieved great success (Krizhevsky et al. 2012) in image classification tasks in recent times. They, however, suffer from the fact that the operation of convolution is computationally expensive. Kerenidis et al. (2019) design a quantum CNN based on quantum computation of the convolution product between two tensors. They also propose a quantum tomographyFootnote 12 sampling approach to recover classical information from the network output and a quantum backpropagation algorithm for efficient training of the quantum CNN.

3.3.4 Variational quantum classifiers

As discussed in the preceding sections, a trove of quantum techniques have been proposed to solve problems in machine learning. Most of them, however, require computational resources that are far beyond what quantum computers provide today. A new research direction has emerged in recent times that focuses on developing techniques that can be practically implemented on the noisy, intermediate-scale quantum (NISQ) computers that we expect to have in the next few years. These techniques need fewer than 100 qubits, are robust against errors, and are generally aimed at solving problems that have immediate useful applications.

A quantum circuit is composed of a series of unitary transformations that act on the given input state. Such a circuit can be trained to solve a classification problem by tuning its parameters based on how a cost function responds to the training data. Low-depth quantum circuits are suitable for error-correction (Li and Benjamin 2017; Temme et al. 2017), and this makes them a natural candidate for quantum machine learning on NISQ processors. A quantum classifier consists of a circuit U(𝜃) parameterized by the set of parameters 𝜃 that is initialized in a fixed initial state (usually the zero state). An observable \(\hat {B}\) is defined to represent the output of the circuit. The cost function of such a classifier can be defined as \(f(\theta )=\langle {0}|U^{\dag }(\theta )\hat {B}U(\theta )|0\rangle \). The training process involves finding optimum candidates for the parameters 𝜃 to minimize the cost function f(𝜃). This optimization is carried out using classical methods, most often some variation of gradient descent.

A general theory of variational algorithms and quantum circuit learning is presented in McClean et al. (2016) and Mitarai et al. (2018). Havlíček et al. (2019) propose using a variational quantum circuit as a classifier by representing the feature space as quantum states. Schuld et al. (2020) present a circuit-based supervised classification algorithm for amplitude-encoded data that can be trained using a hybrid quantum-classical scheme of gradient descent. Farhi and Neven (2018) introduce a quantum neural network based on variational circuits for near-term quantum computers that learn boolean functions by tuning the circuit parameters.

Ensemble methods have achieved state-of-art performance on many machine learning challenges. In an ensemble, multiple models are trained on the same task, and their predictions are weighed or combined together to arrive at the final decision. Schuld and Petruccione (2018a) introduce the quantum ensemble of quantum classifiers where the parameters of the classifier are held in superposition allowing a quantum computer to train an exponentially large number of classifiers in parallel. Abbas et al. (2020) show that this quantum ensemble framework contains a classically intractable computation that can be solved efficiently on a quantum computer by reducing it to the Deutsch Jozsa algorithm. Chen et al. (2021) extend circuit learning to quantum data and propose using a shallow parameterized quantum circuit trained using classical methods to classify quantum states.

4 Discussion

Quantum computation has made great strides in the last two decades in both theory and practice. A significant corpus of research has emerged in applying the principles of quantum computation to problems across many fields of science and engineering. At the same time, several approaches for physical realization of quantum computers based on superconducting quantum bits, trapped ions, optical lattices, photonic computing, nuclear magnetic resonance, etc. have shown promise. However, fundamental challenges remain unresolved on both fronts. While developing quantum algorithms, we must consider the input problem and the output problem (Biamonte et al. 2017). The input problem refers to the fact that the cost of reading classical data and encoding it in quantum states can sometimes dominate performance and render the further downstream speed-up irrelevant. The output problem refers to the reverse process of decoding the full classical solution from quantum states. Some important hardware challenges to constructing, operating, and maintaining large-scale quantum computers include achieving longer coherence, greater circuit depth, higher qubit quality, and higher control over qubits. Quantum error correction plays an important role, and it will likely span across hardware and software in the future.

It is important to distinguish between quantum-enhanced machine learning that focuses on techniques to implement learning on classical data using quantum computers from quantum-generalisation of machine learning algorithms that deals with developing fully quantum algorithms that work with quantum data. This is especially true for neural networks for which active research is underway on both fronts. We restrict this paper to quantum-enhanced techniques and refer the reader to Cao et al. (2017), Beer et al. (2020), Amin et al. (2018), Wan et al. (2017), and Cong et al. (2019) for work on quantum generalisations. The principles of quantum computing have inspired development of new classical randomized algorithms that show exponential speed-ups over conventional algorithms (Tang 2019). With these quantum-inspired algorithms, the gap between certain classical and quantum algorithms is no longer exponential but polynomial. Arrazola et al. (2019) provide a study of these algorithms and observe that they work well only under stringent conditions which occur rarely in practice. We do not cover these in this paper.

Owing to its roots in quantum physics, research in quantum computing has so far been confined within the purview of the physics community. Although realization of quantum computers in the form of hardware will remain a problem for physicists, we believe this need not be the case when it comes to applying quantum computing to solve machine learning problems. Classical computing and machine learning, like physics and many other fields, serve as prime examples of disciplines where theoretical results were obtained far before technological progress made possible their experimental realizations. Small-scale quantum computers with less than 100 qubits and quantum annealers with around 2000 qubits have been developed and are already being sold commercially (Castelvecchi 2017; Johnston 2013). We hope this article serves its purpose as introductory material for interested machine learning researchers and practitioners from various disciplines.