2.1 Overview

I begin, in Sect. 2.2, by presenting the postulates of quantum mechanics as a natural generalization of probability theory. In Sect. 2.3 I describe basic entanglement phenomena of “super dense coding”, “teleportation”, and Bell’s confirmation of the “paradox” proposed by Einstein-Podolsky-Rosen. In Sect. 2.4 I outline aspects of the basic quantum algorithms, emphasizing the geometry involved. Section 2.5 is a detour into classical information theory, which is the basis of its quantum cousin briefly discussed in Sect. 2.7. Before that, in Sect. 2.6, I reformulate quantum theory in terms of density operators, which facilitates the discussion of quantum information theory. Critical to quantum information theory is von Neumann entropy and in Sect. 2.8 I elaborate on some of its properties. A generalization of “teleportation” is discussed in Sect. 2.9. Regarding practical computation, the exponential growth in size of \(({\mathbb C}^2)^{{\mathord { \otimes } } n}\) with n that appears in quantum information theory leads to the notion of “feasible” states discussed in Sect. 2.10, which has interesting algebraic geometry associated to it. I conclude with a discussion of representation-theoretic aspects of quantum information theory, including a discussion of the quantum marginal problem in Sect. 2.11. I do not discuss topological quantum computing, which utilizes the representation theory of the braid group. For those interested in more details from this perspective, see [18].

2.2 Quantum Computation as Generalized Probabilistic Computation

In this section I take the point of view advocated in [1] and other places that quantum computing should be viewed as a natural generalization of probabilistic computing, and more generally that the laws of quantum mechanics as generalizations of the laws of probability.

2.2.1 Classical and Probabilistic Computing via Linear Algebra

This section is inspired by Arora and Barak [2, Exercise 10.4].

Classical communication deals with bits, elements of {0, 1}, which will be convenient to think of as elements of \(\mathbb {F}_2\), the field with two elements. Let \(f_n: \mathbb {F}_2^n\rightarrow \mathbb {F}_2\) be a sequence of functions. Give \({\mathbb R}^2\) basis {|0〉, |1〉} (such notation is standard in quantum mechanics) and give \(({\mathbb R}^2)^{{\mathord { \otimes } } m}={\mathbb R}^{2^m}\) basis {|I〉∣I ∈{0, 1}m}. In this way, we may identify \(\mathbb {F}_2^m\) with the set of basis vectors of \({\mathbb R}^{2^m}\). A computation of f n (via an arithmetic or Boolean circuit) may be phrased as a sequence of linear maps on a vector space containing \({\mathbb R}^{2^n}\), where each linear map comes from a pre-fixed set agreed upon in advance. In anticipation of what will come in quantum computation, the pre-fixed set of maps (called gates in the literature) will be taken from maps having the following properties:

  1. 1.

    Each linear map must take probability distributions to probability distributions. This implies the matrices are stochastic: the entries are non-negative and each column sums to 1.

  2. 2.

    Each linear map only alters a small number of entries. For simplicity assume it alters at most three entries, i.e., it acts on at most \({\mathbb R}^{2^3}\) and is the identity on all other factors in the tensor product.

In quantum computation, the first property will be replaced by requiring the linear maps to be completely positive and trace preserving (see Sect. 2.7). The second is the same and justified because “universal” quantum computing is possible with such maps, even requiring the three factors to be adjacent, which is essentially due to the classical Cartan-Dieudonné theorem.

To facilitate comparison with quantum computation, first restrict to reversible classical computation. The complexity class of a sequence of functions in classical reversible computation is the same as in arbitrary classical computation.

For example, if we want to effect (x, y)↦x ∗ y, consider the map

$$\displaystyle \begin{aligned} |{ x,y,z}\rangle\mapsto |{x,y,z\oplus (x*y)}\rangle=|{ x,y,z\oplus (x\wedge y)}\rangle \end{aligned} $$
(2.1)

(where the second expression is for those preferring Boolean notation) and act as the identity on all other basis vectors (sometimes called registers). Here z will represent “workspace bits”: x, y will come from the input and z will always be set to 0 in the input. In the basis |000〉, |001〉, |010〉, |100〉, |011〉, |101〉, |110〉, |111〉, of \({\mathbb R}^8\), the matrix is

$$\displaystyle \begin{aligned} \begin{pmatrix} 1 &0 &0 &0 &0 &0 &0&0\\ 0 &1 &0 &0 &0 &0 &0&0\\ 0 &0 &1 &0 &0 &0 &0&0\\ 0 &0 &0 &1 &0 &0 &0&0\\ 0 &0 &0 &0 &1 &0 &0&0\\ 0 &0 &0 &0 &0 &1 &0&0\\ 0 &0 &0 &0 &0 &0 &0&1\\ 0 &0 &0 &0 &0 &0 &1&0 \end{pmatrix}. \end{aligned} $$
(2.2)

This gate is sometimes called the Toffoli gate and the matrix the Toffoli matrix.

The swap (negation) gate ¬ is realized by the matrix

$$\displaystyle \begin{aligned} \sigma_x=\begin{pmatrix} 0 & 1\\ 1&0\end{pmatrix}. \end{aligned} $$
(2.3)

The swap and Toffoli matrices can perform any computation that is accomplished via a sequence of matrices drawn from some finite set of Boolean operations, each acting on a fixed number of basis vectors with at worst a polynomial in n size increase in the number of matrices needed. For those familiar with Boolean circuits, any sequence of Boolean circuits (one for each n) may be replaced by a sequence with just Toffoli and negation gates with at worst a polynomial (in n) blow up in size.

A probability distribution on {0, 1}m may be encoded as a vector in \({\mathbb R}^{2^m}\): If the probability distribution assigns probability p I to I ∈{0, 1}m, assign to the distribution the vector \(v=\sum _I p_I |{I}\rangle \in {\mathbb R}^{2^m}\).

The matrices (2.2), (2.3) realize classical computation. To add randomness to enable probabilistic computation, introduce the matrix

$$\displaystyle \begin{aligned}\begin{pmatrix} \frac 12 & \frac 12\\ \frac 12 &\frac 12\end{pmatrix} \end{aligned}$$

which acts on a single \({\mathbb R}^2\) corresponding to a fair coin flip. Note that the coin flip matrix is not invertible, which will be one motivation for quantum computation in Sect. 2.2.2. Work in \({\mathbb R}^{2^{n+s+r}}\) where r is the number of times one needs to access a random choice and s is the number of matrices (arithmetic operations) in addition to the coin tosses needed to compute f.

A probabilistic computation, viewed this way, starts with |x0r+s〉, where \(x\in \mathbb {F}_2^n\) is the input. One then applies a sequence of admissible stochastic linear maps to it, and ends with a vector that encodes a probability distribution on {0, 1}n+s+r. One then restricts this to {0, 1}p(n), that is, one takes the vector and throws away all but the first p(n) entries. This vector encodes a probability sub-distribution, i.e., all coefficients are non-negative and they sum to a number between zero and one. One then renormalizes (dividing each entry by the sum of the entries) to obtain a vector encoding a probability distribution on {0, 1}p(n) and then outputs the answer according to this distribution. Note that even if our calculation is feasible (i.e., polynomial in size), to write out the original output vector that one truncates would be exponential in cost. A stronger variant of this phenomenon will occur with quantum computing, where the result will be obtained with a polynomial size calculation, but one does not have access to the vector created, even using an exponential amount of computation.

To further prepare for the analogy with quantum computation, define a probabilistic bit (a pbit) to be an element of

$$\displaystyle \begin{aligned}\{ p_0 |{0}\rangle +p_1 |{1}\rangle \mid p_j\in [0,1]\ \mathrm{and}\ p_0+p_1=1\} \subset {\mathbb R}^2. \end{aligned}$$

Note that the set of pbits (possible states) is a convex set, and the basis vectors are the extremal points of this convex set.

2.2.2 A Wish List

Here is a wish list for how one might want to improve upon the above set-up:

  1. 1.

    Allow more general kinds of linear maps to get more computing power, while keeping the maps easy to compute.

  2. 2.

    Have reversible computation: we saw that classical computation can be made reversible, but the coin flip was not. This property is motivated by physics, where many physical theories require time reversibility.

  3. 3.

    Again motivated by physics, one would like to have a continuous evolution of the probability vector, more precisely, one would like the probability vector to depend on a continuous parameter t such that if \(|{\psi _{t_1}}\rangle =X|{\psi _{t_0}}\rangle \), then there exist admissible matrices Y, Z such that \(|{\psi _{t_0+\frac 12t_1}}\rangle =Y|{\psi _{t_0}}\rangle \) and \(|{\psi _{t_1}}\rangle =Z|{\psi _{t_0+\frac 12 t_1}}\rangle \) and X = ZY . In particular, one wants operators to have square roots. (Physicists sometimes state this as “time evolution being described by a semi-group”.)

One way to make the coin flip reversible is, instead of making the probability distribution be determined by the sum of the coefficients, one could take the sum of the squares. If one does this, there is no harm in allowing the entries of the output vectors to become negative, and one could use

$$\displaystyle \begin{aligned} H:=\frac 1{\sqrt{2}}\begin{pmatrix} 1&1\\ 1& -1\end{pmatrix} \end{aligned} $$
(2.4)

for the coin flip. The matrix H is called the Hadamard matrix or Hadamard gate in the quantum computing literature. If we make this change, we obtain our second wish, and moreover have many operations be “continuous”, because the set of matrices preserving the norm-squared of a real-valued vector is the orthogonal group \(O(n)=\{ A\in Mat_{n\times n} \mid AA^T=\operatorname {Id}\}\). So for example, any rotation has a square root.

However our third property will not be completely satisfied, as the matrix

$$\displaystyle \begin{aligned}\begin{pmatrix} 1&0\\ 0&-1\end{pmatrix} \end{aligned}$$

which represents a reflection, does not have a square root in O(2).

To have the third wish satisfied, allow vectors with complex entries. From now on let \(i=\sqrt {-1}\). For a complex number z = x + iy let \(\overline {z}=x-iy\) denote its complex conjugate and \(|z|{ }^2=z\overline {z}\) the square of its norm.

So we go from pbits, {p|0〉 + q|1〉∣ p, q ≥ 0 and p + q = 1} to qubits, the set of which is

$$\displaystyle \begin{aligned} \{\alpha |{0}\rangle +\beta |{1}\rangle\mid \alpha,\beta\in {\mathbb C}\ \mathrm{and}\ |\alpha|{}^2+|\beta|{}^2=1 \}. \end{aligned} $$
(2.5)

The set of qubits, considered in terms of real parameters, looks at first like the 3-sphere S 3 in \({\mathbb R}^4\simeq {\mathbb C}^2\). However, the probability distributions induced by |ψ〉 and e |ψ〉 are the same so it is really S 3S 1 (the Hopf fibration), i.e., the two-sphere S 2. In the quantum literature this is referred to as the Bloch sphere. Geometrically, it would be more natural (especially since we have already seen the need to re-normalize in probabilistic computation) to work with projective space \({\mathbb C}\mathbb P^1\simeq S^2\) as our space of qubits, instead of a subset of \({\mathbb C}^2\). So the set of qubits is better seen as (2.5) modulo the equivalence |ψ〉∼ e |ψ〉.

For \(v=(v_1{, \hdots ,} v_n)\in {\mathbb C}^n\), write |v|2 = |v 1|2 + ⋯ + |v n|2. The set of stochastic matrices is now replaced by the unitary group

$$\displaystyle \begin{aligned}{\mathbf{U}}(n):=\{ A\in Mat_{n\times n}({\mathbb C}) \mid \ |Av|=|v|\ \forall |{v}\rangle\in {\mathbb C}^n\}. \end{aligned}$$

The unitary group satisfies the third wish on the list: For all A ∈U(n), there exists a matrix B ∈U(n) satisfying B 2 = A.

Consider wish 1: it is an open question! However at least our generalized probabilistic computation includes our old probabilistic computation because the matrices (2.2), (2.3), (2.4) are unitary.

An indication that generalized probability may be related to quantum mechanics is that the interference patterns observed in the famous two slit experiments is manifested in generalized probability: one obtains a “random bit” by applying H to |0〉: \(H|{0}\rangle =\frac 1{\sqrt {2}}(|{0}\rangle + |{1}\rangle )\). However, if one applies a second quantum coin flip, one loses the randomness as \(H^2=\operatorname {Id}\) so H 2|0〉 = |0〉, which, as pointed out in [1], could be interpreted as a manifestation of interference.

2.2.3 Postulates of Quantum Mechanics and Relevant Linear Algebra

Here are the standard postulates of quantum mechanics and relevant definitions from linear algebra.

P1

Associated to any isolated physical system is a Hilbert space \({\mathcal H}\), called the state space. The system is completely described at a given moment by a unit vector \(|{\psi }\rangle \in {\mathcal H}\), called its state vector, which is well defined up to a phase e with \(\theta \in {\mathbb R}\). Alternatively one may work in projective space \({\mathbb P} {\mathcal H}\).

Explanations

A Hilbert space \({\mathcal H}\) is a (complete) complex vector space endowed with a non-degenerate Hermitian inner-product, \(h: {\mathcal H}\times {\mathcal H}\rightarrow {\mathbb C}\), where by definition h is linear in the first factor and conjugate linear in the second, \(h(|{v}\rangle ,|{w}\rangle )=\overline {h(|{w}\rangle ,|{v}\rangle )}\) for all v, w, and h(|v〉, |v〉) > 0 for all |v〉≠ 0.

The Hermitian inner-product h allows an identification of \({\mathcal H}\) with \({\mathcal H}^*\) by |w〉↦〈w| := h(⋅, |w〉). This identification will be used repeatedly. Write h(|v〉, |w〉) = 〈w|v〉 and \(|v|=\sqrt {\langle {v}|{v}\rangle }\) for the length of |v〉.

If \({\mathcal H}= {\mathbb C}^n\) with its standard basis, where |v〉 = (v 1, …, v n), the standard Hermitian inner-product on \({\mathbb C}^n\) is \(\langle {w}|{v}\rangle =\sum _{j=1}^n \overline {w}_j v_j\). I will always assume \({\mathbb C}^n\) is equipped with its standard Hermitian inner-product.

Remark 2.2.1

When studying quantum mechanics in general, one needs to allow infinite dimensional Hilbert spaces, but in the case of quantum computing, one restricts to finite dimensional Hilbert spaces, usually \(({\mathbb C}^2)^{{\mathord { \otimes } } N}\).

P2

The state of an isolated system evolves with time according to the Schrödinger equation

$$\displaystyle \begin{aligned}i\hbar \frac{d|{\psi}\rangle}{dt} = X|{\psi}\rangle \end{aligned}$$

where ħ is a constant (Planck’s constant) and X is a fixed Hermitian operator, called the Hamiltonian of the system. (Physicists, enamored of the letter H, often also use it to denote the Hamiltonian.) Here, recall that the adjoint of an operator \(X\in \operatorname {End}({\mathcal H})\), is the operator \(X^\dag \in \operatorname {End}({\mathcal H})\) such that 〈X v|w〉 = 〈v|Xw〉 for all \(v,w\in {\mathcal H}\) and X is Hermitian if X = X . For a general Hilbert space, the Unitary group is \({\mathbf {U}}({\mathcal H}):=\{ U\in \operatorname {End}({\mathcal H})\mid |Uv|=|v|\ \forall |{v}\rangle \in {\mathcal H}\}\).

How is generalized probability related to Schrödinger’s equation? Let \(U(t)\subset {\mathbf {U}}({\mathcal H})\) be a smooth curve with \(U(0)=\operatorname {Id}\). Write \(U'(0)=\frac d{dt}|{ }_{t=0} U(t)\). Consider

$$\displaystyle \begin{aligned} 0&= \frac d{dt}|{}_{t=0}\langle {v}|{w}\rangle\\ &= \frac d{dt}|{}_{t=0}\langle {U(t)v}|{U(t)w}\rangle\\ &= \langle {U'(0)v}|{w}\rangle + \langle {v}|{U'(0)w}\rangle. \end{aligned} $$

Thus iU′(0) is Hermitian. We are almost at Schrödinger’s equation. Let \({\mathfrak u}({\mathcal H})=T_{\operatorname {Id}}{\mathbf {U}}({\mathcal H})\) denote the Lie algebra of \({\mathbf {U}}({\mathcal H})\) so \(i {\mathfrak u}({\mathcal H})\) is the space of Hermitian endomorphisms. For \(X\in \operatorname {End}({\mathcal H})\), write \(X^k\in \operatorname {End}({\mathcal H})\) for XX applied k times. Write \(e^X:=\sum _{k=0}^\infty \frac 1{k!} X^k\). If X is Hermitian, then \(e^{iX}\in {\mathbf {U}}({\mathcal H})\). Postulate 2 implies the system will evolve unitarily, by (assuming one starts at t = 0), |ψ t〉 = U(t)|ψ 0〉, where

$$\displaystyle \begin{aligned}U(t)=e^{\frac{-itX }{\hbar}}. \end{aligned}$$

Measurements

Our first two postulates dealt with isolated systems. In reality, no system is isolated and the whole universe is modeled by one enormous Hilbert space. In practice, parts of the system are sufficiently isolated that they can be treated as isolated systems. However, they are occasionally acted upon by the outside world, and one needs a way to describe this outside interference. For our purposes, the isolated systems will be the Hilbert space attached to the input in a quantum algorithm and the outside interference will be the measurement at the end. That is, after a sequence of unitary operations one obtains a vector \(|{\psi }\rangle =\sum z_j |{j}\rangle \) (here implicitly assuming the Hilbert space is of countable dimension), and as in generalized probability:

P3

The probability of obtaining outcome j under a measurement is |z j|2.

In Sect. 2.6, motivated again by probability, P1, P3 will be generalized to new postulates that give rise to the same theory, but are more convenient to work with in information theory.

A typical situation in quantum mechanics and quantum computing is that there are two or more isolated systems, say \({\mathcal H}_A,{\mathcal H}_B\) that are brought together (i.e., allowed to interact with each other) to form a larger isolated system \({\mathcal H}_{AB}\). The larger system is called the composite system. In classical probability, the composite space is \(\{ 0,1\}^{N_A}\times \{ 0,1\}^{N_B}\). In our generalized probability, the composite space is \(({\mathbb C}^2)^{{\mathord { \otimes } } N_A}{\mathord { \otimes } } ({\mathbb C}^2)^{{\mathord { \otimes } } N_B}=({\mathbb C}^2)^{{\mathord { \otimes } } (N_A+N_B)}\):

P4

The state of a composite system \({\mathcal H}_{AB}\) is the tensor product of the state spaces of the component physical systems \({\mathcal H}_A,{\mathcal H}_B\): \({\mathcal H}_{AB}={\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_B\).

When dealing with composite systems, we will need to allow partial measurements whose outcomes are of the form |I|ϕ〉 with |ϕ〉 arbitrary.

This tensor product structure gives rise to the notion of entanglement, which accounts for phenomenon outside of our classical intuition, as discussed in the next section.

Definition 2.2.2

A state \(|{\psi }\rangle \in {\mathcal H}_1{\mathord {\otimes \cdots \otimes } } {\mathcal H}_n\) is called separable if it corresponds to a rank one tensor, i.e., |ψ〉 = |v 1⊗⋯⊗|v n〉 with each \(|{v_j}\rangle \in {\mathcal H}_j\). Otherwise it is entangled.

2.3 Entanglement Phenomena

2.3.1 Super-Dense Coding

Physicists describe their experiments in terms of two characters, Alice and Bob. I generally follow this convention. Let \({\mathcal H}={\mathbb C}^2{\mathord { \otimes } } {\mathbb C}^2={\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_B\), and let \(|{epr}\rangle = \frac {|{00}\rangle +|{11}\rangle }{\sqrt {2}}\) (called the EPR state in the physics literature after Einstein-Podolsky-Rosen). Assume this state has been created, both Alice and Bob are aware of it, Alice is in possession of the first qubit, and Bob the second. In particular Alice can act on the first qubit by unitary matrices and Bob can act on the second. This all happens before the experiment begins.

Now say Alice wants to transmit a two classical bit message to Bob, i.e., one of the four states |00〉, |01〉, |10〉, |11〉 by transmitting qubits. We will see that she can do so transmitting just one qubit. If she manipulates her qubit by acting on the first \({\mathbb C}^2\) by a unitary transformation, |epr〉 will be manipulated. She uses the following matrices depending on the message she wants to transmit:

where the names σ x, σ y, σ z are traditional in the physics literature (the Pauli matrices). If Alice sends Bob her qubit, so he is now in possession of the modified |epr〉 (although he does not see it), he can determine which of the four messages she sent him by measuring the state in his possession. More precisely, first Bob acts on \({\mathbb C}^2{\mathord { \otimes } } {\mathbb C}^2\) by a unitary transformation that takes the orthonormal basis in the “to obtain” column to the standard orthonormal basis (this is a composition of two Hadamard matrices), to obtain a state vector whose probability is concentrated at one of the four classical states. He then measures, and obtains the correct classical state with probability one.

In summary, with preparation of an EPR state in advance, plus transmission of a single qubit, one can transmit two classical bits of information.

2.3.2 Quantum Teleportation

Here again, Alice and Bob share half of an EPR state, Alice is in possession of a qubit |ψ〉 = α|0〉 + β|1〉, and wants to “send” |ψ〉 to Bob. However Alice is only allowed to transmit classical information to Bob. We will see that she can accomplish her goal by transmitting two classical bits. Write the state of the system as

$$\displaystyle \begin{aligned}\frac 1{\sqrt{2}}\left[ \alpha|{0}\rangle{\mathord{ \otimes } } (|{00}\rangle+|{11}\rangle)+\beta|{1}\rangle{\mathord{ \otimes } } (|{00}\rangle+|{11}\rangle) \right] \end{aligned}$$

where Alice can operate on the first two qubits. If Alice acts on the first two qubits by \(H{\mathord { \otimes } } \sigma _x= \frac 1{\sqrt {2}}\begin {pmatrix} 1&1\\ 1&-1\end {pmatrix}{\mathord { \otimes } } \begin {pmatrix} 0&1\\ 1&0\end {pmatrix}\), she obtains

$$\displaystyle \begin{aligned}\frac 12\left[ |{00}\rangle{\mathord{ \otimes } } (\alpha|{0}\rangle+\beta|{1}\rangle) +|{01}\rangle{\mathord{ \otimes } }(\alpha|{1}\rangle+\beta|{0}\rangle) +|{10}\rangle{\mathord{ \otimes } } (\alpha|{0}\rangle-\beta|{1}\rangle)+ |{11}\rangle{\mathord{ \otimes } } (\alpha|{1}\rangle-\beta|{0}\rangle) \right]. \end{aligned}$$

Notice that Bob’s coefficient of Alice’s |00〉 is the state |ψ〉 that is to be transmitted. Alice performs a measurement. If she has the good luck to obtain |00〉, then she knows Bob has |ψ〉 and she can tell him classically that he is in possession of |ψ〉. But say she obtains the state |01〉: the situation is still good, she knows Bob is in possession of a state such that, if he acts on it with \(\sigma _x=\begin {pmatrix}0&1\\ 1& 0\end {pmatrix}\), he will obtain the state |ψ〉, so she just needs to tell him classically to apply σ x. Since they had communicated the algorithm in the past, all Alice really needs to tell Bob in the first case is the classical message 00 and in the second case the message 01. The cases of 10 and 11 are similar.

In summary, a shared EPR pair plus sending two classical bits of information allows transmission of one qubit.

Remark 2.3.1

In the literature this phenomenon is named quantum teleportation. Since information is transmitted at a speed slower than the speed of light, the use of the word “teleportation”, which implies instantaneous transmission, is misleading.

2.3.3 Bell’s Game

The 1935 Einstein-Podolsky-Rosen paper [10] challenged quantum mechanics with the following thought experiment that they believed implied instantaneous communication across distances, in violation of principles of relativity: Alice and Bob prepare \(|{epr}\rangle =\frac 1{\sqrt {2}}(|{00}\rangle +|{11}\rangle )\), then travel far apart. Alice measures her bit. If she gets 0, then she can predict with certainty that Bob will get 0 in his measurement, even if his measurement is taken a second later and they are a light year apart.

Ironically, this thought experiment has been made into an actual experiment. One modern interpretation (see, e.g., [2]) is that there is no paradox because the system does not transmit information faster than the speed of light, but rather they are acting on information that has already been shared. What follows is a version from [7], adapted from the presentation in [2].

Charlie chooses x, y ∈{0, 1} at random and sends x to Alice and y to Bob. Based on this information, Alice and Bob, without communicating with each other, get to choose bits a, b and send them to Charlie. The game is such that Alice and Bob play on a team. They win if a ⊕ b = x ∧ y, i.e., either (x, y) ≠ (1, 1) and a = b or (x, y) = (1, 1) and a ≠ b.

2.3.3.1 Classical Version

Note that if Alice and Bob both always choose 0, they win with probability \(\frac 34\).

Theorem 2.3.2 ([3])

Regardless of the strategy Alice and Bob use, they never win with probability greater than \(\frac 34\).

See, e.g., [2, Thm. 10.3] for a proof.

2.3.3.2 Quantum Version

Although there is still no communication allowed between Alice and Bob, they will exploit a pre-shared |epr〉 to gain an advantage over the classical case. Alice and Bob prepare \(|{epr}\rangle = \frac {|{00}\rangle +|{11}\rangle }{\sqrt {2}}\) in advance, and Alice takes the first qubit and Bob the second. When Alice gets x from Charlie, if x = 1, she applies a rotation by \(\frac {\pi }8\) to her qubit, and if x = 0 she does nothing. When Bob gets y from Charlie, he applies a rotation by \(-\frac {\pi }8\) to his qubit if y = 1 and if y = 0 he does nothing. (The order these rotations are applied does not matter because the corresponding operators on \(({\mathbb C}^2)^{{\mathord { \otimes } } 2}\) commute.) Both of them measure their respective qubits and send the values obtained to Charlie.

Theorem 2.3.3

With this strategy, Alice and Bob win with probability at least \(\frac 45\).

The idea behind the strategy is that when (x, y) ≠ (1, 1), the states of the two qubits will have an angle at most \(\frac \pi 8\) between them, but when (x, y) = (1, 1), the angle will be \(\frac \pi 4\). That is, when (x, y) ≠ (1, 1), the manipulation makes it more likely that Alice and Bob’s measurements produce the same outcomes, and less likely to produce the same outcome when (x, y) = (1, 1). See [2, Thm. 10.4] for details.

2.4 Quantum Algorithms

Rather than giving a detailed description of the algorithms, I just present a few main ideas that illustrate the differences with classical and probabilistic algorithms.

2.4.1 Grover’s Search Algorithm

The problem: given \(F_n: \mathbb {F}^n_2\rightarrow \mathbb {F}_2\), computable by a poly(n)-size classical circuit, find a such that F n(a) = 1 if such a exists.

Grover found a quantum circuit of size \(poly(n)2^{\frac n2}\) that solves this problem with high probability. Compare this with a brute force search, which requires a circuit of size poly(n)2n. No classical or probabilistic algorithm is known that does better than poly(n)2n. Note that it also gives a size \(poly(n)2^{\frac n2}\) probabilistic solution to the NP-complete problem SAT (it is stronger, as it not only determines existence of a solution, but finds it).

I present the algorithm for the following simplified version where one is promised there exists exactly one solution. All essential ideas of the general case are here.

Problem

Given \(F_n: \mathbb {F}^n_2\rightarrow \mathbb {F}_2\), computable by a poly(n)-size classical circuit, and the information that there is exactly one vector a with F n(a) = 1, find a.

The idea of the algorithm is to start with a vector equidistant from all possible solutions, and then to incrementally rotate it towards a. What is strange for our classical intuition is that one is able to rotate towards the solution without knowing what it is, and similarly, we won’t “see” the rotation matrix either.

Work in \(({\mathbb C}^2)^{{\mathord { \otimes } } n +s}\) where s = s(n) is the size of the classical circuit needed to compute F n. I suppress reference to the s “workspace bits” in what follows.

The following vector is the average of all the classical (observable) states:

$$\displaystyle \begin{aligned} |{av}\rangle:=\frac 1{2^{\frac n2}}\sum_{I\in \{ 0,1\}^n } |{I}\rangle. \end{aligned} $$
(2.6)

To prepare |av〉, note that \(H|{0}\rangle =\frac 1{\sqrt {2}}(|{0}\rangle +|{1}\rangle )\), so applying H ⊗n to |0⋯0〉 transforms it to |av〉. The cost of this is n gates (matrices).

Since |av〉 is equidistant from all possible solution vectors, \(\langle {av}|{a}\rangle =\frac 1{2^{\frac n2}}\). We want to rotate |av〉 towards the unknown a. Recall that \({\cos }(\angle (|{v}\rangle ,|{w}\rangle ))=\frac {\langle {v}|{w}\rangle }{|v||w|}\). Write the angle between av and a as \(\frac \pi 2 - \theta \), so \({\sin }(\theta )=\frac 1{2^{\frac n2}}\).

A rotation is a product of two reflections. In order to perform the rotation Rot that moves |av〉 towards |a〉, first reflect in the hyperplane orthogonal to |a〉, and then in the hyperplane orthogonal to |av〉, as in Fig. 2.1, which is valid for rotating any vector |v〉 towards |a〉.

Fig. 2.1
figure 1

Rotation of v to Rot(v) via two reflections

Consider the map

$$\displaystyle \begin{aligned}|{xy}\rangle\mapsto |{x(y\oplus F(x))}\rangle \end{aligned} $$
(2.7)

defined on basis vectors and extended linearly. To execute this, use the s workspace bits that are suppressed from the notation, to effect s reversible classical gates. Initially set y = 0 so that the image is |x0〉 for x ≠ a, and |x1〉 when x = a. Next apply the quantum gate \(\operatorname {Id}{\mathord { \otimes } } \begin {pmatrix} 1&0\\ 0&-1\end {pmatrix}\) which sends |x0〉↦|x0〉, and |x1〉↦ −|x1〉. Finally apply the map |xy〉↦|x(y ⊕ F(x))〉 again.

Thus |a0〉↦ −|a0〉 and all other vectors |b0〉 are mapped to themselves, as desired.

Next we need to reflect around |av〉. It is easy to reflect around a classical state, so first perform the map H −1⊗n = H ⊗n that sends |av〉 to |0⋯0〉, then reflect in the hyperplane perpendicular to |0⋯0〉 using the Boolean function \(g: \mathbb {F}_2^n\rightarrow \mathbb {F}_2\) that outputs 1 if and only if its input is (0, …, 0), in the role of F for our previous reflection, then apply H ⊗n again so the resulting reflection is about |av〉. (Note that both these reflections have polynomial size cost.)

The composition of these two reflections is the desired rotation Rot. The vector Rot|av〉 is not useful as measuring it only slightly increases the probability of obtaining |a〉, but if one composes Rot with itself \(O(\frac 1{\theta })\) times, one obtains a vector much closer to |a〉. (Note that \(\theta \sim {\sin }(\theta )\) so \(\frac 1\theta \sim \sqrt {N}\).)

For more details, see, e.g., [2, Thm. 10.13] or [20, §6.1].

2.4.2 The Quantum Discrete Fourier Transform

Underlying the famous quantum algorithm of Shor for factoring integers and Simon’s algorithm that led up to it, are “quantum” versions of the discrete Fourier transform on finite abelian groups.

The DFT for \( \mathbb {Z}/M\mathbb {Z}\), in vector notation, for \(j\in \mathbb {Z}/M\mathbb {Z}\), is

$$\displaystyle \begin{aligned}|{j}\rangle\mapsto \frac 1{\sqrt{M}} \sum_{k=0}^{M-1} \omega^{jk}|{k}\rangle \end{aligned}$$

where \(\omega =e^{\frac {2\pi i}M}\). It is a unitary change of basis such that in the new basis, multiplication in \(\mathbb {Z}/M\mathbb {Z}\) is given by a diagonal matrix, and the classical FFT writes the DFT as a product of \(O( \operatorname {\mathrm {log}}(M))\) sparse matrices (each with M << M 2 nonzero entries), for a total cost of \(O( \operatorname {\mathrm {log}}(M)M)<O(M^2)\) arithmetic operations.

Say M = 2m. The DFT can be written as a product of \(O(m^3)=O( \operatorname {\mathrm {log}}(M)^3)\) controlled local unitary operators. Hence one can approximately construct the output vector by a sequence of poly(m) unitary operators from our gate set with the caveat that we won’t be able to “see” it.

Here is the quantum DFT: It will be convenient to express j in binary and view \({\mathbb C}^M=({\mathbb C}^2)^{{\mathord { \otimes } } m}\), i.e., write

$$\displaystyle \begin{aligned}|{j}\rangle=|{j_1}\rangle{\mathord{\otimes\cdots\otimes} } |{j_m}\rangle \end{aligned}$$

where j = j 12m−1 + j 22m−2 + ⋯ + j m20 and j i ∈{0, 1}. Write the DFT as

$$\displaystyle \begin{aligned} &|{j_1}\rangle{\mathord{\otimes\cdots\otimes} } |{j_m}\rangle \\ & \mapsto \frac 1{\sqrt{M}} \sum_{k=0}^{M-1}\omega^{jk}|{k}\rangle\\ &=\frac 1{\sqrt{M}} \sum_{k_i\in \{0,1\} }\omega^{j(\sum_{l=1}^m k_l2^{m-l})}|{k_1}\rangle{\mathord{\otimes\cdots\otimes} } |{k_m}\rangle\\ &=\frac 1{\sqrt{M}} \sum_{k_i\in \{0,1\} } \bigotimes_{l=1}^m \left[ \omega^{j k_l2^{m-l}}|{k_l}\rangle\right] \\ &=\frac 1{\sqrt{M}} \sum_{k_i\in \{0,1\} } \bigotimes_{l=1}^m \left[ \omega^{(j_12^{2m-1-l}+\cdots + j_m2^{m-l}) k_l }|{k_l}\rangle\right] \\ {}&=\frac 1{2^{\frac m2}} (|{0}\rangle + \omega^{ {j_m}2^{-1}}|{1}\rangle) {\mathord{ \otimes } } (|{0}\rangle + \omega^{ {j_{m-1}}2^{-1}+ {j_m}2^{-2}}|{1}\rangle) {\mathord{ \otimes } } (|{0}\rangle + \omega^{ {j_{m-2}}2^{-1} + {j_{m-1}}2^{-2} + {j_{m }}2^{-3}}|{1}\rangle)\\ &\ \ {\mathord{\otimes\cdots\otimes} } (|{0}\rangle + \omega^{\sum_{s=0}^{m-1} {j_{m-s}} 2^{m-(s+1)} } |{1}\rangle) \end{aligned} $$
(2.8)

where for the last line if 2m − s − l > m, i.e., s + l < m, there is no contribution with j s because \(\omega ^{2^m}=1\), and I multiplied all terms by \(1=\omega ^{-2^{m}}\) to have negative exponents.

It will be notationally more convenient to write the quantum circuit for this vector with the order of factors reversed, so I describe a quantum circuit that produces

$$\displaystyle \begin{aligned} &\frac 1{\sqrt{2}}(|{0}\rangle + \omega^{\sum_{s=0}^{m-1} {j_{m-s}} 2^{m-(s+1)} } |{1}\rangle) {\mathord{\otimes\cdots\otimes} }\frac 1{\sqrt{2}}(|{0}\rangle + \omega^{ {j_{m-2}}2^{-1} + {j_{m-1}}2^{-2} + {j_{m }}2^{-3}}|{1}\rangle)\\ & {\mathord{ \otimes } } \frac 1{\sqrt{2}}(|{0}\rangle + \omega^{ {j_{m-1}}2^{-1}+ {j_m}2^{-2}}|{1}\rangle) {\mathord{ \otimes } } \frac 1{\sqrt{2}} (|{0}\rangle + \omega^{ {j_m}2^{-1}}|{1}\rangle). \end{aligned} $$
(2.9)

Set

$$\displaystyle \begin{aligned} R_k=\begin{pmatrix} 1&0\\ 0 & \omega^{2^k}\end{pmatrix}, \end{aligned} $$
(2.10)

then (2.9) is obtained as follows: first apply H to \(({\mathbb C}^2)_1\) then a linear map Λ1 R j, defined by |x|y〉↦|x⊗R j|y〉 if |x〉≠ |0〉 and to |x|y〉 if |x〉 = |0〉, to \(({\mathbb C}^2)_j{\mathord { \otimes } } ({\mathbb C}^2)_1\) for j = 2, …, m. Note that at this point only the \(({\mathbb C}^2)_1\)-term has been altered. From now on leave the \(({\mathbb C}^2)_1\)-slot alone. Next apply H to \(({\mathbb C}^2)_{2}\) then Λ1 R j−1 to \(({\mathbb C}^2)_{j}{\mathord { \otimes } } ({\mathbb C}^2)_{2}\) for j = 3, …, m. Then apply H to \(({\mathbb C}^2)_{3}\) then Λ1 R j−2 to \(({\mathbb C}^2)_j{\mathord { \otimes } } ({\mathbb C}^2)_3\) for j = 4, …, m. Continue, until finally one just applies H to \(({\mathbb C}^2)_m\). Finally to obtain the DFT, reverse the orders of the factors (a classical operation).

In practice, one has to fix a quantum gate set, i.e., a finite set of unitary operators that will be allowed in algorithms, in advance. Thus in general it will be necessary to approximate the transformations R k from elements of our gate set, so one only obtains an approximation of the DFT.

2.4.3 The Hidden Subgroup Problem

Given a discrete group G with a specific representation of its elements in binary, a function \(f: G\rightarrow \mathbb {F}_2^n\), and a device that computes f (for unit cost), and the knowledge that there exists a subgroup G′⊂ G such that f(x) = f(y) if and only if xy −1 ∈ G′, find G′.

For finitely generated abelian groups, it is sufficient to solve the problem for \(G=\mathbb {Z}^{\oplus k}\) as all finitely generated abelian groups are quotients of some \(\mathbb {Z}^{\oplus k}\).

Simons algorithm is for the hidden subgroup problem with \(G=\mathbb {Z}_2^{\oplus m}\), see [15, §13.1]. The DFT 2 matrix is just

$$\displaystyle \begin{aligned}H=\frac 1{\sqrt{2}} \begin{pmatrix} 1& -1\\ -1 & 1\end{pmatrix} \end{aligned}$$

and G′ is the subgroup generated by \(a\in \mathbb {Z}_2^{\oplus m}\).

Shor’s algorithm for factoring (after classical preparation) amounts to the case \(G=\mathbb {Z}\) and F is the function \(x\mapsto a^x\operatorname {mod} N\). It has generated intense interest in quantum computation because no classical or probabilistic polynomial time algorithm for factoring is known. For example, most “secure” electronic communication is based on the difficulty of factoring a number into its prime factors, so the real world impact of a quantum computer would be substantial. See, e.g., http://www.math.tamu.edu/~jml/CNSA-Suite-and-Quantum-Computing-FAQ.pdf. See, e.g., [2, §10.6] for an exposition of Shor’s algorithm.

2.5 Classical Information Theory

Quantum information theory is based on classical information theory, so I review the classical theory. The discovery/invention of the bit by Tukey and its development by Shannon [22] was one of the great scientific achievements of the twentieth century, as it changed the way one views information, giving it an abstract formalism that is discussed in this section. The link to quantum information is explained in Sect. 2.7.

The basic question is: given a physical channel, e.g., a telegraph wire, what is the maximum rate of transmission of messages allowing for a small amount of error? I begin with toy examples, leading up to Shannon’s two fundamental theorems.

2.5.1 Data Compression: Noiseless Channels

(Following [6]) A source emits symbols x from an alphabet \({\mathcal X}\) that we want to store efficiently so we try to encode x in a small number of bits, to say \(y\in {\mathcal Y}\) in a way that one can decode it later to recover x (Fig. 2.2).

Fig. 2.2
figure 2

Message from source encoded into bits then decoded

The symbols from \({\mathcal X}\) do not necessarily occur with the same frequency. Let \(p=P_{{\mathcal X}}\) denote the associated probability distribution. What is the minimum possible size for \({\mathcal Y}\)? Since we are dealing in bits, it will be convenient to use the logarithms of cardinalities, so define the capacity as \( \operatorname {\mathrm {Cap}}(P_{{\mathcal X}}):=\operatorname {min} \operatorname {\mathrm {log}}|{\mathcal Y}|\).

Consider the case \({\mathcal X}=\{a,b,c,d\}\) where p(a) = 0.1, p(b) = 0, p(c) = 0.4 and p(d) = 0.5. One can clearly get away with \(|{\mathcal Y}|=3\), e.g., for the encoder, send a, b to 1, c to 2 and d to 3, then for the decoder, send 1 to a, 2 to c and 3 to d. In general, one can always throw away symbols with probability zero. On the other hand, one cannot map two distinct symbols that do occur to the same symbol, as there would be no way to distinguish them when decoding. Thus \( \operatorname {\mathrm {Cap}}(p)= \operatorname {\mathrm {log}} \operatorname {\mathrm {supp}}(p)\), where \( \operatorname {\mathrm {supp}}(p)=\# \{ x\in {\mathcal X} \mid p(x)>0\}\).

Now say we are willing to tolerate a small error. First rephrase what we did probabilistically: Let p enc(y|x) denote the conditional probability distribution of the encoder \({\mathcal E}\) and p dec(x|y) that of the decoder \({\mathcal D}\). Our requirement was for all x,

$$\displaystyle \begin{aligned}p[x={\mathcal D}\circ {\mathcal E}(x)]=\sum_{ y,x'} p^{enc}(y|x)p^{dec}(x'|y)\delta_{x,x'}=1. \end{aligned}$$

Now relax it to

$$\displaystyle \begin{aligned}\sum_{x,y,x'} p(x)p^{enc}(y|x)p^{dec}(x'|y)\delta_{x,x'}\geq 1-{\epsilon}. \end{aligned}$$

for some error 𝜖 that we are willing to tolerate. In addition to throwing out the symbols that do not appear, we may also discard the largest set of symbols whose total probability is smaller than 𝜖. Call the corresponding quantity \( \operatorname {\mathrm {Cap}}^{{\epsilon }}(p)\). In this example, if one takes 𝜖 > 0.1, one can lower storage cost, taking \(|{\mathcal Y}|=2\).

Recall that a probability distribution \(p: {\mathcal X}\rightarrow [0,1]\) must satisfy \(\sum _{x\in {\mathcal X}}p(x)=1\). Relax this to non-normalized probability distributions, \(q: {\mathcal X}\rightarrow [0,1]\), where \(\sum _{x\in {\mathcal X}}q(x)\leq 1\). We obtain: \( \operatorname {\mathrm {Cap}}^{{\epsilon }}(p)= \operatorname {min} \operatorname {\mathrm {log}} \operatorname {\mathrm {supp}}(q)\), where the min is taken over all non-normalized probability distributions q satisfying q(x) ≤ p(x) and \(\sum _{x\in {\mathcal X}} q(x)\geq 1-{\epsilon }\).

Now say we get not a single symbol, but a string of n symbols, so we seek an encoder \({\mathcal E}: {\mathcal X}^n\rightarrow {\mathcal Y}(n)\), where \({\mathcal Y}(n)\) is a set that varies with n, and decoder \({\mathcal D}: {\mathcal Y}(n)\rightarrow {\mathcal X}^n\), and we want to minimize \(|{\mathcal Y}(n)|\), with a tolerance of error that goes to zero as n goes to infinity. In practice one wants to send information through a communication channel (e.g. telegraph wire). The channel can only send a limited number of bits per second, and we want to maximize the amount of information we can send per second: \(\lim _{{\epsilon }\rightarrow 0}\lim _{n\rightarrow \infty } \frac 1n \operatorname {\mathrm {Cap}}^{{\epsilon }}(p^n)\).

The string \(x_1\cdots x_n=:\overline x^n\) is identically and independently distributed (i.i.d), that is each x j is drawn from the same probability distribution and the draw of x j is independent of the draws of the other x i. Say \({\mathcal X}=\{1{, \hdots ,} d\}\) with p(j) = p j. The probability of any given string occurring depends only on the number of 1’s 2’s etc. in the string and not on their order. A string with c j j’s occurs with probability \(p_1^{c_1}\cdots p_d^{c_d}\). (Note that c 1 + ⋯ + c d = n.) The number of strings with this probability is

$$\displaystyle \begin{aligned}\binom n {c_1{, \hdots ,} c_d}:=\frac{n!}{c_1!\cdots c_d!} \end{aligned}$$

and we need to estimate this quantity.

Stirling’s formula implies \( \ln (n!)=n\ln (n)-n+ O(\ln (n))\). In particular, for 0 < β < 1 such that \(\beta n\in \mathbb {Z}\),

$$\displaystyle \begin{aligned}\operatorname{\mathrm{log}}\binom n{\beta n} = n[-\beta \operatorname{\mathrm{log}}(\beta) -(1-\beta)\operatorname{\mathrm{log}}(1-\beta)]+O(\operatorname{\mathrm{log}}(n)). \end{aligned}$$

Let \(H(\beta )=-\beta \operatorname {\mathrm {log}}(\beta ) -(1-\beta ) \operatorname {\mathrm {log}}(1-\beta )\) and more generally, for \(\overline {p}=(p_1{, \hdots ,} p_d)\), let

$$\displaystyle \begin{aligned}H(\overline{p})=-\sum_{i=1}^d p_i\operatorname{\mathrm{log}}(p_i), \end{aligned}$$

the Shannon entropy of \(\overline {p}\). It plays a central role in information theory.

Define a map \(wt: {\mathcal X}^n\rightarrow {\mathbb R}^d\) by \(\overline x^n\mapsto (c_1{, \hdots ,} c_d)\), where c j is the number of j’s appearing in \(\overline x^n\). Then the expectation is \(E[wt(\overline x^n)]=(np_1{, \hdots ,} np_d)\). The weak law of large numbers states that for any 𝜖 > 0,

$$\displaystyle \begin{aligned}\lim_{n\rightarrow \infty} p[|| \frac 1n wt(\overline{x}^n)-E[wt(\overline x^n)) ]||{}_1>{\epsilon}]=0 \end{aligned}$$

where for \(f: {\mathcal Z}\rightarrow {\mathbb R}^d\), define \(||f||{ }_1=\sum _{z\in {\mathcal Z}} |f(z)|\). In our case, \({\mathcal Z}={\mathcal X}^n\).

Now simply throw out all strings \(\overline {x}^n\) with \(||\frac {1}n (wt( \overline x^n)-E[wt(\overline x^n)) ]||{ }_1>{\epsilon }\), and take \({\mathcal Y}(n)\) of size

$$\displaystyle \begin{aligned} |{\mathcal Y}(n)|&=\#\{ \overline x^n\mid ||\frac{1}n (wt(\overline x^n)-E[wt(\overline x^n))]||{}_1<{\epsilon} \}\\ &= \sum_{{\genfrac{}{}{0pt}{}{\overline x^n\mid}{ ||\frac{1}n( wt(\overline x^n)-E[wt(\overline x^n))]||{}_1<{\epsilon} }}} \binom{n}{wt(\overline{x}^n) }. \end{aligned} $$

If 𝜖 is small, the multinomial coefficients appearing will all be very close to

$$\displaystyle \begin{aligned}\binom n{np_1{, \hdots ,} np_d} \end{aligned}$$

and for what follows, one can take the crude approximation

$$\displaystyle \begin{aligned} |{\mathcal Y}(n)|\leq poly(n) \binom n{np_1{, \hdots ,} np_d} \end{aligned} $$
(2.11)

(recall that d is fixed).

Taking logarithms, the right hand side of (2.11) becomes \(nH(\overline {p})+O( \operatorname {\mathrm {log}}(n)) \). Thus

$$\displaystyle \begin{aligned}\frac 1 n \operatorname{\mathrm{log}} |{\mathcal Y}(n)|\leq H( \overline{p})+o(1) \end{aligned}$$

and \(\lim _{{\epsilon }\rightarrow 0}\lim _{n\rightarrow \infty } \frac 1n \operatorname {\mathrm {Cap}}^{{\epsilon }}(p^n)\leq H(\overline {p})\).

Theorem 2.5.1 ([22])

\(\lim _{{\epsilon }\rightarrow 0}\lim _{n\rightarrow \infty } \frac 1n \operatorname {\mathrm {Cap}}^{{\epsilon }}(p^n)= H(\overline {p})\).

The full proof uses the law of large numbers.

2.5.2 Transmission over Noisy Channels

Say symbols x are transmitted over a channel subject to noise, and symbols y are received so one may or may not have y = x. Intuitively, if the noise is small, with some redundancy it should be possible to communicate accurate messages most of the time. In a noiseless channel the maximal rate of transmission is just \(H(p_{{\mathcal X}})\), but now we must subtract off something to account for the uncertainty that, upon receiving y, that it was the signal sent. This something will be the conditional entropy: Recall the conditional probability of i occurring given knowledge that j occurs (assuming p(j) > 0): \(p_{{\mathcal X} | {\mathcal Y}}(i|j)=\frac {p_{{\mathcal X}, {\mathcal Y}}(i,j)}{p_{ {\mathcal Y}}(j)}\) (also recall \(p_{ {\mathcal Y}}(j)=\sum _ip_{{\mathcal X}, {\mathcal Y}}(i,j)\)). Define the conditional entropy

$$\displaystyle \begin{aligned}H(\overline p_{\mathcal Y}|\overline p_{\mathcal X} ):=-\sum_{i,j}p_{{\mathcal X}, {\mathcal Y}}(i,j)\operatorname{\mathrm{log}} p_{{\mathcal Y} | {\mathcal X}}(j|i). \end{aligned}$$

Note that

$$\displaystyle \begin{aligned} H(\overline p_{\mathcal Y}|\overline p_{\mathcal X} )=H(\overline p_{{\mathcal X},{\mathcal Y}} )-H(\overline p_{{\mathcal X} } ) \end{aligned} $$
(2.12)

or equivalently \(H(\overline p_{{\mathcal X},{\mathcal Y}} )=H(\overline p_{{\mathcal X} } )+H(\overline p_{\mathcal Y}|\overline p_{\mathcal X} )\), the uncertainty of \(p_{{\mathcal X},{\mathcal Y}}\) is the uncertainty of \(p_{{\mathcal X} }\) plus the uncertainty of \(p_{{\mathcal Y} }\) given \(p_{{\mathcal X} }\). In particular \( H(\overline p_{{\mathcal Y} } )\geq H(\overline p_{\mathcal Y}|\overline p_{\mathcal X} ) \), i.e., with extra knowledge, our uncertainty about \(p_{{\mathcal Y} } \) cannot increase, and decreases unless \(p_{{\mathcal X} } \) and \(p_{{\mathcal Y} } \) are independent.

2.5.2.1 Capacity of a Noisy Channel

Define the capacity of a noisy channel to be the maximum rate over all possible probability distributions on the source:

$$\displaystyle \begin{aligned}\operatorname{\mathrm{Cap}}:=\operatorname{max}_{q_{{\mathcal X}}}\left(H(q_{{\mathcal X}})-H(q_{{\mathcal X}}|p_{{\mathcal Y}})\right). \end{aligned}$$

Shannon [22] proves that \( \operatorname {\mathrm {Cap}}\) lives up to its name: if the entropy of a discrete channel is below \( \operatorname {\mathrm {Cap}}\) then there exists an encoding \(\overline p\) of the source such that information can be transmitted over the channel with an arbitrarily small frequency of errors. The basic idea is the same as the noiseless case, however there is a novel feature that now occurs frequently in complexity theory arguments—that instead of producing an algorithm to find the efficient encoding, Shannon showed that a random choice of encoding will work.

After presenting the proof, Shannon remarks: “An attempt to obtain a good approximation to ideal coding by following the method of the proof is generally impractical. …Probably this is no accident but is related to the difficulty of giving an explicit construction for a good approximation to a random sequence”. To my knowledge, this is the first time that the difficulty of “finding hay in a haystack” (phrase due to Howard Karloff) is mentioned in print. This problem is central to complexity: for example, Valiant’s algebraic version of P ≠ NP can be phrased as the problem of finding a sequence of explicit polynomials that are difficult to compute, while it is known that a random sequence is indeed difficult to compute. According to A. Wigderson, the difficulty of writing down random objects was also explicitly discussed by Erdös, in the context of random graphs, at least as early as 1947, in relation to his seminar paper [11]. This paper, along with [22] gave rise to the now ubiquitous probabilistic method in complexity theory.

2.6 Reformulation of Quantum Mechanics

I discuss two inconveniences about our formulation of the postulates of quantum mechanics, leading to a reformulation of the postulates in terms of density operators.

2.6.1 Partial Measurements

A measurement of a state \(|{\psi }\rangle =\sum z_I|{I}\rangle \) was defined as a procedure that gives us I = (i 1, …, i n) ∈{0, 1}n with probability |z I|2. But in our algorithms, this is not what happened: we were working not in \(({\mathbb C}^2)^{{\mathord { \otimes } } n}\), but \(({\mathbb C}^2)^{{\mathord { \otimes } } n+m}\) where there were m “workspace” qubits we were not interested in measuring. So our measurement was more like the projections onto the spaces \(|{I}\rangle {\mathord { \otimes } } ({\mathbb C}^2)^{{\mathord { \otimes } } m}\). I now define this generalized notion of measurement.

To make the transition, first observe that |z I|2 = 〈ψ| ΠI|ψ〉, where \(\Pi _I : ({\mathbb C}^2)^{{\mathord { \otimes } } n}\rightarrow {\mathbb C}|{I}\rangle \) is the orthogonal projection onto the line spanned by |I〉.

Say we are only interested in the first n bits of a system of n + m bits, and want to know the probability a measurement gives rise to some I represented by a vector \(|{I}\rangle \in ({\mathbb C}^2)^{{\mathord { \otimes } } n}\), but we have \(|{\psi }\rangle \in ({\mathbb C}^2)^{{\mathord { \otimes } } n+m}\). Adopt the notation |ϕ〉〈ψ| := |ϕψ|. Then the probability of obtaining |I〉 given |ψ〉 is

$$\displaystyle \begin{aligned} p( |{I}\rangle \mid |{\psi}\rangle)&=\sum_{J\in\{0,1\}^m} p(|{\psi}\rangle, |{IJ}\rangle)\\ &= \sum_J \langle {\psi}|{IJ}\rangle\langle {IJ}|{\psi}\rangle\\ &=\langle {\psi}| (|{I}\rangle\langle {I}| {\mathord{ \otimes } } \operatorname{Id}_{({\mathbb C}^2)^{{\mathord{ \otimes } } m}})|{\psi}\rangle\\ &=\langle {\psi}| \Pi_{{\mathcal M}}|{\psi}\rangle \end{aligned} $$

where \(\Pi _{{\mathcal M}}: ({\mathbb C}^2)^{{\mathord { \otimes } } n+m}\rightarrow |{I}\rangle {\mathord { \otimes } } ({\mathbb C}^2)^{{\mathord { \otimes } } m}=:{\mathcal M}\) is the orthogonal projection operator. With this definition, one can allow \({\mathcal M}\subset {\mathcal H}\) to be any linear subspace, which will simplify our measurements. (Earlier, if we wanted to measure the probability of a non-basis state, we had to change bases before measuring.) Write \(p_{\psi }({\mathcal M}):= \langle {\psi }| \Pi _{{\mathcal M}}|{\psi }\rangle \) for the probability of measuring |ψ〉 in state \({\mathcal M}\).

One may think of projection operators as representing outside interference of a quantum system, like adding a filter to beams being sent that destroy states not in \({\mathcal M}\). Recall that in classical probability, one has the identity:

$$\displaystyle \begin{aligned}p(M_1\cup M_2)= p(M_1)+p(M_2) - p(M_1\cap M_2) . \end{aligned} $$
(2.13)

The quantum analog is false in general: Let \({\mathcal H}={\mathbb C}^2\), \({\mathcal M}_1={\mathbb C}|{0}\rangle \) and \({\mathcal M}_2={\mathbb C}(|{0}\rangle +|{1}\rangle )\) Let |ψ〉 = α|0〉 + β|1〉 with |α|2 + |β|2 = 1. Then (and in general) \(p_{\psi }( \operatorname {\mathrm {span}}\{ {\mathcal M}_1,{\mathcal M}_2 \})\neq p_{\psi }( {\mathcal M}_1)+ p_{\psi }( {\mathcal M}_2 ) -p_{\psi }( {\mathcal M}_1\cap {\mathcal M}_2 )\).

However, one can recover (2.13) if the projection operators commute:

Proposition 2.6.1

If \(\Pi _{{\mathcal M}_1}\Pi _{{\mathcal M}_2}=\Pi _{{\mathcal M}_2}\Pi _{{\mathcal M}_1}\) then for all ψ, \(p_{\psi }( \operatorname {\mathrm {span}}\{ {\mathcal M}_1, {\mathcal M}_2 \})= p_{\psi }( {\mathcal M}_1)+ p_{\psi }( {\mathcal M}_2 ) -p_{\psi }( {\mathcal M}_1\cap {\mathcal M}_2 )\).

2.6.2 Mixing Classical and Quantum Probability

A typical situation in probability is as follows: you want a cookie, but can’t make up your mind which kind, so you decide to take one at random from the cookie jar to eat. However when you open the cupboard, you find there are two different cookie jars H and T, each with a different distribution of cookies, say P H and P T. You decide to flip a coin to decide which jar and say your coin is biased with probability p for heads (choice H). The resulting probability distribution is

$$\displaystyle \begin{aligned}pP_H +(1-p)P_T. \end{aligned}$$

Let’s encode this scenario with vectors. Classically, if vectors corresponding to P H, P T are respectively v H, v T, the new vector is pv H + (1 − p)v T. The probability of drawing a chocolate chip (CC) cookie is pP H(CC) + (1 − p)P T(CC) = pv H,CC + (1 − p)v T,CC.

But what should one take in generalized probability (where one uses the 2 norm instead of the 1 norm)? Given \(|{\psi _A}\rangle =\sum z_I|{I}\rangle , |{\psi _B}\rangle =\sum w_J|{J}\rangle \), we want to make a measurement that gives us p|z CC|2 + (1 − p)|w CC|2. Unfortunately |pz CC + (1 − p)w CC|2 ≠ p|z CC|2 + (1 − p)|w CC|2 in general. To fix this problem one enlarges the notion of state and further modifies the definition of measurement.

Our problem comes from having a mixture of 1 and 2 norms. The fix will be to rewrite |ψ〉 in a way that the 2 norm becomes an 1 norm. That is, construct an object that naturally contains the squares of the norms of the coefficients of |ψ A〉. Consider the endomorphism \( |{\psi _A}\rangle \langle {\psi _A}|=\sum _{I,J} z_I\overline {z}_J |{I}\rangle \langle {J}|\). It is rank one, and in the standard basis its diagonal entries are the quantities we want.

To measure them, let ΠJ denote the projection onto the J-th coordinate. Then

$$\displaystyle \begin{aligned}\operatorname{trace}(\Pi_J |{\psi_A}\rangle\langle {\psi_A}|)=|z_{A,J}|{}^2 \end{aligned}$$

is the desired quantity.

Now back to our cookie jars, set

$$\displaystyle \begin{aligned}\rho=p |{\psi_A}\rangle\langle {\psi_A}|+(1-p) |{\psi_B}\rangle\langle {\psi_B}| \end{aligned}$$

and observe that

$$\displaystyle \begin{aligned}\operatorname{trace}(\Pi_J\rho)= p |z_{A,J}|{}^2+(1-p)|z_{B,J}|{}^2 \end{aligned}$$

as desired.

Given a finite set of states {|ψ 1〉, …, |ψ s〉}, with p(|ψ i〉) = p i, and ∑i p i = 1, set \(\rho =\sum _k p_k |{\psi _k}\rangle \langle {\psi _k}|\in \operatorname {End}({\mathcal H})\). Note that ρ has the properties

  1. (1)

    ρ = ρ , i.e., ρ is Hermitian,

  2. (2)

    ∀|η〉, 〈η|ρ|η〉≥ 0, i.e., ρ is positive,

  3. (3)

    \(\operatorname {trace}(\rho )=1\).

This motivates the following definition:

Definition 2.6.2

An operator \(\rho \in \operatorname {End}({\mathcal H})\) satisfying 1,2,3 above is called a density operator.

Note that a density operator that is diagonal in the standard basis of \({\mathbb C}^d\) corresponds to a probability distribution on {1, …, d}, so the definition includes classical probability as well as our old notion of state (which are the rank one density operators). The set of density operators is invariant under the induced action of \({\mathbf {U}}({\mathcal H})\) on \(\operatorname {End}({\mathcal H})\).

Different scenarios can lead to the same density operator. However, two states with the same density operator are physically indistinguishable.

2.6.3 Reformulation of the Postulates of Quantum Mechanics

Postulate 1

Associated to any isolated physical system is a Hilbert space \({\mathcal H}\), call the state space. The system is described by its density operator \(\rho \in \operatorname {End}({\mathcal H})\).

Postulate 2

The evolution of an isolated system is described by the action of unitary operators on ρ.

Postulate 3

Measurements correspond to a collection of projection operators \(\Pi _{{\mathcal M}_j}\) such that \(\sum _k \Pi _{{\mathcal M}_k}=\operatorname {Id}_{{\mathcal H}}\). The probability that ρ is in measured in state \({\mathcal M}_j\) is \(\operatorname {trace}(\Pi _{{\mathcal M}_j}\rho )\). Such measurements are called “Positive Operator-Valued Measurements”, or POVM, in the literature.

Sometimes it is convenient to allow more general measurements than POVM:

Postulate 3

Projective measurements correspond to a collection of Hermitian operators \(X_j\in \operatorname {End}{\mathcal H}\) such that \(\sum _k X_k=\operatorname {Id}_{{\mathcal H}}\). The probability that ρ is in measured in state X j is \(\operatorname {trace}( X_j\rho )\).

Postulate 4 regarding composite systems is unchanged.

Remark 2.6.3

Note that for \(A\in \operatorname {End}{\mathcal H}={\mathcal H}^*{\mathord { \otimes } } {\mathcal H}\), \(\operatorname {trace}(A)\) is the image of A under the contraction map \({\mathcal H}^*{\mathord { \otimes } } {\mathcal H}\rightarrow {\mathbb C}\), 〈v||w〉↦〈v|w〉. For \(A\in \operatorname {End}({\mathcal H}_1{\mathord { \otimes } } {\mathcal H}_2)=({\mathcal H}_1^*{\mathord { \otimes } } {\mathcal H}_2^*){\mathord { \otimes } } ({\mathcal H}_1{\mathord { \otimes } } {\mathcal H}_2)\), define the partial trace \(\operatorname {trace}_{{\mathcal H}_1}(A)\) to be the image of A under the contraction \({\mathcal H}_1^*{\mathord { \otimes } } {\mathcal H}_2^*{\mathord { \otimes } } {\mathcal H}_1{\mathord { \otimes } } {\mathcal H}_2\rightarrow {\mathcal H}_2^*{\mathord { \otimes } } {\mathcal H}_2\) given by 〈ϕ|ψ||v|w〉↦〈ϕ|v〉〈ψ||w〉 = 〈ϕ|v〉|w〉〈ψ|.

2.6.4 Expectation and the Uncertainty Principle

Let \(A\in \operatorname {End}({\mathcal H})\) be a Hermitian operator with eigenvalues λ 1, …, λ k and eigenspaces \({\mathcal M}_j\). If our system is in state ρ, one can consider A as a random variable that takes the value λ j with probability \(\operatorname {trace}(\Pi _{{\mathcal M}_j} \rho )\).

The expectation of a random variable \(X: {\mathcal X}\rightarrow {\mathbb R}\) is \(E[X]:=\sum _{j\in {\mathcal X}} X(j)p(j)\).

If a system is in state ρ, the expectation of a Hermitian operator \(A\in \operatorname {End}({\mathcal H})\) is \(\operatorname {trace}(A\rho )\) because \( E[A] =\sum _{\lambda _j} \lambda _j \operatorname {trace}(\Pi _{{\mathcal M}_j} \rho ) = \operatorname {trace}((\sum _{\lambda _j}\lambda _j\Pi _{{\mathcal M}_j}) \rho ) =\operatorname {trace}(A\rho )\).

One way mathematicians describe the famous Heisenberg uncertainty principle is that it is impossible to localize both a function and its Fourier transform. Another interpretation comes from probability:

First note that given a random variable, or Hermitian operator X, one can replace it with an operator of mean zero \(\hat X:= X-E(X \rho )\operatorname {Id}\). For notational convenience, I state the uncertainty principle for such shifted operators.

The variance var(X) of a random variable is var(X) = E[XE(X)]2. The standard deviation \(\sigma (X)=\sqrt {var(X)}\) of X is a measure of the failure of the corresponding probability distribution to be concentrated at a point, i.e., failure of the induced probability distribution to have a certain outcome.

Proposition 2.6.4

Let X, Y  be Hermitian operators of mean zero, corresponding to observables on a system in state ρ, let Then

$$\displaystyle \begin{aligned}\sigma( X )\sigma( Y )\geq \frac{|\operatorname{trace} ([ X, Y]\rho) |}2. \end{aligned}$$

The uncertainty principle says that the failure of two Hermitian operators to commute lower bounds the product of their uncertainties. In particular, if they do not commute, neither can give rise to a classical (certain) measurement. It is a consequence of the Cauchy-Schwarz inequality.

2.6.5 Pure and Mixed States

Definition 2.6.5

Let \(\rho \in \operatorname {End}({\mathcal H})\) be a density operator. If rank(ρ) = 1, i.e. ρ = |ξ〉〈ξ|, ρ is called a pure state, and otherwise it is called a mixed state.

The partial trace of a pure state can be a mixed state. For example, if ρ = |ψ〉〈ψ| with \(\psi =\frac 1{\sqrt {2}} (|{00}\rangle +|{11}\rangle )\in {\mathcal H}_1{\mathord { \otimes } } {\mathcal H}_2\), then \(\operatorname {trace}_{{\mathcal H}_2}(\rho )=\frac 12( |{0}\rangle \langle {0}| + |{1}\rangle \langle {1}|)\).

The following proposition shows that one could avoid density operators altogether by working on a larger space:

Proposition 2.6.6

An arbitrary mixed state \(\rho \in \operatorname {End}({\mathcal H})\) can be represented as the partial trace \(\operatorname {trace}_{{\mathcal H}'} |{\psi }\rangle \langle {\psi }|\) of a pure state in \(\operatorname {End}({\mathcal H}{\mathord { \otimes } } {\mathcal H}')\) for some Hilbert space \({\mathcal H}'\) . In fact, one can always take \({\mathcal H}'={\mathcal H}^*\).

Given a density operator \(\rho \in \operatorname {End}({\mathcal H})\), there is a well defined operator \(\sqrt {\rho }\in \operatorname {End}({\mathcal H})\) whose eigenvectors are the same as for ρ, and whose eigenvalues are the positive square roots of the eigenvalues of ρ. To prove the proposition, given \(\rho \in {\mathcal H}{\mathord { \otimes } } {\mathcal H}^*\), consider \( |{\sqrt {\rho }}\rangle \langle {\sqrt {\rho }}|\in \operatorname {End}({\mathcal H}{\mathord { \otimes } } {\mathcal H}^*)\). Then \(\rho = \operatorname {trace}_{{\mathcal H}^*}( |{\sqrt {\rho }}\rangle \langle {\sqrt {\rho }}|)\). A pure state whose partial trace is ρ is called a purification of ρ.

2.7 Communication Across a Quantum Channel

Now instead of having a source \({\mathcal X}^{\times n}\) our “source” is \({\mathcal H}^{{\mathord { \otimes } } n}\), where one can think of \({\mathcal H}^{{\mathord { \otimes } } n}={\mathcal H}_A^{{\mathord { \otimes } } n}\), and Alice will “transmit” a state to Bob, and instead of a probability distribution p one has a density operator ρ.

What is a quantum channel? It should be a linear map sending \(\rho \in \operatorname {End}({\mathcal H}_A)\) to some \(\Phi (\rho )\in \operatorname {End}({\mathcal H}_B)\).

First consider the special case \({\mathcal H}_A={\mathcal H}_B\). One should allow coupling with an auxiliary system, i.e.,

$$\displaystyle \begin{aligned} \rho\mapsto \rho{\mathord{ \otimes } } \sigma \in \operatorname{End}({\mathcal H}_A{\mathord{ \otimes } } {\mathcal H}_C). \end{aligned} $$
(2.14)

One should also allow the state ρ⊗σ to evolve in \(\operatorname {End}({\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_C)\), i.e., be acted upon by an arbitrary \(U\in {\mathbf {U}}({\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_C)\). Finally one should allow measurements, i.e., tracing out the \({\mathcal H}_C\) part. In summary, a quantum channel \({\mathcal H}_A\rightarrow {\mathcal H}_A\) is a map of the form \(\rho \mapsto \operatorname {trace}_{{\mathcal H}_C}(U(\rho {\mathord { \otimes } } \sigma )U{ }^{-1})\). More generally to go from \({\mathcal H}_A\) to \({\mathcal H}_B\), one needs to allow isometries as well. Such maps are the completely positive trace preserving maps (CPTP), where a map Λ is completely positive if \(\Lambda {\mathord { \otimes } } \operatorname {Id}_{{\mathcal H}_E}\) is positive for all \({\mathcal H}_E\).

We seek an encoder \({\mathcal E}\) and decoder \({\mathcal D}\) and a compression space \({\mathcal H}_{0n}\):

with R as small as possible such that \( {\mathcal E}\circ {\mathcal D}(\rho ^{{\mathord { \otimes } } n})\) converges to ρ ⊗n as n →. To determine R, we need a quantum version of entropy.

Definition 2.7.1

The von Neumann entropy of a density operator ρ is \(H(\rho )=-\operatorname {trace}(\rho \operatorname {\mathrm {log}}(\rho ))\).

Here \( \operatorname {\mathrm {log}}(\rho )\) is defined as follows: write ρ in terms of its eigenvectors and eigenvalues, ρ =∑j λ j|ψ j〉〈ψ j|, then \( \operatorname {\mathrm {log}}(\rho )= \sum _j \operatorname {\mathrm {log}}(\lambda _j) |{\psi _j}\rangle \langle {\psi _j}|\).

If ρ =∑j λ j|ψ j〉〈ψ j|, then \(H(\rho )=-\sum _j \lambda _j \operatorname {\mathrm {log}}(\lambda _j)\) so if ρ is classical (i.e., diagonal), one obtains the Shannon entropy.

Proposition 2.7.2

The von Neumann entropy has the following properties:

  1. (1)

    H(ρ) ≥ 0 with equality if and only if ρ is pure.

  2. (2)

    Let \(\operatorname {dim} {\mathcal H}=d\) . Then \(H(\rho )\leq \operatorname {\mathrm {log}}(d)\) with equality if and only if \(\rho =\frac 1d\operatorname {Id}_{{\mathcal H}}\).

  3. (3)

    If \(\rho = |{\psi }\rangle \langle {\psi }| \in \operatorname {End}({\mathcal H}_A{\mathord { \otimes } }{\mathcal H}_B)\) , then H(ρ A) = H(ρ B), where \(\rho _A=\operatorname {trace}_{{\mathcal H}_B}(\rho )\in \operatorname {End}({\mathcal H}_A)\).

Notice that in particular von Neumann entropy is maximized for |epr〉. In Sects. 2.8 and 2.9 I discuss entanglement as a resource and von Neumann entropy as a measurement of that resource.

Theorem 2.7.3 ([21], The Quantum Noiseless Channel Theorem)

Let \(({\mathcal H},\rho )\) be an i.i.d. quantum source. If R > H(ρ), then there exists a reliable compression scheme of rate R. That is, there exists a compression space \({\mathcal H}_{0n}\) , of dimension 2nR , and encoder \({\mathcal E}:{\mathcal H}^{{\mathord { \otimes } } n}\rightarrow {\mathcal H}_{0n}\) and a decoder \({\mathcal D}:{\mathcal H}_{0n}\rightarrow {\mathcal H}^{{\mathord { \otimes } } n}\) such that \( {\mathcal D}\circ {\mathcal E}(\rho ^{{\mathord { \otimes } } n})\) converges to ρ ⊗n as n ∞. If R < H(ρ), then any compression scheme is unreliable.

2.8 More on von Neumann Entropy and Its Variants

First for the classical case, define the relative entropy \(H(\overline p|| \overline q):=-\sum p_i \operatorname {\mathrm {log}}\frac {q_i}{p_i}=-H(\overline p)-\sum _i p_i \operatorname {\mathrm {log}}(q_i)\). It is zero when \(\overline p=\overline q\) and is otherwise positive. Define the relative von Neumann entropy \(H(\rho || \sigma ):=\operatorname {trace}(\rho \operatorname {\mathrm {log}}(\rho ))-\operatorname {trace}(\rho \operatorname {\mathrm {log}}(\sigma ))\). It shares the positivity property of its classical cousin: [16] H(ρ||σ) ≥ 0 with equality if and only if ρ = σ.

Proposition 2.8.1 (von Neumann Entropy Is Non-decreasing Under Projective Measurements)

Let Π i be a complete set of orthogonal projectors, set ρ′ =∑i Π i ρ Π i . Then H(ρ′) ≥ H(ρ) with equality if and only if ρ′ = ρ.

If we think of the entropy of ρ as a measurement of entanglement, i.e., a measurement of ρ as a communication resource, we see this resource decreases after a measurement.

Proof

First note that \(0\leq H(\rho ||\rho ')=-H(\rho )-\operatorname {trace}(\rho \operatorname {\mathrm {log}}(\rho '))\). Now

$$\displaystyle \begin{aligned} \operatorname{trace}(\rho\operatorname{\mathrm{log}}(\rho'))&=\operatorname{trace}\left(\sum_i \Pi_i\rho\operatorname{\mathrm{log}} (\rho')\right)\\ &=\operatorname{trace}\left(\sum_i \Pi_i\rho\operatorname{\mathrm{log}} (\rho')\Pi_i\right) \end{aligned} $$

because \(\Pi _i^2=\Pi _i\) and \(\mathrm {trace}(AB)=\operatorname {trace}(BA)\). Now Πi commutes with ρ′ and \( \operatorname {\mathrm {log}}(\rho ')\) because Πi Πj = 0 if i ≠ j, so

$$\displaystyle \begin{aligned} \operatorname{trace}(\rho\operatorname{\mathrm{log}}(\rho')) &= \operatorname{trace}(\sum_i \Pi_i\rho \Pi_i\operatorname{\mathrm{log}} (\rho')) \\ &=\operatorname{trace}(\rho'\operatorname{\mathrm{log}}(\rho'))\\ &=-H(\rho') \end{aligned} $$

Putting it all together, we obtain the result. □

Here and in what follows ρ AB is a density operator on \({\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_B\) and \(\rho _A=\operatorname {trace}_{{\mathcal H}_B}(\rho _{AB})\), \(\rho _B=\operatorname {trace}_{{\mathcal H}_A}(\rho _{AB})\) are respectively the induced density operators on \({\mathcal H}_A\), \({\mathcal H}_B\).

von Neumann entropy is sub-additive: H(ρ AB) ≤ H(ρ A) + H(ρ B) with equality if and only if ρ AB = ρ A ⊗ρ B. It also satisfies a triangle inequality: H(ρ AB) ≥|H(ρ A) − H(ρ B)|.

Recall the conditional Shannon entropy is defined to be \(H(\overline p_{{\mathcal X}}|\overline p_{{\mathcal Y}})=-\sum _{i,j}p_{{\mathcal X}\times {\mathcal Y}}(i,j) \operatorname {\mathrm {log}} p_{{\mathcal X}|{\mathcal Y}}(i|j)\), the entropy of \(p_{{\mathcal X}}\) conditioned on y = j, averaged over \({\mathcal Y}\). It is not clear how to “condition” one density matrix on another, so one needs a different definition. Recall that Shannon entropy satisfies \(H(\overline p_{{\mathcal X}}|\overline p_{{\mathcal Y}})=H(\overline p_{{\mathcal X}\times {\mathcal Y}})-H(\overline p_{{\mathcal Y}})\), and the right hand side of this expression does make sense for density operators, so define, for ρ AB a density operator on \({\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_B\),

$$\displaystyle \begin{aligned} H(\rho_A|\rho_B):=H(\rho_{AB})-H(\rho_B). \end{aligned} $$
(2.15)

Note that H(ρ A|ρ B) is a function of ρ AB, as \(\rho _B=\operatorname {trace}_{{\mathcal H}_A}\rho _{AB}\).

WARNING: it is possible that the conditional von Neumann entropy is negative as it is possible that H(ρ B) > H(ρ AB). Consider the following example: Let \(|{\psi }\rangle =\frac 1{\sqrt {2}}(|{00}\rangle +|{11}\rangle )\in {\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_B\). Then \(\rho _A=\frac 12\operatorname {Id}_{{\mathcal H}_A}= \frac 12( |{0}\rangle \langle {0}| + |{1}\rangle \langle {1}|)\) so H(ρ A) = 1, but H(|ψ〉〈ψ|) = 0 because |ψ〉〈ψ| is pure.

However, vestiges of positivity are true in the quantum case:

Theorem 2.8.2 (Strong Sub-additivity)

Let ρ ABC be a density operator on \({\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_B{\mathord { \otimes } } {\mathcal H}_C\) . Then

$$\displaystyle \begin{aligned} H(\rho_C|\rho_A)+H(\rho_C|\rho_B)\geq 0 \end{aligned} $$
(2.16)

and

$$\displaystyle \begin{aligned} H(\rho_{ABC})-[H(\rho_{AB})+H(\rho_{BC})] + H(\rho_B)\geq 0. \end{aligned} $$
(2.17)

Strong sub-additivity has many consequences: entropy is non-increasing under operations such as conditioning, discarding a subsystem does not increase mutual information, and quantum operations (CPTP maps) do not increase mutual information, see, e.g. [20, §11.4.2] for a discussion.

2.9 Entanglement and LOCC

We have seen several ways that entanglement is a resource already for the space \({\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_B={\mathbb C}^2{\mathord { \otimes } } {\mathbb C}^2\): given a shared \(|{epr}\rangle =\frac 1{\sqrt {2}}(|{00}\rangle +|{11}\rangle )\), one can transport two bits of classical information using only one qubit (“super dense coding”) and one can also transmit one qubit of quantum information from Alice to Bob by sending two classical bits (“teleportation” ).

Given a quantum state \(\rho \in \operatorname {End}({\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_B)\), one would like to know how “entangled” it is, e.g., what quantum states could it be used to transport with the aid of classical communication (as in teleportation)? In this section I discuss measures of “quality of entanglement”.

2.9.1 LOCC

Assume several different laboratories can communicate classically, have prepared some shared states in advance, and can perform unitary and projection operations on their parts of the states, as was the situation for quantum teleportation. More precisely, make the following assumptions:

  • \({\mathcal H}={\mathcal H}_1{\mathord {\otimes \cdots \otimes } } {\mathcal H}_n\), and the \({\mathcal H}_j\) share an entangled state |ψ〉. Often one will just have \({\mathcal H}={\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_B\) and |ψ〉 = α|00〉 + β|11〉.

  • The laboratories can communicate classically.

  • Each laboratory is allowed to perform unitary and measurement operations on their own spaces.

The above assumptions are called LOCC for “local operations and classical communication”. It generalizes the set-up for teleportation Sect. 2.3.2.

Restrict to the case \({\mathcal H}={\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_B\), each of dimension two. I will use |epr〉 as a benchmark for measuring the quality of entanglement.

We will not be concerned with a single state |ψ〉, but the tensor product of many copies of it, \({|{\psi }\rangle }^{{\mathord { \otimes } } n}\in ({\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_B)^{{\mathord { \otimes } } n}\). “How much” entanglement does |ψ⊗n have? An answer is given in Sect. 2.9.4.

To gain insight as to which states can be produced via LOCC from a given density operator, return to the classical case. For the classical cousin of LOCC, by considering diagonal density operators, we see we should allow alteration of a probability distribution by permuting the p j (permutation matrices are unitary), and more generally averaging our probability measure under some probability measure on elements of \({\mathfrak S}_d\) (the classical cousin of a projective measurement), i.e., we should allow

$$\displaystyle \begin{aligned} \overline{p}\mapsto \sum_{\sigma\in {\mathfrak S}_d}q_{\sigma}\mu(\sigma)\overline p \end{aligned} $$
(2.18)

where \(\mu : {\mathfrak S}_d\rightarrow GL_d\) is map sending a permutation to a d × d permutation matrix, and q is a probability distribution on \({\mathfrak S}_d\).

This is because the unitary and projection local operators allowed amount to

$$\displaystyle \begin{aligned}\rho\mapsto \sum_{j=1}^k p_j U_j\rho U_j{}^{-1} \end{aligned}$$

where the U j are unitary and p is a probability distribution on {1, …, k} for some finite k.

2.9.2 A Partial Order on Probability Distributions Compatible with Entropy

Shannon entropy is non-increasing under an action of the form (2.18). The partial order on probability distributions determined by (2.18) is the dominance order:

Definition 2.9.1

Let \(x,y\in {\mathbb R}^d\), write x for x re-ordered such that x 1 ≥ x 2 ≥⋯ ≥ x d. Write x ≺ y if for all k ≤ d, \(\sum _{j=1}^k x_j^{\downarrow } \leq \sum _{j=1}^k y_j^{\downarrow }\).

Note that if p is a probability distribution concentrated at a point, then \(\overline q\prec \overline p\) for all probability distributions q, and if p is such that \(p_j=\frac 1d\) for all j, then \(\overline p\prec \overline q\) for all q, and more generally the dominance order is compatible with the entropy in the sense that \(\overline p\prec \overline q\) implies \(H(\overline p)\geq H(\overline q)\).

Recall that a matrix D ∈ Mat d×d is doubly stochastic if D ij ≥ 0 and all column and row sums equal one. Let \({\mathcal D}{\mathcal S}_d\subset Mat_{d\times d}\) denote the set of doubly stochastic matrices. Birkhoff [5] showed \({\mathcal D}{\mathcal S}_d\) is the convex hull of \(\mu ({\mathfrak S}_d)\), and Hardy-Littlewood-Polya [13] showed \(\{ x\mid x\prec y\}= {\mathcal D}{\mathcal S}_d\cdot y\).

2.9.3 A Reduction Theorem

The study of LOCC is potentially unwieldy because there can be numerous rounds of local operations and classical communication, making it hard to model. The following result eliminates this problem:

Proposition 2.9.2

If \(|{\psi }\rangle \in {\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_B\) can be transformed into |ϕby LOCC, then it can be transformed to |ϕby the following sequence of operations:

  1. (1)

    Alice performs a single measurement with operators \(\Pi _{M_j}\).

  2. (2)

    She sends the result of her measurement (some j) to Bob classically.

  3. (3)

    Bob performs a unitary operation on his system.

The key point is that for any vector spaces V, W, an element f ∈ V⊗W, may be considered as a linear map W → V . In our case, \({\mathcal H}_B^*\simeq {\mathcal H}_B\) so |ψ〉 induces a linear map \({\mathcal H}_B\rightarrow {\mathcal H}_A\) which gives us the mechanism to transfer Bob’s measurements to Alice.

For \(X\in {\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_B\), let \( \operatorname {\mathrm {singvals}}(X)\) denote the set of its singular values Now I can state the main theorem on LOCC:

Theorem 2.9.3 ([19])

For states \(|{\psi }\rangle ,|{\phi }\rangle \in {\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_B\) , by LOCC if and only if \( \operatorname {\mathrm {singvals}}(|{\psi }\rangle )\prec \operatorname {\mathrm {singvals}}(|{\phi }\rangle )\).

2.9.4 Entanglement Distillation (Concentration) and Dilution

To compare the entanglement resources of two states |ϕ〉 and |ψ〉, consider |ϕ⊗m for large m with the goal of determining the largest n = n(m) such that |ϕ⊗m may be degenerated to |ψ⊗n via LOCC. Due to the approximate and probabilistic nature of quantum computing, relax this to degenerating |ϕ⊗m to a state that is close to |ψ⊗n.

There is a subtlety for this question worth pointing out. Teleportation was defined in such a way that Alice did not need to know the state she was teleporting, but for distillation and dilution, she will need to know that its right singular vectors are standard basis vectors. More precisely, if she is in possession of \(|{\psi }\rangle = \sqrt {p_1}|{v_1}\rangle {\mathord { \otimes } } |{1}\rangle + \sqrt {p_2}|{v_2}\rangle {\mathord { \otimes } } |{2}\rangle \) , she can teleport the second half of it to Bob if they share \(|{epr}\rangle \in {\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_B\). More generally, if she is in possession of \(|{\psi }\rangle =\sum _{j=1}^d \sqrt {p_j} |{v_j}\rangle {\mathord { \otimes } } |{j}\rangle \in {\mathcal H}_{A'}{\mathord { \otimes } } {\mathcal H}_{A''}\), she can teleport it to Bob if they share enough EPR states. In most textbooks, Alice is assumed to possess states whose singular vectors are |jj〉’s and I will follow that convention here. Similarly, if \(|{\psi }\rangle =\sum _{j=1}^d \sqrt {p_j} |{jj}\rangle \in {\mathcal H}_{A }{\mathord { \otimes } } {\mathcal H}_{B}\), I discuss how many shared EPR states they can construct from a shared |ψ⊗m.

Define the entanglement cost E C(ψ) to be \(\mathrm {inf}_{m} \frac {n(m)}m\) where n(m) copies of ψ can be constructed from |epr⊗m by LOCC with error going to zero as m →. Similarly, define the entanglement value, or distillable entanglement E V(ψ) to be \(\operatorname {sup}_{m} \frac {n(m)}m\) where n(m) copies of |epr〉 can be constructed with diminishing error from |ψ⊗m by LOCC. One has E V(ψ) = E C(ψ) = H(|ψ〉〈ψ|).

Remark 2.9.4

In classical computation one can reproduce information, but this cannot be done with quantum information in general. This is because the map |ψ〉↦|ψ|ψ〉, called the Veronese map in algebraic geometry, is not a linear map. This observation is called the no cloning theorem in the quantum literature. However, one can define a linear map, e.g., \({\mathbb C}^2\rightarrow {\mathbb C}^2{\mathord { \otimes } } {\mathbb C}^2\) that duplicates basis vectors, i.e., |0〉↦|0〉|0〉 and |1〉↦|1〉|1〉. But then of course α|0〉 + β|1〉↦α|0〉|0〉 + β|1〉|1〉≠ (a|0〉 + β|1〉)2.

For mixed states ρ on \({\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_B\), one can still define E C(ρ) and E V(ρ), but there exist examples where they differ, so there is not a canonical measure of entanglement. A wish list of what one might want from an entanglement measure E:

  • Non-increasing under LOCC.

  • If ρ is a product state, i.e., ρ = |ϕ A〉〈ϕ A||ψ B〉〈ψ B|, then E(ρ) = 0.

The two conditions together imply any state constructible from a product state by LOCC should also have zero entanglement. Hence the following definition:

Definition 2.9.5

A density operator \(\rho \in \operatorname {End} ({\mathcal H}_1{\mathord {\otimes \cdots \otimes } } {\mathcal H}_n)\) is separable if ρ =∑i p i ρ i,1 ⊗⋯⊗ρ i,n, where \(\rho _{i,\alpha }\in \operatorname {End}({\mathcal H}_{\alpha })\) are density operators, p i ≥ 0, and ∑i p i = n. If ρ is not separable, ρ is entangled.

Definition 2.9.6

An entanglement monotone E is a function on density operators on \({\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_B\) that is non-increasing under LOCC.

An example of an entanglement monotone different from E V, E C useful for general density operators is the squashed entanglement [9]

$$\displaystyle \begin{aligned}E_{sq}(\rho_{AB}):=\mathrm{inf}_C\{ \frac 12 [H(\rho_A\mid \rho_C)+H(\rho_{B }\mid \rho_C) -H(\rho_{AB}\mid \rho_C)] \mid \rho_{AB}=\operatorname{trace}_{{\mathcal H}_C}(\rho_{ABC})\}. \end{aligned}$$

For bipartite states, all entanglement measures are compatible with the order of states from most to least entangled. This breaks down already for tripartite states.

Remark 2.9.7

An entanglement measure appealing to geometers is SLOCC (stochastic local operations and classical communication) defined originally in [4], which asks if \(|{\psi }\rangle \in {\mathcal H}_1{\mathord {\otimes \cdots \otimes } }{\mathcal H}_d \) is in the same \(SL({\mathcal H}_1)\times \cdots \times SL({\mathcal H}_d)\) orbit as \(|{\phi }\rangle \in {\mathcal H}_1{\mathord {\otimes \cdots \otimes } }{\mathcal H}_d \). If one relaxes this to orbit closure, then it amounts to being able to convert |ψ〉 to |ϕ〉 with positive probability. While appealing, and while there is literature on SLOCC, given the probabilistic nature of quantum computing, its use appears to be limited to very special cases, where the orbit structure is understood (e.g., d ≤ 4, \(\operatorname {dim} {\mathcal H}_j=2\)).

2.10 Tensor Network States

Assuming interactions between particles should be short-ranged enough (which is satisfied in most physically relevant set-ups), if we have an arrangement of electrons, say on a circle, as in Fig. 2.3.

Fig. 2.3
figure 3

Electrons arranged on a circle

It is highly improbable that the electrons will share entanglement with any but their nearest neighbors. This is fortuitous, because if one is dealing with thousands of electrons and would like to describe their joint state, a priori one would have to work with a vector space of dimension 2n, with n in the thousands, which is not feasible. The practical solution to this problem is to define a subset of \(({\mathbb C}^2)^{{\mathord { \otimes } } n}\) of reasonable dimension (e.g. O(n)) consisting of the probable states.

For another example, say the isolated system consists of electrons arranged along a line as below.

and we only want to allow electrons to be entangled with their nearest neighbors. This leads to the notion of Matrix Product States (MPS): draw a graph reflecting this geometry, with a vertex for each electron. To each vertex, attach edges going from the electron’s vertex to those of its nearest neighbors, and add an additional edge not attached to anything else (these will be called physical edges). If our space is \({\mathcal H}_1{\mathord {\otimes \cdots \otimes } } {\mathcal H}_n\), then, assuming vertex j has two neighbors, attach two auxiliary vector spaces, \(E_{j-1},E_{j}^*\), and a tensor \(T_j\in {\mathcal H}_j{\mathord { \otimes } } E_{j-1}{\mathord { \otimes } } E_{j}^*\). If we are on a line, to vertex one, we just attach \(T_1\in {\mathcal H}_1{\mathord { \otimes } } E_1^*\), and similarly, to vertex n we attach \(T_n\in {\mathcal H}_n{\mathord { \otimes } } E_{n-1}\). Now consider the tensor product of all the tensors

$$\displaystyle \begin{aligned}T_1{\mathord{\otimes\cdots\otimes} } T_n\in ({\mathcal H}_1{\mathord{ \otimes } } E_1^*){\mathord{ \otimes } } ({\mathcal H}_2{\mathord{ \otimes } } E_1{\mathord{ \otimes } } E_2^*){\mathord{\otimes\cdots\otimes} } ({\mathcal H}_{n-1}{\mathord{ \otimes } } E_{n-2}{\mathord{ \otimes } } E_{n-1}^*){\mathord{ \otimes } } ({\mathcal H}_n{\mathord{ \otimes } } E_{n-1}) \end{aligned}$$

Assume each E j has dimension k. We can contract these to obtain a tensor \(T\in {\mathcal H}_1{\mathord {\otimes \cdots \otimes } } {\mathcal H}_n\). If k = 1, we just obtain the product states. As we increase k, we obtain a steadily larger subset of \({\mathcal H}_1{\mathord {\otimes \cdots \otimes } }{\mathcal H}_n\), that fills the entire space for sufficiently large (exponential size) k. The claim is that the tensors obtainable in this fashion (for some k determined by the physical setup) are exactly those locally entangled states that we seek. (The first and last tensors in this set-up may be interpreted as boundary values, related to interaction with the outside world.)

For the circle, the only difference in the construction is to make the construction periodic, so \(T_1\in {\mathcal H}_1{\mathord { \otimes } } E_n{\mathord { \otimes } } E_1^*\) and \(T_n\in {\mathcal H}_n{\mathord { \otimes } } E_{n-1}{\mathord { \otimes } } E_n^*\). Such states are called Matrix product states or MPS in the physics literature.

Sometimes for applications (e.g. translation invariant systems on the circle) one requires the same tensor be placed at each vertex. If the tensor is ∑i,j,α T i,j,αi||j⊗v α, the resulting tensor is \(\sum T_{i_1,i_2,\alpha _1}T_{i_2,i_3,\alpha _2}\cdots T_{i_n,i_1,\alpha _n} v_{\alpha _1} {\mathord {\otimes \cdots \otimes } } v_{\alpha _n}\).

For a second example, consider electrons arranged in a rectangular array (or on a grid on a torus), where each vertex is allowed to interact with its four nearest neighbors. Such states are called projected entangled pair states or PEPS in the physics literature.

Assume we place the same tensor at each vertex. If our grid is n × n and periodic, we obtain a map \(({\mathbb C}^k)^{{\mathord { \otimes } } 4}{\mathord { \otimes } } {\mathbb C}^d\rightarrow ({\mathbb C}^d)^{{\mathord { \otimes } } n^2}\).

Definition 2.10.1

Let Γ be a directed graph with vertices v α and two kinds of edges: “physical” edges e i, that are attached to a single vertex, and “auxiliary” (or entanglement) edges e s between two vertices. Associate to each physical edge a vector space V i (in the quantum case, \(V_i={\mathcal H}_i\) is a Hilbert space), and to each auxiliary edge a vector space E s, of dimension e s. Let \(\overline {\mathbf e}=({\mathbf e}_1{, \hdots ,} {\mathbf e}_f)\) denote the vector of these dimensions. A tensor network state associated to \((\Gamma , \{ V_i\}, \overline {\mathbf e})\) is a tensor T ∈ V 1 ⊗⋯⊗V n obtained as follows: To each vertex v α, associate a tensor

$$\displaystyle \begin{aligned}T_{\alpha}\in {\mathord{ \otimes } }_{i\in \alpha}V_i{\mathord{ \otimes } }_{s\in in(\alpha)}E_s^* {\mathord{ \otimes } }_{t\in out(\alpha)}E_t. \end{aligned}$$

Here in(α) are the edges going into vertex α and out(α) are the edges going out of the vertex. The tensor network state associated to this configuration is T := contr(T 1 ⊗⋯⊗T g) ∈ V 1 ⊗⋯⊗V n. Let TNS( Γ, V 1 ⊗⋯⊗V n, e) ⊂ V 1 ⊗⋯⊗V n denote the set of tensor network states.

Other graphs that occur are trees, which also appear in the numerical analysis literature, see [12].

Example 2.10.2

Let Γ be:

Then

$$\displaystyle \begin{aligned} TNS(\Gamma,V_1{\mathord{ \otimes } } V_2{\mathord{ \otimes } } V_3,\overline{\mathbf e}) &= TNS(\Gamma,(E_1^*{\mathord{ \otimes } } E_2){\mathord{ \otimes } } (E_2^*{\mathord{ \otimes } } E_3){\mathord{ \otimes } } (E_3^*{\mathord{ \otimes } } E_1),\overline{\mathbf e})\\ &= \operatorname{End}(V_1)\times \operatorname{End}(V_2)\times \operatorname{End}(V_3)\cdot M_{\langle {\mathbf e}_1,{\mathbf e}_2,{\mathbf e}_3\rangle}. \end{aligned} $$

Here \(M_{\langle {\mathbf e}_1,{\mathbf e}_2,{\mathbf e}_3\rangle }\) is the matrix multiplication tensor, for \(A\in Mat_{{\mathbf e}_1\times {\mathbf e}_2}\), \(B\in Mat_{{\mathbf e}_2\times {\mathbf e}_3}\), \(C\in Mat_{{\mathbf e}_3\times {\mathbf e}_1}\), \((A,B,C)\mapsto \operatorname {trace}(ABC)\). Let \(e_1{, \hdots ,} e_{{\mathbf e}_1}\) be a basis of E 1, \(f_1{, \hdots ,} f_{{\mathbf e}_2}\) be a basis of E 2, and \(g_1{, \hdots ,} g_{{\mathbf e}_3}\) be a basis of E 3.

There are many open questions about tensor network states: only in very few cases is there a satisfactory description of the states producible from a given graph and parameters. Regarding algebraic geometry, one can ask for a description of the ideal of the Zariski closure of the set of states producible from a given graph and parameters. Such information would be extremely useful for applications.

2.11 Representation Theory in Quantum Information Theory

Say we have a state \(\rho \in {\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_B\) or in \({\mathcal H}_{A_1}{\mathord {\otimes \cdots \otimes } } {\mathcal H}_{A_d}\) create-able by a device or experiment and we perform the experiment numerous times to get a state \(\rho ^{{\mathord { \otimes } } n}\in {\mathcal H}_A^{{\mathord { \otimes } } n}{\mathord { \otimes } } {\mathcal H}_B^{{\mathord { \otimes } } n}\). What is the “value” of such states for information theory? What are measurements of these states likely to produce? It turns out (partial) answers to these questions can be gained by exploiting representation theory, see Theorem 2.11.5. I review the relevant representation theory and then apply it to describe the solution to the quantum marginal problem, which discusses which pairs of states on \({\mathcal H}_A,{\mathcal H}_B\) may arise as partial traces of some \(\rho _{AB}\in \operatorname {End}({\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_B)\).

2.11.1 Review of Relevant Representation Theory

(Isomorphism classes of) irreducible representations of the permutation group \({\mathfrak S}_d\) are indexed by partitions of d, write [π] for the \({\mathfrak S}_d\)-module corresponding to the partition π. The irreducible polynomial representations of GL(V ) are indexed by partitions π = (p 1, …, p (π)) with \(\ell (\pi )\leq \operatorname {dim} V\). Write S π V  for the corresponding GL(V )-module.

Theorem 2.11.1 (Schur-Weyl Duality)

As a \(GL(V)\times {\mathfrak S}_d\) -module,

$$\displaystyle \begin{aligned}V^{{\mathord{ \otimes } } d}=\bigoplus_{|\pi|=d} S_{\pi }V{\mathord{ \otimes } } [\pi]. \end{aligned}$$

Let P π : V ⊗d → S π V[π] denote the \(GL(V)\times {\mathfrak S}_d\)-module projection operator.

One is often interested in decompositions of a module under the action of a subgroup. For example S d(V⊗W) is an irreducible GL(V⊗W)-module, but as a GL(V ) × GL(W)-module it has the decomposition, called the Cauchy formula,

$$\displaystyle \begin{aligned} S^d(V{\mathord{ \otimes } } W)=\oplus_{|\pi|=d}S_{\pi}V{\mathord{ \otimes } } S_{\pi}W. \end{aligned} $$
(2.19)

We will be particularly interested in the decomposition of S d(U⊗V⊗W) as a GL(U) × GL(V ) × GL(W)-module. An explicit formula for this decomposition is not known. Write

$$\displaystyle \begin{aligned}S^d(U{\mathord{ \otimes } } V{\mathord{ \otimes } } W)=\bigoplus_{|\pi|,|\mu|,|\nu|=d}(S_{\pi}U{\mathord{ \otimes } } S_{\mu}V{\mathord{ \otimes } } S_{\nu}W)^{\oplus k_{\pi,\mu,\nu}}. \end{aligned}$$

The numbers k π,ν,μ that record the multiplicities are called Kronecker coefficients. They have several additional descriptions. For example, \( S_{\pi }(V{\mathord { \otimes } } W)=\bigoplus _{|\mu |,|\nu |=d} (S_{\mu }V{\mathord { \otimes } } S_{\nu }W)^{\oplus k_{\pi ,\mu ,\nu }}, \) and \(k_{\pi ,\mu ,\nu }=\operatorname {dim} ([\pi ]{\mathord { \otimes } } [\mu ]{\mathord { \otimes } } [\mu ])^{{\mathfrak S}_d} =\mathrm {mult}([d], [\pi ]{\mathord { \otimes } }[\mu ]{\mathord { \otimes } }[\nu ])=\mathrm {mult}([\pi ],[\mu ]{\mathord { \otimes } } [\nu ])\).

2.11.2 Quantum Marginals and Projections onto Isotypic Subspaces of \({\mathcal H}^{{\mathord { \otimes } } d}\)

In this section I address the question: what are compatibility conditions on density operators ρ on \({\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_B\), ρ′ on \({\mathcal H}_A\) and ρ″ on \({\mathcal H}_B\) such that \(\rho '=\operatorname {trace}_{{\mathcal H}_B}(\rho )\), \(\rho ''=\operatorname {trace}_{{\mathcal H}_A}(\rho )\)? As you might expect by now, compatibility will depend only on the spectra of the operators.

Above I discussed representations of the general linear group GL(V ) where V  is a complex vector space. In quantum theory, one is interested in representations on the unitary group \({\mathbf {U}}({\mathcal H})\) on a Hilbert space \({\mathcal H}\). The unitary group is a real Lie group, not a complex Lie group, because complex conjugation is not a complex linear map. It is a special case of a general fact about representations of a maximal compact subgroups of complex Lie groups have the same representation theory as the original group, so in particular the decomposition of \({\mathcal H}^{{\mathord { \otimes } } d}\) as a \({\mathbf {U}}({\mathcal H})\)-module coincides with its decomposition as a \(GL({\mathcal H})\)-module.

For a partition π = (p 1, …, p d) of d, introduce the notation \(\overline \pi =(\frac {p_1}d{, \hdots ,} \frac {p_d}d)\) which is a probability distribution on {1, …, d}.

Theorem 2.11.2 ([8])

Let ρ AB be a density operator on \({\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_B\) . Then there exists a sequence (π j, μ j, ν j) of triples of partitions such that \(k_{\pi _j,\mu _j,\nu _j}\neq 0\) for all j and

$$\displaystyle \begin{aligned} \lim_{j\rightarrow \infty}\overline{\pi}_j &=\operatorname{\mathrm{spec}}(\rho_{AB})\\ \lim_{j\rightarrow \infty}\overline{\mu}_j &=\operatorname{\mathrm{spec}}(\rho_{A})\\ \lim_{j\rightarrow \infty}\overline{\nu}_j &=\operatorname{\mathrm{spec}}(\rho_{B}).\end{aligned} $$

Theorem 2.11.3 ([17])

Let ρ AB be a density operator on \({\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_B\) such that \( \operatorname {\mathrm {spec}}(\rho _{AB})\) , \( \operatorname {\mathrm {spec}}(\rho _A)\) and \( \operatorname {\mathrm {spec}}(\rho _B)\) are all rational vectors. Then there exists an integer M > 0 such that

$$\displaystyle \begin{aligned}k_{M\operatorname{\mathrm{spec}}(\rho_A),M\operatorname{\mathrm{spec}}(\rho_B),M\operatorname{\mathrm{spec}}(\rho_C)}\neq 0. \end{aligned}$$

Theorem 2.11.4 ([17])

Let π, μ, ν be partitions of d with k π,μ,ν ≠ 0 and satisfying ℓ(π) ≤ mn, ℓ(μ) ≤ m, and ℓ(ν) ≤ n. Then there exists a density operator ρ AB on \({\mathbb C}^n{\mathord { \otimes } } {\mathbb C}^m={\mathcal H}_A{\mathord { \otimes } } {\mathcal H}_B\) with \( \operatorname {\mathrm {spec}}(\rho _{AB})=\overline \pi \) , \( \operatorname {\mathrm {spec}}(\rho _{A })=\overline \mu \) , and \( \operatorname {\mathrm {spec}}(\rho _{B})=\overline \nu \).

Klyatchko’s proofs are via co-adjoint orbits and vector bundles on flag varieties, while the proof of Christandl-Mitchison is information-theoretic in flavor.

Recall the relative entropy \(H(\overline p||\overline q)=-\sum _i p_i \operatorname {\mathrm {log}}\frac {q_i}{p_i}\), which may be thought of as measuring how close p, q are because it is non-negative, and zero if and only if p = q. A key step in the Christandl-Mitchison proof is the following theorem:

Theorem 2.11.5 ([14])

Let \(\rho \in \operatorname {End}({\mathcal H})\) be a density operator, where \(\operatorname {dim} {\mathcal H}=n\) . Let |π| = d and let \(P_{\pi }: {\mathcal H}^{{\mathord { \otimes } } d}\rightarrow S_{\pi }{\mathcal H}{\mathord { \otimes } } [\pi ]\) be the projection operator. Then

$$\displaystyle \begin{aligned}\operatorname{trace}(P_{\pi}\rho^{{\mathord{ \otimes } } d})\leq (d+1)^{\binom n2}e^{-d H(\overline\pi ||\operatorname{\mathrm{spec}}(\rho))}. \end{aligned}$$

A key step of the proof is that the projection of e I to S π V[π] is nonzero if and only if wt(e I) ≺ π.

Let Spec m,n,mn denote the set of admissible triples \(( \operatorname {\mathrm {spec}}(\rho _A), \operatorname {\mathrm {spec}}(\rho _B), \operatorname {\mathrm {spec}}(\rho _{AB}))\) and KRON m,n,mn the triples \((\overline \mu ,\overline \nu ,\overline \pi )\) of normalized partitions (μ, ν, π) with (μ) ≤ m, (ν) ≤ n, (π) ≤ mn and k π,μ,ν ≠ 0.

The theorems above imply:

$$\displaystyle \begin{aligned}Spec_{m,n,mn}=\overline{KRON_{m,n,mn}}. \end{aligned}$$

In particular, Spec m,n,mn is a convex polytope.