1 Introduction

When studying classical stochastic processes, we often seek models and representations of the underlying system that allow us to simulate and predict future dynamics. If the process is memoryful, then models that generate it or predict its future behaviors must also have memory. Memory, however, comes at some resource cost; both in a practical sense—consider, for instance, the substantial resources required to generate predictions of weather and climate [1, 2]—and in a theoretical sense—seen in analyzing resource use in thermodynamic systems such as information engines [3]. It is therefore beneficial to seek out a process’ minimally resource-intensive implementation. Notably, this challenge remains an open problem with regards to both classical and quantum processes.

The mathematical idealization of a system’s behaviors is its stochastic process, and the study of the resource costs for predicting and simulating processes is known as computational mechanics (CM) [4,5,6,7]. To date CM has largely focused on discrete-time, discrete-state stochastic processes. These are probability measures \(\mathbb {P}\left( \dots x_{-1} x_0 x_1 \dots \right) \) over sequences of symbols that take values in a finite alphabet \(\mathcal {X}\). The minimal information processing required to predict the sequence is represented by a type of hidden Markov model called the \(\epsilon {\textit{-machine}}\). The statistical complexity \(C_\mu \)—the memory rate for \(\epsilon \)-machines to simultaneously generate many copies of a process—is a key measure of a process’ memory resources. Where finite, \(C_\mu \) is known to be the minimal memory rate over all classical implementations.

When simulating classical processes, quantum implementations can be constructed that have smaller memory requirements than the \(\epsilon \text{-machine }\) [8, 9]. The study of such implementations is the task of quantum computational mechanics (QCM). Over a wide range of processes, a particular implementation of quantum simulation—the q-machine—has shown advantage in reduced memory rate; often the advantage over classical implementations is unbounded [10,11,12,13]. For quantum machines, the minimal memory rate \(C_q\) has been determined in cases such as the Ising model [11] and the Perturbed Coin Process [14], where the q-machine attains the minimum rate. Though a given q-machine’s memory can be readily calculated [15], in many cases the absolutely minimal \(C_q\) is not known.

Another structural formalism, developed parallel to CM, provides a calculus of quantum informational resources. This field, termed quantum resource theory (QRT) recently emerged in quantum information theory as a toolkit for addressing resource consumption in the contexts of entanglement, thermodynamics, and numerous other quantum and even classical resources [16]. Its fundamental challenge is to determine when one system (a QRT resource) can be converted to another using a predetermined set of free or allowed operations.

QRT is closely allied with two other areas of mathematics, namely majorization and lattice theory. Figure 1 depicts their relationships.

Fig. 1
figure 1

Triumvirate of resource theory, majorization, and lattice theory

On the one hand, majorization is a preorder relation \(\succsim \) on positive vectors (typically probability distributions) computed by evaluating a set of inequalities [17]. If the majorization relations hold between two vectors, then one can be converted to the other using a certain class of operations. Majorization is used in several resource theories to numerically test for convertibility between two resources [18,19,20].

Lattice theory, on the other hand, concerns partially ordered sets and their suprema and infima, if they exist [21]. Functions that quantify the practical uses of a resource are monotonic with respect to the partial orders induced by convertibility and majorization. Optimization of monotones of memory is then related to the problem of finding the extrema of the lattice. Majorization and resource convertibility are both relations that generate lattice-like structures on the set of systems.

The following brings the tools of CM and QRT together for the first time. Section 3 starts with a review of majorization theory for the unfamiliar and introduces strong and weak optimization which, as we show, have eminently practical implications for process predictors and simulators. Section 4 briefly reviews CM and demonstrates how strong/weak optimizations shed new light on the fundamental role of the \(\epsilon \text{-machine }\) in the hierarchy of implementations for a given process. In particular, among classical predictive models the \(\epsilon \text{-machine }\) is strongly minimal in that it simultaneously minimizes all measures of memory. Sections 5 and 6 then take these notions into the quantum setting, demonstrating the universally advantageous nature of quantum modeling when it comes to memory resources, but showing that no analog of (strong minimal) \(\epsilon \)-machines exists in the hierarchy of quantum machines.

2 Processes, Probabilities, and Measures

The objects whose probabilities we study span both finite and infinite spaces, each of which entails its own notation.

Most of the objects of study in the following can be described with finite probability distributions. Finite here refers to random variables (e.g., X) that take values in a finite set (e.g., \(\mathcal {X}\)). Distribution refers to the probability of outcomes \(x\in \mathcal {X}\) given by a vector \(\mathbf {p}:=\left( p_x\right) \) with components indexed by \(\mathcal {X}\) that sum to unity: \(\sum _{x\in \mathcal {X}} p_x = 1\).

Probability vectors may be transformed into one another by stochastic matrices. Here, we write such matrices as \(\mathbf {T}:=(T_{y|x})\) to represent a stochastic mapping from \(\mathcal {X}\) to \(\mathcal {Y}\). The matrix components are indexed by elements \(x\in \mathcal {X}\) and \(y\in \mathcal {Y}\) and the stochasticity constraint is \(\sum _{y \in Y} T_{y|x} = 1\) and \(T_{y|x} \ge 0\) for all \(x\in \mathcal {X}\) and \(y\in \mathcal {Y}\).

The following development works with one object that is not finite—a stochastic process. Starting with a finite set \(\mathcal {X}\) of symbols x (the “alphabet”), a length-\(\ell \) word \(w:=x_1 \dots x_{ \ell }\) is a concatenation of \(\ell \) symbols and the set of these is denoted \(\mathcal {X}^\ell \). A bi-infinite word \(\overleftrightarrow {x}:=\dots x_{-1} x_0 x_1 \dots \) is a concatenation of symbols that extends infinitely in both directions and the set of these is denoted \(\mathcal {X}^\infty \).

A stochastic process is a probability distribution over bi-infinite words. This implies a random variable \(\overleftrightarrow {X}\) taking values in the set \(\mathcal {X}^\infty \). However, this set is uncountably infinite, and the notation of measure theory is required to appropriately work with it [22]. In this case, probability values are taken over sets rather than distinct elements. We distinguish probabilities of sets from those of elements using the symbol \(\mathbb {P}\). Often, we ask for the probability of seeing a given length-\(\ell \) word w. This asks for the probability of the cylinder set\(c_w := \left\{ \overleftrightarrow {x}:\overleftrightarrow {x}=\dots x_t w x_{t+\ell +1} \dots \mathrm {\ for\ some}\ t\in \mathbb {Z}\right\} \) of bi-infinite words containing word w. The measure then induces a finite distribution \(\mathbf {p}:=(p_w)\) over \(\mathcal {X}^\ell \) describing a random variable W:

$$\begin{aligned} p_w := \mathbb {P}\left( c_w \right) . \end{aligned}$$

When discussing the process as a whole, we refer to it by its random variable \(\overleftrightarrow {X}\).

Following these conventions, lowercase boldface letters such as \(\mathbf {p}\) and \(\mathbf {q}\) denote probability vectors; uppercase boldface letters such as \(\mathbf {T}\) denote linear transformations on the probability vectors; and uppercase cursive letters such as \(\mathcal {X}\) denote finite sets (and almost always come with an associated random variable X). Lowercase italic letters generally refer to elements of a finite set, though p and q are reserved for components of probability vectors.

Notation for quantum systems follows standard practice. Cursive letters do double-duty, as \(\mathcal {H}\) is exclusively reserved for a Hilbert space, and quantum states are given by lowercase Greek letters. Linear operators are upper-case but not boldface.

3 Majorization and Optimization

First off, an overview of important relevant concepts from majorization and information theory is in order. Those familiar with these may skip to strong/weak optimization (Definition 3), though the intervening notational definitions might be useful.

The majorization of positive vectors provides a qualitative description of how concentrated the quantity of a vector is over its components. For ease of comparison, consider vectors \(\mathbf {p}=(p_i)\), \(i\in \{1,\dots ,n\}\), whose components all sum to a constant value, which we take to be unity:

$$\begin{aligned} \sum _{i=1}^n p_i = 1, \end{aligned}$$

and are nonnegative: \(p_i \ge 0\). For our purposes, we interpret these vectors as probability distributions, as just discussed in Sect. 2.

Our introduction to majorization here follows Ref. [17]. The historical definition of majorization is also the most intuitive, starting with the concept of a transfer operation.

Definition 1

(Transfer operation) A transfer operation\(\mathbf {T}\) on a vector \(\mathbf {p}=(p_i)\) selects two indices \(i,j\in \{1,\dots ,n\}\), such that \(p_i>p_j\), and transforms the components in the following way:

$$\begin{aligned} (Tp)_i&:= p_i-\epsilon \\ (Tp)_j&:= p_j+\epsilon , \end{aligned}$$

where \(0< \epsilon < p_i-p_j\), while leaving all other components equal; \((Tp)_k := p_k\) for \(k\ne i,j\).

Intuitively, these operations reduce concentration, since they act to equalize the disparity between two components, in such a way as to not create greater disparity in the opposite direction. This is the principle of transfers.

Suppose now that we have two vectors \(\mathbf {p}=(p_i)\) and \(\mathbf {q}=(q_i)\) and that there exists a sequence of transfer operations \(\mathbf {T}_1,\dots ,\mathbf {T}_m\) such that \(\mathbf {T}_m \circ \cdots \circ \mathbf {T}_1 \mathbf {p} = \mathbf {q}\). We will say that \(\mathbf {p}\)majorizes\(\mathbf {q}\); denoted \(\mathbf {p}\succsim \mathbf {q}\). The relation \(\succsim \) defines a preorder on the set of distributions, as it is reflexive and transitive but not necessarily antisymmetric.

There are, in fact, a number of equivalent criteria for majorization. We list three relevant to our development in the following composite theorem.

Theorem 1

(Majorization Criteria) Given two vectors \(\mathbf {p}:=(p_i)\) and \(\mathbf {q}:=(q_i)\) with the same total sum, let their orderings be given by the permuted vectors \(\mathbf {p}^{\downarrow } := (p^{\downarrow }_i)\) and \(\mathbf {q}^{\downarrow } := (q^{\downarrow }_i)\) such that \(p^{\downarrow }_1> p^{\downarrow }_2> \dots > p^{\downarrow }_n\) and the same for \(\mathbf {q}^{\downarrow }\). Then the following statements are equivalent:

  1. 1.

    Hardy–Littlewood–Pólya: For every \(1\le k\le n\),

    $$\begin{aligned} \sum _{i=1}^k p_i^{\downarrow } \ge \sum _{i=1}^k q_i^{\downarrow }; \end{aligned}$$
  2. 2.

    Principle of transfers:\(\mathbf {p}\) can be transformed to \(\mathbf {q}\) via a sequence of transfer operations;

  3. 3.

    Schur-Horn: There exists a unitary matrix \(\mathbf {U}:=(U_{ij})\) such that \(\mathbf {q} = \mathbf {Dp}\), where \(\mathbf {D}:=\big (\left| U_{ij}\right| ^2\big )\), a uni-stochastic matrix.

The Hardly-Littlewood-Pólya criterion provides a visual representation of majorization in the form of the Lorenz curve. For a distribution \(\mathbf {p}:=(p_i)\), the Lorenz curve is simply the function \(\beta _\mathbf {p}(k) := \sum _{i=1}^k p_i^{\downarrow }\). See Fig. 2. We can see that \(\mathbf {p} \succsim \mathbf {q}\) so long as the area under \(\beta _\mathbf {q}\) is completely contained in the area under \(\beta _\mathbf {p}\).

Fig. 2
figure 2

Lorenz curves when \(\mathbf {p}\) and \(\mathbf {q}\) are comparable and the first majorizes the second: \(\mathbf {p} \succsim \mathbf {q}\). Here, we chose \(\mathbf {p} = (\nicefrac {3}{4},\nicefrac {1}{8},\nicefrac {1}{8},0,0)\) and \(\mathbf {q} = (\nicefrac {2}{5},\nicefrac {1}{5},\nicefrac {1}{5},\nicefrac {1}{10},\nicefrac {1}{10})\). Tick marks indicate kinks in the Lorenz curve

The Lorenz curve can be understood via a social analogy, by examining rhetoric of the form “The top x% of the population owns y% of the wealth”. Let y be a function of x in this statement, and we have the Lorenz curve of a wealth distribution. (Majorization, in fact, has its origins in the study of income inequality.)

If neither \(\mathbf {p}\) nor \(\mathbf {q}\) majorizes the other, they are incomparable.Footnote 1 (See Fig. 3.)

As noted, majorization is a preorder, since there may exist distinct \(\mathbf {p}\) and \(\mathbf {q}\) such that \(\mathbf {p}\succsim \mathbf {q}\) and \(\mathbf {q} \succsim \mathbf {p}\). This defines an equivalence relation \(\sim \) between distributions. It can be checked that \(q \sim p\) if and only if the two vectors are related by a permutation matrix \(\mathbf {P}\). Every preorder can be converted into a partial order by considering equivalence classes \([\mathbf {p}]_\sim \).

Fig. 3
figure 3

Lorenz curves when \(\mathbf {p}\) and \(\mathbf {q}\) are incomparable. Here, we chose \(\mathbf {p} = (\nicefrac {3}{5},\nicefrac {1}{10},\nicefrac {1}{10},\nicefrac {1}{10},\nicefrac {1}{10})\) and \(\mathbf {q} = (\nicefrac {1}{3},\nicefrac {1}{3},\nicefrac {1}{3},0,0)\)

If majorization, in fact, captures important physical properties of the distributions, we should expect that these properties may be quantified. The class of monotones that quantify the preorder of majorization are called Schur-convex and Schur-concave functions.

Definition 2

(Schur-convex (-concave) functions) A function \(f:\mathbb {R}^n\rightarrow \mathbb {R}\) is called Schur-convex (-concave) if \(\mathbf {p}\succsim \mathbf {q}\) implies \(f(\mathbf {p}) \ge f(\mathbf {q})\) (\(f(\mathbf {p}) \le f(\mathbf {q})\)). f is strictly Schur-convex (concave) if \(\mathbf {p}\succsim \mathbf {q}\) and \(f(\mathbf {p}) = f(\mathbf {q})\) implies \(\mathbf {p}\sim \mathbf {q}\).

An important class of Schur-concave functions consists of the Rényi entropies:

$$\begin{aligned} H_\alpha (\mathbf {p}) := \frac{1}{1-\alpha }\log _2\left( \sum _{i=1}^n p_i^\alpha \right) . \end{aligned}$$

In particular, the three limits:

$$\begin{aligned} H(\mathbf {p})&:= \lim _{\alpha \rightarrow 1}H_\alpha (\mathbf {p}) = -\sum _{i=1}^n p_i\log _2 p_i ,\\ H_0(\mathbf {p})&:= \lim _{\alpha \rightarrow 0}H_\alpha (\mathbf {p}) = \log _2 \left| \{1\le i \le n: p_i > 0\}\right| , ~\text {and} \\ H_\infty (\mathbf {p})&:= \lim _{\alpha \rightarrow \infty }H_\alpha (\mathbf {p}) = -\log _2 \max _{1\le i \le n}p_i ~ \end{aligned}$$

Shannon entropy, topological entropy, and min-entropy, respectively—describe important practical features of a distribution. In order, they describe (i) the asymptotic rate at which the outcomes can be accurately conveyed, (ii) the single-shot resource requirements for the same task, and (iii) the probability of error in guessing the outcome if no information is conveyed at all (or, alternatively, the single-shot rate at which randomness can be extracted from the distribution) [23, 24]. As such, they play a significant role in communication and memory storage.

We note that the Rényi entropies for \(0< \alpha < \infty \) are strictly Schur-concave.

The example of two incomparable distributions \(\mathbf {p}\) and \(\mathbf {q}\) can be analyzed in terms of the Rényi entropies if we plot \(H_\alpha \left( \mathbf {p}\right) \) and \(H_\alpha \left( \mathbf {q}\right) \) as a function of \(\alpha \), as in Fig. 4.

Fig. 4
figure 4

Rényi entropies of the two incomparable distributions \(\mathbf {p}\) and \(\mathbf {q}\) from Fig. 3

The central idea explored in the following is how majorization may be used to determine when it is possible to simultaneously optimize all entropy monotones—or, alternatively, to determine if each monotone has a unique extremum. Obviously, this distinction is a highly practical one to make when possible. This leads to defining strong maxima and strong minima.

Definition 3

(Strong maximum (minimum)) Let S be a set of probability distributions. If a distribution \(\mathbf {p}\in S\) satisfies \(\mathbf {p}\precsim \mathbf {q}\) (\(\mathbf {p}\succsim \mathbf {q}\)), for all \(\mathbf {q}\in S\), then \(\mathbf {p}\) is a strong maximum (minimum) of the set S.

The extrema names derive from the fact that the strong maximum maximizes the Rényi entropies and the strong minimum minimizes them. One can extend the definitions to the case where \(\mathbf {p} \not \in S\), but is the least-upper-bound such that any other \(\mathbf {p}'\) satisfying \(\mathbf {p}'\precsim \mathbf {q}\) must obey \(\mathbf {p}'\precsim \mathbf {p}\). This case would be called a strong supremum (or in the other direction a strong infimum). These constructions may not be unique as \(\succsim \) is a preorder and not a partial order. However, if we sort by equivalence class, then the strongly maximal (minimal) class is unique if it exists.

In lattice-theoretic terms, the strong maximum is essentially the lattice-theoretic notion of a meet and the strong minimum is a join [21].

One example of strong minimization is found in quantum mechanics. Let \(\rho \) be a density matrix and X be a maximal diagonalizing measurement. For a given measurement Y, let \(\left. \rho \right| _Y\) be the corresponding probability distribution that comes from measuring \(\rho \) with Y. Then \(\left. \rho \right| _X \succsim \left. \rho \right| _Y\) for all maximal projective measurements Y. (This follows from the unitary matrices that transform from the basis of X to that of Y and the Schur–Horn lemma.)

Another, recent example is found in Ref. [25], where the set \(B_{\epsilon }\left( \mathbf {p}\right) \) of all distributions \(\epsilon \)-close to \(\mathbf {p}\) under the total variation distance \(\delta \) is considered:

$$\begin{aligned} B_{\epsilon }\left( \mathbf {p}\right) := \left\{ \mathbf {q}:\delta (\mathbf {p},\mathbf {q})\le \epsilon \right\} . \end{aligned}$$

This set has a strong minimum, called the steepest distribution\(\overline{\mathbf {p}}{}_\epsilon \), and a strong maximum, called the flattest distribution\(\underline{\mathbf {p}}{}_\epsilon \).

When a strong minimum or maximum does not exist, we refer to the individual extrema of the various monotones as weak extrema.

4 Strong Minimality of the \(\epsilon \)-Machine

We spoke in the introduction of simulating and predicting processes; this task is accomplished by hidden Markov models (HMMs) [26]. Here, we study a particular class of HMMs which we term finite predictive models (FPM).Footnote 2

Definition 4

(Finite predictive model) A finite predictive model is a triplet \(\mathfrak {M} :=(\mathcal {R},\mathcal {X},\{\mathbf {T}^{(x)}:x\in \mathcal {X}\})\) containing:

  1. 1.

    A finite set of hidden states\(\mathcal {R}\),

  2. 2.

    A finite alphabet\(\mathcal {X}\),

  3. 3.

    Nonnegative transition matrices \(\mathbf {T}^{(x)} := \left( T^{(x)}_{r'|r} \right) \), labeled by symbols \(x\in \mathcal {X}\) with components indexed by \(r,r'\in \mathcal {R}\),

satisfying the properties:

  1. 1.

    Irreducibility: \(\mathbf {T} := \sum _{x\in \mathcal {X}} \mathbf {T}^{(x)}\) is stochastic and irreducible.

  2. 2.

    Unifilarity: \(T^{(x)}_{r'|r} = P_{x|r} \delta _{r',f(r,x)}\) for some stochastic matrix \(P_{x|r}\) and deterministic function f.

A finite predictive model is thought of as a dynamical object; the model transitions between states \(r,r'\in \mathcal {R}\) at each timestep while emitting a symbol \(x\in \mathcal {X}\) with probabilities determined by the transition matrices \(\mathbf {T}^{(x)}:=\big (T^{(x)}_{r'|r}\big )\). Unifilarity ensures that, given the model state \(r \in \mathcal {R}\) and symbol \(x \in \mathcal {X}\), the next state \(r' \in \mathcal {R}\) is unique.

What makes this model predictive? Here, it is the unifilarity property that grants predictivity: In a unifilar model, the hidden state provides the most information possible about the future behavior as compared to other nonunifilar models [6].

Given a FPM \(\mathfrak {M}\), the state transition matrix \(\mathbf {T}\) has a single right-eigenvector \(\varvec{\pi }\) of eigenvalue 1, by the Perron-Frobenius theorem, satisfying \(\mathbf {T}\varvec{\pi } = \varvec{\pi }\). We call this state distribution the stationary distribution. The finite set \(\mathcal {R}\) and distribution \(\varvec{\pi }\) form a random variable R describing the asymptotic distribution over hidden states.

A stationaryFootnote 3 stochastic process \(\overleftrightarrow {X}\) is entirely determined by specifying its probability vectors \(\mathbf {p}^{(\ell )}:=(p^{(\ell )}_w)\) over words \(w=x_1\dots x_\ell \) of length \(\ell \), for all \(\ell \in \mathbb {Z}^{+}\). Using the stationary distribution \(\varvec{\pi }\) we define the process \(\overleftrightarrow {X}_\mathfrak {M}\) generated by \(\mathfrak {M}\) using the word distributions \(p^{(\ell )}_w := \varvec{1}^\top \mathbf {T}^{(x_\ell )}\dots \mathbf {T}^{(x_1)} \varvec{\pi }\), where \(w:=x_1\dots x_\ell \) and \(\varvec{1}\) is the vector with all 1’s for its components. If we let \(\varvec{\delta }_r\) be a distribution on \(\mathcal {R}\) that assigns the state \(r\in \mathcal {R}\) probability 1, then the vector \(\mathbf {p}^{(\ell )}_r:=(p^{(\ell )}_{w|r})\) with components \(p^{(\ell )}_{w|r} := \varvec{1}^\top \mathbf {T}^{(x_\ell )}\dots \mathbf {T}^{(x_1)} \varvec{\delta }_{r}\) is the probability of seeing word w after starting in state r.Footnote 4

Given a model with stationary distribution \(\varvec{\pi }\), we define the model’s Rényi memory as \(H_\alpha \left( \mathfrak {M}\right) := H_\alpha \left( \varvec{\pi }\right) \). This includes the topological memory \(H_0\left( \mathfrak {M}\right) \), the statistical memory \(H\left( \mathfrak {M}\right) =H_1\left( \mathfrak {M}\right) \), and the min-memory \(H_\infty \left( \mathfrak {M}\right) \). Given a process \(\overleftrightarrow {X}\), we define the Rényi complexity as [4]

$$\begin{aligned} C_\mu ^{(\alpha )}\left( \overleftrightarrow {X}\right) := \min _{\mathfrak {M}:\overleftrightarrow {X}=\overleftrightarrow {X}_\mathfrak {M}}H_\alpha \left( \mathfrak {M}\right) \end{aligned}$$

These include the topological complexity \(C^{(0)}_\mu \), the statistical complexity \(C_\mu :=C_\mu ^{(1)}\), and the min-complexity \(C^{(\infty )}_\mu \).

The question, then, of strong or weak optimization with regards to memory in prediction and simulation is really the question of whether, for a given process \(\overleftrightarrow {X}\), a particular model achieves all \(C^{(\alpha )}_\mu \) (strong optimization), or whether a separate model is required for different values of \(\alpha \) (weak optimization). As each \(\alpha \) may have practical meaning in a particular scenario, this question is highly relevant for problems of optimal modeling.

Among the class of FPMs, a particularly distinguished member is the \(\epsilon {\textit{-machine}}\), first considered in Ref. [4]. We use the definition given in Ref. [31].

Definition 5

(Generator\(\epsilon \)-machine) A generator \(\epsilon \text{-machine }\) is a finite predictive model \(\mathfrak {M} :=(\mathcal {S},\mathcal {X},\{\mathbf {T}^{(x)}:x\in \mathcal {X}\})\) such that \(\mathbf {p}^{(\ell )}_s = \mathbf {p}^{(\ell )}_{s'}\) for all \(\ell \in \mathbb {Z}^{+}\) implies \(s=s'\) for \(s,s'\in \mathcal {S}\).

In other words, a generator \(\epsilon \text{-machine }\) must be irreducible, unifilar, and its states must be probabilistically distinct, so that no pair of distinct states predict the same future.

An important result of computational mechanics is that the generator \(\epsilon \text{-machine }\) is unique with respect to the process it generates [31].

Theorem 2

(Model-Process Uniqueness Theorem) Given an \(\epsilon \text{-machine }\)\(\mathfrak {M}\), there is no other \(\epsilon \text{-machine }\) that generates \(\overleftrightarrow {X}_{\mathfrak {M}}\).

This is a consequence of the equivalence of the generator definition with another, called the history\(\epsilon \text{-machine }\), which is itself provably unique (up to isomorphism) [6]. A further important result is that the \(\epsilon \text{-machine }\) minimizes both the statistical complexity \(C_\mu \) and the topological complexity \(C_\mu ^{(0)}\) [6].

To fix intuitions, and to begin introducing majorization concepts into CM, we will now consider several example processes and their models.

First, consider the Biased Coin Process, a memoryless process in which, at each time step, a coin is flipped with probability p of generating a 1 and probability \(1-p\) of generating a 0. Figure 5 displays three models for it. Model (a) is the process’ \(\epsilon \text{-machine }\), and models (b) and (c) are each 2-state alternative finite predictive models. Notice that in both models (b) and (c), the two states generate equivalent futures.

Fig. 5
figure 5

The diagrammatic form of a FSM is read as follows. The colored circles represent hidden states from the finite set \(\mathcal {R}\). The edges are labeled by a blue number, the symbol x, and a probability p. The edges with symbol x represent the transition matrix \(\mathbf {T}^{ ({x})}:=\big (T^{ ({x})}_{r'|r}\big )\), where the tail of the arrow is the starting state r, the head is the final state \(r'\), and \(p=T^{({x})}_{r'|r}\). a\(\epsilon \)-Machine for a coin flipped with bias p. b Alternate representation with bias p to be in state B and \(1-p\) to be in state C. c Alternate representation with biases p to stay in current state and \(1-p\) to switch states

Continuing, Fig. 6 displays two alternative models of the even–odd process. This process is uniformly random save for the constraint that 1s appear only in blocks of even number and 0s only in blocks of odd number. We see in Fig. 6a the process’ \(\epsilon \text{-machine }\). In Fig. 6b, we see an alternative finite predictive model. Notice that its states E and F predict the same futures and so are not probabilistically distinct. They both play the role of state C in the \(\epsilon \text{-machine }\), in terms of the futures they predict.

Fig. 6
figure 6

a\(\epsilon \)-Machine for even–odd process. b Refinement of the even–odd process \(\epsilon \text{-machine }\), where the \(\epsilon \text{-machine }\) ’s state C has been split into states E and F

Majorization and Lorenz curves, in particular, allow us to compare the various models for each of these processes—see Fig. 7. We notice that the \(\epsilon \text{-machine }\) state distribution always majorizes the state distribution of the alternative machines.

Fig. 7
figure 7

a Lorenz curves for Fig. 5a’s \(\epsilon \text{-machine }\) and Fig. 5b’s alternative predictor of the Biased Coin Process. b Same comparison for the even–odd process \(\epsilon \text{-machine }\) Fig. 6a and alternative predictor Fig. 6b

The key to formalizing this observation is the following corollary of Model-Process Uniqueness Theorem:

Corollary 1

(State Merging) Let \(\mathfrak {M}:=(\mathcal {R},\mathcal {X},\{\mathbf {T}^{(x)}:x\in \mathcal {X}\})\) be a finite predictive model that is not an \(\epsilon \text{-machine }\). Then the machine created by merging its probabilistically equivalent states is the \(\epsilon \text{-machine }\) of the process \(\overleftrightarrow {X}_\mathfrak {M}\) generated by \(\mathfrak {M}\).

Proof

Let \(\sim \) be the equivalence relation where \(r\sim r'\) if \(\mathbf {p}^{(\ell )}_r =\mathbf {p}^{(\ell )}_{r'}\) for all \(\ell \in \mathbb {Z}^{+}\). Let \(\mathcal {S}\) consist of the set of equivalence classes \([r]_{\sim }\) generated by this relation. For a given class \(s\in \mathcal {S}\), consider the transition probabilities associated with each \(r\in s\). For each \(x\in \mathcal {X}\) such that \(P_{x|r} > 0\), there is a outcome state \(r_x := f(x,r)\). Comparing with another state in the same class \(r'\in s\), we have the outcome state \(r_x' := f(x,r')\).

For the future predictions of both states r and \(r'\) to be equivalent, they must also be equivalent after seeing the symbol x. That is, \({p}^{(\ell )}_{w|r} ={p}^{(\ell )}_{w|r'}\) for all w and \(\ell \) also implies \({p}^{(\ell +1)}_{xw|r} ={p}^{(\ell +1)}_{xw|r'}\) for all x, w and \(\ell \). But \({p}^{(\ell +1)}_{xw|r} ={p}^{(\ell )}_{w|r_x} \), and so we have \(r_x\sim r_x'\) for all \(x\in \mathcal {X}\).

The upshot of these considerations is that we can define a consistent and unifilar transition dynamic \(\{\widetilde{\mathbf {T}}{}^{(x)}:x\in \mathcal {X}\}\) on \(\mathcal {S}\) given by the matrices \(\widetilde{T}{}^{(x)}_{s'|s} := \widetilde{T}{}^{(x)}_{r'|r}\) for any \(r\in s\) and \(r'\in s'\). It inherits unifilarity from the original model \(\mathfrak {M}\) as well as irreducibility. It has probabilistically distinct states since we already merged all of the probabilistically equivalent states. Therefore, the resulting machine \(\mathfrak {M}_{\mathcal {S}} := (\mathcal {S},\mathcal {X},\{\widetilde{\mathbf {T}}{}^{(x)}:x\in \mathcal {X}\})\) is the \(\epsilon \text{-machine }\) of the process \(\overleftrightarrow {X}_\mathfrak {M}\) generated by \(\mathfrak {M}\); its uniqueness follows from Model-Process Uniqueness Theorem. \(\square \)

The state-merging procedure here is an adaptation of the Hopcroft algorithm for minimization of deterministic finite (nonstochastic) automate, which is itself an implementation of the Nerode equivalence relation [32]. The Hopcroft algorithm has been applied previously to analyze synchronization in \(\epsilon \)-machines [33].

Using Corollary 1, we prove this section’s main result.

Theorem 3

(Strong Minimality of \(\epsilon \)-Machine) Let \(\mathfrak {M}_{\mathcal {S}} := (\mathcal {S},\mathcal {X},\{\widetilde{\mathbf {T}}{}^{(x)}:x\in \mathcal {X}\})\) be the \(\epsilon \text{-machine }\) of process \(\overleftrightarrow {X}\) and \(\mathfrak {M}_{\mathcal {R}}:=(\mathcal {R},\mathcal {S},\{\mathbf {T}^{(x)}:x\in \mathcal {X}\})\) be any other finite generating machine. Let the stationary distributions be \(\varvec{\pi }_\mathcal {S}:=\left( \pi _{s|\mathcal {S}}\right) \) and \(\varvec{\pi }_\mathcal {R}:=\left( \pi _{r|\mathcal {R}}\right) \), respectively. Then \(\varvec{\pi }_\mathcal {S}\succsim \varvec{\pi }_\mathcal {R}\), with equivalence \(\sim \) only when \(\mathfrak {M}_{\mathcal {S}}\) and \(\mathfrak {M}_{\mathcal {R}}\) are isomorphic.

Proof

By Corollary 1, the states of the \(\epsilon \text{-machine }\)\(\mathfrak {M}_{\mathcal {S}}\) are formed by merging equivalence classes \(s=[r]\) on the finite predictive model \(\mathfrak {M}_{\mathcal {R}}\). Since the machines are otherwise equivalent, the stationary probability \(\pi _{s|\mathcal {S}}\) is simply the sum of the stationary probabilities for each \(r\subseteq s\), given by \(\pi _{r|\mathcal {R}}\). That is:

$$\begin{aligned} \pi _{s|\mathcal {S}} = \sum _{r\in s} \pi _{r|\mathcal {R}}. \end{aligned}$$

One can then construct \(\varvec{\pi }_\mathcal {R}\) from \(\varvec{\pi }_\mathcal {S}\) by a series of transfer operations in which probability is shifted out of the state s into new states r. Since the two states are related by a series of transfer operations, \(\varvec{\pi }_\mathcal {S}\succsim \varvec{\pi }_\mathcal {R}\). \(\square \)

It immediately follows from this that not only does the \(\epsilon \text{-machine }\) minimize the statistical complexity \(C_\mu \) and the topological complexity \(C^{(0)}_\mu \), but it also minimizes every other Rényi complexity \(C^{(\alpha )}_\mu \) as well. That this was so for \(C_\mu \) and \(C^{(0)}_\mu \) has previously been proven; the extension to all \(\alpha \) is a new result here.

The uniqueness of the \(\epsilon \text{-machine }\) is extremely important in formulating this result. This property of \(\epsilon \)-machines follows from the understanding of predictive models as partitions of the past and of the \(\epsilon \)-machines as corresponding to the coarsest graining of these predictive partitions [6]. Other paradigms for modeling will not necessarily have this underlying structure and so may not have strongly minimal solutions. Indeed, in the following we will see that this result does not generalize to quantum machines.

5 Strong Quantum Advantage

A pure-state quantum machine can be generalized from the classical case by replacing the classical states s with quantum state vectors \(\left| \eta _ s\right\rangle \) and the symbol-labeled transition matrices \(\mathbf {T}^{(x)}\) with symbol-labeled Kraus operators \(K^{(x)}\).Footnote 5 The generalization is called a pure-state quantum model (PSQM).

Definition 6

(Pure-state quantum model) A pure-state quantum model is a quintuplet \(\mathfrak {M} :=(\mathcal {H},\mathcal {X},\mathcal {S}, \{\left| \eta _s\right\rangle :s\in \mathcal {S}\},\{K^{(x)}:x\in \mathcal {X}\})\) consisting of:

  1. 1.

    A finite-dimensional Hilbert space \(\mathcal {H}\),

  2. 2.

    A finite alphabet \(\mathcal {X}\),

  3. 3.

    Pure states \(\left| \eta _s\right\rangle \) indexed by elements \(s\in \mathcal {S}\) in a finite set \(\mathcal {S}\),

  4. 4.

    Nonnegative Kraus operators \(K^{(x)}\) indexed by symbols \(x\in \mathcal {X}\),

satisfying the properties:

  1. 1.

    Completeness: The Kraus operators satisfy \(\sum _x K^{(x)\dagger }K^{(x)}=I\).

  2. 2.

    Unifilarity: \(K^{(x)}\left| \eta _s\right\rangle \propto \left| \eta _{f(s,x)}\right\rangle \) for some deterministic function f(sx).

This is a particular kind of hidden quantum Markov model (HQMM) [35] in which we assume the dynamics can be described by the evolution of pure states. This is practically analogous to the assumption of unifilarity in the classical predictive setting.

It is not necessarily the case that the states \(\{\left| \eta _s\right\rangle \}\) form an orthonormal basis; rather, nonorthonormality is the intended advantage [8, 9]. Overlap between the states allows for a smaller von Neumann entropy for the process’ stationary state distribution. We formalize this notion shortly.

It is assumed that the Kraus operators have a unique stationary density matrix \(\rho _\pi \) analogous to a classical model’s stationary state \(\varvec{\pi }\). One way to compute it is to note the matrix \(P_{x|s} = \left\langle \eta _s\right| K^{(x)\dagger }K^{(x)}\left| \eta _s\right\rangle \) and the function \(s \mapsto f(s,x)\) together determine a finite predictive model as defined above. The model’s stationary state \(\varvec{\pi }:=(\pi _s)\) is related to the stationary density matrix of the quantum model via:

$$\begin{aligned} \rho _\pi = \sum _{s} \pi _s \left| \eta _s\right\rangle \left\langle \eta _s\right| . \end{aligned}$$

The process generated by a pure-state quantum model has the length-\(\ell \) word distribution, for words \(w=x_1 \dots x_\ell \):

$$\begin{aligned} p_w^{(\ell )} := \mathrm {Tr}\left[ K^{(x_\ell )}\ldots K^{(x_1)}\rho _\pi K^{(x_1)\dagger }\ldots K^{(x_\ell )\dagger }\right] . \end{aligned}$$

The eigenvalues \(\{\lambda _i\}\) of the stationary state \(\rho _\pi \) form a distribution \(\varvec{\lambda } = \left( \lambda _i\right) \). The Rényi entropies of these distributions form the von Neumann-Rényi entropies of the states:

$$\begin{aligned} S_\alpha \left( \rho _\pi \right) = H_\alpha \left( \varvec{\lambda }\right) . \end{aligned}$$

We noted previously that for a given density matrix, these entropies are strongly minimal over the entropies of all projective, maximal measurements on the state. Given a model \(\mathfrak {M}\) with stationary state \(\rho _\pi \), we may simply write \(S_\alpha \left( \mathfrak {M}\right) := S_\alpha \left( \rho _\pi \right) \) as the Rényi memory of the model. Important limits, as before, are the topological memory\(S_0\left( \mathfrak {M}\right) \), the statistical memory\(S\left( \mathfrak {M}\right) = S_1\left( \mathfrak {M}\right) \), and the min-memory\(S_\infty \left( \mathfrak {M}\right) \), which represent physical limitations on memory storage for the generator.

To properly compare PSQMs and FPMs, we define the classical equivalent model of a PSQM.

Definition 7

(Classical equivalent model) Let \(\mathfrak {M} :=(\mathcal {H},\mathcal {X},\mathcal {S}, \{\left| \eta _ s\right\rangle : s\in \mathcal {S}\},\{K^{(x)}:x\in \mathcal {X}\})\) be a pure-state quantum model, with probabilities \(P_{x|s} := \left\langle \eta _ s\right| K^{(x)\dagger }K^{(x)}\left| \eta _ s\right\rangle \) and deterministic function f(sx) such that \(K^{(x)}\left| \eta _s\right\rangle \propto \left| \eta _{f(s,x)}\right\rangle \). Its classical equivalent \(\mathfrak {M}_\mathrm {cl}=(\mathcal {S},\mathcal {X},\{\mathbf {T}^{(x)}:x\in \mathcal {X}\})\) is the classical finite predictive model with state set \(\mathcal {S}\), alphabet \(\mathcal {X}\), and symbol-based transition matrices \(\mathbf {T}^{(x)}=\big (T^{(x)}_{s'|s}\big )\) given by \(T^{(x)}_{s'|s} = P_{x|r}\delta _{r',f(r,x)}\).

Each PSQM \(\mathfrak {M}\) generates a process \(\overleftrightarrow {X}_\mathfrak {M}\), which is the same process that is generated by the classical equivalent model: \(\overleftrightarrow {X}_{\mathfrak {M}_\mathrm {cl}}=\overleftrightarrow {X}_\mathfrak {M}\).

We now prove that a classical equivalent model for a PSQM is always strongly improved in memory by said PSQM.

Theorem 4

(Strong quantum advantage) Let \(\mathfrak {M} :=(\mathcal {H},\mathcal {X},\mathcal {S}, \{\left| \eta _ s\right\rangle : s\in \mathcal {S}\},\{K^{(x)}:x\in \mathcal {X}\})\) be a pure-state quantum model with stationary state \(\rho _\pi \), and let \(\mathfrak {M}_\mathrm {cl}\) be the classical equivalent model with stationary state \(\varvec{\pi }:=(\pi _ s)\) (with \( s=1,\dots ,n\)). Let \(D:=\dim \mathcal {H}\) and \(N:= \left| \mathcal {S}\right| \). (We have \(N\ge D\): if not, then we can take a smaller Hilbert space that spans the states.) Let \(\varvec{\lambda }=(\lambda _i)\) be an N-dimensional vector where the first D components are the eigenvalues of \(\rho _\pi \) and the remaining elements are 0. Then \(\varvec{\lambda }\succsim \varvec{\pi }\).

Proof

We know that:

$$\begin{aligned} \rho _\pi&= \sum _{ s\in \mathcal {S}} \pi _ s \left| \eta _ s \right\rangle \left\langle \eta _ s\right| \\&= \sum _{ s\in \mathcal {S}} \left| \phi _ s \right\rangle \left\langle \phi _ s\right| , \end{aligned}$$

where \(\left| \phi _ s\right\rangle := \sqrt{\pi _ s}\left| \eta _ s\right\rangle \). However, we can also write \(\rho _\pi \) in the eigenbasis:

$$\begin{aligned} \rho _\pi&= \sum _{i=1}^D \lambda _i \left| i \right\rangle \left\langle i\right| \\&= \sum _{i=1}^D \left| \psi _i \right\rangle \left\langle \psi _i\right| , \end{aligned}$$

where \(\left| \psi _i\right\rangle := \sqrt{\lambda _i}\left| i\right\rangle \). Then the two sets of vectors can be related via:

$$\begin{aligned} \left| \phi _ s\right\rangle = \sum _{i=1}^D U_{ s i} \left| \psi _i\right\rangle , \end{aligned}$$

where \(U_{ s i}\) is a \(N \times D\) matrix comprised of d rows of orthonormal N-dimensional vectors [36]. Now, we have:

$$\begin{aligned} \pi _ s&= \left\langle \phi _ s|\phi _ s\right\rangle \\&= \sum _{i=1}^D |U_{ s i}|^2 \lambda _i . \end{aligned}$$

Note that \(U_{ s i}\) is not square, but since we have taken \(\lambda _i = 0\) for \(i> D\), we can simply extend \(U_{ s i}\) into a square unitary matrix by filling out the bottom \(n-D\) rows with more orthonormal vectors. This leaves the equation unchanged. We can then write:

$$\begin{aligned} \pi _ s = \sum _{i=1}^n |U_{ s i}|^2 \lambda _i . \end{aligned}$$

Then by Theorem 1, \(\varvec{\lambda }\succsim \varvec{\pi }\). \(\square \)

It helps to recall now that majorization is a preorder, which means we could have \(\varvec{\pi }\sim \varvec{\lambda }\), in which case there would be no advantage per se. This happens when \(|U_{si}|^2\) is a permutation matrix. However, one quickly sees that this is true if and only if \(\left\{ \left| \eta _ s\right\rangle \right\} \) are orthogonal. Thus, any nonorthogonality in the quantum states automatically induces advantage.

Corollary 2

\(S_\alpha (\mathfrak {M}) \le H_\alpha \left( \mathfrak {M}_{\mathrm {cl}}\right) \) for all \(\alpha \ge 0\), with equality for \(0<\alpha <\infty \) if and only if the states \(\left\{ \left| \eta _ s\right\rangle \right\} \) of \(\mathfrak {M}\) are orthogonal.

As in the classical case, it immediately follows from this that not only is the classical equivalent model improved upon by its corresponding PSQM in terms of \(S_0(\mathfrak {M})\) and \(S(\mathfrak {M})\) (as was previously known in certain cases), but it is improved in all Rényi memories \(S_\alpha (\mathfrak {M})\).

Many alternative pure-state quantum models may describe the same process. The “first mark”, so to speak, for quantum models is the q-machine, which directly embeds the dynamics of the \(\epsilon \text{-machine }\) into a quantum system while already leveraging the memory advantage due to quantum state overlap. The notion of the q-machine originates in [8], and its definition was further refined in Refs. [9, 15]. We use an equivalent definition first introduced in Ref. [34]; however, there an equivalent unitary formalism is used instead of Kraus operators.

Definition 8

(q-Machine) Given an \(\epsilon \text{-machine }\)\(\mathfrak {M}:=\left( \mathcal {S},\mathcal {X},\{\mathbf {T}^{(x)}:x\in \mathcal {X}\}\right) \), where \({T}^{(x)}_{ s s'}:={P}_{x| s} \delta _{ s',f( s,x)}\) for some deterministic function f(sx), construct the corresponding q-machine in the following way:

  1. 1.

    The states \(\left| \eta _ s\right\rangle \) are built to satisfy the recursive relation:

    $$\begin{aligned} \left\langle \eta _ s|\eta _{ s'}\right\rangle = \sum _{x\in \mathcal {X}}\sqrt{{P}_{x| s}{P}_{x| s'}}\left\langle \eta _{f( s,x)}|\eta _{f( s',x)}\right\rangle . \end{aligned}$$
  2. 2.

    \(\mathcal {H}\) is the space spanned by the states \(\left| \eta _ s\right\rangle \).

  3. 3.

    The Kraus operators \(K^{(x)}\) are determined by the relations:

    $$\begin{aligned} K^{(x)}\left| \eta _ s\right\rangle = \sqrt{{P}_{x| s}}\left| \eta _{f( s,x)}\right\rangle . \end{aligned}$$

Then Corollary 2 can be applied here. The q-machine is matched in memory by the \(\epsilon \text{-machine }\) when and only when the states \(\left| \eta _s\right\rangle \) are orthogonal, \({\langle {\eta _s|\eta _{s'}}\rangle }=\delta _{ss'}\). The recursive relation becomes:

$$\begin{aligned} \delta _{ss'} = \sum _{x\in \mathcal {X}}\sqrt{{P}_{x| s}{P}_{x| s'}}\delta _{f( s,x)f(s',x)} . \end{aligned}$$

This holds if and only if \(\delta _{f( s,x)f(s',x)}=\delta _{ss'}\) for all x satisfying \({P}_{x| s},{P}_{x| s'}>0\). This constrains the structure of the \(\epsilon \text{-machine }\): two distinct states s and \(s'\) cannot map to the same state on the same symbol. In other words, given a state and an incoming symbol, the previous state must be determined. Such a structure is called co-unifilar [37]. Examples of co-unifilar machines are shown in Fig. 5a and c.

To be clear, then, the q-machine offers strict advantage over any \(\epsilon \text{-machine }\) which is not co-unifilar and matches the \(\epsilon \text{-machine }\) when it is co-unifilar. That the q-machine offers statistical memory advantage with respect to the \(\epsilon \text{-machine }\) was previously shown in Ref. [9] and with respect to topological memory in Ref. [14]. Theorem 4 implies those results as well as advantage with respect to all Rényi measures of memory.

One can check that the q-machine satisfies the completeness relations and has the correct probability dynamics for the process generated by the \(\epsilon \text{-machine }\).

6 Weak Quantum Minimality

An open problem is to determine the minimal quantum pure-state representation of a given classical process. This problem is solved in some specific instances such as the Ising model [11] and the Perturbed Coin Process [14]. In these cases it is known to be the q-machine. We denote the smallest value of the Rényi entropy of the stationary state as:

$$\begin{aligned} C_{q}^{(\alpha )}\left( \overleftrightarrow {X}\right) := \min _{\mathfrak {M}:\overleftrightarrow {X}_\mathfrak {M}=\overleftrightarrow {X}} S_\alpha \left( \mathfrak {M}\right) , \end{aligned}$$

called the quantum Rényi complexities, including the limits, the quantum topological complexity\(C_{q}^{(0)}\), the quantum min-complexity\(C_{q}^{(\infty )}\), and the quantum statistical complexity\(C_{q}:=C_{q}^{(1)}\).

If a strongly minimal quantum pure-state model exists, these complexities are all attained by the same pure-state model. Our primary result here is that there are processes for which this does not occur.

We start by examining two examples. The first, the MBW process introduced in Ref. [35], demonstrates a machine whose q-machine is not minimal in the von Neumann complexity. Consider the process generated by the 4-state MBW machine shown in Fig. 8.

Fig. 8
figure 8

The 4-state MBW process as a Markov chain (which is the \(\epsilon \text{-machine }\))

This process’ HMM is simply a Markov chain, and its representation in Fig. 8 is its \(\epsilon \text{-machine }\). Denote this classical representation by \(\mathfrak {M}_4\). If we take \(\{ \left| A\right\rangle , \left| B\right\rangle , \left| C\right\rangle , \left| D\right\rangle \}\) as an orthonormal basis of a Hilbert space, we can construct the q-machine with the states:

$$\begin{aligned} \left| \eta _A\right\rangle&:= \frac{1}{\sqrt{2}}\left| A\right\rangle + \frac{1}{2}\left( \left| C\right\rangle +\left| D\right\rangle \right) , \\ \left| \eta _B\right\rangle&:= \frac{1}{\sqrt{2}}\left| B\right\rangle + \frac{1}{2}\left( \left| C\right\rangle +\left| D\right\rangle \right) , \\ \left| \eta _C\right\rangle&:= \frac{1}{\sqrt{2}}\left| C\right\rangle + \frac{1}{2}\left( \left| A\right\rangle +\left| B\right\rangle \right) , ~\text {and}\\ \left| \eta _D\right\rangle&:= \frac{1}{\sqrt{2}}\left| D\right\rangle + \frac{1}{2}\left( \left| A\right\rangle +\left| B\right\rangle \right) . \end{aligned}$$

Since it is a Markov chain, we can write the Kraus operators as \(K_x := \left| \eta _x \right\rangle \left\langle \epsilon _x\right| \), where \(\left\langle \epsilon _x|\eta _{x'}\right\rangle \propto \sqrt{P_{x'|x}}\). This is a special case of the construction used in Ref. [13]. For q-machines of Markov chains, then, the dual basis is just \(\left\langle \epsilon _x\right| = \left\langle x\right| \). We denote the q-machine model of the 4-state MBW process as \(\mathfrak {Q}_4\).

Let’s examine the majorization between \(\mathfrak {Q}_4\) and the Markov model via the Lorenz curves of \(\varvec{\lambda }\), the eigenvalues of \(\rho _\pi \), and the stationary state of the Markov chain. See Fig. 9.

Fig. 9
figure 9

Lorenz curves for the 4-state MBW \(\epsilon \text{-machine }\)\(\mathfrak {M}_4\) and the associated q-machine \(\mathfrak {Q}_4\)

It turns out that there is a smaller quantum model embedded in two dimensions, with states:

$$\begin{aligned} \left| \eta _A'\right\rangle&:= \left| 0\right\rangle , \\ \left| \eta _B'\right\rangle&:= \left| 1\right\rangle , \\ \left| \eta _C'\right\rangle&:= \frac{1}{\sqrt{2}}\left( \left| 0\right\rangle +\left| 1\right\rangle \right) , ~\text {and}\\ \left| \eta _D'\right\rangle&:= \frac{1}{\sqrt{2}}\left( \left| 0\right\rangle -\left| 1\right\rangle \right) . \end{aligned}$$

In this case, \(\left\langle \epsilon _x'\right| =\frac{1}{\sqrt{2}}\left\langle \eta _x'\right| \) derives the q-machine. This gives the proper transition probabilities for the 4-state MBW model. We denote this dimensionally-smaller model \(\mathfrak {D}_4\). Figure 10 compares the Lorenz curve of its stationary eigenvalues \(\varvec{\lambda }'\) to those of \(\mathfrak {Q}_4\). One sees that it does not majorize the q-machine, but it does have a lower statistical memory: \(S(\mathfrak {D}_4) = 1.0\) and \(S(\mathfrak {Q}_4) \approx 1.2\) bit. (On the other hand, the q-machine has a smaller min-memory, with \(S_\infty (\mathfrak {D}_4) = 1.0\) and \(S_\infty (\mathfrak {Q}_4) \approx 0.46\).)

Fig. 10
figure 10

Lorenz curves for the 4-state MBW q-machine \(\mathfrak {Q}_4\) and a dimensionally-smaller model \(\mathfrak {D}_4\)

Now consider something in the opposite direction. Consider the 3-state MBW model, denoted \(\mathfrak {M}_3\) and displayed in Fig. 11. This is a generalization of the previous example to three states instead of four. We will compute the corresponding q-machine \(\mathfrak {Q}_3\) and show that there also exists a dimensionally-smaller representation \(\mathfrak {D}_3\). In this case, however, \(\mathfrak {D}_3\) is not smaller in its statistical memory.

The q-machine \(\mathfrak {Q}_3\) of this Markov chain is given by the states:

$$\begin{aligned} \left| \eta _A\right\rangle&:= \sqrt{\frac{2}{3}}\left| A\right\rangle + \frac{1}{\sqrt{6}}\left( \left| B\right\rangle +\left| C\right\rangle \right) , \\ \left| \eta _B\right\rangle&:= \sqrt{\frac{2}{3}}\left| B\right\rangle + \frac{1}{\sqrt{6}}\left( \left| A\right\rangle +\left| C\right\rangle \right) , ~\text {and}\\ \left| \eta _C\right\rangle&:= \sqrt{\frac{2}{3}}\left| C\right\rangle + \frac{1}{\sqrt{6}}\left( \left| A\right\rangle +\left| B\right\rangle \right) , \end{aligned}$$

and Kraus operators defined similarly to before. We can examine the majorization between the q-machine and the Markov model by plotting the Lorenz curves of \(\varvec{\lambda }\), the eigenvalues of \(\rho _\pi \), and the stationary state of the Markov chain, shown in Fig. 12.

Fig. 11
figure 11

The 3-state MBW process as a Markov chain (which is the process’ \(\epsilon \text{-machine }\))

Fig. 12
figure 12

Lorenz curves for the 3-state MBW \(\epsilon \text{-machine }\)\(\mathfrak {M}_3\) and the associated q-machine \(\mathbf {Q}_3\)

The lower-dimensional model \(\mathfrak {D}_3\) is given by the states:

$$\begin{aligned} \left| \eta _A\right\rangle&:= \left| 0\right\rangle , \\ \left| \eta _B\right\rangle&:= \frac{1}{2}\left| 0\right\rangle +\frac{\sqrt{3}}{2}\left| 1\right\rangle , ~\text {and}\\ \left| \eta _C\right\rangle&:= \frac{1}{2}\left| 0\right\rangle -\frac{\sqrt{3}}{2}\left| 1\right\rangle , \end{aligned}$$

with \(\left\langle \epsilon _x'\right| =\sqrt{\frac{2}{3}}\left\langle \eta _x'\right| \). This gives the proper transition probabilities for the 3-state MBW model. Figure 13 compares the Lorenz curve of its stationary eigenvalues \(\varvec{\lambda }'\) to that of \(\mathfrak {Q}_3\). We see that it does not majorize \(\mathfrak {Q}_3\). And, this time, this is directly manifested by the fact that the smaller-dimension model has a larger entropy: \(S(\mathfrak {D}_3) = 1.0\) and \(S(\mathfrak {Q}_3) \approx 0.61\) bit.

Fig. 13
figure 13

Lorenz curves for the 3-state MBW q-machine, \(\mathfrak {Q}_3\) and a dimensionally-smaller model \(\mathfrak {D}_3\)

After seeing the \(\epsilon \text{-machine }\) ’s strong minimality with respect to other classical models and its strong maximality with respect to quantum models, it is certainly tempting to conjecture that a strongly minimal quantum model exists. However, the examples we just explored cast serious doubt. None of the examples covered above are strong minima.

One way to prove that no strong minimum exists for, say, the 3-state MBW process requires showing that there does not exist any other quantum model in 2 dimensions that generates the process. This would imply that no other model can majorize \(\mathfrak {D}_3\). Since this model is not strongly minimal, no strongly minimal solution can exist.

Appendix 1 proves exactly this—thus, demonstrating a counterexample to the strong minimality of quantum models.

Counterexample

(Weak Minimality of \(\mathfrak {D}_3\)) The quantum model \(\mathfrak {D}_3\) weakly minimizes topological complexity for all quantum generators of the 3-state MBW process; consequently, the 3-state MBW process has no strongly minimal quantum model.

Fig. 14
figure 14

Proposed majorization saddle structure of model-space: the \(\epsilon \text{-machine }\) (labeled \(\epsilon \)) is located at a saddle-point with respect to majorization, where classical deviations (state-splitting) move up the lattice and quantum deviations (utilizing state overlap) move down the lattice

7 Concluding Remarks

Majorization provides a means to compare a process’ alternative models in both the classical and quantum regimes. When it holds, majorization implies the simultaneous minimization of a large host of functions. As a result we showed that:

  1. 1.

    The \(\epsilon \text{-machine }\) majorizes all classical predictive models of the same process and so simultaneously minimizes many different measures of memory cost.

  2. 2.

    The q-machine, and indeed any quantum realization of the \(\epsilon \text{-machine }\), always majorizes the \(\epsilon \text{-machine }\), and so simultaneously improves on all the measures of memory cost.

  3. 3.

    For at least one process, there does not exist any quantum pure-state model that majorizes all quantum pure-state models of that process. Thus, while an \(\epsilon \)-machine may be improved upon by different possible quantum models, there is not a unique one quantum model that is unambiguously the “best” choice.

Imagining the \(\epsilon \text{-machine }\) as an invariant “saddle-point” in the majorization structure of model-space, Fig. 14 depicts the implied geometry. That is, we see that despite its nonminimality among all models, the \(\epsilon \text{-machine }\) still occupies a topologically important position in model-space—one that is invariant to one’s choice of memory measure. However, no similar model plays the topologically minimal role for quantum pure-state models.

The quantum statistical complexity \(C_q\) has been offered up as an alternative quantum measure of structural complexity—a rival of the statistical complexity \(C_\mu \) [38]. One implication of our results here is that the nature of this quantum minimum \(C_q\) is fundamentally different than that of \(C_\mu \). This observation should help further explorations into techniques required to compute \(C_q\) and the physical circumstances in which it is most relevant.

That the physical meaning of \(C_q\) involves generating an asymptotically large number of realizations of a process may imply that it cannot be accurately computed by only considering machines that generate a single realization. This is in contrast to \(C_\mu \) which, being strongly minimized, must be attainable in the single-shot regime along with measures like \(C_\mu ^{(0)}\) and \(C_\mu ^{(\infty )}\).

In this way, the quantum realm again appears ambiguous. Ambiguity in structural complexity has been previously observed in the sense that there exist pairs of processes, \(\overleftrightarrow {X}\) and \(\overleftrightarrow {Y}\), such that \(C_\mu \big (\overleftrightarrow {X}\big )>C_\mu \big (\overleftrightarrow {Y}\big )\) but \(C_q\big (\overleftrightarrow {X}\big )<C_q\big (\overleftrightarrow {Y}\big )\) [39]. The classical and quantum paradigms for modeling can disagree on simplicity—there is no universal Ockham’s Razor. How this result relates to strong versus weak optimization deserves further investigation.

The methods and results here should also be extended to analyze classical generative models which, in many ways, bear resemblances in their functionality to the quantum models [40,41,42]. These drop the requirement of unifilarity, similar to how the quantum models relax the notion of orthogonality. Important questions to pursue in this vein are whether generative models are strongly maximized by the \(\epsilon \text{-machine }\) and whether they have their own strong minimum or, like the quantum models, only weak minima in different contexts.

We also only explored finite-state, discrete-time processes. Processes with infinite memory [43] and continuous generation [44, 45] are also common in nature. Applying our results to understand these requires further mathematical development.

We close by noting that we have committed a sleight of hand here, using the tools of resource theory to study CM and, particularly, memory in stochastic processes. This is still far from formulating a resource theory of memory. It is also far from applying the memoryful logic of CM to extend resource theory, which often studies memoryless collections of resources, in which there is no temporal structure. Both of these directions will be developed elsewhere and, in doing so, will likely shed significant light on the above questions.