Keywords

1 Introduction

Uniformity testing is the flagship problem of the modern distribution testing [1] research program. From n independent observations sampled from an unknown distribution \(\mu \) on a finite space \(\mathcal {X}\), the goal is to distinguish between the two cases where \(\mu \) is uniform and \(\mu \) is \(\varepsilon \)-far from being uniform with respect to some notion of distance. The complexity of this problem in total variation is known to be [12] of the orderFootnote 1 of \(\tilde{\varTheta }(\sqrt{\left| \mathcal {X} \right| }/\varepsilon ^2)\), which compares favorably with the linear dependency in \(\left| \mathcal {X} \right| \) required for estimating the distribution to precision \(\varepsilon \) [17]. Interestingly, the uniform distribution can be replaced by any arbitrary reference at same statistical cost. In fact, Goldreich [7] proved that the latter problem formally reduces to the former. Inspired by his approach, we seek and obtain a reduction result in the much less understood and more challenging Markovian setting.

Informal Markovian Problem Statement — The scientist is given the full description of a reference transition matrix \(\overline{P}\) and a single Markov chain \(X_1^n\) sampled with respect to some unknown transition operator P and arbitrary initial distribution. For fixed proximity parameter \(\varepsilon > 0\), the goal is to design an algorithm that distinguishes between the two cases \(P = \overline{P}\) and \(K(P,\overline{P}) > \varepsilon \), with high probability, where K is a contrast functionFootnote 2 between stochastic matrices.

Related Work — Under the contrast function (1) described in Sect. 2, and the hypothesis that P and \(\overline{P}\) are both irreducible and symmetric over a finite space \(\mathcal {X}\), a first tester with sample complexity \(\tilde{\mathcal {O}}(\left| \mathcal {X} \right| /\varepsilon + h)\), where h [4, Definition 3] is the hitting time of the reference chain, and a lower bound in \(\varOmega (\left| \mathcal {X} \right| /\varepsilon )\), were obtained in [4]. In [3], a graph partitioning algorithm delivers, under the same symmetry assumption, a testing procedure with sample complexity \(\mathcal {O}(\left| \mathcal {X} \right| /\varepsilon ^4)\), i.e. independent of hitting properties. More recently, [6] relaxed the symmetry requirement, replacing it with a more natural reversibility assumption. The algorithm therein has a sample complexity of \(\mathcal {O}(1/(\overline{\pi }_\star \varepsilon ^4))\), where \(\overline{\pi }_\star \) is the minimum stationary probability of the reference \(\overline{P}\), gracefully recovering [3] under symmetry. In parallel, [18] started and [2] complemented the research program of inspecting the problem under the infinity norm for matrices, and derived nearly minimax-optimal bounds.

Contribution — We show how to mostly recover [6] from [3] under additional assumptions (see Sect. 3), with a technique based on a geometry preserving embedding. We obtain a more economical proof than [6], which went through the process of re-deriving a graph partitioning algorithm for the reversible case. Furthermore, the impact of our approach, by its generality, stretches beyond the task at hand and is also applicable to related inference problems (see Remark 2).

2 Preliminaries

We let \(\mathcal {X}, \mathcal {Y}\) be finite sets, and denote \(\mathcal {P}(\mathcal {X})\) the set of all probability distributions over \(\mathcal {X}\). All vectors are written as row vectors. For matrices AB, \(\rho (A)\) is the spectral radius of A, \(A \circ B\) is the Hadamard product of A and B defined by \((A \circ B)(x,x') = A(x)B(x')\) and \(A^{\circ 1/2}(x,x') = \sqrt{A(x,x')}\). For \(n \in \mathbb {N}\), we use the compact notation \(x_1^n = (x_1, \dots , x_n)\). \(\mathcal {W}(\mathcal {X})\) is the set of all row-stochastic matrices over the state space \(\mathcal {X}\), and \(\pi \) is called a stationary distribution for \(P \in \mathcal {W}(\mathcal {X})\) when \(\pi P = \pi \).

Irreducibility and Reversibility — Let \((\mathcal {X}, \mathcal {D})\) be a digraph with vertex set \(\mathcal {X}\) and edge-set \(\mathcal {D}\subset \mathcal {X}^2\). When \((\mathcal {X}, \mathcal {D})\) is strongly connected, a Markov chain with connection graph \((\mathcal {X}, \mathcal {D})\) is said to be irreducible. We write \(\mathcal {W}(\mathcal {X}, \mathcal {D})\) for the set of irreducible stochastic matrices over \((\mathcal {X}, \mathcal {D})\). When \(P \in \mathcal {W}(\mathcal {X}, \mathcal {D})\), \(\pi \) is unique and we denote \(\pi _\star \doteq \min _{x \in \mathcal {X}} \pi (x) > 0\) the minimum stationary probability. When P satisfies the detailed-balance equation \(\pi (x)P(x,x') = \pi (x') P(x',x)\) for any \((x,x') \in \mathcal {D}\), we say that P is reversible.

Lumpability — In contradistinction with the distribution setting, merging symbols in a Markov chain may break the Markov property, resulting in a hidden Markov model. For \(P \in \mathcal {W}(\mathcal {Y}, \mathcal {E})\) and a surjective map \(\kappa :\mathcal {Y}\rightarrow \mathcal {X}\) merging elements of \(\mathcal {Y}\) together, we say that P is \(\kappa \)-lumpable [10] when the output process still defines a Markov chain. Introducing \(\mathcal {S}_x = \kappa ^{-1}(\left\{ x \right\} )\) for the collection of symbols that merge into \(x \in \mathcal {X}\), lumpability was characterized by [10, Theorem 6.3.2] as follows. P is \(\kappa \)-lumpable, when for any \(x,x' \in \mathcal {X}\), and \(y_1,y_2 \in \mathcal {S}_x\), it holds that

$$P(y_1, \mathcal {S}_{x'}) = P(y_2, \mathcal {S}_{x'}).$$

The lumped transition matrix \(\kappa _\star P \in \mathcal {W}(\mathcal {X}, \kappa _2(\mathcal {E}))\), with connected edge set

$$\kappa _2(\mathcal {E}) \doteq \left\{ (x,x') \in \mathcal {X}^2 :\exists (y,y') \in \mathcal {E}, (\kappa (y), \kappa (y')) = (x,x') \right\} ,$$

is then given by

$$\kappa _\star P(x,x') = P(y,\mathcal {S}_{x'}), \text { for some } y \in \mathcal {S}_{x}.$$

Contrast Function — We consider the following notion of discrepancy between two stochastic matrices \(P, P' \in \mathcal {W}(\mathcal {X})\),

$$\begin{aligned} K(P,P') \doteq 1 - \rho \left( P^{\circ 1/2} \circ P'^{\circ 1/2} \right) . \end{aligned}$$
(1)

Although K made its first appearance in [4] in the context of Markov chain identity testing, its inception can be traced back to Kazakos [9]. K is directly related to the Rényi entropy of order 1/2, and asymptotically connected to the Bhattacharyya/Hellinger distance between trajectories (see e.g. proof of Lemma 2). It is instructive to observe that K vanishes on chains that share an identical strongly connected component and does not satisfy the triangle inequality for reducible matrices, hence is not a proper metric on \(\mathcal {W}(\mathcal {X})\) [4, p.10, footnote 13]. Some additional properties of K of possible interest are listed in [6, Section 7].

Reduction Approach for Identity Testing of Distributions — Problem reduction is ubiquitous in the property testing literature. Our work takes inspiration from [7], who introduced two so-called “stochastic filters” in order to show how in the distribution setting, identity testing was reducible to uniformity testing, thereby recovering the known complexity of \(\mathcal {O}(\sqrt{\left| \mathcal {X} \right| }/\varepsilon ^2)\) obtained more directly by [14]. Notable works also include [5], who reduced a collection of distribution testing problems to \(\ell _2\)-identity testing.

3 The Restricted Identity Testing Problem

Let \(\mathcal {V}_{\textsf{test}}\subset \mathcal {W}(\mathcal {X})\) be a class of stochastic matrices of interest, and let \(\overline{P} \in \mathcal {V}_{\textsf{test}}\) be a fixed reference. The identity testing problem consists in determining, with high probability, from a single stream of observations \(X_1^n = X_1, \dots , X_n\) drawn according to a transition matrix \(P \in \mathcal {V}_{\textsf{test}}\), whether

$$P \in \mathcal {H}_0 \doteq \left\{ \overline{P} \right\} , \text { or } P \in \mathcal {H}_1(\varepsilon ) \doteq \left\{ P \in \mathcal {V}_{\textsf{test}}:K(P, \overline{P}) > \varepsilon \right\} .$$

We note the presence of an exclusion region, and regard the problem as a Bayesian testing problem with a prior which is uniform over the two hypotheses classes \(\mathcal {H}_0\) and \(\mathcal {H}_1(\varepsilon )\) and vanishes on the exclusion region. Casting our problem in the minimax framework, the worst-case error probability \(e_n(\phi , \varepsilon )\) of a given test \(\phi :\mathcal {X}^n \rightarrow \left\{ 0, 1 \right\} \) is defined as

$$\begin{aligned} 2 e_n(\phi , \varepsilon ) \doteq \mathbb {P}_{X_1^n \sim \overline{\pi }, \overline{P}}\left( \phi (X_1^n) = 1 \right) + \sup _{P \in \mathcal {H}_1(\varepsilon )} \mathbb {P}_{X_1^n \sim \pi , P}\left( \phi (X_1^n) = 0 \right) . \end{aligned}$$

We subsequently define the minimax risk \(\mathcal {R}_n(\varepsilon )\) as,

$$\begin{aligned} \mathcal {R}_n(\varepsilon ) \doteq \min _{\phi :\mathcal {X}^n \rightarrow \left\{ 0, 1 \right\} } e_n(\phi , \varepsilon ), \end{aligned}$$

where the minimum is taken over all —possibly randomized— testing procedures. For a confidence parameter \(\delta \), the sample complexity is

$$n_\star (\varepsilon , \delta ) \doteq \min \left\{ n \in \mathbb {N}:\mathcal {R}_n(\varepsilon ) < \delta \right\} .$$

We briefly recall the assumptions made in [6]. For \((P, \overline{P}) \in (\mathcal {V}_{\textsf{test}}, \mathcal {H}_0)\),

(A.1):

P and \(\overline{P}\) are irreducible and reversible.

(A.2):

P and \(\overline{P}\) share the sameFootnote 3 stationary distribution \(\overline{\pi } = \pi \).

The following additional assumptions will make our approach readily applicable.

(B.1):

\(P, \overline{P}\) and share the same connection graph, \(P, \overline{P} \in \mathcal {W}(\mathcal {X}, \mathcal {D})\).

(B.2):

The common stationary distribution is rational, \(\overline{\pi } \in \mathbb {Q}^\mathcal {X}\).

Remark 1

A sufficient condition for \(\overline{\pi } \in \mathbb {Q}^\mathcal {X}\) is \(\overline{P}(x,x') \in \mathbb {Q}\) for any \(x,x' \in \mathcal {X}\).

Without loss of generality, we express \(\overline{\pi } = \left( p_1, p_2, \dots , p_{\left| \mathcal {X} \right| } \right) / \varDelta \), for some \(\varDelta \in \mathbb {N}\), and \(p \in \mathbb {N}^{\left| \mathcal {X} \right| }\) where \(0< p_1 \le p_2 \le \dots \le p_{\left| \mathcal {X} \right| } < \varDelta \). We henceforth denote by \(\mathcal {V}_{\textsf{test}}\) the subset of stochastic matrices that verify assumptions (A.1), (A.2), (B.1) and (B.2) with respect to the fixed \(\overline{\pi } \in \mathcal {P}(\mathcal {X})\). Our below-stated theorem provides an upper bound on the sample complexity \(n_\star (\varepsilon , \delta )\) in \(\widetilde{\mathcal {O}}(1/(\overline{\pi }_\star \varepsilon ))\).

Theorem 1

Let \(\varepsilon , \delta \in (0,1)\) and let \(\overline{P} \in \mathcal {V}_{\textsf{test}}\subset \mathcal {W}(\mathcal {X}, \mathcal {D})\). There exists a randomized testing procedure \(\phi :\mathcal {X}^n \rightarrow \left\{ 0,1 \right\} \), with \(n = \tilde{\mathcal {O}}(1/(\overline{\pi }_\star \varepsilon ^4))\), such that the following holds. For any \(P \in \mathcal {V}_{\textsf{test}}\) and \(X_1^n\) sampled according to P, \(\phi \) distinguishes between the cases \(P = \overline{P}\) and \(K(P, \overline{P}) > \varepsilon \) with error probability less than \(\delta \).

Proof (sketch)

Our strategy can be broken down into two steps. First, we employ a transformation on Markov chains, termed Markov embedding [20], in order to symmetrize both the reference chain (algebraically, by computing the new transition matrix) and the unknown chain (operationally, by simulating an embedded trajectory). Crucially, our transformation preserves the contrast between two chains and their embedded version (Lemma 2). Second, we invoke the known tester [3] for symmetric chains as a black box and report its output. The proof is deferred to Sect. 6.

Remark 2

Our reduction approach has applicability beyond recovery of the sample complexity of [6], for instance in the tolerant testing setting, where the two competing hypotheses are

$$K(P, \overline{P}) < \varepsilon /2 \text { and } K(P, \overline{P}) > \varepsilon .$$

Even in the symmetric setting, this problem remains open. Our technique shows that future work can focus on solving the problem under a symmetry assumption, as we provide a natural extension to the reversible class.

4 Symmetrization of Reversible Markov Chains

Information geometry — Our construction and notation follow [11], who established a dually-flat structure

$$(\mathcal {W}(\mathcal {X}, \mathcal {D}), \mathfrak {g}, \nabla ^{(e)}, \nabla ^{(m)})$$

on the space of irreducible stochastic matrices, where \(\mathfrak {g}\) is a Riemannian metric, and \(\nabla ^{(e)}, \nabla ^{(m)}\) are dual affine (exponential and mixture) connections. Introducing a model \(\mathcal {V}= \left\{ P_\theta : \theta \in \varTheta \subset \mathbb {R}^d \right\} \subset \mathcal {W}(\mathcal {X}, \mathcal {D})\), we write \(P_\theta \in \mathcal {V}\) for the transition matrix at coordinates \(\theta = (\theta ^1, \dots , \theta ^d)\), and where d is the manifold dimension of \(\mathcal {V}\). Using the shorthand \(\partial _i \cdot \doteq \partial \cdot /\partial \theta ^i\), the Fisher metric is expressed [11, (9)] in the chart induced basis \((\partial _i)_{i \in [d]}\) as

$$\begin{aligned} \mathfrak {g}_{ij}(\theta ) = \sum _{(x,x') \in \mathcal {D}} \pi _\theta (x) P_\theta (x,x') \partial _i \log P_\theta (x,x') \partial _j \log P_\theta (x,x'), \text { for } i,j \in [d]. \end{aligned}$$
(2)

Following this formalism, it is possible to define mixture families (m-families) and exponential families (e-families) of stochastic matrices [8, 11].

Example 1

The class \(\mathcal {W}_{\textsf{rev}}(\mathcal {X}, \mathcal {D})\) of reversible Markov chains irreducible over a connection graph \((\mathcal {X}, \mathcal {D})\) forms both an e-family and an m-family of dimension

$$\dim \mathcal {W}_{\textsf{rev}}(\mathcal {X}, \mathcal {D}) = \frac{\left| \mathcal {D} \right| + \left| \ell (\mathcal {D}) \right| }{2} - 1,$$

where \(\ell (\mathcal {D}) \subset \mathcal {D}\) is the set of loops present in the connection graph [19, Theorem 3,5].

Embeddings — The operation converse to lumping is embedding into a larger space of symbols. In the distribution setting, Markov morphisms were introduced by Čencov [16] as the natural operations on distributions. In the Markovian setting, [20] proposed the following notion of an embedding for stochastic matrices.

Definition 1

(Markov embedding for Markov chains [20]). We call Markov embedding, a map \(\varLambda _\star :\mathcal {W}(\mathcal {X}, \mathcal {D}) \rightarrow \mathcal {W}(\mathcal {Y}, \mathcal {E}), P \mapsto \varLambda _\star P\), such that for any \((y,y') \in \mathcal {E}\),

$$\varLambda _\star P(y,y') = P(\kappa (y), \kappa (y'))\varLambda (y,y'),$$

and where \(\kappa \) and \(\varLambda \) satisfy the following requirements

(i):

\(\kappa :\mathcal {Y}\rightarrow \mathcal {X}\) is a lumping function for which \(\kappa _2(\mathcal {E}) = \mathcal {D}\).

(ii):

\(\varLambda \) is a positive function over the edge set, \(\varLambda :\mathcal {E}\rightarrow \mathbb {R}_+\).

(iii):

Writing \(\bigcup _{x \in \mathcal {X}} \mathcal {S}_x = \mathcal {Y}\) for the partition defined by \(\kappa \), \(\varLambda \) is such that for any \(y \in \mathcal {Y}\) and \(x' \in \mathcal {X}\),

$$(\kappa (y), x') \in \mathcal {D}\implies (\varLambda (y,y'))_{y' \in \mathcal {S}_{x'}} \in \mathcal {P}(\mathcal {S}_{x'}).$$

The above embeddings are characterized as the linear maps over the space of lumpable matrices that satisfy a set of monotonicity requirements and are congruent with respect to the lumping operation [20, Theorem 3.1]. When for any \(y,y' \in \mathcal {Y}\), it additionally holds that \(\varLambda (y,y') = \varLambda (y') \delta \left[ (\kappa (y), \kappa (y')) \in \mathcal {D}\right] \), the embedding \(\varLambda _\star \) is called memoryless [20, Section 3.4.2] and is e/m-geodesic affine [20, Th. 3.2, Lemma 3.6], preserving both e-families and m-families of stochastic matrices.

Given \(\overline{\pi }\) and \(\varDelta \) as defined in Sect. 3, from [20, Corollary 3.3], there exists a lumping function \(\kappa :[\varDelta ] \rightarrow \mathcal {X}\), and a memoryless embedding \(\sigma ^{\overline{\pi }}_\star :\mathcal {W}(\mathcal {X}, \mathcal {D}) \rightarrow \mathcal {W}([\varDelta ], \mathcal {E})\) with \(\mathcal {E}= \left\{ (y,y') \in [\varDelta ]^2 :(\kappa (y), \kappa (y')) \in \mathcal {D} \right\} \), such that \(\sigma ^{\overline{\pi }}_\star \overline{P}\) is symmetric. Furthermore, identifying \(\mathcal {X}\cong \left\{ 1,2, \dots , \left| \mathcal {X} \right| \right\} \), its existence is constructively given by

$$\kappa (j) = \mathop {\mathrm {arg\,min}}\limits _{1 \le i \le \left| \mathcal {X} \right| } \left\{ \sum _{k=1}^{i} p_k \ge j \right\} , \text { with } \sigma ^{\overline{\pi }}(j) = p^{-1}_{\kappa (j)}, \text { for any } 1 \le j \le \varDelta .$$

As a consequence, we obtain 1. and 2. below.

  1. 1.

    The expression of \(\sigma ^{\overline{\pi }}_\star \overline{P}\) following algebraic manipulations in Definition 1.

  2. 2.

    A randomized algorithm to memorylessly simulate trajectories from \(\sigma ^{\overline{\pi }}_\star P\) out of trajectories from P (see [20, Section 3.1]). Namely, there exists a stochastic mapping \(\varPsi ^{\overline{\pi }} :\mathcal {X}\rightarrow \varDelta \) such that,

    $$X_1, \dots , X_n \sim P \implies \varPsi ^{\overline{\pi }}(X_1^n) = \varPsi ^{\overline{\pi }}(X_1), \dots , \varPsi ^{\overline{\pi }}(X_n) \sim \sigma ^{\overline{\pi }}_\star P.$$

5 Contrast Preservation

It was established in [20, Lemma 3.1] that similar to their distribution counterparts, Markov embeddings in Definition 1 preserve the Fisher information metric \(\mathfrak {g}\) in (2), the affine connections \(\nabla ^{(e)}, \nabla ^{(m)}\) and the informational (Kullback-Leibler) divergence between points. In this section, we show that memoryless embeddings, such as the symmetrizer \(\sigma ^{\overline{\pi }}_\star \) introduced in Sect. 4, also preserve the contrast function K. Our proof will rely on first showing that the memoryless embeddings of [20, Section 3.4.2] induce natural Markov morphisms [15] from distributions over \(\mathcal {X}^n\) to \(\mathcal {Y}^n\).

Lemma 1

Let a lumping function \(\kappa :\mathcal {Y}\rightarrow \mathcal {X}\), and

$$L_\star :\mathcal {W}(\mathcal {X}, \mathcal {D}) \rightarrow \mathcal {W}(\mathcal {Y}, \mathcal {E})$$

be a \(\kappa \)-congruent memoryless Markov embedding. For \(P \in \mathcal {W}(\mathcal {X}, \mathcal {D})\), let \(Q^n \in \mathcal {P}(\mathcal {X}^n)\) (resp. \(\widetilde{Q}^n \in \mathcal {P}(\mathcal {Y}^n)\)) be the unique distribution over stationary paths of length n induced from P (resp. \(L_\star P\)). Then there exists a Markov morphism \(M_\star :\mathcal {P}(\mathcal {X}^n) \rightarrow \mathcal {P}(\mathcal {Y}^n)\) such that \(M_\star Q^n = \widetilde{Q}^n\).

Proof

Let \(\kappa _n :\mathcal {Y}^n \rightarrow \mathcal {X}^n\) be the lumping function on blocks induced from \(\kappa \),

$$\begin{aligned} \forall y_1^n \in \mathcal {Y}^n, \kappa _n(y_1^n) = (\kappa (y_t))_{1 \le t \le n} \in \mathcal {X}^n, \end{aligned}$$

and introduce

$$\begin{aligned} \mathcal {Y}^n = \bigcup _{x_1^n \in \mathcal {X}^n} \mathcal {S}_{x_1^n}, \text { with } \mathcal {S}_{x_1^n} = \left\{ y_1^n \in \mathcal {Y}^n :\kappa _n(y_1^n) = x_1^n \right\} , \end{aligned}$$

the partition associated to \(\kappa _n\). For any realizable path \(x_1^n, Q^n(x_1^n) > 0\), we define a distribution \(M^{x_1^n} \in \mathcal {P}(\mathcal {Y}^n)\) concentrated on \(\mathcal {S}_{x_1^n}\), and such that for any \(y_1^n \in \mathcal {S}_{x_1^n}\), \( M^{x_1^n}(y_1^n) = \prod _{t=1}^{n} L(y_t).\) Non-negativity of \(M^{x_1^n}\) is immediate, and

$$\begin{aligned} \begin{aligned} \sum _{y_1^n \in \mathcal {Y}^n} M^{x_1^n}(y_1^n)&= \sum _{y_1^n \in \mathcal {Y}^n :\kappa _n(y_1^n) = x_1^n} M^{x_1^n}(y_1^n) = \prod _{t=1}^{n} \left( \sum _{y_t \in \mathcal {S}_{x_t}} L(y_t) \right) = 1, \end{aligned} \end{aligned}$$

thus \(M^{x_1^n}\) is well-defined. Furthermore, for \(y_1^n \in \mathcal {Y}^n\), it holds that

$$\begin{aligned} \begin{aligned} \widetilde{Q}^n(y_1^n)&= L_\star \pi (y_1) \prod _{t=1}^{n-1} L_\star P(y_t, y_{t+1}) {\mathop {=}\limits ^{(\spadesuit )}} \pi (\kappa (y_1)) L(y_1) \prod _{t=1}^{n-1} P(\kappa (y_t), \kappa (y_{t+1})) L(y_t) \\&= Q^n(\kappa (y_1), \dots , \kappa (y_n)) \prod _{t=1}^{n} L(y_t) = Q^n(\kappa _n(y_1^n)) \prod _{t=1}^{n} L(y_t) \\&= \sum _{x_1^n \in \mathcal {X}^n} Q^n(\kappa _n(y_1^n)) M^{x_1^n}(y_1^n) = M_\star Q^n(y_1^n), \end{aligned} \end{aligned}$$

where \((\spadesuit )\) stems from [20, Lemma 3.5], whence our claim holds.

Lemma 1 essentially states that the following diagram commutes

figure a

for the Markov morphism \(M_\star \) induced by \(L_\star \), and where we denoted \(\mathcal {Q}^n_{\mathcal {W}(\mathcal {X}, \mathcal {D})} \subset \mathcal {P}(\mathcal {X}^n)\) for the set of all distributions over paths of length n induced from the family \(\mathcal {W}(\mathcal {X}, \mathcal {D})\). As a consequence, we can unambiguously write \(L_\star Q^n \in \mathcal {Q}^n_{L_\star \mathcal {W}(\mathcal {X}, \mathcal {D})}\) for the distribution over stationary paths of length n that pertains to \(L_\star P\).

Lemma 2

Let \(L_\star :\mathcal {W}(\mathcal {X}, \mathcal {D}) \rightarrow \mathcal {W}(\mathcal {Y}, \mathcal {E})\) be a memoryless embedding,

$$\begin{aligned} K(L_\star P, L_\star \overline{P}) = K( P, \overline{P}). \end{aligned}$$

Proof

We recall for two distributions \(\mu , \nu \in \mathcal {P}(\mathcal {X})\) the definition of \(R_{1/2}\) the Rényi entropy of order 1/2,

$$\begin{aligned} \begin{aligned} R_{1/2}(\mu \Vert \nu ) \doteq -2 \log \left( \sum _{x \in \mathcal {X}} \sqrt{\mu (x) \nu (x)} \right) , \end{aligned} \end{aligned}$$

and note that \(R_{1/2}\) is closely related to the Hellinger distance between \(\mu \) and \(\nu \). This definition extends to the notion of a divergence rate between stochastic processes \((X_t)_{t \in \mathbb {N}}, (X'_t)_{t \in \mathbb {N}}\) on \(\mathcal {X}\) as follows

$$\begin{aligned} R_{1/2}\left( (X_t)_{t \in \mathbb {N}} \Vert (X'_t)_{t \in \mathbb {N}}\right) = \lim _{n \rightarrow \infty } \frac{1}{n} R_{1/2}\left( X_1^n \Vert X_1'^n \right) , \end{aligned}$$

and in the irreducible time-homogeneous Markovian setting where \((X_t)_{t \in \mathbb {N}},\) \((X'_t)_{t \in \mathbb {N}}\) evolve according to transition matrices P and \(P'\), the above reduces [13] to

$$\begin{aligned} R_{1/2}\left( (X_t)_{t \in \mathbb {N}} \Vert (X'_t)_{t \in \mathbb {N}}\right) = -2 \log \rho (P^{\circ 1/2} \circ P'^{\circ 1/2}) = -2 \log (1 - K(P, P')). \end{aligned}$$

Reorganizing terms and plugging for the embedded stochastic matrices,

$$\begin{aligned} \begin{aligned} K(L_\star P, L_\star \overline{P})&= 1 - \exp \left( -\frac{1}{2} \lim _{n \rightarrow \infty } \frac{1}{n} R_{1/2}\left( L_\star Q^n \Vert L_\star \overline{Q}^n \right) \right) , \\ \end{aligned} \end{aligned}$$

where \(L_\star \overline{Q}^n\) is the distribution over stationary paths of length n induced by the embedded \(L_\star \overline{P}\). For any \(n \in \mathbb {N}\), from Lemma 1 and information monotonicity of the Rényi divergence, \(R_{1/2}\left( L_\star Q^n \Vert L_\star \overline{Q}^n \right) = R_{1/2}\left( Q^n \Vert \overline{Q}^n \right) ,\) hence our claim.

6 Proof of Theorem 1

We assume that P and \(\overline{P}\) are in \(\mathcal {V}_{\textsf{test}}\). We reduce the problem as follows. We construct \(\sigma ^{\overline{\pi }}_\star \), the symmetrizerFootnote 4 defined in Sect. 4. We proceed to embed both the reference chain (using Definition 1) and the unknown trajectory (using the operational definition in [20, Section 3.1]). We invoke the tester of [3] as a black box, and report its answer.

Fig. 1.
figure 1

Reduction of the testing problem by isometric embedding.

Completeness case. It is immediate that \(P = \overline{P} \implies \sigma ^{\overline{\pi }}_\star P = \sigma ^{\overline{\pi }}_\star \overline{P}\).

Soundness case. From Lemma 2, \(K(P, \overline{P})> \varepsilon \implies K(\sigma ^{\overline{\pi }}_\star P, \sigma ^{\overline{\pi }}_\star \overline{P}) > \varepsilon \).

As a consequence of [3, Theorem 10], the sample complexity of testing is upper bounded by \(\mathcal {O}(\varDelta /\varepsilon ^4)\). With \(\overline{\pi }_\star = p_1/\varDelta \) and treating \(p_1\) as a small constant, we recover the known sample complexity.