1 Introduction

In pattern recognition, the following basic situation is considered: A random variable (XY) consists of an observed pattern \(X\in {\mathcal {X}} \), typically \( {\mathcal {X}} = {\mathbb {R}}^d\), from which we want to infer the unobservable class Y which belongs to a given finite set M of classes. Consider the case that the distribution \(P^{(X,Y)}\) is known. Then the classification rule which chooses the class having maximum a posteriori probability given the observed pattern x has minimal risk of misclassification. This optimal rule is given by

$$\begin{aligned} x \mapsto \mathop {{{\mathrm{argmax}}}}\limits _{y} P\big (Y=y|X=x\big ) \end{aligned}$$

where \({{\mathrm{argmax}}}\) takes, in a measurable way, some value \(y^*\) with \(P(Y=y^*|X=x)= \max _{y}P(Y=y|X=x)\). The minimal risk of misclassification, often termed the Bayes risk, is given by

$$\begin{aligned} R= \int \min _{y} P\big (Y \ne y|X=x\big ) P^X(\mathrm{d}x). \end{aligned}$$

Even though in many problems of pattern recognition the distribution of \(P^{(X,Y)}\) will not be known, the Bayes risk is a quantity of major importance as it provides the benchmark behaviour against which any other procedure is judged.

Let us briefly recall the i.i.d. model of supervised learning which has provided a main direction of research, see, e.g., the monograph (Devroye et al. 1996). There, in addition to (XY), we have a learning sequence \((X_1',Y_1'),(X_2',Y_2'),\ldots ,(X_n',Y_n')\) of independent copies of (XY), i.e. having the same distribution. This sequence is sampled independently of (XY) and is used for learning proposes, in a statistical sense for the estimation of unknown distributions to construct the classification procedure.

In this paper we take a different approach which is motivated by an on-line classification setting which we model in the following way: There is given a stochastic process \((X_{-n},Y_{-n})_{n}\), the present time-point being denoted by 0, with observables in temporal order

$$\begin{aligned} \ldots ,X_{-n},X_{-n+1},\ldots , X_{-1}, X_0 \end{aligned}$$

from which the pattern \(Y_0\) is to be inferred. The time parameter n belongs to some set of the form \(\{0,1,\ldots ,m\}\) or, for mathematical purposes, to \({\mathbb {N}}_0= \{0,1,\ldots \}\). So in this classification setting, previous observations may be used to classify the present observation. If \((X_0,Y_0)\) is independent of the past \((X_{-n},Y_{-n})_{n \ge 1 }\) then clearly previous observations carry no information on \(Y_0\) and our optimal classification would be given by \({{\mathrm{argmax}}}_{y} P(Y_0=y|X_0=x_0)\).

But in a variety of classification problems we encounter dependence. Looking e.g. at the on-line classification of handwritten characters the dependence structure in natural language could be taken into account. In this situation, \(X_0\) would be the current handwritten character to be classified, \(Y_0\) the unknown true character, the foregoing handwritten character would be \(X_{-1}\) and the unknown true character \(Y_{-1}\), and in general \(X_{-n}\) would be the n-th one preceding \(X_0\) with unknown \(Y_n\). So, there is a well-known dependence between the \(Y_n\)’s, described by linguists using Markov models (see e.g., Institute for Defense Analyses 1980 for early discussions), and this dependence is of course inherited by the \(X_n\)’s. A popular model for this situation is given by a hidden Markov model, which we shall also use in this paper.

Coming back to the general model, we prescribe to use the present and in addition the last l preceding observables. Then a classification rule with memory l takes the form

$$\begin{aligned} (x_0,x_{-1},\ldots ,x_{-l}) \mapsto \delta (x_0,x_{-1},\ldots ,x_{-l}) \end{aligned}$$

for some measurable \(\delta :{\mathcal {X}}^{l+1} \rightarrow M\). The optimal rule is given by

$$\begin{aligned} (x_0,x_{-1},\ldots ,x_{-l}) \mapsto \mathop {{{\mathrm{argmax}}}}\limits _{y}P(Y_0=y| X_0=x_0,X_{-1}=x_{-1},\ldots ,X_{-l}=x_{-l}) \end{aligned}$$

with Bayes risk

$$\begin{aligned} R_l = \int \min _y P(Y_0 \ne y| X_0,X_{-1},\ldots ,X_{-l}) \mathrm{d}P. \end{aligned}$$

Obviously

$$\begin{aligned} R_0 \ge R_1 \ge \cdots \ge R_l \ge \cdots \end{aligned}$$

Assume that we have such a process \((X_{-n},Y_{-n})_{n \in {\mathbb {N}}_0}\) with full past for our mathematical model. By martingale convergence it follows that for \(l \rightarrow \infty \)

$$\begin{aligned} R_l \rightarrow R^*=\int \min _y P(Y_0 \ne y| (X_{-n})_{ n \in {\mathbb {N}}_0}) \mathrm{d}P. \end{aligned}$$

Here it is important to point out that this paper centers around the behaviour of the optimal classification procedure in dependence on l, the number of past observations used. This differs markedly from one of the main lines of research in the i.i.d. model of supervised learning where the focus is on the behaviour of classification procedures in dependence on n, the size of the training sequence.

To investigate the behaviour of procedures which incorporate preceding information into classification rules we will use the setting of hidden Markov models. This class of models has been of considerable interest in the theory and applications of pattern recognition, see the monographs by MacDonald and Zucchini (1997) and by Huang et al. (1990) from a more practical viewpoint. It provides a class which allows good modelling for various problems with dependence and still may be handled well from the analytical, the algorithmic, and the statistical point of view, see the monograph by Cappé et al. (2005). The applications range from biology to speech recognition to finance; the above monographs contain a wealth of such examples.

A theoretical contribution to pattern recognition for such models was given by Holst and Irle (2001) where the asymptotic risk of misclassification for nearest neighbor rules in dependent models including hidden Markov models was derived. Similar models were treated in Ryabko (2006) to obtain consistency for certain classes of procedures, i.e. convergence of the risk of misclassification to the Bayes risk. As consistency for classification follows from consistency in the corresponding regression problem, see e.g. (Devroye et al. 1996, 6.7), any result on regression consistency yields a result on classification consistency, and a wealth of such results is available, e.g. under mixing conditions. All these results invoke the convergence of the size n of a training sequence to infinity and do not cover the topic of this paper. Closer to our paper is the problem of predicting \(Y_0\) from \((X_0,Y_{-1},\ldots ,Y_{-l})\) for stationary and ergodic time series, see e.g. (Györfi et al. 2002, Chap. 27). Our treatment differs as we do not have knowledge (just guesses of) \((Y_{-1},\ldots ,Y_{-l})\) in on-line pattern recognition, only that of \((X_0,X_{-1},\ldots ,X_{-l})\).

The hidden Markov model as it will be used in this paper takes the following form. We assume that for each m we have, written in their temporal order, observables \(X_{-m},X_{-m+1},\ldots , X_{-1}, X_0\) and unobservables \(Y_{-m},Y_{-m+1},\ldots , Y_{-1}, Y_0\). The unobservables form a Markov chain. The observables are conditionally independent given the unobservables in the form of

$$\begin{aligned} P\big (X_0 \in B_0,\ldots , X_{-m} \in B_{-m}|Y_0=y_0,\ldots , Y_{-m}=y_{-m}\big ) =Q(B_0,y_0)\ldots Q(B_{-m},y_{-m}) \end{aligned}$$

for some stochastic kernel Q and are not Markovian in general. This stochastic kernel and the transition matrix of the chain are assumed to be the same for each m. But we allow for the flexibility that, for each m, a different initial distribution, i.e. distribution of \(Y_{-m}\) may occur. Note that m stands for the time point in the past where our model would be started and the distribution of \(Y_{-m}\) would not be known.

For being completely precise we would have to use the notation \(Y^{(m)}_{-m},\ldots , Y^{(m)}_0\) since, due to our flexibility in initial distribution, the distribution of \(Y^{(m)}_{-m},\ldots , Y^{(m)}_0\) and \(Y^{(m+1)}_{-m},\ldots , Y^{(m+1)}_0\) need not coincide. Hence also \(R_l \ge R_{l+1}\) does not hold in general where \(R_l\) is computed in a model started at some time \(-m,m\ge l\), and \(R_{l+1}\) in a model with a possibly different initial distribution. But all our bounds will only involve the transition matrix and the stochastic kernel which do not depend on the index m. So we shall omit this upper index in order not to overburden our notations.

We assume that the transition matrix of the chain is such that there exists a unique stationary probability distribution \(\pi \), characterized by the property that if \(Y_{-m}\) has the distribution \(\pi \) then all later \(Y_{-m+k},~k\ge 0,\) have the same distribution \(\pi \). For our chain with full past \((X_{-n},Y_{-n})_{n \in {\mathbb {N}}_0}\) we consider the stationary setting where each \(Y_{-n}\) has the same distribution \(\pi \). Then of course \(R_l^* \ge R_{l+1}^*\) and \(\lim _{l}R_l^* = R^*\) denoting the risk in the stationary case with an additional \(^*\).

Without loss of generality we let the probability measures \(Q(\cdot ,y)\) be given by densities \(f_y\) with respect to some \(\sigma \)-finite measure \(\mu \) on \({\mathcal {X}}\). So we have for all n

$$\begin{aligned} P\big (X_{-n} \in B|Y_{-n}=y\big ) = \int _B f_y(x)\mu (\mathrm{d}x). \end{aligned}$$

This provides a unified treatment for the case of discrete \({\mathcal {X}}\) where \(\mu \) might be the counting measure, and for the case of Lebesgue densities where \({\mathcal {X}} = {\mathbb {R}}^d\) and \(\mu \) might be d-dimensional Lebesgue measure.

In Sect. 2 we shall show under a suitable assumption that \(\lim _l R_l\) exists and is independent of the particular sequence of initial distributions, hence \(\lim _l R_l = R^*\). Furthermore this convergence is exponentially fast and we provide a bound for \(\sup _k |R_l - R_{l+k}|\) in this respect. Let us remark that, as we are looking backwards in time, the usual geometric ergodicity forward in time does not seem to yield an immediate proof. In Sect. 3 we introduce kernel classification rules with memory and discuss their theoretical and practical performance. Our findings indicate that it might be useful to include a small number l of preceding observations, starting with \(l=1\), to increase the performance of classification rules with an acceptable increase in computational complexity. Various technical proofs are given in Sect. 4.

2 Exponential convergence

We consider a hidden Markov model as described in the Introduction. For this model we make the following assumption:

(A) All entries \(p_{ij}, i,j \in M\), in the transition matrix are \( >0\). All densities are \( >0\) on \({\mathcal {X}}\).

(A) will be assumed to hold throughout Sects. 2 and 4. It implies the finiteness of the following quantities which will be used in our bounds.

Remark 2.1

Set

$$\begin{aligned} \alpha =\max _{\iota ,\kappa ,i,j\in M}\frac{p_{i\iota }p_{\iota j}}{p_{i\kappa }p_{\kappa j}}, \;\; \alpha (x)=\max _{ \iota ,\kappa ,i,j\in M}\frac{p_{i\iota }p_{\iota j}\; f_{\iota }(x)}{p_{i\kappa }p_{\kappa j}\; f_{\kappa }(x)} \quad \text{ for } x \in {\mathcal {X}}. \end{aligned}$$

Then \(\; 1 \le \alpha , \alpha (x) < \infty \).

Furthermore, with |M| denoting the number of classes, let

$$\begin{aligned} \eta =\big (1+(|M|-1)\alpha \big )^{-1}, \;\; \eta (x)=\big (1+(|M|-1)\alpha (x)\big )^{-1} \text{ for } x \in {\mathcal {X}}. \end{aligned}$$

Then \(\; 0 < \eta , \eta (x) \le 1/2\).

The following result provides the main technical tool. Its proof will be given in Sect. 4. We use the notation \( x_{-n}^{0}\) for \( (x_{0},\ldots ,x_{-n}) \) and in the same manner we use \( X_{-n}^{0}\).

Theorem 2.2

Let \(l,n \in {\mathbb {N}}\). Consider a hidden Markov model which starts in some time point \(-m < \min \{-l, -n\}\). Let \( A\subseteq M \) and fix \( x_{0},\ldots ,x_{-n}\in {\mathcal {X}} \). Set

$$\begin{aligned} m_{l}^{+}=\max _{i\in M}P(Y_{0}\in A|X_{-n}^{0}=x_{-n}^{0},Y_{-l}=i), \quad l\in {\mathbb {N}}\end{aligned}$$

and

$$\begin{aligned} m_{l}^{-}=\min _{i\in M}P(Y_{0}\in A|X_{-n}^{0}=x_{-n}^{0},Y_{-l}=i), \quad l\in {\mathbb {N}}\, . \end{aligned}$$

Then for all \(l\in {\mathbb {N}}\)

$$\begin{aligned} m_{l}^{+}-m_{l}^{-}\le \prod _{j=-l+1}^{0}(1-2\hat{\eta }_{j}), \end{aligned}$$

where \( \hat{\eta }_{j} = \eta (x_{j}) \) for \( j\in \{-n,\ldots ,0\}\) and \( \hat{\eta }_{j} = \eta \) for \( j\notin \{-n,\ldots ,0\}\).

In the following corollary the probabilities \(P\big (Y_{0}\in A|X_{-l}^{0}=x_{-l}^{0}\big )\) and \(P\big (Y_{0}\in A|X_{-l-k}^{0}=x_{-l-k}^{0}\big )\) are treated. The first will pertain to a hidden Markov model which starts in some time point , the second to one which starts in some time point . Note that terms of the form are identical in both models due to the identical transition matrix and the identical kernel Q.

Corollary 2.3

Let \( l,k\in {\mathbb {N}}\), \( A\subseteq M \) and \( x_{0},x_{-1},\ldots , x_{-l-k} \in {\mathcal {X}} \). Then

$$\begin{aligned} \big |P(Y_{0}\in A|X_{-l}^{0}=x_{-l}^{0})-P(Y_{0}\in A|X_{-l-k}^{0}=x_{-l-k}^{0})\big | \le \prod _{j=-l}^{0}\big (1-2\eta (x_{j})\big )\, . \end{aligned}$$

Proof

We obtain

$$\begin{aligned}&\big |P\big (Y_{0}\in A|X_{-l}^{0}=x_{-l}^{0}\big )-P\big (Y_{0}\in A|X_{-l-k}^{0}=x_{-l-k}^{0}\big )\big |\\&\quad = \Bigg |\sum _{\iota \in M}P\big (Y_{0}\in A|X_{-l}^{0}=x_{-l}^{0},Y_{-l-1}=\iota \big )P\big (Y_{-l-1}=\iota |X_{-l}^{0}=x_{-l}^{0}\big )\\&\qquad -\sum _{\kappa \in M}P\big (Y_{0}\in A|X_{-l-k}^{0}=x_{-l-k}^{0},Y_{-l-1}=\kappa \big )P\big (Y_{-l-1}=\kappa |X_{-l-k}^{0}=x_{-l-k}^{0}\big )\Bigg | \\&\quad = \Bigg |\sum _{\iota \in M}P\big (Y_{0}\in A|X_{-l}^{0}=x_{-l}^{0},Y_{-l-1}=\iota \big ) P\big (Y_{-l-1}=\iota |X_{-l}^{0}=x_{-l}^{0}\big )\\&\qquad -\sum _{\kappa \in M}P\big (Y_{0}\in A|X_{-l}^{0}=x_{-l}^{0},Y_{-l-1}=\kappa \big ) P\big (Y_{-l-1}=\kappa |X_{-l-k}^{0}=x_{-l-k}^{0}\big )\Bigg |\\&\quad \le \max _{i\in M}P\big (Y_{0}\in A|X_{-l}^{0}=x_{-l}^{0},Y_{-l-1}=i\big )\\&\qquad -\min _{i\in M}P\big (Y_{0}\in A|X_{-l}^{0}=x_{-l}^{0},Y_{-l-1}=i\big )\\&\quad = m_{l+1}^{+}-m_{l+1}^{-} \le \prod _{j=-l}^{0}\big (1-2\eta (x_{j})\big )\,. \end{aligned}$$

using Theorem 2.2. \(\square \)

We now introduce the constants used for the exponential bound.

Remark 2.4

  1. (i)

    Set

    $$\begin{aligned} \beta =\min _{k \in M} \int \frac{1}{ 1 + \big (|M|-1\big )\alpha (x)} f_{k}(x)\mu (\mathrm{d}x) \quad \mathrm{and} \quad \gamma =1-2\beta \,. \end{aligned}$$

    Then

    $$\begin{aligned} 0 < \beta \le \frac{1}{2} \quad \mathrm{and} \quad 0 \le \gamma < 1. \end{aligned}$$

    For all \(k \in M\)

    $$\begin{aligned} \int \big (1 - 2\eta (x)\big ) f_{k}(x)\mu (\mathrm{d}x) \le \gamma . \end{aligned}$$
  2. (ii)

    For the following result we need additional constants which arise from basic Markov process theory. A transition matrix Q is called uniformly ergodic if there exists a unique stationary probability distribution \(\pi \) and there exist constants \(a>0,~0<b<1,\) such that for any Markov chain \((Y_n)_{n\ge t}\) with transition matrix Q and any initial distribution at time t

    $$\begin{aligned} \Vert P^{Y_{t+k}}-\pi \Vert \le a\cdot b^k, \quad k\in {\mathbb {N}}, \end{aligned}$$

    in total variation norm \(\Vert \cdot \Vert \). With the same meaning, also the process \((Y_n)_{n\ge t}\) is called uniformly ergodic. Assumption (A) above implies uniform ergodicity, so that we have for each Markov chain constants ab as above, see (Meyn and Tweedie 2012, Chap. 16) for a general treatment.

Theorem 2.5

There exist constants \(a>0,~0<b,\gamma <1\) such that for all \( l,k\in {\mathbb {N}}\)

$$\begin{aligned} | R_l -R_{l+k}| \le \gamma ^{l+1}\, , \end{aligned}$$

if \(R_l\) and \(R_{l+k}\) come from the same model started at some time point ,

$$\begin{aligned} | R_l -R_{l+k}| \le 2 ( \gamma ^{l/2} + ab^{l/2}) \, , \end{aligned}$$

in the general case that \(R_l\) and \(R_{l+k}\) come from possibly different models, the first started in some time point , the second in some time point .

Proof

The constants \(a,b,\gamma \) will be those introduced in Remark 2.4.

Let us firstly consider the case that \(R_l\) and \(R_{l+k}\) stem from the same model. Using the generic symbol f to denote densities in this model we write the joint density as \(f(x_{-l-k}^{0})\) and the joint conditional density as \(f_{y_{-l-k}^{0}}(x_{-l-k}^{0})\). With this notation we have

$$\begin{aligned} f(x_{-l-k}^{0})\!=\!\sum _{y_{-l-k}^{0}}P(Y_{-l-k}^{0}=y_{-l-k}^{0})\; f_{y_{-l-k}^{0}}(x_{-l-k}^{0}) \quad { and } \quad f_{y_{-l-k}^{0}}(x_{-l-k}^{0})\!=\!\prod _{j=-l-k}^{0}f_{y_{j}}(x_{j}), \end{aligned}$$

furthermore

$$\begin{aligned} R_l= & {} \int _{{\mathcal {X}}^{l+k+1}}\min _{i\in M}P\big (Y_{0}\ne i|X_{-l}^{0}=x_{-l}^{0}\big )\; f(x_{-l-k}^{0})\mu ^{l+k+1}(\mathrm{d}x_{-l-k}^{0}),\\ R_{l+k}= & {} -\int _{{\mathcal {X}} ^{l+k+1}}\min _{i\in M}P\big (Y_{0}\ne i|X_{-l-k}^{0}=x_{-l-k}^{0}\big ) \; f(x_{-l-k}^{0})\mu ^{l+k+1}(\mathrm{d}x_{-l-k}^{0}). \end{aligned}$$

We obtain from Corollary 2.3 and conditional independence

$$\begin{aligned}&| R_{l}-R_{l+k} |\\&\quad \le \int _{{\mathcal {X}} ^{l+k+1}}\max _{i\in M}\big |P(Y_{0} \ne i|X_{-l}^{0}=x_{-l}^{0})- P\big (Y_{0} \ne i|X_{-l-k}^{0}=x_{-l-k}^{0}\big )\big |\\&\qquad f(x_{-l-k}^{0})\mu ^{l+k+1}(dx_{-l-k}^{0})\\&\quad \le \int _{{\mathcal {X}} ^{l+1}}\prod _{j=-l}^{0}\big (1-2\eta (x_{j})\big )\; f(x_{-l}^{0})\mu ^{l+1}(\mathrm{d}x_{-l}^{0})\\&\quad = \sum _{y_{-l}^{0}}\left[ P\big (Y_{-l}^{0}=y_{-l}^{0}\big )\int _{{\mathcal {X}} ^{l+1}} \prod _{j=-l}^{0}\big (1-2\eta (x_{j})\big ) \; f_{y_{-l}^{0}}(x_{-l}^{0})\mu ^{l+1}(\mathrm{d}x_{-l}^{0}) \right] \\&\quad =\sum _{y_{-l}^{0}}\left[ P\big (Y_{-l}^{0}=y_{-l}^{0}\big )\prod _{j=-l}^{0}\int _{{\mathcal {X}} }\big (1-2\eta (x_{j})\big ) f_{y_{j}}(x_{j})\mu (\mathrm{d}x_{j})\right] \\&\quad \le \sum _{y_{-l}^{0}}P\big (Y_{-l}^{0}=y_{-l}^{0}\big ) \gamma ^{l+1} = \gamma ^{l+1}\,. \end{aligned}$$

Let us now look at the general case with models 1 and 2, \(R_l=R_l^1\) stemming from model 1, \(R_{l+k}= R_{l+k}^2\) from model 2 respectively. Then

$$\begin{aligned} \big |R_l^1 - R_{l+k}^2\big | \le \big |R_l^1 - R^1_{\lfloor l/2 \rfloor }\big | + \big | R^1_{\lfloor l/2 \rfloor } - R^2_{\lfloor l/2 \rfloor }\big | + \big | R^2_{\lfloor l/2 \rfloor } - R^2_{ l + k}\big |. \end{aligned}$$

From the first part of the assertion

$$\begin{aligned} \big |R_l^1 - R^1_{\lfloor l/2 \rfloor }\big | + \big | R^2_{\lfloor l/2 \rfloor } - R^2_{ l + k}\big | \le 2 \gamma ^{ l/2}. \end{aligned}$$

To treat \(R^1_{\lfloor l/2 \rfloor }\) and \( R^2_{\lfloor l/2 \rfloor }\) we note that the conditional Bayes risks for time lag \(\lfloor l/2 \rfloor \) given \(Y_{-\lfloor l/2 \rfloor }\) are the same in both models hence the unconditional risks differ by at most the total variation distance between the two distributions of \(Y_{\lfloor l/2 \rfloor }\) in the two models. This quantity is \(\le 2ab^{l/2}\) since both models have been running for at least \(l - \lfloor l/2 \rfloor \) time points, hence

$$\begin{aligned} \big | R^1_{\lfloor l/2 \rfloor } - R^2_{\lfloor l/2 \rfloor }\big | \le 2ab^{l/2}. \end{aligned}$$

\(\square \)

From this we easily obtain our main result.

Theorem 2.6

There exist constants \(a>0,~0<b,\gamma <1\) such that for all \(l \in {\mathbb {N}}\)

$$\begin{aligned} | R^{*} - R_l | \le 2 ( \gamma ^{l/2} + ab^{l/2}) \, , \end{aligned}$$

in particular for \(l \rightarrow \infty \)

$$\begin{aligned} R_l \rightarrow R^{*}\;. \end{aligned}$$

Proof

As in Theorem 2.5, the constants \(a,b,\gamma \) are those of Remark 2.4. Recall that \(R^{*}_l\) is the Bayes risk in the stationary case. As already stated earlier, \(\lim _{l} R^{*}_l = R^{*}\) by martingale convergence. Theorem 2.5 shows

$$\begin{aligned} | R_l - R^{*}_{l+k}| \le 2( \gamma ^{l/2} + ab^{l/2}) \end{aligned}$$

for all k, proving the assertion. \(\square \)

3 Kernel classification with memory

Optimal classification procedures provide benchmarks for the actual behaviour of data driven classification procedures which do not require knowledge of the underlying distribution. A general principle from statistical classification involves the availability of a training sequence \((x'_1,y'_1,\ldots ,x'_n,y'_n)\) where the \(x'_i\) have been recorded together with the \(y'_i\). This training sequence is used for the construction of a regression estimator

$$\begin{aligned} {\hat{p}}(y|x;x'_1,y'_1,\ldots ,x'_n,y'_n) \quad \text{ for } P(Y=y|X=x) \end{aligned}$$

which leads to the classification rule

$$\begin{aligned} x \mapsto \mathop {{{\mathrm{argmax}}}}\limits _{y} \,{\hat{p}}\big (y|x;x'_1,y'_1,\ldots ,x'_n,y'_n\big ). \end{aligned}$$

When we choose a kernel

$$\begin{aligned} K: {\mathcal {X}} \rightarrow [0,\infty ) \end{aligned}$$

and use the common kernel regression estimate we arrive at the kernel classification rule

$$\begin{aligned} x \mapsto \mathop {{{\mathrm{argmax}}}}\limits _{y} \sum _{i=1}^{n} K\left( \frac{x-x'_{i}}{h}\right) 1_{\{y'_{i}=y\}}. \end{aligned}$$

The asymptotic behaviour, as the size of the training sequence tends to infinity, has been thoroughly investigated for such classification rules and in particular for kernel classification rules. In the i.i.d. case or more generally under suitable mixing conditions, such procedures are risk consistent in the following sense: Kernel classification rules asymptotically achieve the minimal risk of misclassification for \({\mathcal {X}}= {\mathbb {R}}^d\) if the size n of the training sequence tends to \(\infty \) and \(h=h(n)\) satisfies \(h(n) \rightarrow 0\) and \(nh(n)^{d} \rightarrow \infty \). As remarked in the Introduction, this type of consistency follows from the consistency of the corresponding regression estimator, hence any result on regression consistency translates into a result on risk consistency.

It is the aim of this section to discuss the applicability of kernel classification with memory in hidden Markov models. Assume that the training sequence \(X'_{1},Y'_{1},\ldots ,X'_{n},Y'_{n}\) is generated according to a hidden Markov model and that there is a sequence of observations

$$\begin{aligned} \ldots , X_{-l},X_{-l-1},\ldots , X_{0} \text{ to } \text{ be } \text{ classified } \end{aligned}$$

which stems from the stationary hidden Markov model with the same transition matrix and the same kernel and is stochastically independent of the training sequence. For the classification of \(X_{0} = x_{0}\) the usual kernel classification rule as described above would classify \(x_{0}\) as belonging to the class

$$\begin{aligned} \mathop {{{\mathrm{argmax}}}}\limits _{y} \sum _{i=1}^{n} K\left( \frac{x_{0}-X'_{i}}{h}\right) 1_{\{Y'_{i}=y\}}. \end{aligned}$$

This ignores the Markovian structure, so we want to use memory as in the optimal classification of the preceding Sect. 2. We propose the following procedure. Fix some memory \(l=1,2,\ldots \) prescribing the number of preceding observations used in the classification of the current one. Use a kernel

$$\begin{aligned} K : {\mathcal {X}}^{l+1} \rightarrow [0,\infty ) \end{aligned}$$

and, assuming a training sequence of size \(n+l\), classify observation \(x_{0}\) as originating from the class

$$\begin{aligned} \mathop {{{\mathrm{argmax}}}}\limits _{y} \sum _{i=1}^{n} K\left( \frac{1}{h} \big ((x_{-l},\ldots ,x_{0})-(X'_{i},\ldots ,X'_{i+l})\big )\right) 1_{\{Y'_{i+l}=y\}}. \end{aligned}$$

Compared to rules without memory the role of \(x_{0}\) is taken by \((x_{-l},\ldots ,x_{0})\) whereas the role of \((X'_{i},Y'_{i})\) is taken by \((X'_{i},Y'_{i},\ldots ,X'_{i+l},Y'_{i+l})\).

The approach we propose here leads to a risk consistent procedure for hidden Markov models, i.e. the risk converges to the corresponding Bayes risk when, for fixed l, the size of the training sample n tends to \(\infty \). The proof of this risk consistency adapts the methods of proof for the i.i.d. case to the Markov model we have here. We present the basic facts here and refer to Irle (1997) for a detailed treatment; see also (Györfi et al. 1989, Chap. 13).

The kernel K has to satisfy that for any y

$$\begin{aligned} \frac{E1_{\{Y=y\}}K\big (\frac{1}{k}[(x_{-l},\ldots ,x_{0})-(X_{-l},\ldots ,X_{0})]\big )}{EK\big (\frac{1}{k}[(x_{-l},\ldots ,x_{0})-(X_{-l},\ldots ,X_{0})]\big )}\rightarrow P\big (Y=y|X_{-l}=x_{-l},\ldots ,X_{0}=x_{0}\big ) \end{aligned}$$

for a.a. \((x_{-l},\ldots ,x_{0})\) as \(k\rightarrow 0\). Any kernel K such that \(K\ge 0\), K is bounded with bounded support, and there exist \(t_0,c>0\) such that \(K(z)\ge c\) for \(\Vert z\Vert \le t_0\), fulfills the above condition, see, e.g., (Devroye et al. 1996, 10.1). We call such a kernel regular.

Next note that we consider a uniformly ergodic transition matrix for our Markov chain. Looking at the hidden Markov model forward in time, the process \((X_n',Y_n')_{n\in {\mathbb {N}}}\) forms a Markov chain with state space \(\mathcal {X}\times M\) in discrete time. The stationary distribution for this process is given by

$$\begin{aligned} \pi '(A\times B)=\sum _{y\in B}Q(A,y)\pi \big (\{y\}\big ). \end{aligned}$$

It follows immediately that this process is again uniformly ergodic such that for \(a>0,~0<b<1\), the constants for the Y-process, it holds that for all n

$$\begin{aligned} \big \Vert P^{(X_n',Y_n')}-\pi '\big \Vert \le a\cdot b^n \end{aligned}$$

since

$$\begin{aligned} \big |P^{(X_n',Y_n')}(A\times B)-\pi '(A\times B)|&=|\sum _{y\in B}Q(A,y)\big (P(Y_n'=y)-\pi (\{y\})\big )\big |\\&\le \big |P(Y_n'\in B)-\pi (B)\big |\le a\cdot b^n. \end{aligned}$$

In exactly the same manner, the process \((Z_n)_{n\in {\mathbb {N}}}=(X_n',Y_n',\ldots ,X_{n+l}',Y_{n+l}')\) is a Markov chain with state space \(\mathcal {Z}=(\mathcal {X}\times M)^{l+1}\) and stationary probability distribution \(\pi ^{(l)}\) given by

$$\begin{aligned} \pi ^{(l)}\Bigg (\prod _{i=1}^{l+1} (A_i\times B_i) \Bigg )&=\sum _{y_1\in B_1,\ldots ,y_{l+1}\in B_{l+1}}\left( \prod _{i=1}^{l+1}Q(A_i,y_i)\right) \pi (\{y_1\})p_{y_1,y_2} \ldots p_{y_l,y_{l+1}}, \end{aligned}$$

the p’s denoting the transition probabilities for the original chain. It is easily seen that this process is again uniformly ergodic where, with the same constants ab, we have for all n

$$\begin{aligned} \big \Vert P^{Z_n}-\pi ^{(l)}\big \Vert \le a\cdot b^n. \end{aligned}$$

Finally we note that any uniformly ergodic process is geometrically mixing in the sense that there exist \(\alpha >0,~0<\beta <1\) such that for all n

$$\begin{aligned} \big |Ef(Z_{i+n})g(Z_n)-Ef(Z_{i+n})Eg(Z_n)\big |\le \alpha \beta ^i \end{aligned}$$

for any measurable \(f,g:\mathcal {Z}\rightarrow {\mathbb {R}},~|f|,~|g|\le 1,\) and any initial distribution, see (Meyn and Tweedie 2012, Theorem 16.1.5).

Using the foregoing notations we can obtain the following result.

Theorem 3.1

Let K be regular and let \(h(n)>0, n=1,2,\ldots \) be such that \(h(n) \rightarrow 0\) and \(nh(n)^{d(l+1)} \rightarrow \infty \). Denote the risk of the kernel classification rule by \(L_{n}^{(l)}\). Then as \(n\rightarrow \infty \)

$$\begin{aligned} L_{n}^{(l)} \rightarrow R_l^*. \end{aligned}$$

Proof

The proof is based on the observation that classification is easier than regression function estimation, see (Devroye et al. 1996, 6.7). To adapt this to our setting fix \(y\in M\). Let for \((x_{-l},....,x_0)\in \mathcal X^{l+1}\)

$$\begin{aligned} \hat{p}_n(x_{-l},....,x_0)=\frac{\sum _{i=1}^nK\big (\frac{1}{h(n)}((x_{-l}, \ldots ,x_{0})-(X'_{i},\ldots ,X'_{i+l}))\big )1_{\{Y_{i+l}'=y\}}}{\sum _{i=1}^nK\big (\frac{1}{h(n)}((x_{-l},\ldots ,x_{0})-(X'_{i},\ldots ,X'_{i+l}))\big )}. \end{aligned}$$

This is the kernel regression function estimator of size n for

$$\begin{aligned} p(x_{-l},\ldots ,x_0)=P(Y_0=y|(X_{-l},\ldots ,X_0)=(x_{-l},\ldots ,x_0)), \end{aligned}$$

with corresponding kernel classification rule

$$\begin{aligned} \mathop {{{\mathrm{argmax}}}}\limits _y \sum _{i=1}^nK\left( \frac{1}{h(n)}\big ((x_{-l},\ldots ,x_{0})-(X'_{i}, \ldots ,X'_{i+l})\big )\right) 1_{\{Y_{i+l}'=y\}}. \end{aligned}$$

Now to show \(L_n^{(l)}\rightarrow R^*_l\) it is enough to show that as \(n\rightarrow \infty \)

$$\begin{aligned} \hat{p}_n(x_{-l},\ldots ,x_0)\rightarrow p(x_{-l},\ldots ,x_0) \end{aligned}$$
(1)

in probability for almost all \((x_{-l},\ldots ,x_0)\), see (Devroye et al. 1996, Theorem 6.5). For this we may apply (Irle 1997, Theorem 1) and the application to kernel regression estimators in ibid, part 3, in particular the representation for \(\hat{p}_n\), p.138. We than use uniform ergodicity, geometric mixing, regularity of the kernel, together with \(nh(n)^{d(l+1)} \rightarrow \infty \) to infer that the conditions to apply (Irle 1997, Theorem 1 (i)) are fulfilled. This then shows the assertion.

\(\square \)

Remark 3.2

  1. (i)

    The more complicated result of almost sure convergence \(\hat{p}_n(x_{-l},\ldots ,x_0)\rightarrow p(x_{-l},\ldots ,x_0)\) in (1) needs additional conditions; see (Irle 1997, Corollary 4), (Györfi et al. 1989, Chap. 13). But here, we only need convergence in probability.

  2. (ii)

    This method of using past information to construct a classification procedure seems generally applicable. We simply have to replace \(x_0\) by \((x_{-l},\ldots ,x_0)\) and the learning sequence \((X_n',Y_n')_n\) by \((X_n',Y_n',\ldots ,X_{n+l}',Y_{n+l}')_n\). E.g. for a nearest neighbor classification we would look for the nearest neighbor among the \((X_n',Y_n',\ldots ,X_{n+l}',Y_{n+l}')\) and use the resulting \(Y_{n+l}'\) for classification. To show consistency, we can proceed in the same way as for kernel classification using the nearest neighbor regression estimate, compare (Irle 1997, Part 4).

So asymptotically as n tends to infinity, the kernel classification rule performs as the optimal rule of Sect. 2 and may be used as a typical nonparametric rule to test the usefulness of invoking preceding information.

From a practical point of view we now comment on the performance in simulations and in recognition problems for isolated handwritten letters which points to a saving in misclassifications.

3.1 Performance studies

In the following we report on some typical results in our studies of the actual behaviour of the kernel classification rule as proposed in this paper. As a general experience we point out that memory \(l>1\) did not lead to significant improvement over \(l=1\) so that we only compare the cases \(l=0,l=1\).

(i) In simulations 1 and 2 we choose \(\ldots Y_{-1}, Y_{0},Y_{1},\ldots \) as a Markov chain with four states and transition probabilities

$$\begin{aligned} \left[ \begin{array}{cccc} 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 \\ 0.3 &{}\quad 0.7 &{}\quad 0 &{}\quad 0 \\ 0.7 &{}\quad 0.3 &{}\quad 0 &{}\quad 0 \end{array} \right] \end{aligned}$$

with stationary distribution (0.25, 0.25, 0.25, 0.25).

In simulation 3 we choose \(Y_{1},Y_{2},\ldots \) i.i.d. following the stationary distribution. The \(X_{i}\)’s have a three-dimensional normal distribution with identical covariance matrix and mean vectors

$$\begin{aligned} \begin{array}{ll} (0,0,0) &{} \quad \text{ in } \text{ simulations } \text{1,2,3 } \text{ for } \text{ class } \text{1 } ,\\ (4,0,0) &{} \quad \text{ in } \text{ simulations } \text{1,2,3 } \text{ for } \text{ class } \text{2 } ,\\ (3.9,3.9,0) &{} \quad \text{ in } \text{ simulation } \text{1, } \text{(4,4,0) } \text{ in } \text{ simulations } \text{2,3 } \text{ for } \text{ class } \text{3, } \\ (0,3.9,0) &{} \quad \text{ in } \text{ simulation } \text{1, } \text{(3.8,3.8,0) } \text{ in } \text{ simulations } \text{2,3 } \text{ for } \text{ class } \text{4. } \end{array} \end{aligned}$$

So there is good distinction between all classes in simulation 1 with easy classification, there is poor distinction between classes 3 and 4 in simulations 2 and 3. The following table gives the error rate for classification with size n of the training sequence in the first row. A normal kernel is used.

This shows that use of the Markov structure in simulation 2 through \(l=1\) leads to the possibility of distinguishing between classes 3 and 4. In the i.i.d. case of simulation 3 an appeal to memory of course does not help.

(ii) The classification of handwritten isolated capital letters was performed using kernel methods. Features were obtained by transforming handwritten letters into \(16 \times 16\) grey-value matrices. The learning sequence was obtained by merging samples from seven different persons.

The following typical error rates resulted from the classification of the word SAITE (german for ’string’) where error rates are writer dependent. A normal kernel was used with \(h=1.0\) and \(h=0.25\).

Use of the Markov structure through \(l=1\) seems to lead to improved performance. Of course, the incorporation of memory can be applied to any procedure of pattern recognition. In particular we have also looked into nearest neighbor rules with memory l. Our findings have been similar to those for the kernel rule as discussed above and also advocate the use of memory \(l=1\).

4 Proofs for Section 2

Lemma 4.1

Let \(l \in {\mathbb {N}}_0\). Consider a hidden Markov model which starts in some time point \(-m < -l\) and let \(T \subseteq \{-m,-m+1,\ldots ,0\}\).

For any \( x_{t}\in {\mathcal {X}} ,\, t\in T, \) and \( i,\iota ,\kappa \in M \) we have

  1. (i)

    for \( -l\in T \)

    $$\begin{aligned} \frac{P(Y_{-l}=\iota |Y_{-l-1}=i, X_{t}=x_{t},t\in T)}{P(Y_{-l}=\kappa |Y_{-l-1}=i, X_{t}=x_{t},t\in T)}\le \alpha (x_{-l})\, , \end{aligned}$$
  2. (ii)

    for \( -l\notin T \)

    $$\begin{aligned} \frac{P(Y_{-l}=\iota |Y_{-l-1}=i,X_{t}=x_{t}:t\in T)}{P(Y_{-l}=\kappa |Y_{-l-1}=i, X_{t}=x_{t},t\in T)}\le \alpha \, . \end{aligned}$$

Proof

We shall use the symbol f in a generic way to denote joint and conditional joint densities; also we use \( T_{\ge s}= \{t\in T:t\ge s\} \). From the properties of a hidden Markov model we obtain

$$\begin{aligned}&\frac{P(Y_{-l}=\iota |Y_{-l-1}=i,X_{t}=x_{t},t\in T)}{P(Y_{-l}=\kappa |Y_{-l-1}=i,X_{t}=x_{t},t\in T)}\\&\quad = \frac{P(Y_{-l}=\iota |Y_{-l-1}=i,X_{t}=x_{t},t\in T_{\ge -l})}{P(Y_{-l}=\kappa |Y_{-l-1}=i,X_{t}=x_{t},t\in T_{\ge -l})} \\&\quad = \frac{P(Y_{-l-1}=i,Y_{-l}=\iota |X_{t}=x_{t},t\in T_{\ge -l})}{P(Y_{-l-1}=i,Y_{-l}=\kappa |X_{t}=x_{t},t\in T_{\ge -l})}\\&\quad = \frac{\sum _{j}P(Y_{-l-1}=i,Y_{-l}=\iota ,Y_{-l+1}=j|X_{t}=x_{t},t\in T_{\ge -l})}{\sum _{j}P(Y_{-l-1}=i,Y_{-l}=\kappa ,Y_{-l+1}=j|X_{t}=x_{t},t\in T_{\ge -l})} \\&\quad = \frac{\sum _{j}P(Y_{-l-1}=i,Y_{-l}=\iota ,Y_{-l+1}=j)\; f(x_{t},t\in T_{\ge -l}|Y_{-l-1}=i,Y_{-l}=\iota ,Y_{-l+1}=j)}{\sum _{j}P(Y_{-l-1}=i,Y_{-l}=\kappa ,Y_{-l+1}=j)\; f(x_{t},t\in T_{\ge -l}| Y_{-l-1}=i,Y_{-l}=\kappa ,Y_{-l+1}=j)}\\&\quad = \frac{\sum _{j}P(Y_{-l-1}=i)\; p_{i\iota }\; p_{\iota j}\; f(x_{t},t\in T_{\ge -l}|Y_{-l-1}=i,Y_{-l}=\iota ,Y_{-l+1}=j)}{\sum _{j}P(Y_{-l-1}=i)\; p_{i\kappa }\; p_{\kappa j}\; f(x_{t},t\in T_{\ge -l}| Y_{-l-1}=i,Y_{-l}=\kappa ,Y_{-l+1}=j)}\\&\quad = \frac{p_{i\iota }\; \sum _{j}p_{\iota j}\; f(x_{t},t\in T_{\ge -l}|Y_{-l-1}=i,Y_{-l}=\iota ,Y_{-l+1}=j)}{p_{i\kappa }\; \sum _{j}p_{\kappa j}\; f(x_{t},t\in T_{\ge -l}| Y_{-l-1}=i,Y_{-l}=\kappa ,Y_{-l+1}=j)}\\&\quad = \frac{p_{i\iota }\; \sum _{j}p_{\iota j}\; f_{\iota }(x_{-l})\; f_{j}(x_{-l+1})\; f(x_{t},t\in T_{\ge -l+2}| Y_{-l+1}=j)}{p_{i\kappa }\; \sum _{j}p_{\kappa j}\; f_{\kappa }(x_{-l})\; f_{j}(x_{-l+1})\; f(x_{t},t\in T_{\ge -l+2}| Y_{-l+1}=j)}\, \, \, \, \, (*)\\&\quad \le \frac{p_{i\iota }\; f_{\iota }(x_{-l})}{p_{i\kappa }\; f_{\kappa }(x_{-l})}\; \max _{j}\frac{p_{\iota j}\; f_{j}(x_{-l+1})\; f(x_{t},t\in T_{\ge -l+2}| Y_{-l+1}=j)}{p_{\kappa j}\; f_{j}(x_{-l+1})\; f(x_{t},t\in T_{\ge -l+2}| Y_{-l+1}=j)}\\&\quad = \frac{p_{i\iota }\; f_{\iota }(x_{-l})}{p_{i\kappa }\; f_{\kappa }(x_{-l})}\; \max _{j}\frac{p_{\iota j}}{p_{\kappa j}}\\&\quad \le \max _{\iota ,\kappa ,i,j}\frac{p_{i\iota }\; p_{\iota j}\; f_{\iota }(x_{-l})}{p_{i\kappa }\; p_{\kappa j}\; f_{\kappa }(x_{-l})}\, .\\ \end{aligned}$$

Note that the term \( f_{j}(x_{-l+1}) \) in line \( (*) \) only appears for \( -l+1\in T \). In the second part of the assertion we have \( -l\notin T \) hence the terms \( f_{\iota }(x_{-l}) \) and \( f_{\kappa }(x_{-l}) \) do not appear in line \( (*) \). Thus the dependence on \( x_{-l} \) disappears. \(\square \)

As a consequence we obtain:

Corollary 4.2

Consider the situation of Lemma 4.1. Then

  1. (i)

    for \( -l\in T \)

    $$\begin{aligned} P\big (Y_{-l}=j|Y_{-l-1}=i,X_{t}=x_{t},t\in T \big )\ge \eta (x_{-l})>0\, , \end{aligned}$$
  2. (ii)

    for \( -l\notin T \)

    $$\begin{aligned} P\big (Y_{-l}=j|Y_{-l-1}=i,X_{t}=x_{t},t\in T \big )\ge \eta >0\, . \end{aligned}$$

Proof

Set \( \hat{\alpha }= \alpha (x_{-l})\) for \( -l\in T \) and \( \hat{\alpha }= \alpha \) for \( -l\notin T\). Lemma 4.1 implies

$$\begin{aligned}&\frac{1}{P\big (Y_{-l}=j|Y_{-l-1}=i,X_{t}=x_{t},t\in T \big )}\\&\quad = \sum _{k}\frac{P(Y_{-l}=k|Y_{-l-1}=i,X_{t}=x_{t},t\in T)}{P(Y_{-l}=j|Y_{-l-1}=i,X_{t}=x_{t}:t\in T)}\\&\quad \le 1+(m-1)\hat{\alpha }\, , \end{aligned}$$

hence

$$\begin{aligned} P\big (Y_{-l}=j|Y_{-l-1}=i,X_{t}=x_{t},t\in T\big )\ge \big (1+(m-1)\hat{\alpha }\big )^{-1}\,. \end{aligned}$$

\(\square \)

We now give the proof of Theorem 2.2.

Proof

For \( A=M \) we have \( m_{l}^{+}-m_{l}^{-}=1-1=0 \). So assume \( A\ne M \). We use induction in l and shall apply Corollary 4.2.

Let \( l=1 \): Chose \( j\in A^{c} \).

$$\begin{aligned} m_{1}^{+}-m_{1}^{-}= & {} \max _{i\in M}P(Y_{0}\in A|X_{-n}^{0}=x_{-n}^{0},Y_{-1}=i)-m_{1}^{-}\\= & {} 1-\min _{i\in M}P(Y_{0}\notin A|X_{-n}^{0}=x_{-n}^{0},Y_{-1}=i)-m_{1}^{-}\\\le & {} 1-\min _{i\in M}P(Y_{0}=j|X_{-n}^{0}=x_{-n}^{0},Y_{-1}=i)-m_{1}^{-}\\\le & {} 1-\hat{\eta }_{0}-\hat{\eta }_{0} = (1-2\hat{\eta }_{0})\, . \end{aligned}$$

The inductive step:

Assume that the assertion is true for \( l\in {\mathbb {N}}\). Let \(j^+\) and \(j^-\) be such that the maximum and minimum respectively, are attained in these values, i.e. \(j^+ = \arg m_{l}^{+},\, j^- = \arg m_{l}^{-}\). Then

$$\begin{aligned} m_{l+1}^{+}= & {} \max _{i\in M}P(Y_{0}\in A|X_{-n}^{0}=x_{-n}^{0},Y_{-l-1}=i)\\= & {} \max _{i\in M}\Bigg \{\sum _{j\in M}P(Y_{0}\in A|X_{-n}^{0}=x_{-n}^{0},Y_{-l}=j,Y_{-l-1}=i)\\&\times P(Y_{-l}=j|X_{-n}^{0}=x_{-n}^{0},Y_{-l-1}=i)\Bigg \}\\= & {} \max _{i\in M}\Bigg \{\sum _{j\in M}P(Y_{0}\in A|X_{-n}^{0}=x_{-n}^{0},Y_{-l}=j)\\&\times \, P(Y_{-l}=j|X_{-n}^{0}=x_{-n}^{0},Y_{-l-1}=i)\Bigg \}\\ \end{aligned}$$
$$\begin{aligned}= & {} \max _{i\in M}\Bigg \{\sum _{j\ne j^{-}}P(Y_{0}\in A|X_{-n}^{0}=x_{-n}^{0},Y_{-l}=j)\\&\times P(Y_{-l}=j|X_{-n}^{0}=x_{-n}^{0},Y_{-l-1}=i)\\&+\,m_{l}^{-}\; P(Y_{-l}= j^{-} |X_{-n}^{0}=x_{-n}^{0},Y_{-l-1}=i)\Bigg \} \\\le & {} \max _{i\in M}\Bigg \{\sum _{j\ne j^{-}} m_{l}^{+}\; P(Y_{-l}=j|X_{-n}^{0}=x_{-n}^{0},Y_{-l-1}=i)\\&+\, m_{l}^{-}\; P(Y_{-l}= j^{-} |X_{-n}^{0}=x_{-n}^{0},Y_{-l-1}=i)\Bigg \} \\= & {} \max _{i\in M}\{m_{l}^{+}\; (1-P(Y_{-l}= j^{-}|X_{-n}^{0}=x_{-n}^{0},Y_{-l-1}=i)) \\ \end{aligned}$$
$$\begin{aligned}&+\,m_{l}^{-}\; \underbrace{P(Y_{-l}= j^{-}|X_{-n}^{0}=x_{-n}^{0},Y_{-l-1}=i)}_{\ge \hat{\eta }_{-l}}\}\\\le & {} \max _{i\in M}\{m_{l}^{+}(1-\hat{\eta }_{-l})+\,m_{l}^{-}\hat{\eta }_{-l}\}\\= & {} m_{l}^{+}(1-\hat{\eta }_{-l})+\,m_{l}^{-}\hat{\eta }_{-l}\, , \end{aligned}$$

Similarly

$$\begin{aligned} m_{l+1}^{-}= & {} \min _{i\in M}P(Y_{0}\in A|X_{-n}^{0}=x_{-n}^{0},Y_{-l-1}=i)\\= & {} \min _{i\in M}\Bigg \{\sum _{j\in M}P(Y_{0}\in A|X_{-n}^{0}=x_{-n}^{0},Y_{-l}=j,Y_{-l-1}=i)\\&\times P(Y_{-l}=j|X_{-n}^{0}=x_{-n}^{0},Y_{-l-1}=i)\Bigg \} \\= & {} \min _{i\in M}\Bigg \{\sum _{j\in M}P(Y_{0}\in A|X_{-n}^{0}=x_{-n}^{0},Y_{-l}=j)\\&\times P(Y_{-l}=j|X_{-n}^{0}=x_{-n}^{0},Y_{-l-1}=i)\Bigg \} \\ \end{aligned}$$
$$\begin{aligned}= & {} \min _{i\in M}\Bigg \{\sum _{j\ne j^{+}}P(Y_{0}\in A|X_{-n}^{0}=x_{-n}^{0},Y_{-l}=j)\\&\times P(Y_{-l}=j|X_{-n}^{0}=x_{-n}^{0},Y_{-l-1}=i)\\&+\,m_{l}^{+}\; P(Y_{-l}=j^{+}|X_{-n}^{0}=x_{-n}^{0},Y_{-l-1}=i)\Bigg \}\\\ge & {} \min _{i\in M}\Bigg \{\sum _{j\ne j^{+}}m_{l}^{-} \; P(Y_{-l}=j|X_{-n}^{0}=x_{-n}^{0},Y_{-l-1}=i)\\&+\,m_{l}^{+}\; P(Y_{-l}= j^{+}|X_{-n}^{0}=x_{-n}^{0},Y_{-l-1}=i)\Bigg \}\\ \end{aligned}$$
$$\begin{aligned}= & {} \min _{i\in M}\{m_{l}^{-}\; (1-P(Y_{-l}= j^{+}|X_{-n}^{0}=x_{-n}^{0},Y_{-l-1}=i))\\&+\,m_{l}^{+}\; \underbrace{P(Y_{-l}= j^{+}|X_{-n}^{0}=x_{-n}^{0},Y_{-l-1}=i)}_{\ge \hat{\eta }_{-l}}\} \\\ge & {} \min _{i\in M}\{m_{l}^{-}(1-\hat{\eta }_{-l})+\,m_{l}^{+}\hat{\eta }_{-l}\}\\= & {} m_{l}^{-}(1-\hat{\eta }_{-l})+\,m_{l}^{+}\hat{\eta }_{-l}\, . \end{aligned}$$

This implies

$$\begin{aligned} m_{l+1}^{+}-m_{l+1}^{-}\le & {} (1-2\hat{\eta }_{-l})m_{l}^{+}-(1-2\hat{\eta }_{-l})m_{l}^{-}\\\le & {} (1-2\hat{\eta }_{-l})\prod _{j=-l+1}^{0}(1-2\hat{\eta }_{j})\\= & {} \prod _{j=-l}^{0}(1-2\hat{\eta }_{j}). \end{aligned}$$

\(\square \)