Keywords

1 Introduction

We consider off-policy temporal-difference (TD) learning in discounted Markov decision processes (MDPs), where the goal is to evaluate a policy in a model-free way by using observations of a state process generated without executing the policy. Off-policy learning is an important part of the reinforcement learning methodology [25] and has been studied in the areas of operations research and machine learning (see e.g., [3, 5, 6, 8,9,10,11, 17, 18, 20, 29]). Available algorithms, however, tend to have very high variances due to the use of importance sampling, an issue that limits their applicability in practice. The purpose of this paper is to introduce a new TD learning scheme that can help address this problem.

Our work is motivated by the recently proposed Retrace [15] and ABQ [12] algorithms, and by the Tree-Backup algorithm [18] that existed earlier. These algorithms, as explained in [12], all try to use the \(\lambda \)-parameters of TD to curb the high variance issue in off-policy learning. In this paper we propose a new scheme of setting the \(\lambda \)-parameters of TD, based on generalized Bellman equations. Our scheme is to set \(\lambda \) according to the eligibility trace iterates calculated in TD, thereby easily keeping those traces in a desired range. Compared to the previous works, this is a direct way to bound the traces in TD, and it is also more flexible, allowing much larger \(\lambda \) values for off-policy learning.

Regarding generalized Bellman equations, they are a powerful tool. In classic MDP theory they have been used in some intricate optimality analyses. Their computational use, however, seems to emerge primarily in the field of reinforcement learning (see [24], [1, Chap. 5.3] and [28] for related early and recent research). Like the earlier works [12, 15, 18, 28, 33], our work aims to employ this tool to make off-policy learning more efficient.

Our analyses of the new TD learning scheme will focus on its theoretical side. Using Markov chain theory, we prove the ergodicity of the joint state and trace process under nonrestrictive conditions (see Theorem 2.1), and we show that associated with our scheme is a generalized Bellman equation (for the policy to be evaluated) that depends on both \(\lambda \) and the unique invariant probability measure of the state-trace process (see Theorem 3.1). These results not only lead immediately to a characterization of the convergence behavior of least-squares based implementation of our scheme (see Corrolary 2.1 and Remark 3.1), but also prepare the ground for further analysis of gradient-based implementations.

We note that due to space limit, in this paper we can only give the ideas or outlines of our proofs. The full details will be given in the longer version of this paper, which will also include numerical examples that we will not cover here.

The rest of the paper is organized as follows. In Sect. 2, after a brief background introduction, we present our scheme of TD learning with bounded traces, and we establish the ergodicity of the joint state-trace process. In Sect. 3 we derive the generalized Bellman equation associated with our scheme.

2 Off-Policy TD Learning with Bounded Traces

2.1 Preliminaries

The off-policy learning problem we consider in this paper concerns two Markov chains on a finite state space \(\mathcal {S}= \{ 1, \ldots , N\}\). The first chain has transition matrix \(P\), and the second \(P^o\). Whatever physical mechanisms that induce the two chains shall be denoted by \(\pi \) and \(\pi ^o\), and referred to as the target policy and behavior policy, respectively. The second Markov chain we can observe; however, it is the system performance for the first Markov chain that we want to evaluate. Specifically, we consider a one-stage reward function \(r_\pi : \mathcal {S}\rightarrow \mathfrak {R}\) and an associated discounted total reward criterion with state-dependent discount factors \(\gamma (s) \in [0,1], s \in \mathcal {S}\). Let \(\varGamma \) denote the \(N \times N\) diagonal matrix with diagonal entries \(\gamma (s)\). We assume that \(P\) and \(P^o\) satisfy the following conditions:

Condition 2.1

(Conditions on the target and behavior policies).

(i) P is such that the inverse \((I - P\varGamma )^{-1}\) exists, and (ii) \(P^o\) is such that for all \(s,s' \in \mathcal {S}\), \(P^o_{ss'} = 0 \Rightarrow P_{ss'}=0\), and moreover, \(P^o\) is irreducible.

The performance of \(\pi \) is defined as the expected discounted total rewards for each initial state \(s \in \mathcal {S}\):

$$\begin{aligned} \textstyle {v_{\pi }(s) : = \mathbb {E}^\pi _s \left[ \, r_\pi (S_0) + \sum _{t=1}^{\infty } \gamma (S_{1})\, \gamma (S_{2}) \cdots \gamma (S_{t}) \cdot r_\pi (S_t) \right] ,} \end{aligned}$$
(2.1)

where the notation \(\mathbb {E}^\pi _s\) means that the expectation is taken with respect to (w.r.t.) the Markov chain \(\{S_t\}\) starting from \(S_0 =s\) and induced by \(\pi \) (i.e., with transition matrix P). The function \(v_\pi \) is well-defined under Condition 2.1(i). It is called the value function of \(\pi \), and by standard MDP theory (see e.g., [19]), we can write it in matrix/vector notation as

$$v_\pi = r_{\pi } + P\varGamma \, v_\pi , \qquad \text {i.e.}, \quad v_\pi = (I - P\varGamma )^{-1} r_\pi .$$

The first equation above is known as the Bellman equation (or dynamic programming equation) for a stationary policy.

We compute an approximation of \(v_\pi \) of the form \(v(s) = \phi (s)^\top \theta \), \(s \in \mathcal {S}\), where \(\theta \in \mathfrak {R}^n\) is a parameter vector and \(\phi (s)\) is an n-dimensional feature representation for each state s (\(\phi (s),\theta \) are column vectors and \(\text {}^\top \) stands for transpose). Data available for this computation are:

  1. (i)

    the Markov chain \(\{S_t\}\) with transition matrix \(P^o\) generated by \(\pi ^o\), and

  2. (ii)

    rewards \(R_t = r(S_t, S_{t+1})\) associated with state transitions, where the function r relates to \(r_\pi (s)\) as \(r_{\pi }(s) = \mathbb {E}^\pi _s [ r(s, S_1) ]\) for all \(s \in \mathcal {S}\).

To find a suitable parameter \(\theta \) for the approximation \(\phi (s)^\top \theta \), we use the off-policy TD learning scheme. Define \(\rho (s, s') = P_{ss'}/P^o_{ss'}\) (the importance sampling ratio);Footnote 1 denote \(\rho _t = \rho (S_t, S_{t+1}), \gamma _t=\gamma (S_t)\). Given an initial \(e_0 \in \mathfrak {R}^n\), for each \(t \ge 1\), the eligibility trace vector \(e_t \in \mathfrak {R}^n\) and the scalar temporal-difference term \(\delta _t(v)\) for any approximate value function \(v: \mathcal {S}\rightarrow \mathfrak {R}\) are calculated according to

$$\begin{aligned} e_t&= \lambda _t \, \gamma _t \, \rho _{t-1} \, e_{t-1} + \phi (S_t), \end{aligned}$$
(2.2)
$$\begin{aligned} \delta _t(v)&= \rho _t \, \big ( R_{t} + \gamma _{t+1} v(S_{t+1}) - v(S_t) \big ). \end{aligned}$$
(2.3)

Here \(\lambda _t \in [0,1], t \ge 1\), are important parameters in TD learning, the choice of which we shall elaborate on shortly.

Using \(e_t\) and \(\delta _t\), a number of algorithms can be formed to generate a sequence of parameters \(\theta _t\) for approximate value functions. One such algorithm is LSTD [2, 29], which obtains \(\theta _t\) by solving the linear equation for \(\theta \in \mathfrak {R}^n\),

$$\begin{aligned} \textstyle { \tfrac{1}{t} \sum _{k=0}^{t-1} \, e_k \, \delta _k (v) = 0, \quad v = \varPhi \theta } \end{aligned}$$
(2.4)

(if it admits a solution), where \(\varPhi \) is a matrix with row vectors \(\phi (s)^\top , s \in \mathcal {S}\). LSTD updates the equation (2.4) iteratively by incorporating one by one the observation of \((S_t, S_{t+1}, R_t)\) at each state transition. We will discuss primarily this algorithm in the paper, as its behavior can be characterized directly using our subsequent analyses of the joint state-trace process. As mentioned earlier, our analyses will also provide bases for analyzing other gradient-based TD algorithms [9, 10] using stochastic approximation theory. However, due to its complexity, this subject is better to be treated separately, not in the present paper.

2.2 Our Choice of \(\lambda \)

We now come to the choices of \(\lambda _t\) in the trace iterates (2.2). For TD with function approximation, one often lets \(\lambda _t\) be a constant or a function of \(S_t\) [23, 25, 27]. If neither the behavior policy nor the \(\lambda _t\)’s are further constrained, \(\{e_t\}\) can have unbounded variances and is also unbounded in many natural situations (see e.g., [29, Sect. 3.1]), and this makes off-policy TD learning challenging.Footnote 2 If we let the behavior policy to be close enough to the target policy so that \(P^o \approx P\), then variance can be reduced, but it is not a satisfactory solution, for the applicability of off-policy learning would be seriously limited.

Without restricting the behavior policy, the two recent works [12, 15] (as well as the closely related early work [18]) exploit state-dependent \(\lambda \)’s to control variance. Their choices of \(\lambda _t\) are such that \(\lambda _t \rho _{t-1} < 1\) for all t, so that the trace iterates \(e_t\) are made bounded, which can help reduce the variance of the iterates.

Our proposal, motivated by these prior works, is to set \(\lambda _t\) according to \(e_{t-1}\) directly, so that we can keep \(e_t\) in a desired range straightforwardly and at the same time, allow a much larger range of values for the \(\lambda \)-parameters. As a simple example, if we use \(\lambda _t\) to scale the vector \(\gamma _t \rho _{t-1} e_{t-1}\) to be within a ball with some given radius, then we keep \(e_t\) bounded always.

In the rest of this paper, we shall focus on analyzing the iteration (2.2) with a particular choice of \(\lambda _t\) of the kind just mentioned. We want to be more general than the preceding simple example. However, we also want to retain certain Markovian properties that are very useful for convergence analysis. This leads us to consider \(\lambda _t\) being a certain function of the previous trace and past states. More specifically, we will let \(\lambda _t\) be a function of the previous trace and a certain memory state that is a summary of the states observed so far, and the formulation is as follows.

Denote the memory state at time t by \(y_t\). For simplicity, we assume that \(y_t\) can only take values from a finite set \(\mathcal {M}\), and its evolution is Markovian: \(y_t = g(y_{t-1}, S_t)\) for some given function g. The joint process \(\{(S_t, y_t)\}\) is then a simple finite-state Markov chain. Each \(y_t\) is a function of \((S_0, \ldots , S_t)\) and \(y_0\). We further require, besides the irreducibility of \(\{S_t\}\) (cf. Condition 2.1(ii)), thatFootnote 3

Condition 2.2

(Evolution of memory states). Under the behavior policy \(\pi ^o\), the Markov chain \(\{(S_t, y_t)\}\) on \(\mathcal {S}\times \mathcal {M}\) has a single recurrent class.

Thus we let \(y_t\) and \(\lambda _t\) evolve as

$$\begin{aligned} y_t = g(y_{t-1}, S_t), \qquad \lambda _t=\lambda (y_t, e_{t-1}) \end{aligned}$$
(2.5)

where \(\lambda : \mathcal {M}\times \mathfrak {R}^n\rightarrow [0,1]\). We require the function \(\lambda \) to satisfy two conditions.

Condition 2.3

(Conditions for \(\lambda \) ). For some norm \(\Vert \cdot \Vert \) on \(\mathfrak {R}^n\), the following hold for each memory state \(y \in \mathcal {M}\):

  1. (i)

    For any \(e, e' \in \mathfrak {R}^n\), \(\Vert \lambda (y, e) \, e- \lambda (y, e') \, e' \Vert \le \Vert e- e'\Vert \).

  2. (ii)

    For some constant \(C_y\), \(\Vert \gamma (s') \rho (s, s') \cdot \lambda (y, e) \, e\Vert \le C_y\) for all possible state transitions \((s, s')\) that can lead to the memory state y.

In the above, the second condition is to restrict \(\{e_t\}\) in a desired range (as it makes \(\Vert e_t\Vert \le \max _{y \in \mathcal {M}} C_y + \max _{s \in \mathcal {S}} \Vert \phi (s)\Vert \)). The first condition is to ensure that the traces \(e_t\) jointly with \((S_t, y_t)\) form a Markov chain with nice properties (as will be seen in the next subsection).

Consider the simple scaling example mentioned earlier. In this case we can let \(y_t = (S_{t-1}, S_t)\), and for each \(y = (s,s')\), define \(\lambda (y, \cdot )\) to scale back the vector \(\gamma (s') \rho (s,s') \, e\) when it is outside the Euclidean ball with radius \(C_{ss'}\): \(\lambda \big (y, e\big ) = 1\) if \(\gamma (s') \rho (s, s') \Vert e\Vert _2 \le C_{ss'}\); and \(\lambda \big (y, e\big ) = \tfrac{C_{ss'}}{\gamma (s') \rho (s,s') \Vert e\Vert _2}\) otherwise.

2.3 Ergodicity Result

The properties of the joint state-trace process \(\{(S_t, y_t, e_t)\}\) are important for understanding and characterizing the behavior of the proposed TD learning scheme. We study them in this subsection; most importantly, we shall establish the ergodicity of the state-trace process. The result will be useful in convergence analysis of several associated TD algorithms, although in this paper we discuss only the LSTD algorithm. In the next section we will also use the ergodicity result when we relate the LSTD equation (2.4) to a generalized Bellman equation for the target policy, which will then make the meaning of the LSTD solutions clear.

As a side note, one can introduce nonnegative coefficients i(y) for memory states y to weight the state features (similarly to the use of “interest” weights in the ETD algorithm [26]) and update \(e_t\) according to

$$\begin{aligned} e_t = \lambda _t \, \gamma _t \, \rho _{t-1} \, e_{t-1} + i(y_t) \, \phi (S_t). \end{aligned}$$
(2.6)

The results given below apply to this update rule as well.

Let us start with two basic properties of \(\{(S_t, y_t, e_t)\}\) that follow directly from our choice of the \(\lambda \) function:

  1. (i)

    By Condition 2.3(i), for each y, \(\lambda (y, e) e\) is a continuous function of \(e\), and thus \(e_t\) depends continuously on \(e_{t-1}\). This, together with the finiteness of \(\mathcal {S}\times \mathcal {M}\), ensures that \(\{(S_t, y_t, e_t)\}\) is a weak Feller Markov chain.Footnote 4

  2. (ii)

    Then, by a property of weak Feller Markov chains [14, Theorem 12.1.2(ii)], the boundedness of \(\{e_t\}\) ensured by Condition 2.3(ii) implies that \(\{(S_t, y_t, e_t)\}\) has at least one invariant probability measure.

The third property, given in the lemma below, concerns the behavior of \(\{e_t\}\) for different initial \(e_0\). It is an important implication of Condition 2.3(i) (actually it is our purpose of introducing the condition 2.3(i) in the first place). Due to space limit, we omit the proof, which is similar to the proof of [29, Lemma 3.2].

Lemma 2.1

Let \(\{e_t\}\) and \(\{\hat{e}_t\}\) be generated by the iteration (2.2) and (2.5), using the same trajectory of states \(\{S_t\}\) and initial \(y_0\), but with different initial \(e_0\) and \(\hat{e}_0\), respectively. Then under Conditions 2.1(i) and 2.3(i), \(e_t - \hat{e}_t \overset{a.s.}{\rightarrow }0\).

We use the preceding lemma and ergodicity properties of weak Feller Markov chains [13] to prove the ergodicity theorem given below (for lack of space, we again omit the proof). Before stating this result, we note that for \(\{(S_t, y_t, e_t)\}\) starting from the initial condition \(x = (s, y, e)\), the occupation probability measures \(\{\mu _{x,t}\}\) are random probability measures on \(\mathcal {S}\times \mathcal {M}\times \mathfrak {R}^n\) given by

$$\textstyle {\mu _{x,t}(D): = \frac{1}{t} \sum _{k=0}^{t-1} \mathbb {1}\big ((S_k, y_k, e_k) \in D \big )}$$

for all Borel sets \(D \subset \mathcal {S}\times \mathcal {M}\times \mathfrak {R}^n\), where \(\mathbb {1}(\cdot )\) is the indicator function. We write \(\mathbf {P}_x\) for the probability distribution of \(\{(S_t, y_t, e_t)\}\) with initial condition x.

Theorem 2.1

Under Conditions 2.12.3, \(\{(S_t, y_t, e_t)\}\) is a weak Feller Markov chain and has a unique invariant probability measure \(\zeta \). For each initial condition \((S_0, y_0, e_0) = (s,y,e) =: x\), the occupation probability measures \(\{\mu _{x,t}\}\) converge weaklyFootnote 5 to \(\zeta \), \(\mathbf {P}_x\)-almost surely.

Let \(\mathbb {E}_\zeta \) denote expectation w.r.t. the stationary state-trace process \(\{\!(S_t, y_t, e_t)\!\}\) with initial distribution \(\zeta \). Since the traces and hence the entire process lie in a bounded set under Condition 2.3(ii), the weak convergence of \(\{\mu _{x,t}\}\) to \(\zeta \) implies that the sequence of equations, \( \tfrac{1}{t} \sum _{k=0}^{t-1} \, e_k \, \delta _k (v) = 0\), as given in (2.4) for LSTD, has an asymptotic limit that can be expressed in terms of the stationary state-trace process as follows.

Corollary 2.1

Let Conditions 2.12.3 hold. Then for each initial condition of \((S_0, y_0, e_0)\), almost surely, the first equation in (2.4), viewed as a linear equation in v, tends toFootnote 6 the equation \(\mathbb {E}_\zeta [e_0 \delta _0(v)] = 0\) in the limit as \(t \rightarrow \infty \).

3 Generalized Bellman Equations

In this section we continue the analysis started in Sect. 2.3. Our goal is to relate the linear equation \(\mathbb {E}_\zeta [e_0 \delta _0(v)] = 0\), the asymptotic limit of the linear equation (2.4) for LSTD as just shown by Corrolary 2.1, to a generalized Bellman equation for the target policy \(\pi \). Then, we can interpret solutions of (2.4) as solutions of approximate versions of that generalized Bellman equation.

To simplify notation in subsequent derivations, we shall use the following shorthand notation: For \(k \le m\), denote \(S_k^m = (S_k, S_{k+1}, \ldots S_m)\), and denote

$$\begin{aligned} \textstyle {\rho _k^m = \prod _{i=k}^m \rho _i, \qquad \lambda _k^m = \prod _{i=k}^m \lambda _i, \qquad \gamma _k^m = \prod _{i=k}^m \gamma _i,} \end{aligned}$$
(3.1)

whereas by convention we treat \(\rho _k^m = \lambda _k^m = \gamma _k^m = 1\) if \(k > m\).

3.1 Randomized Stopping Times

Consider the Markov chain \(\{S_t\}\) induced by the target policy \(\pi \). Let Condition 2.1(i) hold. Recall that for the value function \(v_\pi \), we have

$$v_{\pi }(s) = \textstyle {\mathbb {E}^\pi _s \big [ \sum _{t=0}^\infty \gamma _1^t\, r_\pi (S_t) \big ]} \ \ \text {(by definition),}\ \ \ \text {and} \ \ v_\pi (s) = r_\pi (s) + \mathbb {E}_s^\pi [ \gamma _1 v_\pi (S_1)] $$

for each state s. The second equation is the standard one-step Bellman equation.

To write generalized Bellman equations for \(\pi \), we need the notion of randomized stopping times for \(\{S_t\}\). They generalize stopping times for \(\{S_t\}\) in that whether to stop at time t depends not only on \(S_0^t\) but also on certain random outcomes. A simple example is to toss a coin at each time and stop as soon as the coin lands on heads, regardless of the history \(S_0^t\). (The corresponding Bellman equation is the one associated with TD(\(\lambda \)) for a constant \(\lambda \).) Of interest here is the general case where the stopping decision does depend on the entire history.

To define a randomized stopping time formally, first, the probability space of \(\{S_t\}\) is enlarged to take into account whatever randomization scheme that is used to make the stopping decision. (The enlargement will be problem-dependent, as the next subsection will demonstrate.) Then, on the enlarged space, a randomized stopping time \(\tau \) for \(\{S_t\}\) is by definition a stopping time relative to some increasing sequence of sigma-algebras \(\mathcal {F}_0 \subset \mathcal {F}_1 \subset \cdots \), where the sequence \(\{\mathcal {F}_t\}\) is such that (i) for all \(t \ge 0\), \(\mathcal {F}_t \supset \sigma (S_0^t)\) (the sigma-algebra generated by \(S_0^t\)), and (ii) w.r.t. \(\{\mathcal {F}_t\}\), \(\{S_t\}\) remains to be a Markov chain with transition probability \(P\), i.e., \(\text {Prob}(S_{t+1} =s \mid \mathcal {F}_t) = P_{S_t s}\). (See [16, Chap. 3.3].)

The advantage of this abstract definition is that it allows us to write Bellman equations in general forms without worrying about the details of the enlarged space which are not important at this point. For notational simplicity, we shall still use \(\mathbb {E}^\pi \) to denote expectation for the enlarged probability space and write \(\mathbf {P}^\pi \) for the probability measure on that space, when there is no confusion.

If \(\tau \) is a randomized stopping time for \(\{S_t\}\), the strong Markov property [16, Theorem 3.3] allows us to express \(v_\pi \) in terms of \(v_\pi (S_\tau )\) and the total discounted rewards \(R^\tau \) prior to stopping:

$$\begin{aligned} v_\pi (s)&= \textstyle { \mathbb {E}^\pi _s \left[ \sum _{t=0}^{\tau -1} \gamma _1^t \, r_\pi (S_t) + \sum _{t=\tau }^\infty \gamma _1^{\tau } \cdot \gamma _{\tau +1}^t \, r_\pi (S_t) \right] } \nonumber \\&= \mathbb {E}^\pi _s \big [ R^{\tau } + \gamma _1^{\tau } \, v_{\pi }(S_{\tau }) \big ], \end{aligned}$$
(3.2)

where \(R^\tau = \sum _{t=0}^{\tau -1} \gamma _1^t \, r_\pi (S_t)\) for \(\tau \in \{0, 1, 2, \ldots \} \cup \{+\infty \}\).Footnote 7

We can also write the Bellman equation (3.2) in terms of \(\{S_t\}\) only, by taking expectation over \(\tau \):

$$\begin{aligned}&\quad v_\pi (s) = \textstyle { \mathbb {E}^\pi _s \left[ \sum _{t=0}^\infty \Big ( q^+_{t}(S_0^t) \cdot \gamma _1^t \, r_\pi (S_t) + q_{t}(S_0^t) \cdot \gamma _1^t \, v_\pi (S_t) \Big ) \right] ,} \end{aligned}$$
(3.3)
$$\begin{aligned}&\text {where} \quad q^+_{t}(S_0^t) = \mathbf {P}^\pi ( \tau > t \mid S_0^t), \qquad q_{t}(S_0^t) = \mathbf {P}^\pi ( \tau = t \mid S_0^t). \end{aligned}$$
(3.4)

The r.h.s. of (3.2) or (3.3) defines an associated generalized Bellman operator \(T : \mathfrak {R}^{N} \rightarrow \mathfrak {R}^{N}\) that has several equivalent expressions; e.g., for all \(s \in \mathcal {S}\),

$$\begin{aligned} (Tv)(s) = \textstyle { \mathbb {E}^\pi _s \big [ R^{\tau } + \gamma _1^{\tau } \, v(S_{\tau }) \big ]} = \textstyle { \mathbb {E}^\pi _s \left[ \sum _{t=0}^\infty \Big ( q^+_{t}(S_0^t) \cdot \gamma _1^t \, r_\pi (S_t) + q_{t}(S_0^t) \cdot \gamma _1^t\, v(S_t) \Big ) \right] .} \end{aligned}$$

If \(\tau \ge 1\) a.s., then as in the case of the one-step Bellman operator, the value function \(v_\pi \) is the unique fixed point of T, i.e., the unique solution of \(v = T v\).Footnote 8

3.2 Bellman Equation for the Proposed TD Learning Scheme

With the terminology of randomized stopping times, we are now ready to write down the generalized Bellman equation associated with the TD-learning scheme proposed in Sect. 2.2. It corresponds to a particular randomized stopping time. We shall first describe this random time, from which a generalized Bellman equation follows as seen in the preceding subsection. That this is indeed the Bellman equation for our TD learning scheme will then be proved.

Consider the Markov chain \(\{S_t\}\) under the target policy \(\pi \). We define a randomized stopping time \(\tau \) for \(\{S_t\}\):

  • Let \(y_t, \lambda _t, e_t, t \ge 1,\) evolve according to (2.5) and (2.2).

  • Let the initial \((S_0, y_0, e_0)\) be distributed according to \(\zeta \), the unique invariant probability measure in Theorem 2.1.

  • At time \(t \ge 1\), we stop the system with probability \(1 - \lambda _t\) if it has not yet been stopped. Let \(\tau \) be the time when the system stops (\(\tau = \infty \) if the system never stops).

To make the dependence on the initial distribution \(\zeta \) explicit, we write \(\mathbf {P}^\pi _\zeta \) for the probability measure of this process.

Note that by definition \(\lambda _t\) and \(\lambda _1^t\) are functions of the initial \((y_0, e_0)\) and states \(S_0^t\). From how the random time \(\tau \) is defined, we have for all \(t \ge 1\),

$$\begin{aligned} \mathbf {P}^\pi _\zeta ( \tau > t \mid S_0^t, y_0, e_0)&= \lambda _1^t = : h^+_{t}(y_0, e_0, S_0^t), \end{aligned}$$
(3.5)
$$\begin{aligned} \mathbf {P}^\pi _\zeta ( \tau = t \mid S_0^t, y_0, e_0)&= \lambda _1^{t-1} (1 - \lambda _t) = : h_t(y_0, e_0, S_0^t), \end{aligned}$$
(3.6)

and hence

$$\begin{aligned} q^+_{t}(S_0^t)&: = \mathbf {P}^\pi _\zeta ( \tau > t \mid S_0^t) = \int h^+_{t}(y, e, S_0^t) \, \zeta \big (d(y, e) \mid S_0 \big ), \end{aligned}$$
(3.7)
$$\begin{aligned} q_{t}(S_0^t)&: = \mathbf {P}^\pi _\zeta ( \tau = t \mid S_0^t) = \int h_t(y, e, S_0^t) \, \zeta \big (d(y, e) \mid S_0 \big ), \end{aligned}$$
(3.8)

where \(\zeta (d(y,e) \mid s)\) is the conditional distribution of \((y_0, e_0)\) given \(S_0 = s\), w.r.t. the initial distribution \(\zeta \). As before, we can write the generalized Bellman operator T associated with \(\tau \) in several equivalent forms. Let \(\mathbb {E}^\pi _\zeta \) denote expectation under \(\mathbf {P}^\pi _\zeta \). Based on (3.2) and (3.5)–(3.6), it is easy to derive thatFootnote 9 for all \(v: \mathcal {S}\rightarrow \mathfrak {R}, s \in \mathcal {S}\),

$$\begin{aligned} (Tv)(s) = \textstyle { \mathbb {E}^\pi _\zeta \left[ \sum _{t=0}^\infty \lambda _1^t \gamma _1^t \, r_\pi (S_t) + \sum _{t=1}^\infty \lambda _1^{t-1} (1 - \lambda _t) \gamma _1^{t} \, v(S_{t}) \mid S_0 = s\right] }. \end{aligned}$$
(3.9)

Alternatively, by integrating over \((y_0, e_0)\) and using (3.7)–(3.8), we can write

$$\begin{aligned} (Tv)(s) = \textstyle { \mathbb {E}^\pi _\zeta \left[ \sum _{t=0}^\infty \Big ( q^+_{t}(S_0^t) \cdot \gamma _1^t \, r_\pi (S_t) + q_{t}(S_0^{t}) \cdot \gamma _1^{t} \, v(S_{t}) \Big ) \, \big | \, S_0 = s \right] }, \end{aligned}$$
(3.10)

for all \(v: \mathcal {S}\rightarrow \mathfrak {R}, s \in \mathcal {S}\), where in the case \(t=0\), \(q^+_0(\cdot ) \equiv 1 = \mathbf {P}^\pi _\zeta ( \tau > 0 \mid S_0)\) and \(q_0(\cdot ) \equiv 0 = \mathbf {P}^\pi _\zeta ( \tau = 0 \mid S_0)\) (since \(\tau > 0\) by construction).

Comparing the two expressions of T, we remark that the expression (3.9) reflects the role of the \(\lambda _t\)’s in determining the stopping time, whereas the expression (3.10), which has eliminated the auxiliary memory states \(y_t\), shows more clearly the dependence of the stopping time on the entire history \(S_0^t\). It can also be seen from the initial distribution \(\zeta \) that the behavior policy asserts a significant role in determining the Bellman operator T for the target policy. This is in contrast with off-policy TD learning that uses a constant \(\lambda \), where the behavior policy affects only how one approximates the Bellman equation underlying TD, not the Bellman equation itself.

We now proceed to show how the Bellman equation \(v=T v\) given above relates to the off-policy TD learning scheme in Sect. 2.2. Some notation is needed. Denote by \(\zeta _\mathcal {S}\) the marginal of \(\zeta \) on \(\mathcal {S}\). Note that \(\zeta _\mathcal {S}\) coincides with the invariant probability measure of the Markov chain \(\{S_t\}\) induced by the behavior policy. For two functions \(v_1, v_2\) on \(\mathcal {S}\), we write \(v_1 \perp _{\zeta _\mathcal {S}} v_2\) if \(\sum _{s \in \mathcal {S}} \zeta _\mathcal {S}(s) \, v_1(s) \, v_2(s) = 0\). If \(\mathcal {L}\) is a linear subspace of functions on \(\mathcal {S}\) and \(v \perp _{\zeta _\mathcal {S}} v'\) for all \(v' \in \mathcal {L}\), we write \(v \perp _{\zeta _\mathcal {S}} \mathcal {L}\). Recall that \(\phi \) is a function that maps each state s to an n-dimensional feature vector. Denote by \(\mathcal {L}_\phi \) the subspace spanned by the n component functions of \(\phi \), which is the space of approximate value functions for our TD learning scheme. Recall also that \(\mathbb {E}_\zeta \) denotes expectation w.r.t. the stationary state-trace process \(\{(S_t, y_t, e_t)\}\) under the behavior policy (cf. Theorem 2.1).

Theorem 3.1

Let Conditions 2.12.3 hold. Then as a linear equation in v, \(\mathbb {E}_\zeta \big [ e_0 \, \delta _0(v) \big ] = 0\) is equivalently \(T v - v \perp _{\zeta _\mathcal {S}} \mathcal {L}_\phi ,\) where T is the generalized Bellman operator for \(\pi \) given in (3.9) or (3.10).

Remark 3.1

(On LSTD). Note that \(T v - v \perp _{\zeta _\mathcal {S}} \mathcal {L}_\phi , v \in \mathcal {L}_\phi \) is a projected version of the generalized Bellman equation \(T v - v =0\) (projecting the l.h.s. onto the approximation subspace \(\mathcal {L}_\phi \) w.r.t. the \(\zeta _\mathcal {S}\)-weighted Euclidean norm). Theorem 3.1 and Corrolary 2.1 together show that this is what LSTD solves in the limit. If this projected Bellman equation admits a unique solution \(\bar{v}\), then the approximation error \(\bar{v} - v_\pi \) can be characterized as in [22, 32].

Proof

(outline). We divide the proof into three parts. The first part is more subtle than the other two, which are mostly calculations. Due to space limit, we can only outline the proof here, leaving out the details of some arguments.

(i) We extend the stationary state-trace process to \(t = -1\), \(-2, \ldots \) and work with a double-ended stationary process \(\{(S_t, y_t, e_t)\}_{- \infty< t < \infty }\) (such a process exists by Kolmogorov’s theorem [4, Theorem 12.1.2]). We keep using the notation \(P_\zeta \) and \(\mathbb {E}_\zeta \) for this double-ended stationary Markov chain. Then, by unfolding the iteration (2.2) for \(e_t\) backwards in time, we show thatFootnote 10

$$\begin{aligned} e_0 = \phi (S_0) + \textstyle { \sum _{t=1}^{\infty } \lambda _{1-t}^0 \gamma _{1-t}^0 \rho _{-t}^{-1} \, \phi (S_{-t})} \quad P_\zeta \mathrm{-a.s.,} \end{aligned}$$
(3.11)

or with \(\lambda _1^0 =\rho _0^{-1}=1\) by convention, we can write \(e_0 = \sum _{t=0}^{\infty } \lambda _{1-t}^0 \gamma _{1-t}^0 \rho _{-t}^{-1} \, \phi (S_{-t})\) \(P_\zeta \)-a.s. The proof of (3.11) uses the stationarity of the process, Condition 2.1(i) and a theorem on integration [21, Theorem 1.38] among others.

(ii) Using the expression (3.11) of \(e_0\), we calculate \(\mathbb {E}_\zeta \big [ e_0 \cdot \rho _0 f(S_0^1) \big ]\) for any bounded measurable function f on \(\mathcal {S}\times \mathcal {S}\). In particular, we first obtain

$$\begin{aligned} \mathbb {E}_\zeta \big [ e_0 \cdot \rho _0 f(S_0^1) \big ]&= \textstyle { \sum _{t=0}^\infty \mathbb {E}_\zeta \Big [ \phi (S_0) \cdot \mathbb {E}_\zeta \left[ \lambda _{1}^t \gamma _{1}^t \rho _{0}^{t} \, f(S_t^{t+1}) \mid S_0 \right] \Big ]} \end{aligned}$$
(3.12)

by using (3.11) and the stationarity of the state-trace process. Next we relate the expectations in the summation in (3.12) to expectations w.r.t. the process with probability measure \(\mathbf {P}^\pi _\zeta \), which we recall is induced by the target policy \(\pi \) and introduced at the beginning of this subsection. Let \(\tilde{\mathbb {E}}^\pi _\zeta \) denote expectation w.r.t. the marginal of \(\mathbf {P}^\pi _\zeta \) on the space of \(\{(S_t, y_t, e_t)\}_{t \ge 0}\). From the change of measure performed through \(\rho _{0}^{t}\), we have

$$\begin{aligned} \mathbb {E}_\zeta \left[ \lambda _{1}^t \gamma _{1}^t \rho _{0}^{t} \, f(S_t^{t+1}) \mid S_0, y_0, e_0 \right] = \tilde{\mathbb {E}}^\pi _\zeta \left[ \lambda _{1}^t \gamma _{1}^t \, f(S_t^{t+1}) \mid S_0, y_0, e_0 \right] , \quad t \ge 0. \end{aligned}$$
(3.13)

Combining this with (3.12) and using the fact that \(\zeta \) is the marginal distribution of \((S_0, y_0, e_0)\) in both processes, we obtain

$$\begin{aligned} \mathbb {E}_\zeta \big [ e_0 \cdot \rho _0 f(S_0^1) \big ] = \textstyle { \sum _{t=0}^\infty \tilde{\mathbb {E}}^\pi _\zeta \Big [ \phi (S_0) \cdot \tilde{\mathbb {E}}^\pi _\zeta \left[ \lambda _{1}^t \gamma _{1}^t \, f(S_t^{t+1}) \mid S_0 \right] \Big ]}. \end{aligned}$$
(3.14)

(iii) We now use (3.14) to calculate \(\mathbb {E}_\zeta \big [ e_0 \, \delta _0(v) \big ]\) for a given function v. Recall from (2.3) \(\delta _0(v) = \rho _0 \cdot \big (r(S_0^1) + \gamma _1 v(S_1) - v(S_0) \big )\), so we let \(f(S_t^{t+1}) = r(S_t^{t+1}) + \gamma _{t+1} v(S_{t+1}) - v(S_t)\) in (3.14). Then a direct calculation shows thatFootnote 11

$$\begin{aligned} \mathbb {E}_\zeta \big [ e_0 \, \delta _0(v) \mid S_0 \big ]&= \phi (S_0) \cdot \big \{ -v(S_0) + (T v)(S_0) \big \}. \end{aligned}$$
(3.15)

Therefore \(\mathbb {E}_\zeta \big [ e_0 \, \delta _0(v) \big ] = \sum _{s \in \mathcal {S}} \zeta _\mathcal {S}(s) \, \phi (s) \cdot (Tv-v)(s)\), and this shows that \(\mathbb {E}_\zeta \big [ e_0 \, \delta _0(v) \big ] = 0\) is equivalent to \(Tv - v \perp _{\zeta _\mathcal {S}} \mathcal {L}_\phi \).    \(\square \)

Concluding Remark. This completes our analysis of the LSTD algorithm for the proposed TD-learning scheme. To conclude the paper, we note that the preceding results also prepare the ground for analyzing gradient-based algorithms similar to [9, 10] in a future work. Specifically, like LSTD, these algorithms would aim to solve the same projected generalized Bellman equation as characterized by Theorem 3.1 (cf. Remark 3.1). Their average dynamics, which is important for analyzing their convergence using the mean ODE approach from stochastic approximation theory [7], can be studied based on the ergodicity result of Theorem 2.1, in essentially the same way as we did in Sect. 2.3 for the LSTD algorithm.