FormalPara Synonyms

Approximate dynamic programming; Cost-to-go function approximation; Neuro-dynamic programming

Definition

Value Function Approximation is a collection of function approximation representations, techniques, and methods aiming at providing a scalable and effective approximation to an exact value function (a real-valued function indicating the long-term goodness of making decisions at any state within a sequential decision problem).

Motivation and Background

Markov Decision Processes

A Markov decision process (MDP) is a 6-tuple \((\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma,\mathcal{D})\), where \(\mathcal{S}\) is the state space of the process, \(\mathcal{A}\) is a finite set of actions, \(\mathcal{P}\) is a Markovian transition model (\(\mathcal{P}(s^{\prime}\vert s,a)\) denotes the probability of a transition to state s′ when taking action a in state s), \(\mathcal{R}\) is a reward function (\(\mathcal{R}(s,a)\) is the reward for taking action a in state s), γ ∈ (0, 1] is the discount factor for future rewards (a reward received after t steps is weighted by γt), and \(\mathcal{D}\) is the initial state distribution (Puterman 1994). MDPs are discrete-time processes. The process begins at time t = 0 in some state \(s_{0} \in \mathcal{S}\) drawn from \(\mathcal{D}\). At each time step t, the decision maker observes the current state of the process \(s_{t} \in \mathcal{S}\) and chooses an action \(a_{t} \in \mathcal{A}\). The next state of the process st+1 is drawn stochastically according to the transition model \(\mathcal{P}(s_{t+1}\vert s_{t},a_{t})\), and the reward r t at that time step is determined by the reward function \(\mathcal{R}(s_{t},a_{t})\). The horizon h is the temporal extent of each run of the process and is typically infinite. A complete run of the process over its horizon is called an episode and consists of a long sequence of states, actions, and rewards:

The quantity of interest is the expected total discounted reward from any state s:

$$\displaystyle\begin{array}{rcl} & & E\Big(r_{0} + \gamma r_{1} + \gamma ^{2}r_{ 2} + \gamma ^{3}r_{ 3} +\ldots +\gamma ^{h}r_{ h}\ \Big\vert \ s_{0} = s\Big) {}\\ & & \quad = E\left (\sum _{t=0}^{h}\gamma ^{t}r_{ t}\ \Big\vert \ s_{0} = s\right ), {}\\ \end{array}$$

where the expectation is taken with respect to all possible trajectories of the process in the state space under the decisions made and the transition model, assuming that the process is initialized in state s. The goal of the decision maker is to make decisions so that the expected total discounted reward, when s is drawn from \(\mathcal{D}\), is optimized. (The optimization objective could be maximization or minimization depending on the problem. Here, we adopt a reward maximization viewpoint, but there are analogous definitions for cost minimization. There are also other popular optimality measures, such as maximization/minimization of the average reward/cost per step.)

Policies

A policy dictates how the decision maker chooses actions in each state. A stationary, deterministic policy π is a mapping \(\pi : \mathcal{S}\mapsto \mathcal{A}\) from states to actions; π(s) denotes the action the agent takes in state s. In this case, there is a single action choice for each state, and this choice does not change over time. In contrast, a stationary, stochastic policy π is a mapping \(\pi : \mathcal{S}\mapsto \varOmega (\mathcal{A})\), where \(\varOmega (\mathcal{A})\) is the set of all probability distributions over \(\mathcal{A}\). Stochastic policies are also called soft, for they do not commit to a single action per state; π(a | s) stands for the probability of choosing action a in state s under policy π. Each policy π (deterministic or stochastic) is characterized by the expected total discounted reward it accumulates during an episode. An optimal policy π for an MDP is a policy that maximizes the expected total discounted reward from any state \(s \in \mathcal{S}\). It is well known that for every MDP, there exists at least one, not necessarily unique, optimal policy, which is stationary and deterministic.

Value Functions

The quality of any policy π can be quantified formally through a value function, which measures the expected return of the policy under different process initializations. For any MDP and any policy π, the state value function V assigns a numeric value to each state. The value Vπ(s) of a state s under a policy π is the expected return, when the process starts in state s and the decision maker follows policy π (all decisions at all steps are made according to π):

$$\displaystyle{V ^{\pi }(s) = E_{a_{t}\sim \pi \;;\;s_{t}\sim \mathcal{P}\;;\;r_{t}\sim \mathcal{R}}\left (\sum _{t=0}^{\infty }\gamma ^{t}r_{ t}\ \Big\vert \ s_{0} = s\right ).}$$

Similarly, the state-action value function Q assigns a numeric value to each pair (s, a) of states and actions. The value Qπ(s, a) of taking action a in state s under a policy π is the expected return when the process starts in state s, and the decision maker takes action a for the first step and follows policy π thereafter:

$$\displaystyle\begin{array}{rcl} & & Q^{\pi }(s,a) {}\\ & & = E_{a_{t}\sim \pi \;;\;s_{t}\sim \mathcal{P}\;;\;r_{t}\sim \mathcal{R}}\left (\sum _{t=0}^{\infty }\gamma ^{t}r_{ t}\ \Big\vert \ s_{0} = s,\ a_{0} = a\right ). {}\\ \end{array}$$

The state and the state-action value functions for a deterministic policy π are related as follows:

$$\displaystyle{V ^{\pi }(s) = Q^{\pi }\big(s,\pi (s)\big).}$$

For a stochastic policy π this relationship needs to take into account the probability distribution over actions:

$$\displaystyle{V ^{\pi }(s) =\sum _{a\in \mathcal{A}}\pi (a\vert s)Q^{\pi }(s,a).}$$

The state-action value function of a policy π (either deterministic or stochastic) can also be expressed in terms of the state value function:

$$\displaystyle{Q^{\pi }(s,a) = \mathcal{R}(s,a) + \gamma \sum _{s^{\prime}\in \mathcal{S}}\mathcal{P}(s^{\prime}\vert s,a)V ^{\pi }(s^{\prime}).}$$

The optimal value functions\(V ^{{\ast}} = V ^{\pi ^{{\ast}} }\) and \(Q^{{\ast}} = Q^{\pi ^{{\ast}} }\) are the state and the state-action value functions of any optimal policy π. Even if there are several distinct optimal policies, they all share the same unique optimal value functions.

Bellman Equations

Given the full MDP model, the state or the state-action value function of any given policy can be computed by solving a linear system formed using the linear Bellman equations. In general, the linear Bellman equation relates the value of the function at any point to the values of the function at several – in fact, all – other points. This is achieved by separating the first step of an episode from the rest and using the definition of the value function recursively in the next step. In particular, for any deterministic policy π, the linear Bellman equation for the state value function is

$$\displaystyle{V ^{\pi }(s) = \mathcal{R}(s,\pi (s)) + \gamma \sum _{s^{\prime}\in \mathcal{S}}\mathcal{P}(s^{\prime}\vert s,\pi (s))V ^{\pi }(s^{\prime}),}$$

whereas for a stochastic policy π, it becomes

$$\displaystyle\begin{array}{rcl} V ^{\pi }(s)& =& \sum _{a\in \mathcal{A}}\pi (a\vert s)\bigg(\mathcal{R}(s,a) {}\\ & & +\gamma \sum _{s^{\prime}\in \mathcal{S}}\mathcal{P}(s^{\prime}\vert s,a)V ^{\pi }(s^{\prime})\bigg). {}\\ \end{array}$$

The exact Vπ values for all states can be found by solving the \((\vert \mathcal{S}\vert \times \vert \mathcal{S}\vert )\) linear system that results from writing down the linear Bellman equation for all states.

Similarly, the linear Bellman equation for the state-action value function given any deterministic policy π is

$$\displaystyle\begin{array}{rcl} Q^{\pi }(s,a)& =& \mathcal{R}(s,a) {}\\ & & +\gamma \sum _{s^{\prime}\in \mathcal{S}}\mathcal{P}(s^{\prime}\vert s,a)Q^{\pi }\big(s^{\prime},\pi (s^{\prime})\big), {}\\ \end{array}$$

whereas for a stochastic policy π, it becomes

$$\displaystyle\begin{array}{rcl} Q^{\pi }(s,a)& =& \mathcal{R}(s,a) + \gamma \sum _{s^{\prime}\in \mathcal{S}}\mathcal{P}(s^{\prime}\vert s,a) {}\\ & & \times \sum _{a^{\prime}\in \mathcal{A}}\pi (a^{\prime}\vert s^{\prime})Q^{\pi }(s^{\prime},a^{\prime}). {}\\ \end{array}$$

The exact Qπ values for all state-action pairs can be found by solving the \((\vert \mathcal{S}\vert \vert \mathcal{A}\vert \times \vert \mathcal{S}\vert \vert \mathcal{A}\vert )\) linear system that results from writing down the linear Bellman equation for all state-action pairs.

The unique optimal state or state-action value function can be computed even for an unknown optimal policy π using the nonlinear Bellman optimality equation, which relates values of the function at different points while exploiting the fact that there exists a deterministic optimal policy that achieves the maximum value at each point. In particular, the nonlinear Bellman optimality equation for the state value function is

$$\displaystyle\begin{array}{rcl} V ^{{\ast}}(s)& =& \max _{ a\in \mathcal{A}}\bigg\{\mathcal{R}(s,a) {}\\ & & +\gamma \sum _{s^{\prime}\in \mathcal{S}}\mathcal{P}(s^{\prime}\vert s,a)V ^{{\ast}}(s^{\prime})\bigg\}, {}\\ \end{array}$$

whereas for the state-action value function is

$$\displaystyle\begin{array}{rcl} Q^{{\ast}}(s,a)& =& \mathcal{R}(s,a) {}\\ & & +\gamma \sum _{s^{\prime}\in \mathcal{S}}\mathcal{P}(s^{\prime}\vert s,a)\max _{a^{\prime}\in \mathcal{A}}\big\{Q^{{\ast}}(s^{\prime},a^{\prime})\big\}. {}\\ \end{array}$$

The functions V and Q can be approximated arbitrarily closely by the iterative application of the operator formed by the right-hand side of the equations above (Bellman optimality operator). This iteration is a contraction with rate γ, so starting with any arbitrary initialization, it eventually converges to V or Q.

Significance of Value Functions

Value functions play a critical role in sequential decision making because they address two core problems: policy evaluation and policy improvement. Policy evaluation refers to the problem of quantifying the quality of any given policy π in a given MDP. Apparently, computing the value function Vπ or Qπ using the Bellman equations provides the solution to this problem. Policy improvement, on the other hand, refers to the problem of deriving an improved policy π′ given any base policy π, so that π′ is at least as good as π and possibly better. The knowledge of Vπ or Qπ allows for the derivation of an improved deterministic policy π′ through a simple one-step look-ahead maximization procedure:

$$\displaystyle\begin{array}{rcl} \pi ^{\prime}(s)& =& \mathop{\mathrm{arg\,max}}_{a\in \mathcal{A}}\bigg\{\mathcal{R}(s,a) {}\\ & & +\gamma \sum _{s^{\prime}\in \mathcal{S}}\mathcal{P}(s^{\prime}\vert s,a)V ^{\pi }(s^{\prime})\bigg\} {}\\ \pi ^{\prime}(s)& =& \mathop{\mathrm{arg\,max}}_{a\in \mathcal{A}}\big\{Q^{\pi }(s,a)\big\}. {}\\ \end{array}$$

Note that this maximization does not need the MDP model when using the state-action value function. Once policy evaluation and policy improvement have been addressed, the derivation of an optimal policy for any MDP is straightforward. One can alternate between policy evaluation and policy improvement producing a sequence of improving policies until convergence to an optimal policy; this algorithm is known as policy iteration. Alternatively, one can iteratively compute an optimal value function V or Q and extract an optimal policy through a trivial step of policy improvement on top of V or Q; this algorithm is known as value iteration. In either case, value functions provide the means to the end.

The problem of deriving an optimal policy using the full MDP model is known as planning. Nevertheless, in many real-world sequential decision domains, the model is unknown. The problem of optimal decision making in an unknown stochastic environment is known as reinforcement learning, because the decision maker relies on the feedback received through interaction with the environment to reinforce or discourage past decisions. More specifically, the learner interacts with an unknown MDP and typically observes the state of the process and the immediate reward at every step; however, \(\mathcal{P}\) and \(\mathcal{R}\) are not accessible. At each step of interaction, the learner observes the current state s, chooses an action a, and observes the resulting next state s′ and the reward received r, thus learning is based on (s, a, r, s′) samples. The core problems in reinforcement learning are known as prediction and control. Prediction refers to the problem of learning the value function of a given policy π in an unknown MDP through interaction. Well-known algorithms for the prediction problem are Monte Carlo estimation and temporal difference (TD) learning. Control, on the other hand, refers to the problem of gradually learning a good or even optimal policy in an unknown MDP through interaction. Well-known algorithms for the control problem are SARSA and Q-learning. Again, value functions play a critical role in reinforcement learning; they are absolutely necessary for the prediction problem, and the majority of control approaches are value-function based.

Structure of Learning System

Value-Function Approximation

Most algorithms for planning or learning in MDPs rely on computing or learning a value function. However, if the state space of the process is fairly large, the exact (tabular) representation of a value function becomes problematic. Not only does memory space become insufficient very quickly, but also computing or learning accurately all the distinct entries of the function requires a tremendous amount of computation and data. This is known as the curse of dimensionality: the exponential growth of the state or action space as a function of the dimensionality of the state or action. The urgent need for solutions to large real-world sequential decision problems has drawn attention to approximate methods. These methods use function approximation techniques for approximating value functions; therefore, they sacrifice some representational accuracy in order to make the representation manageable in practice. Sacrificing accuracy in the representation of the value function is acceptable, since the ultimate goal is to find a good policy and not necessarily an accurate value function. As a result, value-function approximation methods cannot guarantee optimal solutions, but only good solutions. This is not to say that they are doomed to always finding suboptimal solutions; if an optimal solution lies within the space spanned by the value-function approximation scheme, it is possible that an optimal solution will be discovered.

Let \(\hat{V }^{\pi }(s;w)\) be an approximation to the state value function Vπ(s) represented by a parametric approximation architecture with free parameters w. The key idea of value- function approximation is that the parameters w can be adjusted appropriately so that the approximate values are “close enough” to the original values,

$$\displaystyle{\hat{V }^{\pi }(s;w) \approx V ^{\pi }(s),}$$

and, therefore, \(\hat{V }^{\pi }\) can be used in place of the exact value function Vπ. Similarly, let \(\hat{Q}^{\pi }(s,a;w)\) be an approximation to the state-action value function Qπ(s, a). Again, the goal is to adjust the parameters w so that

$$\displaystyle{\hat{Q}^{\pi }(s,a;w) \approx Q^{\pi }(s,a),}$$

and, therefore, \(\hat{Q}^{\pi }\) can be used in place of the exact value function Qπ. Approximations \(\hat{V }^{{\ast}}\) and \(\hat{Q}^{{\ast}}\) of the optimal value functions V and Q are defined similarly. The characterization “close enough” ( ≈ ) accepts a variety of interpretations in this context, and it does not necessarily refer to the minimization of some norm. Value-function approximation should be regarded as a functional approximation rather than as a pure numerical approximation, where “functional” refers to the ability of the approximation to play closely the functional role of the original value function within a decision making algorithm.

The benefits of value-function approximation are obvious. The storage requirements are much smaller compared to the tabular case, since only the parameters w need to be stored along with a compact description of the functional form of the architecture. In general, for most approximation architectures, the storage needs are independent of the size of the state space and/or the size of the action space. Furthermore, for most approximation architectures there is no restriction on the state space to be a finite set; it could be an infinite, or even a continuous, space. This flexibility nevertheless reveals the need for good generalization abilities on behalf of the architecture, since the approximate value function will have to provide good values over the entire state/state-action space, using data only from a limited subset of the space.

The main difficulty associated with value-function approximation, beyond the loss in accuracy, is the choice of the projection method, which is the method of finding appropriate parameters that maximize the accuracy of the approximation according to certain criteria and with respect to the target function. Typically, for ordinary function approximation, this is accomplished using a training set of examples of the form \(\big\{s,V ^{\pi }(s)\big\}\), \(\big\{s,V ^{{\ast}}(s)\big\}\), \(\big\{(s,a),Q^{\pi }(s,a)\big\}\), or \(\big\{(s,a),Q^{{\ast}}(s,a)\big\}\) that provide the true value of the target function at certain sample points s or (s, a) (supervised learning). Unfortunately, in the context of sequential decision making, the target value function is completely unknown. Had it been possible to compute it easily, value-function approximation would have been unnecessary. In fact, it is not possible to analytically compute the true value of the target value function even at certain isolated sample points due to interdependencies between the values at all points. The implication of this difficulty is that evaluation and projection to the approximation architecture must be blended together. This is usually achieved by trying to find values for the free parameters so that the approximate function retains some properties of the original exact value function. Another implication of using approximation for value functions is that all convergence properties of exact planning or learning algorithms are compromised. Therefore, significant attention must be paid to the choice of the approximation architecture and the evaluation and projection method to minimize the chances for divergence or oscillations.

Approximation Architectures

There are a variety of architectures available for value-function approximation: perceptrons, neural networks, splines, polynomials, radial basis functions, support vector machines, decision trees, CMACs, wavelets, etc. These architectures have diverse representational power and generalization abilities, and the most appropriate choice will heavily depend on the properties of the decision making problem at hand. The projection methods associated with these approximation architectures are typically designed for a supervised learning setting. For successful use in the context of decision making, combined evaluation and projection methods are necessary.

A broad categorization of approximation architectures distinguishes between nonlinear and linear architectures. The characterization “nonlinear” or “linear” refers to the way the free parameters enter into the architecture and not to the approximation ability of the architecture. Nonlinear architectures are usually more expressive than the linear ones, due to the complex interactions among their free parameters; however, tuning their parameters is a much more elaborate task compared to tuning the parameters of their linear counterparts. Linear architectures are perhaps the most popular choice for value-function approximation; interestingly, most theoretical results on convergence properties in the context of value-function approximation are restricted to linear architectures.

A linear approximation architecture approximates a function Vπ(s) or Qπ(s, a) as a linear weighted combination of k basis functions (also called features):

$$\displaystyle\begin{array}{rcl} \hat{V }^{\pi }(s;w)& =& \sum _{j=1}^{k}\phi _{ j}(s)w_{j} =\phi (s)^{\top }w {}\\ \hat{Q}^{\pi }(s,a;w)& =& \sum _{j=1}^{k}\phi _{ j}(s,a)w_{j} =\phi (s,a)^{\top }w. {}\\ \end{array}$$

The free parameters of the architecture are the coefficients w j of the combination (also called weights). The basis functions ϕ j are fixed, but arbitrary and, in general, nonlinear functions of s or (s, a). It is required that the basis functions ϕ j are linearly independent to ensure that there are no redundant parameters and that the matrices involved in the computations are full rank. In general, \(k \ll \vert \mathcal{S}\vert\) and \(k \ll \vert \mathcal{S}\vert \vert \mathcal{A}\vert\) and the basis functions ϕ j have small compact descriptions. As a result, the storage requirements of a linear approximation architecture are much smaller than those of the tabular representation of a value function. There is a large variety of linear approximation architectures, and in fact, many common schemes for value-function approximation can be cast as linear architectures.

  • Lookup Table. This is the exact tabular representation (There is no approximation under this scheme; it is included only to illustrate that exact representation belongs in the family of linear architectures.) suitable for problems with discrete state variables. Each basis function is an indicator function whose value is 1 only for a specific discrete input point (state or state-action) and 0 otherwise. Each parameter stores one value/entry of the table.

  • Discretization. This is a common technique for turning a continuous space into discrete using a uniform- or variable-resolution grid. One indicator basis function is assigned to each cell of the discretization, and the corresponding parameter holds the value of that cell.

  • Tile Coding (CMAC). This scheme utilizes several overlapping discretizations (tilings) for better accuracy. It generates indicator basis functions for each cell of each tiling and concatenates the basis functions for all tilings. Each parameter corresponds to one cell in one tiling, but the value at each input point is computed additively from the values of all containing cells from all tilings.

  • State Aggregation. This is a family of schemes where “similar” (by some metric) states are grouped together and are treated as one state. The similarity metric is usually formed through dimensionality reduction techniques for identifying the most significant dimensions in a high-dimensional input space, unlike conventional proximity measures in the same space. There is one indicator basis function for each group and a single value for all states in the group.

  • Polynomials. This is a generic approximation scheme suitable for problems with several (continuous) state variables. Each basis function is a polynomial term composed of state variables up to a certain degree.

  • Radial Basis Functions (RBFs). This is another generic approximation scheme suitable for continuous state variables. Each basis function is a Gaussian with fixed mean and variance; the Gaussians are topologically arranged so that they cover the input space with some overlap.

  • Kernel Methods. Kernels are symmetric functions between two points, and they are used to represent compactly dot products of feature vectors in high- or even infinite-dimensional spaces. The compactness of kernels allows for approximation schemes that essentially enjoy the flexibility provided by a huge or infinite number of basis functions. The basis functions, in this case, are implicitly defined through the choice of the kernel.

  • Partitioning. This technique is used for constructing complex approximators by partitioning the state space in several subsets and using a different approximator in each one of them. If linear architectures are used in all partitions, then a set of basis functions for the global architecture can be constructed by concatenating the basis functions of the smaller linear architectures multiplying each subset with an indicator function for the corresponding partition.

Linear architectures offer several advantages: they are easy to implement and use, and their behavior is fairly transparent, both from an analysis standpoint and from a debugging and feature engineering standpoint. It is usually relatively easy to get some insight into the reasons for which a particular choice of features succeeds or fails. This is facilitated by the fact that the magnitude of each parameter is related to the importance of the corresponding feature in the approximation (assuming normalized features).

A nonlinear approximation architecture approximates a function Vπ(s) or Qπ(s, a) using arbitrary parametric functions of s and (s, a), possibly in conjunction with some features ϕ computed over s and (s, a). The best-known representative of this category is the multilayer feed-forward neural networks, which use one or more layers of linear combinations followed by a nonlinear sigmoidal transformation (thresholding). In their simplest form (one layer), neural networks approximate value functions as

$$\displaystyle\begin{array}{rcl} \hat{V }^{\pi }(s;w,\theta )& =& \sum _{i=1}^{m}\theta _{ i}\sigma \left (\sum _{j=1}^{k}\phi _{ j}(s)w_{ji}\right ) {}\\ & =& \sum _{i=1}^{m}\theta _{ i}\sigma \Big(\phi (s)^{\top }w_{ i}\Big) {}\\ \hat{Q}^{\pi }(s,a;w,\theta )& =& \sum _{i=1}^{m}\theta _{ i}\sigma \left (\sum _{j=1}^{k}\phi _{ j}(s,a)w_{ji}\right ) {}\\ & =& \sum _{i=1}^{m}\theta _{ i}\sigma \Big(\phi (s,a)^{\top }w_{ i}\Big). {}\\ \end{array}$$

Common choices for the differentiable, bounded, and monotonically increasing function σ are the hyperbolic tangent function σ(x) = tanh(x) = (exex)∕(ex + ex) and the logistic function σ(x) = 1∕(1 + ex). The free parameters of the architecture (also called weights) are the coefficients w ji of the linear combinations of the inputs and the coefficients θ i of the linear combination of the sigmoidal transformations for the output. Notice that the parameters w ji enter nonlinearly into the approximation.

A key question in all approximation architectures is how features are generated and selected. The feature generation and selection problem is an open question that spans most of machine learning research and admits no easy and general answer. Prior domain-specific knowledge and experience can be very helpful in choosing appropriate features. Several recent studies also describe methods for automatically generating features targeted for value-function approximation (Menache et al. 2005; Mahadevan and Maggioni 2007; Parr et al. 2007).

Learning

Learning (or training or parameter estimation) in value-function approximation refers to parameter tuning methods that take as input a policy π, an approximation architecture for VπQπ, and the full MDP model or samples of interaction with the process and output a set of parameters wπ such that \(\hat{V }^{\pi }/\hat{Q}^{\pi }\) is a good approximation to VπQπ. Learning also covers methods for the harder problem of taking an approximation architecture for VQ and the model or samples and outputting a set of parameters w such that \(\hat{V }^{{\ast}}/\hat{Q}^{{\ast}}\) is a good approximation to VQ. The former problem is somewhat easier because the policy π, unlike an optimal policy π, is known, and therefore in the presence of a simulator of the process, the value function can be estimated at any isolated point using Monte Carlo estimation techniques based on repeated policy rollouts from that point. Each rollout is an execution of an episode starting from a state s (or state-action (s, a)) using policy π to obtain an unbiased estimate of the return of the policy from s (or (s, a)). In this case, value-function approximation can be cast as a classic supervised learning problem; the true value of VπQπ is estimated at a subset of points to form a training set, which can be subsequently used to train the approximation architecture using supervised learning techniques. However, in the absence of a simulator or when seeking approximations to VQ, evaluation and projection into the architecture have to be carried out simultaneously.

The true value function has two key properties: it satisfies the Bellman equations, and it is the fixed point of the Bellman operator. Learning in value-function approximation strives to find values for the free parameters so that the approximate function retains at least one of these properties to the extent this is possible. Learning methods that focus on satisfying the Bellman equations attempt to find an approximate function that minimizes the Bellman residual, the difference between the left- and the right-hand sides of the system of Bellman equations. On the other hand, learning methods that focus on the fixed point property attempt to find an approximate function that exhibits a fixed point behavior under the combined application of the Bellman operator and projection onto the space spanned by the basis functions. Experimental evidence suggests that it is really hard to satisfy both properties under approximation, and therefore these two approaches differ significantly in their solutions. The majority of existing learning methods focus on fixed point approximation, which experimentally has been shown to exhibit more stable behavior and delivers better policies. There are also proposals for combining the benefits of both approaches into a hybrid method (Johns et al. 2009).

The most widely used learning approach is based on gradient descent and is applicable to any approximation architecture that is differentiable with respect to its parameters. Any stochastic approximation learning method for tabular representations of value functions can be extended to approximate representations. For example, given any sample (s, a, r, s′) of interaction with the process, the temporal difference (TD) learning update rule

$$\displaystyle{V ^{\pi }(s) \leftarrow V ^{\pi }(s) +\alpha \Big (r + \gamma V ^{\pi }(s^{\prime}) - V ^{\pi }(s)\Big)}$$

for tabular representations, where α ∈ (0, 1] is the learning rate, becomes

$$\displaystyle\begin{array}{rcl} & & w^{\pi } \leftarrow w^{\pi } +\alpha \Big (r + \gamma \hat{V }^{\pi }(s^{\prime};w^{\pi }) {}\\ & & \ \ \quad -\hat{ V }^{\pi }(s;w^{\pi })\Big)\nabla _{w^{\pi }}\hat{V }^{\pi }(s;w^{\pi }) {}\\ \end{array}$$

under an approximation scheme \(\hat{V }^{\pi }\). Similarly, the SARSA update rule

$$\displaystyle\begin{array}{rcl} & & Q^{\pi }(s,a) \leftarrow Q^{\pi }(s,a) {}\\ & & \quad +\alpha \Big (r + \gamma Q^{\pi }(s^{\prime},\pi (s^{\prime})) - Q^{\pi }(s,a)\Big) {}\\ \end{array}$$

for tabular representations becomes

$$\displaystyle\begin{array}{rcl} & & w^{\pi } \leftarrow w^{\pi } +\alpha \Big (r + \gamma \hat{Q}^{\pi }(s^{\prime},\pi (s^{\prime});w^{\pi }) {}\\ & & \ \ \quad -\hat{ Q}^{\pi }(s,a;w^{\pi })\Big)\nabla _{w^{\pi }}\hat{Q}^{\pi }(s,a;w^{\pi }) {}\\ \end{array}$$

under an approximation scheme \(\hat{Q}^{\pi }\). Finally, the Q-learning update rule

$$\displaystyle\begin{array}{rcl} & & Q^{{\ast}}(s,a) \leftarrow Q^{{\ast}}(s,a) {}\\ & & \quad +\alpha \Big (r + \gamma \max _{a^{\prime}\in \mathcal{A}}\big\{Q^{{\ast}}(s^{\prime},a^{\prime})\big\} - Q^{{\ast}}(s,a)\Big) {}\\ \end{array}$$

for tabular representations under an approximation scheme \(\hat{Q}^{{\ast}}\) becomes

$$\displaystyle\begin{array}{rcl} & & w^{{\ast}}\leftarrow w^{{\ast}} +\alpha \Big (r + \gamma \max _{ a^{\prime}\in \mathcal{A}}\big\{\hat{Q}^{{\ast}}(s^{\prime},a^{\prime};w^{{\ast}})\big\} {}\\ & & \qquad -\hat{ Q}^{{\ast}}(s,a;w^{{\ast}})\Big)\nabla _{ w^{{\ast}}}\hat{Q}^{{\ast}}(s,a;w^{{\ast}})\;. {}\\ \end{array}$$

These rules are applicable to any approximation architecture, linear or nonlinear. However, when using linear architectures they can be greatly simplified, because the gradient with respect to a parameter w j is simply the corresponding basis function ϕ j , for j = 1, 2, , k.

$$\displaystyle\begin{array}{rcl} \text{TD: }w_{j}^{\pi }& \leftarrow & w_{ j}^{\pi } +\alpha \Big (r + \gamma \phi (s^{\prime})^{\top }w^{\pi } -\phi (s)^{\top }w^{\pi }\Big)\phi _{ j}(s) {}\\ \text{SARSA: }w_{j}^{\pi }& \leftarrow & w_{ j}^{\pi } +\alpha \Big (r + \gamma \phi (s^{\prime},\pi (s^{\prime}))^{\top }w^{\pi } -\phi (s,a)^{\top }w^{\pi }\Big)\phi _{ j}(s,a) {}\\ \text{Q-learning: }w_{j}^{{\ast}}& \leftarrow & w_{ j}^{{\ast}} +\alpha \Big (r + \gamma \max _{ a^{\prime}\in \mathcal{A}}\big\{\phi (s^{\prime},a^{\prime})^{\top }w^{{\ast}}\big\}-\phi (s,a)^{\top }w^{{\ast}}\Big)\phi _{ j}(s,a) {}\\ \end{array}$$

More sophisticated learning approaches rely on retaining the desired value-function property in a batch manner by processing several samples collectively. A variety of least-squares techniques have been proposed for linear architectures giving rise to several least-squares reinforcement learning methods, such as least-squares temporal difference (LSTD) learning (Bradtke and Barto 1996), least-squares policy evaluation (LSPE) (Nedić and Bertsekas 2003), hybrid least-squares approximation (Johns et al. 2009), and least-squares policy iteration (LSPI) (Lagoudakis and Parr 2003). The parameters of a linear architecture can also be estimated using Linear Programming (de Farias and Van Roy 2003). Specialized learning algorithms have been proposed when using a kernel-based approximation architecture, based either on Gaussian process TD (GPTD) (Engel et al. 2003), Gaussian process SARSA (GPSARSA) (Engel et al. 2005), kernelized LSTD (KLSTD) and LSPI (KLSPI) (Xu et al. 2007), support vector regression (Bethke et al. 2008), or Gaussian process regression (Rasmussen and Kuss 2004; Bethke and How 2009). A unified view of kernelized value-function approximation is offered by Taylor and Parr (2009). On the other hand, boot-strapping – the updating of a value estimate based on other value estimates – is the main learning approach behind batch methods for nonlinear architectures, such as fitted Q-iteration (FQI) (Ernst et al. 2005).

Examples

Very close approximations of the state value function of optimal policies in two well-known problems are presented to illustrate the difficulty of value-function approximation. To obtain these close approximations, a fine discretization of the two-dimensional state space into a uniform grid of 250 × 250 was used for representation. The state-action value function Q was initially computed using approximate policy iteration (a sparse-matrix version of LSPI) with a set of indicator basis functions over the state grid and all actions and 500 (s, a, r, s′) samples for each one of the 187, 500 discrete cells (s, a), until convergence to a near-optimal policy; the state value function V was extracted from the Q values.

Inverted Pendulum

The inverted pendulum problem is to balance a pendulum of unknown length and mass at the upright position by applying forces to the cart it is attached to (Fig. 1, left). Three actions are allowed: left force LF ( − 50 Newtons), right force RF ( + 50 Newtons), or no force NF (0 Newtons). All three actions are noisy; Gaussian noise with μ = 0 and σ2 = 10 is added to the chosen action. The state space of the problem is continuous and consists of the vertical angle θ and the angular velocity \(\dot{\theta }\) of the pendulum. The transitions are governed by the nonlinear dynamics of the system and depend on the current state and the current (noisy) control u:

$$\displaystyle{\ddot{\theta }= \frac{g\sin (\theta ) -\alpha ml(\dot{\theta })^{2}\sin (2\theta )/2 -\alpha \cos (\theta )u} {4l/3 -\alpha ml\cos ^{2}(\theta )} \;\;,}$$

where g is the gravity constant (g = 9. 8 m∕s2), m is the mass of the pendulum (default: m = 2. 0 kg), M is the mass of the cart (default: M = 8. 0 kg), l is the length of the pendulum (default: l = 0. 5 m), and α = 1∕(m + M). The simulation step is 0. 1 s, thus the control input is given at a rate of 10 Hz, at the beginning of each time step, and is kept constant during any time step. A reward of 0 is given as long as the angle of the pendulum does not exceed π∕2 in absolute value (the pendulum is above the horizontal line). An angle greater than π∕2 in absolute value signals the end of the episode and a reward (penalty) of − 1. The discount factor of the process is 0. 95.

Figure 1 shows a close approximation to the state value function V of an optimal policy for the inverted pendulum domain over the two-dimensional state space \((\theta,\dot{\theta })\). The value function indicates that states which potentially offer high return are clustered within a zone where θ and \(\dot{\theta }\) have different signs and therefore the gravity force can be counteracted. Notice the nonlinearity of the function and the difficult approximation problem it presents.

Value Function Approximation, Fig. 1
figure 221figure 221

Inverted pendulum: state value function of an optimal policy (3D and 2D) (Courtesy of Ioannis Rexakis)

Mountain Car

The mountain car problem is to drive an underpowered car from the bottom of a valley between two mountains to the top of the mountain on the right (Fig. 2, left). The car is not powerful enough to climb any of the hills directly from the bottom of the valley even at full throttle; it must build some energy by climbing first to the left (moving away from the goal) and then to the right. Three actions are allowed: forward throttle FT ( + 1), reverse throttle RT ( − 1), or no throttle NT (0). All three actions are noisy; Gaussian noise with μ = 0 and σ2 = 0. 2 is added to the chosen action. The state space of the problem is continuous and consists of the position x and the velocity \(\dot{x}\) of the car along the horizontal axis. The transitions are governed by the nonlinear dynamics of the system and depend on the current state \((x(t),\dot{x}(t))\) and the current (noisy) control u(t):

$$\displaystyle\begin{array}{rcl} x(t + 1)& =& \text{BOUND}_{x}[x(t) +\dot{ x}(t + 1)] {}\\ \dot{x}(t + 1)& =& \text{BOUND}_{\dot{x}}[\dot{x}(t) {}\\ & & +0.001u(t) - 0.0025\cos (3x(t))]\;, {}\\ \end{array}$$

where BOUND x is a function that keeps x within [−1. 2, 0. 5], while \(\text{BOUND}_{\dot{x}}\) keeps \(\dot{x}\) within [−0. 07, 0. 07]. If the car hits the bounds of the position x, the velocity \(\dot{x}\) is set to zero. A penalty of − 1 is given at each step as long as the position of the car is below the right bound (0. 5). As soon as the car position hits the right bound of the position, it has reached the goal; the episode ends successfully and a reward of 0 is given. The discount factor of the process is 0. 99.

Value Function Approximation, Fig. 2
figure 222figure 222

Mountain car: state value function of an optimal policy (3D and 2D) (Courtesy of Ioannis Rexakis)

Figure 2 shows a close approximation to the state value function V of an optimal policy for the mountain car domain over the two-dimensional state space \((x,\dot{x})\). The value function indicates that in order to gain high return, the car has to follow a spiral in the state space that goes through states with progressively higher value. In practice, this means that the car has to move back and forth between the two mountains until sufficient energy is built to escape from the valley. Again, notice the high nonlinearity of the function and the hard approximation problem it presents.

Notation

The table summarizes the differences in names and symbols between the common notation (adopted here) and the alternative notation used in the literature.

Common notation

Alternative notation

Name

Symbol

Symbol

Name

State space

\(\mathcal{S}\)

S

States

State

s, s

i, j

State

Action space

\(\mathcal{A}\)

U

Controls

Action

a

u

Control

Transition model

\(\mathcal{P}(s^{\prime}\vert s,a)\)

p ij (u)

Transition probabilities

Reward function

\(\mathcal{R}\)

g

Cost function

Discount factor

γ

α

Discount factor

Policy

π

μ

Policy

State value function

V

J

Cost-to-go function

State-action value function

Q

Q

Q-factors

Parameters/weights

w

r

Parameters

Learning rate

α

γ

Step size

Cross-References