Motivation

The problem of optimal decision making under uncertainty is crucial to both animals and artificial intelligent agents. Reinforcement learning (RL) addresses this problem by proposing that agents should choose actions that maximize an expected long-term return provided by the environment (Sutton and Barto 1998). To achieve this goal, an agent has to explore its environment, while at the same time exploiting the knowledge it currently has in order to achieve good returns. In many existing algorithms, this trade-off is achieved mainly through simple randomization of the action choices. Practical implementations rely heavily on heuristics, only few theoretically principled approaches exist (see Sect. 5 for a more detailed discussion). In this article, we look at the exploration–exploitation trade-off from a fresh perspective: we use information-theoretic methods to analyze an existing exploration method, and to propose a new one.

Recently, an information-theoretic framework for behavioral learning has been introduced by one of us (Still 2009), which we use here to tackle reward-driven behavioral learning. First, we propose an intuitive optimality criterion for exploration policies which includes both the reward received, as well as the complexity of the policy. Having a simple policy is not usually a stated goal in reinforcement learning, but it is desirable for bounded-rationality agents, and it is especially useful in the context of developmental agents, which should evolve increasingly complex strategies as they get more experience, and as their knowledge of the environment becomes more sophisticated. We show in Sect. 2 that the general solution of the proposed optimization problem is a Boltzmann-style exploration algorithm. This approach is closely related to rate distortion theory (Shannon 1948), which is based on the fact that approximating a true signal using a compressed representation will cause a loss, computed as the expected value of a distortion function. The choice of the distortion function implicitly provides the distinction of relevant and irrelevant features of the data. Here, the trade-off is between the return, on the one hand, and the average bit cost of the policy, on the other hand. This trade-off is controlled by a “temperature”-like parameterFootnote 1. At high temperatures, simplicity is more important than return. As the temperature decreases, return becomes increasingly important and the policy goes to the optimal-return policy as the temperature goes to zero.

Animals often explore their environment not only to gather rewards, but also just for the sake of learning about it. Continuous learning is also useful in the context of reinforcement learning, because the reward structure of the environment may change over time, and the agent may need to adapt to this change. Hence, it is advantageous to know more about the environment than what is strictly necessary in order to maximize the long-term return under the current conditions. Similar arguments have been presented in Singh et al. (2005) and Still (2009) as well as in many papers on transfer of knowledge in reinforcement learning (see Taylor and Stone 2009, for a survey). In Sect. 3, we formalize this idea, building on previous work (Still 2009). We seek behavioral policies that maximize future return, while at the same time maximizing predictive power, which we measure by the information captured about the future. Our objective function also contains a term which ensures that the agent continues to prefer simple policies. This term penalizes behaviors for retaining more memory about the past than is necessary to predict the future and to maximize the expected reward. As a consequence, it ensures that undesirable repetitive behaviors are avoided. We show that the resulting optimal policy contains a trade-off between exploration and exploitation which emerges naturally from the optimization principle.

The article is structured as follows. In Sect. 2, we lay the information-theoretic foundation of exploration for a reinforcement learning agent, whose main goal is to optimize long-term returns. Next, we formulate the problem of curiosity-driven reinforcement learning and solve it using a similar principle, including the maximization of predictive power (Sect. 3). Finally, we discuss algorithmic implementation issues in Sect. 4, and close with a discussion of the relationship of our approach to classical and current work in RL in Sect. 5.

Information-theoretic approach to exploration in reinforcement learning

We consider the standard RL scenario (Sutton and Barto 1998) in which an agent is interacting with an environment on a discrete time scale. At each time step t, the agent observes the state of the environment, \(x_t\in {\mathbf{X}}\) and takes an action \(a_t\in {\mathbf{A}}.\) In response to its action, the agent receives an immediate (extrinsic) reward, r t+1 and the environment changes to a new state x t+1. We assume that the environment is Markovian. The reward is expressed as r t+1 = R(x t a t ), where \(R: {\mathbf{X}} \times {\mathbf{A}} \to {\mathbb{R}}\) is the reward function. The next state x t+1 is drawn from the distribution p(X t+1|x t , a t )Footnote 2. Together, the reward function and the state transition distributions constitute the model of the environment. A way of behaving, called a policy, \(\pi : {\mathbf{X}}\times {\mathbf{A}} \to [ 0, 1 ] \) is a probability distribution over actions, given the state. Each policy has an action-value function associated with it:

$$ Q^{\pi}(x,a) = E_{\pi} \left[ r_{t+1}+ \gamma r_{t+2} + \gamma^2 r_{t+3} + \cdots |X_t=x,A_t=a \right] , $$
(1)

where \(\gamma\in(0,1)\) is a discount factor expressing the fact that later rewards should be emphasized less. The interpretation of this value function is that the agent starts in state x, chooses a as its first action and thereafter chooses actions according to π. The goal of a reinforcement learning agent is to find a policy that maximizes the value function for all state-action pairs. In a finite Markov decision process (MDP), there is always at least one deterministic policy that achieves this goal, and many methods can be used to find such a policy, e.g. Q-learning (Watkins 1989). A comprehensive review of methods can be found in Sutton and Barto (1998). In some situations, e.g. when the state space X is too large and value functions cannot be represented exactly, policies are compared with respect to a starting state distribution, p 0(X). Then, the goal is to maximize the expected return:

$$ V^{\pi} = \sum_{x\in {\mathbf{X}}} \sum_{a\in {\mathbf{A}}} p_0(x) \pi(a|x) Q^{\pi}(x,a) $$
(2)

The advantage of using this criterion is that it allows a policy to be characterized by a single number, and offers a clear ordering of policies. Then, the optimal policy for the MDP maximizes \(V^{\pi},\) for example, for the uniform starting distribution.

Suppose that we had a set of policies that all produce the same expected return. Which policy should be preferred? If one is to implement the policy on a real system, e.g. a robot, then it is reasonable to prefer the simplest policy, i.e. the policy that can be specified with the smallest number of bits. To make this precise, let us reinterpret the meaning of a policy. The action can be viewed as a summary of the state of the environment. Therefore, mapping states onto actions can be viewed as lossy compression. If a large group of states share the same optimal action, then that action can be viewed as a compressed representation for this “class” of states, a representation which is sufficient from the point of view of attaining a desired level of return.

In order to formalize this intuition, we revisit rate distortion theory, introduced by Shannon (1948). Rate distortion theory measures the cost of approximating a signal Z by a signal Y, using the expected value of some distortion function, d(ZY). This distortion measure can, but need not, be a metric. Lossy compression is achieved by assigning Z to Y via the probabilistic map \(P(Y|z)\), such that the mutual information:

$$ I(Z,Y) = \sum_{z\in \Upomega_Z} \sum_{y\in \Upomega_Y} P(z,y) \log_2 \left[ \frac{P(z,y)}{P(z)P(y)}\right] = \sum_{z\in \Upomega_Z} \sum_{y\in \Upomega_Y} P(y|z) P(z) \log_2 \left[ \frac{P(y|z)}{P(y)} \right] $$
(3)

is minimized. The minimization is constrained by fixing the expected distortion \(\sum_{z\in \Upomega_Z} \sum_{y\in \Upomega_Y} P(z,y) d(z,y).\) In other words, recalling the meaning of information in terms of the rate of bit-flow, among the representations with the same quality, the most compact one will be preferred.

Now, let us interpret return as a function that measures quality, rather than distortion. The action is interpreded as a lossy summary of the state; hence, among the policies with the same return, we want to find the most compact one. Considering a set of policies that achieve a fixed average return V π, we can express this idea through the following optimization problem:

$$\begin{aligned}\mathop {\min}\limits_{\pi}\quad &I^{\pi}(A,X)\\\hbox{subject to:}\quad & V^{\pi} =\hbox{const}.\\ & \sum_{a\in{\mathbf{A}}} \pi(a|x) = 1, \forall x \; \in{\mathbf{X}}\\ &\pi(a|x)\geq 0, \forall x\in{\mathbf{X}}, \forall a\in{\mathbf{A}}\end{aligned} $$
(4)

Here, π is the policy we seek, which can be viewed as a probabilistic assignment of states to actions. The second constraint ensures normalization, and the third ensures positivity of the probability function. The average return related to policy π, V π, is defined in Eq. (2). The term I π(AX) denotes the information that the action A carries about the state X under policy π, where the joint distribution is given by p(XA) = π(A|X)p π(X):

$$ I^{\pi}(A,X) = \sum_{x\in {\mathbf{X}}} \sum_{a\in {\mathbf{A}}} \pi(a|x) p^{\pi}(x) \log\left[\frac{\pi(a|x)}{p^{\pi}(a)}\right]. $$
(5)

Note that I π(AX) depends also on the stationary distribution of states under policy π, p π(X) (which we assume exists, as is standard in RL), and on the average action probability, \(p^{\pi}(A)=\sum_{x\in {\mathbf{X}}} p^{\pi}(x) \pi(A|x).\)

This optimization problem is complex, because of the dependence on the stationary distribution, which in general is unknown (though computable) and which changes as the policy π evolves during learning. A standard approach for changing the policy in reinforcement learning is to assume that we fix the policy π, compute its return, but then we consider a small perturbation around it at a given time step t. Let V π t (q) be the expected return if the agent acts according to policy π on all time steps, except on time step t, when it chooses its action according to a different action distribution q:

$$ V^{\pi}_t(q)= \sum_{x_0 a_0 \ldots x_t,a_t} p_0(x_0) \left(\prod_{j=0}^{t-1} \pi(a_j|x_j) p(x_{j+1}|x_j,a_j) \right) q(a_t|x_t)\left[\sum_{i=0}^{t-1} \gamma^i R(x_i,a_i) + \gamma^t Q^{\pi}(x_t,a_t)\right] $$
(6)

where q(a t |x t ) is the new probability of choosing action a t from the state x t , which we seek. Let \(I^{\pi}_{q}(A_t,X_t)\) denote the information that the action A t carries about the state X t :

$$ I^{\pi}_{q}(A_t,X_t) = \sum_{x\in {\mathbf{X}}} \sum_{a\in {\mathbf {A}}} q(a|x) p_t^{\pi}(x) \log\left[\frac{q(a|x)}{p^{\pi}_t(a)}\right], $$
(7)

where x and a range over the possible values of random variables X t and A t p π t (x) is the probability of arriving at state x on time step t if the agent starts with a state drawn from p 0 and chooses actions according to π:

$$\begin{aligned} p^{\pi}_t(x)&= p^{\pi}(X_t=x)\\ &= \sum_{x_0 a_0\ldots x_{t-1}a_{t-1}} p_0(x_0) \pi(a_0|x_0) p(x_1|x_0,a_0) \ldots\pi(a_{t-1}|x_{t-1})\\&p(X_t=x|x_{t-1},a_{t-1}),\end{aligned} $$
(8)

and \(p^{\pi}_t(a) = p^{\pi}(A_t=a) = \sum_{x \in {\mathbf{X}}}p^{\pi}_t(x) q(A_t=a|x).\)

Now, the optimization problem can be written as:

$$\begin{aligned} \mathop {\min}\limits_{q}&\quad I^{\pi}_{q}(A_t,X_t)\\ {\text{subject to:}} & \quad V^{\pi}_t(q) = \, \hbox{const}.\\ &\sum_{a\in {\mathbf{A}}} q(a|x) = 1, \forall x\in {\mathbf{X}}\\ & q(a|x)\geq 0, \forall x\in {\mathbf{X}} ,\forall a\in{\mathbf{A}}.\end{aligned} $$
(9)

This optimization problem has a dual form, where we maximize the average return under the constraint that the “size” of the policy is kept constant. Note that this is mathematically equivalent, but constitutes a potentially useful way to think about agents with limited computational capacity (e.g., robots with limited on-board computation). In this case, one may just want to find the best policy which still fits on the available physical system. Similar capacity constraints may apply to animals. The dual form is the following:

$$ \begin{aligned} \mathop {\max}\limits_{q} \quad &V^{\pi}_t(q)\\ \hbox{subject to:}\quad & I^{\pi}_{q}(A_t,X_t) = \,\hbox{const}.\\ & \sum_{a\in {\mathbf{A}}} q(a|x) = 1; \forall x\in {\mathbf{X}},\\ & q(a|x)\geq 0, \forall x\in {\mathbf{X}},\forall a\in {\mathbf{A}}.\end{aligned} $$
(10)

A similar cost function was given by Bagnell and Schneider (2003), as well as Peters and Schaal (2008); they used a linearization to compute a better type of policy gradient update.

We can now rewrite the constrained optimization principle, using the Lagrange multipliers \(\lambda\) and \(\mu(x):\)

$$ \mathop {\max }\limits_{q} \, F [ q ] , $$
(11)
$$ with: F [ q ] = V^{\pi}_t(q) - \lambda I^{\pi}_{q}(A_t,X_t) +\sum_{x\in {\mathbf{X}}} \mu(x) \sum_{a\in {\mathbf{A}}} q(a|x),$$
(12)

where we have dropped irrelevant constants. The objective function, F[q], is a functional of the policy q. The solution is obtained by setting the variation of F to zero which leads to the optimal policy

$$ q_{\rm opt}(A_t=a|X_t=x) = \frac{p^{\pi}_t(a)}{Z(x)} e^{\frac{1}{\lambda} Q^{\pi}(x,\, a)} = \frac{1}{Z(x) }e^{\frac{1}{\lambda}Q^{\pi}(x,\, a) + \log\left[p^{\pi}_t(a) \right]}, \;\forall x\in {\mathbf{X}}, a \in {\mathbf{A}} $$
(13)

which has to be solved self-consistently, together with:

$$ p^{\pi}_t(a) = \sum_{x\in {\mathbf{X}}} q_{\rm opt}(a|x) p^{\pi}_t(x), \; \forall a\in {\mathbf{A}} $$
(14)

The partition function \(Z(x) = \sum_{a \in {\mathbf{A}}} p^{\pi}_t(a) e^{\frac{1}{\lambda} Q^{\pi}(x,a)}\) ensures normalization. This solution, Eq. (13) is similar to Boltzmann exploration, also known as softmax action selection (Sutton and Barto 1998). The difference is that here, we have an additional “complexity penalty,” \(\hbox{log}[p^{\pi}_t(a)].\) We note that by a similar calculation, if one tries to optimize the return at a fixed level of (Shannon) entropy, then one recovers exactly Boltzmann exploration. This follows immediately from the results in Rose (1998), and the arguments presented in Jaynes (1957). In contrast, here we penalize explicitly for the complexity of the policy, measured by the coding cost. The result is that there is a penalty for using more actions than necessary. This is useful not only when the agent has limited computational capacity, but also when the action space is very large (for example, in combinatorial optimization or inventory control problems). In this case, Eq. (13) may force the agent to use only a subset of the entire action space, which makes the learning task easier. The policy update in Eq. (13) appears also related to the ones suggested in references (Azar and Kappen 2010; Peters et al. 2010) despite their different roots.

This Boltzmann-style softening of the policy optimally trades the complexity of the policy for average return. The trade-off is governed by the temperature-like parameter \(\lambda,\) and exploration takes place due to fluctuations only at non-zero temperature, when emphasis is put on the compactness of the policy. As \(\lambda\) tends to zero, the information minimization constraint in Eq. (12) becomes less relevant, and in the limit, the optimal policy becomes deterministic if there are no degeneracies.Footnote 3 The optimal action then becomes:

$$ a_{{\rm opt}}(x) = \hbox{arg}\max_{a} \left[ Q^{\pi}(x,a) \right] ,$$
(15)

i.e. the action is chosen to maximize return. In this framework, explorative behavior is driven by randomness, the level of which is controlled by λ.

Curiosity-driven reinforcement learning

Intuitively, exploration should be driven by the curiosity to visit unknown areas of the state space of the coupled world-agent system. However, the theory we have laid out thus far is lacking any notion of curiosity. Apart from the coding rate constraint, the agent is just maximizing the return, as defined based on external rewards received from the environment.

In this section, we present a formalization of curiosity based on information-theoretic principles, drawing on ideas from Still (2009), where one of us has postulated that the main goal of a curious agent is to create for itself a world that is interesting. This is quantified by means of the predictive power that the agent’s behavior carries, as defined by Still (2009). In the present context of a fully observable Markovian environment, the agent’s sensations directly correspond to the state of the world, and predictive power simplifies to the mutual information carried by the action and the current state of the environment about the future state of the environment, I[{A t , X t }; X t+1].

Our goal then becomes to find the policy that maximizes predictive power together with expected return. We note that this is also important in extensions to partially observable MDPs (POMDPs), in which the exact state of the environment is unknown. Maximizing predictive power is highly desirable in this setting, because it means that the agent is able to predict well its future sensation given the past data. An extension of the work in this section to the POMDP setting is possible, because the general framework (Still 2009) does not assume that the environment is fully observable.

Predictive information, defined as the mutual information shared between the past and the future of a time series, measures temporal structure and is related to other measures of complexity (Bialek et al. 2001; Crutchfield and Feldman 2001). This measure has shown up under other names in the literature, such as “stored information” in Shaw (1984); see also Crutchfield and Feldman (2003) and references therein. It provides a measure of how complex, surprising, or how “interesting,” a time series is. Intuitively, if a time series has high predictive information, there will be data available for learning about a variety of situations, and also about different ways of behaving. Formally, by taking predictive power into account, our objective becomes

$$ \mathop {\max }\limits_{q} \tilde{F} [ q ] , $$
(16)
$$ with: \tilde{F} [ q ] = I^{\pi}_q(\{X_t,A_t\}, X_{t+1}) + \; \alpha V^{\pi}_t(q) \; - \lambda I^{\pi}_q(A_t,X_t) + \sum_{x\in{\mathbf{X}}}\mu(x) \sum_{a\in {\mathbf{A}}} q(a|x). $$
(17)

where we have dropped irrelevant constants. Variation of \( I^{\pi}_q (\{X_t,A_t\}, X_{t+1})\) w.r.t. q, results in the additional contribution

$$ {\cal D}^{\pi}(x, a):= D_{\rm KL}\left[p(X_{t+1}|x, a) \|p^{\pi}(X_{t+1})\right] $$
(18)

to the exponent in the solution. The Kullback-Leibler divergence is defined as

$$ D_{\rm KL} [ p_1(X) \| p_2(X)) ] = \sum_{x} p_1(x) \log\left[{p_1(x) \over p_2(x)}\right], $$
(19)

and

$$ p^{\pi}(X_{t+1}=x^{\prime}) = \sum_{a\in {\mathbf{A}}} \sum_{x\in {\mathbf{X}}} p(X_{t+1}=x^{\prime}|x,a) q(a| x) p_t^{\pi}(x), \forall x^{\prime}\in {\mathbf{X}} $$
(20)

With the extra contribution (18), the optimal solution is given by

$$ q_{\rm opt}(A_t=a|X_t=x) = \frac{p_t^{\pi}(a)}{Z(x)}e^{\frac{1}{\lambda} \left({\cal D}^{\pi}(x,\,a) + \; \alpha Q^{\pi}(x,\,a)\right)} ,{\quad}\forall x\in {\mathbf{X}}, \forall a\in {\mathbf{A}}. $$
(21)

The first term in the exponent of Eq. (21) drives the agent toward exploration: the optimal action will maximize the divergence between the distribution over the next state, given the current state x and the action a, to the average distribution over the next state. This means that the optimal action policy will result in producing a next state with a conditional probability distribution far from the average distribution. The second term is the value maximization, as before. The exponent in Eq. (21) thus represents a trade-off between exploration and exploitation.

As \(\lambda \to 0,\) the chosen action becomes the one that maximizes the functional in the exponent of Eq. (21):Footnote 4

$$ a_{\rm opt} (x) = {\rm arg}\max_{a} \left[ {\cal D}^{\pi}(x,a)+ \alpha Q^{\pi}(x,a)\right] $$
(22)

Importantly, the optimal action policy includes a trade-off between exploration and exploitation, even when the policy is deterministic. Recall that this is not the case for Boltzmann exploration, as we can see from comparison with Eq. (15). There, the optimal action under a deterministic policy maximizes only the return, subject possibly to size constraints, and exploration arises only at nonzero temperature, due to randomization.

The parameter \(\alpha\) can be viewed as a measure of how interested the agent is in obtaining a reward. For example, if the reward is energy intake, then \(\alpha\) could be set by measuring the charge of a robot’s batteries, and would represent how “hungry” the agent is.

Illustrations

To build some intuition about what this approach to curiosity-driven RL does, we consider two simple examples of a world in which there are two states, \(x \in \{0,1\}, \) and assume that \(\alpha=0, \) so the optimal action becomes the one that maximizes only the predictive power. All calculations for this section can be found in the Appendix.

First, consider a continuous range of actions \(a \in [ 0,1 ] . \) Let the value of the action express how strongly the agent tries to stay in the same state or leave it, such that a = 0 means that the agent wants to remain in the same state, a = 1/2 means that the agent is ambivalent about staying or leaving, and a = 1 means that the agent tries to switch states. Let the Markovian transitions of the environment be given by \(p({\bar x}|x,a) = a. \) Then, the optimal policy, Eq. (21), chooses only those two actions which result in the largest predictability, namely a = 0 and a = 1, and it chooses between these two actions with equal probability. This “clever random” policy is an example of the balance between control and exploration that was mentioned by Still (2009). Note the difference to a random policy, which would assign equal probability to all possible values of a.

As a second case, consider a two-state world in which there are only two actions, STAY or FLIP, \(a \in \{s,f\},\) and the transition probabilities are such that one state is completely reliable: p(0|0, s) = p(1|0, f) = 1, while the other state is completely unreliable p(0|1, s) = p(0|1, f) = 1/2. This is a test for our information-theoretic objective: if we are doing the right thing, then we should find that in the absence of a reward (or the absence of an interest in a reward, α = 0), the optimal curious policy will enable exploration of the combined state-action space, which means that the optimal policy should not stay in the more reliable state with probability one, i.e., we should not find π(A = s|0) = 1. Indeed, maximizing \(I[\{X_t,A_t\}; X_{t+1}]\) results asymptotically in the optimal policy π(A = s|0) = 3/4, which balances between exploration and choosing a reliable state, i.e. control..

Algorithmic issues

The optimal solution consists of Eq. (21), which has to be solved self-consistently, together with Eq. (20). Furthermore, the action-value function Q has to be estimated. In this section, we discuss how this can be implemented in practice.

We propose an implementation that is inspired by the usual Boltzmann exploration algorithm. The algorithm proceeds as follows:

  1. 1.

    Initialize \(t \leftarrow 0\) and get initial state x 0. Initialize \(\pi(a|x), \forall x\in {\mathbf{X}}, \forall a \in{\mathbf{A}}\) (e.g., uniformly randomly) and initialize the action-value function Q.

  2. 2.

    Repeat at every time step t

    1. (a)

      Update \(p_t(x), \forall x \in{\mathbf{X}}\) (the current estimate of the state visitation distribution)

    2. (b)

      Initialize \(q^{(0)}(a|x), \forall a \in{\mathbf{A}}, \forall x\in {\mathbf{X}}\)

    3. (c)

      Repeat the following updates, until the difference between q (j) and q (j+1) is small:

      $$ p^{(j)}(a) \leftarrow \sum_{x} q^{(j)}(a|x) p_t(x), \forall a \in {\mathbf{A}} $$
      (23)
      $$p^{(j)}(x^{\prime}) \leftarrow \sum_{x} \sum_{a} p(x^{\prime}|a,x) q^{(j)}(a|x) p_t(x), \forall x^{\prime}\in {\mathbf{X}}$$
      (24)
      $$q^{(j+1)}(a|x) \leftarrow \frac{p^{(j)}(a)}{Z^{(j)}(x)} \exp \left[ \frac{1}{\lambda} \left({\cal D}^{\pi}(x,\, a) + \alpha Q(x,\,a) \right) \right], \forall x \in{\mathbf{X}}, \forall a\in{\mathbf{A}}$$
      (25)

      Update \(\pi\leftarrow q^{(j+1)}\)

    4. (d)

      Choose action \(a_t \sim \pi(\cdot|x_t)\) and obtain reward r t+1 and next state x t+1

    5. (e)

      Update the action-value function estimates Q.

    6. (f)

      \(t \leftarrow t+1\)

In this algorithm, step 2 (a) can be performed exactly by using the true model and all the previous policies; the update of the model is similar to the one in Eq. (24); we note that this is exactly the same type of update used in the forward algorithm in a hidden Markov model (HMM). However, this computation can be expensive if the number of states is large. As a result, in this case, we would use the state samples x k k ≤ t, to estimate p t (x) approximately.

The initial value q (0)(a|x) is important, as it will influence the point to which iteration 2(c) converges (convergence is guaranteed to a locally optimal solution). One can start with the result of the previous iteration, under the assumption that the policy will change fairly smoothly from one time step to the next.

In principle, the action-value function Q should be recomputed exactly at every step, using the known model and the computed policy. This involves solving a system of linear equations with \(|{\mathbf{X}}| \times |{\mathbf{A}}|\) unknowns. While this may be feasible for small environments, it is computationally expensive for larger problems. In this case, the value Q(x t a t ) can instead be updated incrementally, using the standard temporal-difference learning approach (i.e. a learning rule like Sarsa or Q-learning; for details see Sutton and Barto 1998). Intuitively, this approach should work well if the policy changes slowly, because the action-value function will only change around the current state x t . Similarly, in order to save computation, the policy may be recomputed only at x t , rather than at all states \(x\in {\mathbf{X}}, \) as indicated in Eq. (25).

If the agent has no knowledge of the environment, then it can use the samples received to fit an approximate model, \(\hat{p}(X_{t+1}|x_t, a_t),\) and then use this model in the computation above. The model, action values, and distributions of interest can all be updated incrementally from samples. If a batch of samples is gathered first and then we run the algorithm above, we obtain an approach fairly close to batch model-based reinforcement learning. If on every time step t we update the model estimate \(\hat{p}(X_{t+1}|X_t, A_t)\) and immediately use it in the policy computation, we obtain an algorithm very close to incremental, model-free reinforcement learning. In general, the optimal policy can be re-computed every T time steps, and the approximate model can be improved using the T samples from the intervening period.

The temperature-like parameter \(\lambda\) determines how deterministic the resulting policy is. There are different possibilities for choosing this parameter. In the simplest case, the parameter is fixed to a pre-specified value, for example dictated by the capacity/memory constraints of a robot. This selects a fixed trade-off between complexity and utility. More generally, a process known as deterministic annealing (Rose 1998) can be employed at every time step. It consists of starting with a large temperature, running the iterative algorithm until convergence, then lowering the temperature by a factor (the annealing rate), and continuing this process, until the policy is deterministic, always using the current result as initial conditions for the iterations at the next (lower) temperature. This method obtains, at each time step, the deterministic, optimal policy, according to the criterion. The procedure is computationally intensive, but guarantees that actions are always chosen in a way that maximizes the optimization criterion, given that the annealing rate is sufficiently slow. Finally, the temperature can be fixed during each time step, but lowered as a function of time, \(\lambda(t),\) until it approaches zero. This approach is preferable when the agent’s knowledge about the world increases with time. Methods such as the ones discussed by Still and Bialek (2004) can be used to find (a bound on) \(\lambda(t).\) Finally, if a complexity constraint is given by the design of the agent, then this scheme can be modified to include a \(\lambda_{{\rm min}} = \lim_{t \rightarrow \infty} \lambda(t).\)

If the algorithm is implemented using only exact computations (i.e., the most computationally expensive version, outlined above), it is guaranteed to converge to a locally optimal solution for the proposed optimization criterion. Convergence analysis for the case in which samples are used incrementally is quite tricky and we leave it for future work.

Related work

The textbook by Sutton and Barto (1998) summarizes several randomization-based exploration approaches used in reinforcement learning, such as Boltzmann exploration and \(\epsilon\)-greedy (in which there is simply a fixed, small probability of trying out actions which appear sub-optimal). Many heuristic variations have been proposed, in which bonuses are added to the value function to encourage more efficient exploration (e.g. Thrun and Moeller 1992; Ratitch and Precup 2003)

A different strategy, which yields interesting theoretical results, is that of optimism in the face of uncertainty: if a state has not been visited sufficiently for the agent to be familiar with it, it is automatically considered good, so the agent will be driven toward it. This ensures that an agent will explore new areas of the state space. The first sample-complexity results for reinforcement learning using this idea were provided by Kearns and Singh (1998). The authors assumed that a state is “known” if it has been visited a sufficiently large number of times. The RMAX algorithm proposed by Brafman and Tennenholtz (2002) is a practical implementation of this idea. An extensive theoretical analysis of this approach was given by Strehl et al. (2006), showing sample-complexity results both for reinforcement learning methods that learn a model and ones that learn directly a value function. Those PAC-style bounds are not directly related to our work.

Previous work on curiosity-driven reinforcement learning is centered around the idea that agents are motivated by an internal reward signal, and in the process of maximizing this reward, they learn a collection of skills. The work we presented in Sect. 3 could be interpreted as defining implicitly an intrinsic reward, based on the idea of maximizing how interesting is the time series experienced by the agent. In early work Schmidhuber (1991) proposed different kinds of internal reward signals. More recently, a hierarchical learning approach was put forth by Singh et al. (2005). In this case, both an external and an internal reward signal are used to learn a behavior policy. At the same time, the extrinsic reward is used to learn multiple temporally extended behaviors. The particular setting proposed for the intrinsic reward is attempting to provide a novelty bonus. We note that the intrinsic reward is only used to generate behavior. The paper assumes that there are certain events in the world that are “salient” and which the agent will be motivated to seek.We furthermore note that the intrinsic rewards proposed by Singh et al. (2005) also involve the probability of the next state given the current state, under different extended behaviors. However, this is proposed as a heuristic. The relationship to our results remains to be explored. Oudeyer and colleagues implemented these ideas in robotics tasks (see e.g. Oudeyer et al. 2007). Recently, Schmidhuber (2009) proposed a novel approach to creativity and exploration which is related to information theory.

Information-theoretic approaches inspired by some form of rate distortion theory have been used in machine learning, for example for clustering and dimensionality reduction (Rose et al. 1990; Pereira et al. 1993; Rose 1998; Tishby et al. 1999; Chigirev and Bialek 2004; Still and Bialek 2004; Still et al. 2004; Chechnik et al. 2005). Ay et al. (2008) explore maximization of I[X t X t+1] for a specific class of models. In contrast, the work we have presented in Sect. 3 maximizes predictive power while keeping the coding rate, or model complexity, fixed, and thereby penalizing policies with more memory than is needed for prediction. This is in line with Still (2009). Tishby and Polanyi (2010) proposed an MDP formulation in which rewards are traded off against information. The authors observe that information also obeys Bellman-like equations, and use this observation to set up dynamic programming algorithms for solving such MDPs. Our work is different in a few important aspects. First, the development in Tishby and Polani (2010) is with respect to a single state distribution, while we account for the state distributions induced by different policies. Second, in their formulation, the information value ends up mixed with the value function. In our case, information influences the exploration policy, but ultimately, one can still obtain a value function, and a policy, that reflect only reward optimization. Another important distinction is that in their formulation, deterministic policies are more “complex” than randomized ones, whereas in our case, a deterministic policy that is constant everywhere would still be considered “simple.” We anticipate that such a treatment will be important in the generalization of these ideas to continuous states and actions (where simple policies will share the same choices across large subsets of states). Interestingly, Little and Sommer (2011) considered several measures for estimating (or approximating) the information gain of an action in the context of past data. They found that learning efficiency is strongly dependent on temporal integration of information gain but less dependent on the particular measure used to quantify information gain.

The recent work on differential dynamic programming (e.g. Todorov 2009; Azar and Kappen 2010) addresses the problem of finding closed-form solutions to reinforcement learning problems, by reformulating the optimization objective. More specifically, the system is considered to have “passive” dynamics (induced by a default policy). The optimization criterion then includes both the value function and a term that penalizes deviations form the “passive” dynamics (using the KL-divergence between the state distributions induced by the sought policy and the default policy). This line of work comes from the perspective of continuous control. The results obtained for the optimal policy bear some similarity with the updates we obtain, but the motivation behind the approach is very different. A similarly defined policy update is also obtained by Peters et al. (2010), coming from yet another different angle. They formulate an optimization problem in which the goal is to utilize the existing samples as well as possible. They give a policy search algorithm in which new policies are penalized if they induce a state distribution that is different from the empirical distribution observed in the past data. The fact that very different points of view lead to syntactically similar policy updates is intriguing and we plan to study it further in future work.

Conclusion and future work

In this article, we introduced a new information-theoretic perspective on the problem of optimal exploration in reinforcement learning. We focused, for simplicity, on Markovian environments, in which the state of the environment is observable and does not have to be learned.

First, we showed that a soft policy similar to Boltzmann exploration optimally trades return for the coding cost (or complexity) of the policy. Second, by postulating that an agent should, in addition to maximizing the expected return, also maximize its predictive power, at a fixed policy complexity, we derived a trade-off between exploration and exploitation that does not rely on randomness in the action policy. In this view, exploration is an emergent behavior that arises to maximize predictive power. This may be a more adequate way of modeling explorative behavior than previous schemes, such as Boltzmann exploration, where exploration hinges upon randomization of the action policy.

Our results can be extended easily to POMDPs, using the framework of Still (2009). In that case, the additional goal is to build a good, predictive internal representation of the environment. Our theoretical framework can also be extended to continuous states and actions; very little work has been done so far in this direction (Wingate and Singh 2007). A third important direction for future work is empirical: we are currently evaluating the proposed method in comparison to existing exploration techniques. Experience in large domains will be especially useful in the future.