Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Controlling exploration and exploitation is one of the main challenges when developing autonomous learning agents. In general, exploratory actions lead to an increase in knowledge about the long-term utility of actions (long-term optimization), but often may cause the income of negative reward due to randomly selected bad actions. However, exploiting knowledge (short-term optimization) may also lead to sub-optimal action selections if the utility of an optimal action is underestimated. As a consequence, the dilemma between exploration and exploitation arises [1].

Several approaches exist to tackle this problem in reinforcement learning. Using action counters for determining confidence intervals is a popular approach in the domain of machine learning [24]. In contrast, neurobiologically inspired models utilize the immediate reward [5] or the temporal-difference error [6] to adapt the amount of exploration, e.g. by the meta-parameter \(\tau \) of Softmax action selection. In a more recent approach, we proposed to use the value difference (the product between the temporal-difference error and the learning rate) as an indicator for the uncertainty of knowledge about the environment [7]. This indicator is used for controlling the action-selection policy between Greedy and Softmax, which aims at robustness with regard to stochastic rewards and even non-stationary environments [8].

In this paper we consider the problem of adapting the amount of exploration and exploitation in model-free reinforcement learning. We combine our recently proposed “REINFORCE exploration control” (REC) policies [9, 10] with replacing eligibility traces [11], for tackling the problem of delayed rewards. We investigate the proposed algorithm on a reward model of a crawling robot that learns to walk forward through sensorimotor interactions, and also on the mountain–car problem with two goals.

2 Methods

We investigate the problem of maximizing an agent’s cumulative reward over time, which in our experiments can be described as learning in a Markovian Decision Process (MDP) in discrete time \(t \in \left\{ 0, 1, 2, \dots \right\} \) [1]. In general, an MDP consists of a finite set of states \(\mathcal {S}\) and a finite set of possible actions in each state, \(\mathcal {A}(s) \in \mathcal {A}, \forall s \in \mathcal {S}\). A transition function \(P(s,a,s')\) describes the (stochastic) behavior of the environment, i.e. the probability of reaching successor state \(s'\) after selecting action \(a \in \mathcal {A}(s)\) in state \(s\). After the selection of an action, a reward \(r \in \mathbb {R}\) is received from the environment and the agent finds itself in a successor state \(s' \in \mathcal {S}\). The choice of action \(a\) is significant, and therefore the general goal is to find an optimal action-selection policy, \(\pi ^* : \mathcal {S} \rightarrow \mathcal {A}\), that maximizes the cumulative reward.

2.1 Action-Value Functions

Action-selection policies can either be learned using model-based techniques (the model of rewards and dynamics are approximated separately) or model-free techniques (only the value function is learned) [1]. Both require learning a value function for the prediction of future reward. In the following, we are particularly interested in optimizing model-free techniques, for the reason of being closely related to reinforcement learning in the brain [12, 13].

Action-selection policies can be derived from value functions representing so far learned knowledge about the future reward [1]. An action-value function, \(Q(s,a)\), approximates the cumulative discounted reward for following policy \(\pi \), when starting in state \(s\) and taking action \(a\),

$$\begin{aligned} Q(s,a) = E_\pi \left\{ \sum _{k=0}^\infty \gamma ^k r_{t+k+1} \vert s_t = s, a_t=a \right\} \!, \end{aligned}$$
(1)

where \(0 < \gamma \le 1\) is a discounting factor used for weighting future rewards in \(Q(s,a)\). Since \(Q(s,a)\) depends on rewards received in the future, the cumulative reward is considered to be an expected value \(E_\pi \{\cdot \}\) which depends on the action-selection policy \(\pi \). The parameter \(\gamma \) is allowed to take on the value of \(1\) only in episodic learning problems, i.e. an episode must terminate after a maximum of \(T\) steps, which prevents \(Q(s,a)\) from growing to an infinite sum. In case a fixed horizon does not exist, \(\gamma \) must to be chosen \(< 1\).

2.2 Q-Learning with Replacing Eligibility Traces

The action-value function is sampled from interactions with the environment. For this we use Watkin’s \(Q\)-learning algorithm [14], which adapts reward estimates according to:

$$\begin{aligned} b^*&\leftarrow \mathop {{\arg \max }}\limits _{b \in \mathcal {A}(s')} Q(s', b) \end{aligned}$$
(2)
$$\begin{aligned} Q(s,a)&\leftarrow Q(s,a) + \alpha \underbrace{ \left( r + \gamma Q(s', b^*) - Q(s, a) \right) }_{\text {Temporal-Difference Error} {\,\,} \varDelta } \end{aligned}$$
(3)

where \(0<\alpha \le 1\) denotes a step-size parameter [15]. For an optimal action-value function, \(Q^*(s,a)\), from which the optimal (greedy) policy \(\pi ^*\) can be derived, the temporal-difference error is zero for all observation tuples \((s,a,r,s')\).

As proposed by Singh and Sutton [1, 11], we also combine \(Q\)-learning with replacing eligibility traces, which is known as \(Q(\lambda )\)-learning as shown in Algorithm 1. The advantage is that rewards are propagated faster to previously taken actions, which tackles the general problem of accounting delayed rewards in reinforcement learning. Visited state-actions pairs \(e(s_e,a_e)\) are memorized in an eligibility trace list, where each entry in this list is associated with an additional memory variable, the eligibility trace, used for weighting the current temporal-difference error \(\varDelta \) also in the action value \(Q(s_e,a_e)\) of previously taken actions. The trace for the last taken action is set to \(e(s,a) \leftarrow 1\), which means that the temporal-difference error is fully credited.

All traces are decayed by \(\gamma \lambda \) after taking an action. In this sense \(0 \le \lambda \le 1\) denotes a time constant. For \(\lambda =0\), credit is only assigned to the last taken action, where Eq. (3) is only computed for the most recent observation tuple \((s,a,r, s')\). On the other hand, a time constant of \(\lambda =1\) refers to Monte-Carlo backups. Eligibility traces \(e(s_e, a_e)\) are removed from the list as soon as their value becomes lower than a small threshold \(\Theta \), and thus the parameter \(\lambda \) does implicitly control over the amount of actions the current temporal-difference error is propagated into the past. The list of eligibility traces is also cleared in case an exploratory action is selected, because \(Q\)-learning learns the action-value function independently of the actual policy (called off-policy learning), assuming to follow a greedy policy in the limit for \(t \rightarrow \infty \). Therefore, crediting the reward of exploratory actions to previously taken actions has no necessary relationship to a greedy policy [1].

3 Exploration and Exploitation

In the following we describe two basic strategies for deriving action-selection probabilities \(\pi (s,a)\) from the action-value function, and afterwards we show how to combine them for meta-parameter learning.

figure a

The \(\varepsilon \)-Greedy policy selects a uniform randomly distributed action with probability \(0 \le \varepsilon < 1\) [1]. With probability \(1-\varepsilon \), a greedy action from the set \(\mathcal {A}^*(s)\) of so far estimated optimal actions in state \(s\) is selected:

$$\begin{aligned} \mathcal {A}^*(s)&= \mathop {{\arg \max }}\limits _{a} Q(s,a)\nonumber \\ \pi _\text {EG}(s,a)&= \left\{ \begin{array}{cc} \frac{1-\varepsilon }{\vert \mathcal {A}^*(s) \vert } + \frac{\varepsilon }{\vert \mathcal {A}(s)\vert } &{} {\mathrm {if~}}a \in \mathcal {A}^*(s) \\ \frac{\varepsilon }{\vert \mathcal {A}(s)\vert } &{} \;\;\;{\mathrm {otherwise}}. \\ \end{array} \right. \end{aligned}$$
(4)

Note that in any state \(s\) all selection probabilities \(\pi _\text {EG}(s,a)\) sum up to 1. A drawback of \(\varepsilon \)-Greedy is the choice of uniformly selected random actions, which might cause the income of negative reward due to bad actions. However, the \(\varepsilon \)-Greedy policy is reported for being hard to beat when a proper exploration parameter \(\varepsilon \) is configured [16].

The second policy we investigate is the Softmax policy, which selects an action according to its weighting in a Boltzmann distribution [1]:

$$\begin{aligned} \pi _\text {SM}(s,a) = \frac{\exp \left( \frac{Q(s,a)}{\tau }\right) }{\sum _b \exp \left( \frac{Q(s,b)}{\tau }\right) }. \end{aligned}$$
(5)

This policy utilizes a positive parameter \(\tau \), called temperature, which controls between exploration and exploitation. High values of \(\tau \) lead to equally distributed random actions, however low values to greedy actions. Again, in any state \(s\) all selection probabilities \(\pi _\text {SM}(s,a)\) sum up to \(1\).

In general, it is a problem to define global constants \(\tau >0\) and \(\varepsilon \in [0,1]\) for achieving reasonable performance. Therefore, these parameters are typically initialized with a high value at the beginning of an experiment, being decreased over time. Especially for large state-action spaces, such fine-tuning is most often a time-consuming process.

3.1 Meta-Learning of Exploration/Exploitation Parameters

A drawback of \(\varepsilon \)-Greedy and Softmax is the optimism in the face of uncertainty. For this we recently proposed “Value-Difference Based Exploration with Softmax” (VDBE-Softmax) [7], which controls exploration and exploitation between Greedy and Softmax in a meta-learning fashion. A local exploration rate \(\varepsilon (s)\), initialized by \(1\), is assigned to each state in the state space, which denotes the probability of selecting an exploratory action in state \(s\). The selection probabilities \(\pi (s,a)\) are adapted in dependence of fluctuations in the action-value function, which are considered to be a measure of the uncertainty in knowledge about the environment. Furthermore and in contrast to the \(\varepsilon \)-Greedy policy, exploratory actions are not selected equally distributed, but weighted according to the Softmax rule, which prevents selecting of very bad actions due to their weighting in the Boltzmann distribution. This idea of combining both policies was introduced by Wiering [17], called Max Boltzmann Exploration (MBE), but without learning of the policy parameters. Instead he configures \(\varepsilon \) and \(\tau \) (globally) by hand.

The general idea of VDBE-Softmax is that high fluctuations of the action-value function should lead to a high degree of exploration, i.e. \(\varepsilon (s) \rightarrow 1\), because the observation is insufficiently approximated by the prediction. On the other hand, when the prediction about future reward is well approximated, so far learned knowledge should be exploited, i.e. \(\varepsilon (s) \rightarrow 0\). In this sense the corresponding exploration rate in state \(s\) is updated after each value-function backup for any action \(a\) in state \(s\) according to:

$$\begin{aligned} \varepsilon (s) \leftarrow \varepsilon (s) + \delta \left[ 1 - \exp \left( \frac{-\vert \alpha \varDelta \vert }{\phi }\right) - \varepsilon (s) \right] , \end{aligned}$$
(6)

where \(\phi \) is a positive parameter for the sensibility with regard to the absolute value difference \(\vert \alpha \varDelta \vert \) [7]. In case an exploratory action should be selected, all \(Q\)-values in state \(s\) are scaled into the interval \([-1,1]\), and the action is selected according to Eq. (5) using \(\tau =1\). As proposed in [7, 18] the learning rate \(\delta \) can be determined online by the inverse of the number of actions, i.e. \(\delta = \frac{1}{{\mathcal {A}}(s_t)}\) in the current state \(s_t\).

Kobayashi and colleagues [6] proposed a similar approach (using the temporal-difference error \(\varDelta \)), but adapting the parameter \(\tau \) of Softmax in a global manner instead. In contrast, VDBE-Softmax is a state-based strategy, which has the advantage of selecting exploratory actions only in states where the observation is insufficiently approximated by the action-value function.

The extension of VDBE-Softmax for \(Q(\lambda )\)-learning is straightforward. We simply need to apply the learning rule right after each action-value update (line 19 of Algorithm 1), but additionally including the eligibility trace for state \(s_e\) and action \(a_e\). Therefore Eq. (6) is slightly modified to:

$$\begin{aligned} \varepsilon (s_e) \leftarrow \varepsilon (s_e) + \delta \left[ 1 - \exp \left( \frac{-\vert \alpha \varDelta e(s_e,a_e) \vert }{\phi } \right) - \varepsilon (s_e) \right] . \end{aligned}$$
(7)

3.2 REINFORCE Exploration Control

We recently proposed a further alternative for meta-learning of exploration parameters called REINFORCE Exploration Control (REC) [9, 10]. The general idea is to control the exploration parameter of any above mentioned policiesFootnote 1 by a gradient-following algorithm. The heart of REC is Williams “REINFORCE with multiparameter distributions” algorithm [19] for reinforcement learning in continuous action spaces.

In REC the exploration parameter of a policy is considered to be an continuous action. In this sense, we proposed two variants: (1) the global (episodic) variant selecting an exploration parameter for the duration of an learning episode, and (2) the local (stepwise) variant selecting a state-based exploration parameter before actually selecting an action by one of the above policies. In any of both cases, the exploration parameter is drawn according to a Gaussian distribution with parameters \(\mu \) (mean) and \(\sigma \) (standard deviation). For example, in the local variant the parameter of \(\varepsilon \)-Greedy is selected according to:

$$\begin{aligned} \varepsilon (s) \sim bound\left[ \mathcal {N}\left( \mu (s), \sigma (s)\right) \right] , \end{aligned}$$
(8)

where \(bound[\cdot ]\) ensures that \(\varepsilon (s)\) stays within the interval \([0,1]\). After taking action \(a\) according to the policy, and after observing the successor state \(s'\) and reward \(r\), the local distribution parameters \(\mu (s)\) and \(\sigma (s)\) are adapted according to a reinforcement comparison scheme:

$$\begin{aligned} \mu (s)&\leftarrow bound\left[ \mu (s) + \alpha _R (\rho -\bar{\rho }(s)) \frac{\varepsilon (s) -\mu (s)}{\sigma (s)^2} \right] \end{aligned}$$
(9)
$$\begin{aligned} \sigma (s)&\leftarrow bound\left[ \sigma (s) + \alpha _R (\rho -\bar{\rho }(s)) \frac{(\varepsilon (s)-\mu (s))^2 - \sigma (s)^2}{\sigma (s)^3} \right] \end{aligned}$$
(10)

using performance measure

$$\begin{aligned} \rho = r + \max _b Q(s',b) \end{aligned}$$
(11)

and its baseline

$$\begin{aligned} \bar{\rho }(s) \leftarrow \bar{\rho }(s) + \alpha (\rho - \bar{\rho }(s)) . \end{aligned}$$
(12)

The learning rate \(\alpha _R\) has to be chosen appropriately, e.g. as a small positive constant, \(\alpha _R = \alpha \sigma ^2\), as proposed by Williams [19]. Furthermore, all REC parameters must be bounded, e.g. \(0 \le \varepsilon (s), \mu (s) \le 1\) and \(0.001 \le \sigma (s) \le 5\). The bounds for Softmax, MBE and VDBE-Softmax can be taken according to [10]. In this paper, the performance measure \(\rho \) slightly differs from [10], which yielded to improved results in the experiments (especially for Softmax).

In contrast, the global variant of REC draws the exploration parameter at the beginning of an episode, for which reason the distribution parameters and baseline need only to be approximated for starting states [9, 10]. When updating the distribution parameters at the end of episode \(i\), the sum of rewards is taken as performance measure \(\rho _i\), i.e.

$$\begin{aligned} \rho _i = \sum _{t=1}^T r_t . \end{aligned}$$
(13)

In case a learning problem has only one starting state (such as in the investigated mountain–car problem with two goals), a stateless (global) approximation of \(\mu \), \(\sigma \) and \(\bar{\rho }\) can be used.

4 Experiments

In this section, the two learning problems shown in Fig. 1 are investigated using \(Q(\lambda )\)-learning with meta-parameter learning of exploration parameters. First, a little crawling robot is investigated, which is a non-episodic learning problem on which the local variant of REC is applied. Second, the mountain–car problem with two goals is investigated, which is an episodic learning problem, and therefore the global variant of REC is applied.

Fig. 1.
figure 1

Investigated learning problems: (a) the crawling robot [20], and (b) the mountain–car problem with two goals [9].

4.1 The Robot

We investigate a little crawling robot whose architecture was inspired from [21]. Figure 1(a) shows the corresponding hardware robot. The general aim is learning to crawl forward through sensorimotor interactions, which is achieved by a cyclic policy representing movements of the two joints \(g_x\) and \(g_y\). The components of the corresponding MDP are defined as follows:

States: :

At each time step, the state \(s \in \mathcal {S}\) consists of the arm’s discrete joint positions at present, i. e. the state is fully described by \(s = \left\{ g_x, g_y\right\} \). Due to the small onboard memory of size 2 kB, each joint is discretized into 5 equidistant state positions resulting in total to 25 states as shown in [20].

Actions: :

The set of possible actions in each state \(s\in \mathcal {S}\) consists of the four cardinal directions in the state-space model:

$$\begin{aligned} \mathcal {A}(s) \in \left\{ \text {UP}, \text {DOWN}, \text {LEFT}, \text {RIGHT}\right\} . \end{aligned}$$
Rewards: :

Performing an action \(a\) in the environment leads to a reward, \(r\in \mathbb {R}\), which is measured as the number of accumulated wheel-encoder tics while repositioning the arm. The encoder tics trigger the external interrupt of the microcontroller. An interrupt service routine accumulates positive and negative encoder ticks for the reward signal \(r\) delivered to the learning algorithm.

For simplicity we do not model transition probabilities on the crawling robot, because actions (movements of the joints) always transition with probability 1 to the corresponding neighbor state.

Fig. 2.
figure 2

The crawling robot: results for low learning rates of \(\alpha =0.1\) are on the left, high learning rates of \(\alpha =1.0\) are on the right. Results are smoothed and averaged over 1000 runs.

We performed simulation experiments with a reward model sampled from the robot as shown in [20]. For simulating sensor noise, the reward from the model is perturbed with a Gaussian noise (mean 0 and variance 1). Figure 2 shows the results of our study, which are averages over 1000 runs, and using a discounting factor of \(\gamma =0.95\). In general it’s observable that local REC adaptation behaves very robust independently of the policy and its parameter to be adapted. Using low learning rates of \(\alpha =0.1\), eligibility traces significantly improve the results the higher \(\lambda \) was chosen, which improves the standard \(Q\)-learning algorithm (\(\lambda =0\)) from Eq. (3). In contrast, high learning rates lead in general to better results, with little effect of using eligibility traces. From the results with high learning rates we further observe that the VDBE-Softmax policy is in general a bit better, but which is on cost of additionally memorizing \(\varepsilon (s)\).

4.2 The Mountain–Car Problem with Two Goals

We investigate \(Q(\lambda )\)-learning on the mountain–car problem with two goals as proposed in [9, 10] and depicted in Fig. 1b. This learning problem is an extension of the originally proposed mountain-car problem [11], but having two goal states and an enlarged action set.

Our results shown in Fig. 3 reveal an interesting effect. The higher \(\lambda \) was chosen, the more policies tend to behave greedily, with the result of terminating an episode more likely at the left goal. For MBE eligibility traces seem to be contra productive. In contrast, the VDBE-Softmax results are improved in the range of \(0.3 < \lambda < 0.5\), but also with degrading performance the more \(\lambda \) approaches \(1\).

Fig. 3.
figure 3

The mountain–car problem with two goals: smoothed results for various \(\lambda \); each averaged over 200 runs.

5 Summary and Conclusion

We showed that replacing eligibility traces can improve the reward of an agent, which is consistent with the results shown by Singh and Sutton [11]. As shown on the robot example, learning improves the more \(\lambda \) approaches \(1\), but this is not the case for the mountain-car problem with two goals. Therefore it is definitely not straightforward to decide on the choice of time constant \(\lambda \). The reason for this effect is that oversampling of actual performed actions can also lead to underestimating actions not selected so far (which might be better, than their current action-value predicts). As a result, it is more likely for \(\lambda \rightarrow 1\) that a greedy behavior arises, as shown in the results of the mountain–car problem with two goals. Therefore, the optimal choice of \(\lambda \) is dependent on the learning rate \(\alpha \) used for sampling the action-value function, which was also shown in the results of Singh and Sutton [11]. As a conclusion, meta-learning of \(\lambda \) and \(\alpha \) in combination remains to be an interesting direction of further research.

Finally, it seems reasonable to initiate a discussion of the idea that reinforcement learning should also be considered as part of the relatively new research paradigm of partially-supervised learning, which by now had its emphasis on combinations of unsupervised and supervised learning. Our opinion is that the general framework of developing reinforcement learning agents fits well into this paradigm due to the following reasons: (1) reinforcement learning is utilized for sampling estimates about the future reward that are usually approximated by value functions [1], which (2) are most often learned using supervised learning algorithms, e.g. in a neural network fashion [22, 23]. In previous research we successfully showed this relationship in the development of learning agents for board games [24, 25]. Also in the context of neurobiology all three paradigms apparently interact with each other [26, 27].