Keywords

1 Introduction

In the real world, a task often contains multiple objectives, and the preferences for these objectives are different in different situations, so various strategies are needed to meet the needs in different situations. Existing reinforcement learning methods such as DQN [1] and DDPG [2] can already handle single-objective problems. In more cases, the importance of each objective in the task is difficult to determine in advance, or some changes need to be made according to the actual situation. In single-objective reinforcement learning, this kind of problem is solved by adjusting the weight of this objective separately. The results of this adjustment method are often uncertain, because some objective may not be independent of each other, or even exist in opposition to each other. And for the traditional reinforcement learning algorithm, its goal is to maximize the expected return made up of multiple parts, the larger return is not necessarily achieved by increasing the reward, but by reducing the penalty. For example, when the reward for reaching the goal is small, and the penalty related to time is large, the robot may end the episode early by colliding with the obstacle to reduce the penalty obtained. Compared with traditional single-objective reinforcement learning, multi-objective reinforcement learning (MORL) provides a better way to deal with multi-objective tasks.

Existing MORL methods can be divided into two categories [3]: outer loop and inner loop [5]. The outer loop method treats the MORL as multiple single-objective reinforcement learning problems with different preferences. This method achieves the effect of approximately representing the Pareto front by maintaining a population of policies [6] distinguished by preference, which makes it difficult to extend to complex problems or problems with a large number of objective. The inner loop class method learns a value function or policy network conditioned on preference, and uses the deep neural network to obtain an approximate Pareto optimal solution. During the training process, different preferences are used as the input of the neural network, and the goal is to maximize the cumulative return under the linear weighting of the preferences. Compared with the outer loop method, it avoids maintaining a large set of strategies, but there are problems of large sample demand and catastrophic forgetting. Friedman et al. [7] combined the MORL method with Hindsight Experience Replay (HER) [8] and Deep Deterministic Policy Gradients to achieve higher sample utilization in continuous action space. Abels et al. [9] implemented a weight-conditioned network (CN) based on vector Q-functions, and accelerate learning by using deverse experience replay (DER). Yang et al. [10] proposed the optimality operator for a generalized version of Bellman equation and the Envelope Q-learning (EQL) method, they proved its theoretical onvergence under linear preference. The EQL outperforms single-objective reinforcement learning on the complex Atari game SuperMario.

Multi-objective problems can be seen as a form of diversity problems. For multi-objective problems, the strategies on the Pareto frontier have certain differences. Diversity problems focus on differences in strategies under the same scenario, which can be caused by different reward functions. Shen et al. [6] used multi-objective genetic algorithm combined with single-objective reinforcement learning method to obtain different styles of game AI. Optimizing the distance of trajectories obtained by different strategies or the kl-distance of different strategies is also a way to generate diversity. Wu et al. [11] used the Maximum Mean Discrepancy (MMD) distance between trajectories as a regular term to generate different opponents strategies. Eysenbach et al. [12] used unsupervised learning for generating diversity by maximizing the mutual information of different skills and states based on maximum entropy reinforcement learning. These methods can generate more fine-grained diversity, but for many practical problems, a controllable and interpretable diversity is more valuable.

The contribution of our work is threefold:

  1. 1)

    We propose an algorithm which can cover the optimal solution as much as possible with training only one model.

  2. 2)

    For different preference inputs, a variety of diversity strategies that conform to the preference can be generated.

  3. 3)

    Compared with single-objective reinforcement learning, our algorithm can reduce the possibility of falling into a local optimal solution, and can better converge to the optimal solution in sparse reward environments.

The structure of this paper is organized as follows: In Sect. 2, we introduce the relevant background of multi-objective reinforcement learning and some notations used in this paper. In Sect. 3, we analyze some problems with existing methods and propose a multi-objective reinforcement learning algorithm with preference exploration (MoPE). In Sect. 4, we introduce two multi-objective environments and validate our methods based on them. In Sect. 5, we summarize the contributions of this article and illustrate future research directions.

2 Background

The Multi-objective Reinforcement Learning (MORL) is a class of methods relative to traditional single-objective reinforcement learning. Its reward function \(\boldsymbol{r}(s,a)=(r_1(s,a),...r_m(s, a))\) is given in vector form, where m is the number of targets. The MORL is derived from multi-objective optimization problems, whose goal is to solve \(\max f(x)=(f_1(x),...f_m(x))^\top \). In most multi-objective problems, there are conflicts or incomparability between objectives, and the global optimal solution cannot be obtained, which leads to the improvement of one objective that tends to weaken other objectives. For a solution, if there is no other solution that is better than it on all objectives, it is called a Pareto solution under the problem. A set of Pareto optimal solutions is called the Pareto optimal set, and the surface formed by the optimal set in space is called the Pareto front.

The Markov Decision Process (MDP) is the basic form of reinforcement learning (RL). MDP can be represented by the tuple \((S,A,P,\gamma ,r)\), with state space S, action space A, action \(a\in A\), transition probability \(P(s'\mid s,a)\), discount factor \(\gamma \), reward function r(sa). The goal of reinforcement learning is to learn the strategy \(\pi :S\times A\rightarrow [0,1]\) to obtain the maximum cumulative return \( G_t =\sum _t^\infty \gamma ^tr(s,a)\), at this time the strategy is called the optimal strategy \(\pi ^*\). The solution method of the optimal strategy can be divided into two types: based on the value function \(V(s)=E_\pi [G_t\mid S_t=s]\) and \(Q(s,a)=E_\pi [G_t\mid S_t=s,A_t=a]\) and based on the policy. The DQN is a representation of a value function-based approach, which value function is updated as:

$$\begin{aligned} Q(s, a)_{i+1} \leftarrow Q(s, a)_{i}+\alpha \left[ r+\gamma \max _{a^{\prime } \in A} Q\left( s^{\prime }, a^{\prime }\right) _{i}-Q(s, a)_{i}\right] \end{aligned}$$
(1)

The PPO is a policy base algorithm, which loss function is:

$$\begin{aligned} \begin{aligned}&J_{PPO}(\theta )=\sum _{t=1}^{T} \frac{\pi _{\theta }\left( a_{t} \mid s_{t}\right) }{\pi _{o l d}\left( a_{t} \mid s_{t}\right) } \hat{A}_{t}-\lambda \mathrm {KL}\left[ \pi _{\text{ old } } \mid \pi _{\theta }\right] \\&\hat{A}_{t}=\hat{R}_{t}-V_{\phi }\left( s_{t}\right) \end{aligned} \end{aligned}$$
(2)

The Multi-objective Markov Decision Process (MOMDP) can be represented by the tuple \((S,A,P,\boldsymbol{r},\varOmega ,f_\varOmega )\) with vector reward function \(\boldsymbol{r}(s,a,s')\), the space of preference \(\varOmega \), preference function \(f_\omega =\boldsymbol{\omega }^T \boldsymbol{r}(s,a)\), preference \(\boldsymbol{\omega }\in \varOmega \). When \(\boldsymbol{\omega }\) is fixed, each \(\boldsymbol{\omega }\) corresponds to a MDP process. All optimal solutions form the Pareto coverage set (PCS) \(\mathcal F^*:=\{\hat{\boldsymbol{r}}\mid \hat{\boldsymbol{r}}\ge \hat{\mathcal r}',\forall \hat{\boldsymbol{r}}'\}\) , where the return \(\hat{\boldsymbol{r}_t}=\sum _t^\infty \gamma ^t\boldsymbol{r}_t(s,a)\). The optimal strategies corresponding to all optimal solutions constitute the Pareto front of this problem. For all possible \(\boldsymbol{\omega }\in \varOmega \) constitute a convex coverage set (CCS) [10] which is a subset of PCS:

$$\begin{aligned} \text {CCS }:=\left\{ \hat{\boldsymbol{r}} \in \mathcal {F}^{*} \mid \exists \omega \in \varOmega \text{ s.t. } \omega ^{\top } \hat{\boldsymbol{r}} \ge \boldsymbol{\omega }^{\top } \hat{\boldsymbol{r}}^{\prime }, \forall \hat{\boldsymbol{r}}^{\prime } \in \mathcal {F}^{*}\right\} . \end{aligned}$$
(3)

Our goal is to learn a general value function \(\mathcal Q(s,a,\boldsymbol{\omega },\theta )\) that can generalize to the entire preference space, and the policy obtained by this value function can cover the CCS as much as possible.

3 Multi-objective RL with Preference Exploration

The algorithm we propose is based on two ideas. The first is the exploration and utilization of existing samples. For inner loop methods, it is necessary to sample trajectories under different preferences during the training process to learn a general value function, which reduce sample size for each preference. To address this issue, we expand and explore in existing samples through HER and prioritization experience replay (PER) [13] combined with information theory to increase sample utilization. The second is the exploration of preference space during sampling. Existing methods usually directly obtain the preference used in each episode through uniform sampling. While it’s favorable to use the already trained preferences and corresponding result as priors to guide the selection of new preferences.

Exploring and Utilizing Existing Experience is the key to improving sample efficiency. The HER is a technique for dealing with the sparse reward goal-condition problem, which improves the utilization of samples by relabeling the goals of existing trajectories as goals that can be reached by the current strategy. Preference is also a goal in this paper. The preference will only affect the action of the agent, but not the dynamics of the environment. Therefore, A relabeling can be used to the experience to improve the utilization of the sample. For each experience in the batch \((s,\boldsymbol{\omega }_b,a,\boldsymbol{r},s')\) we additionally sample \(N_\omega \) preferences \(\boldsymbol{\omega }=\{\omega _0,\omega _1,...,\omega _{N_\omega }\mid \forall \omega ,\Vert \omega _i-\omega _b \Vert _2\le \sigma _\omega \}\) as new goals. Different from the EQL, it can be noticed that policies with similar preferences will be more similar under linear preferences, which can provide more information to help the learning of the current policy, so we will limit the newly sampled preferences to the vicinity \(\sigma _\omega \) of the actual sampled \(\omega _b\) which called Similar Preference Exploration. After relabeling, The EQL’s loss is used to update the network. The loss function L includes two parts:

$$\begin{aligned} \begin{aligned}&L^{\mathrm {A}}(\theta ) = \mathbb {E}_{s, a, \boldsymbol{\omega }}\left[ \Vert \boldsymbol{y}-\boldsymbol{Q}(s, a, \boldsymbol{\omega } ; \theta )\Vert _{2}^{2}\right] \\&L^{\mathrm {B}}(\theta )=\mathbb {E}_{s, a, \boldsymbol{\omega }}\left[ \left| \boldsymbol{\omega }^{\top } \boldsymbol{y}-\boldsymbol{\omega }^{\top } \boldsymbol{Q}(s, a, \boldsymbol{\omega } ; \theta )\right| \right] \\&L = L^{\mathrm {A}} + L^{\mathrm {B}} \end{aligned} \end{aligned}$$
(4)

where \(\boldsymbol{y}=\mathbb {E}_{s^{\prime }}\left[ \boldsymbol{r}+\gamma \arg _{Q} \max _{a, \boldsymbol{\omega }^{\prime }} \boldsymbol{\omega }^{\boldsymbol{\top }} \boldsymbol{Q}\left( s^{\prime }, a, \boldsymbol{\omega }^{\prime } ; \theta _{k}\right) \right] \). Compared to the original DQN objective, the EQL takes the largest Q value among the sampled actions and preferences simultaneously. \(L^{\mathrm {A}}\) updates the vector Q function, \(L^{\mathrm {B}}\) is the auxiliary loss, and the existence of \(L^{\mathrm {B}}\) is to reduce the influence of the discrete solutions on the frontier on the optimization. The optimal solution corresponding to the preference is the same, which increases the difficulty of learning the value function.

The HER focuses on the utilization of data in the current batch, and PER is used to focus on the utilization of historical data. PER is generally based on the TD error of each experience. Based on the above analysis, it is more beneficial to consider a preference interval rather than the empirical TD error under a single preference in the MORL problem:

$$\begin{aligned} \delta =\frac{1}{N_\omega }\sum _{i=0}^{N_\omega }{\boldsymbol{\omega _i}^{\top }\boldsymbol{y}-\boldsymbol{\omega _i}^{\top }\boldsymbol{Q}(s, a, \boldsymbol{\omega _i} ; \theta )} . \end{aligned}$$
(5)

The TD error takes into account the error of the value function, reflecting how well the value function is learned. However, discrete optimal solutions will make strategies under multiple preferences correspond to the same optimal solution, which increases the possibility of strategies falling into local optimality. At the same time, the multi-objective reinforcement learning problem pays more attention to the results of different strategies. But the differences in results do not fully correspond to the differences in strategies. In this regard, we use the method of information theory to deal with this problem. Our method is based on an idea that a large difference in preference \(\omega \) corresponds to a large difference in the state S reached by the agent. It should be able to better distinguish the agent’s preferences under different state, that is, maximize the mutual information \(I(S;\omega )\) between state and the preferences.

$$\begin{aligned} \begin{aligned} I(S ; \omega )&=(\mathcal {H}[\omega ]-\mathcal {H}[\omega \mid S])\\&=\mathbb {E}_{\omega \sim p(\omega ), s \sim \pi (\omega )}[\log p(\omega \mid s)]-\mathbb {E}_{\omega \sim p(\omega )}[\log p(\omega )]\\&\ge \mathbb {E}_{\omega \sim p(\omega ), s \sim \pi (\omega )}\left[ \log q_{\phi }(\omega \mid s)-\log p(\omega )\right] \end{aligned} \end{aligned}$$
(6)

Among them, \(p(\omega \mid s)\) is difficult to calculate directly, a discriminator network \(q_{\phi }(\omega \mid s)\) is trained to approximate it. It can be proved that the approximated solution can provide a variational lower bound on mutual information. \(I(S;\omega )\) is optimized by add this priority to PER, the priority sampling probability is:

$$\begin{aligned} q(s_i)=\delta +\alpha (\log q_{\phi }(\omega \mid s)-\log p(\omega )). \end{aligned}$$
(7)

The above formula can be understood that we pay more attention to the state that is easy to distinguish, and the value function has not been estimated accurately. This kind of state also corresponds to the frontier of exploration, and \(\alpha \) is used to balance the ratio of the two parts. For some adversarial tasks, the addition of the discriminator also brings convenience to opponent modeling.

Exploring in Preference Space is key to speeding up training and mitigating catastrophic forgetting. To guarantee coverage of the optimal solution, continuously train with different preferences is needed. After many updates, the neural network may forget some of the policies it learned earlier. In order to reduce the impact of forgetting and take advantage of the learned policy, we need to update the sampling probability \(p(\omega )\) during training to purposefully learn some preferences.

Specifically, we expect to give higher sampling probabilities to preferences that currently have few visits or incomplete training. For the problem of visit frequency, we discretize the preference space into \(N_I\) intervals \(I_i\in \varOmega ,i\in [0,N_I]\). Put each used preference into the corresponding interval to get the count \(N_{i}\). Combined with the counting-based exploration method in reinforcement learning [15], we set \(p(\omega )\propto (N_{i}+0.01)^{-1/2},\omega \in I_i\).

4 Experiment

Deep-Sea Treasure (DST)  [14] is a simple multi-objective \(10\times 11\) grid world environment, which is often used for the verification of MORL algorithms. Its Pareto frontier is obtainable and convex. In the DST environment, there is an agent and several treasures with scores, and the treasure near the bottom of the map have higher scores. At the beginning of the round, the agent is in the upper left corner of the map, and the algorithm needs to control the agent to move in four directions until it reach the treasure. The agent’s goal is to obtain the highest possible score while taking the fewest steps. The two parts of the rewards are the reward \(r_{score}\) for reaching the treasure, and the penalty for each step \(r_{steps}=-1\) (Fig. 1).

Fig. 1.
figure 1

The Deep-sea treasure (DST) environment. Agent starts from the upper left corner of the map, takes (xy) as the state, and searches for treasures with different scores by moving in four directions. The red and the orange line in the figure represent two paths that can reach a specific treasure. It can be seen that the set of optimal solution paths in the DST environment is the shortest path to each treasure. (Color figure online)

Fig. 2.
figure 2

Evaluate Deep-sea treasure results. (a) Coverage ratio metric under 20000 timesteps. (b) The real CCS and the recovered solutions.

In order to evaluate the coverage of the algorithm to the optimal solution in the CCS, we introduce the evaluation metric Coverage Ratio (CR). The calculation of CR is based on the coincidence of the optimal solution set \(\mathcal P\) and \(\text{ CCS }\) found by the algorithm [10]:

$$\begin{aligned} CR=2 \times \frac{ \text{ precision } \times \text{ recall } }{ \text{ precision } + \text{ recall } }, \end{aligned}$$
(8)

where \( \text{ precision } =\left| \mathcal {P} \cap \text{ CCS }\right| /|\mathcal {P}|\),\( \text{ recall } =\left| \mathcal {P} \cap \text{ CCS }\right| /|\text{ CCS }|\), represent the proportion of coincident solutions in the two sets. In practical, we approximate the current optimal solution set \(\mathcal P\) by randomly sampling a certain number of \(\omega \) in the preference space.

We use a conditional Multi-head Q network [9] by 4 fully connected hidden layers with {64, 128, 128, 64} for training. The inputs are states (xy) and preferences, and the number of heads in the output layer is equal to the number of objectives. The original EQL and the algorithm after adding exploration were trained for 20,000 timesteps, and the CR was evaluated every 1,000 steps during the training process, and 2,000 preferences were evaluated using uniform sampling. For the experience exploration (EE) approach, a discriminator network with 3 fully connected layers is used to output the probabilities used in PER. Then we sample 64 preferences under the condition of \(\sigma _\omega =0.04\) to relabel the samples. For the preference exploration (PE) approach, we divided the preference space into 10 parts. MoPE integrates these two methods of preference exploration. The final result is shown in Fig. 2. Under the same number of training steps, the algorithm after adding exploration can achieve higher CR than the original algorithm.

Fig. 3.
figure 3

Deep-sea treasure results. (a) The relationship between preference and optimal solution under DST task. The gray dotted line is the Pareto optimal solution, the green thick solid line is the optimal solution obtained by the algorithm. The preference space can be divided into multiple parts according to the different solutions. The red dotted line indicates that the corresponding optimal solution is missing, and the red area indicates that the solution obtained by the algorithm is not in the optimal solution set. (b) The discriminator’ s prediction of preference. The color indicates the weight corresponding to \(r_{score}\). (Color figure online)

Compare the distribution of the optimal solutions obtained by the two algorithms in the preference space as shown in Fig. 3a. The solutions obtained by MoPE are more uniformly distributed in the preference space and are all within the CCS, while the EQL will have some suboptimal and concentrated solutions. Further, we plot the discriminator’s prediction of preference as shown in Fig. 3b. For the DST environment, the optimal solution is unique, but the optimal path is not. The agent’s preference is more difficult to discern when it is close to a treasure location in the top half of the map. Under our method, this part exhibits larger differences in predicted preferences, which means that agents choose more diversity routes for different preferences.

Robot Confrontation Game is a multi-objective problem in a continuous action space. There are two red and blue agents \(A_r\), \(A_b\) in the scene, where \(A_r\) is the agent to be controlled, \(A_b\) is the rule-based agent, and the state update of the agents is based on (9).

$$\begin{aligned} \dot{x}=v\cos \theta ,\dot{y}=v\sin \theta ,\dot{\theta }=w,\dot{v}=a, \end{aligned}$$
(9)

where robot’s acceleration \(a\in [-0.2,0.2]\) m/s, angular velocity \(w\in [-0.8,0.8]\) rad/s, direction \(\theta \in [-\pi ,\pi ]\). The goal is to hunt down the blue agent without going out of bounds. The definition of winning is that the distance and relative angle between the two agents are less than a certain threshold, such as (10) (Fig. 4).

Fig. 4.
figure 4

Parameter definition of the environment.

$$\begin{aligned} q_{r}<30^{\circ }, q_{b}>30^{\circ }, \beta<40^{\circ }, d<0.3\,m, \end{aligned}$$
(10)

where \(q_r,q_b\) is the angle between the speed direction and the line connecting the two agents, \(\beta \) is the angle between the speed directions, and d is the distance between the two robots. For the strategy of rule-based agents, we use the threat index T which widely used in air combat problems as the evaluation index of the state, combined with single-step forward prediction, then select the action with the highest threat to the opponent. The threat index T we use includes two parts, the angle threat \(T_a\) and the distance threat \(T_b\), which are defined as follows:

$$\begin{aligned} T_{a}=\frac{q_{b}-q_{r}}{\pi }, T_{d}=\frac{d_{\max }+d_{\min }-2 d}{d_{\max }-d_{\min }}, T=aT_a+bT_b, \end{aligned}$$
(11)

where \(d_{max}\) and \(d_{min}\) are the maximum and minimum distances that two agents may encounter, a and b are the corresponding weight coefficients. In this paper, \(a=0.1, b=0.9\).

For this environment we choose pursuit reward \(r_{catch}\) and moving distance reward \(r_{move}\) as optimization goals.

$$\begin{aligned} \begin{aligned}&r_{catch}=\left\{ \begin{aligned}&10&\text{ if } \text{ red } \text{ wins } \\&-10&\text{ if } \text{ bule } \text{ wins } \\&(T_{a}-1)/10+(T_{b}-1)/10&\text{ otherwise } \end{aligned} \right. \\&r_{run}=\left\{ \begin{aligned}&-10&\text{ if } \text{ out } \text{ of } \text{ range } \\&v\mathrm {d}t&\text{ otherwise } \end{aligned} \right. \end{aligned} \end{aligned}$$
(12)

In this environment, the capability of the two agents is consistent, which makes it easy for the algorithm to fall into a sub-optimal solution that keeps accelerating in circles. At this time, although the agent cannot obtain rewards, it will not be punished for being caught up. To alleviate this problem we use the threat index penalty to guide the agent.

Fig. 5.
figure 5

Optimal solutions obtained by different algorithms. The dashed line is result fitted by data.

Compare the optimal solutions obtained by different algorithms in Fig. 5. The data of the PPO [16] is obtained by training single-objective problems under different preferences, and the data of MoPE and the EQL are obtained by uniform sampling 100 preferences after training. From the maximum reward that the algorithm can achieve, it can be seen that (1) the single-target PPO algorithm does not learn a strategy that can catch up with the opponent (the reward for catching up with the opponent is 10). (2) Compared with the EQL, MoPE has more diverse strategies. We further map MoPE’s representative games under different preferences as Fig. 6.

Fig. 6.
figure 6

MoPE result with different preferences.

5 Conclusion

We propose a multi-objective reinforcement learning algorithm (MORL), which can cover the optimal solutions under different preferences as much as possible when only one model is trained, and can generate sufficiently diversity strategies under different preference inputs. We conduct experiments on the algorithm in a simple grid world environment DST and a more difficult robot confrontation environment, Our experiments demonstrate that our algorithm has sufficient generalization and diversity relative to the benchmark algorithms. Future research will consider constrained multi-objective problems. The method in this paper is based on the assumption of linear preference, which means that all rewards must participate in the weighting calculation, and their corresponding weights need to be sampled in training, but for some penalty items, such as collisions, timeouts, etc., we do not need to consider them preferences, but wish to constrain them within a certain range. One possible approach to this problem is Thresholded Lexicographic Ordering (TLO) [17].