Keywords

1 Introduction

Navigating in crowded space with high efficiency and safety is a challenging task. Traditional approaches often need to adjust theirs parameters manually for different scenes, which restricts the application environment [10, 18]. In order to perform adaptive and resilient behaviors for diverse scenes, navigation algorithm needs to understand different environment semantic.

To address the above issues, some work attempt to learn navigation behaviors from data. According to the source of data, the approaches can be divided into imitation learning and reinforcement learning (RL). Imitation learning [6, 12, 14, 17, 20] learns navigation policy based on human walking trajectory data from real word, resulting in that the agent can perform similar behaviors to humans. However, the application of imitation learning restrict to the scenes of collected data.

In recent years, RL shows the level of human beings in go and games [15, 16]. More and more researchers try to apply RL to navigation. Reinforcement learning does not depend on real data sets, it can continuously obtain training data through the virtual environment. Moreover, the optimization goal of RL is the cumulative environmental feedback signal. Compared with imitation learning, instead of simply imitating the real data, RL thinks about what strategies will make the cumulative environmental feedback higher. In recent years, the RL based navigation methods have achieved good results [3,4,5, 13], which verify the effectiveness of RL.

To address this, we propose to use hierarchical RL to generate plausible and collision free trajectories. The main contributions of our work include:

  1. 1)

    We built an effective and novel HRL framework to guide the agent to reach the destination efficiently. By using hierarchical RL framework, the navigation task is decoupled into target driving and collision avoidance, so that a stable and robust model can be trained, which can quickly adapt to a new environment.

  2. 2)

    Through comparisons with state-of-the-art RL-based methods, our model achieves superior performance, especially in various challenging resilient experiments.

2 Related Work

In this section, we review related works on navigation and hierarchical RL which our work refers to.

2.1 Conventional Methods for Navigation

Helbling et al. [9, 10] proposed social forces model to describe interactions among pedestrians. The model is based on a potential field in which attractive forces lead agents to the destination and repulsive forces block the surrounding obstacles. Reactive model predicts the collision time based on the current velocity. The representing work is ORCA [21], which is based on RVO [2]. ORCA seeks joint obstacle avoidance velocities under reciprocal assumptions. These models which are designed elaborately by researchers behave well in the specific application scene, while they rely on hand-craft functions and can not generalize well to various scenes.

2.2 Deep Reinforcement Learning Methods for Navigation

Earlier works [8, 23] was limited by calculate capability, thus researches tried to simplify problems when applying RL to the navigation problem. The combination of RL and deep learning enables processing data with higher dimensions and larger state space. Recent work of navigation by deep reinforcement learning can be divided into two categories according to the algorithm.

One is value based reinforcement learning, which decomposes the action space into discrete velocity set V according to speed and direction. The method CADRL [5] first applied DRL to navigation, which adapted two-agent to the multi-agent case through maximin operation to pick up the best action. Chen et al. [4] proposed SARL which rethinks human-robot pairwise interactions with a self-attention mechanism.

One is policy-based reinforcement learning, whose action space is continuous. Long et al. [13] directly mapped raw sensor measurements to desired collision avoidance policy and presented a multi-scene multi-stage training framework for adapting to different scenes. Based on [7, 13] learned safer and more resilient behaviors for navigation by integrating uncertainty estimation.

2.3 Hierarchical Reinforcement Learning

Hierarchical reinforcement learning (HRL) is inspired by divide-and-conquer, decoupling task to reduce training difficulty. Bacon et al. [1] proposed Option-Critic Framework which decouples the problem into two levels. The high policy is responsible for choosing an option. The low level do the low-level policy following the option until meet the option’s termination condition.

Vezhnevets [22] proposed FuUdal Networks where two levels of hierarchy within an agent communicate via explicit goals. Both the high level and low level are deep learning model and no gradients are propagated between two levels. The Manager receives its learning signal from the environment alone. Our model takes inspiration from the design of FeUdal Networks.

3 Approach

3.1 Overview

In this work, we consider the problem that an agent navigates towards a goal on the ground where N dynamic obstacles exists. Both the agent and N dynamic obstacles are modeled as discs with the same radius. The agent can not communicate with other dynamic obstacles. Therefore at each time t the agent’s observation is N obstacles’ positions \(p_i^t=[p_x^t,p_y^t]\in {P},P^t=\left\{ p_1^t,p_2^t,\cdots ,p_n^t\right\} \) and velocities \(v_i^t=[v_x^t,v_y^t]\in {V},V^t=\left\{ v_1^t,v_2^t,\cdots ,v_n^t\right\} \).

Figure 1 shows the overview of method, our model takes inspiration from feudal reinforcement learning [22] where levels of hierarchy communicate via explicit goals. The high-level module aims to optimize long term interest, whose output is the sub-goal \(g_t=(p_x^{t+c},p_y^{t+c})\) for the future. The low-level module aims to safely and efficiently navigate to the sub-goal, whose output is primitive actions which is 2-dimensional velocity \(a_{t+i}=(v_x^{t+i},v_y^{t+i})\). The low-level is the module that actually interacts with the environment.

Both of high-level and low-level are constructed by deep RL models and optimize their policy respectively on the reward received from the environment. Similar to the FeUdal Networks [22], there are no gradient between two levels. The information that agent can get from the environment, \(x_t\) contains last six consecutive observations: \(x_t = \left\{ P^{t-5},V^{t-5},\cdots ,P^t,V^t\right\} \). As Fig. 2 shows, our model converts the change of obstacles’ positions into the 360-degree laser form \(o_z^t\), the shape of which is 6 \(\times \) 360. The high-level and low-level modules don’t use the same neural network but share the same neural network architecture as the Fig. 3 shows. Based on the current velocities of the obstacles, we predict the obstacles’ positions in next 3 time steps by linear interpolation and convert the future positions to the laser form \(o_p^t\). \(\theta _{t}\) is the current agent’s orientation and \(o_{z+p}^t\) is a slice of \(o_z^t\) and \(o_p^t\) chosen by \(\theta _{t}\): \(o_{z+p}^t=o_z^t[\theta _t]+o_p^t[\theta _t]\). \(p_g^t\) is the agent’s final goal (for high-level) or sub-goal (for low-level). \(v_t\) is the agent’s current velocity. Both high and low level take the agent’s position as the origin for the local coordinate every time step, high-level’s y-axis points toward the final goal and low-level’s y-axis points toward the sub-goal.

Here we introduce the transition between the high-level and low-level. Every c time steps, the high-level module calculates the sub-goal \(g_t\) for the future c time steps and conveys it to the low-level module. c is a hyper-parameter which we set 20. Low-level receives the sub-goal and starts the loop of calculating the primitive action, two-dimensional velocity \(v_t\). The low-level module will not stop until meeting the terminal conditions. The terminal conditions contain three situations, the first one is the agent arrives the sub-goal, the second one is the agent collides with other obstacle, the third one is the agent have performed low-level action c times.

Fig. 1.
figure 1

The overview of our model. There are two levels in our model. The high-level module aims to optimize long term interest, whose output is the sub-goal \(g_t=(p_x^{t+c},p_y^{t+c})\) for the future. The low-level module aims to safely and efficiently navigate to the sub-goal, whose output is primitive actions which is 2-dimensional velocity \(a_{t+i}=(v_x^{t+i},v_y^{t+i})\).

Fig. 2.
figure 2

In our model, the agent processes the environment observation to the form of laser.

Fig. 3.
figure 3

The neural network shared by high-level and low-level module.

3.2 High-Level Module

Above all, low-level module actually interacts with the environment while high-level module interacts with environment by directing low-level module. The responsibility for high-level module is giving low-level module a good sub-goal which can balance the need of the reaching final goal and safety. Therefore, we estimate high-level policy by the trajectory that the low-level module actually performed in loop after receiving the sub-goal from high-level.

We first introduce the notations of the trajectory that the low-level module generated, then introduce the estimation criteria on the trajectory. After the low-level received sub-goal, assume that low-level performed \(c^L\) times before meeting the terminal condition, \(c^L\le {c}\). Then the agent’s trajectory is \(\left\{ p_t,p_{t+1},\cdots ,p_{t+c^L}\right\} \), the minimal distances between the agent and obstacles are \(\left\{ d_t,d_{t+1},\cdots ,d_{t+c^L}\right\} \).

Let the displacement during \(c^L\) times be \(p_{t+c^L}-p_t\) and the travelled distance be \(\sum _{k=1}^{k=c^L}{\Vert p_{t+k}-p_{t+k-1}\Vert }\). Let the agent’s final goal be g, then \(g-p^t\) is the relative vector from the final goal to the agent’s position at the beginning of the loop.

We define \(d_{goal}\) to represent the relative distance that the agent navigate to the final goal during the \(c^L\) times and the relative distance the agent move per unit time is UnitGoal.

$$\begin{aligned} d_{goal}&=\frac{(p^{t+c^L}-p^t)(g-p^t)}{\parallel {g-p^t}\parallel }\end{aligned}$$
(1)
$$\begin{aligned} UnitGoal&=\frac{d_{goal}}{c^L} \end{aligned}$$
(2)

If the agent’s minimal distance to the obstacles is less than 0.25 m, then the reward function will give a penalty signal for being too close to obstacles, which is −0.25+d. Formula 3 is the sum of the distance penalty CloseDist, \(1_{d<0.25}\)(d) is an indicator function.

$$\begin{aligned} CloseDist=\sum _{t=1}^{c^L}1_{d<0.25}(d)*(-0.25+d) \end{aligned}$$
(3)

We define ExtraPath to represent the difference between the travelled distance and displacement.

$$\begin{aligned} ExtraPath=\sum _{k=1}^{k=c^L}{\Vert p_{t+k}-p_{t+k-1}\Vert }-\Vert p^{t+c^L}-p^t\Vert \end{aligned}$$
(4)

Formula 5 is the high-level reward function, which takes UnitGoal, CloseDist, and ExtraPath into account. \(w_1,w_2,w_3\) is weighting parameters which are set 0.5, −0.5 and 0.5 at the beginning for dimensional homogeneity. Then these parameters were gradually adjusted by persistent attempts to \(w_1=0.8,\ w_2=-0.5,\ w_3=0.4\).

$$\begin{aligned} R_t=w_{1}UnitGoal+w_{2}ExtraPath+w_{3}CloseDist \end{aligned}$$
(5)

High-level module is responsible for giving a good sub-goal, therefore we train the module to maximize the one step interest. We refer to the training design of DDPG [11], a policy based RL algorithm, and transform the design to optimize one step interest. There are three neural networks, a policy network Actor for giving out the sub-goal, two value networks Critic and \(Critic_{target}\) for describing the state value. The network structure of Critic and \(Critic_{target}\) is the same. The task of \(Critic_{target}\) is to provide policy’s value without gradient, thus it won’t be optimized when training, the parameters of \(Critic_{target}\) will periodically copy from Critic. The loss for Critic is shown as Formula 6, whose aim is minimizing the value estimated by Critic and the real reward. The loss for Actor is shown as Formula 7, whose aim is maximizing the state value that the Actor can bring.

$$\begin{aligned} critic\ loss&=(Critic(s_t, g_t)-r_t)^2\end{aligned}$$
(6)
$$\begin{aligned} actor\ loss&= -{Critic_{target}(s_t,Actor(s_t))} \end{aligned}$$
(7)

3.3 Low-Level Module

The responsibility for low-level module is navigating to the sub-goal safely, this task is similar to the mono-layer RL work [4, 13]. The termination condition of low-level is reaching sub-goal or colliding with obstacles or the executions is over c. Formula 8 shows the low-level reward function which awards navigating to the sub-goal and penalizes collisions.

If the agent collides with other obstacles, we will give penalty \(r_{collision}=-3\). If the agent is too close to other obstacles (the distance to other obstacles is less than 0.2 m), we will give penalty on the uncomfortable distance: \(-0.6+d_{min}/2\). Otherwise, we will give the award for navigating to the sub-goal: \(-w_g(\Vert p^{t-1}-g\Vert -\Vert p^t-g\Vert )\), where \(w_g=2.5\). Our reward function doesn’t award for reaching the final goal specially like the mono-layer RL method [4, 13], because low-level can end up with reaching the final goal or reaching the sub-goal or timeout so that awarding the final goal will induce the inequality. Previous discount factor \(\gamma \) in [4, 13] is over 0.9. However the collision influence doesn’t need be so far, thus the discount factor \(\gamma \) in our model is 0.6.

$$\begin{aligned} r_t=\left\{ \begin{array}{lll} r_{collision} &{} &{} {p^t-p_j^t< 2R}\\ -0.6+d_{min}/2 &{} &{} {d_{min}<0.2}\\ -w_g(\Vert p^{t-1}-g\Vert -\Vert p^t-g\Vert ) &{} &{} {otherwise}\ \end{array} \right. \end{aligned}$$
(8)

Low-level module is trained using Deep Deterministic Policy Grading (DDPG) [11], a policy based method. Compare to the stochastic policy search of Proximal Policy Optimization (PPO) [19], the deterministic policy of DDPG accelerates convergence on navigation problem. When navigating in the crowded environment, low-level module will frequently collide with other obstacles, which possibly leading to the training lies in local minimum. For improving this problem, low-level module uses two tricks. The first is controlling the ratio of success and fail trajectories in the data set, which we set 0.6:0.4 in our paper. The second is adding protect for the low-level policy, the 2-d velocity. The conventional method ORCA has robust collision avoidance ability, whose input is the prefer velocity. After the low-level policy network outputs the velocity \(v_t\), we use ORCA to reduce the collision probability by taking \(v_t\) as the prefer velocity. In other words, we treat the low-level model’s output as the prefer velocity for ORCA to increase security.

4 Experiment

4.1 Scene Design

The scenes should be able to verify the model’s navigation ability, we design the scenes from two aspect: First, the agent should be able to maintain navigation ability when the scene’s size changed. Second, the agent should be able to maintain navigation ability when the scene change to dissimilar scene.

Therefore, this paper train and test on following three scenes which are easy to change size. Figure 4 shows the diagram of these three scenes, the red disc represents the agent and the blue disc represents the human, the stars with the corresponding color represents the agent’s or humans’ goals. Fig(a) is scene Squeeze, where an agent and a human randomly positioned on a circle of radius of rm and their goal positions are on the opposite side of the same circle. (b) is scene Circle, its design is similar to Squeeze, the only difference is Circle has one agent and five humans. (c) is scene Square, an agent and five humans randomly positioned on a square whose side length is w m.

Fig. 4.
figure 4

The scenes that we use in this paper to verify the effectiveness of our model.

We design two kinds of experiment. Experiment 1 compares the models’ navigation, where the train and test scene is the same. Experiment 2 compares the models’ resilience from two aspect: compare the resilience explicitly on scene size and scene type.

4.2 Perform Metrics

To fairly compare our model with other models, every test scene is evaluated for 100 repeats. We use the three performance metrics. Success rate represents the ratio that the agent successfully arrived the destination without collision. Collision rate represents the ratio that the agent collide with other obstacles. Average navigation time represents the average navigation time of the successful trajectories.

4.3 Navigation Ability Comparison

We compared with the representative models which are based on RL. As expected, the CADRL and the SARL have low collision rate due to their training algorithm is value based, which cautiously enumerates all the discrete actions. Long et al. [13] uses policy-based algorithm which action space is continuous, which exploration space is further higher than the value based. With the scene’s complexity increase, Long et al. fails to avoid the obstacles. The collision rates of Squeeze, Circle and Square are 0.24, 0.3, 0.47 respectively. Our model’s hierarchical structure decouples the navigation task into target-driven task and collision avoidance task, each layer concentrates on the its own responsibility and our low-level module add ORCA policy to protect, thus our model has the highest success rate among three scenes.

As for the average navigation time, the value-based methods’ has longer time than the policy-based methods on Circle and Square. Because the value-based methods’ action space is discrete, which means its trajectories are less smooth than the policy-based trajectories. We also make a statistic on the trajectories’ kinetic energy, we compute the minimum, mean value and maximum of 100 trajectories’ energy. As the fifth row in Table 1 shows, our model has the lowest value among Circle and Square.

4.4 Resilience Comparison

Table 1. Testing the learning ability of the model: train and test on the same scene.

Resilience on Scene Size. In this experiment, we train the model on Squeeze with diameter 8 m, then test the model on Squeeze with diameter 16 m. The same configuration for Circle and Square. The test results are shown in Table 2 (3rd, 4th and 5th column). Above all, the value-based model CADRL and SARL behaved bad on scene Squeeze and Circle, the success rates of which are 0.03 and 0 respectively. Because Square changes little with its side length increases, the SARL still retains 0.49 success rate. This is because, the value-based method chooses policy by the formula \(a_t\leftarrow {{\arg \max }_{a_t\in {A}}{R(s_t,a_t)+\gamma {V(\hat{s}_{t+1})}}}\) which highly depends on the accurate estimation on the state value, \(V(\hat{s}_{t+1})\). If the value network hasn’t seen the state \(s_{t+1}\) before, it is hard for value-based method to choose a reasonable action.

The policy-based method uses the policy network to learn the relation of environment state and the action. Therefore, even if the method meets the unfamiliar environment state, the policy network still knows the general direction of the action. As the second row in Table 2 shows, the success rates of the Long et al. are 0.76, 0.7 and 0.53 explicitly, which are apparently higher than value-based method. However, it has higher collision rate with the complexity of the scene increases. Our model retains high success rate when the scene size increased, where the success rates are 0.96, 0.95, and 0.92 explicitly. Our high-level module helps to avoid the relative crowded area by choosing temporal destination.

Resilience on Scene Type. For testing the resilience on scene type, we train the model on Squeeze with diameter 8 m, then test on Circle with diameter 8 m and Square with side length 8 m. Thus the scene sizes of train and test are the same. The test results are shown in Table 2 (6th and 7th column). Face to the unfamiliar scene with the same size, the CADRL and SARL retain some navigation ability, whose success rates are over 0.65. The SARL behaves better than CADRL for its value network can process crowd while the CADRL can only process pair-wise relationship. The success rate of our model on Circle and Square are 0.99 and 0.92, apparently behaves better than other models.

Table 2. Testing the resilience of the model: train and test on different scenes.
Table 3. Test the navigation ability of the model: train and test in Concentric scene.

4.5 Ablation Experiment

To verify the effectiveness of our hierarchical architecture, we compare our model with mono-layer RL model. The reward function and train configuration of mono-layer RL is same to our model’s low-level module, except the low-level module calculate velocity policy based on sub-goal while our model based on final goal. We compare the two models under the scene Concentric (the Fig(d) in Fig. 4). The radius of two circles are 8 m and 16 m. The test result is shown in Table 3. The success rate of our model is 0.99 which are much higher than the mono-layer’s 0.18. As the Fig. 5 shows, mono-layer model’s policy (the left trajectory) is aggressive that the agent walks straight to the final goal. Although the agent tried to avoid the nearby obstacles, the high-density led to collision at last. Our policy (the right trajectory) avoid the high-density by sub-goals which are represented as red stars in Fig. 5.

Fig. 5.
figure 5

Train and test the models on Concentric scene. (Color figure online)

5 Conclusion

In this paper, we propose a hierarchical reinforcement learning based navigation algorithm, which decouple navigation task into target-driven and collision avoidance. We evaluated our approach by comparing the trajectories taken by the agent with previous methods. Our experimental results suggest that our approach produces motion that is more resilience in different scenes.