Keywords

1 Introduction

Applying reinforcement learning (RL) methods to solve complex tasks is becoming increasingly popular and progressing research constantly delivers new state-of-the-art algorithms [15, 23, 30]. In some of these applications, RL-based solutions exceed even the human performance [10]. However, the vast majority of the considered problems were placed in the virtual domain, which usually takes the form of a computer game or a simulation, while the progress in solving problems in the real world domain is not that advanced. This immaturity is a concern for the industry, where the application potential is significant.

One of the potential areas of reinforcement learning methods application is behavior and trajectory planning of the automated vehicles. Due to the problem’s complexity, rules-based deterministic algorithms may not be sufficient to achieve desirable system’s performance. Therefore, a need for data-driven methods emerges and reinforcement learning seems to be a promising approach.

As a research group involved in the development of automated vehicle technology, we present our motivation and problems behind the application of reinforcement learning to this area. In this paper, we refer to the publication [13], in which the authors outlined nine major problems of applying reinforcement learning in a real world setting. We describe the problems we encountered and investigated during the implementation of the behavior planning functionality in a close to production ready system.

1.1 Our Work

In this section we present an architecture of our system involving reinforcement learning agent in the place of behavior planning module of an automated driving stack.

Motion Planning System. Because of the reasons described in more detail in Sect. 3.4, we decided to create a hybrid system consisting both of reinforcement learning and model-based algorithms. The motion planning system is fed by a perception stack, which provides a representation of an ego car’s surroundings, such as other road users and a road itself. The system consists of four main entities: the behavior planning module, the safety framework, the trajectory planning module and the control block. The architecture is presented in Fig. 1. In our case, the behavior planning module, providing a car with a high-level description of an intention, like a maneuver to execute or a speed recommendation, has a form of an reinforcement learning agent. The returned behavior is then realized by model-based methods responsible for trajectory planning and control, which are under the supervision of the safety framework. The safety framework itself, heavily based on the ideas presented in [31], along with an additional set of deterministic rules based on traffic law, assures safety of the AI-based behavior planning as well as the trajectory planning. The behavior planning module can also be additionally supplied with signals from the mission planning module, responsible for producing sub-goals necessary to follow the defined path.

Fig. 1.
figure 1

The motion planning architecture: the RL-based behavior planning module (in red) and the model-based trajectory planning and control modules (in blue). (Color figure online)

Reinforcement Learning Environment. To train a reinforcement learning agent responsible for behavior planning we design a custom environment based on a proprietary closed-loop simulation tool [6]. The environment consists of multiple blocks, which we listed below.

  • Behavior decoding decodes an action (outputted from a policy neural network) to a semantic value, like a maneuver to execute or a velocity set point (see Sect. 3.3).

  • Trajectory planning establishes a specific realization of a behavior and returns a continuous trajectory to be followed by the agent.

  • Control produces low-level control signals (e.g. throttle, braking, steering wheel angle) to keep the car on its reference trajectory.

  • Simulation delivers a closed-loop simulation of vehicle dynamics and road users interactions.

  • Perception simulation disturbs the perfect ground truth obtained from the simulation using high-level sensor models (see Sect. 3.5).

  • Observation encoding encodes the state of the environment to a form easily interpretable by the policy network. The design of this transformation is crucial for the policy to be able to generalize well in various situations (see Sect. 3.3).

  • Reward calculation defines a numerical reward signal for the agent, designed to promote the agent for keeping a cruising velocity and reaching predefined lane-based goals, while maximizing comfort and safety of passengers (see Sect. 3.6).

By defining the environment in such way, we are able to control the current challenge for the agent as well as experiment with different versions of each module.

Training Setup. Having the environment defined, we utilize standard reinforcement learning algorithms, such as DQN [23], PPO [30] or SAC [15] to train the behavior planning agent. For better generalization and more efficient training, these algorithms were scaled up to support a distributed setup. Our best performing setup is a combination of PPO algorithm and multiple state-of-the-art methods available in the literature. In [28] we presented our previous results with DQN algorithm, along with the more detailed description of the environment definition, including the observation and action spaces and the reward shaping mechanism (Fig. 2).

Fig. 2.
figure 2

The architecture of the behavior planning agent training setup.

2 Promises

Using data-driven methods over other methods in behavior and trajectory planning modules brings several promises. The most obvious one is the transfer of the responsibility for the design of driving policy rules and actions from the group of human experts to the computer program. This design might be done based on a large amount of data collected with the help of a realistic virtual simulation. A good example might be Waymo, a self-driving company, which states that their systems drive 20 million miles in a simulation each day [4]. Still, the choice between supervised learning and reinforcement learning methods is an open question.

Supervised learning methods, such as e.g. behavior cloning [33], might be more interesting due to their higher stability of the training process, but they require human interaction in the data collection process, thus making this process significantly slower and more expensive [1]. Furthermore, when the trained policy is found to be unsatisfactory, the data has to be collected again. In the case of reinforcement learning, for some algorithms, the data can be reused by recalculating the rewards only.

The more important difference, however, is with the trained policy itself. With supervised learning, we are limited to a single policy demonstrated by a driver. Moreover, it is unlikely that the driver’s policy is the optimal one. This is due to the fact that it is usually hard to decide, which action is the best decision at a given moment, and how this action will affect the future states. This problem is known as the credit assignment problem, and many reinforcement learning algorithms tackle this problem by default. Yet another problem we observed is the difference between the perception system of a human and an automated vehicle. The drivers’ decisions are affected by human factors such as lack of attention, cognitive distraction [21, 36] or aggressiveness level [22]. That and other not listed discrepancies in perception causes the distributional shift between learning and evaluation domain of automated vehicle. Finally, an exploration of corner cases (accidents, dangerous or unusual situations), even when carried out in the virtual simulation, is more efficient with reinforcement learning methods, as humans are willing to risk less, than computer programs.

3 Challenges

3.1 Training Off-Line from the Fixed Logs of an External Behavior Policy

Fixed logs from driving are valuable assets in the entire learning process. Such logs can be used in a supervised manner to train an initial policy with behavior cloning [33], as a part of imitation learning process [34, 35] or as an additional exploration guiding for policy optimization [20].

A good example of successful logs utilization is the work [11] where the team from Waymo used 30 million samples of car’s trajectories to train a neural network policy, which was responsible for generating a future trajectory for a vehicle. As the authors admitted, the model was good enough to drive a vehicle successfully, however it was not fully competitive with the deterministic motion planning despite using a considerable amount of data.

Besides the amount of data, the samples should be also of sufficient quality. Erroneous actions resulting from driver’s mistakes are misleading during training. Additionally, data should include states in which a reasonable driver will never be situated. These states provide, however, valuable inputs to the learning process as they alleviate distributional shift between the learning and the operating domains.

3.2 Learning on the Real System from Limited Samples

The process of developing functionalities, capable of operating in the real world, based solely on simulation is burdened with the simulation to reality (sim-to-real) gap issue. Therefore, a logical step to close this gap is to fine-tune the policy on public roads. In this process, a most recent driving policy is used to interact with the target environment, explore it and collect data for the next training iteration. The relatively slow rate of data collection, thus the limited number of samples, and willingness of data re-usage suggest utilization of off-policy methods, such as DQN [23] or SAC [15], which latter’s efficiency in training the physical robots was demonstrated in [16].

The complication which arises in such an approach is a risk resulting from exploiting a deficient policy in a vulnerable environment. The application of such a solution requires ensuring the safety of the environment influenced by the agent. In our work, this approach is viable because the agent’s action is secured by a deterministic trajectory planning and its safety module (discussed in Sect. 3.4).

In case of recognizing performance drops in some situations during evaluation, the process enables counteracting them by involving more vehicles in those problematic scenarios, which can be later used to improve the policy in those cases.

3.3 High-Dimensional Continuous State and Action Spaces

While designing state spaces for robotic applications it is reasonable to convert raw sensor inputs to another, meaningful representation, e.g. camera images to a set of detected objects or a lidar points cloud to a spatial map of surroundings. That way we decrease the amount of black-box models in the motion planning module, which significantly increases its explainability. Moreover, the specific representation of the input data contributes to a better generalization of a neural network and speeds up its training. We use this approach in our work, where we represent the environment with the high-level properties of the ego vehicle (e.g. orientation, velocity), the target vehicles, and the road model.

A reasonable action space for an automated vehicle motion planning module must consist of at least one control signal for each of the two spatial dimensions: longitudinal (velocity, acceleration, throttle, braking) and lateral (heading, steering angle). Another approach is to determine a set of waypoints, which describe consecutive features of a trajectory: velocity and acceleration values sampled along a reference path.

As mentioned in Sect. 1.1, we split the control decision component into two modules: behavior planning and trajectory planning. The behavior planning module is further combined with strategic and tactical planners. The former outputs one maneuver (e.g. follow lane, change lane left, prepare to change lane left) from the list of the currently available maneuvers. The state of the list depends on the state of the current situation on the road and is strictly controlled by the safety framework. The latter produces a combination of additional guiding signals for the trajectory planning module, such as a velocity set point or a lane bias. The real values are discretized with a resolution found empirically, having in mind the trade-off between the model complexity and the maneuvers’ flexibility.

3.4 Satisfying Safety Constraints

All components of a modern vehicle must meet strict safety requirements. Both hardware and software are subject to validation accordingly to standards such as ISO 26262 (Functional Safety) or ISO/PAS 21448 (Safety of the Intended Functionality) and therefore are heavily tested before they reach an end customer. With the advent of advanced systems, which take more and more responsibility for control of a vehicle, safety considerations for highly automated vehicles is a complex topic in itself.

Looking for the required performance with regards to safety it is reasonable to refer to the human drivers. While car accidents are caused to over 1.3 million deaths annually [27], a fatal accident rate of a single vehicle is relatively low, reaching on average 1 fatality per 100 million miles in US [8]. Since statistics on the road-related accidents are shaping expectations regarding maximum failure rates of automated vehicles, a rarity of such events poses a great challenge to companies developing such systems. Showing with reasonable confidence that a system has a significantly lower failure rate in terms of fatal accidents through end-to-end test drives would require billions of miles of test drives [19].

A relative rarity and importance of severe accidents create a challenge not only for testing and validation but also for the development of such systems. Learning an end-to-end RL-based planning system to drive with reasonably low severe collision rate requires modeling of reward function with trade-off between efficiency and safety or using constrained RL [7]. However, to liberate the RL part from the problem which could be formulated in a deterministic way, we decided that the reinforcement learning system should not be formally responsible for the safety aspects of driving.

Another observation may be made with respect to traffic regulations. Training an agent to be compliant with them with the use of a reward signal will always end up in their approximations, which is unacceptable. Fortunately, it is straightforward to design a model-based system obeying such rules by definition.

The points above suggest that end-to-end planning systems based on reinforcement learning present several serious issues. It seems that the desired system should be rather a hybrid approach, composed of model-based and machine learning modules. The interface between the reinforcement learning module and the deterministic part of the system has to satisfy two contradicting objectives. On the one hand, the interface has to be constrained enough to not put safety requirements onto the reinforcement learning part. On the other, it has to be open enough to benefit from data-driven methods in planning and assure a sufficient level of policy transparency, allowing efficient and explainable learning.

An interesting approach intended for providing safety guarantees in path planning systems with reinforcement learning elements was proposed in [31]. Authors formulate a set of safety rules that are used to derive deterministic constraints for a path planning algorithm. The rules describe a safety envelope, i.e. a constrained area of a state space, which guarantees the existence of collision-free trajectories in all reasonable situations. The reasonable situations are the ones in which road users follow traffic laws and formalized common-sense traffic rules. Early detection of safety envelope violations, caused by a controlled agent, allows to execute predefined emergency responses in time and avoid collisions. Similar concepts were proposed also in [26] and [29].

The described approach can fulfill two major roles in the RL-based path planning system: limiting the available action space and triggering execution of emergency trajectories in unsafe situations (that may be caused by both ego’s actions and other agents’ behaviors). The safety envelope provides transparent, testable safety constraints that protect an agent against performing unsafe maneuvers and enforces proper responses when the constraints are violated.

This approach alone provides a certain level of protection against collisions, but still, overall system performance in terms of avoiding unsafe situations can be improved in a training process. Since emergency actions typically consist of severe braking, which is sub-optimal in terms of passengers’ comfort and driving efficiency, a properly trained agent will proactively avoid potentially unsafe actions. An example of such behavior may be to switch a lane away from a merge-in lane to avoid potentially unsafe situations involving slow vehicles merging into highway traffic.

Proactive safety behaviors emerging in an RL-based system may have particular significance in a context of a realistic sensor stack with limited performance. Simulating typical sensors’ error patterns and occlusions in a training setup may significantly decrease the impact of perception issues on overall system safety. An agent trained in such environments may learn to avoid situations that are correlated with hazardous perception errors, as well as to perform human-like visibility-enhancement actions, such as biasing on its lane to unveil occluded areas of the environment.

3.5 Partial Observability and Non-stationarity

In the automated driving domain, both partial observability and non-stationarity are major issues significantly affecting the trained policy’s performance. They are a result of different factors, from the physical world barriers to software and hardware limitations. In this section, we summarize what type of problems we observed during our research, what solutions we applied, or what solutions we propose.

Occluded Objects. The first problem is occluded objects: other vehicles, pedestrians, traffic signs and lights, lane markings, and other elements of infrastructure. One possible general solution might be to involve V2V or V2X communication [18] to exchange the information between the objects, even when they are not visible to each other. Another option is to assume worst-case scenarios, as proposed in [31] and discussed in Sect. 3.4, and act appropriately, but this might lead to overly protective policies. We implemented this by injecting worst-case object hypotheses on occlusion edges into perception results and allow agents to learn avoiding potentially unsafe regions to minimize effects on driver’s comfort or total efficiency.

Perception Issues. The second problem is the imperfection of a sensing stack. A typically automated vehicle prototype is nowadays equipped with a combination of multiple cameras, radars, and lidars. Each type of the sensors has its advantages and disadvantages, thus the common approach is to fuse data from multiple sources to obtain a high fidelity representation of the vehicle’s surroundings. In practice, false positives (ghost objects), false negatives (missing objects), or significant measurement uncertainties are still present in the fused representation. To address these problems in simulation, sensor models can be introduced to imitate the behavior of the real-world sensing suite. A lot of work has already been done in the area of detailed sensor simulation [14, 24]. However, due to the detailed simulation, the proposed solutions are usually computationally expensive, what significantly extends the simulation time, thus slow down the learning process. To alleviate this issue, we propose the usage of simple, high-level sensor models, which skip understanding the underlying physical principles of a given sensor (e.g. how radar wave is reflected from an object and how it is later tracked), and instead utilize statistical models for each common problem (ghost objects, uncertainty, etc.; Fig. 3).

Fig. 3.
figure 3

The high-level sensor models: a) a measurement uncertainty (in red) is being applied to target objects (in gray); b) a ghost object (in green) is being inserted into the sensor’s FOV; c) a missing object (in red) is being removed from the sensor’s FOV. (Color figure online)

Intentions. The third problem is missing information about other road users’ intentions. For the same observation of a state, the consecutive states might differ between episodes, depending on the hidden state of other drivers. One solution might be to add a pipeline responsible for the behavior or trajectory prediction of road users. This introduces an extra computing overhead and is a challenging task in itself, but significantly increases the model’s explainability. Another solution is to let a policy’s network spot the interesting connections on its own and store them in memory blocks.

In the previous paragraphs, we mentioned memory blocks as a solution for partial observability and non-stationarity of the environment. The obvious solution is to use modules such as LSTM [17] cells, but they lack interpretability. Frame stacking or putting historical data into the observation seems to be more suitable for safety-critical systems, however, in order to achieve satisfying policies, they might need to generate their own signals (e.g. information about a long-term decision taken in the past) and pass them between consecutive timesteps. Ideally, to achieve high interpretability, these signals have to be identified and defined by a human in advance. This, however, limits the capabilities being possible to be trained by the policy.

3.6 Unspecified and Multi-objective Reward Functions

Designing a reward function is another challenge that determines the success of the implemented solution. Drivers while making their decisions, take into account many aspects, which should be identified and represented during the reward function design.

Therefore, the reward function should be a compromise between pro-active safety assurance, performance on reaching destination targets, efficiency (time, fuel), comfort, and a general impact on a traffic flow. However, reward hacking [9] is a known problem in reinforcement learning, which causes an agent to not behave as expected, even though it achieves high rewards. We observed this in our work multiple times, e.g. when the agent was performing too frequent changes of its current maneuver. It is important to remember that the reward function designer must adequately prioritize and balance the individual parts of the reward signal or use meta-learning techniques.

3.7 Explainability

Explainability is a crucial requirement when working with safety-critical systems such as automated vehicles. In case of a system failure, whether in a form of an accident or a dangerous situation, a testing team must have a tool to understand why the situation happened, which decision was wrong, and what signals attributed most to this decision. Fortunately, the topic is being extensively researched recently and a good survey on available methods was presented in [25], as well as open-source tools [2, 5] were released.

In this paper, we focus on neural network-based policies, where the input signals are high-level signals outputted from other systems (perception, localization). This is in contrast to end-to-end methods (e.g. [12]), where the input data is usually raw data and its interpretation is rather an uneasy task. In our situation, the input signals are labeled, and we can use simple methods such as e.g. Integrated Gradients [32] to calculate each input neuron’s attributation to the output signal and provide meaningful insight on the policy’s motivation.

With this methodology, we can navigate through a complete episode and investigate, what caused a problematic decision at any given time step, and how the decision would change, if we change the observation. We found this approach successful for some tasks and open-sourced our initial version of the tool [3]. We, however, encountered the problem with understanding neurons’ state in LSTM cells. As mentioned in Sect. 3.5, there is a need for more explainable memory blocks.

4 Summary

In this paper, we presented the promises we see standing behind the application of the reinforcement learning method in motion planning. We also listed the challenges we faced during our research on the behavior planning module and described our solutions for these issues. Based on our observations, we believe that the current state of the research allows us to successfully deploy reinforcement learning in real-world automated driving. However, some work has to be done yet to e.g., ensure better explainability of an artificial agent and to provide an easier comparison between different versions and methods of the trained agent. As a consequence of the unpredictability of RL agents, we presented our method of satisfying safety constraints, but this approach still requires exhaustive testing in an operating domain. Additionally, the formulation of reward function remains an open research topic. As an active research group, we will continue our research towards more challenging traffic scenarios such as occurring on urban roads. Our future research will also be devoted to the explainability problem of RL agents.