Dense reinforcement learning for safety validation of autonomous vehicles

Feng, Shuo; Sun, Haowei; Yan, Xintao; Zhu, Haojie; Zou, Zhengxia; Shen, Shengyin; Liu, Henry X.

doi:10.1038/s41586-023-05732-2

Dense reinforcement learning for safety validation of autonomous vehicles

Article
Published: 22 March 2023

Volume 615, pages 620–627, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

From

View current issue Submit your manuscript

Dense reinforcement learning for safety validation of autonomous vehicles

Download PDF

32k Accesses
126 Citations
161 Altmetric
20 Mentions
Explore all metrics

Abstract

One critical bottleneck that impedes the development and deployment of autonomous vehicles is the prohibitively high economic and time costs required to validate their safety in a naturalistic driving environment, owing to the rarity of safety-critical events¹. Here we report the development of an intelligent testing environment, where artificial-intelligence-based background agents are trained to validate the safety performances of autonomous vehicles in an accelerated mode, without loss of unbiasedness. From naturalistic driving data, the background agents learn what adversarial manoeuvre to execute through a dense deep-reinforcement-learning (D2RL) approach, in which Markov decision processes are edited by removing non-safety-critical states and reconnecting critical ones so that the information in the training data is densified. D2RL enables neural networks to learn from densified information with safety-critical events and achieves tasks that are intractable for traditional deep-reinforcement-learning approaches. We demonstrate the effectiveness of our approach by testing a highly automated vehicle in both highway and urban test tracks with an augmented-reality environment, combining simulated background vehicles with physical road infrastructure and a real autonomous test vehicle. Our results show that the D2RL-trained agents can accelerate the evaluation process by multiple orders of magnitude (10³ to 10⁵ times faster). In addition, D2RL will enable accelerated testing and training with other safety-critical autonomous systems.

DSMC Evaluation Stages: Fostering Robust and Safe Behavior in Deep Reinforcement Learning

WiseMove: A Framework to Investigate Safe Deep Reinforcement Learning for Autonomous Driving

Research on Autonomous Driving Perception Test Based on Adversarial Examples

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Main

Owing to the rapid development of autonomous vehicle (AV) technologies, we are on the cusp of a revolution in transportation on a scale not seen since the introduction of automobiles a century ago. AV technologies have the potential to substantially improve transportation safety, mobility and sustainability, and thus have attracted worldwide attention from industries, government agencies, professional organizations and academic institutions. Over the past 20 years, substantial progress has been made on the development of AVs, particularly with the emergence of deep learning². By 2015, several companies had announced that they would be mass-producing AVs before 2020^3,4,5. So far, the reality has not lived up to these expectations, and no level 4 (ref. ⁶) AVs are commercially available. The reasons for this are numerous. But above all, the safety performance of AVs is still substantially below that of human drivers. For average drivers in the United States, the occurrence probability of a crash is around 1.9 × 10⁻⁶ per mile in the naturalistic driving environment (NDE)¹. In contrast, the disengagement rate for the state-of-the-art AV is around 2.0 × 10⁻⁵ per mile, according to the 2021 Disengagement Reports from California⁷. Although the disengagement rate is criticized for its potential biasedness, it has been widely used to track the trend of AV safety performance^8,9, as it is arguably the only statistic that is available to the public for the comparison of different AVs.

One critical bottleneck to improving AV safety performance is the severe inefficiency of safety validation. Prevailing approaches usually test AVs in the NDE through a combination of software simulation, closed test track and on-road testing. However, to validate the safety performance of AVs at the level of human drivers, it is well known that hundreds of millions of miles and sometimes hundreds of billions of miles would need to be tested in the NDE¹. Owing to this severe inefficiency, AV developers must pay substantial economic and time costs to evaluate each development, which has hindered the progress of AV deployment. To improve the testing efficiency, many approaches test AVs in purposely generated scenarios that are more safety critical^10,11. Yet, existing scenario-based approaches^{12,13,14,15,16,17} can mainly be applied to short scenario segments with limited background road users (see Supplementary Information for more discussions).

Validating the safety performance of AVs in the NDE is in essence a rare-event estimation problem in a high-dimensional space. The main challenge is caused by the compounding effects of the ‘curse of rarity’ in addition to the ‘curse of dimensionality’ (Fig. 1a). By ‘curse of dimensionality’, we mean that driving environments could be spatiotemporally complex, and the variables needed to define such environments are high-dimensional. As the volume of the variable space grows exponentially with dimensionality, the computational complexity also grows exponentially¹⁸. By ‘curse of rarity’, we mean that the occurrence probability for safety-critical events is rare, that is, most points of the variable space are non-safety-critical, which provide no or noisy information for training. Under this circumstance, it is hard for a deep-learning model to learn even given a large amount of data, as valuable information (for example, policy gradient) of safety-critical events could be buried under the large amount of non-safety-critical data. Recent decades have seen rapid progress in the ability of artificial intelligence (AI) systems to solve problems with the curse of dimensionality¹⁹, for example, the board game Go has a state space of 10³⁶⁰ (ref. ²⁰) and the semiconductor chip design may have a state space on the order of 10^2,500 (ref. ²¹). Before this work, however, solving the curse of dimensionality and the curse of rarity simultaneously has remained an open question, which has impeded the applicability of AI techniques in safety-critical systems, such as AVs, medical robots and aerospace systems²².

**Fig. 1: Validating safety-critical AI with the dense-learning approach.**

We address this challenge by developing a dense deep-reinforcement-learning (D2RL) approach. The basic idea is to identify and remove the non-safety-critical data and train neural networks utilizing the safety-critical data. As only a very small portion of data is safety critical, the information of the remaining data will be substantially densified. Essentially, the D2RL approach edits the Markov decision process by removing the uncritical states and reconnecting the critical states, and then trains neural networks for only the edited Markov process (Fig. 1b). Therefore, for any training episode, the reward from the end state is backpropagated along the edited Markov chain with critical states only (Fig. 1c). The D2RL approach can dramatically reduce the variance of the policy gradient estimation with multiple orders of magnitude without loss of unbiasedness, compared with the DRL approach, as proved in Theorem 1 in Methods. Such substantial variance reduction can enable neural networks to learn and achieve tasks that are intractable for the DRL approach. For AV testing, we leverage the D2RL approach and train the background vehicles (BVs) through a neural network to learn when to execute what adversarial manoeuvre, which aims to improve the testing efficiency and ensure evaluation unbiasedness. This results in an AI-based adversarial testing environment that can reduce the required testing miles of AVs by multiple orders of magnitude while ensuring the testing unbiasedness. Our approach can be applied to complex driving environments, including multiple highways, intersections and roundabouts, which cannot be achieved by previous scenario-based approaches. The proposed approach empowers the testing agents in the environment with intelligence to create an intelligent testing environment, that is, using AI to validate AI. This is a paradigm shift and it opens the door for accelerated testing and training with other safety-critical systems.

To demonstrate the effectiveness of our AI-based testing approach, we trained the BVs with large-scale naturalistic driving datasets and conducted simulation experiments as well as field experiments in physical test tracks. Specifically, we tested a level 4 AV with an open-source automated driving system, Autoware²³, in the physical 4-km-long highway test track at the American Center for Mobility (ACM) and the urban test track at Mcity. To test the AV with the D2RL-trained testing environment safely and precisely, we developed an augmented-reality testing platform²⁴, which combines the physical test track and a microscopic traffic simulator, SUMO (Simulation of Urban Mobility)²⁵. As shown in Fig. 1d, by synchronizing the movements of the real AV and virtual BVs, the real AV in the physical test track can interact with the virtual BVs as though it is in a realistic traffic environment, where the BVs are directed to interact with the real AV. For both simulation and field experiments, we evaluated not only crash rates but also crash types and crash severities. Our simulation and field-testing results show that the D2RL approach can effectively learn the intelligent testing environment, which can substantially accelerate the evaluation process of AVs by multiple orders of magnitude (10³ to 10⁵ times faster) unbiasedly, compared with the results from testing AVs directly in the NDE.

Dense deep reinforcement learning

To leverage AI techniques, we formulate the AV testing problem as a sequential Markov decision process (MDP), where manoeuvres of BVs are decided based on the current state information. We aim to train a policy (a DRL agent) modelled by a neural network, which can control the manoeuvres of BVs to interact with the AV, to maximize the evaluation efficiency and ensure unbiasedness. However, as mentioned earlier, it is hard—or even empirically infeasible—to learn an effective policy if directly applying DRL approaches because of the curse of dimensionality and the curse of rarity.

We address this challenge by developing the D2RL approach. Owing to the rarity of safety-critical events, most states are uncritical and cannot provide information for safety-critical events, so the key concept of D2RL is to remove the data of these uncritical states and utilize only the informative data for training the neural network (Fig. 1b,c). For AV testing problems, many safety metrics²⁶ can be utilized to identify the critical states with different efficiency and effectiveness. In this study, we utilize the criticality measure^12,13, which is an outer approximation of the AV crash rate within a specific time horizon (for example, one second) from the current state. Theoretical analysis for more generic problems can be found in Methods and Supplementary Section 2a. We then edit the Markov process, discard the data of uncritical states, and use the remaining data for the policy gradient estimation and bootstrapping of the DRL training. We find that dense learning can markedly reduce the variance of the policy-gradient estimation with multiple orders of magnitude without loss of estimation unbiasedness, as proved in Theorem 1 in Methods. The dense learning can also reduce the bootstrapping variance, as it can be regarded as a state-dependent temporal-difference learning²⁷, where only critical states are utilized and others are skipped.

To demonstrate the effectiveness of dense learning, we compared D2RL with the DRL approach for a corner-case-generation problem^28,29, which can be formulated as a well defined reinforcement-learning problem. A neural network was trained to maximize the AV’s crash rate by controlling the closest eight BVs’ actions (Fig. 2a). We used proximal policy optimization (PPO)³⁰ to update the parameters of the policy network, given the reward for each testing episode, that is, +20 for an AV crash and 0 for others. For a fair comparison, the only difference between DRL and D2RL is that DRL utilized all the data for training the neural network, whereas D2RL utilized only the data of critical states. As shown in Fig. 2b, D2RL removed the data of 80.5% complete episodes and 99.3% steps from uncritical states, compared with DRL. According to Theorem 1, this indicates that D2RL can reduce around 99.3% of the policy-gradient-estimation variance, which enables the neural network to learn effectively. Specifically, the D2RL can maximize the reward during the training process, whereas the DRL was stuck from the beginning of the training process (Fig. 2c). The policy learned by D2RL can effectively increase the crash rate of the AV, whereas DRL failed to do so (Fig. 2d). Figure 2e–g illustrates three generated corner cases.

**Fig. 2: Comparison of D2RL with DRL using the corner-case-generation examples.**

Learning the intelligent testing environment

Learning the intelligent testing environment for unbiased and efficient AV evaluation is much more complex than corner-case generation. According to the importance sampling theory³¹, the goal is essentially to learn new sampling distributions, that is, the importance function, of BVs’ manoeuvres to replace their naturalistic ones, with the aim of minimizing the estimation variance of AV testing. Intuitively, the BVs are trained to learn when to execute what adversarial manoeuvre, in that all BVs follow naturalistic behaviours, only selected vehicles at selected moments execute specifically designed adversarial moves with a learned probability. To achieve this goal, without using any heuristics or handcrafted functions, we derive the reward function from the estimation variance as

$$r({\bf{x}})=-\,{{\mathbb{I}}}_{A}({\bf{x}})\times {W}_{{q}_{\pi }}({\bf{x}})\times {W}_{{q}_{{\pi }_{{\rm{b}}}}}({\bf{x}}),$$

(1)

where x denotes the variables of each testing episode, ${{\mathbb{I}}}_{A}({\bf{x}})$ is an indicator function of the AV crash event (A), and ${W}_{{q}_{\pi }}({\bf{x}})=P({\bf{x}})/{q}_{\pi }({\bf{x}})$ and ${W}_{{{q}_{\pi }}_{{\rm{b}}}}({\bf{x}})=P({\bf{x}})/{q}_{{\pi }_{{\rm{b}}}}({\bf{x}})$ are weights (or likelihoods) produced by importance sampling. Here P(x) denotes the naturalistic distribution, q_π(x) denotes the importance function with the target policy π, and ${q}_{{\pi }_{{\rm{b}}}}({\bf{x}})$ denotes the importance function with the behaviour policy π_b. As there is no heuristic or handcrafted immediate reward function, the reward function in equation (1) is highly consistent with the testing performance, that is, a higher reward indicates a more efficient testing environment. Such reward design is generic and applicable to other rare-event estimation problems with high-dimensional variables.

To determine the learning mechanism, we further investigate the relationship between the behaviour policy π_b and target policy π. As proved in Theorem 2 in Methods, we find that the optimal behaviour policy ${\pi }_{{\rm{b}}}^{* }\,$ that collects data during the training process is nearly inversely proportional to the target policy. It indicates that, if using on-policy learning mechanisms (${q}_{{\pi }_{b}}={q}_{\pi }$), the behaviour policy would be far from optimality, which could mislead the training process and eventually cause the underestimation issues. To address this issue, we design an off-policy learning mechanism, where a generic behaviour policy is designed and kept unchanged during the training process. Although this off-policy mechanism is not the optimal behaviour policy as in Theorem 2 (which is usually unavailable in practice), it can balance the exploration and exploitation and is empirically effective for all experiment settings in this study. With the reward function and off-policy learning mechanism, we can learn the intelligent testing environment by the D2RL approach (see Methods for training details).

AV testing in simulation

We evaluated the effectiveness of the D2RL-based intelligent testing environment regarding accuracy, efficiency, scalability and generalizability by systematic simulation analysis. To measure the safety performance of AVs, crash rates of different crash types and severities in the NDE are utilized as the benchmark. As the NDE is generated completely based on naturalistic driving data, testing results in the NDE can represent the safety performance of AVs in the real world. For each test episode, we simulated AV driving in traffic for a fixed distance, and then the test results were recorded and analysed. To investigate the scalability and generalizability, we conducted simulation experiments with different road geometries, different driving distances and two different types of AV model (that is, the AV-I and AV-II models; see Supplementary Section 3d).

Figure 3 shows the results of the two-lane highway environment with the 400-m driving distance for the AV-I model, which is a basic experiment to validate our approach. As shown in Fig. 3a, during the training process, the estimation variance of the intelligent testing environment decreases with the increase of reward function, which demonstrates the effectiveness of the reward function in equation (1). To justify the off-policy mechanism, we investigated the performance of the on-policy mechanism, where the target policy was utilized as the behaviour policy. As shown in Fig. 3b, during the training process, the crash rate for the on-policy experiments substantially increases, whereas the crash rate for the off-policy experiments is unchanged because the behaviour policy is unchanged. However, as the on-policy mechanism breaks the consistency between the reward function and estimation variance, this increase of the crash rate would be misleading. As shown in Fig. 3c, the testing environment obtained by the on-policy mechanism underestimates the crash rate. In contrast, our off-policy approach can obtain the same crash rate as the NDE approach, but more efficiently (Fig. 3d,e). To measure the efficiency, we calculated the minimum number of tests for reaching a predetermined precision threshold (the relative half-width^12,17 is 0.3). To reduce the randomness of the results for a fair comparison, we repeated the testing of our approach by bootstrap sampling and obtained the frequency and average of the required number of tests (Fig. 3f). Compared with the NDE approach that required 1.9 × 10⁸ number of tests, our approach required an average of 9.1 × 10⁴ number of tests, which is 2.1 × 10³ times faster. To investigate the generalizability, we further tested the AV-II model using the same intelligent testing environment without any refinement, which can also obtain an accurate estimation with about 10³ times faster (see Supplementary Section 4d).

**Fig. 3: Performance evaluation of the D2RL-based intelligent testing environment.**

To validate the unbiasedness about crash types, crash severities and near-miss events, we analysed the crash rates of different crash types, distribution of the speed difference at the crash moment, and distributions of the time to collision, bumper-to-bumper distance and post-encroachment time of near-miss events. Throughout the paper, our use of the term unbiasedness refers to the fact that estimations from our approach have the same mathematical expectations as those from the NDE. In our experiments, we collected about 2.34 × 10⁸ episodes of tests in the NDE and 3.15 × 10⁶ (about two orders of magnitude less) episodes of tests in the intelligent testing environment. As the intelligent testing environment is more adversarial than the NDE, the total crash rate in our approach is 3.21 × 10⁻³ (Fig. 3g), which is much higher than that (1.58 × 10⁻⁷) in the NDE. As required by the importance sampling theory, each crash event should be weighted by the likelihood ratio to keep the unbiasedness. Therefore, the weighted crash rates for all crash types are compared with the results in the NDE (Fig. 3h), which demonstrates the unbiasedness of our approach within the evaluation precision. Similarly, Fig. 3i–l demonstrates that our approach can also unbiasedly evaluate the AV’s safety performance regarding crash severities and near-miss events within the evaluation precision. As near-miss events are critical for the development of AVs, the generated near-miss events without loss of unbiasedness open the door for accelerating the AV training. We leave that for future study.

To further investigate the scalability and generalizability, we conducted the experiments with different numbers of lanes (two and three lanes) and driving distances (400 m, 2 km, 4 km and 25 km) for the AV-I model. Here we studied the 25-km case to demonstrate the effectiveness of our approach over full-length trips, because the average commuter travels approximately 25 km one way in the United States. As shown in Table 1, because of the skipped episodes and steps that substantially reduce the training variance, our approach can effectively learn the intelligent testing environment for all the experiments.

Table 1 Performance evaluation with different highway simulation environments

Full size table

Furthermore, to demonstrate the advance of our approach in realistic urban scenarios, we extended our simulation experiments at a real-world four-armed roundabout³² in Germany with a high traffic volume and complex interactions. Compared with the NDE testing approach that requires about 8.91 × 10⁶ number of tests to reach the 30% relative half-width, our approach only requires 3.76 × 10³ number of tests, which is 2.37 × 10³ times faster. See Supplementary Video 2 and Supplementary Section 4b for more details.

AV testing in test tracks

Finally, we tested a Lincoln MKZ hybrid equipped with the open-source automated driving system, Autoware²³ (Fig. 4a), driving continuously in the physical multi-lane 4-km highway test track at the ACM (Fig. 4b) and the physical urban test track at Mcity (Fig. 4c). We developed an augmented-reality testing platform²⁴, which combines the physical test track and a simulation environment, SUMO²⁵. As shown in Fig. 1d, by synchronizing the movements of the real AV and virtual BVs, the real AV in the physical test track can interact with the virtual BVs as though it is in a real traffic environment, where the BVs are controlled according to the intelligent testing environment. Figure 4d illustrates the real-time visualization of the testing process. We trained the intelligent testing environment in the digital twins of the ACM highway section and the Mcity urban section using similar training settings to the simulation studies (see Methods for details). As shown in Fig. 4e–h, the crash rate estimations in both the ACM and Mcity converge and reach the 30% relative half-width after about 156 tests at the ACM and 117 tests at Mcity, which are on the order of 10⁵ times faster than those (2.5 × 10⁷ at the ACM and 2.1 × 10⁷ at Mcity) of the NDE testing approach. We also evaluated the AV’s safety performance for different crash types and severities (Fig. 4i,j).

**Fig. 4: Testing experiments for a real-world AV at physical test tracks.**

Discussion

Our results present evidence of using D2RL techniques to validate the safety performance of AVs regarding their behavioural competency³³. D2RL can accelerate the testing process and can be used for both simulation testing and test-track methods. It can substantially enhance existing testing approaches (falsification methods, scenario-based methods and NDE methods) to overcome their limitations in real-world applications. D2RL also opens the door for leveraging AI techniques to validate machine intelligence of other safety-critical autonomous systems, such as medical robots and aerospace systems.

Ideally, the testing environment should consider all operating conditions of AVs and their associated rare events. For example, a six-layer model³⁴ has been developed to structure the parameters of scenarios, including road geometry, road furniture and rules, temporal modifications and events, moving objects, environmental conditions, and digital information. In this study, we mainly focus on two layers: moving objects and road geometry, that is, multiple surrounding vehicles undertaking manoeuvres on roads of varying geometry, which are critical for the testing environment. Our approach could be extended to include parameters from other layers, such as weather conditions, by collecting large-scale naturalistic data and utilizing domain knowledge of those fields.

We note that increasing attention has also been paid to formal methods to address the challenges raised by AI systems (see refs. ^35,36 and references therein). Formal methods provide a mathematical framework for rigorous system specification, design and verification³⁷, which are critical for trustworthy AI. However, as discussed in ref. ³⁶, multiple major challenges need to be addressed to fully realize their full potential. D2RL can potentially be integrated with formal methods. For example, reachability-based methods³⁸ could be incorporated into the calculation of criticality measure to identify the critical states, particularly for generic safety-critical autonomous systems. How to further integrate D2RL with formal methods deserves further investigation.

Methods

Description of the AV safety validation problem

This section describes the problem formulation of AV safety performance evaluation. Denote the variables of the driving environment as x = [s(0), u(0), u(1), ⋯, u(T )], where s(k) denotes the states (position and speed) of the AV and BVs at the kth time step, u(k) denotes the manoeuvres of BVs at the kth time step and T denotes the total time steps of this testing episode. With Markovian assumptions of the BVs’ manoeuvres, the probability of each testing episode in the NDE can be calculated as $P({\bf{x}})=P({\bf{s}}(0))\times {\prod }_{k=0}^{T}P({\bf{u}}(k)| {\bf{s}}(k))$, and then the AV crash rate can be measured by the Monte Carlo method³¹ as

$$P(A)={{\mathbb{E}}}_{{\bf{x}} \sim P({\bf{x}})}[P(A| {\bf{x}})]\approx \frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}P(A| {{\bf{x}}}_{i}),{{\bf{x}}}_{i} \sim P({\bf{x}}),$$

(2)

where A denotes the crash event, n denotes the total number of testing episodes, i = 1, ..., n denotes the ith testing episode, and x_i ∼ P(x) indicates that the variables are distributed as P(x). Here a crash is defined as a contact that the subject vehicle (for example, AV) has with an object, either moving or fixed, at any speed resulting in fatality, injury or property damage³⁹. As A is a rare event, obtaining a statistically reliable estimation requires a large number of tests (n), which leads to the severe inefficiency issue of the NDE testing approach, as pointed out in ref. ¹.

To address this inefficiency issue, the key is to generate an intelligent driving environment, where BVs can be controlled purposely to test the AV unbiasedly and efficiently. In essence, testing an AV in the intelligent driving environment is to estimate P(A) in equation (2) by the importance sampling method³¹ as

$$P(A)={{\mathbb{E}}}_{{\bf{x}} \sim q({\bf{x}})}[P(A| {\bf{x}})\times {W}_{q}({\bf{x}})]\approx \frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}P(A| {{\bf{x}}}_{i})\times {W}_{q}({{\bf{x}}}_{i}),{{\bf{x}}}_{i} \sim q({\bf{x}}),$$

(3)

where q(x) denotes the underlying distribution of BVs’ manoeuvres in the intelligent testing environment, and W_q(x) is the likelihood of each testing episode as

$${W}_{q}({\bf{x}})=\frac{P({\bf{x}})}{q({\bf{x}})}=\mathop{\prod }\limits_{k=0}^{T}\left[\frac{P({\bf{u}}(k)| {\bf{s}}(k))}{q({\bf{u}}(k)| {\bf{s}}(k))}\right].$$

(4)

According to the importance sampling theory³¹, the unbiasedness of the estimation in equation (3) can be guaranteed if q(x) > 0 for any x that P(A|x)P(x) > 0. To optimize the estimation efficiency, the importance function q(x) needs to minimize the estimation variance

$${\sigma }_{q}^{2}={{\mathbb{E}}}_{q}\left({P}^{2}(A{\rm{| }}{\bf{x}})\times {W}_{q}^{2}({\bf{x}})\right)-{P}^{2}(A).$$

(5)

Therefore, the generation of the intelligent testing environment is formulated as a sequential MDP problem of the BVs’ manoeuvres (that is, determine q(u(k)|s(k)) to minimize the estimation variance ${\sigma }_{q}^{2}$ in equation (5). However, how to solve such a sequential MDP problem associated with rare events and high-dimensional variables remains a highly challenging problem, and most existing importance sampling-based methods suffer from the curse of dimensionality⁴⁰, where the estimation variance would increase exponentially with the dimensionality. In our previous study¹⁴, we found that the curse of dimensionality issue could be addressed theoretically by sparse adversarial control to the naturalistic distribution. However, only a model-based method with handcrafted heuristics was utilized for conducting the sparse adversarial control, which suffers from substantial spatiotemporal limitations, and how to leverage AI techniques to train the BVs for truly learning the testing intelligence remains unsolved, which is the focus of this paper. More details of related work can be found in Supplementary Section 1.

Formulation as a deep-reinforcement-learning problem

This section describes how to generate the intelligent testing environment as a DRL problem. As mentioned above, the goal is to minimize the estimation variance in equation (5) by training a policy π modelled by a neural network θ that can control BVs’ manoeuvres with the underlying distribution q_π(u|s). To keep the notation simple, we leave it implicit in all cases that π is a function of θ. An MDP usually consists of four key elements: state, action, state transition and reward. In this study, states encode information (position and speed) about the AV and surrounding BVs, actions include 31 discrete longitudinal accelerations ([−4, 2] with 0.2 m s⁻² discrete resolution), left lane change and right lane change, and state transitions define the probability distribution over next states that are also dependent on the AV manoeuvre. Here we assumed that a lane-change manoeuvre of BVs would be initiated from its current position and completed in one second if a lane-change action was decided. Our framework is also applicable to more realistic and complex action settings.

For the corner-case-generation case study, we studied a three-lane highway driving environment, where eight critical BVs (that is, principal other vehicles or POVs) are controlled to interact with the AV for a certain distance (400 m) and each BV has the 33 discrete actions at every 0.1 s. For the intelligent-testing-environment generation case study, to keep the runtime of the DRL small, we simplified the output of the neural network as the adversarial manoeuvre probability (ε_π ∈ (0, 1)) of the most critical POV (Principal Other Vehicle), whereas POV’s other manoeuvres are normalized by 1 − ε_π according to the naturalistic distribution and other BVs’ manoeuvres keep following the naturalistic distribution. The adversarial manoeuvre and POV are determined by the criticality measure. We note that the generalization of this work to multiple POVs is straightforward.

The reward function design is critical for the DRL problem⁴¹. As the goal of the intelligent testing environment is to minimize the estimation variance in equation (5), we derived the objective function of the DRL problem as

$$\mathop{\min }\limits_{q}{\sigma }_{q}^{2}=\mathop{\max }\limits_{\pi }\left\{-{{\mathbb{E}}}_{{{q}_{\pi }}_{{\rm{b}}}}\left({{\mathbb{I}}}_{A}({\bf{x}})\times {W}_{{q}_{\pi }}({\bf{x}})\times {W}_{{{q}_{\pi }}_{{\rm{b}}}}({\bf{x}})\right)\right\},$$

(6)

where ${{\mathbb{I}}}_{A}$ is the indicator function of the crash event and π_b denotes the behaviour policy of the DRL. During the training process, the training data are collected by the behaviour policy, which is a Monte Carlo estimation of the expectation in equation (6), so we can obtain the reward function as

$$r({\bf{x}})=-\,{{\mathbb{I}}}_{A}({\bf{x}})\times {W}_{{q}_{\pi }}({\bf{x}})\times {W}_{{{q}_{\pi }}_{{\rm{b}}}}({\bf{x}}),$$

(7)

which is theoretically consistent with the objective function. As it is mainly based on the importance sampling theory, the reward function is also applicable to other rare-event estimation problems with high-dimensional variables. To limit the scale of the error derivatives⁴², we rescaled and clipped the function, resulting in the reward function that belongs to [−100, 100], where the scaling constants could be automatically determined during the learning process.

With the state, action, state transition and reward function, the intelligent-testing-environment generation problem becomes a DRL problem. However, as the gradient estimation of neural networks would suffer from the large variance due to the rareness of informative data, applying learning-based techniques for safety-critical systems is highly challenging because of the curse of rarity. It is hard—or even empirically infeasible—to learn an effective policy if directly applying DRL approaches.

Dense deep reinforcement learning

To address this challenge, we propose the D2RL approach in this paper. Specifically, according to the policy gradient theorem²⁷, the policy gradient of the objective function for DRL approaches can be estimated as

$$\nabla \hat{J}(\theta )={\hat{q}}_{\pi }({S}_{t},{A}_{t}\,)\frac{\nabla \pi ({A}_{t}\,|{S}_{t},\theta )}{\pi ({A}_{t}\,|{S}_{t},\theta )},$$

(8)

where θ denotes the parameters of the policy, q_π(S_t, A_t) denotes the state–action value, S_t and A_t are samples of the state and action under the policy at time t, ${\hat{q}}_{\pi }\left({S}_{t},{A}_{t}\,\right)$ is an unbiased estimation of q_π(S_t, A_t), that is, ${{\mathbb{E}}}_{\pi }\left[{\hat{q}}_{\pi }\left({S}_{t},{A}_{t}\,\right)\,]\right]={q}_{\pi }\left({S}_{t},{A}_{t}\,\right)$. Differently, for the D2RL approach, we propose to estimate the policy gradient as

$${{\rm{\nabla }}}_{{\rm{d}}{\rm{e}}{\rm{n}}{\rm{s}}{\rm{e}}}\,\hat{J}(\theta )={\hat{q}}_{\pi }({S}_{t},{A}_{t})\frac{{\rm{\nabla }}\pi ({A}_{t}\,|{S}_{t},\theta )}{\pi ({A}_{t}\,|{S}_{t},\theta )}{{\mathbb{I}}}_{{S}_{t}\in {{\mathbb{S}}}_{{\rm{c}}}},$$

(9)

where ${{\mathbb{S}}}_{{\rm{c}}}$ denotes the set of critical states and ${{\mathbb{I}}}_{{S}_{t}\in {{\mathbb{S}}}_{{\rm{c}}}}$ denotes the indicator function.

Here, a state is defined as an uncritical state if ${v}_{\pi }\left(s\right)={q}_{\pi }\left(s,a\right),\forall \,a$, where s denotes the state, a denotes the action, ${v}_{\pi }(s)\,{\rm{\stackrel{\scriptscriptstyle\mathrm{def}}{=}}}{{\mathbb{E}}}_{\pi }({q}_{\pi }(s,a))$ denotes the state value, so the set of critical states can be defined as ${{\mathbb{S}}}_{{\rm{c}}}\mathop{=}\limits^{{\rm{d}}{\rm{e}}{\rm{f}}}\{s|{v}_{\pi }(s)\ne {q}_{\pi }(s,a),{\rm{\exists }}\,a\}$. It indicates that a state is defined as uncritical if any action (for example, BVs’ manoeuvres) from the current state will not affect the expected value of the state (for example, AV’s crash probability within a specific time horizon from the current state). We note that this definition is primarily for the theoretical analysis to be clean and is not strictly required to run the algorithm in practice. For example, a state can be practically identified as uncritical if the current action will not substantially affect the expected value of the state. For specific applications, the critical states can be approximately identified based on domain-specific models or physics. For example, the criticality measure^12,13, which is an outer approximation of the AV crash rate within a specific time horizon (for example, one second), is utilized in this study to demonstrate the approach for the AV testing problem. We note that many other safety metrics²⁶ could also be applicable, such as the model predictive instantaneous safety metric⁴³ developed by the National Highway Traffic Administration in the United States and the criticality metric⁴⁴ developed by the PEGASUS project in Germany, as long as the identified set of states covers the critical states. More theoretical analysis for a more general sense can be found in Supplementary Section 2a.

Then, we have the following theorem, and the proof can be found in Supplementary Information.

Theorem 1

The policy gradient estimator of D2RL has the following properties:

(1)
${{\mathbb{E}}}_{\pi }[{\nabla }_{{\rm{dense}}}\,\hat{J}(\theta )]={{\mathbb{E}}}_{\pi }[\nabla \hat{J}(\theta )]$;
(2)
${{\rm{Var}}}_{\pi }[{\nabla }_{{\rm{dense}}}\,\hat{J}(\theta )]\le {{\rm{Var}}}_{\pi }[\nabla \hat{J}(\theta )]$; and
(3)
${{\rm{Var}}}_{\pi }[{\nabla }_{{\rm{dense}}}\,\hat{J}(\theta )]\le {\rho }_{\pi }{{\rm{Var}}}_{\pi }[\nabla \hat{J}(\theta )]$, with the assumption

$${{\mathbb{E}}}_{\pi }[{\sigma }_{\pi }^{2}({S}_{t},{A}_{t}\,){{\mathbb{I}}}_{{S}_{t}\in {{\mathbb{S}}}_{{\rm{c}}}}]={{\mathbb{E}}}_{\pi }[{\sigma }_{\pi }^{2}({S}_{t},{A}_{t}\,)]{{\mathbb{E}}}_{\pi }[{{\mathbb{I}}}_{{S}_{t}\in {{\mathbb{S}}}_{{\rm{c}}}}],$$

(10)

where ${\rho }_{\pi }\mathop{=}\limits^{{\rm{d}}{\rm{e}}{\rm{f}}}{{\mathbb{E}}}_{\pi }({{\mathbb{I}}}_{{S}_{t}\in {{\mathbb{S}}}_{{\rm{c}}}})\in [0,1]$ is the proportion of critical states in all states under the policy π (for example, 1 − ρ_π denotes the proportion of steps skipped in Fig. 2b and Table 1), and ${\sigma }_{\pi }^{2}({S}_{t},{A}_{t})\,=$${\left({\hat{q}}_{\pi }({S}_{t},{A}_{t})\frac{\nabla \pi ({A}_{t}|{S}_{t},\theta )}{\pi ({A}_{t}|{S}_{t},\theta )}\right)}^{2}$.

Theorem 1 proves that the D2RL approach has an unbiased and efficient estimation of the policy gradient compared with the DRL approach. To quantify the variance reduction of dense learning, we introduce the assumption in equation (10), which assumes that ${\sigma }_{\pi }^{2}\left({S}_{t},{A}_{t}\,\right)$ is independent on the indicator function ${{\mathbb{I}}}_{{S}_{t}\in {{\mathbb{S}}}_{{\rm{c}}}}$. As both the policy and the state–action values are randomly initialized, the values of ${\sigma }_{\pi }^{2}\left({S}_{t},{A}_{t}\,\right)$ are quite similar for all different states, so the assumption is valid at the early stage of the training process. Such variance reduction will enable the D2RL approach to optimize the neural network, whereas the DRL approach would be stuck at the beginning of the training process.

We then consider the influence of dense learning on estimating ${\hat{q}}_{\pi }\,\left({S}_{t},{A}_{t}\,)\right)$ with bootstrapping, which can guide the information propagation in the state–action space. For example, the fixed-length advantage estimator (${\hat{A}}_{t}$) is commonly used for the PPO algorithm³⁰ as

$${\hat{A}}_{t}={\delta }_{t}+(\gamma \lambda ){\delta }_{t+1}+\cdots +{(\gamma \lambda )}^{L-t+1}{\delta }_{L-1},$$

(11)

where δ_t = r_t + γV(s_t+1) − V(s_t), V(s_t) is the state–value function, γ denotes the discount rate, and L denotes the fixed length. For safety-critical applications, the immediate reward is usually zero (that is, r_t = 0), and most state–value functions are determined by initial random values without any valuable information because of the rarity of events. Bootstrapping with such noisy state–value functions will not be effective in the learning process. By editing the Markov chain, only the critical states will be considered. Then, the advantage estimator will be essentially modified as

$${\bar{A}}_{t}={\delta }_{z\left(t,0\right)}+(\gamma \lambda ){\delta }_{z\left(t,1\right)}+\cdots +{(\gamma \lambda )}^{L-t+1}{\delta }_{z\left(t,L-1\right)},$$

(12)

where ${\delta }_{z(t,j)}={r}_{z(t,j)}+\gamma V({s}_{z(t,j+1)})-V({s}_{z(t,j)})$, j is a natural number, and z is a function that z(t, 0) = t, $z(t,j)=\mathop{\min }\limits_{i}\{{s}_{i}\in {{\mathbb{S}}}_{{\rm{c}}}|i > z(t,j-1)\},j > 0$, and i is a natural number. In essence, it is a state-dependent temporal-difference learning, where only the values of critical states are utilized for bootstrapping. As the critical states have much higher probabilities leading to safety-critical events, the reward information can be propagated to these critical state values more easily. Utilizing the values of these critical states, the bootstrapping can guide the information from the safety-critical events to the state–action space more efficiently. This mechanism can help avoid the interference of the large number of noisy data and focus the policy on learning the sparse but valuable information. Because of the abovementioned variance reductions regarding the policy gradient estimation and bootstrapping, the D2RL approach substantially improves the learning effectiveness compared with the DRL approach, enabling the neural network to learn from the safety-critical events.

Densifying the information is a natural way to overcome the challenges caused by the rarity of events. In the field of deep neural networks, connecting different layers of neural networks more densely has been demonstrated to produce better training efficiency and efficacy, that is, DenseNet⁴⁵. Instead of connecting layers of neural networks, our approach densifies the information by connecting states more densely with safety-critical states, besides the natural connections provided by the state transitions. As safety-critical states have more connections with rare events, they contain more valuable information with less variance. By densifying the connections between safety-critical states with other states, we can better propagate the valuable information to the entire state space, which can substantially facilitate the learning process. This study proposed and demonstrated one specific realization of the dense-learning approach by approximately identifying uncritical states and connecting the remaining states directly. This can be further improved by more flexible and dense connections among safety-critical states and uncritical states. The connections can even be added in the form of curriculum learning⁴⁶, which can guide the information propagation gradually. The measures for identifying critical states can also be further improved by involving more advanced modelling techniques.

Off-policy learning mechanism

We justify the off-policy learning mechanism in this section. The goal of the behaviour policy π_b is to collect training data for improving the target policy π that can maximize the objective function in equation (6). To achieve this goal, it is critical to estimate the objective function accurately using the reward function in equation (7), which determines the calculation of the policy gradient. However, only episodes with crashes have non-zero rewards, so the objective function estimation suffers from a large variance, because of the rarity of crashes. Without an accurate estimation of the objective function, the training could be misled. According to the importance sampling theory, we have the following theorem, and the proof can be found in Supplementary Information.

Theorem 2

The optimal behaviour policy ${\pi }_{{\rm{b}}}^{* }$ that can minimize the estimation variance of the objective function has the following property:

$${q}_{{\pi }_{{\rm{b}}}^{* }}({\bf{x}})\propto \frac{{q}_{{\pi }^{* }}^{2}({\bf{x}})}{{q}_{\pi }({\bf{x}})},$$

(13)

where ${q}_{{\pi }^{* }}({\bf{x}})$ denotes the optimal importance sampling function that is unchanged during the training process, and the symbol ∝ means ‘proportional to’.

Theorem 2 finds that the optimal behaviour policy is nearly inversely proportional to the target policy, particularly at the beginning of the training process when q_π is far from ${q}_{{\pi }^{* }}$. If using on-policy learning mechanisms (${q}_{{\pi }_{{\rm{b}}}}={q}_{\pi }$), the behaviour policy would be far from optimality, which could mislead the training process and eventually cause the underestimation issues. For example, if a target policy misses an action that could lead to a likely crash, an on-policy learning mechanism will never find this missing crash. More importantly, the on-policy mechanism could mislead the policy for purposely hiding the crashes that are difficult to evaluate, leading to the severe underestimation issue of the safety performance evaluation.

We design an off-policy learning mechanism to address this issue, where a generic behaviour policy is designed and kept unchanged during the training process. Specifically, we determined a constant probability of the adversarial manoeuvre of the POV (that is, ${\varepsilon }_{{\pi }_{{\rm{b}}}}=0.01$) and conducted other manoeuvres with the total probability of 0.99 that were normalized according to the naturalistic distribution. This policy explores the state–action space using the naturalistic distribution most of the time and exploits the information of the model-based criticality measure that helps identify the POV and adversarial manoeuvre. We note that although the optimal behaviour policy needs to be adaptively determined based on the target policy, as indicated in Theorem 2, an off-policy learning mechanism can provide a sufficiently good foundation for effective learning in this study. The behaviour policy is also not sensitive to the constant of ${\varepsilon }_{{\pi }_{b}}$, and generally, a small value (for example, 0.1, 0.05, 0.01 and so on) that balances the exploration and exploitation would be effective in this study. Further improvement can be investigated in the future.

Simulation settings

NDE simulator

To simulate the NDE, we developed a simulation platform based on an open-source traffic simulator SUMO. The scheme of the platform can be found in Supplementary Information. We utilized both the C++ and TRACI interfaces to refine the SUMO simulator so that high-fidelity driving environments can be integrated. Specifically, we rewrote and recompiled the C++ codes of SUMO to integrate the high-fidelity driving environments, including car-following and lane-changing behaviour models. Then, we utilized the TRACI interface to implement the intelligent testing environment, where at selected moments, selected vehicles would execute specific adversarial manoeuvres with a learned probability, following the policy obtained by the D2RL approach. We also synchronized the modified SUMO with the physical test tracks related to the information of BVs, AVs, traffic signals, high-definition maps and so on, through the TRACI interface. To provide a training environment for intelligent testing environments, we constructed a multi-lane highway driving environment and an urban driving environment, where all vehicles were controlled at 100-ms intervals.

Driving behaviour models in the NDE simulator

The default driving behaviour models of SUMO, which are simple and deterministic, cannot be utilized for safety testing and training of AVs because they are designed to be crash-free models. To address this issue, in this study, we constructed NDE models⁴⁷ to provide naturalistic behaviours of BVs according to the large-scale naturalistic driving datasets (NDDs) from the Safety Pilot Model Deployment programme⁴⁸ and the Integrated Vehicle-Based Safety System programme⁴⁹ at the University of Michigan, Ann Arbor. At each step of simulation, the NDE models can provide distributions of each BV’s manoeuvres, which are consistent with the NDD. Then, by sampling manoeuvres from the distributions, a testing environment that can evaluate the real-world safety performance can be generated. For the field testing at ACM and Mcity, although the intelligent testing environment can accelerate the AV testing from about 10⁷ loops of testing to only around 10⁴ loops (Table 1), this still represents a substantial level of effort for an academic research group. To demonstrate our approach in a more efficient way, we simplified the NDE models to demonstrate our method more conveniently. Specifically, we modified the Intelligent Driving Model (IDM)⁵⁰ and the Minimizing Overall Braking Induced by Lane change (MOBIL) model⁵¹ as stochastic models to construct the simplified NDE models. More details of the NDE models can be found in Supplementary Information.

D2RL architecture, implementation and training

The D2RL algorithm can be easily plugged into existing DRL algorithms by defining a specific environment with the dense-learning approach. Specifically, for existing DRL algorithms, the environment receives the decision from the DRL agent, executes the decision, and then collects observations and rewards at each time step, whereas for the D2RL algorithm, the environment collects only the observations and rewards for the critical states, as illustrated in Supplementary Section 3e. In this way, we can quickly implement the D2RL algorithm utilizing existing DRL platforms. In this study, we utilized the PPO algorithm implemented at the RLLib 1.2.0 platform⁵², which was parallelly trained on 500 central-processing-unit cores and 3,500-GB memory high-performance computation cluster at the University of Michigan, Ann Arbor. We designed a three-layer fully connected neural network with 256 neurons in each layer and chose the 10⁻⁴ learning rate and 1.0 discount factor besides the default parameters. Each central processing unit collected 120 time steps of training data for all experiment settings in each training iteration, so a total of 60,000 time steps were collected in each training iteration. For the corner-case generation, the neural network’s output is the actions of the closest 8 BVs, where each BV has the 33 discrete actions space: left lane change, 31 discrete longitudinal accelerations ([−4, 2] with 0.2 m s⁻² discrete resolution) and right lane change. For the intelligent-testing-environment generation, the neural network’s output is the adversarial manoeuvre probability (ε_π) of the POV, where the action space is ε_π ∈ [0.001, 0.999]. To further improve the data efficiency during the training process, we used the collected data with a resampling mechanism to train the neural network for multiple steps.

Field test settings

Augmented-reality testing platform

We implemented the augmented-reality testing platform at the ACM, one of the world’s premier test tracks for AVs located in Ypsilanti, Michigan, and the Mcity test track, which is the world’s first purpose-built test track for AV testing. In this study, we utilized the 4-km highway loop featuring two and three lanes and both exit and entrance ramps to create various merging opportunities, as well as the Mcity urban driving environment, including various types of highway, roundabout, urban streets and so on, as shown in Supplementary Section 3f. We constructed digital twins of the ACM and Mcity based on the NDE simulator and available high-definition maps. To synchronize the information between the simulation and physical test track, we utilized thededicated short-range communications (DSRC) roadside units that were installed in the test tracks. These DSRC-based devices can communicate with AVs via 802.11p and SAE J2735 protocols through the immediate-forward-messaging and forwarding functions. Specifically, we utilized the immediate-forward-messaging function to broadcastproxy basic safety messages (BSMs) containing virtual BVs’ identifier, latitude, longitude, altitude and so on, to the physical AV, and the forwarding function to forward incoming BSMs of the AV to the digital twins. After receiving the BSMs of the AV, we synchronized the AV states in the simulation world, where BVs were controlled by the intelligent testing environment. More details of the platform can be found in ref. ²⁴. We implemented the system with an average 33-ms communication delay, which is acceptable for AV testing and can be further improved with advanced wireless communication techniques.

Augmented image rendering

We use augmented-reality techniques to render and blend virtual objects (for example, vehicles) onto the camera view of the ego vehicle. Given a background three-dimensional model with its 6 degrees of freedom pose/location in the world coordinate, we perform a two-stage transformation to project the model to the onboard camera image: (1) from the world coordinate to the ego-vehicle coordinate, and (2) from the ego-vehicle coordinate to the onboard camera coordinate. In the first transformation, the ego vehicle pose and location are obtained from the real-time signal of the onboard high-precision real-time kinematic positioning (RTK). In the second transformation, the projection is based on the pre-calibrated camera intrinsic and extrinsic. We also perform relighting on the rendered layer to harmonize the visual quality of the blending result. The augmented view is generated based on a linear blending with the rendered foreground layer, the camera’s background layer and the rendered alpha matte. On top of the blending result, a weather-control layer is further added to simulate different weather conditions, for example, rain, snow and fog. We implemented the augmented rendering based on pyrender⁵³. An additional validation of the augmented image rendering can be found in Supplementary Section 4f.

AV under test

As the AV under test, we used a retrofitted Lincoln MKZ from the Mcity Test Facility at the University of Michigan, Ann Arbor. The vehicle was equipped with multiple sensors, computing resources (two Nexcom Lumina) and with drive-by-wire capabilities provided by Dataspeed Inc. Specifically, the sensors include a PointGrey camera, a Velodyne 32-channel LiDAR, Delphi radars, OTXS RT3003 RTK GPS, Xsens MTi GPS/inertial measurement unit and so on. We implemented the vehicle with a Robot Operating System-based open-source software, Autoware.AI²³, which provides full-stack software for the highly automated driving functions, including localization, perception, planning, control and so on. We then integrated the AV with the augmented-reality testing platform to evaluate the AV’s safety performance. An illustration of the system framework can be found in Supplementary Information. Specifically, we modified the AV localization component to utilize the high-definition map and high-accuracy RTK for obtaining the current pose and velocity. The surrounding vehicles’ BSMs were directly obtained from the simulation through wireless communications. To generate the AV’s future trajectory, we applied the OpenPlanner 1.13⁵⁴ as the decision module, an advanced planning algorithm including global and local path planning. We applied the pure pursuit algorithm to convert the planned trajectory into the velocity and yaw rate and then used a proportional–integral–derivative controller provided by Dataspeed Inc. to further convert them into the vehicle by-wire control commands, that is, steering angle, throttle and brake percentages.

Data availability

The raw datasets that we used for modelling the naturalistic driving environment come from the Safety Pilot Model Deployment (SPMD) programme⁴⁸ and the Integrated Vehicle-Based Safety System (IVBSS)⁴⁹ at the University of Michigan, Ann Arbor. The ShapeNet Dataset that includes the three-dimensional model assets for the image augmented-reality module can be found at https://github.com/mmatl/pyrender. The police crash reports used in Supplementary Video 7 are available at https://www.michigantrafficcrashfacts.org/. The processed data for constructing NDE models and the intelligent testing environment and the experiment results that support the findings of this study are available at https://github.com/michigan-traffic-lab/Dense-Deep-Reinforcement-Learning. Source data are provided with this paper.

Code availability

The simulation software SUMO, the automated driving system Autoware and the RLLib platform with the implemented PPO algorithm are publicly available, as described in the text and the relevant references^23,25,52. The source codes for the naturalistic driving environment simulator, the driving behaviour models in the simulator, the D2RL-based intelligent testing environment and the simulation set-ups are available at https://github.com/michigan-traffic-lab/Dense-Deep-Reinforcement-Learning.

References

Kalra, N. & Paddock, S. M. Driving to safety: how many miles of driving would it take to demonstrate autonomous vehicle reliability? Transp. Res. A 94, 182–193 (2016).
Google Scholar
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Article ADS CAS PubMed Google Scholar
10 million self-driving cars will be on the road by 2020. Insider https://www.businessinsider.com/report-10-million-self-driving-cars-will-be-on-the-road-by-2020-2015-5-6 (2016).
Nissan promises self-driving cars by 2020. Wired https://www.wired.com/2013/08/nissan-autonomous-drive/ (2014).
Tesla’s self-driving vehicles are not far off. Insider https://www.businessinsider.com/elon-musk-on-teslas-autonomous-cars-2015-9 (2015).
Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles (Society of Automotive Engineers, 2021); https://www.sae.org/standards/content/j3016_202104/.
2021 Disengagement Reports (California Department of Motor Vehicles, 2022); https://www.dmv.ca.gov/portal/vehicle-industry-services/autonomous-vehicles/disengagement-reports/.
Paz, D., Lai, P. J., Chan, N., Jiang, Y. & Christensen, H. I. Autonomous vehicle benchmarking using unbiased metrics. In IEEE International Conference on Intelligent Robots and Systems 6223–6228 (IEEE, 2020).
Favarò, F., Eurich, S. & Nader, N. Autonomous vehicles’ disengagements: trends, triggers, and regulatory limitations. Accid. Anal. Prev. 110, 136–148 (2018).
Article PubMed Google Scholar
Riedmaier, S., Ponn, T., Ludwig, D., Schick, B. & Diermeyer, F. Survey on scenario-based safety assessment of automated vehicles. IEEE Access 8, 87456–87477 (2020).
Article Google Scholar
Nalic, D. et al. Scenario based testing of automated driving systems: a literature survey. In Proc. of the FISITA Web Congress 1–10 (Fisita, 2020).
Feng, S., Feng, Y., Yu, C., Zhang, Y. & Liu, H. X. Testing scenario library generation for connected and automated vehicles, part I: methodology. IEEE Trans. Intell. Transp. Syst. 22, 1573–1582 (2020).
Article Google Scholar
Feng, S. et al. Testing scenario library generation for connected and automated vehicles, part II: case studies. IEEE Trans. Intell. Transp. Syst. 22, 5635–5647 (2020).
Article Google Scholar
Feng, S., Yan, X., Sun, H., Feng, Y. & Liu, H. X. Intelligent driving intelligence test for autonomous vehicles with naturalistic and adversarial environment. Nat. Commun. 12, 748 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Sinha, A., O’Kelly, M., Tedrake, R. & Duchi, J. C. Neural bridge sampling for evaluating safety-critical autonomous systems. Adv. Neural Inf. Process. Syst. 33, 6402–6416 (2020).
Google Scholar
Li, L. et al. Parallel testing of vehicle intelligence via virtual-real interaction. Sci. Robot. 4, eaaw4106 (2019).
Article PubMed Google Scholar
Zhao, D. et al. Accelerated evaluation of automated vehicles safety in lane-change scenarios based on importance sampling techniques. IEEE Trans. Intell. Transp. Syst. 18, 595–607 (2016).
Article PubMed PubMed Central Google Scholar
Donoho, D. L. High-dimensional data analysis: the curses and blessings of dimensionality. AMS Math Challenges Lecture 1, 32 (2000).
Google Scholar
Hinton, G. E. & Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006).
Article ADS MathSciNet CAS PubMed MATH Google Scholar
Silver, D. et al. Mastering the game of go without human knowledge. Nature 550, 354–359 (2017).
Article ADS CAS PubMed Google Scholar
Mirhoseini, A. et al. A graph placement methodology for fast chip design. Nature 594, 207–212 (2021).
Article ADS CAS PubMed Google Scholar
Cummings, M. L. Rethinking the maturity of artificial intelligence in safety-critical settings. AI Mag. 42, 6–15 (2021).
Google Scholar
Kato, S. et al. Autoware on board: enabling autonomous vehicles with embedded systems. In 2018 ACM/IEEE 9th International Conference on Cyber-Physical Systems 287–296 (IEEE, 2018).
Feng, S. et al. Safety assessment of highly automated driving systems in test tracks: a new framework. Accid. Anal. Prev. 144, 105664 (2020).
Article PubMed Google Scholar
Lopez, P. et al. Microscopic traffic simulation using SUMO. In International Conference on Intelligent Transportation Systems 2575–2582 (IEEE, 2018).
Arun, A., Haque, M. M., Bhaskar, A., Washington, S. & Sayed, T. A systematic mapping review of surrogate safety assessment using traffic conflict techniques. Accid. Anal. Prev. 153, 106016 (2021).
Article PubMed Google Scholar
Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, 2018).
Koren, M., Alsaif, S., Lee, R. & Kochenderfer, M. J. Adaptive stress testing for autonomous vehicles. In IEEE Intelligent Vehicles Symposium (IV) 1–7 (IEEE, 2018).
Sun, H., Feng, S., Yan, X. & Liu, H. X. Corner case generation and analysis for safety assessment of autonomous vehicles. Transport. Res. Rec. 2675, 587–600 (2021).
Article Google Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. Proximal policy optimization algorithms. Preprint at https://arxiv.org/abs/1707.06347 (2017).
Owen, A. B. Monte Carlo theory, methods and examples. Art Owen https://artowen.su.domains/mc/ (2013).
Krajewski, R., Moers, T., Bock, J., Vater, L. & Eckstein, L. September. The round dataset: a drone dataset of road user trajectories at roundabouts in Germany. In 2020 IEEE 23rd International Conference on Intelligent Transportation Systems 1–6 (IEEE, 2020).
Nowakowski, C., Shladover, S. E., Chan, C. Y. & Tan, H. S. Development of California regulations to govern testing and operation of automated driving systems. Transport. Res. Rec. 2489, 137–144 (2015).
Article Google Scholar
Sauerbier, J., Bock, J., Weber, H. & Eckstein, L. Definition of scenarios for safety validation of automated driving functions. ATZ Worldwide 121, 42–45 (2019).
Article Google Scholar
Pek, C., Manzinger, S., Koschi, M. & Althoff, M. Using online verification to prevent autonomous vehicles from causing accidents. Nat. Mach. Intell. 2, 518–528 (2020).
Article Google Scholar
Seshia, S. A., Sadigh, D. & Sastry, S. S. Toward verified artificial intelligence. Commun. ACM 65, 46–55 (2022).
Article Google Scholar
Wing, J. M. A specifier’s introduction to formal methods. IEEE Comput. 23, 8–24 (1990).
Article Google Scholar
Li, A., Sun, L., Zhan, W., Tomizuka, M. & Chen, M. Prediction-based reachability for collision avoidance in autonomous driving. In 2021 IEEE International Conference on Robotics and Automation 7908–7914 (IEEE, 2021).
Automated Vehicle Safety Consortium AVSC Best Practice for Metrics and Methods for Assessing Safety Performance of Automated Driving Systems (ADS) (SAE Industry Technologies Consortia, 2021).
Au, S. K. & Beck, J. L. Important sampling in high dimensions. Struct. Saf. 25, 139–163 (2003).
Article Google Scholar
Silver, D., Singh, S., Precup, D. & Sutton, R. S. Reward is enough. Artif. Intell. 299, 1–13 (2021).
Article MathSciNet MATH Google Scholar
Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).
Article ADS CAS PubMed Google Scholar
Weng, B., Rao, S. J., Deosthale, E., Schnelle, S. & Barickman, F. Model predictive instantaneous safety metric for evaluation of automated driving systems. In IEEE Intelligent Vehicles Symposium (IV) 1899–1906 (IEEE, 2020).
Junietz, P., Bonakdar, F., Klamann, B. & Winner, H. Criticality metric for the safety validation of automated driving using model predictive trajectory optimization. In International Conference on Intelligent Transportation Systems 60–65 (IEEE, 2018).
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition 4700–4708 (IEEE, 2017).
Bengio, Y., Louradour, J., Collobert, R. & Weston, J. Curriculum learning. In International Conference on Machine Learning 41–48 (ICML, 2009).
Yan, X., Feng, S., Sun, H., & Liu, H. X. Distributionally consistent simulation of naturalistic driving environment for autonomous vehicle testing. Preprint at https://arxiv.org/abs/2101.02828 (2021).
Bezzina, D. & Sayer, J. Safety Pilot Model Deployment: Test Conductor Team Report DOT HS 812 171 (National Highway Traffic Safety Administration, 2014).
Sayer, J. et al. Integrated Vehicle-based Safety Systems Field Operational Test: Final Program Report FHWA-JPO-11-150; UMTRI-2010-36 (Joint Program Office for Intelligent Transportation Systems, 2011).
Treiber, M., Hennecke, A. & Helbing, D. Congested traffic states in empirical observations and microscopic simulations. Phys. Rev. E 62, 1805 (2000).
Article ADS CAS MATH Google Scholar
Kesting, A., Treiber, M. & Helbing, D. General lane-changing model MOBIL for car-following models. Transp. Res. Rec. 1999, 86–94 (2007).
Article Google Scholar
Liang, E. et al. RLlib: abstractions for distributed reinforcement learning. In International Conference on Machine Learning 3053–3062 (ICML, 2018).
Chang A. X. et al. ShapeNet: an information-rich 3D model repository. Preprint at https://arxiv.org/abs/1512.03012 (2015).
Darweesh, H. et al. Open source integrated planner for autonomous navigation in highly dynamic environments. J. Robot. Mechatron. 29, 668–684 (2017).
Article Google Scholar

Download references

Acknowledgements

This research was partially funded by the US Department of Transportation (USDOT) Region 5 University Transportation Center: Center for Connected and Automated Transportation (CCAT) of the University of Michigan (#69A3551747105) and the National Science Foundation (CMMI #2223517). We thank the American Center for Mobility (ACM) for providing access to their test track. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the official policy or position of the US government or the American Center for Mobility.

Author information

Shuo Feng
Present address: Department of Automation, Tsinghua University, Beijing, China
Zhengxia Zou
Present address: School of Astronautics, Beihang University, Beijing, China

Authors and Affiliations

Department of Civil and Environmental Engineering, University of Michigan, Ann Arbor, MI, USA
Shuo Feng, Haowei Sun, Xintao Yan, Haojie Zhu, Zhengxia Zou & Henry X. Liu
University of Michigan Transportation Research Institute, Ann Arbor, MI, USA
Shuo Feng, Shengyin Shen & Henry X. Liu
Mcity, University of Michigan, Ann Arbor, MI, USA
Henry X. Liu

Authors

Shuo Feng
View author publications
You can also search for this author in PubMed Google Scholar
Haowei Sun
View author publications
You can also search for this author in PubMed Google Scholar
Xintao Yan
View author publications
You can also search for this author in PubMed Google Scholar
Haojie Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Zhengxia Zou
View author publications
You can also search for this author in PubMed Google Scholar
Shengyin Shen
View author publications
You can also search for this author in PubMed Google Scholar
Henry X. Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.F. and H.X.L. conceived and led the research programme, developed the AI against AI concepts, developed the dense-learning approach, and wrote the paper. S.F. and H.S. developed the algorithms for the intelligent-testing-environment generation and designed the experiments. H.S. and H.Z. developed the simulation platform, implemented the algorithms, performed the simulation tests and prepared the simulation results. X.Y., H.Z. and S.S. implemented the Autoware system in the autonomous vehicle, performed the field tests and prepared the testing results. Z.Z. developed and performed the augmented image rendering. All authors provided feedback during the manuscript revision and results discussions. H.X.L. approved the submission and accepted responsibility for the overall integrity of the paper.

Corresponding author

Correspondence to Henry X. Liu.

Ethics declarations

Competing interests

The University of Michigan is in the process of applying for a patent application #63/338,424 covering the dense reinforcement learning, intelligent testing environment generation, and augmented reality testing techniques that lists H.X.L., S.F., H.S., X.Y., H.Z., Z.Z., and S.S. as inventors.

Peer review

Peer review information

Nature thanks Colin Paterson, Fredrik Warg and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

This file contains Supplementary Sections 1–4, including Supplementary text and equations, Figs. 1–19, Tables 1 and 2. and references—see Contents for details. It also includes links to Supplementary Videos 1–8 in Section 5, which are hosted externally via figshare.

Source data

Source Data Fig. 2

Source Data Fig. 3

Source Data Fig. 4

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Feng, S., Sun, H., Yan, X. et al. Dense reinforcement learning for safety validation of autonomous vehicles. Nature 615, 620–627 (2023). https://doi.org/10.1038/s41586-023-05732-2

Download citation

Received: 01 March 2022
Accepted: 16 January 2023
Published: 22 March 2023
Issue Date: 23 March 2023
DOI: https://doi.org/10.1038/s41586-023-05732-2
Springer Nature Limited

This article is cited by

Reinforcement Learning-Based Energy Management for Hybrid Power Systems: State-of-the-Art Survey, Review, and Perspectives
- Xiaolin Tang
- Jiaxin Chen
- Shen Li
Chinese Journal of Mechanical Engineering (2024)
Machine learning security and privacy: a review of threats and countermeasures
- Anum Paracha
- Junaid Arshad
- Khalid Ismail
EURASIP Journal on Information Security (2024)
A vision chip with complementary pathways for open-world sensing
- Zheyu Yang
- Taoyi Wang
- Luping Shi
Nature (2024)
Stable training via elastic adaptive deep reinforcement learning for autonomous navigation of intelligent vehicles
- Yujiao Zhao
- Yong Ma
- Xinping Yan
Communications Engineering (2024)
Online legal driving behavior monitoring for self-driving vehicles
- Wenhao Yu
- Chengxiang Zhao
- Ding Zhao
Nature Communications (2024)

Dense reinforcement learning for safety validation of autonomous vehicles

Abstract

Similar content being viewed by others

Explore related subjects

Main

Dense deep reinforcement learning

Learning the intelligent testing environment

AV testing in simulation

AV testing in test tracks

Discussion

Methods

Description of the AV safety validation problem

Formulation as a deep-reinforcement-learning problem

Dense deep reinforcement learning

Theorem 1

Off-policy learning mechanism

Theorem 2

Simulation settings

NDE simulator

Driving behaviour models in the NDE simulator

D2RL architecture, implementation and training

Field test settings

Augmented-reality testing platform

Augmented image rendering

AV under test

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Navigation