Main

Owing to the rapid development of autonomous vehicle (AV) technologies, we are on the cusp of a revolution in transportation on a scale not seen since the introduction of automobiles a century ago. AV technologies have the potential to substantially  improve transportation safety, mobility and sustainability, and thus have attracted worldwide attention from industries, government agencies, professional organizations and academic institutions. Over the past 20 years, substantial progress has been made on the development of AVs, particularly with the emergence of deep learning2. By 2015, several companies had announced that they would be mass-producing AVs before 20203,4,5. So far, the reality has not lived up to these expectations, and no level 4 (ref. 6) AVs are commercially available. The reasons for this are numerous. But above all, the safety performance of AVs is still substantially below that of human drivers. For average drivers in the United States, the occurrence probability of a crash is around 1.9 × 10−6 per mile in the naturalistic driving environment (NDE)1. In contrast, the disengagement rate for the state-of-the-art AV is around 2.0 × 10−5 per mile, according to the 2021 Disengagement Reports from California7. Although the disengagement rate is criticized for its potential biasedness, it has been widely used to track the trend of AV safety performance8,9, as it is arguably the only statistic that is available to the public for the comparison of different AVs.

One critical bottleneck to improving AV safety performance is the severe inefficiency of safety validation. Prevailing approaches usually test AVs in the NDE through a combination of software simulation, closed test track and on-road testing. However, to validate the safety performance of AVs at the level of human drivers, it is well known that hundreds of millions of miles and sometimes hundreds of billions of miles would need to be tested in the NDE1. Owing to this severe inefficiency, AV developers must pay substantial economic and time costs to evaluate each development, which has hindered the progress of AV deployment. To improve the testing efficiency, many approaches test AVs in purposely generated scenarios that are more safety critical10,11. Yet, existing scenario-based approaches12,13,14,15,16,17 can mainly be applied to short scenario segments with limited background road users (see Supplementary Information for more discussions).

Validating the safety performance of AVs in the NDE is in essence a rare-event estimation problem in a high-dimensional space. The main challenge is caused by the compounding effects of the ‘curse of rarity’ in addition to the ‘curse of dimensionality’ (Fig. 1a). By ‘curse of dimensionality’, we mean that driving environments could be spatiotemporally complex, and the variables needed to define such environments are high-dimensional. As the volume of the variable space grows exponentially with dimensionality, the computational complexity also grows exponentially18. By ‘curse of rarity’, we mean that the occurrence probability for safety-critical events is rare, that is, most points of the variable space are non-safety-critical, which provide no or noisy information for training. Under this circumstance, it is hard for a deep-learning model to learn even given a large amount of data, as valuable information (for example, policy gradient) of safety-critical events could be buried under the large amount of non-safety-critical data. Recent decades have seen rapid progress in the ability of artificial intelligence (AI) systems to solve problems with the curse of dimensionality19, for example, the board game Go has a state space of 10360 (ref. 20) and the semiconductor chip design may have a state space on the order of 102,500 (ref. 21). Before this work, however, solving the curse of dimensionality and the curse of rarity simultaneously has remained an open question, which has impeded the applicability of AI techniques in safety-critical systems, such as AVs, medical robots and aerospace systems22.

Fig. 1: Validating safety-critical AI with the dense-learning approach.
figure 1

a, The curse of rarity hinders the applicability of deep-learning techniques for safety-critical systems, as the gradient estimation of neural networks would suffer from the large variance due to the rareness of informative data. By training the neural networks with the informative data only, our dense-learning approach substantially reduces the gradient estimation variance, enabling deep-learning applications in safety-critical systems. f and E denote the objective function and mathematical expectation, respectively.  b, The D2RL approach edits the Markov process by removing the uncritical states and reconnecting the critical states, and then trains the neural networks (NN) for only the edited Markov process. c, For any D2RL training episode, the reward from the end state is backpropagated along the edited Markov chain with critical states only. Three examples are provided. In the left example, the episode is completely removed from training data as it does not contain any critical state. In the middle and right examples, the uncritical states are skipped and critical states are reconnected to densify the training data. The end state for the middle example is from a non-crash episode, whereas the right example is from a crash episode. d, The augmented-reality testing platform can augment the real world with virtual background traffic, resulting in a safer, more controllable and more efficient testing environment for AVs. Our approach learns to decide when to control which background vehicles to execute what adversarial manoeuvre with what probability.

We address this challenge by developing a dense deep-reinforcement-learning (D2RL) approach. The basic idea is to identify and remove the non-safety-critical data and train neural networks utilizing the safety-critical data. As only a very small portion of data is safety critical, the information of the remaining data will be substantially densified. Essentially, the D2RL approach edits the Markov decision process by removing the uncritical states and reconnecting the critical states, and then trains neural networks for only the edited Markov process (Fig. 1b). Therefore, for any training episode, the reward from the end state is backpropagated along the edited Markov chain with critical states only (Fig. 1c). The D2RL approach can dramatically reduce the variance of the policy gradient estimation with multiple orders of magnitude without loss of unbiasedness, compared with the DRL approach, as proved in Theorem 1 in Methods. Such substantial variance reduction can enable neural networks to learn and achieve tasks that are intractable for the DRL approach. For AV testing, we leverage the D2RL approach and train the background vehicles (BVs) through a neural network to learn when to execute what adversarial manoeuvre, which aims to improve the testing efficiency and ensure evaluation unbiasedness. This results in an AI-based adversarial testing environment that can reduce the required testing miles of AVs by multiple orders of magnitude while ensuring the testing unbiasedness. Our approach can be applied to complex driving environments, including multiple highways, intersections and roundabouts, which cannot be achieved by previous scenario-based approaches. The proposed approach empowers the testing agents in the environment with intelligence to create an intelligent testing environment, that is, using AI to validate AI. This is a paradigm shift and it opens the door for accelerated testing and training with other safety-critical systems.

To demonstrate the effectiveness of our AI-based testing approach, we trained the BVs with large-scale naturalistic driving datasets and conducted simulation experiments as well as field experiments in physical test tracks. Specifically, we tested a level 4 AV with an open-source automated driving system, Autoware23, in the physical 4-km-long highway test track at the American Center for Mobility (ACM) and the urban test track at Mcity. To test the AV with the D2RL-trained testing environment safely and precisely, we developed an augmented-reality testing platform24, which combines the physical test track and a microscopic traffic simulator, SUMO (Simulation of Urban Mobility)25. As shown in Fig. 1d, by synchronizing the movements of the real AV and virtual BVs, the real AV in the physical test track can interact with the virtual BVs as though it is in a realistic traffic environment, where the BVs are directed to interact with the real AV. For both simulation and field experiments, we evaluated not only crash rates but also crash types and crash severities. Our simulation and field-testing results show that the D2RL approach can effectively learn the intelligent testing environment, which can substantially accelerate the evaluation process of AVs by multiple orders of magnitude (103 to 105 times faster) unbiasedly, compared with the results from testing AVs directly in the NDE.

Dense deep reinforcement learning

To leverage AI techniques, we formulate the AV testing problem as a sequential Markov decision process (MDP), where manoeuvres of BVs are decided based on the current state information. We aim to train a policy (a DRL agent) modelled by a neural network, which can control the manoeuvres of BVs to interact with the AV, to maximize the evaluation efficiency and ensure unbiasedness. However, as mentioned earlier, it is hard—or even empirically infeasible—to learn an effective policy if directly applying DRL approaches because of the curse of dimensionality and the curse of rarity.

We address this challenge by developing the D2RL approach. Owing to the rarity of safety-critical events, most states are uncritical and cannot provide information for safety-critical events, so the key concept of D2RL is to remove the data of these uncritical states and utilize only the informative data for training the neural network (Fig. 1b,c). For AV testing problems, many safety metrics26 can be utilized to identify the critical states with different efficiency and effectiveness. In this study, we utilize the criticality measure12,13, which is an outer approximation of the AV crash rate within a specific time horizon (for example, one second) from the current state. Theoretical analysis for more generic problems can be found in Methods and Supplementary Section 2a. We then edit the Markov process, discard the data of uncritical states, and use the remaining data for the policy gradient estimation and bootstrapping of the DRL training. We find that dense learning can markedly reduce the variance of the policy-gradient estimation with multiple orders of magnitude without loss of estimation unbiasedness, as proved in Theorem 1 in Methods. The dense learning can also reduce the bootstrapping variance, as it can be regarded as a state-dependent temporal-difference learning27, where only critical states are utilized and others are skipped.

To demonstrate the effectiveness of dense learning, we compared D2RL with the DRL approach for a corner-case-generation problem28,29, which can be formulated as a well defined reinforcement-learning problem. A neural network was trained to maximize the AV’s crash rate by controlling the closest eight BVs’ actions (Fig. 2a). We used proximal policy optimization (PPO)30 to update the parameters of the policy network, given the reward for each testing episode, that is, +20 for an AV crash and 0 for others. For a fair comparison, the only difference between DRL and D2RL is that DRL utilized all the data for training the neural network, whereas D2RL utilized only the data of critical states. As shown in Fig. 2b, D2RL removed the data of 80.5% complete episodes and 99.3% steps from uncritical states, compared with DRL. According to Theorem 1, this indicates that D2RL can reduce around 99.3% of the policy-gradient-estimation variance, which enables the neural network to learn effectively. Specifically, the D2RL can maximize the reward during the training process, whereas the DRL was stuck from the beginning of the training process (Fig. 2c). The policy learned by D2RL can effectively increase the crash rate of the AV, whereas DRL failed to do so (Fig. 2d). Figure 2e–g illustrates three generated corner cases.

Fig. 2: Comparison of D2RL with DRL using the corner-case-generation examples.
figure 2

a, The neural network controls the closest 8 vehicles’ manoeuvres within 120 m, where each BV has 33 discrete actions at every 0.1 s: left lane change, 31 discrete longitudinal accelerations ([−4, 2] with 0.2 m s−2 discrete resolution) and right lane change. b, Proportions of the removed data by D2RL regarding the episodes (left) and steps (right). c,d, Comparison of training rewards between DRL and D2RL (c) and comparison of crash rates between the policies learned by DRL and D2RL (d). The solid lines represent the moving averages of rewards (c) and crash rates (d), and the shaded areas represent the standard deviation. e, The AV (blue vehicle) made an evasive lane change to avoid a cut-in vehicle but collided with an adjacent vehicle. f, The right-front vehicle made a cut-in, the left-behind vehicle made a right lane change, while the right-behind vehicle accelerated. These three vehicles cooperatively encircled the AV and caused a crash. g, The right-front vehicle made a cut-in to enforce the AV for braking, which created the opportunity for the right-behind vehicle to make a lane change after 2.8 s (that is, 28 uncritical steps), leading to a crash. Additional explanations are provided in Supplementary Video 1.

Source data

Learning the intelligent testing environment

Learning the intelligent testing environment for unbiased and efficient AV evaluation is much more complex than corner-case generation. According to the importance sampling theory31, the goal is essentially to learn new sampling distributions, that is, the importance function, of BVs’ manoeuvres to replace their naturalistic ones, with the aim of minimizing the estimation variance of AV testing. Intuitively, the BVs are trained to learn when to execute what adversarial manoeuvre, in that all BVs follow naturalistic behaviours, only selected vehicles at selected moments execute specifically designed adversarial moves with a learned probability. To achieve this goal, without using any heuristics or handcrafted functions, we derive the reward function from the estimation variance as

$$r({\bf{x}})=-\,{{\mathbb{I}}}_{A}({\bf{x}})\times {W}_{{q}_{\pi }}({\bf{x}})\times {W}_{{q}_{{\pi }_{{\rm{b}}}}}({\bf{x}}),$$
(1)

where x denotes the variables of each testing episode, \({{\mathbb{I}}}_{A}({\bf{x}})\) is an indicator function of the AV crash event (A), and \({W}_{{q}_{\pi }}({\bf{x}})=P({\bf{x}})/{q}_{\pi }({\bf{x}})\) and \({W}_{{{q}_{\pi }}_{{\rm{b}}}}({\bf{x}})=P({\bf{x}})/{q}_{{\pi }_{{\rm{b}}}}({\bf{x}})\) are weights (or likelihoods) produced by importance sampling. Here P(x) denotes the naturalistic distribution, qπ(x) denotes the importance function with the target policy π, and \({q}_{{\pi }_{{\rm{b}}}}({\bf{x}})\) denotes the importance function with the behaviour policy πb. As there is no heuristic or handcrafted immediate reward function, the reward function in equation (1) is highly consistent with the testing performance, that is, a higher reward indicates a more efficient testing environment. Such reward design is generic and applicable to other rare-event estimation problems with high-dimensional variables.

To determine the learning mechanism, we further investigate the relationship between the behaviour policy πb and target policy π. As proved in Theorem 2 in Methods, we find that the optimal behaviour policy \({\pi }_{{\rm{b}}}^{* }\,\) that collects data during the training process is nearly inversely proportional to the target policy. It indicates that, if using on-policy learning mechanisms (\({q}_{{\pi }_{b}}={q}_{\pi }\)), the behaviour policy would be far from optimality, which could mislead the training process and eventually cause the underestimation issues. To address this issue, we design an off-policy learning mechanism, where a generic behaviour policy is designed and kept unchanged during the training process. Although this off-policy mechanism is not the optimal behaviour policy as in Theorem 2 (which is usually unavailable in practice), it can balance the exploration and exploitation and is empirically effective for all experiment settings in this study. With the reward function and off-policy learning mechanism, we can learn the intelligent testing environment by the D2RL approach (see Methods for training details).

AV testing in simulation

We evaluated the effectiveness of the D2RL-based intelligent testing environment regarding accuracy, efficiency, scalability and generalizability by systematic simulation analysis. To measure the safety performance of AVs, crash rates of different crash types and severities in the NDE are utilized as the benchmark. As the NDE is generated completely based on naturalistic driving data, testing results in the NDE can represent the safety performance of AVs in the real world. For each test episode, we simulated AV driving in traffic for a fixed distance, and then the test results were recorded and analysed. To investigate the scalability and generalizability, we conducted simulation experiments with different road geometries, different driving distances and two different types of AV model (that is, the AV-I and AV-II models; see Supplementary Section 3d).

Figure 3 shows the results of the two-lane highway environment with the 400-m driving distance for the AV-I model, which is a basic experiment to validate our approach. As shown in Fig. 3a, during the training process, the estimation variance of the intelligent testing environment decreases with the increase of reward function, which demonstrates the effectiveness of the reward function in equation (1). To justify the off-policy mechanism, we investigated the performance of the on-policy mechanism, where the target policy was utilized as the behaviour policy. As shown in Fig. 3b, during the training process, the crash rate for the on-policy experiments substantially increases, whereas the crash rate for the off-policy experiments is unchanged because the behaviour policy is unchanged. However, as the on-policy mechanism breaks the consistency between the reward function and estimation variance, this increase of the crash rate would be misleading. As shown in Fig. 3c, the testing environment obtained by the on-policy mechanism underestimates the crash rate. In contrast, our off-policy approach can obtain the same crash rate as the NDE approach, but more efficiently (Fig. 3d,e). To measure the efficiency, we calculated the minimum number of tests for reaching a predetermined precision threshold (the relative half-width12,17 is 0.3). To reduce the randomness of the results for a fair comparison, we repeated the testing of our approach by bootstrap sampling and obtained the frequency and average of the required number of tests (Fig. 3f). Compared with the NDE approach that required 1.9 × 108 number of tests, our approach required an average of 9.1 × 104 number of tests, which is 2.1 × 103 times faster. To investigate the generalizability, we further tested the AV-II model using the same intelligent testing environment without any refinement, which can also obtain an accurate estimation with about 103 times faster (see  Supplementary Section 4d).

Fig. 3: Performance evaluation of the D2RL-based intelligent testing environment.
figure 3

a, Comparison of the reward between the DRL and D2RL approaches, along with the estimation variance (dashed line) of the D2RL approach that represents the testing efficiency. The solid lines represent the moving average and the shaded areas represent the standard deviation. b, Comparison of crash rates of the on-policy and off-policy D2RL approaches during the training process (b) and comparison of estimated crash rates of the on-policy and off-policy D2RL approaches during the testing process (c). The shaded area represents the 90% confidence level and the solid lines represent the averages. d,e, Crash rate estimations (d) and relative half-width (e) of the AV-I model by the NDE and the D2RL-based intelligent testing environment. The bottom x axis denotes the number of tests for the NDE and the top x axis denotes the number of tests for the intelligent testing environment. The shaded area represents the 90% confidence level and the solid lines represent the averages (d). The dashed line represents the 0.3 relative half-width and the numbers represent the required numbers of tests for reaching the 0.3 relative half-width (e). f, Frequency of the required number of tests for repeated testing experiments for the AV-I model. g,h, Unweighted crash rate (g) and weighted crash rate (h) of each crash type in the D2RL-trained testing environment. il, Weighted distributions of the speed difference at the crash moment (i), time to collision (j), bumper-to-bumper distance (k) and post-encroachment time (l) of the near-miss events.

Source data

To validate the unbiasedness about crash types, crash severities and near-miss events, we analysed the crash rates of different crash types, distribution of the speed difference at the crash moment, and distributions of the time to collision, bumper-to-bumper distance and post-encroachment time of near-miss events. Throughout the paper, our use of the term unbiasedness refers to the fact that estimations from our approach have the same mathematical expectations as those from the NDE. In our experiments, we collected about 2.34 × 108 episodes of tests in the NDE and 3.15 × 106 (about two orders of magnitude less) episodes of tests in the intelligent testing environment. As the intelligent testing environment is more adversarial than the NDE, the total crash rate in our approach is 3.21 × 10−3 (Fig. 3g), which is much higher than that (1.58 × 10−7) in the NDE. As required by the importance sampling theory, each crash event should be weighted by the likelihood ratio to keep the unbiasedness. Therefore, the weighted crash rates for all crash types are compared with the results in the NDE (Fig. 3h), which demonstrates the unbiasedness of our approach within the evaluation precision. Similarly, Fig. 3i–l demonstrates that our approach can also unbiasedly evaluate the AV’s safety performance regarding crash severities and near-miss events within the evaluation precision. As near-miss events are critical for the development of AVs, the generated near-miss events without loss of unbiasedness open the door for accelerating the AV training. We leave that for future study.

To further investigate the scalability and generalizability, we conducted the experiments with different numbers of lanes (two and three lanes) and driving distances (400 m, 2 km, 4 km and 25 km) for the AV-I model. Here we studied the 25-km case to demonstrate the effectiveness of our approach over full-length trips, because the average commuter travels approximately 25 km one way in the United States. As shown in Table 1, because of the skipped episodes and steps that substantially reduce the training variance, our approach can effectively learn the intelligent testing environment for all the experiments.

Table 1 Performance evaluation with different highway simulation environments

Furthermore, to demonstrate the advance of our approach in realistic urban scenarios, we extended our simulation experiments at a real-world four-armed roundabout32 in Germany with a high traffic volume and complex interactions. Compared with the NDE testing approach that requires about 8.91 × 106 number of tests to reach the 30% relative half-width, our approach only requires 3.76 × 103 number of tests, which is 2.37 × 103 times faster. See Supplementary Video 2 and Supplementary Section 4b for more details.

AV testing in test tracks

Finally, we tested a Lincoln MKZ hybrid equipped with the open-source automated driving system, Autoware23 (Fig. 4a), driving continuously in the physical multi-lane 4-km highway test track at the ACM (Fig. 4b) and the physical urban test track at Mcity (Fig. 4c). We developed an augmented-reality testing platform24, which combines the physical test track and a simulation environment, SUMO25. As shown in Fig. 1d, by synchronizing the movements of the real AV and virtual BVs, the real AV in the physical test track can interact with the virtual BVs as though it is in a real traffic environment, where the BVs are controlled according to the intelligent testing environment. Figure 4d illustrates the real-time visualization of the testing process. We trained the intelligent testing environment in the digital twins of the ACM highway section and the Mcity urban section using similar training settings to the simulation studies (see Methods for details). As shown in Fig. 4e–h, the crash rate estimations in both the ACM and Mcity converge and reach the 30% relative half-width after about 156 tests at the ACM and 117 tests at Mcity, which are on the order of 105 times faster than those (2.5 × 107 at the ACM and 2.1 × 107 at Mcity) of the NDE testing approach. We also evaluated the AV’s safety performance for different crash types and severities (Fig. 4i,j).

Fig. 4: Testing experiments for a real-world AV at physical test tracks.
figure 4

a, Illustration of the AV under test, equipped with Autoware. IMU, inertial measurement unit; OBU, onboard unit. b, Illustration of the ACM highway testing environment. The red line denotes the AV driving route. c, Illustration of the Mcity urban testing environment including highways, roundabouts, intersections and so on. The explosion icons denote the locations of crash events that happened during the tests. d, Illustration of the real-time visualization of the testing process. Left: the simulation view, where the virtual BVs (green vehicles) are generated and controlled by the intelligent testing environment to interact with the AV (red vehicle). Middle: the real-world AV view visualized by Autoware, where the black vehicle is the AV under test and blue vehicles are augmented BVs. Right: the original image view (top) and augmented image view (bottom) from the AV’s front camera. eh, Crash rate estimation and the relative half-width of the real AV at the ACM test track (e,f) and Mcity test track (g,h) with the augmented-reality testing platform. The black dashed line (e,g) represents the final estimation of the crash rate, the grey dashed lines (e,g) represent the 30% relative errors of the crash rate, the grey dashed line (f,h) represents the 0.3 relative half-width threshold and the shaded areas (e,g) represents the 90% confidence level. i, Crash rates of different crash types of the AV at the Mcity test track. j, Distribution of the speed difference at the crash moment for crash severity analysis of the AV at the Mcity test track. Additional explanations regarding the field experiments are provided in Supplementary Videos 38.

Source data

Discussion

Our results present evidence of using D2RL techniques to validate the safety performance of AVs regarding their behavioural competency33. D2RL can accelerate the testing process and can be used for both simulation testing and test-track methods. It can substantially enhance existing testing approaches (falsification methods, scenario-based methods and NDE methods) to overcome their limitations in real-world applications. D2RL also opens the door for leveraging AI techniques to validate machine intelligence of other safety-critical autonomous systems, such as medical robots and aerospace systems.

Ideally, the testing environment should consider all operating conditions of AVs and their associated rare events. For example, a six-layer model34 has been developed to structure the parameters of scenarios, including road geometry, road furniture and rules, temporal modifications and events, moving objects, environmental conditions, and digital information. In this study, we mainly focus on two layers: moving objects and road geometry, that is, multiple surrounding vehicles undertaking manoeuvres on roads of varying geometry, which are critical for the testing environment. Our approach could be extended to include parameters from other layers, such as weather conditions, by collecting large-scale naturalistic data and utilizing domain knowledge of those fields.

We note that increasing attention has also been paid to formal methods to address the challenges raised by AI systems (see refs. 35,36 and references therein). Formal methods provide a mathematical framework for rigorous system specification, design and verification37, which are critical for trustworthy AI. However, as discussed in ref. 36, multiple major challenges need to be addressed to fully realize their full potential. D2RL can potentially be integrated with formal methods. For example, reachability-based methods38 could be incorporated into the calculation of criticality measure to identify the critical states, particularly for generic safety-critical autonomous systems. How to further integrate D2RL with formal methods deserves further investigation.

Methods

Description of the AV safety validation problem

This section describes the problem formulation of AV safety performance evaluation. Denote the variables of the driving environment as x= [s(0),u(0),u(1), ⋯, u(T  )], where s(k) denotes the states (position and speed) of the AV and BVs at the kth time step, u(k) denotes the manoeuvres of BVs at the kth time step and T denotes the total time steps of this testing episode. With Markovian assumptions of the BVs’ manoeuvres, the probability of each testing episode in the NDE can be calculated as \(P({\bf{x}})=P({\bf{s}}(0))\times {\prod }_{k=0}^{T}P({\bf{u}}(k)| {\bf{s}}(k))\), and then the AV crash rate can be measured by the Monte Carlo method31 as

$$P(A)={{\mathbb{E}}}_{{\bf{x}} \sim P({\bf{x}})}[P(A| {\bf{x}})]\approx \frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}P(A| {{\bf{x}}}_{i}),{{\bf{x}}}_{i} \sim P({\bf{x}}),$$
(2)

where A denotes the crash event, n denotes the total number of testing episodes, i = 1, ..., n denotes the ith testing episode, and xi ∼ P(x) indicates that the variables are distributed as P(x). Here a crash is defined as a contact that the subject vehicle (for example, AV) has with an object, either moving or fixed, at any speed resulting in fatality, injury or property damage39. As A is a rare event, obtaining a statistically reliable estimation requires a large number of tests (n), which leads to the severe inefficiency issue of the NDE testing approach, as pointed out in ref. 1.

To address this inefficiency issue, the key is to generate an intelligent driving environment, where BVs can be controlled purposely to test the AV unbiasedly and efficiently. In essence, testing an AV in the intelligent driving environment is to estimate P(A) in equation (2) by the importance sampling method31 as

$$P(A)={{\mathbb{E}}}_{{\bf{x}} \sim q({\bf{x}})}[P(A| {\bf{x}})\times {W}_{q}({\bf{x}})]\approx \frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}P(A| {{\bf{x}}}_{i})\times {W}_{q}({{\bf{x}}}_{i}),{{\bf{x}}}_{i} \sim q({\bf{x}}),$$
(3)

where q(x) denotes the underlying distribution of BVs’ manoeuvres in the intelligent testing environment, and Wq(x) is the likelihood of each testing episode as

$${W}_{q}({\bf{x}})=\frac{P({\bf{x}})}{q({\bf{x}})}=\mathop{\prod }\limits_{k=0}^{T}\left[\frac{P({\bf{u}}(k)| {\bf{s}}(k))}{q({\bf{u}}(k)| {\bf{s}}(k))}\right].$$
(4)

According to the importance sampling theory31, the unbiasedness of the estimation in equation (3) can be guaranteed if q(x) > 0 for any x that P(A|x)P(x) > 0. To optimize the estimation efficiency, the importance function q(x) needs to minimize the estimation variance

$${\sigma }_{q}^{2}={{\mathbb{E}}}_{q}\left({P}^{2}(A{\rm{| }}{\bf{x}})\times {W}_{q}^{2}({\bf{x}})\right)-{P}^{2}(A).$$
(5)

Therefore, the generation of the intelligent testing environment is formulated as a sequential MDP problem of the BVs’ manoeuvres (that is, determine q(u(k)|s(k)) to minimize the estimation variance \({\sigma }_{q}^{2}\) in equation (5). However, how to solve such a sequential MDP problem associated with rare events and high-dimensional variables remains a highly challenging problem, and most existing importance sampling-based methods suffer from the curse of dimensionality40, where the estimation variance would increase exponentially with the dimensionality. In our previous study14, we found that the curse of dimensionality issue could be addressed theoretically by sparse adversarial control to the naturalistic distribution. However, only a model-based method with handcrafted heuristics was utilized for conducting the sparse adversarial control, which suffers from substantial spatiotemporal limitations, and how to leverage AI techniques to train the BVs for truly learning the testing intelligence remains unsolved, which is the focus of this paper. More details of related work can be found in Supplementary Section 1.

Formulation as a deep-reinforcement-learning problem

This section describes how to generate the intelligent testing environment as a DRL problem. As mentioned above, the goal is to minimize the estimation variance in equation (5) by training a policy π modelled by a neural network θ that can control BVs’ manoeuvres with the underlying distribution qπ(u|s). To keep the notation simple, we leave it implicit in all cases that π is a function of θ. An MDP usually consists of four key elements: state, action, state transition and reward. In this study, states encode information (position and speed) about the AV and surrounding BVs, actions include 31 discrete longitudinal accelerations ([−4, 2] with 0.2 m s−2 discrete resolution), left lane change and right lane change, and state transitions define the probability distribution over next states that are also dependent on the AV manoeuvre. Here we assumed that a lane-change manoeuvre of BVs would be initiated from its current position and completed in one second if a lane-change action was decided. Our framework is also applicable to more realistic and complex action settings.

For the corner-case-generation case study, we studied a three-lane highway driving environment, where eight critical BVs (that is, principal other vehicles or POVs) are controlled to interact with the AV for a certain distance (400 m) and each BV has the 33 discrete actions at every 0.1 s. For the intelligent-testing-environment generation case study, to keep the runtime of the DRL small, we simplified the output of the neural network as the adversarial manoeuvre probability (επ ∈ (0, 1)) of the most critical POV (Principal Other Vehicle), whereas POV’s other manoeuvres are normalized by 1 − επ according to the naturalistic distribution and other BVs’ manoeuvres keep following the naturalistic distribution. The adversarial manoeuvre and POV are determined by the criticality measure. We note that the generalization of this work to multiple POVs is straightforward.

The reward function design is critical for the DRL problem41. As the goal of the intelligent testing environment is to minimize the estimation variance in equation (5), we derived the objective function of the DRL problem as

$$\mathop{\min }\limits_{q}{\sigma }_{q}^{2}=\mathop{\max }\limits_{\pi }\left\{-{{\mathbb{E}}}_{{{q}_{\pi }}_{{\rm{b}}}}\left({{\mathbb{I}}}_{A}({\bf{x}})\times {W}_{{q}_{\pi }}({\bf{x}})\times {W}_{{{q}_{\pi }}_{{\rm{b}}}}({\bf{x}})\right)\right\},$$
(6)

where \({{\mathbb{I}}}_{A}\) is the indicator function of the crash event and πb denotes the behaviour policy of the DRL. During the training process, the training data are collected by the behaviour policy, which is a Monte Carlo estimation of the expectation in equation (6), so we can obtain the reward function as

$$r({\bf{x}})=-\,{{\mathbb{I}}}_{A}({\bf{x}})\times {W}_{{q}_{\pi }}({\bf{x}})\times {W}_{{{q}_{\pi }}_{{\rm{b}}}}({\bf{x}}),$$
(7)

which is theoretically consistent with the objective function. As it is mainly based on the importance sampling theory, the reward function is also applicable to other rare-event estimation problems with high-dimensional variables. To limit the scale of the error derivatives42, we rescaled and clipped the function, resulting in the reward function that belongs to [−100, 100], where the scaling constants could be automatically determined during the learning process.

With the state, action, state transition and reward function, the intelligent-testing-environment generation problem becomes a DRL problem. However, as the gradient estimation of neural networks would suffer from the large variance due to the rareness of informative data, applying learning-based techniques for safety-critical systems is highly challenging because of the curse of rarity. It is hard—or even empirically infeasible—to learn an effective policy if directly applying DRL approaches.

Dense deep reinforcement learning

To address this challenge, we propose the D2RL approach in this paper. Specifically, according to the policy gradient theorem27, the policy gradient of the objective function for DRL approaches can be estimated as

$$\nabla \hat{J}(\theta )={\hat{q}}_{\pi }({S}_{t},{A}_{t}\,)\frac{\nabla \pi ({A}_{t}\,|{S}_{t},\theta )}{\pi ({A}_{t}\,|{S}_{t},\theta )},$$
(8)

where θ denotes the parameters of the policy, qπ(St,At) denotes the state–action value, St and At are samples of the state and action under the policy at time t, \({\hat{q}}_{\pi }\left({S}_{t},{A}_{t}\,\right)\) is an unbiased estimation of qπ(St,At), that is, \({{\mathbb{E}}}_{\pi }\left[{\hat{q}}_{\pi }\left({S}_{t},{A}_{t}\,\right)\,]\right]={q}_{\pi }\left({S}_{t},{A}_{t}\,\right)\). Differently, for the D2RL approach, we propose to estimate the policy gradient as

$${{\rm{\nabla }}}_{{\rm{d}}{\rm{e}}{\rm{n}}{\rm{s}}{\rm{e}}}\,\hat{J}(\theta )={\hat{q}}_{\pi }({S}_{t},{A}_{t})\frac{{\rm{\nabla }}\pi ({A}_{t}\,|{S}_{t},\theta )}{\pi ({A}_{t}\,|{S}_{t},\theta )}{{\mathbb{I}}}_{{S}_{t}\in {{\mathbb{S}}}_{{\rm{c}}}},$$
(9)

where \({{\mathbb{S}}}_{{\rm{c}}}\) denotes the set of critical states and \({{\mathbb{I}}}_{{S}_{t}\in {{\mathbb{S}}}_{{\rm{c}}}}\) denotes the indicator function.

Here, a state is defined as an uncritical state if \({v}_{\pi }\left(s\right)={q}_{\pi }\left(s,a\right),\forall \,a\), where s denotes the state, a denotes the action, \({v}_{\pi }(s)\,{\rm{\stackrel{\scriptscriptstyle\mathrm{def}}{=}}}{{\mathbb{E}}}_{\pi }({q}_{\pi }(s,a))\) denotes the state value, so the set of critical states can be defined as \({{\mathbb{S}}}_{{\rm{c}}}\mathop{=}\limits^{{\rm{d}}{\rm{e}}{\rm{f}}}\{s|{v}_{\pi }(s)\ne {q}_{\pi }(s,a),{\rm{\exists }}\,a\}\). It indicates that a state is defined as uncritical if any action (for example, BVs’ manoeuvres) from the current state will not affect the expected value of the state (for example, AV’s crash probability within a specific time horizon from the current state). We note that this definition is primarily for the theoretical analysis to be clean and is not strictly required to run the algorithm in practice. For example, a state can be practically identified as uncritical if the current action will not substantially affect the expected value of the state. For specific applications, the critical states can be approximately identified based on domain-specific models or physics. For example, the criticality measure12,13, which is an outer approximation of the AV crash rate within a specific time horizon (for example, one second), is utilized in this study to demonstrate the approach for the AV testing problem. We note that many other safety metrics26 could also be applicable, such as the model predictive instantaneous safety metric43 developed by the National Highway Traffic Administration in the United States and the criticality metric44 developed by the PEGASUS project in Germany, as long as the identified set of states covers the critical states. More theoretical analysis for a more general sense can be found in Supplementary Section 2a.

Then, we have the following theorem, and the proof can be found in Supplementary Information.

Theorem 1

The policy gradient estimator of D2RL has the following properties:

  1. (1)

    \({{\mathbb{E}}}_{\pi }[{\nabla }_{{\rm{dense}}}\,\hat{J}(\theta )]={{\mathbb{E}}}_{\pi }[\nabla \hat{J}(\theta )]\);

  2. (2)

    \({{\rm{Var}}}_{\pi }[{\nabla }_{{\rm{dense}}}\,\hat{J}(\theta )]\le {{\rm{Var}}}_{\pi }[\nabla \hat{J}(\theta )]\); and

  3. (3)

    \({{\rm{Var}}}_{\pi }[{\nabla }_{{\rm{dense}}}\,\hat{J}(\theta )]\le {\rho }_{\pi }{{\rm{Var}}}_{\pi }[\nabla \hat{J}(\theta )]\), with the assumption

$${{\mathbb{E}}}_{\pi }[{\sigma }_{\pi }^{2}({S}_{t},{A}_{t}\,){{\mathbb{I}}}_{{S}_{t}\in {{\mathbb{S}}}_{{\rm{c}}}}]={{\mathbb{E}}}_{\pi }[{\sigma }_{\pi }^{2}({S}_{t},{A}_{t}\,)]{{\mathbb{E}}}_{\pi }[{{\mathbb{I}}}_{{S}_{t}\in {{\mathbb{S}}}_{{\rm{c}}}}],$$
(10)

where \({\rho }_{\pi }\mathop{=}\limits^{{\rm{d}}{\rm{e}}{\rm{f}}}{{\mathbb{E}}}_{\pi }({{\mathbb{I}}}_{{S}_{t}\in {{\mathbb{S}}}_{{\rm{c}}}})\in [0,1]\) is the proportion of critical states in all states under the policy π (for example, 1 − ρπ denotes the proportion of steps skipped in Fig. 2b and Table 1), and \({\sigma }_{\pi }^{2}({S}_{t},{A}_{t})\,=\)\({\left({\hat{q}}_{\pi }({S}_{t},{A}_{t})\frac{\nabla \pi ({A}_{t}|{S}_{t},\theta )}{\pi ({A}_{t}|{S}_{t},\theta )}\right)}^{2}\).

Theorem 1 proves that the D2RL approach has an unbiased and efficient estimation of the policy gradient compared with the DRL approach. To quantify the variance reduction of dense learning, we introduce the assumption in equation (10), which assumes that \({\sigma }_{\pi }^{2}\left({S}_{t},{A}_{t}\,\right)\) is independent on the indicator function \({{\mathbb{I}}}_{{S}_{t}\in {{\mathbb{S}}}_{{\rm{c}}}}\). As both the policy and the state–action values are randomly initialized, the values of \({\sigma }_{\pi }^{2}\left({S}_{t},{A}_{t}\,\right)\) are quite similar for all different states, so the assumption is valid at the early stage of the training process. Such variance reduction will enable the D2RL approach to optimize the neural network, whereas the DRL approach would be stuck at the beginning of the training process.

We then consider the influence of dense learning on estimating \({\hat{q}}_{\pi }\,\left({S}_{t},{A}_{t}\,)\right)\) with bootstrapping, which can guide the information propagation in the state–action space. For example, the fixed-length advantage estimator (\({\hat{A}}_{t}\)) is commonly used for the PPO algorithm30 as

$${\hat{A}}_{t}={\delta }_{t}+(\gamma \lambda ){\delta }_{t+1}+\cdots +{(\gamma \lambda )}^{L-t+1}{\delta }_{L-1},$$
(11)

where δt=rt+γV(st+1) − V(st), V(st) is the state–value function,  γ denotes the discount rate, and L denotes the fixed length. For safety-critical applications, the immediate reward is usually zero (that is, rt= 0), and most state–value functions are determined by initial random values without any valuable information because of the rarity of events. Bootstrapping with such noisy state–value functions will not be effective in the learning process. By editing the Markov chain, only the critical states will be considered. Then, the advantage estimator will be essentially modified as

$${\bar{A}}_{t}={\delta }_{z\left(t,0\right)}+(\gamma \lambda ){\delta }_{z\left(t,1\right)}+\cdots +{(\gamma \lambda )}^{L-t+1}{\delta }_{z\left(t,L-1\right)},$$
(12)

where \({\delta }_{z(t,j)}={r}_{z(t,j)}+\gamma V({s}_{z(t,j+1)})-V({s}_{z(t,j)})\), j is a natural number, and z is a function that z(t, 0) = t, \(z(t,j)=\mathop{\min }\limits_{i}\{{s}_{i}\in {{\mathbb{S}}}_{{\rm{c}}}|i > z(t,j-1)\},j > 0\), and i is a natural number. In essence, it is a state-dependent temporal-difference learning, where only the values of critical states are utilized for bootstrapping. As the critical states have much higher probabilities leading to safety-critical events, the reward information can be propagated to these critical state values more easily. Utilizing the values of these critical states, the bootstrapping can guide the information from the safety-critical events to the state–action space more efficiently. This mechanism can help avoid the interference of the large number of noisy data and focus the policy on learning the sparse but valuable information. Because of the abovementioned variance reductions regarding the policy gradient estimation and bootstrapping, the D2RL approach substantially improves the learning effectiveness compared with the DRL approach, enabling the neural network to learn from the safety-critical events.

Densifying the information is a natural way to overcome the challenges caused by the rarity of events. In the field of deep neural networks, connecting different layers of neural networks more densely has been demonstrated to produce better training efficiency and efficacy, that is, DenseNet45. Instead of connecting layers of neural networks, our approach densifies the information by connecting states more densely with safety-critical states, besides the natural connections provided by the state transitions. As safety-critical states have more connections with rare events, they contain more valuable information with less variance. By densifying the connections between safety-critical states with other states, we can better propagate the valuable information to the entire state space, which can substantially facilitate the learning process. This study proposed and demonstrated one specific realization of the dense-learning approach by approximately identifying uncritical states and connecting the remaining states directly. This can be further improved by more flexible and dense connections among safety-critical states and uncritical states. The connections can even be added in the form of curriculum learning46, which can guide the information propagation gradually. The measures for identifying critical states can also be further improved by involving more advanced modelling techniques.

Off-policy learning mechanism

We justify the off-policy learning mechanism in this section. The goal of the behaviour policy πb is to collect training data for improving the target policy π that can maximize the objective function in equation (6). To achieve this goal, it is critical to estimate the objective function accurately using the reward function in equation (7), which determines the calculation of the policy gradient. However, only episodes with crashes have non-zero rewards, so the objective function estimation suffers from a large variance, because of the rarity of crashes. Without an accurate estimation of the objective function, the training could be misled. According to the importance sampling theory, we have the following theorem, and the proof can be found in Supplementary Information.

Theorem 2

The optimal behaviour policy \({\pi }_{{\rm{b}}}^{* }\) that can minimize the estimation variance of the objective function has the following property:

$${q}_{{\pi }_{{\rm{b}}}^{* }}({\bf{x}})\propto \frac{{q}_{{\pi }^{* }}^{2}({\bf{x}})}{{q}_{\pi }({\bf{x}})},$$
(13)

where \({q}_{{\pi }^{* }}({\bf{x}})\) denotes the optimal importance sampling function that is unchanged during the training process, and the symbol ∝ means ‘proportional to’.

Theorem 2 finds that the optimal behaviour policy is nearly inversely proportional to the target policy, particularly at the beginning of the training process when qπ is far from \({q}_{{\pi }^{* }}\). If using on-policy learning mechanisms (\({q}_{{\pi }_{{\rm{b}}}}={q}_{\pi }\)), the behaviour policy would be far from optimality, which could mislead the training process and eventually cause the underestimation issues. For example, if a target policy misses an action that could lead to a likely crash, an on-policy learning mechanism will never find this missing crash. More importantly, the on-policy mechanism could mislead the policy for purposely hiding the crashes that are difficult to evaluate, leading to the severe underestimation issue of the safety performance evaluation.

We design an off-policy learning mechanism to address this issue, where a generic behaviour policy is designed and kept unchanged during the training process. Specifically, we determined a constant probability of the adversarial manoeuvre of the POV (that is, \({\varepsilon }_{{\pi }_{{\rm{b}}}}=0.01\)) and conducted other manoeuvres with the total probability of 0.99 that were normalized according to the naturalistic distribution. This policy explores the state–action space using the naturalistic distribution most of the time and exploits the information of the model-based criticality measure that helps identify the POV and adversarial manoeuvre. We note that although the optimal behaviour policy needs to be adaptively determined based on the target policy, as indicated in Theorem 2, an off-policy learning mechanism can provide a sufficiently good foundation for effective learning in this study. The behaviour policy is also not sensitive to the constant of \({\varepsilon }_{{\pi }_{b}}\), and generally, a small value (for example, 0.1, 0.05, 0.01 and so on) that balances the exploration and exploitation would be effective in this study. Further improvement can be investigated in the future.

Simulation settings

NDE simulator

To simulate the NDE, we developed a simulation platform based on an open-source traffic simulator SUMO. The scheme of the platform can be found in Supplementary Information. We utilized both the C++ and TRACI interfaces to refine the SUMO simulator so that high-fidelity driving environments can be integrated. Specifically, we rewrote and recompiled the C++ codes of SUMO to integrate the high-fidelity driving environments, including car-following and lane-changing behaviour models. Then, we utilized the TRACI interface to implement the intelligent testing environment, where at selected moments, selected vehicles would execute specific adversarial manoeuvres with a learned probability, following the policy obtained by the D2RL approach. We also synchronized the modified SUMO with the physical test tracks related to the information of BVs, AVs, traffic signals, high-definition maps and so on, through the TRACI interface. To provide a training environment for intelligent testing environments, we constructed a multi-lane highway driving environment and an urban driving environment, where all vehicles were controlled at 100-ms intervals.

Driving behaviour models in the NDE simulator

The default driving behaviour models of SUMO, which are simple and deterministic, cannot be utilized for safety testing and training of AVs because they are designed to be crash-free models. To address this issue, in this study, we constructed NDE models47 to provide naturalistic behaviours of BVs according to the large-scale naturalistic driving datasets (NDDs) from the Safety Pilot Model Deployment programme48 and the Integrated Vehicle-Based Safety System programme49 at the University of Michigan, Ann Arbor. At each step of simulation, the NDE models can provide distributions of each BV’s manoeuvres, which are consistent with the NDD. Then, by sampling manoeuvres from the distributions, a testing environment that can evaluate the real-world safety performance can be generated. For the field testing at ACM and Mcity, although the intelligent testing environment can accelerate the AV testing from about 107 loops of testing to only around 104 loops (Table 1), this still represents a substantial level of effort for an academic research group. To demonstrate our approach in a more efficient way, we simplified the NDE models to demonstrate our method more conveniently. Specifically, we modified the Intelligent Driving Model (IDM)50 and the Minimizing Overall Braking Induced by Lane change (MOBIL) model51 as stochastic models to construct the simplified NDE models. More details of the NDE models can be found in Supplementary Information.

D2RL architecture, implementation and training

The D2RL algorithm can be easily plugged into existing DRL algorithms by defining a specific environment with the dense-learning approach. Specifically, for existing DRL algorithms, the environment receives the decision from the DRL agent, executes the decision, and then collects observations and rewards at each time step, whereas for the D2RL algorithm, the environment collects only the observations and rewards for the critical states, as illustrated in Supplementary Section 3e. In this way, we can quickly implement the D2RL algorithm utilizing existing DRL platforms. In this study, we utilized the PPO algorithm implemented at the RLLib 1.2.0 platform52, which was parallelly trained on 500 central-processing-unit cores and 3,500-GB memory high-performance computation cluster at the University of Michigan, Ann Arbor. We designed a three-layer fully connected neural network with 256 neurons in each layer and chose the 10−4 learning rate and 1.0 discount factor besides the default parameters. Each central processing unit collected 120 time steps of training data for all experiment settings in each training iteration, so a total of 60,000 time steps were collected in each training iteration. For the corner-case generation, the neural network’s output is the actions of the closest 8 BVs, where each BV has the 33 discrete actions space: left lane change, 31 discrete longitudinal accelerations ([−4, 2] with 0.2 m s−2 discrete resolution) and right lane change. For the intelligent-testing-environment generation, the neural network’s output is the adversarial manoeuvre probability (επ) of the POV, where the action space is επ ∈ [0.001, 0.999]. To further improve the data efficiency during the training process, we used the collected data with a resampling mechanism to train the neural network for multiple steps.

Field test settings

Augmented-reality testing platform

We implemented the augmented-reality testing platform at the ACM, one of the world’s premier test tracks for AVs located in Ypsilanti, Michigan, and the Mcity test track, which is the world’s first purpose-built test track for AV testing. In this study, we utilized the 4-km highway loop featuring two and three lanes and both exit and entrance ramps to create various merging opportunities, as well as the Mcity urban driving environment, including various types of highway, roundabout, urban streets and so on, as shown in Supplementary Section 3f. We constructed digital twins of the ACM and Mcity based on the NDE simulator and available high-definition maps. To synchronize the information between the simulation and physical test track, we utilized thededicated short-range communications (DSRC) roadside units that were installed in the test tracks. These DSRC-based devices can communicate with AVs via 802.11p and SAE J2735 protocols through the immediate-forward-messaging and forwarding functions. Specifically, we utilized the immediate-forward-messaging function to broadcastproxy basic safety messages (BSMs) containing virtual BVs’ identifier, latitude, longitude, altitude and so on, to the physical AV, and the forwarding function to forward incoming BSMs of the AV to the digital twins. After receiving the BSMs of the AV, we synchronized the AV states in the simulation world, where BVs were controlled by the intelligent testing environment. More details of the platform can be found in ref. 24. We implemented the system with an average 33-ms communication delay, which is acceptable for AV testing and can be further improved with advanced wireless communication techniques.

Augmented image rendering

We use augmented-reality techniques to render and blend virtual objects (for example, vehicles) onto the camera view of the ego vehicle. Given a background three-dimensional model with its 6 degrees of freedom pose/location in the world coordinate, we perform a two-stage transformation to project the model to the onboard camera image: (1) from the world coordinate to the ego-vehicle coordinate, and (2) from the ego-vehicle coordinate to the onboard camera coordinate. In the first transformation, the ego vehicle pose and location are obtained from the real-time signal of the onboard high-precision real-time kinematic positioning (RTK). In the second transformation, the projection is based on the pre-calibrated camera intrinsic and extrinsic. We also perform relighting on the rendered layer to harmonize the visual quality of the blending result. The augmented view is generated based on a linear blending with the rendered foreground layer, the camera’s background layer and the rendered alpha matte. On top of the blending result, a weather-control layer is further added to simulate different weather conditions, for example, rain, snow and fog. We implemented the augmented rendering based on pyrender53. An additional validation of the augmented image rendering can be found in Supplementary Section 4f.

AV under test

As the AV under test, we used a retrofitted Lincoln MKZ from the Mcity Test Facility at the University of Michigan, Ann Arbor. The vehicle was equipped with multiple sensors, computing resources (two Nexcom Lumina) and with drive-by-wire capabilities provided by Dataspeed Inc. Specifically, the sensors include a PointGrey camera, a Velodyne 32-channel LiDAR, Delphi radars, OTXS RT3003 RTK GPS, Xsens MTi GPS/inertial measurement unit and so on. We implemented the vehicle with a Robot Operating System-based open-source software, Autoware.AI23, which provides full-stack software for the highly automated driving functions, including localization, perception, planning, control and so on. We then integrated the AV with the augmented-reality testing platform to evaluate the AV’s safety performance. An illustration of the system framework can be found in Supplementary Information. Specifically, we modified the AV localization component to utilize the high-definition map and high-accuracy RTK for obtaining the current pose and velocity. The surrounding vehicles’ BSMs were directly obtained from the simulation through wireless communications. To generate the AV’s future trajectory, we applied the OpenPlanner 1.1354 as the decision module, an advanced planning algorithm including global and local path planning. We applied the pure pursuit algorithm to convert the planned trajectory into the velocity and yaw rate and then used a proportional–integral–derivative controller provided by Dataspeed Inc. to further convert them into the vehicle by-wire control commands, that is, steering angle, throttle and brake percentages.