1 Introduction

In recent years, automated driving has become a hot topic in the automotive and transportation industries. Notable companies, such as Google Waymo, Baidu, and Cruise Automation, have launched their self-driving cars with SAE Level 4 automation, most of which are even for robotaxi ride-hailing services on open roads. On the other hand, the Advanced Driving Assistance Systems (ADAS) in production cars, by enhancing safety and comfort, have been widely recognized by customers, e.g., Lane Keeping Assist (LKA), and Navigate on Pilot (NOP) by NIO.

Researches on earlier-introduced ADAS functions show that personalization in driving assistance is crucial to improve both comfort and safety [1]. Before fully automated driving is possible for mass production in the future, there is no doubt that human drivers are still necessary to sit behind the wheel to supervise or even handle the daily driving tasks, though parts of which may be assisted by automation. Therefore, a well-designed driving automation strategy should be compatible with or even aligned to human preferences, to better support human drivers. Regardless of partial or full levels of driving automation, if the actual assistance in fulfilling driving tasks is congruent with personal styles in manual driving mode, the customer acceptance of assistant functions can be guaranteed.

Lane change is a common but complicated maneuver in driving, for which various driving assistance systems have been provided, such as Blind Spot Detection (for warning only), Lane Change Assist with Turn Assist, or even Auto Lane Change (ALC). Among them, ALC in particular needs to be designed in line with drivers’ preferences or personal styles of maneuvering. Even in hands-free driving mode, the driver may still experience dynamic and stressful highway lane-changing scenarios on the edge of incidents or even accidents. For lane change assistance, a fundamental motivation behind ADAS personalization is that different drivers have different standards of safety perception, e.g. the acceptable gap, relative distance, and approaching rate. A universal assist design for all drivers may cause problems, either too conservative for aggressive drivers or the other way around [2]. For one thing, this may prevent the driver from activating the function. Moreover, if the driver cannot understand the decisions made by the ADAS function, serious accidents may happen, especially at high speeds. Therefore, it is crucial to design a safe, comfortable, and personalized algorithm for automated lane-change maneuvers.

Personalization is a feasible way to enhance drivers’ trust in automation. To incorporate certain personalized styles in driving automation, a straightforward way is to imitate or even replicate a specific driver’s driving operations in the addressed scenarios, that is, to model the driver behaviors. However, a generalized driver model for solving various driving tasks, though attractive, is not available yet. In consideration of the radical variation of driving conditions, driver modeling is usually done in a scenario-by-scenario fashion, i.e., a holistic driver model is an integration of several sub-models of driver behaviors, e.g., car following, lane change, steering, etc. [3].

For lane change decision modeling, there are two popular methods, mechanism-based and learning-based models [4,5,6]. As for mechanism-based models, Gipps’ model [7] summarized the lane change decision-making process as a flowchart, in which any factors that affect lane change decision-making can be added or replaced. Although there is no consideration of drivers’ variational behavior, it still has a profound influence on the subsequent research. In [8], lane change behaviors are classified into three categories. Although several factors, such as motivation, advantage, and urgency, are considered, the final execution still depends on the availability of the gap between the preceding and the following vehicles in the target lane. Models applied in the FRESIM [9] and NETSIM [10] are similar but only different in the ways of calculating the acceptable gap. Kesting [11] proposed a general lane change decision-making model, MOBIL, in which the IDM model [12] is used to compare the total deceleration of all surrounding vehicles before and after lane change, and then the decision is made. A parameter called “policy factor” is considered in MOBIL to reflect the cooperation between drivers. Additionally, in the case of merging and congested scenarios, game theory is used to model the lane change decision considering the interaction with other vehicles [13, 14]. However, these approaches may not be easily applied to scenarios with several surrounding cars.

Recently, many learning-based approaches have been proposed for unmanned vehicles, e.g., road vehicles [15], surface vehicles [16], and aerial vehicles [17, 18]. In terms of the lane change decision modeling of automated driving, Vallon et al. [19] propose a data-driven approach to capture the lane change decision behavior of the human driver with the help of Support Vector Machine (SVM) classifiers. The results show that the personalized algorithm can reproduce the behaviors of different drivers without explicit initiation. Mirchevska et al. [20] design an RL agent using a Deep Q-Network, which can drive as closely as possible to the desired velocity. Hoel et al. [21] train a Deep Q-Network agent for a truck-trailer combination in highway driving scenarios, which can finish overtaking maneuvers better than the commonly-used reference model consisting of IDM and MOBIL. Learning-based approaches can include more influential factors in lane change decision modeling, but to consider the driver preferences, a lot more lane change data for a particular driver is required. These approaches are inherently data-hungry, while the data size, coverage, and details, will determine the model scope and the potential application domains. There are already some public datasets of driving study, e.g., the Next Generation SIMulation (NGSIM) program launched by the Federal Highway Administration (FHWA) [22] and the highD dataset published by RWTH Aachen University [23]. Regardless of whether the data collection is via cameras mounted in hovering drones or cameras fixed on traffic sign poles, for a specifically investigated vehicle-driver combination, only a very limited range of driving trajectories are available. Based on these data, most previous researches only consider the basic problem in lane change decision modeling, i.e. the general behaviors for safety and efficiency. Due to the lack of data at the driver-specific level, the personalized preferences of in-vehicle drivers have not been well studied.

To overcome these limits, this work focuses on personalized automated lane change decision-making in a two-lane highway scenario. Figure 1 shows how the personalized decision algorithm is achieved from raw data collection, analysis, reinforcement learning (RL) algorithm design, to validation. More specifically, the RL agents are designed to make lane change decisions based on environmental perception and the personalized reward function. The main contributions of this paper are two-fold.

  1. (1)

    By analyzing the simulator driving data, three effective indicators of driver lane change decision preferences are determined, i.e., time to collision with the front car in the current lane (tf), time to collision with the front car in the target lane (tnf), and the relative speed with the rear car in the target lane (dvnb).

  2. (2)

    Based on the three indicators of driver preferences, a personalized decision-making algorithm is proposed using Deep Q-Network. The comparative results show that the proposed algorithm can perform better than the benchmark algorithm with a commonly-used policy.

Fig. 1
figure 1

The proposed algorithm structure of personalized lane change decision

The rest of the paper is organized as follows. Section 2 formulates the lane change decision problem and introduces the basics of reinforcement learning. Section 3 presents the driver-in-the-loop experiments and driver preference analysis. The RL-based lane change decision design is detailed in Section 4, while the results are summarized in Section 5. Finally, the conclusion and some potential future work are given in Section 6.

2 RL formulation of decision problem

For ADAS or automated driving, it is a complex task to make a lane change decision with multiple surrounding cars, especially considering driving personalization. As shown in Fig. 1, there are three steps to tackle the problem. Step 1 is to obtain the human driving data and personalization indicators through driver-in-the-loop experiments. Steps 2 and 3 are to design and validate the personalized RL agents. Here, the various personalization indicators need to be considered simultaneously, which can reflect different decision preferences by different trade-off strategies. On the other hand, the algorithm should have adaptability in practical applications due to environmental conditions and driver data. Therefore, the RL approach is adopted to model the driver preferences in lane change decisions.

2.1 Lane change decision problem

In this RL problem of lane change decision task, Deep-Q-Network is used to learn the state-action value function in describing the personalized reward, Q(s,a). As a typical example, a two-lane highway scenario with three surrounding cars at random speeds is considered, as shown in Fig. 2. The notations of surrounding cars are as follow: the ego car (Cego), the front car in the current lane (Cf), the front car in the target lane (Cnf), and the rear car in the target lane (Cnb). Considering the influences of the surrounding cars, the ego car’s task is to decide whether or when to change from the current lane to the target lane.

Fig. 2
figure 2

Lane change decision scheme in a two-lane driving scenario

2.2 Reinforcement learning

In an RL problem, an agent selects an action a depending on the current state of the environment s and the policy π, then the environment will change to a new state s and return a reward r. The goal is to find an optimal policy π that maximizes the cumulative reward Rt, defined as

$$ R_{t} = \sum\limits_{k=0}^{\infty}\gamma^{k}r_{t+k}, $$
(1)

where rt+k is the reward returned at step t + k, and γ is a discount factor, γ ∈ [0,1].

State-action value function Q(s,a) is used to evaluate the expected cumulative reward of agent when selecting action a in state s, that is

$$ Q(s,a) = E[R_{t} \vert s_{t}=s,a_{t}=a]. $$
(2)

Q-learning is a classical algorithm for the problem with limited states and actions, the state-action values are saved in a Q-table. The optimal state-action value function in Q-learning is

$$ Q^{*}(s,a) = E[r+ \gamma \max\limits_{a'}Q^{*}(s',a') \vert s_{t}=s, a_{t}=a]. $$
(3)

When Q(s,a) is known, the optimal policy is to select an action a that maximizes Q(s,a).

However, if the state space is continuous, it is impractical to remember all the state-action values with a table. To handle this, the Deep Q-Network (DQN) [24] is adopted, which can approximate the optimal state-action value Q(s,a) with a nonlinear estimator Q(s,a;𝜃). Network weights 𝜃 will be updated during the training process to minimize the following loss function

$$ L(\theta) = E[(r+ \gamma \max\limits_{a'}Q(s',a',\theta)-Q(s,a,\theta))^{2}]. $$
(4)

To avoid unstable training caused by the same network weight, the weight of target network is set to 𝜃 and replaced by the prediction network weights 𝜃 at every fixed-step. Then the final loss function is defined as

$$ L(\theta) = E[(r+ \gamma \max\limits_{a'}Q(s',a',\theta^{-})-Q(s,a,\theta))^{2}]. $$
(5)

3 Driver preferences analysis

To model driver preferences, an ideal dataset should be from several different drivers’ naturalistic driving on public roads. However, considering lane change as one kind of highly dynamic and time-critical process, if the lane change decision timing matters, it is difficult to record the exact decision processes. As one improved way of data collection, experimental driving on test roads may better assure the realistic driving conditions and the drivers can almost maneuver the vehicle as they usually do, but only if the surrounding cars (usually 2 to 3 additional cars) can be coordinated well by human or robot drivers. However, experimental driving, if with only limited time and/or limited financial budget, is still impossible to collect enough data on lane change decision strategies, not to mention that the experiment safety risk is extremely high due to the involvement of multiple vehicles at high speeds. Therefore, to analyze the driver preferences in lane change decisions, a moving-base driving simulator is adopted for the original data collection.

3.1 Driver-in-the-loop experiments

A driver-in-the-loop (DIL) experiment environment is designed based on a 6-Degrees-of-Freedom (6DoF) driving simulator, which can provide a realistic driving experience, as shown in Fig. 3. The two-lane highway driving scenario is constructed in the simulation software, TASS Prescan, while the surrounding cars with random constant speeds are controlled by the MATLAB/Simulink model. The ego car is controlled by the human driver with a real steering wheel and gas/brake pedal. A specific button on the steering wheel is set to record the timestamp of every lane change initiation, while the other recorded data include the position and speed of all vehicles during the whole process.

Fig. 3
figure 3

(a) The 6-DoF driving simulator. (b) Lane change scenario in Prescan

Ten drivers, aged between 20 and 26, are invited to participate in the DIL experiments. Each driver is asked to make the lane change maneuvers 50 times on four sections with different speed limits, i.e., 60kph, 70kph, 80kph, and 90kph. Since not all drivers can exactly follow the speed limits, a maximum 5kph speed error over limits is still considered acceptable.

According to the existing research on natural driving data [25, 26], there are many factors affecting drivers’ lane change decisions, such as relative velocity, time to collision (TTC), and relative distance. For safe lane change, a driver should try to avoid collisions with surrounding cars. For the front cars in the current and target lanes, drivers mainly judge whether a collision will occur by sensing the relative speed and distance, which can be described by the value of TTCs, tf, and tnf, respectively. As for the rear car in the target lane, the approaching rate is adopted as a judgment indicator, which is included in the form of relative speed, dvnb. Therefore, a different driver style in lane change decision corresponds to a different combination of values of these three personalization indicators, i.e., tf, tnf and dvnb.

Based on the simulator driving data, Fig. 4 shows the statistical results of three indicators in different speed ranges. It is found that these three indicators are positively correlated with the velocity of ego car, ve. This phenomenon is understandable due to driver’s risk perception at different speeds. Further, correlation analysis is used to check these indicators’ relationships, while the detailed analysis results for each specific driver are given in Table 1. The results reveal that there is a linear correlation between ve and three indicators (p < 0.05) in 80% of the drivers. Therefore, the driver personalization in lane change decisions can be defined using these three indicators. Then the personalization indicator set is

$$ I_{dp} = [t_{f}, t_{nf}, dv_{nb}]^{T}. $$
(6)
Fig. 4
figure 4

The statistical results of lane change decision data in DIL experiment in different speed ranges. (blue lines represent median values of indicators in different speed ranges; red triangles represent mean values.)

Table 1 Correlation analysis for all drivers (tf / tnf / dvnb)

These decision points are fitted with linear regression as

$$ I_{dp} = Av_{e} + b , $$
(7)

which will be used as reference lines in the reward function design of Section 4.

To justify the effectiveness and rationality of (7), a naturalistic driving dataset, HighD, is adopted to illustrate the universality of the experimental samples. The HighD is the abbreviation of the Highway Drone dataset, which is a large-scale naturalistic vehicle trajectories dataset recorded at German highways and covers over 110 thousand vehicles, 44.5 driven kilometers, and 147 driven hours [23]. A total of 1,469 lane change maneuvers are extracted, of which 297 cases that match the scenario in Fig. 2 are finally selected. The scatter diagram and statistical summary of ego speed and three personalization indicators are given in Fig. 5, in which the correlation analyses are also indicated, respectively. Similar to Fig. 4, the linear correlation between indicators and ego speed, with all significance levels p < 0.05, justifies that (7) also holds true in naturalistic driving cases.

Fig. 5
figure 5

The statistical results of lane change data extracted from HighD dataset. (a) The scattered points and regression curves via ordinary least square (OLS). (b) The statistical results in different speed ranges. (blue lines represent median values of indicators in different speed ranges; red triangles represent mean values.)

3.2 Drivers with typical preferences

Based on the DIL experiments data, the driver lane change decisions are clustered into three groups, i.e., Defensive, Normal, and Aggressive, as shown in Fig. 6. When initiating a lane change maneuver, a more aggressive driver corresponds to the smaller TTCs (tf, tnf) and relative speed(dvnb).

Fig. 6
figure 6

The clustering result of three driver styles

According to the clustering results, three drivers with obvious differences are selected as typical examples, as shown in Fig. 7. For each driver, the indicators, tf, tnf and dvnb increase linearly with the speed of ego car, ve. Then, the curve parameters, A and b are obtained via the ordinary least square (OLS) method and presented in Table 2.

Fig. 7
figure 7

The relations of three personalization indicators with ego speed at the lane change decision point. The red lines are the regression curves obtained by OLS. (a) Defensive, (b) Normal, (c) Aggressive

Table 2 Parameter matrices for different driver styles via OLS

4 RL-based lane change decison

Figure 8 schemes the RL-based decision process. The action space, state space, reward function, and the Deep Q-Network are defined first. The RL environment provides information about the ego car and the surrounding cars to the decision module, in which the information is transitioned to the state vector as an input of the Deep Q-Network. Then the network outputs the state-action value for each action, and the best action is selected as the final decision.

Fig. 8
figure 8

The process of RL lane change decision making

4.1 Action space

Here, the left or right lane change decisions are treated as the same. There are two discrete actions in the lane change decision-making problem, i.e., a1:TO CHANGE to the target lane, and a2: NOT TO CHANGE lane (to stay in the current lane). Then, the action space A is defined as

$$ A = \{a_{1}, a_{2}\}. $$
(8)

4.2 State space

Based on the analysis in Section 3, the personalization indicators, i.e., tf, tnf and dvnb, need to be obtained from the state of environment. To facilitate real car applications, the variables in state space should be directly measurable via the onboard sensors. The state consists of two parts, the ego car information and the surrounding car information. Each car’s information includes its longitudinal velocity v and longitudinal position x. For better performances in training, v and x are normalized to (0,1], respectively. Therefore, the state can be described as a vector of eight normalized values,

$$ s = [v_{e}, x_{e}, v_{f}, x_{f}, v_{nf}, x_{nf}, v_{nb}, x_{nb}]. $$
(9)

4.3 Reward functions

To train the human-like RL agents, the personalization indicators are used as the reference to design the reward functions. To help the agent trade-off the benefits between different decisions and learn to make a better choice, the sum of two actions’ rewards is kept as a constant at every decision step. Therefore, the reward function for each indicator is designed as follows. If the decision is TO CHANGE lane, i.e. action = a1,

$$ r = \left\{ \begin{array}{lll} &1&,e_{i} \in [0,m]\\ \frac 1{m-n} &* e_{i}- \frac n{m-n},& e_{i} \in (m,n)\\ & 0& ,e_{i} \in [n,+\infty) \end{array} \right. $$
(10)

If the decision is NOT TO CHANGE lane, i.e. action = a2,

$$ r = \left\{ \begin{array}{lll} &1& ,e_{i} \in [0,m]\\ \frac 1{n-m} & * e_{i}- \frac m{n-m},& e_{i} \in (m,n)\\ &0& ,e_{i} \in [n,+\infty) \end{array} \right. $$
(11)

For each indicator iIdp in (6), ei is the absolute error of actual value iact, and reference value iref,

$$ e_{i} = \vert i_{act}-i_{ref} \vert. $$
(12)

The smaller ei means the indicator i is more suitable for a personalized lane change. If the agent chooses to change lanes with a smaller ei, it will receive a greater reward. Instead, if ei is large enough, the agent needs to keep in the current lane for a greater reward.

m and n are two preset parameters, which represent the maximum acceptable error and the maximum effective error, respectively. If eim, it means the current value of this indicator is exactly matched with the driver, while ein means this indicator does not match the driver at all. However, the extreme pursuit of all indicators will lead to the non-convergence of training and get an unsatisfactory result eventually, so, it is necessary to choose the appropriate m and n. Here, considering the range and precision of each indicator, we have m = 0.2, n = 2 for tf, tnf and m = 0.5, n = 5 for dvnb.

Finally, the reward functions for all three indicators, rf, rnf and rnb, are obtained and the total reward function R is defined as

$$ R = r_{f} + r_{nf} + r_{nb}. $$
(13)

4.4 Neural network design and training details

Convolutional neural networks (CNN) are usually used in the architecture design with image matrix inputs. Here, the network input is a vector consisting of a series of vehicle states, i.e., the state space. Therefore, a fully connected neural network (FCNN) architecture [27] is designed for the target network and the prediction network mentioned in Section 2. As shown in Fig. 9, there are three hidden layers, while each layer has 128 neurons, and the activation function of rectified linear units (ReLUs) is used. The input is a state vector of 8 × 1 size, and the output is a state-action value vector of 2 × 1 size. At time t, the neural network gets an environment state input st and outputs the estimation state-action values Q(s,a) for each action ai in action space A.

Fig. 9
figure 9

The designed FCNN architecture

The neural network is trained with a learning rate η by using the DQN algorithm. The 𝜖-greedy policy is applied in training, and along with the training process, the value of 𝜖 will decrease from 𝜖s to 𝜖e linearly. The factor γ is used to consider the discount of future rewards. The replay memory with a size of Mr is set. The training begins after the initial minimal memory size Mi, and the random sample mini-batch size is set to Mm. The weights of the target network 𝜃 are replaced by the prediction network’s weights 𝜃 every Nu episode, and they are both initialized with a standard normal distribution denoted as N(0,1). Every episode starts with a random environment state and stops with a TO CHANGE lane decision or maximal episode step Ns. The maximal training episode is set to Ne.

Based on the commonly-used setting in DQN, here the hyper parameters are adjusted and determined according to our pre-defined lane change environment. Specifically, the sample time in the RL environment is 0.05s and the agent needs to finish the lane change maneuver in 10s, so the maximal episode step (Ns) should be 200. The maximal training episode (Ne) is set to 10,000, which is large enough to allow the full convergence of training. The initial minimal memory size (Mi) and the replay memory size (Mr) are set relatively small because of the action space design and the short episode step. As for the discount factor (γ), it can be determined with \(\gamma \approx 0.01^{\frac {1}{200}}\), which means the 200th step reward accounts for 0.01 of the total reward. And the rest hyper parameters, i.e., learning rate (η), initial exploration (𝜖s), final exploration (𝜖e) and target network update frequency (Nu), are determined by the grid search method. The hyper parameters used in training are summarized in Table 3.

Table 3 Hyper parameters setting

5 Results and discussion

In this paper, three kinds of personalized RL agents are designed and trained for lane change decision-making to reproduce the typical drivers’ preferences. In the lane change scenario, agents experience different states and learn to make decisions themselves by repeating the lane change interactions. Finally, a stable policy for any state can be learned.

5.1 Training results

Figure 10 shows the training results of three RL agents with different lane change decision preferences. The horizontal axis is the training episode, the training losses defined in (5) are in the left column, and the step rewards defined in (13) are in the right column. It is obvious that the loss curve has a quick downtrend in the first 1,500 training episodes and then flattens out, which means the neural network converges. The average step reward is increased in every episode during the training process, implying that the RL agents have learned to select the actions with a higher reward in a series of lane change maneuvers. Both the loss and step reward stabilize eventually. Due to the random training environment, the curve of the step reward is not very smooth, but its mean value reaches stable at around 1.4.

Fig. 10
figure 10

Training losses (left column) and step rewards (right column) for personalized RL agents. The dark blue curves are obtained by smoothing the real values in light blue color. (a) Defensive, (b) Normal agent, (c) Aggressive

5.2 Benchmark algorithm

This paper focuses on the personalization decisions, and the ability of RL agents to make optimal decisions during lane change needs to be proved, especially when considering three indicators at the same time. Therefore, a benchmark algorithm is set, to show the advantage of the proposed algorithm. With the designed reward function and the selected typical drivers, the benchmark algorithm directly compares the rewards of two actions at every step, and then makes the decision with a higher reward. This is a kind of greedy strategy commonly used, and it will be deployed on three benchmark agents with Defensive, Normal and Aggressive styles. Further, they will be tested in the same simulation environment together with the trained RL agents.

5.3 Test and validation

The trained RL and benchmark agents are tested in a random simulation environment. They make lane change decisions considering the environment states, and then the values of personalization indicators are recorded. Then three sets of lane-change points are obtained. The reference lines and the lane-change points are compared, then the similarities are calculated by the Mean Absolute Error (MAE). To further illustrate the agent’s personalized preferences, the statistical results of decision-making accuracy are presented through the comparisons among the typical drivers, the trained RL agents, and the benchmark agents.

Figure 11 shows the test results of three personalized RL agents and benchmark agents compared to the typical drivers with different driving styles, i.e., Defensive, Normal, and Aggressive. When the agents decide to make a lane change, the points in sub-plots represent the value of personalization indicators at different speeds of the ego car, while the blue and orange points are generated by RL agents and benchmark agents, respectively.

Fig. 11
figure 11

Test results of personalization indicators in lane change decision at different ego speeds. (a) Defensive agents. (b) Normal agents. (c) Aggressive agents

To describe how closely the agents and the corresponding drivers decide in lane change maneuvers, the similarities between reference lines and lane-change points are represented by the MAE, which is defined as

$$ MAE = \frac 1n \sum\limits_{i=0}^{n} \vert y_{a_{i}} - y_{r_{i}} \vert, $$
(14)

where \(y_{a_{i}}\) is the actual value of indicators, and \(y_{r_{i}}\) is the reference value obtained from the reference line with the same speed as \(y_{a_{i}}\). As can be seen in Fig. 11, for all three personalized RL agents, their lane-change points are close to the reference lines. As for the benchmark agents, only the values of dvnb are close to the references, while the performance of the other two indicators, tf, tnf, are worse than that of the RL agents. Specifically, at the initiation of lane change decision, the RL agents’ MAEs of tf, tnf are obviously less than that of the benchmark agents.

Furthermore, the decision results of typical drivers, trained RL agents, and benchmark agents in a series of the same states are compared, with the statistical results shown in Fig. 12. The blue circle points represent that the human drivers and the agents make the same decisions, while the red triangles, in contrast, represent their opposite decisions. For RL agents, we have 95.9% accuracy for Defensive agent, 100% accuracy for Normal agent, and 98.6% accuracy for Aggressive agent. In contrast, the values of accuracy for benchmark agents are only 83.7%, 87%, and 86.5%, respectively. For Defensive and Aggressive RL agents, there are totally three opposite cases marked in Fig. 11, compared with original lane change data generated by typical drivers. To be specific, in cases 1 and 2, the relative distance (dnb) between the ego car (Cego) and the car behind in the target lane (Cnb) is too large, 142.36 meters and 126.11 meters, respectively. However, in this training environment, if Vnb is too far away from the Cego(dnb > 100), it can be considered that Cnb has no effect on the lane change decision making of Cego. As for case 3, there is no leading car (Cnf) in the target lane (the related indicator tnf is recorded as -1), which contributes to a false result of the decision. All these failed cases’ environment states are not involved in the training environment, so the agent cannot handle them.

Fig. 12
figure 12

The comparisons of lane change decision making results between human drivers, trained RL agents and benchmark agents with three different personalized preferences. (a) RL agents. (b) Benchmark agents

To sum up, with the similarity comparisons, the decision accuracies, and the failed case analysis, it is proved that the personalized RL agents can make the lane change decision more like human drivers with three different styles than the benchmark agents.

6 Conclusion

Due to the stressful dynamics, lane change is a common but difficult decision task in dense traffic, especially in highway scenarios. For better user experience, automated driving requires further consideration of driver personalized preferences.

This paper proposes a personalized decision algorithm for lane change based on RL. The RL agent can successfully reproduce the driver’s preferences for lane change decisions, which is promising for further applications in the human-centered design of automated driving.

This work is a part of ongoing research on personalized driving automation considering user experience. There are some limits to overcome in future studies. For example, only three personalization indicators are selected, while there may exist some other indicators that affect lane change decisions, e.g., more variables of the traffic and road conditions. For applications, the algorithm may be further extended by enriching the state space design with the driver’s high-level preferences according to the driver’s current status, e.g., based on driver state monitoring systems.