Keywords

1 Introduction

Compared with other modes of transportation, urban rail transit has the advantages of fast average speed, large passenger capacity, and high utilization rate. However, due to its high operating density and large passenger flow, the total resource consumption value of the urban rail system is also very huge. Among them, train traction energy consumption accounts for more than half. Therefore, studying the energy-saving control of train operation can promote a better green development of urban rail transit.

As shown in Fig. 1, this study minimizes the train operating environment as an optimization goal, takes operating safety, punctuality and other conditions as constraints, and uses different methods for model transformation. At present, the research methods on such issues are mainly divided into the following categories.

Zhu Jinling through the study of the principle of maximum value, introduced the limited speed constraint condition into the improvement model of the energy-saving driving strategy of the train, and the solution clarified the best control conditions [1]. Liang Zhicheng also used the principle of maximum value to further explore the problem of train handling under restricted speed constraints [2]. Wang Qingyuan used the relevant knowledge of maximum value to introduce the regenerative braking situation of multiple trains into the train control model, and clarified the best set of train control conditions [3]. Although the maximum value method can theoretically obtain the optimal numerical solution, it is precisely because of its too strong theoretical nature that the calculation is complicated.

Wang Pengling used the knowledge of dynamic programming to change the constraint conditions, and combined with the Gaussian pseudo-spectrum method to improve and optimize the train travel speed and distance curve [4].

Shi Hongguo discussed and studied the multi-objective problem of train operation. In order to improve the convergence speed and the performance of the output result, the genetic algorithm was introduced to improve the knowledge of variable length chromosomes, and then the problem was analyzed and solved and improved [5]. Liu Weianalyzed the improvement of multiple population genetic algorithms on the basis of predecessors, and introduced relevant knowledge of variable length real matrix coding, which was used in the improvement of the driving strategy of urban rail trains [6].

As one of the most concerned directions in the field of artificial intelligence in recent years, reinforcement learning provides continuous learning to control the behavior of agents by receiving high-dimensional perceptual input, providing a series of complex strategic decision and perception problems that are currently facing A new approach. Based on the relevant knowledge of the reinforcement learning algorithm, combined with the principle of maximum value, this paper uses the Markov decision-making model of the train running process in the interval, and uses the exploration and learning characteristics of the reinforcement learning to design a single-train energy-saving based on reinforcement learning. The control algorithm is used to solve the improvement strategy of train section driving, so as to obtain the optimal energy-saving strategy.

Fig. 1.
figure 1

Schematic scenario for this study

2 Analysis of Train Energy-Saving

When a train is running in a section, it will be affected by external forces such as train traction, train braking force, and train running resistance. Displayed equation is the calculation formula of the resultant external force.

$$ {{F}}_{{C}} = {{F}}_{{T}} - F_{{B}} - F_{{R}} $$
(1)

FT is train traction (kN), FB is train braking force (kN), FR is train running resistance (kN). When the train runs in the section, it is affected by the traction, braking force, and running resistance of the train. According to the three laws of Newtonian mechanics, displayed equation is construct the kinematic equation of train.

$$ \frac{dt}{{ds}} = \frac{1}{v} $$
(2)
$$ \frac{dv}{{ds}} = \frac{{\mu_{{t}} {{F}}_{{T}} - \eta_{{t}} {{F}}_{{B}} - {{F}}_{{R}} }}{mv} $$
(3)

μt is train traction utilization coefficient (ratio of current traction force to maximum traction force, range [0,1]). ηt is rain braking force utilization coefficient (ratio of current braking force to maximum braking force, range [0,1]). The train will generate traction energy consumption during the driving process. This article aims to minimize the energy consumption of train traction, displayed equation is the minimum energy consumption function.

$$ {{minJ}} = \int\limits_{{{s}}_0 }^{{{s}}_{{T}} } {[\mu_{{t}} F_T (v) - \eta_{{t}} F_B (v)]} ds $$
(4)

When the train is running, it must comply with the corresponding operating rules. Displayed equations are restrictions.

$$\begin{array}{*{20}c} v(x) \le v_{Lim} (x) \hfill \\ {{v(}}0{)} = 0 \hfill \\ {{v(s)}} = 0 \hfill \\ t(0) = 0 \hfill \\ t(s) - t(0) = T_s \hfill \\ \end{array} $$
(5)

V(x) is the speed of the train at position x (km/h). vLim(x) is the speed limit of the train at position x (km/h). v(0) is the train’s running at the beginning Speed (km/h); v(s) is the running speed of the train at the end (km/h). t(0) is the corresponding time at the beginning of the train (s). t(s) is the corresponding time at the ending of the train (s). Ts is the planned running time of the train in the interval (s).

Based on the knowledge of the maximum value, displayed equation is the Hamiltonian function of the train operation strategy.

$$ {{H}} = - \mu_{{t}} F_T (v) + \eta_{{t}} F_B (v) + \lambda_1 \frac{{\mu_{{t}} {{F}}_{{T}} - \eta_{{t}} {{F}}_{{B}} - {{ F}}_{{R}} }}{mv} + \lambda_2 \frac{1}{{{v}}} $$
(6)

λ1 and λ2 are the Lagrange multipliers. Displayed equations are the complementary relaxation factor M(x) to establish a regular equation.

$$ \frac{d\lambda_1 }{{ds}} = - \frac{{\partial {{H}}}}{\partial t} $$
(7)
$$ \frac{d\lambda_2 }{{ds}} = - \frac{{\partial {{H}}}}{\partial t} + \frac{dM(x)}{{ds}} $$
(8)

M(x) meets the constraint conditions of the displayed equations.

$$ {{[v}}\;\;{{(x)}} - {{vLim}}\;\;{{(x)}}{]}\frac{dM(x)}{{ds}} = 0 $$
(9)
$$ \frac{dM(x)}{{ds}} \ge 0 $$
(10)

Because there is no variable about time t in the Hamiltonian function, λ1 is a constant. The Eq. (6) can be rewritten as follows equation.

$$ {{H}} = (\beta - 1)\mu_{{t}} F_T (v) + (1 - \beta) \;\eta_{{t}} F_B (v) - {{F}}_{{R}} + \lambda_2 \frac{1}{{{v}}} $$
(11)

It can be seen from the above equation that if the function J wants to obtain the minimum value, the Hamiltonian function H needs to take the maximum value. Then, since the accompanying variables have multiple values, Table 1 is obtained.

Table 1. Optimal control value of train operation

Depending on the value of the accompanying variable, the train operating condition may change (see Table 2). When β > 1, the value of the control variable corresponds to the maximum traction condition; when β = 1, the value of the control variable corresponds to any operating condition; when β < 1, the train corresponds to the maximum braking condition. Therefore, the energy-saving control strategy of train operation. Accelerate with the maximum traction force, then use the cruise mode or idle mode, and use the brake as little as possible during the non-braking operation phase. When the brake is required to stop, the maximum braking force is applied to the train. So, in the following algorithm design in this article, traction and braking are both adopted maximum traction and maximum braking, through the design of the algorithm, find the optimal idle point, in order to achieve as much energy-saving operation as possible.

Table 2. The value of the control variable of the train working condition

3 Research and Design of Reinforcement Learning Algorithm

In this paper, combined with the operation control mode of the train in the interval, the Markov decision process is selected as the environment model of the algorithm. Considering the delay time of information acquisition and the amount of calculation and storage of the algorithm, this paper sets the discrete time to 0.2 s. Then, the train is regarded as an agent, the control output of the train in each minute time period is regarded as the behavior set of the agent, and the speed and position of the train in each minute time period are regarded as the current state vector of the train.

Strategy:

Strategy refers to the way an agent behaves at a given time or at a given stage, and directly determines the agent’s actions or control decisions. In this paper, by modeling the operation process of the train section as a Markov decision process, the output set of the train controller is taken as the action set of the Markov decision process. Then, the action of the agent at stage i is expressed as the displayed equation.

$$ u_{\text{i}} = u_{i - 1} + \Delta u_i $$
(12)

ui is the train output of the i stage (kN). Δui is the train output change of the i stage (kN). Normally, the acceleration of the train in urban rail transit changes between [−1, 1]m/s2, but for passenger comfort considerations, the maximum traction acceleration of the train in this article is 0.6 m/s2, and the maximum braking The deceleration is −0.8 m/s2. Therefore, the change range of the acceleration change rate Δui is selected between [−0.3, 0.3]m/s2. In addition, another benefit of reducing the acceleration change rate range is that it can reduce the control variable range to a greater extent, thereby reducing the actual calculation data volume and data buffer volume of the algorithm, and improving the convergence calculation speed of the algorithm.

Reward Function:

Reward function refers to the reward value generated by the agent in the process of moving to the next state due to an action taken in the current state. Because this article is to solve the problem of energy-saving operation control of trains, the energy consumption of trains in each small time period is regarded as a reward function. The specific calculation formula is as displayed equation.

$$ {{U}}_{{i}} = \frac{1}{2}*\left| {u_{0} + \sum_{k = 1}^i {\varDelta u_k } } \right|(v_i + v_{i - 1} )\varDelta t $$
(13)

u0 is the output of the train starting (kN). vi is the final speed (m/s) of the i stage of the train. Δt is the time of each discrete time period (0.2 s). Under normal circumstances, the calculation of energy consumption of trains is an integral process of force with respect to distance. But in this article, because the train operation control is broken down into minute time units. Therefore, in a discrete time period, it is approximately considered that the amount of change in force can be ignored, and the calculation of distance can be approximated as a distance calculation method in a straight-line process with uniform acceleration.

Value Function:

The value function refers to the weighting and expectation of the reward function, which represents the overall return expectation of the entire control process of the strategy adopted by the agent in the current state. In this paper, the value function is defined as the displayed equation.

$$ {{Q}}{{{(X}}},\;\varDelta u) = \frac{1}{{{k}}}\sum_{{{i}} = 1}^{{{m }} - 1} {\gamma {{U}}_{{i}} } $$
(14)

γ is discount factor (used to measure immediate repayment and long-term repayment, the range is (0,1]). After defining the state, strategy, reward function and value function above, the value function update formula (15) of reinforcement learning can be derived.

$$ {{Q}}^\varPi ({{X}},\;\varPi^{\prime}({{X}})) = \frac{\varepsilon }{{{|A(X)|}}}\sum_{\varDelta u} {{{Q}}^\varPi ({{X}},\;\varDelta u)} + (1 - \varepsilon ) \mathop{min}\limits_{\Delta u} {{Q}}^\varPi ({{X}},\;\varDelta u) $$
(15)

ε is greedy rate of greedy algorithm. A(X) is the number of action sets in state X.

Table 3 shows the process of the reinforcement learning algorithm to calculate a train speed-distance curve. Because this article needs to solve the problem of energy-saving optimization of train operation control, it is necessary to repeatedly calculate the process to obtain multiple train speed-distance curves, and then make selections to optimize the energy consumption of trains traveling in sections.

Table 3. Reinforcement learning algorithm calculates energy saving strategy

4 Simulation and Verification

4.1 Case 1

Case 1 uses the line data of Shanghai Metro Line 17 to verify that the energy-saving algorithm proposed in this paper is effective and reliable in terms of safety, punctuality and energy saving. In this case, the line data from Metro Line 17 from Jiasong Middle Road Station to Xujingbeicheng Station will be used. The running distance between the two stations is 2660 m, the planned train running time of this section is 180 s, and the section information of this section is shown in Table 4.

Table 4. Case 1 Line parameter information table

By importing the relevant line data into the reinforcement learning algorithm for calculation, the most energy-efficient train speed-distance curve in this section is obtained, as shown in Fig. 2.

Fig. 2.
figure 2

Optimal control curve diagram based on reinforcement learning algorithm

As shown in the figure, the optimal strategy derived by this algorithm: first perform the maximum traction acceleration, then repeatedly perform the process of idling and traction, and finally act on the train with the maximum deceleration. And it is not difficult to find that the speed of the train is always below the speed limit of the section during the entire period of the train running. Therefore, this algorithm meets the safety requirements.

Then, according to the actual energy consumption of the train and the running time of the section, it can be obtained from the existing signal system to further enhance the verification of the effectiveness of the algorithm, and refer to the conclusion data [7]. Make a comparison (see Table 5). It can be seen that in terms of travel time, the existing driving strategy is the fastest, followed by this article, and the reference strategy is the slowest. In terms of operating energy consumption, the existing driving strategy is the most energy-consuming, followed by this article, the strategy referred to is the best. Among them, compared with existing strategies, this strategy can reduce energy consumption by 17.5%. Compared with the energy-saving strategy in the reference, the strategy in this article has a slightly higher energy consumption, but the running time is closer to the standard time. In summary, the algorithm proposed in this paper meets the requirements of safety, punctuality and energy saving in train operation, and has a good energy saving effect.

Table 5. Comparison of three strategies

4.2 Case 2

In order to further enhance the verification of the effectiveness of the algorithm, Case 2 uses the line data and other information of another reference[7] to compare the output results between different strategies. This case shows that the algorithm in this paper can model in different intervals. Validity in. In this case, the running distance of the section is 1354 m, and the planned running time of the section is 110 s. The slope, speed limit and curve radius information of the section are shown in Table 6.

Table 6. Case 2 Line parameter information table

By importing the relevant line data into the reinforcement learning algorithm for calculation, the most energy-efficient train speed-distance curve in this section is obtained, as shown in Fig. 3. It can be seen from the figure that the optimal strategy derived by this algorithm is largely the same as the case: first, the maximum traction acceleration is performed, and then the process of idling and traction is repeated repeatedly, and finally the train is braked at the maximum deceleration. The subtle difference is that the second case starts from the section line 684 m from the downhill section, so the acceleration of the idling here is a positive number, so the train has been idling in this section of the road.

Fig. 3.
figure 3

Optimal control curve diagram based on reinforcement learning algorithm

Then, use the conclusion data in reference [9] to compare (see Table 7). Through comparison, it can be seen that in terms of running time, the existing driving strategy is the slowest, the algorithm in this paper is the second, and the strategy in reference to the literature is the fastest. From the perspective of the error time, the error time of this paper is only 0.2 s, which is 0.6 s lower than the error time of the existing strategy and is within the acceptable range. In terms of operating energy consumption, the existing driving strategy is the most energy-consuming, followed by references, and the strategy in this article is the best. Among them, the strategy in this article can reduce energy consumption by about 13% compared with the existing strategy. Compared with the energy-saving strategy in the reference, the strategy in this article can further reduce the energy consumption by about 8% in this case. In summary, in this case, the algorithm in this paper also meets the requirements of safety, punctuality and energy saving, further verifying the effectiveness and reliability of the algorithm.

Table 7. Comparison of three strategies

5 Summary

In this article, for the energy-saving operation of trains, the combination of reinforcement learning algorithm and train operation control is discussed, and the energy-saving optimization model and algorithm of urban rail single train based on reinforcement learning algorithm are proposed. According to the relevant data in the reference, the route and vehicle model are built, and the algorithm designed in this paper is used to solve the problem and verify the validity of the algorithm. Through the comparison of data results, this algorithm has achieved excellent results in both energy consumption and punctual safety.

Although the train energy-saving operation control system based on reinforcement learning designed in this paper has reached good requirements, there are also certain problems worthy of follow-up improvement, such as the establishment of a train multi-particle model.