1 Introduction

Urban rail transit, as an efficient, safe, comfortable, and fast mode of transportation, has undergone significant development in the past few decades. However, along with this massive transportation capacity, there is inevitably an enormous demand for energy. Taking the Guangzhou Metro in China as an example, the total number of passengers transported in 2022 reached 2.358 billion, and the total operating energy consumption for the year was 1.882 × 109 (kw h), with train traction energy consumption accounting for 55.6%, totaling 1.047 × 109 (kw h) [1]. Energy-efficient train operation is a very effective measure for achieving energy conservation, emission reduction, and green transportation [2].

The keys to energy-efficient train operation include determining the speed profile of trains with minimal energy consumption and maintaining a timetable that can meet constraints such as train characteristics, track gradients, curves, and speed limits; this approach is also known as train trajectory optimization (TTO). The optimal train trajectory can describe the movement on the track and can be used as a basis for guiding train drivers during operation or as an input to automatic train operation (ATO).

To solve the TTO problem, many scholars have conducted extensive research on train operation control strategies and control methods, which are mainly divided into five categories: Pontryagin’s maximum principle (PMP) [3,4,5,6,7], quadratic programming (QP) [8, 9], heuristic method [10,11,12,13], dynamic programming (DP) [14, 15] and reinforcement learning (RL) [16, 17]. The PMP faces significant challenges in TTO problems, especially when dealing with hard constraints on non-flat and multi speed-limited tracks, it is difficult to find the optimal conversion conditions. Kouzoupis et al. [18] used a multiple shooting method to transform the TTO into a nonlinear programming problem, and directly solved it using CasAdi (an open-source tool for numeric optimization implementing automatic differentiation in forward and reverse modes on sparse matrix-valued computational graphs) and IPOPT (a software for solving large-scale nonlinear optimization problems). And Wang and Goverde [8] solved it using the pseudospectral method. To improve the accuracy of the solution, the discrete scale of the TTO problem can be expanded, and the QP method requires a huge amount of computation and storage space to handle this large-scale problem. The heuristic methods often require considerable time to make control decisions and sometimes even lead to violent fluctuations in the speed profile that do not comply with the constraint. They regarded the train operation process as a multistage decision process [14, 15], and used the Bellman optimal equation and backpropagation method to obtain the optimal control strategy through iterative solving. To overcome the curse of dimensionality, approximate dynamic programming (ADP) is adopted to solve the TTO problem, and multiple value function approximation methods are designed to estimate the optimal value function, such as the rolling algorithm, the interpolation method, and neural networks [19,20,21].

The RLs allow agents to learn how to complete tasks through interaction with the environment, but they also meet many challenges, such as sparse rewards, balance between exploration and exploitation, sampling efficiency. To solve these problems, many classic RL algorithms have been proposed, such as Q-learning [22], deep Q-network (DQN) [23], advanced actor–critic algorithm (A2C) [24], deep deterministic policy gradient (DDPG) [25], proximal policy optimization (PPO) [26]. Liu et al. [16] proposed an intelligent control method based on the deep Q-network (DQN) to solve the TTO problem of heavy-haul trains. Liang et al. [27] used the asynchronous advanced actor–critic (A3C) to optimize the train speed profile and proposed a parameter update method with a weighted average of advantage values to address the convergence oscillation and degradation problem of the A3C. The train operation process can also be seen as a continuous control task, which is solved using DDPG [28, 29]. Pang et al. [30] addressed the problem of train trajectory reconstruction under interruption conditions, using the PPO model to consider train operation constraints and minimize total train delay, and proposed a train trajectory reconstruction scheme.

During the learning process, traditional RLs adopt a soft constraint approach, which involves setting a penalty function that matches the constraint to prevent agents from crossing boundaries and reaching unsafe states. However, in fields with complex transition dynamics and high-dimensional state-action spaces, this trial-and-error process may cause damage to the learning system when executing selected actions in certain states, affecting the efficiency of algorithm search.

To address this issue, safe reinforcement learning (SRL) has been proposed with the aim of satisfying given safety constraints and ensuring good system performance [31]. Several researchers have applied a Gaussian process to model safety constraints, which enables the algorithm to evaluate the safety of state-action pairs before accessing them to support safe learning [32,33,34]. The concept of shielding was first proposed by Alshiekh et al.[35]. During learning, when an agent discovers that the current action is unsafe, it triggers shielding and uses an alternative action to cover the current action to ensure safety. Jeddi et al. [36] proposed a memory-augmented Lyapunov-based SRL model that enables agents to always meet the safety constraints of the environment. Zhou et al. [37] adopted a simplified system model to establish an SRL framework and effectively learned low-dimensional representations of safe regions through data-driven methods to obtain more accurate safe estimates, which expanded the applicability of the SRL framework.

In addition, RLs require a reward function to provide feedback [38], and learning performance largely depends on the design of the reward function [39]. Generally, reward functions are divided into two types: sparse rewards and dense rewards. The sparse reward function provides reward feedback only when completing tasks; therefore, it has strong anti-interference ability and can be made consistent with the task objectives [40,41,42]. When there are enough successful samples to provide reward feedback, the agent can learn the global optimal strategy [43]. However, in the early stage, its efficiency is relatively low. If the task is not completed, the agent will only receive samples with the same penalty reward, which makes it difficult for the algorithm to learn good strategies from these bad data. A dense reward function usually provides specific feedback for each state of the agent in a timely manner to distinguish different actions [44], which can maintain the continuity of learning and quickly guide the agent to approach high-value states. However, when designing a dense reward function, it is necessary to fully consider the possible interference from noise, as it is susceptible to interference from noise signals that may propagate and amplify through the Bellman equation [45, 46].

In summary, the PMP has limitations when dealing with TTO problems with hard constraints such as track slope and speed limitations. The RLs can obtain rewards through the constant interaction between agents and the environment to guide the continuous evolution of the algorithm. Therefore, it can well adapt to complex environmental constraints and has good generalization ability. The PPO algorithm, a typical RLs, adopt the mode of limiting strategy update amplitude, which enables it to maintain a high sample efficiency while effectively improving the stability and efficiency of training. Then, the SRL can effectively improve the soft constraint, which punishes state-action pairs that exceed the constraint, resulting in low sampling efficiency, a slow learning speed, and even breaking the constraints and obtaining the train running track beyond the safety limit.

Therefore, this paper proposes a PPO based safety reinforcement learning framework (S-PPO) for the train trajectory optimization, including a safe action rechoosing mechanism (SARM) and a relaxed dynamic reward mechanism (RDRM) combining a relaxed sparse reward and a dynamic dense reward. The SARM is proposed to guarantee both the safety of the learning process and the final result. The state transition process of the agent is evaluated by environmental knowledge, and when it is found that the agent's behavior exceeds the safety constraints, a new action is reselected to ensure that the next state reached by the agent always meets safety constraints, effectively improving the sampling efficiency. Notably, the SARM may be triggered at the beginning or middle of a state transition. The RDRM is designed to balance the potential convergence stability issues that the SARM may bring. The relaxed sparse rewards are obtained through extended planned trip time constraints, which makes it easier for the learning system to obtain samples that meet these constraints, greatly reducing the risk of the algorithm falling into local optima. The dynamic dense reward is a dynamic balance coefficient based on the initial velocity of the state, and is used to balance the contribution of running time and energy consumption to obtaining rewards in different states.

The remainder of this paper is organized as follows. In Sect. 2, the train operation model and the Markov decision process model of the train operation are formulated. In Sect. 3, we propose the S-PPO with the SARM and the RDRM for the TTO. In Sect. 4, simulations based on train and track data between Jiugong Station and Yizhuangqiao Station on the Beijing Metro Yizhuang Line verify the effectiveness of the proposed SVRDE and the energy efficiency of the proposed algorithm. In Sect. 5, conclusions are given.

2 Model construction

2.1 Basic train operation model construction

When studying the optimal operation of trains, a single-particle model [4, 47, 48] is always used to construct the kinematic system of trains. Assume that a train moves from the starting point \(x = 0\) to the endpoint \(x = X\). The running time \(t = t(x) \in [0,T]\) and speed \(v = v(x) \in [0,V]\) are used as the dependent variables of the model. Then, the train operation model can be expressed as follows:

$$\left\{ \begin{gathered} \frac{{{\text{d}}t}}{{{\text{d}}x}} = \frac{1}{v} \hfill \\ \frac{{{\text{d}}v}}{{{\text{d}}x}} = \frac{1}{M}\frac{{\alpha_{f} f(v) - \alpha_{b} b(v) - w_{0} (v) - w_{i} (x)}}{v} \hfill \\ \end{gathered} \right.$$
(1)

where \(M\) is the mass of the train. \(f(v) > 0\) and \(b(v) > 0\) represent the maximum traction force and maximum braking force, respectively. \(w_{0} (v) > 0\) expresses the resistance produced by friction and \(w_{i} (x)\) is the resistance generated by gradients. \(\alpha_{f}\) and \(\alpha_{b}\) are the coefficients of traction and braking force utilization, respectively, must satisfy the constraint:

$$\left\{ \begin{gathered} 0 \le \alpha_{f} \le 1 \hfill \\ 0 \le \alpha_{b} \le 1 \hfill \\ \end{gathered} \right.$$
(2)

It is worth noting that the single-particle model ignores the length of the train. When the train passes the gradient transformation points, the model cannot accurately express the force process, and there will be some bias in the description of the train operation. The size of the bias is related to \(w_{i} (x)\) and the train length. Therefore, the single-particle model is generally established with the train center point as the reference point, which effectively reduces this bias and keeps its impact on the train within a tolerable range. Meanwhile, to transform TTO into an Markov decision process (MDP), this paper also ignores the impact of this bias.

The energy consumption E is an important indicator for measuring train operation trajectory control and can be expressed as:

$$E = \int_{0}^{X} {\alpha_{f} f(v){\text{d}}x}$$
(3)

The Hamiltonian function can be defined as:

$$H = - \alpha_{f} f(v) + \frac{{{{\varvec{\uplambda}}}_{1} }}{v} + \frac{{{{\varvec{\uplambda}}}_{2} [\alpha_{f} f(v) - \alpha_{b} b(v) - w_{0} (v) - w_{i} (x)]}}{v}$$
(4)

The Lagrangian function is represented as:

$${\text{La}} = H + \rho_{1} \alpha_{f} + \rho_{2} {(1 - }\alpha_{f} {) + }\rho_{3} \alpha_{b} + \rho_{4} {(1 - }\alpha_{b} {) + }\rho_{5} {(}V{ - }v{)}$$
(5)

where \(\rho_{1} \ge 0,\rho_{2} \ge 0,\rho_{3} \ge 0,\rho_{4} \ge 0,\rho_{5} \ge 0\) are all Lagrangian multipliers and the adjoint variables must satisfy the following:

$$\left\{ \begin{gathered} \frac{{{\text{d}}\lambda_{1} }}{{{\text{d}}x}} = - \frac{\partial La}{{\partial t}} = 0 \hfill \\ \frac{{{\text{d}}\lambda_{2} }}{{{\text{d}}x}} = - \frac{\partial La}{{\partial v}} \hfill \\ \end{gathered} \right.$$
(6)

According to the Karush–Kuhn–Tucker (KKT) condition, it can be concluded that

$$\frac{{\partial \text{La} }}{{\partial \alpha _{f} }} = \left( {\frac{{{\mathbf{\lambda }}_{2} }}{v} - 1} \right)f\left( v \right) + \rho _{1} - \rho _{2} = 0$$
(7)
$$\frac{{\partial {\text{La}} }}{{\partial \alpha_{b} }} = - \frac{{{{\varvec{\uplambda}}}_{2} }}{v}b{(}v{)} + \rho_{3} - \rho_{4} = 0$$
(8)
$$\frac{{\partial {\text{La}} }}{\partial v} = - \alpha_{f} f^{\prime} {(}v{)} - \frac{{{{\varvec{\uplambda}}}_{1} }}{{v^{2} }} + \frac{{{{\varvec{\uplambda}}}_{2} }}{v}{(}\alpha_{f} f^{\prime} {(}v{)} - \alpha_{b} b^{\prime} {(}v{)} - w_{0}^{\prime} {(}v{))} - \frac{{{{\varvec{\uplambda}}}_{2} }}{{v^{2} }}{(}\alpha_{f} f{(}v{)} - \alpha_{b} b{(}v{)} - w_{0} {(}v{)} - w_{i} {(}x{))} - \rho_{5} = 0$$
(9)

The complementary relaxation conditions are as follows:

$$\rho_{1} \alpha_{f} = \rho_{2} {(1} - \alpha_{f} {) = }\rho_{3} \alpha_{b} = \rho_{4} {(1} - \alpha_{b} {) = }\rho_{5} {(}V - v{) = 0}$$
(10)

It can be seen that \({{\varvec{\uplambda}}}_{2}\) has two critical values: \({{\varvec{\uplambda}}}_{2} = v,{{\varvec{\uplambda}}}_{2} = 0\). According to Pontryagin's maximum principle, five different conditions need to be considered to determine the control laws \(\alpha_{f}\) and \(\alpha_{b}\) such that the Hamiltonian function can reach the maximum value in the feasible region.

Condition 1: If \({{\varvec{\uplambda}}}_{2} > v\), since \(f(v) > 0\), according to Eq. (7), \(\rho_{1} < \rho_{2}\). According to (10), due to \(\rho_{1} ,\rho_{2} \ge 0\), \(\rho_{1} = 0\) and \(\rho_{2} > 0\) can be obtained; then, \(\alpha_{f} = 1\). Similarly, according to (8), \(\rho_{3} - \rho_{4} > 0\) can be obtained; since \(\rho_{3} \alpha_{b} = \rho_{4} {(1} - \alpha_{b} {) = 0}\), \(\alpha_{b} = 0\). The train operates with maximum traction force (MT).

Condition 2: If \({{\varvec{\uplambda}}}_{2} = v\), according to Eq. (8), it is easy to find that \(\rho_{3} - \rho_{4} = b{(}v{)} \ge 0\); and according to Eq. (10), \(0 \le \alpha_{f} \le 1\) and \(\alpha_{b} = 0\).

When \({{\varvec{\uplambda}}}_{2} = v\), we can obtain equation \(\frac{{{\text{d}}{{\varvec{\uplambda}}}_{2} }}{{{\text{d}}x}} = \frac{{{\text{d}}v}}{{{\text{d}}x}}\) and substitute Eqs. (1), (6) and (9) into it and simplify them:

$${(}w_{0}^{\prime} {(}v{) + }\rho_{5} {)}v^{2} { + }{\varvec{\lambda}}_{1} = 0$$
(11)

In this condition, the speed is held at a certain value \(v = v_{c}\) [\(v_{c}\) is the positive solution of Eq. (11)]. This indicates that the train operates at a constant speed with only partial traction force, which is called traction cruising (CR-T).

Condition 3: If \(0 < {{\varvec{\uplambda}}}_{2} < v\), it can be inferred that \(\rho_{1} > \rho_{2}\) and \(\rho_{3} > \rho_{4}\) according to (7) and (8), respectively. Then, according to (10), we can obtain \(\alpha_{f} = \alpha_{b} = 0\). Under these conditions, both the traction and braking forces of the train are zero, which is called coasting (CO).

Condition 4: If \({{\varvec{\uplambda}}}_{2} = 0\), similar to condition 2, we can obtain \(\rho_{1} > 0\) and \(\rho_{3} = \rho_{4} = 0\), corresponding to \(\alpha_{f} = 0\) and \(0 \le \alpha_{b} \le 1\).

When \({{\varvec{\uplambda}}}_{2} = 0\), we can construct equations equation \(\frac{{{{\varvec{\uplambda}}}_{2} }}{v} = 0\) and substitute Eq. (9):

$$\rho_{5} v^{2} { - }\lambda_{1} = 0$$
(12)

If \(\rho_{5} = 0\), then Eq. (12) does not hold. If \(\rho_{5} > 0\), according to Eq. (10), \(v = V\), Eq. (12) may hold. The speed is held at a certain value \(v = V = \sqrt {\lambda_{1} /\rho_{5} }\). The train operates at a constant speed with only partial braking force; this process is called braking cruising (CR-B).

Condition 5: If \({{\varvec{\uplambda}}}_{2} < 0\), as above, we can obtain \(\rho_{1} > 0\) and \(\rho_{4} > 0\), so \(\alpha_{f} = 0\) and \(\alpha_{b} = 1\). The train operates with a maximum braking force (MB).

Under both CR-T and CR-B operating conditions, trains can maintain a constant speed; these conditions are collectively referred to as cruising (CR).

In summary, the optimal train operation strategy consists of a sequence of four operation regimes: MT, CR, CO and MB. Therefore, The continuous train operation is simplified into these four actions: MT, CR, CO and MB, which can greatly reduce the dimension of the action space and make TTO easier to solve.

2.2 The Markov decision process of train operation

The MDP is a classical sequential decision process and can be defined as \(\left\langle {\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma } \right\rangle\), where \(\mathcal{S}\) is the state space, \(\mathcal{A}\) is the action space, \(\mathcal{P}\) represents the state transfer function, \(\mathcal{R}\) is the reward function, and \(\gamma\) is the discount factor.

The agent and environment need to constantly interact, and the current action used in the interaction not only affects the immediate reward but also influences the subsequent state through future rewards. In step i, the agent's state is to interact with the environment through actions, reach a new state, and receive a reward. At each step \(i\), an agent in state \(s_{i} \in \mathcal{S}\) interacts through sampling action \(a_{i} \in \mathcal{A}\) and transitions to a new state \(s_{i + 1}\), receiving reward \(r_{i}\); the process of this state transition is denoted as \(s_{i} \mathop{\longrightarrow}\limits_{{r_{i} }}^{{a_{i} }}s_{i + 1}\).

Here, the TTO problem is transformed into an MDP by the discretization. Common discretization methods include time discretization [49, 50], distance and velocity discretization [48], and distance discretization [18, 51].

The process of track discretization is shown in Fig. 1. Based on the gradient and the inflection point of the speed limit of the track, the track is discretized into H large sections (each with a length of \(l_{h} ,(h = 1,2, \cdots ,H)\)). To ensure accuracy, each large section needs to be further discretized. The maximum discretization step size of distance is \(\Delta l\), and each large segment can be discretized into \(m_{h} = {\text{ceil}} (l_{h} /\Delta l)\) small segments. The dimension of the discrete TTO problem is \(N = \sum\nolimits_{h = 1}^{H} {m_{h} }\).

Fig. 1
figure 1

The principle of track discretization

The advantage of this discretization method is that it can ensure fixed properties (speed limit, slope and –if considered –curvature) for each small interval, and can solve the problems of unstable force and speed constraints on trains.

To consider the adaptability of the algorithm to temporary speed limits, this paper also takes adjacent temporary speed limit information in front of the train as important state information when defining the state of the agent. Assume that the number of temporary speed limit intervals is \(Num_{sl}\); the positions of the temporary speed limit intervals are \([xsl_{j}^{start} ,xsl_{j}^{end} ],j = 1,2, \cdots Num_{sl}\); and the speed limit value is \(vsl_{j}\). Then, the state is defined as follows:

$$s_{i} = (x_{i} ,v_{i} ,\Delta xsl_{i}^{{}} ,\Delta vsl_{i} ,T_{i} ,E_{i} ),s_{i} \in \mathcal{S}$$
(13)

where the speed difference from the temporary speed limit is \(\Delta vsl_{i} = vsl_{j} - v_{i}\).

The distance from the current position to the start of the next adjacent temporary speed limit interval is represented as follows:

$$\Delta xsl_{i}^{start} = \left\{ \begin{gathered} xsl^{start} - x_{i} \quad x_{i} < xsl_{{Num_{sl} }}^{start} \hfill \\ 0\quad \quad \quad \quad \;\;x_{i} \ge xsl_{{Num_{sl} }}^{start} \hfill \\ \end{gathered} \right.$$
(14)

\(T_{i}\) and \(E_{i}\) represent the running time and energy consumption, respectively, which can be calculated as follows:

$$\left\{ \begin{gathered} T_{i} = \sum\limits_{k = 1}^{i} {\int_{{x_{k - 1} }}^{{x_{k} }} {\frac{1}{v(x)}} } {\kern 1pt} {\text{d}}x \hfill \\ E_{i} = \sum\limits_{k = 1}^{i} {\int_{{x_{k - 1} }}^{{x_{k} }} {\alpha_{f} f(v){\text{d}}x} } \hfill \\ \end{gathered} \right.\quad k = 1, \cdots ,i\,\,{\text{and}}\,\,i = 1,2, \cdots N$$
(15)

In the interval \([0,X]\), the total trip time of the train is \(T_{N} \in [T_{\min } ,T_{\max } ]\), and the total energy consumption is \(E_{N} \in [E_{\min } ,E_{\max } ]\).

Notably, for the fixed-speed limit sections, the speed limit is lower than the maximum speed limit, and it has the same spatial distribution characteristics as the temporary speed limit during train operation tasks; therefore, this method is also applicable to these sections.

Based on Sect. 2.1, the action space can be defined as:

$$\mathcal{A} = \{ MT,CR,CO,MB\}$$
(16)

The deviation of trip time and energy consumption are important indicators for measuring the trajectory of train operation. Therefore, the objective function is defined as follows:

$$\mathop {\arg \min }\limits_{{\pi^{*} }} \;J(\pi ) = \{ J_{\Delta T} (\pi ),J_{E} (\pi )\}$$
(17)

where \(\pi\) is the train operation strategies;\(J_{E} = E_{N}\), \(J_{\Delta T} = \left| {T_{N} - T_{P} } \right|\), and \(T_{P}\) is the planned trip time.

The ideal total trip time should be consistent with the planned trip time, but in actual train control scenarios, there may be some deviation \(\Delta t\) between them, and small \(\Delta t\) is allowed. By simplifying the time deviation objective as a constraint:

$$\left| {T_{N} (\pi ) - T_{P} } \right| \le \Delta t$$
(18)

The TTO problem can be transformed into a single-objective optimization problem, and objective function is:

$$\mathop {\arg \min }\limits_{{\pi^{*} }} \;J(\pi ) = J_{E} (\pi )$$
(19)

3 The safety reinforcement learning framework

In this section, the PPO algorithm is introduced, and a safe action rechoosing mechanism and a relaxed dynamic reward function are proposed to improve the performance of the algorithm.

3.1 The PPO algorithm

PPO algorithm is a classic RL algorithm consisting of two networks, the actor network and the critic network, which can be defined by weights \({{\varvec{\uptheta}}}\) and \({{\varvec{\upomega}}}\), respectively. The actor network generates the probability distribution of possible actions, which is used to choose the best action. The critic network assesses the value of the current state and guides the network of actors to make better decisions. The PPO optimizes a clipped surrogate objective function using mini-batch stochastic gradient ascent, which is given by

$$L\left( \theta \right) = {\hat{\mathbb{E}}}\left[ {\left. {{\text{min}}\left( {\rho_{i} \left( \theta \right){\kern 1pt} A^{{\pi_{{\theta_{{{\text{old}}}} }} }} \left( {s_{i} ,a_{i} } \right),clip\left( {\rho_{i} \left( \theta \right),{\kern 1pt} 1 - \varepsilon ,{\kern 1pt} 1 + \varepsilon } \right)A^{{\pi_{{\theta_{{{\text{old}}}} }} }} \left( {s_{i} ,a_{i} } \right)} \right)} \right]} \right]$$
(20)

\(\rho_{i} (\theta ) = \frac{{\pi_{\theta } (a_{i} |s_{i} )}}{{\pi_{{\theta_{{{\text{old}}}} }} (a_{i} |s_{i} )}}\) denotes the probability ratio between the previous and updated policies. \(\varepsilon\) is the hyperparameter, represents the range for truncation. \(A^{{\pi_{{\theta_{{{\text{old}}}} }} }} {(}s_{i} {,}a_{i} {)}\,\) is an advantage function estimated by the generalized advantage estimation

$$A^{{\pi_{{\theta_{{{\text{old}}}} }} }} ({\varvec{s}},{\varvec{a}}) = \delta_{i} + (\gamma \lambda )\delta_{i + 1} + \cdots + (\gamma \lambda )^{U - i} \delta_{U}$$
(21)

where \(\delta_{i} = r_{i} + \gamma V_{\omega } (s_{i + 1} ) - V_{\omega } (s_{i} )\).\(V_{\omega } (s_{i} )\) is the value of state \(s_{i}\) and \(r_{i}\) is the reward at \(i\) time step. \(U\) denotes the size of mini-batch. \(\gamma\) and \(\lambda\) are discount factor and GAE parameter, respectively. \({\text{clip}} {(} \cdot \,{)}\) is clip function that can prevent the disastrous performance loss caused by the high variance inherent in the strategy gradient method by conservatively optimizing the strategy. If \(A^{{\pi_{{\theta_{{{\text{old}}}} }} }} {(}s_{i} {,}a_{i} {)}\,{ > }\,{0}\), \(\rho_{i} {(}{{\varvec{\uptheta}}}{)}\) is cliped at \(1 + \varepsilon\); On the contrary, if \(A^{{\pi_{{\theta_{{{\text{old}}}} }} }} {(}s_{i} {,}a_{i} {)}\,{ < }\,{0}\), then \(\rho_{i} {(}{{\varvec{\uptheta}}}{)}\) is cliped at \(1 - \varepsilon\).

$$L(\omega ) = {\hat{\mathbb{E}}}\left[ {\left( {V_{\omega } (s_{i} ) - \hat{V}_{i} } \right)^{2} } \right],\quad {\text{where}}\;\hat{V}_{i} = \sum\limits_{j = i}^{U} {\gamma^{j - i} r_{j} }$$
(22)

At the end, the critic network and actor network are updated by:

$${{\varvec{\upomega}}}_{{{\text{new}}}} \leftarrow {{\varvec{\upomega}}}_{{{\text{now}}}} - \alpha_{c} \cdot \nabla_{\omega } L({\varvec{\omega}})$$
(23)
$${{\varvec{\uptheta}}}_{{{\text{new}}}} \leftarrow {{\varvec{\uptheta}}}_{{{\text{now}}}} + \alpha_{a} \cdot \nabla_{\theta } L({\varvec{\theta}})$$
(24)

3.2 Safe action rechoosing mechanism

In practical applications, the environment is often bounded. To ensure that each state transition satisfies the constraints of the environmental boundary, traditional RLs punish the state–action pairs that exceed the environmental limit and keep the state unchanged, i.e., \(s_{i} \mathop{\longrightarrow}\limits_{{r_{i} }}^{{a_{i} }}s_{i}\), to correct unsafe states. This approach is called an unsafe state maintenance mechanism. This method of collecting data through trial and error to obtain the optimal strategy can incur significant costs and may even cause damage to the learning system during application. Therefore, this paper presents an S-PPO algorithm with SARM and RDRM based on the PPO by analyzing the characteristics of the TTO problem. The unsafe actions discovered during interactions with the environment are rechosen to improve the algorithm learning efficiency and stability. To ensure the safety of train operation, safety law must be met:

$$v(x) < V_{\max } (x),x \in [x_{i} ,x_{i + 1} ]$$
(25)

Where \(V_{\max } (x)\) is the actual speed limit curve generated by the automatic train protection system (ATP).

However, in reality, when the train speed approaches the speed limit, an inappropriate action may violate a safety law. There are two specific situations to describe:

Situation 1: In the constant speed limit interval, as shown in Fig. 2, when the train’s speed \(v(x_{i} )\) approaches the speed limit \(V_{\max } (x_{i} )\) in a state \(s_{i}\) and action \(a_{i} = \{ MT\}\), e.g., \(s_{i} \mathop{\longrightarrow}\limits^{MT}s_{i + 1}^{MT}\), then \(v_{{}}^{MT} (x_{i + 1} ) > V_{\max } (x_{i + 1} )\), which violates safety law (25). However, if the action can be rechosen as \(a_{i} \in \{ CR,CO,MB\}\) at point A (\(V_{\max } (x_{A} ) - \Delta v \le v(x_{A} ) < V_{\max } (x_{A}^{{}} )\), where \(\Delta v\) is the allowable error for approaching the speed limit), then the corresponding state transitions to \(s_{i} \mathop{\longrightarrow}\limits^{CR,CO,MB}s_{i + 1}^{MT} ,s_{i + 1}^{CO} ,s_{i + 1}^{MB}\) and the speed meets condition (25), which can ensure the safety of the train.

Fig. 2
figure 2

Rechoosing plan for a constant speedlimit interval

Situation 2: In the braking speed limit interval, as shown in Fig. 3, when the train’s speed \(v(x_{i} )\) approaches the speed limit \(V_{\max } (x_{i} )\) in a state \(s_{i}\) and action \(a_{i} \in \{ MT,CR,CO\}\), if the speed remains constant within the interval and transitions to the next state \(s_{i + 1} \in \{ s_{i + 1}^{MT} ,s_{i + 1}^{CR} ,s_{i + 1}^{CO} \}\), the corresponding speed \(v^{*} > V_{\max } (x_{i + 1} ),v^{*} \in \{ v_{{}}^{MT} ,v_{{}}^{CR} ,v_{{}}^{CO} \}\), does not comply with safety law (25). If the actions at points A, B, and C can be reset, it can ensure that \(s_{i + 1} = \{ s_{i + 1}^{MB} \}\) meets the requirements of (25).

Fig. 3
figure 3

Rechoosing plan for a braking speed limit interval

Specifically, in S-PPO, as shown in Fig. 4, the probability distribution \(Pdist_{\mathcal{A}}\) of each action in action space \(\mathcal{A}\) can be generated based on the policy network \(\pi ( \cdot |s_{i} ;{{\varvec{\uptheta}}}_{now} )\). The action \(a_{i}\) sampled based on the probability distribution \(Pdist_{\mathcal{A}}\) is used to interact with the environment. Environmental knowledge is used to evaluate the safety of state–action pairs, and the safety judgment coefficient is \(\xi_{sd}\).

$$\xi_{sd} = \left\{ \begin{gathered} 0\quad v(x) \le V_{\max } (x) \hfill \\ 1\quad \,v(x) > V_{\max } (x) \hfill \\ \end{gathered} \right.$$
(26)
Fig. 4
figure 4

The safe action rechoosing mechanism

If \(\xi_{sd} = 1\), continuing to take the action \(a_{i}\) will cause the train to exceed the speed limit, which does not satisfy the safety law. Therefore, the SARM will be triggered. To prevent the new action \(a_{i}^{\prime}\) after rechoosing from being consistent with the original action \(a_{i}\), it is necessary to remove unsafe actions and reconstruct the action space \(\mathcal{A}^{\prime}\) (\(\mathcal{A}^{\prime} = \mathcal{A} - a_{i}\)). Then, the probability distribution of the remaining actions is normalized to obtain the reconstructed probability distribution \(Pdist_{{\mathcal{A}^{\prime} }}\), which is used to obtain a new action \(a_{i}^{\prime}\). Therefore, when performing the SARM, new actions are sampled based on \(Pdist_{{\mathcal{A}^{\prime} }}\). The actions with higher adoption probabilities can have more opportunities to be reselected. The pseudocode of the safe action rechoosing mechanism is shown in algorithm 1.

Algorithm 1
figure a

SARM

3.3 Relaxed dynamic reward function construction

The performance of RL agent’s learning largely depended on the reward function design. In [16], a dense reward strategy has been designed, which punishes unsafe operations and encouraging the release of air brakes under specific conditions through a positive reward. [30] constructs a sparse reward, which the system gives a large immediate reward to agent when the train reaches the final position, otherwise the instantaneous reward obtained by the system agent is always 0. Lin et al. [52] and Haung et al. [53] adopt a weighted average approach to incorporate multiple objectives into the reward function.

This paper combines the advantages of sparse rewards and dense rewards to design a reward function, called the RDRM, which includes the relaxed sparse reward and the dynamic dense reward. The planned trip time is limited to a narrow range, which makes it difficult for the algorithm to collect samples that meet the constraints during learning. The relaxed sparse reward function is established to increase the probability of the agent completing tasks by relaxing the time constraints of the train operation plan, which can enable the agent to obtain more successful samples and accelerate the speed of learning the optimal strategy. Furthermore, the dynamic dense reward function is established based on the average planned speed of the train, which can balance the contributions of time rewards and energy consumption rewards according to the states of different trains and provide better feedback.

3.3.1 The relaxed sparse reward

In the iterative learning process, exploring successful strategies requires meeting time constraints \(\left| {T_{N} (\pi ) - T_{P} } \right| \le \Delta t\). The small \(\Delta t\) makes it difficult for the algorithm to obtain successful experience to train the learning network, seriously affecting the learning efficiency of the network. Therefore, it is necessary to relax the time constraint to improve the probability of successful strategy acquisition during the algorithm exploration process.

According to this conclusion, if \(T_{N} (\pi^{*} ) = T_{P}\), we can propose several typical optimal strategies \(\pi_{k}^{*} ,\;k = 1,2,3,4\) and \(T_{P}^{1} < T_{P}^{2} < T_{P}^{3} < T_{P}^{4}\). The trajectories are shown in Fig. 5 and Table 1. According to Eqs. (15), the train’s energy consumption can be expressed as:

$$J_{E} = E_{N} = E_{MT} + E_{CR - T} = \sum\limits_{i = 1}^{m} {\int_{{x_{i - 1} }}^{{x_{i} }} {f(v){\text{d}}x} } + \sum\limits_{j = 1}^{n} {\int_{{x_{j - 1} }}^{{x_{j} }} {\alpha_{f} f(v){\text{d}}x} } \; = \sum\limits_{i = 1}^{m} {\int_{{t_{i - 1} }}^{{t_{i} }} {vf(v){\text{d}}t} } + \sum\limits_{j = 1}^{n} {\int_{{t_{j - 1} }}^{{t_{j} }} {\alpha_{f} vf(v){\text{d}}t} }$$
(27)

where \(E_{MT}\) and \(E_{CR - T}\) are the maximum traction energy consumption and traction cruise energy consumption, respectively; m and n represent the numbers of sections using maximum traction and traction cruising, respectively. The trip times for trains to maintain MT in different strategies are \(T_{MT}^{1} = T_{MT}^{2} = T_{MT}^{3} > T_{MT}^{4}\), and the energy consumption is \(E_{MT}^{1} = E_{MT}^{2} = E_{MT}^{3} > E_{MT}^{4}\). In the CR-T stage, \(T_{CR - T}^{1} > T_{CR - T}^{2} > T_{CR - T}^{3} = T_{CR - T}^{4}\) and \(E_{CR - T}^{1} > E_{CR - T}^{2} > E_{CR - T}^{3} = E_{CR - T}^{4}\). Therefore, it is easy to deduce \(J_{E}^{{\pi_{1}^{*} }} > J_{E}^{{\pi_{2}^{*} }} > J_{E}^{{\pi_{3}^{*} }} > J_{E}^{{\pi_{4}^{*} }}\). Furthermore, due to the negative correlation between \(T_{N} (\pi^{*} )\) and the average speed \(\overline{v}_{{\pi_{{}}^{*} }}\) of the strategy, when v is higher, it is positively correlated with \(w_{0}\). The higher the speed is, the more energy the train needs to consume to overcome resistance.

Fig. 5
figure 5

Schematic diagram of the optimal strategy trajectories

Table 1 Correspondence between the optimal strategy and trajectory

As a result, for an optimal train operation strategy \(\pi^{*}\), if \(T_{N} (\pi^{*} ) = T_{P} ,T_{P} \in [T_{\min } ,T_{\max } ]\), then \(T_{P}\) is negatively correlated with the energy consumption of optimal strategy \(E_{N} (\pi^{*} )\) for the TTO problem on a straight track. And we can draw the following:

$$\mathop {\arg \min }\limits_{{\pi^{*} }} \;J(\pi ) = \mathop {\arg \min }\limits_{{\pi^{*} }} J_{E} (\pi ),\quad T_{N}^{{^{{}} }} (\pi ) \in [T_{\min } ,T_{P} ]$$
(28)

When \(T_{N}^{{\pi^{*} }} \in [T_{\min } ,T_{P} ]\); \(J_{E}^{{\pi_{{}}^{*} }} (T_{N}^{{\pi^{*} }} )\) monotonically decreases, i.e.,\(\min (J_{E}^{{\pi_{{}}^{*} }} ) = J_{E}^{{\pi_{{}}^{*} }} (T_{P} )\), \(T_{N}^{{\pi^{*} }} \in (T_{P} ,T_{\max } ]\); thus, the Gaussian function is adopted. Therefore, by relaxing Eq. (18), a train trip-time reward function \(R_{T}\) is constructed, as shown in Fig. 6.

$$R_{T} = \left\{ \begin{gathered} 500\quad \quad \quad \quad \;\quad T_{N} \in [T_{\min } ,T_{P} ] \hfill \\ 500e^{{ - (T_{N} - T_{P} )^{2} /20}} \;\quad T_{N} \in (T_{P} ,T_{\max } ] \hfill \\ \end{gathered} \right.$$
(29)
Fig. 6
figure 6

The train trip reward function curve

The energy consumption reward function \(R_{E}\) reflects the energy consumption level of the train.

$$R_{E} = 500e^{{ - (E_{N} - E_{\min } )/(E_{\max } - E_{\min } )}}$$
(30)

Therefore, the relaxed sparse reward function is represented as follows:

$$sparse\_r_{i} = \left\{ \begin{gathered} 0\;\quad \quad \quad \quad \;\quad \quad \quad \;i < N \hfill \\ \beta R_{T} + (1 - \beta )R_{E} \quad \quad i = N \hfill \\ \end{gathered} \right.$$
(31)

where \(\beta\) represents weighting factor for the trip time weight and energy consumption weight coefficients. An agent guided by sparse rewards can in principle have higher consistency with the task; thus, larger values of \(sparse\_r_{N}\) is needed to highlight the contribution of sparse rewards. In the TTO, the constraint of planning travel time needs to be strictly followed, so we pay more attention to time rewards. Therefore, this paper takes \(\beta = 0.6\).

3.3.2 The dynamic dense reward

A dense reward can enhance the exploration ability of algorithms [39]. Based on two objective of time and energy consumption, we design an average dense reward function \(ave\_r\).

$$ave\_r_{i} = \left\{ \begin{gathered} \beta r_{t}^{i} + (1 - \beta )r_{e}^{i} \quad \quad \;\,i < N \hfill \\ \beta R_{T} + (1 - \beta )R_{E} \quad \;\;i = N \hfill \\ \end{gathered} \right.$$
(32)

Here, the dense time reward function \(r_{t}^{{}}\) should satisfy the planned travel time constraint.

$$r_{t}^{i} = e^{{ - (|T_{i} - T_{P} |)/T_{P} }}$$
(33)

The dense energy consumption reward function \(r_{e}^{{}}\) should guide agents to approach the optimal strategy for energy consumption as well as possible.

$$r_{e}^{i} = e^{{ - (|E_{i} - E_{{\overline{v}_{p} }} |)/E_{{\overline{v}_{p} }} }}$$
(34)

where the planned average trip time is \(\overline{v}_{p} = X/T_{P}\). \(E_{{\overline{v}_{p} }}\) is the benchmark energy consumption, where the train accelerates to \(\overline{v}_{p}\) with MT, then moves to CR and MB, adopts coasting and completes all generated energy consumption steps. Making the energy consumption appropriately smaller than \(E_{N} (\pi^{*} )\) is more beneficial for the algorithm learning process. Notably, the dense reward function must have the ability to guide the agent toward the optimal goal.

The larger the value of \(r_{t}^{i}\) is, the more it helps the agent accelerate and save time. Similarly, the larger \(r_{e}^{i}\) is, the more it helps the agent slow down and reduce energy consumption. In the initial stage, increasing the contribution of \(r_{t}^{i}\) is more advantageous for agents to achieve higher speed. In contrast, in the final stage, when the speed \(v \to 0\), increasing the contribution of \(r_{e}^{i}\) is more advantageous. Moreover, considering the constraints of planned travel time, a linear balance benchmark function \(Fbal(x)\) is constructed based on the planned average trip time \(\overline{v}_{p}\).

$$Fbal(x) = - \frac{20}{X}x + \frac{X}{{T_{P} }} + 5$$
(35)

The dynamic balance coefficients \(\xi_{e}\) and \(\xi_{t}\) used to balance time rewards and energy consumption rewards.

$$\left\{ \begin{gathered} \xi_{e} = \frac{{v_{i} - Fbal(x_{i} )}}{{V_{\max } (x_{i} )}} + 0.5 \hfill \\ \xi_{t} = - \frac{{v_{i} - Fbal(x_{i} )}}{{V_{\max } (x_{i} )}} + 0.5 \hfill \\ \end{gathered} \right.\quad \quad$$
(36)

The dynamic dense rewards are represented as follows:

$$dense\_r_{i} = \left[ \begin{gathered} \xi_{e} \hfill \\ \xi_{t} \hfill \\ \end{gathered} \right]\left[ \begin{gathered} r_{e}^{i} \hfill \\ r_{t}^{i} \hfill \\ \end{gathered} \right]$$
(37)

The SARM reschoosing actions beyond the boundary, the agent is inevitably able to transition from one state to the new state with each iteration, rather than remaining in the current state. In training, the agent completes a episode after N iterations. Therefore, when i = N, only more accurate sparse rewards are used to guide the learning process. The relaxed dynamic reward function is

$$Reward_{i} = \left\{ \begin{gathered} dense\_r_{i} \quad \;{\kern 1pt} \,i < N \hfill \\ sparse\_r_{i} \;\quad i = N \hfill \\ \end{gathered} \right.$$
(38)

The pseudocode of S-PPO is shown in Algorithm 2.

Algorithm 2
figure b

The pseudocode of S-PPO

4 Experimental simulation and analysis

We implement the proposed S-PPO method via three fully connected hidden layers with 120 hidden units, and the numbers of output layers of the strategy network and the value network are one and four, respectively. This simulation is based on the line data from Jiugong Station to Yizhuangqiao Station on the Beijing Subway Yizhuang Line [28]. Table 2 shows the static speed limit data of Jiugong Station and Yizhuangqiao Station, and the slopes of the track are shown in Fig. 7. The proposed algorithm is implemented in Python on a computer with an Intel Core i7-10700 CPU @2.90 GHz and 32 GB RAM running Windows 10 × 64 Edition.

Table 2 Static speed limit data of the track
Fig. 7
figure 7

The slope data of the track

The parameters used in the algorithm are shown in Table 3. A smaller discretization step \(\Delta l\) can improve the accuracy of the results, but it also makes the calculation more complex. Therefore, this paper takes \(\Delta l = 30\) (m), corresponding to which the track is discretized into N = 73 sub segments and takes \(\Delta t = 1\;({\text{s}})\).

Table 3 The values of the parameters

We follow the hyperparameters recommended by PPO [26] and set clipping rate \(\varepsilon = 0.2\). And we set the discount factor to a higher value, \(\gamma = 0.99\), which can encourage agents to focus more on long-term rewards. In addition, we set the mini-batch size to \(U = 40\) and the Number of epochs \(n\_epoch = 5\). As the PPO algorithm is insensitive to the change in the learning rate, we always keep learning rates of critic network and actor network the same and select 0.0001, 0.0003, and 0.0005 earning rates in the experiment. The experimental results are shown in Fig. 8. When learning rates is 0.0001, the convergence speed is relatively slow. The larger learning rate of 0.0005, although improving the convergence speed, also causes oscillations in the convergence curve in the later stages of iteration. Therefore, we choose a moderate learning rate of 0.0003 in this paper, which can achieve faster convergence speed and maintain a stable convergence curve.

Fig. 8
figure 8

The slope data of the track

Effectiveness Test: We compare the optimization effects of the S-PPO and the traditional PPO without safety protection measures on the TTO problem to verify the effectiveness of the SARM. A comparative experiment is conducted under three different reward functions, namely, the sparse reward function (31), the average dense reward function (32), and the relaxed dynamic reward function (38). The combination table numbers for the algorithms and rewards are shown in Table 4. Furthermore, we conducted a statistical analysis of the unsafe action counts of S-PPO and PPO with the relaxed dynamic reward (Max_ep = 5000) to further demonstrate the effectiveness of the safe action rechoose mechanism. In addition, we compared the performance of S-PPO with two other excellent train operation methods (i.e. A3C [27], DQN [16]), as the train operation processes in these methods are all modeled as MDPs.

Table 4 The combination table numbers for the algorithms and reward functions

Universality Test: Generally, the most important factor affecting the train trajectory is the maximum speed constraint, which determines the basic shape of the train operation trajectory. In practical applications, trains often need to operate on speed limited tracks with different spatial characteristics. Therefore, to verify the performance of S-PPO on tracks with different speed limits, we randomly added speed limited sections with different characteristics on the original track to test its adaptability.

4.1 Effectiveness test

Due to the lack of a safety protection mechanism, the traditional PPO uses soft constraints to handle unsafe risks such as those exceeding environmental boundaries. Therefore, a penalty value needs to be set for such behavior, and in this experiment, the penalty value \(r_{{{\text{penalty}}}} = - 1\) is chosen. The convergence curves and the train trajectories of the reward of the S-PPO and the PPO algorithms for three different rewards are shown in Figs. 9 and 10, respectively. Table 5 presents the results of the energy consumption, total trip time, and time deviation with respect to the planned trip time for the S-PPO and PPO algorithms for three different rewards.

  1. 1.

    Figure 9 shows that the three combinations ①, ②, and ③ of the S-PPO algorithm converge to better reward values than the corresponding three combinations ④, ⑤, and ⑥ of the PPO algorithm. And, ① and ② converge faster and smoother than ④ and ⑤, respectively. Although ③ has poorer convergence speed and stability than ⑥. It can jump out of local optima in the later stage of iteration, causing continuous oscillation, which requires more iterations (exceeding Max_ep) to reconverge to a stationary state and obtains a better train operation control model. The three combinations of ④, ⑤ and ⑥ all have significant oscillations in the early stages of training. This is because PPO does not have a safe action protection mechanism. At the beginning of training, the learning system has not formed a relatively stable strategy model, making it easier for unsafe actions to occur near the environmental boundary, causing damage to the learning system. The three combinations of S-PPO with SARM ①, ②, and ③ can effectively avoid the disturbance of unsafe behavior during the training process, which enables the algorithm to better balance exploration and development and enhance the learning efficiency of the algorithm and the smoothness of the training process.

    In addition, the convergence curve of ① and ④ is the fastest and smoothest, followed by ② and ⑤, while the convergence process of ③ and ⑥ is slower and the fluctuation is relatively larger. Therefore, compared with average dense reward and sparse reward, relaxed dynamic reward can better adjust the scale of feedback according to different state-action pairs information, enabling the algorithm to obtain sample information faster and more effectively and promote its learning process.

  2. 2.

    Figure 10 shows that the three trajectories of the S-PPO algorithm, ①, ②, and ③, have fewer operational changes and are smoother than the three trajectories of the PPO algorithm, ④, ⑤, and ⑥. ①, ②, and ③ are able to quickly increase the train speed with the MT in the early stage and then maintain the speed at a reasonable level through the MT, CR, and CO adjustments. Afterward, coasting control is adopted over the longer track space. Finally, when turning to the MB, the speeds of ① and ② are all less than 47 km/h, and ③ starts braking at a speed of 53.17 km/h, but undergoes several transitions between MB and CO until stopping. This indicates that for these three trajectories, less kinetic energy is consumed by braking, and more kinetic energy is used to overcome resistance. On the other hand, the three trajectories ④, ⑤, and ⑥ have a higher frequency of switching operations. In a relatively long distance of the track, the train maintains high speed through MT and CO. After switching to continuous CO, the distance traveled is shorter. The speed of three trajectories exceeds 54 km/h when switching to MB. This means that more energy is consumed by braking. Therefore, the energy consumption of three trajectories ①, ②, and ③ is lower than ④, ⑤, and ⑥, respectively.

  3. 3.

    This is also supported by Table 5, in which three important indicators the energy consumption \(E_{N}\)(kw h), the total trip time \(T_{N}\)(s), and the time deviation \(J_{\Delta t} = \left| {T_{N} - T_{P} } \right|\)(s) are used to evaluate the train operation trajectory. The energy consumption corresponding to the train trajectories obtained by the S-PPO algorithm with the three rewards is 21.13 kw h, 21.47 kw h and 22.06 kw h, and their deviation in trip time are also very small, with values of 0.06 s, 0.07 s and 0.21 s. The PPO algorithm yields energy consumption of 22.32 kw h, 22.41 kw h and 22.61 kw h, with trip time deviations of 0.04 s, 0.57 s and 0.02 s, respectively. The time deviations of the control strategies obtained by S-PPO and PPO with different rewards are less than 1 s, which is within the allowable range. Compared to PPO, S-PPO with relaxed dynamic reward, average dense reward and sparse reward save energy consumption by 5.63%, 4.37%, and 2.49%, respectively. Especially with the combination of S-PPO and relaxed dynamic reward, there is a considerable improvement in energy efficiency on a line with a total length of 1975 m. This further demonstrates that algorithms with relaxed dynamic rewards have better exploration ability than do those with sparse and average dense rewards.

  4. 4.

    The statistical analysis bar chart of the average unsafe action counts of S-PPO and PPO ten experiments is shown in Fig. 11, where parameter \(\chi\) is the unsafe action counts in each episode. It can be seen that S-PPO has a low level of unsafe action counts (\(\chi \le 5\)) in 2209 episodes, accounting for 44.18%, much higher than PPO's 622 episodes (12.44%). When \(5 < \chi \le 10\), S-PPO has slightly higher episodes than PPO, with 2598 episodes (51.96%) and 2248 episodes (44.96%), respectively. At high level unsafe action counts (\(10 < \chi \le 15\), \(15 < \chi \le 20\) and \(\chi > 20\)), the number of episodes for S-PPO is much lower than PPO. As a result, S-PPO maintains a lower unsafe action counts for a episode throughout the entire iteration process, while PPO does the opposite. This viewpoint is also supported by Fig. 12 that is a moving average unsafe action counts curve of ten experiments for S-PPO and PPO. From Fig. 12, it can be seen that the unsafe action counts of PPO remain oscillating at a high level. However, the unsafe action counts of S-PPO experienced severe oscillations before the 1000 episodes and has been able to maintain a relatively low steady state since then.

    The higher proportion of low-level unsafe action counts and the lower proportion of high-level unsafe action counts indicate that the SARM, which can effectively limit the unsafe actions to a lower level and protect the learning process, is superior to the penalty mechanism based soft constraint method.

  5. 5.

    The convergence curves and train operating trajectories of the S-PPO, A3C, and DQN algorithms are shown in Figs. 13 and 14, respectively. As shown in Fig. 13, in comparison with the A3C and DQN algorithms, the S-PPO algorithm exhibits a quicker and smoother convergence, achieving superior reward values. In Fig. 14, the S-PPO algorithm's trajectory minimizes the number of MT and MB, effectively reducing braking duration through extensive coasting. The performance metrics for S-PPO, A3C, and DQN are presented in Table 6. The S-PPO algorithm achieved a runtime of 130.06 s on the track, maintaining a close tolerance of only 0.06 s from the planned trip time. Its energy consumption is 21.13 kw h, the lowest among the three algorithms. These results suggest that S-PPO exhibits superior performance in TTO problems.

Fig. 9
figure 9

The convergence curves S-PPO and PPO based on relaxed dynamic reward, average dense reward and sparse reward

Fig. 10
figure 10

The trajectories of train operating using S-PPO and PPO based on relaxed dynamic reward, average dense reward and sparse reward

Table 5 The results of the energy consumption \(E_{N}\), the total trip time \(T_{N}\), and the time deviation \(J_{\Delta t}\) with respect to the planned trip time for the S-PPO and PPO algorithms for three different rewards
Fig. 11
figure 11

The result of unsafe action counts of S-PPO and PPO

Fig. 12
figure 12

The comparison of average unsafe action counts of S-PPO and PPO

Fig. 13
figure 13

The convergence curve comparsion among the three algorithm

Fig. 14
figure 14

The trajectories of train operating comparsion among the three algorithm

Table 6 The performance metrics of S-PPO, A3C, and DQN

In summary, the combination of the relaxed dynamic rewards and the SARM enhances the exploration and convergence capabilities of the S-PPO algorithm, which helps it obtain better train trajectories in TTO problems.

4.2 Universality test

Given that varying speed limit locations and the shape of speed limit curves significantly affect the trajectories of train operating, we have developed four different speed limit curves to assess the adaptability of S-PPO. The speed limit information is shown in Table 7. ① and ② denote the establishment of a single speed limit at the proximate and terminal positions along the track, respectively. ③ and ④, different speed limit combinations have been adopted. The trajectories of trains under four different speed limits, achieved using S-PPO, A3C, and DQN algorithms, are depicted in Fig. 15. Correspondingly, detailed performance metrics are compiled in Table 8.

Table 7 Speed limit information for the track
Fig. 15
figure 15

The trajectories of train operating under various speed limit using S-PPO, A3C, and DQN

Table 8 Performance data under different speed limits

As shown in Fig. 15, train operating trajectories of the S-PPO, A3C, and DQN algorithms on four tracks with varying speed limits are presented. The corresponding performance metrics for these trajectories are outlined in Table 8. The trajectories of S-PPO exhibit consistent characteristics. Prior to reaching or surpassing speed limit starting point, the train endeavors to increase its speed as much as possible, and then effectively employs CR and CO to maintain a reasonable speed level, significantly reducing energy losses due to MB. This aligns with the features of the train's optimal control sequence. This is consistent with the characteristics of the optimal control sequence of the train. S-PPO operates with energy consumption of 25.96, 21.86, 25.57, and 23.81 respectively on ①, ②, ③, and ④, demonstrating better energy-saving efficiency than A3C and DQN, and has a sufficiently small deviation in operating time (\(J_{\Delta t} \le 0.2s\)). The train operation trajectories of A3C on four tracks with varying speed limits has fewer action transitions. Its time deviation on ①, ③, and ④ is superior to that of S-PPO, albeit within a narrow margin of 0.04. Nevertheless, its energy-saving efficiency significantly trails that of S-PPO. DQN has the worst energy-saving effect and operating time deviation.

Therefore, S-PPO can adapt well to different speed limits and obtain satisfactory train operation trajectories, with strong universality.

5 Conclusion

This paper presents an S-PPO algorithm to solve the problem that the soft constraints in reinforcement learning cannot fully protect the learning process from interference and damage caused by actions that exceed safety limits. The SARM has been designed to ensure the agent remains within safe boundaries during the learning process, and combined with the RDRM to enhance the algorithm's exploration ability. The simulation experiment results confirm that these mechanisms significantly improve algorithm stability, making the learning process more efficient while meeting environmental safety constraints. In addition, series of experiments have also demonstrated that the S-PPO performs well under various speed limit conditions and achieves effective train operation strategies, indicating its strong generalization ability and adaptability to energy-saving optimization challenges in diverse environments. It holds broad application prospects in practical scenarios.

In future research, we will continue to explore the adaptability of the S-PPO algorithm under dynamic speed limit conditions. Specifically, we will investigate how to optimize the algorithm’s performance to meet speed limit requirements in emergency situations as speed limits continuously change. Additionally, we will also explore the application of the S-PPO algorithm in other areas with security requirements, such as drone navigation and intelligent transportation. Through these studies, we aim to provide more effective and stable methods for the application of reinforcement learning in security-critical areas.