1 Introduction

As one of new energy vehicles, fuel cell hybrid vehicles (FCHVs) have gained increased attention worldwide, especially in China, fuel cell hybrid buses (FCHBs) have been developed rapidly in recent years. FCHBs use fuel cell systems (FCSs) and batteries as two power sources, which generates the energy management problem. Thus, an energy management strategy (EMS) is necessary for FCHBs, which determines the power distribution between FCSs and batteries and further influences the fuel economy of FCHBs and other relevant factors such as the performance degradation of FCSs.

Two types of EMSs were developed previously for hybrid vehicles, i.e. rule-based EMSs [1,2,3,4] and optimization-based EMSs [5,6,7,8]. The former is composed of some if–then control rules which are based on the expert knowledge. The real-time control performance of the former is good owing to the simplicity, however it still leaves rooms for the control optimality. The latter is usually based on different optimal control theories, which guarantees the control optimality but the real-time applications of which are limited mainly due to the dependency on the future driving cycles. Besides, in order to increase the adaptability to different driving conditions, relevant parameters should be adjusted appropriately in those two types of EMSs. Along with the rapid development of the artificial intelligence, the third type, i.e. learning-based EMSs have been gradually investigated for hybrid vehicles in recent years, learning algorithms adopted in which mainly include the reinforcement learning (RL) algorithm and the deep reinforcement learning (DRL) algorithm. RL-based and DRL-based EMSs are fully data-driven, which reach the optimal control results through interactions between the agent and the environment and the trial-and-error learning. In addition, RL-based and DRL-based EMSs do not rely on any predefined rules or optimal control theories and present good real-time performance and adaptability.

In the earlier research [9], the RL algorithm was adopted to the EMS for a hybrid electric tracked vehicle, and the results showed that the RL-based EMS presents the strong adaptability, optimality, learning ability, and also the effectively reduced computational time. In recent years, the RL algorithm has been widely adopted to different hybrid vehicles, such as engine-motor hybrid vehicles [9,10,11,12,13,14,15,16,17], FCHVs [18, 19], engine-ultracapacitor hybrid vehicles [20], and electric vehicles using hybrid energy storage systems [21]. In research [22], a parametric study on several key parameters of RL-based EMSs was conducted for hybrid vehicles, such as the state types and number of states, the state and action discretization, the exploration and exploitation, and the learning experience selection. The offline training & online application mode is commonly used for RL-based EMSs for hybrid vehicles. Besides the control optimality, the convergence rate during the offline training and the adaptability during the online application are also important performances of RL-based EMSs. In order to improve those performances of RL-based EMSs, some skills have been developed. Introducing the transition probability matrix (TPM) of the vehicle’s required power to the RL algorithm framework is a common way to expedite the convergence of the offline training. Additionally, the learning rate was adjusted during the offline training in some research [12, 13] in order to improve the convergence rate. Furthermore, refining the RL algorithm using other strategies [19] and introducing initialization strategies to the RL algorithm using properly selected penalty functions [16] are also used to speed up the convergence of the offline training. For improving the adaptability of RL-based EMSs, some characteristic factors of the TPM, such as the kullback–leibler divergence rate [10, 13, 20], the induced matrix norm [12], and the cosine similarity [18], were introduced to the RL algorithm, and the control strategy was updated in real-time according to those characteristic factors, respectively.

The difference between the RL and DRL algorithms is on the expression form of the Q-value, which is an important factor for the decision-making for both algorithms, i.e. the Q-table and the Q-network. The RL algorithm is based on the state discretization, which will cause the rapid increase on the Q-table size and consequently the long computational time and the bad convergence ability when dealing with a higher-dimensional state space. On the other hand, the DRL algorithm uses a deep neural network (DNN), i.e. the deep Q-network (DQN), for fitting the Q-table, which is helpful for considering more state variables and also results in a more accurate identification of state variables as any continuous changes in state variables can be reflected in the DNN-based decision-making system. In earlier research [23], a DQN-based EMS was proposed for a power-split hybrid electric bus, and simulation results showed that the fuel economy of the proposed EMS approaches a 5.6% better performance than the RL-based EMS in a trained driving cycle and achieves nearly 90% level of the dynamic programming (DP) in an untrained driving cycle. In some research [24, 25], an extra DNN named the target network was created in order to improve the convergence performance, in which the target network was periodically updated by copying parameters from the original network. In research [26], a dueling network structure- DQN-based EMS was proposed for hybrid vehicles to further speed up the convergence, which is particularly useful in states where the actions do not affect the environment significantly. In addition to above DQN-based EMSs, the deep deterministic policy gradient (DDPG), which belongs to the actor-critic DRL framework, has also been adopted to DRL-based EMSs of hybrid vehicles [27,28,29].

Although DRL-based EMSs have presented the superiority compared to other types of EMSs, there are still some problems need to be solved to improve the performance. The leaning ability, i.e. the convergence speed of the DRL algorithm is the first key factor. Additionally, the control effect and the adaptability of the EMSs are also important factors. Those factors could be further improved by using different skills. Currently, most of research on RL-based and DRL-based EMSs is focused on traditional hybrid vehicles, i.e. engine-motor hybrid vehicles and rarely on FCHVs. In addition, for FCHVs, the fuel cell stack lifetime is an issue due to the high-cost, thus the fuel cell stack durability should be considered when designing EMSs.

In this research, a DQN-based EMS is proposed for FCHBs, in which the Prioritized Experience Replay (PER) is adopted in order to expedite the convergence of the DRL algorithm. In addition, the action space of the DRL algorithm is limited in the proposed EMS according to the efficiency characteristic of FCSs in order to improve the fuel economy of FCHBs. Furthermore, the fuel cell stack durability is considered in the proposed EMS based on a fuel cell degradation model. Finally, to validate the effectiveness of the proposed EMS, simulation results of the proposed EMS for an FCHB are compared to those of an RL-based EMS and a DP-based EMS.

The remaining part of this paper is organized as follows: in Sect. 2, the target FCHB model is introduced including the fuel cell degradation model; in Sect. 3, the DRL-based EMS is proposed for the FCHB based on the introduction on the relevant algorithms; in Sect. 4, the effectiveness of the proposed EMS is validated in terms of the fuel economy, the fuel cell durability, the convergence performance, and the adaptability by comparing to an RL-based EMS and a DP-based EMS; at the end, conclusions are drawn from this research in Sect. 5.

2 The FCHB Model

The FCHB powertrain is mainly composed of the FCS, the DC/DC converter, the battery, the motor, and the final drive, as illustrated in Fig. 1. In this research, an FCHB is selected as the target bus, which is shown in the recommendation model lists for the new energy vehicle popularization and application of Ministry of Industry and Information Technology of China [30]. The relevant data of the FCHB are provided in Table 1.

Fig. 1
figure 1

Powertrain configuration of the FCHB

Table 1 Vehicle parameters of the FCHB

2.1 FCHB Power Demand Model

The vehicle movement is determined by the tractive forces and resistances acting on the vehicle during driving. The power required for the vehicle during driving can be expressed as follows:

$$P_{req} = (fMg\cos \alpha + 0.5\rho_{a} AC_{D} v^{2} + Mg\sin \alpha + \delta Ma) \cdot v$$
(1)

where \(f\) is the rolling resistance coefficient, \(M\) is the mass of the vehicle, \(g\) is the acceleration of gravity, \(\alpha\) is the road slope which is set to 0 in this research, \(\rho_{a}\) is the air mass density, \(A\) is the vehicle frontal area, \(C_{D}\) is the aerodynamic drag coefficient, \(v\) is the vehicle velocity, \(\delta\) is the mass factor which is set to 1 in this research, and \(a\) is the vehicle acceleration. For the FCHB, the power \(P_{req}\) is provided by the FCS and the battery, and the specific relationship on the power balance is as follows:

$$P_{req} = \left( {P_{fcs} \cdot \eta_{{{\text{conv}}}} + P_{bat} } \right) \cdot \eta_{mot} \cdot \eta_{final}$$
(2)

where \(P_{fcs}\) and \(P_{batt}\) represent the FCS power and the battery power respectively; \(\eta_{{{\text{conv}}}}\), \(\eta_{mot}\), and \(\eta_{final}\) represent the DC/DC converter efficiency, the motor efficiency, and the final drive efficiency respectively. Detailed values for a part of the parameters above are listed in Table 2, and the rest of parameters will be explained in the following parts.

Table 2 Parameter values of the FCHB

2.2 FCS Model

An FCS consists of a fuel cell stack and some auxiliary components such as the air compressor, the cooler, and the humidifier. A part of power generated from the fuel cell stack is provided to the auxiliary components to keep the regular operation of the FCS. The fuel cell stack is composed of a number of single cells, and there are three types of losses which occur in every single cell, i.e. the activation loss, the ohmic loss, and the concentration loss. The voltage of the single cell \(v_{fc}\) can be expressed as follows:

$$v_{fc} = E - v_{act} - v_{ohm} - v_{conc}$$
(3)

where \(E\) is the open circuit voltage (OCV), \(v_{act}\), \(v_{ohm}\), and \(v_{conc}\) represent the activation loss, the ohmic loss, and the concentration loss respectively.

The hydrogen consumption rate of the stack \(\mathop{m}\limits^{ \bullet }{}_{{h_{2}}}\) is related to the stack current \(I_{stack}\) according to the following equation:

$$\mathop{m}\limits^{ \bullet }{}_{{h_{2}}} = \frac{{N_{cell} \cdot M_{{h_{2} }} }}{n \cdot F} \cdot I_{stack} \cdot \lambda$$
(4)

where \(N_{cell}\) represents the cell number of the stack, \(M_{{h_{2} }}\) represents the molar mass of the hydrogen, \(n\) represents the number of electrons acting in the reaction, \(F\) is the Faraday constant, and \(\lambda\) is the hydrogen excess ratio. For the FCS, the efficiency \(\eta_{fcs}\) is defined as follows:

$$\eta_{fcs} = \frac{{P_{fcs} }}{ \mathop{m}\limits^{ \bullet }{}_{{h_{2}}} \cdot LHV}$$
(5)

where LHV is the lower heating value of the hydrogen. Further details on the FCS model can be found in our previous research [8, 31,32,33]. A 53 kW FCS is used in the FCHB, for which the hydrogen consumption rate and the efficiency of the FCS vary according to the FCS power as shown in Fig. 2.

Fig. 2
figure 2

Fuel consumption rate and efficiency of the FCS

In this research, an empirical fuel cell degradation model [34] is adopted in order to evaluate the effect of EMSs on the fuel cell durability, in which the fuel cell degradation is mainly caused by the load changing, the startup and shutdown, the idling, and the high power load operation conditions [34, 35]. The fuel cell degradation model is expressed as follows:

$$\Delta \phi_{{{\text{degrad}}}} = Kp\left( {\left( {k_{1} t_{1} + k_{2} n_{1} + k_{3} t_{2} + k_{4} t_{3} } \right) + \beta } \right)$$
(6)

where \(\Delta \phi_{{{\text{degrad}}}}\) represents the voltage decline percentage; t1, n1, t2, and t3 can be obtained from the driving condition of the FCHB, which represent the duration of the idle time, the start-stop count, the duration of rapid load variations, and the duration of high power loading conditions, respectively; k1, k2, k3, and k4 are the corresponding coefficients for the above each term, detailed values of which can be found in the research [34]; \(\beta\) is the natural decay rate; \(Kp\) is a modifying coefficient for on-road systems considering the durability difference between in the laboratory and on the road. Detailed values of \(\beta\) and \(Kp\) are also sourced from the research [34].

2.3 Battery Model

The battery is modeled by an equivalent circuit, which is composed of a voltage source \(U_{oc}\) and a resistance \(R_{int}\) connected to the voltage source in series, as illustrated in Fig. 3. The voltage source \(U_{oc}\), which is also called the open circuit voltage (OCV), and the internal resistance \(R_{int}\) vary according to the battery state of charge (SOC), as shown in Fig. 4.

Fig. 3
figure 3

Equivalent circuit diagram of the battery model

Fig. 4
figure 4

OCV and internal resistance of the battery

The battery SOC is derived from the ampere-hour integral method as follows:

$$\mathop {SOC}\limits^{ \bullet } = - \frac{{I_{bat} }}{q}$$
(7)

where q represents the battery capacity. The following relationship can be obtained from Fig. 3.

$$I_{bat} = \frac{{U_{oc} \left( {SOC} \right) - \sqrt {U_{oc} \left( {SOC} \right)^{2} - 4R_{{\text{int}}} \left( {SOC} \right) \cdot P_{bat} } }}{{2R_{{\text{int}}} \left( {SOC} \right)}}$$
(8)

2.4 Motor Model

The motor is modeled by an efficiency map, which indicates the relationship among the motor speed, torque, and efficiency, as illustrated in Fig. 5.

Fig. 5
figure 5

Motor efficiency map

3 The Proposed DRL-based EMS

In this section, relevant algorithms including the RL and DRL algorithms are introduced first, and then the proposed DRL-based EMS is explained.

3.1 RL and DRL Algorithms

The RL algorithm is a main branch of machine learning algorithms, which contains several important factors including the agent, the environment, the state, the action, and the reward. The main concern of the RL algorithm is how the agent takes actions under a given environment in order to maximize the cumulative reward and finally reaches the optimal control results through interactions between the agent and the environment.

The Q-learning is the most commonly used RL algorithm, in which the Q-function that satisfies Bellman’s equation is defined as follows:

$$Q\left( {s_{t} ,a_{t} } \right) = E\left[ {R_{t} + \gamma \mathop {\max }\limits_{{a_{t + 1} }} Q\left( {s_{t + 1} ,a_{t + 1} } \right)|s_{t} ,a_{t} } \right]$$
(9)

where \(Q\) is also called the value function; E represents the expectation of cumulative returns; s, a, and R represent the state, the action, and the reward, respectively; \(\gamma\) is a discount factor for the future value function, which is beneficial for the convergence during the learning process. The updating rule of the Q-learning is as follows:

$$Q\left( {s_{t} ,a_{t} } \right) \leftarrow Q\left( {s_{t} ,a_{t} } \right) + \eta \left[ {R_{t} + \gamma \mathop {\max }\limits_{{a_{t + 1} }} Q\left( {s_{t + 1} ,a_{t + 1} } \right) - Q\left( {s_{t} ,a_{t} } \right)} \right]$$
(10)

where \(\eta\) is the learning rate, which influences the convergence performance, i.e. the larger value results in the faster convergence speed but also causes the learning oscillation and overfitting problems. The relationship between the exploration and the exploitation during the learning process is decided by the \(\varepsilon\)-greedy algorithm, i.e. the agent randomly chooses actions with a small probability \({1} - \varepsilon\) while selects actions maximizing the Q-function with a probability \(\varepsilon\). The optimal control strategy \(\pi\) can be finally acquired as follows after the Q-function is converged through the algorithm iterations.

$$\pi^{*} (s) = \mathop {\arg \max }\limits_{a} Q^{*} (s,a)$$
(11)

In the Q-learning algorithm, the Q-value for each state-action pair is stored in the huge Q-table. This will cause the rapid increase on the Q-table size when dealing with a higher-dimensional state space and consequently make the convergence difficult. The DRL algorithm uses DNNs to fit the Q-function, which is effective when dealing with higher-dimensional systems, as follows:

$$Q\left( {s_{t} ,a_{t} ;\theta } \right) \approx Q\left( {s_{t} ,a_{t} } \right)$$
(12)

where \(\theta\) represents the network parameter. In order to break the dependency between the target Q-value and the original DNN parameters and speed up and stabilize the convergence, an extra DNN named the target network is usually created with the network parameter of \(\theta^{ - }\). The original DNN is called the evaluation network and used to select actions and the target network is periodically updated by copying parameters from the evaluation network. The evaluation network parameter \(\theta\) is updated by implementing the backpropagation and gradient descent based on the loss function, which is defined as the mean squared error between the target Q-vale and the current Q-value derived from the evaluation network, as follows:

$$L(\theta ) = E\left[ {\left( {R_{t} + \gamma \mathop {\max }\limits_{{a_{t + 1} }} Q\left( {s_{t + 1} ,a_{t + 1} ;\theta^{ - } } \right) - Q\left( {s_{t} ,a_{t} ;\theta } \right)} \right)^{2} } \right]$$
(13)

In order to break correlations among training data, the experience replay is usually adopted during the training, where the experience \(e_{t} = \left( {s_{t} ,a_{t} ,R_{t} ,s_{{t + {1}}} } \right)\) at each time step is stored in an experience pool \(D_{N} = \left\{ {e_{1} ,e_{2} , \ldots ,e_{N} } \right\}\) and mini-batches of data are sampled from the pool randomly for training. The experience replay is effective for cutting off the relationship among training data through the random sampling, however some important samples can be missed and this will influence the convergence [36]. In this research, the PER is adopted to replay important samples more frequently, in which the absolute temporal difference (TD) error of each data sample is selected to assess its importance, i.e. the priority. The sampling probability of each sample is proportional to the sample priority, as follows [36]:

$$\begin{gathered} TD\left( {s_{t} ,a_{t} } \right) = Q\left( {s_{t} ,a_{t} ;\theta } \right) - \left( {R_{t} + \gamma \mathop {\max }\limits_{{a_{t + 1} }} Q\left( {s_{t + 1} ,a_{t + 1} ;\theta^{ - } } \right)} \right) \hfill \\ p_{i} = \left| {TD_{i} } \right| + \in \hfill \\ P_{i} = \frac{{p_{i} }}{{\sum\nolimits_{k} {p_{k} } }} \hfill \\ \end{gathered}$$
(14)

where \(p_{i}\) and \(P_{i}\) are the priority and the sampling probability of the ith sample, respectively, \(\in\) is a little positive number for avoiding the zero sampling probability.

3.2 The Proposed DRL-Based EMS

For the DRL-based EMS of FCHB, the agent is the EMS while the environment includes the FCHB status and the driving condition, as shown in Fig. 6. Important factors of the DRL algorithm, including the state, the action, the reward function, and the DNNs, should be set and designed first according to the control problem characteristics of the FCHB. Owing to the powerful fitting ability of DNNs, the discretization on the state variables is not necessary and considering more state variables is possible compared to the case of the RL algorithm.

Fig. 6
figure 6

Learning framework of DRL-based EMS for FCHB

In this research, the FCHB velocity, the acceleration, and the battery SOC are selected as the state variables, as follows:

$$S = \{ v,a,SOC\}$$
(15)

The FCS power is set as the action variable, which is limited in an effective range according to the efficiency characteristic shown in Fig. 2, as follows:

$$A = \left\{ {{2,}\;{3,}\;{4,}\;{5,} \ldots {,}\;{36,}\;{37,}\;{38,}\;{39,}\;{40}} \right\}\;\;{\text{kw}}$$
(16)

The reward function is significant for the performance of the DRL-based EMS as it directly influences the control effect and the convergence rate. Considering the control objective of improving the fuel economy and the fuel cell stack durability and the fuel cell degradation model introduced in 2.2, the reward function \(R\) is designed as follows:

$$R = - \left( {\alpha \dot{m}_{{h_{2} }} + \mu (SOC_{ref} - SOC)^{2} + \varphi \left| {\left. {\Delta P_{fcs} } \right|} \right. + \xi f(t_{life} )} \right)$$
(17)

where the first term is related to the fuel economy, the second term is related to the battery SOC sustaining, while the rest of terms are related to the fuel cell durability. \(\alpha\), \(\mu\), \(\varphi\), \(\xi\) are weighting factors for each term, \(SOC_{ref}\) is the reference SOC value for sustaining which is set to 0.7 in this research, and \(f\) represents the sigmoid function, as follows:

$$f(x) = \frac{{1}}{{{1} + e^{ - x} }}$$
(18)

\(\Delta P_{fcs}\) and \(t_{life}\) are defined as follows:

$$\begin{gathered} \Delta P_{fcs} \left( t \right) = P_{fcs} \left( t \right) - P_{fcs} \left( {t - {1}} \right) \hfill \\ t_{life} = t_{1} + n_{1} + t_{3} \hfill \\ \end{gathered}$$
(19)

where \(t_{1}\), \(n_{1}\), and \(t_{3}\) correspond to those in the fuel cell degradation model in (6), which are related to the idling, the startup and shutdown, and the high power load conditions respectively. Those three factors are considered together owing to the fact that the calculation time-step for the algorithm is one second in this research. According to the reward function (17) and the mechanism of the DRL algorithm, in order to maximize the cumulative reward, the agent will tend to minimize the fuel consumption, maintain the battery SOC to the reference, and minimize harmful operation conditions of the FCS.

The evaluation and target DNNs are designed with the same structure, where the input and the output of the network are related to the state variables and the action variable. Considering the network structure presented in research [37], there are three hidden layers except the input and output layers, where there are 200, 100, and 50 neurons in each layer. The ReLU function [38] is used as the activation function for each layer, which is defined as follows.

$$f(x) = \max ({0},x)$$
(20)

The pseudocode of the DRL algorithm for the proposed EMS is presented in Table 3, where the first loop circulates different training episodes, i.e. driving cycles, which is processed once for every driving cycle, the second loop is processed once for every time-step within one driving cycle, and the third loop is processed nth at every time-step, which is the size of the mini-batch from the experience pool. The framework of the proposed DRL-based EMS is illustrated in Fig. 7.

Table 3 Pseudocode of the DRL algorithm for the proposed EMS
Fig. 7
figure 7

Framework of the proposed DRL-based EMS

4 Simulation Results

Four driving cycles are utilized in this research as shown in Fig. 8, where the West Virginia University city cycle (WVUCITY) [39], the West Virginia University suburban cycle (WVUSUB) [39], and the Manhattan Bus driving cycle are used in the algorithm training while the Japan 1015 driving cycle is used for the validation of the proposed DRL-based EMS. The effectiveness of the proposed DRL-based EMS is validated in terms of the fuel economy, the fuel cell durability, the adaptability, and the convergence performance, respectively. Specific values for the algorithm parameters are listed in Table 4.

Fig. 8
figure 8

Driving cycles used in this research

Table 4 Parameter values of the DRL algorithm

4.1 Fuel Economy

The hydrogen consumption of the proposed DRL-based EMS is compared to that of an RL-based EMS and a DP-based EMS for the FCHB, where the DP is a global optimization method, the result of which is usually regarded as the benchmark for the evaluation of other control methods and details of which can be found in our previous research [40]. In order to focus on the fuel economy, only the first two terms in the reward function (17) are considered here. Figures 9, 10, and 11 show the comparison results of the FCS power and the battery power for the above three EMSs on the three training driving cycles respectively. Table 5 summarizes the hydrogen consumption comparison, where the differences on the final battery SOC are considered by the equivalent hydrogen consumption. The comparison results indicate that the fuel economy of the proposed DRL-based EMS is improved by 2.93%, 4.25%, and 3.72% compared to the RL-based EMS on the WVUCITY, WVUSUB, and Manhattan Bus driving cycles respectively, while the difference to the DP-based EMS is within 5.53%, 5.67%, and 5.86% on the three driving cycles respectively.

Fig. 9
figure 9

Comparison on the WVUCITY: a FCS power; b battery power

Fig. 10
figure 10

Comparison on the WVUSUB: a FCS power; b battery power

Fig. 11
figure 11

Comparison on the Manhattan Bus: a FCS power; b battery power

Table 5 Fuel consumption comparison results

4.2 Fuel Cell Durability

The voltage decline percentage \(\Delta \phi_{{{\text{degrad}}}}\) in (6) is used to evaluate the fuel cell degradation rate in this research, which is obtained for the cases where only the first two terms in the reward function (17) and the whole reward function are considered respectively. Here, the former case corresponds to the DRL-based EMS in which the fuel cell durability is not considered. The results are provided and compared in Table 6, which show that the fuel cell degradation rate is decreased by 56.96%, 69.47%, and 64.03% using the proposed DRL-based EMS compared to the one without considering the fuel cell durability on the WVUCITY, WVUSUB, and Manhattan Bus driving cycles respectively, while the fuel economy is almost not influenced.

Table 6 Fuel cell durability comparison results

4.3 Convergence Performance

In this research, the PER is adopted to replay important samples more frequently from the experience pool and expedite the convergence during training. In order to validate the effectiveness of the PER, the tendency of the average reward during training is compared for the cases of with the PER and without the PER on three driving cycles, as illustrated in Fig. 12. It can be observed that the DRL algorithm with the PER reaches the convergence with around 375, 365, and 466 rounds while the one without the PER starts to converge with around 420, 850, and 612 rounds on the three driving cycles respectively, i.e. the convergence performance of the proposed DRL-based EMS is improved by 10.71%, 57.06%, and 23.86% owing to the utilization of the PER on the three driving cycles respectively.

Fig. 12
figure 12

Tendency of the average reward during training: a on the WVUCITY driving cycle; b on the WVUSUB driving cycle; c on the Manhattan Bus driving cycle

4.4 Adaptability

In order to validate the adaptability of the proposed DRL-based EMS to different driving cycles, it is applied to a new driving cycle after training, i.e. the Japan 1015 driving cycle. The simulation result of the fuel consumption on the Japan 1015 driving cycle is presented in Table 7 and compared to that of other two different EMSs, which reveals that the fuel economy of the proposed DRL-based EMS is improved by 4.18% compared to the RL-based EMS whereas the difference to the DP-based EMS is within 5.65%. Compared to Table 5, it is enough to prove that the proposed DRL-based EMS presents a good adaptability. Figure 13 illustrates comparison results of the FCS power and the battery power for the DP, RL, and DRL-based EMSs on the Japan 1015 driving cycle, where the driving cycle is repeated twice in order to observe more obvious results.

Table 7 Fuel consumption result on the Japan 1015 driving cycle
Fig. 13
figure 13

Simulation results on the Japan 1015 driving cycle: a FCS power; b battery power

5 Conclusion

Considering the rapid development of FCHBs in China currently, a DRL-based EMS is proposed for FCHBs in this research, in which the fuel cell durability is considered based on a fuel cell degradation model. The PER is adopted for improving the convergence performance of the DRL algorithm and the action space of the DRL algorithm is limited for the better control effect. The effectiveness of the proposed DRL-based EMS for an FCHB is validated in terms of the fuel economy, the fuel cell durability, the convergence performance, and the adaptability by comparing the results of it to those of an RL-based and a DP-based EMSs. The following conclusions can be drawn from this research:

  1. (1)

    The fuel economy of the proposed DRL-based EMS is improved by 2.93%, 4.25%, and 3.72% compared to the RL-based EMS on the WVUCITY, WVUSUB, and Manhattan Bus driving cycles respectively, while the difference to the DP-based EMS is within 5.53%, 5.67%, and 5.86% on the three driving cycles respectively.

  2. (2)

    The fuel cell degradation rate is decreased by 56.96%, 69.47%, and 64.03% using the proposed DRL-based EMS compared to the one without considering the fuel cell durability on the WVUCITY, WVUSUB, and Manhattan Bus driving cycles respectively.

  3. (3)

    The convergence performance of the proposed DRL-based EMS is improved by 10.71%, 57.06%, and 23.86% owing to the utilization of the PER on the WVUCITY, WVUSUB, and Manhattan Bus driving cycles respectively.

  4. (4)

    The adaptability of the proposed DRL-based EMS is validated on the Japan 1015 driving cycle, whereas the training of the DRL algorithm is completed on the WVUCITY, WVUSUB, and Manhattan Bus driving cycles, and the result proves that the proposed DRL-based EMS presents a good adaptability.