1 Introduction

With the development of technologies and information systems, the intelligent transportation systems (ITSs) and advanced traveler information systems (ATISs) have been widely deployed in city region. In the field of public transportation, the delivery of real-time traffic information is actually the most visible applications of ITS [1]. Traditionally, this type of information is delivered in the form of arrival and departure times on digital boards at bus stops. More recently, passengers can obtain it via smartphone apps and in-vehicle screens. Currently, public transport can always serve passengers in an easy and comfortable way. However, due to some unpredictable factors, sometimes there is an early arrival or delay at a particular stop. For the lack of operational reliability, many travelers incline to choose private cars rather than public transport. Transit agencies realize that providing an accurate bus arrival time is valuable to attract more passengers and improve management or service level. They can adjust their bus schedules by applying a higher or lower speed in advance and ultimately achieve the goal of reducing the waste of bus resources.

At the same time, the availability of accurate bus information can also help passengers to efficiently schedule their departure time, reduce their waiting time and make smart choices for their travel [2, 3]. Given the bus arrival time information, passenger can choose suitable travel mode for their journeys. Figure 1 shows an example for illustrating the effect on providing accurate bus information at bus stop. In the example, a passenger attempts to travel from Stop A to Stop B. He/she actually has three choices on this trip. If the passenger has enough time, he/she can choose any one of the two bus routes (i.e., route no. 1 and no. 2). However, if the time is short, the passenger can take a taxi to the destination. Furthermore, if the passenger knows the bus arrival times of the next buses of the two bus routes (e.g., 09:00 and 09:05, respectively), he/she will wait for the next bus of route no. 2 rather than that of route no. 1. As a result, the waiting time of the passenger will be reduced.

Fig. 1
figure 1

An example for accurate bus information at bus stop

As the objective of bus travel time prediction is to provide such information, this problem has been one of the hottest issues in ITSs. It should be noted that providing bus travel time prediction precisely in areas with litter external influence, such as rural areas, is easy in a way. However, the problem becomes much more complex in urban areas. The motivation of this paper is to improve the accuracy of prediction yielded by current bus travel time prediction models.

In the past decades, by using historical data or online data (obtained by the global positioning system), various forecasting models and techniques have been proposed to predict bus travel time. These techniques include historical average model [4,5,6], statistical model [7,8,9,10,11], nonparametric regression model [2, 12,13,14], machine learning model [2, 3, 15,16,17,18,19,20,21,22,23,24] and hybrid model [20, 25, 26].

Historical average model is used to predict the current and future bus travel time within a given period of time by averaging the historical travel times. Chung and Shalaby [6] presented a school bus arrival time prediction model, which combined historical moving average model with an operational strategy. Their results indicated the proposed model was powerful in real-life application. However, a noteworthy feature of this type of model is the requirement of stable traffic pattern. The reliability of the model would greatly decrease if the traffic pattern has large variations [27, 28]. It should be noted that in recent research, this type of model is presented only for comparison purposes.

Statistical model can be further divided into two main categories: time-series model and regression model. With respect to time-series model, such model assumes that the future state of a bus depends on the trend of past several states of the same bus. This model usually causes a short time lag between the prediction value and real data. The commonly used time-series models include autoregressive moving average (ARMA) [29], generalized autoregressive conditional heteroscedasticity model [7] and seasonal ARIMA model [8, 30]. Williams and Hoel [4] compared the performances of several methods for traffic flow rate prediction. These methods include seasonal ARIMA, random walk, historical average and deviation from historical average. Their results showed the seasonal ARIMA could provide the best forecasts for all performance statistics. It should be noted that the prediction accuracy also decreases significantly when the relationship between real-time and historical data is complicated [16]. The other kind of statistical model is the regression model. This kind of model is developed to verify the effect of different factors on bus travel time. In the model, factors such as bus dwell time, traffic conditions and travel distance are treated as independent variables, while bus travel time is the dependent variable. Observed that the travel times in current and future status have linear relationship, at the same time, the slope and intercept of this relationship change subject to the time of day, Rice and Van Zwet [9] proposed a linear regression with time-varying coefficients to predict travel time on freeways. However, the prediction accuracy of regression model heavily depends on the selection of independent variables [16].

Recently, due to the absence of estimating parameters, k-NN has been extensively employed in travel time prediction. It is actually a kind of nonparametric regression model. Based on k-NN, You and Kim [12] proposed a hybrid travel time forecasting model to predict link travel times in congested road networks. By using historical and real-time data, Chang et al. [14] developed a dynamic model to predict bus multi-interval path travel time. Results in their paper showed that k-NN method was effective in the aspects of both accuracy and computing time. Yu et al. [2] proposed four models, including SVM, ANN, k-NN and linear regression, to predict bus arrival time at the bus stop with multiple routes. Their results also indicated k-NN was an effective method for bus travel time prediction. However, nonparametric regression model becomes costly in execution time when the sample size is large [2, 14].

As a subfield of artificial intelligence, machine learning methods are widely used in transportation. Of various machine learning methods, ANN and SVM are the most widely used models in bus travel time prediction. The ability of ANN for solving complex nonlinear problems has been proved in many applications [31, 32]. Chien et al. [16] developed two ANN-based models to predict bus arrival time, namely link-based ANN and stop-based ANN. To further improve the prediction accuracy of both models, an adaptive algorithm was developed. Results indicated that the adaptive algorithm could actually improve the performances. Mazloumi et al. [17] proposed an integrated framework that consists of two ANNs to predict both the average and variance of travel times. The proposed method was validated by using the data collected in Melbourne, Australia. However, the structure of ANN (i.e., input variables, hidden-layer size, learning rate, etc.) depends on the experience of researchers [33,34,35,36].

Characterized by the capacity control of the decision function, the use of the kernel functions and the sparse solution, SVM has also been widely used in the field of prediction. By using SVM-based model and the data of preceding buses, Yu et al. [18] predicted the bus arrival time for a bus route. Their results showed SVM was a powerful method for bus arrival time prediction. To further improve the performance of the SVM, Yu et al. [19] firstly applied the Grubbs’ test method to remove outliers from the input data. Then, an SVM-based model with forgetting factors was introduced. Results showed that the improved SVM model outperformed the standard SVM.

However, ANN and SVM are complicated because of the large number of parameters needed to be adjusted [3]. In considering coping with the bad effect of parameters selection, attention has been focused on an emerging type of machine learning method, RFs. As an ensemble learning algorithm, RFs model can easily explain the importance of thousands of variables. Through the procedure of random selection of features and training samples, it has been proved to be efficient in avoiding overfitting. Some applications of RFs can be found in bus travel time prediction. Gal et al. [20] proposed a combination method of queuing theory and RFs to predict bus travel time. In their paper, RFs were used to detect the outliers in historical data. Yu et al. [3] developed a RFs model to predict bus travel time. To avoid the influence of massive data, a preselection method for the training set, near neighborhoods, was proposed in their paper. Finally, their model was calibrated and validated by using the data of two bus routes in China.

To avoid the drawback of single prediction model, several types of hybrid models have also been studied. These methods include ANN model with Kalman filter-based algorithm [25], SVM model with Kalman filter-based algorithm [26], queuing theory with RFs algorithm [20] and so on. Results showed that hybrid models could certainly obtain better performances than those single prediction models.

Among all varieties of prediction models, the accuracy of the bus travel time prediction heavily depends on historical and real-time traffic data. Nowadays, these traffic data are now undergoing rapid growth with the widespread of traffic sensor technologies. In addition, factors, which have influences on bus travel time, commonly have complicated relationships. Finding a reliable method that can identify the relationship among these influencing factors is a challenging problem. However, the above-mentioned bus travel time prediction methods are mainly based on shallow learning architecture. It is difficult for them to easily and precisely identify the relationship among these influencing factors with the boom of information.

In recent years, deep learning models have drawn a lot of attentions. By adopting multilayer or deep architectures, deep learning models can extract inherent features of data from the lowest level to the highest level even if the data set is large. Currently, there are some applications of deep learning models in traffic flows or travel time prediction [37,38,39,40,41,42,43,44,45,46]. The first traffic flow prediction model based on deep architecture was proposed by Lv et al. [37]. In their paper, a stacked autoencoder model was developed to learn traffic features from the data collected in California freeway systems. Experimental results showed the prediction method based on deep architecture was superior to BPNN, SVM and RBFNN models [37]. Siripanpornchana et al. [38] developed an urban freeway travel time prediction model based on deep learning architectures. Different from Lv et al. [37] that used autoencoder model, the concept of DBN, which comprises a stack of restricted Boltzmann machines (RBM), was used in their paper. However, to our knowledge, literature about bus travel time prediction using deep learning models is relatively scarce. In addition, the relationship among factors affecting the bus travel time is usually more complex than that affecting the traffic flows. Concerning the good features learning ability of deep learning models, this paper attempts to propose a new bus travel time prediction model based on the concept of DBN. In the aspect of model inputs, previous studies indicate that the traffic condition is regarded as the main factor influencing the accuracy of bus travel time prediction. Data that can reflect the traffic condition can be summarized as: (1) travel speed; (2) travel time; (3) distance; (4) emergency situation; and (5) dwell time. In this paper, due to data availability, both bus running time and bus dwell time (the number of passengers waiting to board) of the preceding vehicles (temporal and spatial data) are used to reflect the traffic condition and to predict the travel time of next bus.

DBN is one of the widely used deep learning-based methods. It has been proved for its capacity to extract nonlinear characteristics [42, 47]. However, the classical DBN is developed to learn features from binary data rather than continuous data. Considering the data used in this paper are continuous, a variation of RBM in DBN, called Gaussian–Bernoulli RBM (GBRBM), is employed. The basic units of DBN are improved to make the model have the ability of extracting and learning continuous data features. To further improve the performance of DBN, a shallow learning architecture, named BP neural network model, is also adopted to predict bus travel time in a supervised fashion.

This paper seeks to make three contributions to previous literature. Firstly, this paper attempts to develop a bus travel time prediction model based on the concept of DBN. It is expected that an accurate prediction of bus travel time can be obtained. By using the prediction results, the anxiety and waiting time of passengers can be effectively reduced. Secondly, the basic units of DBN are improved to make the model have the ability of extracting and learning continuous data (traffic) features. Thirdly, to improve the performance of DBN, a shallow learning architecture, named BP neural network model, is also adopted to predict bus travel time in a supervised fashion. Experimental results show that the proposed method has superior performance in prediction accuracy.

The remainder of this paper is organized in the following way. Section 2 presents the bus travel time prediction model and the deep learning method based on the concept of DBN. Section 3 describes the details of our experiments and results. Finally, conclusions and future works are given in Sect. 4.

2 Methodologies

2.1 Bus travel time prediction model

Bus travel time prediction model is to predict the bus travel time between two consecutive bus stops. The framework of the prediction is illustrated in Fig. 2. The bus travel time on the target segment i (distance between bus stop B and C) need to be predicted when the target bus k arrives at the bus stop B. To obtain accurate prediction results, it is essential to find the appropriate factors that have influence on the travel time. As described in the literature, bus travel time between bus stops has relations with the bus dwell time and traffic conditions [3]. Yu et al. [3, 18] pointed out the traffic condition can be reflected by the running time of preceding buses and the most recent data provides the most reliable information. However, the number of passengers waiting to board (bus dwell time) also has effect on the bus running time since buses need more time to speed up or slow down. There are two general approaches to forecast bus travel time in previous studies. The first one is only to use the bus running time of preceding buses (dwell time is not explicitly expressed) [3, 18, 26], and the other approach is to use both bus running time and bus dwell time [27]. In addition, in an urban road network, road links do not exist in isolation. Traffic speed on both upstream and downstream links can affect the traffic speed of the current road. Therefore, in this paper, the average bus travel times of preceding buses on the current segment, upstream segment and downstream segment are all considered. Similarly, to find the appropriate factors to estimate bus dwell time and avoid the errors brought by unpredictable factors, the dwell time, the average dwell time and the variance of dwell time of targeted bus at the starting stop are also considered. It should be noted that the preceding buses in this paper only refer to the buses that just finished the travel between two consecutive stops and leave for the next stop.

Fig. 2
figure 2

The framework of bus travel time prediction

Assuming that k represents the target bus for predicting, i represents the segment of target bus route. The input variables considered in this paper are illustrated as follows:

$$\hat{t}_{i,k} = f\left( {t_{i - 1,k} ,t_{i,k - 1} ,t_{i + 1,k - 1} ,dt_{i,k} ,\overline{dt}_{i} ,\sigma \left( {dt} \right)_{i}^{2} } \right)$$
(1)

where \(\hat{t}_{i,k}\) is the predicted travel time of bus k on segment i and \(t_{i - 1,k}\) represents the running time of bus k on segment i − 1. It shows the traffic conditions on the upstream of segment i. \(t_{i,k - 1}\) means the running time of bus k − 1 (i.e., the preceding bus) on segment i, expressing the traffic conditions on the current segment i, \(t_{i + 1,k - 1}\) is the running time of bus k − 1 (i.e., the preceding bus) on segment i + 1, which represents the traffic conditions on the downstream of segment i, \(dt_{i,k}\) shows the dwell time of bus k at the starting stop of segment i and \(\overline{dt}_{i}\) and \(\sigma \left( {dt} \right)_{i}^{2}\) represent the average and variance of dwell time at the starting stop of segment i, respectively. The prediction model proposed in this paper is to find the relationship among these influencing factors.

2.2 Deep belief network

DBN is a kind of artificial neural network inspired by imitating the process of cognition and inference of human brain [41]. It has become one of the most universal deep learning models and has been widely used in many fields. As an arbitrary undirected graphical model, DBN can be described as a generative energy-based model that has random multilayer units connecting the input layer and the output layer. It is constituted by a stack of RBMs, which provide DBN with the capacity of feature learning and feature extraction. There are two layers existing in the RBM. One layer is constituted by visible units, while the other is composed of hidden units [42]. Visible units are used to represent observable data, while hidden units are used to capture dependencies among observed variables. An illustration of an RBM is shown in Fig. 3.

Fig. 3
figure 3

The structure of RBM

In this figure, \(v_{i} \left( {i = 1, \ldots ,3} \right)\) is the unit in the visible layer, while \(h_{i} \left( {i = 1, \ldots ,4} \right)\) is the unit in the hidden layer, and \(W\) represents the weight matrix.

However, the classical RBM was developed to solve problems only with binary data, limiting the applications for problem solving. Considering the traffic data are continuous and several variations of RBM have been presented in previous literature, in this paper, a variation of RBM that can deal with continuous data, called GBRBM, is introduced. In this model, the binary visible units are replaced by linear units with Gaussian noise [40]. Note that GBRBM is an energy-based model, in which hidden-layer variables are used to describe the probabilistic distribution of visible-layer variables. Thus, given m visible units and k hidden units, the probabilistic distribution over variables can be defined as follows [47]. It is actually a derivation of the energy function of a standard RBM.

$$E\left( {v,h\left| \theta \right.} \right) = - \sum\limits_{i = 1}^{m} {\frac{{\left( {v_{i} - a_{i} } \right)^{2} }}{{2\sigma_{i}^{2} }}} - \sum\limits_{j = 1}^{k} {b_{j} h_{j} } - \sum\limits_{i = 1}^{m} {\sum\limits_{j = 1}^{k} {\frac{{v_{i} }}{{\sigma_{i} }}} } W_{ij} h_{j}$$
(2)

where \(\theta = \left\{ {W,a,b} \right\}\) is the structure parameter of GBRBM, \(a_{i}\) and \(b_{j}\) are the bias vectors for visible and hidden units, respectively, \(w_{ij}\) is the weight related to the connection between visible unit \(v_{i}\) and hidden unit \(h_{j}\) and \(\sigma_{i}\) is the standard deviation of Gaussian noise, which corresponds to the visible unit \(v_{i}\).

Based on Eq. (2), the probabilistic distribution for each pair of visible unit and hidden unit \(\left( {v,h} \right)\) can be described as:

$$P\left( {v,h\left| \theta \right.} \right) = \frac{{{\text{e}}^{{ - E\left( {v,h\left| \theta \right.} \right)}} }}{{\sum\nolimits_{v,h} {{\text{e}}^{{ - E\left( {v,h\left| \theta \right.} \right)}} } }}$$
(3)

Thus, the probability of the vector in visible layer can be calculated by summing all probabilities of the vector in hidden layer, and the process can be described by:

$$p\left( v \right) = \frac{{\sum\nolimits_{h} {{\text{e}}^{{ - E\left( {v,h} \right)}} } }}{{\sum\nolimits_{v,h} {{\text{e}}^{{ - E\left( {v,h} \right)}} } }}$$
(4)

Since units in a layer only have connections to the units in another layer, the events are independent of each other. The conditional probabilities can be calculated by:

$$P\left( {v_{i} \left| {h,\theta } \right.} \right) = N\left( {a_{i} + \sigma_{i} \sum\limits_{j = 1}^{k} {W_{ij} } h_{j} ,\sigma_{i}^{2} } \right)$$
(5)
$$P\left( {h_{j} = 1\left| {v,\theta } \right.} \right) = \sigma \left( {b_{j} + \sum\limits_{i = 1}^{m} {\frac{{v_{i} }}{{\sigma_{i} }}} W_{ij} } \right)$$
(6)

where \(\sigma \left( x \right) = \frac{1}{{1 + \exp \left( { - x} \right)}}\) is the most used activation function named sigmoid function. \(N\left( {\mu ,\sigma^{2} } \right)\) stands for the Gaussian distribution with mean \(\mu\) and variance \(\sigma^{2}\). In GBRBM, the deviation is also known as noise level, and in practice, it is usually set to 1.

For a given training data \(X = \left\{ {x^{\left( t \right)} } \right\},\;t = 1,2 \ldots ,C\) (C is the number of sample size), data should be firstly standardized due to the disunity of weights and measures. After that, the structure parameters of GBRBM can be estimated by minimizing the negative log-likelihood. In this paper, a stochastic approximation method, namely contrastive divergence (CD) algorithm, is introduced to estimate the expected values. The structure parameters of GBRBM can be updated by the following equations:

$$\Gamma = \log \prod\nolimits_{m} {p\left( {v_{m} } \right)}$$
(7)
$$\Delta w_{ij} = \varepsilon \frac{{\partial\Gamma }}{{\partial W_{ij} }} = \varepsilon \left\langle {\frac{{v_{i} h_{j} }}{{\sigma_{i}^{2} }}} \right\rangle_{d} - \left\langle {\frac{{v_{i} h_{j} }}{{\sigma_{i}^{2} }}} \right\rangle_{m}$$
(8)
$$\Delta a_{i} = \varepsilon \frac{{\partial\Gamma }}{{\partial a_{i} }} = \varepsilon \left\langle {\frac{{v_{i} }}{{\sigma_{i}^{2} }}} \right\rangle_{d} - \left\langle {\frac{{v_{i} }}{{\sigma_{i}^{2} }}} \right\rangle_{m}$$
(9)
$$\Delta b_{j} = \varepsilon \frac{{\partial\Gamma }}{{\partial b_{j} }} = \varepsilon \left\langle {\frac{{h_{j} }}{{\sigma_{i}^{2} }}} \right\rangle_{d} - \left\langle {\frac{{h_{j} }}{{\sigma_{i}^{2} }}} \right\rangle_{m}$$
(10)

where \(\varepsilon\) is the learning rate and \(\left\langle . \right\rangle_{d}\) and \(\left\langle . \right\rangle_{m}\) represent the expected values of the training data and model, respectively. Details of how GBRBM is derived from the classical RBM are given in “Appendix.”

In the DBN model, features of the observation data are extracted from the hidden units, and the obtained features then served as the input of another GBRBM. After stacking GBRBMs in such way, a high-dimensional feature vector can be obtained in the final GBRBM. However, GBRBM is originally used for unsupervised learning. One method for considering the supervised learning is to combine an additional supervised learning algorithm. In this paper, the output of the final GBRBM (i.e., high-dimensional feature vector) is used as the input of a BP neutral network, and the output of the BP neutral network is used as the final result (i.e., predicted bus travel time). The multilayer structure of the proposed model makes it can process highly structured data and learn features automatically. We report hereafter the pseudo-code of the procedure of training the DBN model. And we use the following symbols: \(T\) is the number of layers; \(m\) and \(k\) are, respectively, the number of visual units and hidden units; \(N\) is the number of iterations; \(\varepsilon\) is learning rate; \(W\) is the set of weights of GBRBMs to be estimated; \(\theta = \left\{ {W,a,b} \right\}\) is the structure parameter of GBRBM; \(\Psi\) represents the inherent features of data; \({\text{ETF}}\left( {\text{data}} \right)\) extracts and transforms the features across the hidden layers; \({\text{UWE}}\left(\Psi \right)\) is the procedure which updates the weights of GBRBM; \(\psi\) is the prediction results of DBN model; \(\psi^{*}\) is the best results of DBN model; \({\text{BP}}\left( {\psi^{*} } \right)\) is the procedure of BP neural network; and \(\Delta\) denotes the final solutions. In addition, the structure of DBN model for bus travel time prediction is shown in Fig. 4.

figure a
Fig. 4
figure 4

The structure of DBM for bus travel time prediction

3 Case study

3.1 Data descriptions

In this paper, the proposed model is tested by real-world data collected in Shenyang, the capital city of Liaoning Province in China. Shenyang has a highly developed urban transit network that comprises more than 222 bus routes. Providing accurate bus information is vital for transit agencies to attract more passengers. In Shenyang, each bus stop is equipped with a video detector. It is originally used to identify abnormal conditions. However, by using video recognition technology, the data of bus travel time and dwell time can also be obtained.

The selected bus route for testing is route No. 232, which spans 10.7 km. The operation time of route No. 232 is from 4:50 am to 10:00 pm, and the departure interval is approximately 2.5 min. The reason for selecting this route is that there is a large passenger demand on the route every day. It connects Santaizi bus stop, located in the suburb of Shenyang, to Wanda plaza, the city center. It has 19 bus stops and the location of the selected route is illustrated in Fig. 5. The detailed information is also given in Table 1. Note that the two directions of the route are considered separately in this paper.

Fig. 5
figure 5

The location of bus route No. 232

Table 1 Detailed information of the selected bus route

To obtain the data of bus running time and dwell time, vehicle recognition has been carried out on the video data during every whole day of July 17–21, 2017 (Monday to Friday). By using vehicle recognition technology, the bus no., bus dwell time and bus running time can be acquired. However, sometimes, several buses reach the bus stop at the same time, making vehicle recognition difficult to implement. To reduce the influence of recognition errors, automatic vehicle location (AVL) data obtained by the GPS unit of buses are also used. By mapping the bus location [48] with the vehicle video records and filtering out the outliers, 50,876 valid observations of bus route, No. 232, are finally obtained. Then, based on the running directions, these valid observations are divided into two groups. And for each group, data are also divided into two subsets: 80 percent of the data are used as the training set, while remaining 20 percent are used as testing data. The descriptive statistics for each route segment are given in Table 2.

Table 2 Descriptive statistic for the valid observation

Note that the north direction is the direction in which buses move from the city center to suburb, while the south direction is from suburb to city center. From Table 2, it can be observed that the bus travel time in the south direction varies from 30 to 392 s and the travel time in the north direction is from 58 to 343 s. The average bus travel time for both directions is almost the same, while the standard deviation (SD) of the north direction is less than that of the south direction. The comparisons arise because passenger demand in the south direction is obviously more volatile than that in the north direction.

3.2 Performance indexes

The performance of the prediction model is evaluated with three metrics: mean absolute error (MAE), mean absolute percent error (MAPE) and root mean squared error (RMSE). The three metrics could measure the differences between the predicted and actual observed value in different aspects. They are defined as follows:

$${\text{MAE}} = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left| {t_{i,k} - \hat{t}_{i,k} } \right|}$$
(11)
$${\text{MAPE}}\,\,(\% ) = \frac{1}{n}\sum\limits_{i = 1}^{n} {\frac{{\left| {t_{i,k} - \hat{t}_{i,k} } \right|}}{{t_{i,k} }}} \times 100\%$$
(12)
$${\text{RMSE}} = \sqrt {\frac{1}{n - 1}\sum\limits_{i = 1}^{n} {\left( {t_{i,k} - \hat{t}_{i,k} } \right)^{2} } }$$
(13)

where \(t_{i,k}\) is the observed travel time and n is the number of observations.

3.3 Model identification

In model identification, observations of the first four days, i.e., July 17–20, 2017, are set as the training set and the observations on July 21, 2017, are selected as the testing set. Firstly, sensitivity tests are conducted to select rational input variables. To reduce the effect of different units of measurement, data normalization is conducted and the normalization method is shown in Eq. (14).

$$x_{i}^{\text{normal}} = \frac{{x_{i} - x_{ \hbox{min} } }}{{x_{ \hbox{max} } - x_{ \hbox{min} } }}$$
(14)

where \(x_{i}^{\text{normal}}\) is the normalized travel time, \(x_{i}\) is the raw sample data and \(x_{\hbox{min} }\) and \(x_{\hbox{max} }\) are the minimum and maximum value of sample data, respectively.

In this paper, six models with different input variables are calibrated and listed in Table 3. Table 3 also provides the average MAEs of the prediction results for both directions. It can be seen that the performance of the sixth model is the best. This indicates that taking different aspects of traffic conditions into consideration can truly improve the accuracy of bus running time prediction. Thus, the variables, \(t_{i - 1,k} ,t_{i,k - 1} ,t_{i + 1,k - 1} ,dt_{i,k} ,\overline{dt}_{i} ,\sigma \left( {dt} \right)\), are all selected as the input variables of the DBN model in this paper.

Table 3 Prediction models with different input variables

Finding a DBN with well features learning capacity is difficult since several numerical parameters, such as the number of layers (T), the number of units (\(m\) for visual units, \(k\) for hidden units), the number of iterations (N) and the learning rate (\(\varepsilon\)), need to be predetermined. In this paper, the range of these parameters is set as \(T \in [1,4]\), \(m \in [150,200]\), \(k \in [150,200]\), \(N \in [100,500]\) and \(\varepsilon \in [0.0005,0.005]\), respectively. Then, a tenfold cross-validation [3] and a grid search are used to identify the optimal parameter values. The results of \(\left( {T,m,k,N,\varepsilon } \right)\) are \(\left( {3,160,180,340,0.001} \right)\). Finally, a three-layer BP neural network with six hidden units is used to predict bus travel time. Since the data have been processed through standardization, the final predicted bus travel time should be obtained according to Eq. (15).

$$x_{i}^{\text{inv}} = x_{i}^{\text{normal}} \times \left( {x_{\hbox{max} } - x_{\hbox{min} } } \right) + x_{\hbox{min} }$$
(15)

3.4 Numerical results

In this section, to validate the proposed DBN model, four other prediction models, including k-NN, ANN, SVM and RFs, are also employed. To have an equitable comparison, the same training and testing sets are used for all prediction models, and the input variables of all four models are the same as the ones of the DBN model. By sensitivity tests, the value of the parameter k is set as 3 for the k-NN model. In addition, in the ANN model, a standard three-layer ANN is used and the number of hidden units is set as 5. There are two hyper parameters, C and \(\varepsilon\), existing in the SVM model. By using a tenfold cross-validation [3] and a grid search method, the value of the two parameters is set as 25 and 0.2, respectively. Beyond that, the radial basis function (RBF) kernel function is also used in the SVM model. There are also two hyperparameters, while using RFs: \(m_{\text{try}}\) and \(n_{\text{tree}}\). By using the same parameter selection method as SVM model, the two parameters in RFs are set as 4 and 1000, respectively. Finally, the results are summarized in Fig. 6 and the details are given in Table 4.

Fig. 6
figure 6

The performance of the five models

Table 4 The average results of the five different prediction models

Figure 6 shows the performance of all five prediction models for both directions of the selected bus route in terms of MAE, RMSE and MAPE. With respect to MAE, the DBN model outperforms other four prediction models by 46.01%, 67.47%, 24.17% and 17.91% in the south direction and 49.64%, 50.79%, 0% and 4.12% in the north direction, respectively. In terms of MAPE, the DBN model outperforms other four prediction models by 43.01%, 71.15%, 30.67% and 23.25% in the south direction and 50.77%, 43.73%, 8.53% and 10.65% in the north direction, respectively. Moreover, regarding the RMSE, the DBN model outperforms other four prediction models by 41.72%, 56.03%, 14.73% and 11.39% in the south direction. However, in the north direction, the performance of SVM model is the best. In summary, the DBN model has the best performance except the north direction in RMSE. It demonstrates that the DBN model is more effective than shallow learning architectures in extracting inherent features from massive data. In terms of MAE and MAPE, though the performances of SVM and RFs models are worse than that of the DBN model, the results of the two models are still better than k-NN and BP network model. It is mainly because the structure risk minimization principle of SVM and feature reduction principle of RFs can effectively avoid overfitting problem. It also illustrates why the two types of prediction model are widely used in the forecast field. Moreover, though the performance of k-NN model is worse than that of SVM, RFs and DBN models, it performs better than BP network model and the performance of BP network model is the worst. Thus, considering the simple structure and easy implement, k-NN is still a satisfying model for bus travel time prediction.

As to computation time, all models were coded in MATLAB of version R2012a and executed on a PC equipped with 4 GB of RAM and a dual-core 3.2 GHz processor. The results are listed in Table 5. In Table 5, it is obvious that the computation time of RFs is the least among all bus travel time prediction models while the k-NN model has the maximum running time. Specifically, the computation time of k-NN is almost 20 times that of RFs. It could be contributed to the process of similarity measurement in the k-NN model. Compared with the k-NN model, the shallow learning methods, BP network and SVM models, have relatively shorter running time. Moreover, due to the property of multilayer, it takes more computation time to establish a trained DBN model. However, with the help of cloud and parallel computing, the DBN model can be easily applied to the real-time bus travel time prediction. Considering the trade-off of accuracy and computation time, DBN model is the best approach among the five bus travel time prediction models and RFs is also a competitive method.

Table 5 Computation time of all prediction models (unit: s)

Since bus travel time is affected by time and space variables, there are significant differences in the bus travel time between peak hours and off-peak hours and between different driving directions. The reliability of predicted bus travel time would decrease in peak hours because at this time, traffic conditions are more complex and the number of passengers waiting to board is larger. Thus, in this paper, to further validate the proposed DBN model, the bus travel time predictions for both south and north directions in peak hours (i.e., 7 am–9 am and 5 pm–7 pm) on July 21, 2017, are also conducted. The prediction results of the five models in peak hours in terms of MAPE are shown in Fig. 7.

Fig. 7
figure 7

Prediction results of the five models in peak hours

For all prediction models, it can be seen from Fig. 7 that the results of the north direction are better than those of the south direction. Traffic conditions and the number of passengers waiting to board in the south direction are always more complex than those in the north direction since the south direction is from the suburb to city center, while the north direction is from the city center to suburb. This indicates that prediction accuracy decreases as the traffic conditions become more complex. Nonetheless, compared with other four prediction models, the performance of DBN model in peak hours remains to be the best. In summary, the prediction results obtained by DBN model are shown to be more reliable than those obtained by other prediction models.

Finally, the detailed prediction results for both south and north directions in peak hours on July 21, 2017, are illustrated in Fig. 8. It can be seen that, in both directions, the maximum error of all segments is less than 12.19. Figure 8 also shows that the prediction errors of segments 9, 10 and 11 in the south direction are the largest, while in the north direction, the segments are 5 and 7. In Table 2, the standard deviations of the segments 9, 10 and 11 in the south direction and segments 5 and 7 in the north direction are 32.74, 44.51, 31.81, 13.65 and 12.95, respectively, which are obviously larger than those of other segments in each direction. The results give a further verification of that the prediction accuracy decreases as the increase in the uncertainty of traffic conditions.

Fig. 8
figure 8

Prediction errors of the DBN model of different segments

4 Conclusion

This paper proposed a bus travel time prediction model based on the concept of deep belief network. Since the input variables of the proposed model are continuous data, the basic units of DBN were improved by introducing Gaussian–Bernoulli RBMs. In addition, a BP neutral network algorithm was also used to predict bus travel time in a supervised fashion. To validate the proposed model, real-world data from bus route No. 232 in Shenyang, China, were collected. Four other models, including k-NN, ANN, SVM and RFs, were also introduced. Results showed that the performance of DBN model was the best among all five travel time prediction models. The performance of the proposed model was also quite good in peak hours. The maximum errors in such a case, in terms of MAE, were only 12.19 in the south direction and 10.22 in the north direction, respectively.

In this paper, only the bus data from a single route are considered as the input variables. Further study should consider more factors that might affect prediction accuracy, such as weather conditions, the running time of other bus routes or vehicles on the same segment, and the environment of signalized intersections to enhance the performance of the proposed bus travel time prediction model. In addition, with the development of computer technology, multiple types of deep learning models have been proposed. In the future, the comparison among different deep learning models will be conducted to further prove the validity of the proposed method.