Bus travel time prediction based on deep belief network with back-propagation

Chen, Chao; Wang, Hui; Yuan, Fang; Jia, Huizhong; Yao, Baozhen

doi:10.1007/s00521-019-04579-x

Bus travel time prediction based on deep belief network with back-propagation

Original Article
Published: 02 November 2019

Volume 32, pages 10435–10449, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Computing and Applications Aims and scope Submit manuscript

Bus travel time prediction based on deep belief network with back-propagation

Download PDF

Chao Chen^1,2,
Hui Wang¹,
Fang Yuan¹,
Huizhong Jia¹ &
…
Baozhen Yao ORCID: orcid.org/0000-0003-2867-2549¹

847 Accesses
24 Citations
Explore all metrics

Abstract

In an intelligent transportation system, accurate bus information is vital for passengers to schedule their departure time and make reasonable route choice. In this paper, an improved deep belief network (DBN) is proposed to predict the bus travel time. By using Gaussian–Bernoulli restricted Boltzmann machines to construct a DBN, we update the classical DBN to model continuous data. In addition, a back-propagation (BP) neural network is further applied to improve the performance. Based on the real traffic data collected in Shenyang, China, several experiments are conducted to validate the technique. Comparison with typical forecasting methods such as k-nearest neighbor algorithm (k-NN), artificial neural network (ANN), support vector machine (SVM) and random forests (RFs) shows that the proposed method is applicable to the prediction of bus travel time and works better than traditional methods.

Deep Architecture for Traffic Flow Prediction

Deep belief network-based support vector regression method for traffic flow forecasting

Article 19 July 2019

Traffic Accident Forecasting Based on a GrDBN-GPR Model with Integrated Road Features

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the development of technologies and information systems, the intelligent transportation systems (ITSs) and advanced traveler information systems (ATISs) have been widely deployed in city region. In the field of public transportation, the delivery of real-time traffic information is actually the most visible applications of ITS [1]. Traditionally, this type of information is delivered in the form of arrival and departure times on digital boards at bus stops. More recently, passengers can obtain it via smartphone apps and in-vehicle screens. Currently, public transport can always serve passengers in an easy and comfortable way. However, due to some unpredictable factors, sometimes there is an early arrival or delay at a particular stop. For the lack of operational reliability, many travelers incline to choose private cars rather than public transport. Transit agencies realize that providing an accurate bus arrival time is valuable to attract more passengers and improve management or service level. They can adjust their bus schedules by applying a higher or lower speed in advance and ultimately achieve the goal of reducing the waste of bus resources.

At the same time, the availability of accurate bus information can also help passengers to efficiently schedule their departure time, reduce their waiting time and make smart choices for their travel [2, 3]. Given the bus arrival time information, passenger can choose suitable travel mode for their journeys. Figure 1 shows an example for illustrating the effect on providing accurate bus information at bus stop. In the example, a passenger attempts to travel from Stop A to Stop B. He/she actually has three choices on this trip. If the passenger has enough time, he/she can choose any one of the two bus routes (i.e., route no. 1 and no. 2). However, if the time is short, the passenger can take a taxi to the destination. Furthermore, if the passenger knows the bus arrival times of the next buses of the two bus routes (e.g., 09:00 and 09:05, respectively), he/she will wait for the next bus of route no. 2 rather than that of route no. 1. As a result, the waiting time of the passenger will be reduced.

As the objective of bus travel time prediction is to provide such information, this problem has been one of the hottest issues in ITSs. It should be noted that providing bus travel time prediction precisely in areas with litter external influence, such as rural areas, is easy in a way. However, the problem becomes much more complex in urban areas. The motivation of this paper is to improve the accuracy of prediction yielded by current bus travel time prediction models.

In the past decades, by using historical data or online data (obtained by the global positioning system), various forecasting models and techniques have been proposed to predict bus travel time. These techniques include historical average model [4,5,6], statistical model [7,8,9,10,11], nonparametric regression model [2, 12,13,14], machine learning model [2, 3, 15,16,17,18,19,20,21,22,23,24] and hybrid model [20, 25, 26].

Historical average model is used to predict the current and future bus travel time within a given period of time by averaging the historical travel times. Chung and Shalaby [6] presented a school bus arrival time prediction model, which combined historical moving average model with an operational strategy. Their results indicated the proposed model was powerful in real-life application. However, a noteworthy feature of this type of model is the requirement of stable traffic pattern. The reliability of the model would greatly decrease if the traffic pattern has large variations [27, 28]. It should be noted that in recent research, this type of model is presented only for comparison purposes.

Statistical model can be further divided into two main categories: time-series model and regression model. With respect to time-series model, such model assumes that the future state of a bus depends on the trend of past several states of the same bus. This model usually causes a short time lag between the prediction value and real data. The commonly used time-series models include autoregressive moving average (ARMA) [29], generalized autoregressive conditional heteroscedasticity model [7] and seasonal ARIMA model [8, 30]. Williams and Hoel [4] compared the performances of several methods for traffic flow rate prediction. These methods include seasonal ARIMA, random walk, historical average and deviation from historical average. Their results showed the seasonal ARIMA could provide the best forecasts for all performance statistics. It should be noted that the prediction accuracy also decreases significantly when the relationship between real-time and historical data is complicated [16]. The other kind of statistical model is the regression model. This kind of model is developed to verify the effect of different factors on bus travel time. In the model, factors such as bus dwell time, traffic conditions and travel distance are treated as independent variables, while bus travel time is the dependent variable. Observed that the travel times in current and future status have linear relationship, at the same time, the slope and intercept of this relationship change subject to the time of day, Rice and Van Zwet [9] proposed a linear regression with time-varying coefficients to predict travel time on freeways. However, the prediction accuracy of regression model heavily depends on the selection of independent variables [16].

Recently, due to the absence of estimating parameters, k-NN has been extensively employed in travel time prediction. It is actually a kind of nonparametric regression model. Based on k-NN, You and Kim [12] proposed a hybrid travel time forecasting model to predict link travel times in congested road networks. By using historical and real-time data, Chang et al. [14] developed a dynamic model to predict bus multi-interval path travel time. Results in their paper showed that k-NN method was effective in the aspects of both accuracy and computing time. Yu et al. [2] proposed four models, including SVM, ANN, k-NN and linear regression, to predict bus arrival time at the bus stop with multiple routes. Their results also indicated k-NN was an effective method for bus travel time prediction. However, nonparametric regression model becomes costly in execution time when the sample size is large [2, 14].

As a subfield of artificial intelligence, machine learning methods are widely used in transportation. Of various machine learning methods, ANN and SVM are the most widely used models in bus travel time prediction. The ability of ANN for solving complex nonlinear problems has been proved in many applications [31, 32]. Chien et al. [16] developed two ANN-based models to predict bus arrival time, namely link-based ANN and stop-based ANN. To further improve the prediction accuracy of both models, an adaptive algorithm was developed. Results indicated that the adaptive algorithm could actually improve the performances. Mazloumi et al. [17] proposed an integrated framework that consists of two ANNs to predict both the average and variance of travel times. The proposed method was validated by using the data collected in Melbourne, Australia. However, the structure of ANN (i.e., input variables, hidden-layer size, learning rate, etc.) depends on the experience of researchers [33,34,35,36].

Characterized by the capacity control of the decision function, the use of the kernel functions and the sparse solution, SVM has also been widely used in the field of prediction. By using SVM-based model and the data of preceding buses, Yu et al. [18] predicted the bus arrival time for a bus route. Their results showed SVM was a powerful method for bus arrival time prediction. To further improve the performance of the SVM, Yu et al. [19] firstly applied the Grubbs’ test method to remove outliers from the input data. Then, an SVM-based model with forgetting factors was introduced. Results showed that the improved SVM model outperformed the standard SVM.

However, ANN and SVM are complicated because of the large number of parameters needed to be adjusted [3]. In considering coping with the bad effect of parameters selection, attention has been focused on an emerging type of machine learning method, RFs. As an ensemble learning algorithm, RFs model can easily explain the importance of thousands of variables. Through the procedure of random selection of features and training samples, it has been proved to be efficient in avoiding overfitting. Some applications of RFs can be found in bus travel time prediction. Gal et al. [20] proposed a combination method of queuing theory and RFs to predict bus travel time. In their paper, RFs were used to detect the outliers in historical data. Yu et al. [3] developed a RFs model to predict bus travel time. To avoid the influence of massive data, a preselection method for the training set, near neighborhoods, was proposed in their paper. Finally, their model was calibrated and validated by using the data of two bus routes in China.

To avoid the drawback of single prediction model, several types of hybrid models have also been studied. These methods include ANN model with Kalman filter-based algorithm [25], SVM model with Kalman filter-based algorithm [26], queuing theory with RFs algorithm [20] and so on. Results showed that hybrid models could certainly obtain better performances than those single prediction models.

Among all varieties of prediction models, the accuracy of the bus travel time prediction heavily depends on historical and real-time traffic data. Nowadays, these traffic data are now undergoing rapid growth with the widespread of traffic sensor technologies. In addition, factors, which have influences on bus travel time, commonly have complicated relationships. Finding a reliable method that can identify the relationship among these influencing factors is a challenging problem. However, the above-mentioned bus travel time prediction methods are mainly based on shallow learning architecture. It is difficult for them to easily and precisely identify the relationship among these influencing factors with the boom of information.

In recent years, deep learning models have drawn a lot of attentions. By adopting multilayer or deep architectures, deep learning models can extract inherent features of data from the lowest level to the highest level even if the data set is large. Currently, there are some applications of deep learning models in traffic flows or travel time prediction [37,38,39,40,41,42,43,44,45,46]. The first traffic flow prediction model based on deep architecture was proposed by Lv et al. [37]. In their paper, a stacked autoencoder model was developed to learn traffic features from the data collected in California freeway systems. Experimental results showed the prediction method based on deep architecture was superior to BPNN, SVM and RBFNN models [37]. Siripanpornchana et al. [38] developed an urban freeway travel time prediction model based on deep learning architectures. Different from Lv et al. [37] that used autoencoder model, the concept of DBN, which comprises a stack of restricted Boltzmann machines (RBM), was used in their paper. However, to our knowledge, literature about bus travel time prediction using deep learning models is relatively scarce. In addition, the relationship among factors affecting the bus travel time is usually more complex than that affecting the traffic flows. Concerning the good features learning ability of deep learning models, this paper attempts to propose a new bus travel time prediction model based on the concept of DBN. In the aspect of model inputs, previous studies indicate that the traffic condition is regarded as the main factor influencing the accuracy of bus travel time prediction. Data that can reflect the traffic condition can be summarized as: (1) travel speed; (2) travel time; (3) distance; (4) emergency situation; and (5) dwell time. In this paper, due to data availability, both bus running time and bus dwell time (the number of passengers waiting to board) of the preceding vehicles (temporal and spatial data) are used to reflect the traffic condition and to predict the travel time of next bus.

DBN is one of the widely used deep learning-based methods. It has been proved for its capacity to extract nonlinear characteristics [42, 47]. However, the classical DBN is developed to learn features from binary data rather than continuous data. Considering the data used in this paper are continuous, a variation of RBM in DBN, called Gaussian–Bernoulli RBM (GBRBM), is employed. The basic units of DBN are improved to make the model have the ability of extracting and learning continuous data features. To further improve the performance of DBN, a shallow learning architecture, named BP neural network model, is also adopted to predict bus travel time in a supervised fashion.

This paper seeks to make three contributions to previous literature. Firstly, this paper attempts to develop a bus travel time prediction model based on the concept of DBN. It is expected that an accurate prediction of bus travel time can be obtained. By using the prediction results, the anxiety and waiting time of passengers can be effectively reduced. Secondly, the basic units of DBN are improved to make the model have the ability of extracting and learning continuous data (traffic) features. Thirdly, to improve the performance of DBN, a shallow learning architecture, named BP neural network model, is also adopted to predict bus travel time in a supervised fashion. Experimental results show that the proposed method has superior performance in prediction accuracy.

The remainder of this paper is organized in the following way. Section 2 presents the bus travel time prediction model and the deep learning method based on the concept of DBN. Section 3 describes the details of our experiments and results. Finally, conclusions and future works are given in Sect. 4.

2 Methodologies

2.1 Bus travel time prediction model

Bus travel time prediction model is to predict the bus travel time between two consecutive bus stops. The framework of the prediction is illustrated in Fig. 2. The bus travel time on the target segment i (distance between bus stop B and C) need to be predicted when the target bus k arrives at the bus stop B. To obtain accurate prediction results, it is essential to find the appropriate factors that have influence on the travel time. As described in the literature, bus travel time between bus stops has relations with the bus dwell time and traffic conditions [3]. Yu et al. [3, 18] pointed out the traffic condition can be reflected by the running time of preceding buses and the most recent data provides the most reliable information. However, the number of passengers waiting to board (bus dwell time) also has effect on the bus running time since buses need more time to speed up or slow down. There are two general approaches to forecast bus travel time in previous studies. The first one is only to use the bus running time of preceding buses (dwell time is not explicitly expressed) [3, 18, 26], and the other approach is to use both bus running time and bus dwell time [27]. In addition, in an urban road network, road links do not exist in isolation. Traffic speed on both upstream and downstream links can affect the traffic speed of the current road. Therefore, in this paper, the average bus travel times of preceding buses on the current segment, upstream segment and downstream segment are all considered. Similarly, to find the appropriate factors to estimate bus dwell time and avoid the errors brought by unpredictable factors, the dwell time, the average dwell time and the variance of dwell time of targeted bus at the starting stop are also considered. It should be noted that the preceding buses in this paper only refer to the buses that just finished the travel between two consecutive stops and leave for the next stop.

Assuming that k represents the target bus for predicting, i represents the segment of target bus route. The input variables considered in this paper are illustrated as follows:

$$\hat{t}_{i,k} = f\left( {t_{i - 1,k} ,t_{i,k - 1} ,t_{i + 1,k - 1} ,dt_{i,k} ,\overline{dt}_{i} ,\sigma \left( {dt} \right)_{i}^{2} } \right)$$

(1)

where $\hat{t}_{i,k}$ is the predicted travel time of bus k on segment i and $t_{i - 1,k}$ represents the running time of bus k on segment i − 1. It shows the traffic conditions on the upstream of segment i. $t_{i,k - 1}$ means the running time of bus k − 1 (i.e., the preceding bus) on segment i, expressing the traffic conditions on the current segment i, $t_{i + 1,k - 1}$ is the running time of bus k − 1 (i.e., the preceding bus) on segment i + 1, which represents the traffic conditions on the downstream of segment i, $dt_{i,k}$ shows the dwell time of bus k at the starting stop of segment i and $\overline{dt}_{i}$ and $\sigma \left( {dt} \right)_{i}^{2}$ represent the average and variance of dwell time at the starting stop of segment i, respectively. The prediction model proposed in this paper is to find the relationship among these influencing factors.

2.2 Deep belief network

DBN is a kind of artificial neural network inspired by imitating the process of cognition and inference of human brain [41]. It has become one of the most universal deep learning models and has been widely used in many fields. As an arbitrary undirected graphical model, DBN can be described as a generative energy-based model that has random multilayer units connecting the input layer and the output layer. It is constituted by a stack of RBMs, which provide DBN with the capacity of feature learning and feature extraction. There are two layers existing in the RBM. One layer is constituted by visible units, while the other is composed of hidden units [42]. Visible units are used to represent observable data, while hidden units are used to capture dependencies among observed variables. An illustration of an RBM is shown in Fig. 3.

In this figure, $v_{i} \left( {i = 1, \ldots ,3} \right)$ is the unit in the visible layer, while $h_{i} \left( {i = 1, \ldots ,4} \right)$ is the unit in the hidden layer, and $W$ represents the weight matrix.

However, the classical RBM was developed to solve problems only with binary data, limiting the applications for problem solving. Considering the traffic data are continuous and several variations of RBM have been presented in previous literature, in this paper, a variation of RBM that can deal with continuous data, called GBRBM, is introduced. In this model, the binary visible units are replaced by linear units with Gaussian noise [40]. Note that GBRBM is an energy-based model, in which hidden-layer variables are used to describe the probabilistic distribution of visible-layer variables. Thus, given m visible units and k hidden units, the probabilistic distribution over variables can be defined as follows [47]. It is actually a derivation of the energy function of a standard RBM.

$$E\left( {v,h\left| \theta \right.} \right) = - \sum\limits_{i = 1}^{m} {\frac{{\left( {v_{i} - a_{i} } \right)^{2} }}{{2\sigma_{i}^{2} }}} - \sum\limits_{j = 1}^{k} {b_{j} h_{j} } - \sum\limits_{i = 1}^{m} {\sum\limits_{j = 1}^{k} {\frac{{v_{i} }}{{\sigma_{i} }}} } W_{ij} h_{j}$$

(2)

where $\theta = \left\{ {W,a,b} \right\}$ is the structure parameter of GBRBM, $a_{i}$ and $b_{j}$ are the bias vectors for visible and hidden units, respectively, $w_{ij}$ is the weight related to the connection between visible unit $v_{i}$ and hidden unit $h_{j}$ and $\sigma_{i}$ is the standard deviation of Gaussian noise, which corresponds to the visible unit $v_{i}$.

Based on Eq. (2), the probabilistic distribution for each pair of visible unit and hidden unit $\left( {v,h} \right)$ can be described as:

$$P\left( {v,h\left| \theta \right.} \right) = \frac{{{\text{e}}^{{ - E\left( {v,h\left| \theta \right.} \right)}} }}{{\sum\nolimits_{v,h} {{\text{e}}^{{ - E\left( {v,h\left| \theta \right.} \right)}} } }}$$

(3)

Thus, the probability of the vector in visible layer can be calculated by summing all probabilities of the vector in hidden layer, and the process can be described by:

$$p\left( v \right) = \frac{{\sum\nolimits_{h} {{\text{e}}^{{ - E\left( {v,h} \right)}} } }}{{\sum\nolimits_{v,h} {{\text{e}}^{{ - E\left( {v,h} \right)}} } }}$$

(4)

Since units in a layer only have connections to the units in another layer, the events are independent of each other. The conditional probabilities can be calculated by:

$$P\left( {v_{i} \left| {h,\theta } \right.} \right) = N\left( {a_{i} + \sigma_{i} \sum\limits_{j = 1}^{k} {W_{ij} } h_{j} ,\sigma_{i}^{2} } \right)$$

(5)

$$P\left( {h_{j} = 1\left| {v,\theta } \right.} \right) = \sigma \left( {b_{j} + \sum\limits_{i = 1}^{m} {\frac{{v_{i} }}{{\sigma_{i} }}} W_{ij} } \right)$$

(6)

where $\sigma \left( x \right) = \frac{1}{{1 + \exp \left( { - x} \right)}}$ is the most used activation function named sigmoid function. $N\left( {\mu ,\sigma^{2} } \right)$ stands for the Gaussian distribution with mean $\mu$ and variance $\sigma^{2}$. In GBRBM, the deviation is also known as noise level, and in practice, it is usually set to 1.

For a given training data $X = \left\{ {x^{\left( t \right)} } \right\},\;t = 1,2 \ldots ,C$ (C is the number of sample size), data should be firstly standardized due to the disunity of weights and measures. After that, the structure parameters of GBRBM can be estimated by minimizing the negative log-likelihood. In this paper, a stochastic approximation method, namely contrastive divergence (CD) algorithm, is introduced to estimate the expected values. The structure parameters of GBRBM can be updated by the following equations:

$$\Gamma = \log \prod\nolimits_{m} {p\left( {v_{m} } \right)}$$

(7)

$$\Delta w_{ij} = \varepsilon \frac{{\partial\Gamma }}{{\partial W_{ij} }} = \varepsilon \left\langle {\frac{{v_{i} h_{j} }}{{\sigma_{i}^{2} }}} \right\rangle_{d} - \left\langle {\frac{{v_{i} h_{j} }}{{\sigma_{i}^{2} }}} \right\rangle_{m}$$

(8)

$$\Delta a_{i} = \varepsilon \frac{{\partial\Gamma }}{{\partial a_{i} }} = \varepsilon \left\langle {\frac{{v_{i} }}{{\sigma_{i}^{2} }}} \right\rangle_{d} - \left\langle {\frac{{v_{i} }}{{\sigma_{i}^{2} }}} \right\rangle_{m}$$

(9)

$$\Delta b_{j} = \varepsilon \frac{{\partial\Gamma }}{{\partial b_{j} }} = \varepsilon \left\langle {\frac{{h_{j} }}{{\sigma_{i}^{2} }}} \right\rangle_{d} - \left\langle {\frac{{h_{j} }}{{\sigma_{i}^{2} }}} \right\rangle_{m}$$

(10)

where $\varepsilon$ is the learning rate and $\left\langle . \right\rangle_{d}$ and $\left\langle . \right\rangle_{m}$ represent the expected values of the training data and model, respectively. Details of how GBRBM is derived from the classical RBM are given in “Appendix.”

In the DBN model, features of the observation data are extracted from the hidden units, and the obtained features then served as the input of another GBRBM. After stacking GBRBMs in such way, a high-dimensional feature vector can be obtained in the final GBRBM. However, GBRBM is originally used for unsupervised learning. One method for considering the supervised learning is to combine an additional supervised learning algorithm. In this paper, the output of the final GBRBM (i.e., high-dimensional feature vector) is used as the input of a BP neutral network, and the output of the BP neutral network is used as the final result (i.e., predicted bus travel time). The multilayer structure of the proposed model makes it can process highly structured data and learn features automatically. We report hereafter the pseudo-code of the procedure of training the DBN model. And we use the following symbols: $T$ is the number of layers; $m$ and $k$ are, respectively, the number of visual units and hidden units; $N$ is the number of iterations; $\varepsilon$ is learning rate; $W$ is the set of weights of GBRBMs to be estimated; $\theta = \left\{ {W,a,b} \right\}$ is the structure parameter of GBRBM; $\Psi$ represents the inherent features of data; ${\text{ETF}}\left( {\text{data}} \right)$ extracts and transforms the features across the hidden layers; ${\text{UWE}}\left(\Psi \right)$ is the procedure which updates the weights of GBRBM; $\psi$ is the prediction results of DBN model; $\psi^{*}$ is the best results of DBN model; ${\text{BP}}\left( {\psi^{*} } \right)$ is the procedure of BP neural network; and $\Delta$ denotes the final solutions. In addition, the structure of DBN model for bus travel time prediction is shown in Fig. 4.

3 Case study

3.1 Data descriptions

In this paper, the proposed model is tested by real-world data collected in Shenyang, the capital city of Liaoning Province in China. Shenyang has a highly developed urban transit network that comprises more than 222 bus routes. Providing accurate bus information is vital for transit agencies to attract more passengers. In Shenyang, each bus stop is equipped with a video detector. It is originally used to identify abnormal conditions. However, by using video recognition technology, the data of bus travel time and dwell time can also be obtained.

The selected bus route for testing is route No. 232, which spans 10.7 km. The operation time of route No. 232 is from 4:50 am to 10:00 pm, and the departure interval is approximately 2.5 min. The reason for selecting this route is that there is a large passenger demand on the route every day. It connects Santaizi bus stop, located in the suburb of Shenyang, to Wanda plaza, the city center. It has 19 bus stops and the location of the selected route is illustrated in Fig. 5. The detailed information is also given in Table 1. Note that the two directions of the route are considered separately in this paper.

Table 1 Detailed information of the selected bus route

Full size table

To obtain the data of bus running time and dwell time, vehicle recognition has been carried out on the video data during every whole day of July 17–21, 2017 (Monday to Friday). By using vehicle recognition technology, the bus no., bus dwell time and bus running time can be acquired. However, sometimes, several buses reach the bus stop at the same time, making vehicle recognition difficult to implement. To reduce the influence of recognition errors, automatic vehicle location (AVL) data obtained by the GPS unit of buses are also used. By mapping the bus location [48] with the vehicle video records and filtering out the outliers, 50,876 valid observations of bus route, No. 232, are finally obtained. Then, based on the running directions, these valid observations are divided into two groups. And for each group, data are also divided into two subsets: 80 percent of the data are used as the training set, while remaining 20 percent are used as testing data. The descriptive statistics for each route segment are given in Table 2.

Table 2 Descriptive statistic for the valid observation

Full size table

Note that the north direction is the direction in which buses move from the city center to suburb, while the south direction is from suburb to city center. From Table 2, it can be observed that the bus travel time in the south direction varies from 30 to 392 s and the travel time in the north direction is from 58 to 343 s. The average bus travel time for both directions is almost the same, while the standard deviation (SD) of the north direction is less than that of the south direction. The comparisons arise because passenger demand in the south direction is obviously more volatile than that in the north direction.

3.2 Performance indexes

The performance of the prediction model is evaluated with three metrics: mean absolute error (MAE), mean absolute percent error (MAPE) and root mean squared error (RMSE). The three metrics could measure the differences between the predicted and actual observed value in different aspects. They are defined as follows:

$${\text{MAE}} = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left| {t_{i,k} - \hat{t}_{i,k} } \right|}$$

(11)

$${\text{MAPE}}\,\,(\% ) = \frac{1}{n}\sum\limits_{i = 1}^{n} {\frac{{\left| {t_{i,k} - \hat{t}_{i,k} } \right|}}{{t_{i,k} }}} \times 100\%$$

(12)

$${\text{RMSE}} = \sqrt {\frac{1}{n - 1}\sum\limits_{i = 1}^{n} {\left( {t_{i,k} - \hat{t}_{i,k} } \right)^{2} } }$$

(13)

where $t_{i,k}$ is the observed travel time and n is the number of observations.

3.3 Model identification

In model identification, observations of the first four days, i.e., July 17–20, 2017, are set as the training set and the observations on July 21, 2017, are selected as the testing set. Firstly, sensitivity tests are conducted to select rational input variables. To reduce the effect of different units of measurement, data normalization is conducted and the normalization method is shown in Eq. (14).

$$x_{i}^{\text{normal}} = \frac{{x_{i} - x_{ \hbox{min} } }}{{x_{ \hbox{max} } - x_{ \hbox{min} } }}$$

(14)

where $x_{i}^{\text{normal}}$ is the normalized travel time, $x_{i}$ is the raw sample data and $x_{\hbox{min} }$ and $x_{\hbox{max} }$ are the minimum and maximum value of sample data, respectively.

In this paper, six models with different input variables are calibrated and listed in Table 3. Table 3 also provides the average MAEs of the prediction results for both directions. It can be seen that the performance of the sixth model is the best. This indicates that taking different aspects of traffic conditions into consideration can truly improve the accuracy of bus running time prediction. Thus, the variables, $t_{i - 1,k} ,t_{i,k - 1} ,t_{i + 1,k - 1} ,dt_{i,k} ,\overline{dt}_{i} ,\sigma \left( {dt} \right)$, are all selected as the input variables of the DBN model in this paper.

Table 3 Prediction models with different input variables

Full size table

Finding a DBN with well features learning capacity is difficult since several numerical parameters, such as the number of layers (T), the number of units ($m$ for visual units, $k$ for hidden units), the number of iterations (N) and the learning rate ($\varepsilon$), need to be predetermined. In this paper, the range of these parameters is set as $T \in [1,4]$, $m \in [150,200]$, $k \in [150,200]$, $N \in [100,500]$ and $\varepsilon \in [0.0005,0.005]$, respectively. Then, a tenfold cross-validation [3] and a grid search are used to identify the optimal parameter values. The results of $\left( {T,m,k,N,\varepsilon } \right)$ are $\left( {3,160,180,340,0.001} \right)$. Finally, a three-layer BP neural network with six hidden units is used to predict bus travel time. Since the data have been processed through standardization, the final predicted bus travel time should be obtained according to Eq. (15).

$$x_{i}^{\text{inv}} = x_{i}^{\text{normal}} \times \left( {x_{\hbox{max} } - x_{\hbox{min} } } \right) + x_{\hbox{min} }$$

(15)

3.4 Numerical results

In this section, to validate the proposed DBN model, four other prediction models, including k-NN, ANN, SVM and RFs, are also employed. To have an equitable comparison, the same training and testing sets are used for all prediction models, and the input variables of all four models are the same as the ones of the DBN model. By sensitivity tests, the value of the parameter k is set as 3 for the k-NN model. In addition, in the ANN model, a standard three-layer ANN is used and the number of hidden units is set as 5. There are two hyper parameters, C and $\varepsilon$, existing in the SVM model. By using a tenfold cross-validation [3] and a grid search method, the value of the two parameters is set as 2⁵ and 0.2, respectively. Beyond that, the radial basis function (RBF) kernel function is also used in the SVM model. There are also two hyperparameters, while using RFs: $m_{\text{try}}$ and $n_{\text{tree}}$. By using the same parameter selection method as SVM model, the two parameters in RFs are set as 4 and 1000, respectively. Finally, the results are summarized in Fig. 6 and the details are given in Table 4.

Table 4 The average results of the five different prediction models

Full size table

Figure 6 shows the performance of all five prediction models for both directions of the selected bus route in terms of MAE, RMSE and MAPE. With respect to MAE, the DBN model outperforms other four prediction models by 46.01%, 67.47%, 24.17% and 17.91% in the south direction and 49.64%, 50.79%, 0% and 4.12% in the north direction, respectively. In terms of MAPE, the DBN model outperforms other four prediction models by 43.01%, 71.15%, 30.67% and 23.25% in the south direction and 50.77%, 43.73%, 8.53% and 10.65% in the north direction, respectively. Moreover, regarding the RMSE, the DBN model outperforms other four prediction models by 41.72%, 56.03%, 14.73% and 11.39% in the south direction. However, in the north direction, the performance of SVM model is the best. In summary, the DBN model has the best performance except the north direction in RMSE. It demonstrates that the DBN model is more effective than shallow learning architectures in extracting inherent features from massive data. In terms of MAE and MAPE, though the performances of SVM and RFs models are worse than that of the DBN model, the results of the two models are still better than k-NN and BP network model. It is mainly because the structure risk minimization principle of SVM and feature reduction principle of RFs can effectively avoid overfitting problem. It also illustrates why the two types of prediction model are widely used in the forecast field. Moreover, though the performance of k-NN model is worse than that of SVM, RFs and DBN models, it performs better than BP network model and the performance of BP network model is the worst. Thus, considering the simple structure and easy implement, k-NN is still a satisfying model for bus travel time prediction.

As to computation time, all models were coded in MATLAB of version R2012a and executed on a PC equipped with 4 GB of RAM and a dual-core 3.2 GHz processor. The results are listed in Table 5. In Table 5, it is obvious that the computation time of RFs is the least among all bus travel time prediction models while the k-NN model has the maximum running time. Specifically, the computation time of k-NN is almost 20 times that of RFs. It could be contributed to the process of similarity measurement in the k-NN model. Compared with the k-NN model, the shallow learning methods, BP network and SVM models, have relatively shorter running time. Moreover, due to the property of multilayer, it takes more computation time to establish a trained DBN model. However, with the help of cloud and parallel computing, the DBN model can be easily applied to the real-time bus travel time prediction. Considering the trade-off of accuracy and computation time, DBN model is the best approach among the five bus travel time prediction models and RFs is also a competitive method.

Table 5 Computation time of all prediction models (unit: s)

Full size table

Since bus travel time is affected by time and space variables, there are significant differences in the bus travel time between peak hours and off-peak hours and between different driving directions. The reliability of predicted bus travel time would decrease in peak hours because at this time, traffic conditions are more complex and the number of passengers waiting to board is larger. Thus, in this paper, to further validate the proposed DBN model, the bus travel time predictions for both south and north directions in peak hours (i.e., 7 am–9 am and 5 pm–7 pm) on July 21, 2017, are also conducted. The prediction results of the five models in peak hours in terms of MAPE are shown in Fig. 7.

For all prediction models, it can be seen from Fig. 7 that the results of the north direction are better than those of the south direction. Traffic conditions and the number of passengers waiting to board in the south direction are always more complex than those in the north direction since the south direction is from the suburb to city center, while the north direction is from the city center to suburb. This indicates that prediction accuracy decreases as the traffic conditions become more complex. Nonetheless, compared with other four prediction models, the performance of DBN model in peak hours remains to be the best. In summary, the prediction results obtained by DBN model are shown to be more reliable than those obtained by other prediction models.

Finally, the detailed prediction results for both south and north directions in peak hours on July 21, 2017, are illustrated in Fig. 8. It can be seen that, in both directions, the maximum error of all segments is less than 12.19. Figure 8 also shows that the prediction errors of segments 9, 10 and 11 in the south direction are the largest, while in the north direction, the segments are 5 and 7. In Table 2, the standard deviations of the segments 9, 10 and 11 in the south direction and segments 5 and 7 in the north direction are 32.74, 44.51, 31.81, 13.65 and 12.95, respectively, which are obviously larger than those of other segments in each direction. The results give a further verification of that the prediction accuracy decreases as the increase in the uncertainty of traffic conditions.

4 Conclusion

This paper proposed a bus travel time prediction model based on the concept of deep belief network. Since the input variables of the proposed model are continuous data, the basic units of DBN were improved by introducing Gaussian–Bernoulli RBMs. In addition, a BP neutral network algorithm was also used to predict bus travel time in a supervised fashion. To validate the proposed model, real-world data from bus route No. 232 in Shenyang, China, were collected. Four other models, including k-NN, ANN, SVM and RFs, were also introduced. Results showed that the performance of DBN model was the best among all five travel time prediction models. The performance of the proposed model was also quite good in peak hours. The maximum errors in such a case, in terms of MAE, were only 12.19 in the south direction and 10.22 in the north direction, respectively.

In this paper, only the bus data from a single route are considered as the input variables. Further study should consider more factors that might affect prediction accuracy, such as weather conditions, the running time of other bus routes or vehicles on the same segment, and the environment of signalized intersections to enhance the performance of the proposed bus travel time prediction model. In addition, with the development of computer technology, multiple types of deep learning models have been proposed. In the future, the comparison among different deep learning models will be conducted to further prove the validity of the proposed method.

References

Petersen NC, Rodrigues F, Pereira FC (2019) Multi-output bus travel time prediction with convolutional LSTM neural network. Expert Syst Appl 120:426–435
Article Google Scholar
Yu B, Lam WH, Tam ML (2011) Bus arrival time prediction at bus stop with multiple routes. Transp Res C-Emerg 19(6):1157–1170
Article Google Scholar
Yu B, Wang HZ, Shan WX, Yao BZ (2018) Prediction of bus travel time using random forests based on near neighbors. Comput-Aided Civ Inf 33(4):333–350
Article Google Scholar
Williams BM, Hoel LA (2003) Modeling and forecasting vehicular traffic flow as a seasonal ARIMA process: theoretical basis and empirical results. J Transp Eng 129(6):664–672
Article Google Scholar
Jeong RH (2005) The prediction of bus arrival time using automatic vehicle location systems data. Ph.D. dissertation, Department of Civil Engineering, Texas A&M University, Texas, USA
Chung EH, Shalaby A (2007) Expected time of arrival model for school bus transit using real-time global positioning system-based automatic vehicle location data. Int J Intell Transp Syst Res 11(4):157–167
Article MATH Google Scholar
Yang M, Liu Y, You Z (2010) The reliability of travel time forecasting. IEEE Trans Intell Transp Syst 11(1):162–171
Article Google Scholar
Thomas T, Weijermars W, Berkum EV (2010) Predictions of urban volumes in single time series. IEEE Trans Intell Transp Syst 11(1):71–80
Article Google Scholar
Rice J, Van Zwet E (2004) A simple and effective method for predicting travel times on freeways. IEEE Trans Intell Transp Syst 5(3):200–207
Article Google Scholar
Kwon J, Coifman B, Bickel P (2000) Day-to-day travel-time trends and travel-time prediction from loop-detector data. Transp Res Rec 1717:120–129
Article Google Scholar
Kwon J, Petty K (2005) Travel time prediction algorithm scalable to freeway networks with many nodes with arbitrary travel routes. Transp Res Rec 1935:147–153
Article Google Scholar
You J, Kim TJ (2000) Development and evaluation of a hybrid travel time forecasting model. Transp Res C-Emerg 8(1):231–256
Article Google Scholar
Smith BL, Williams BM, Oswald RK (2002) Comparison of parametric and nonparametric models for traffic flow forecasting. Transp Res C-Emerg 10(4):303–321
Article Google Scholar
Chang H, Park D, Lee S, Lee H, Baek S (2010) Dynamic multi-interval bus travel time prediction using bus transit data. Transportmetr A 6(1):19–38
Article Google Scholar
Adeli H (2001) Neural networks in civil engineering: 1989–2000. Comput-Aided Civ Inf 16(2):126–142
Article Google Scholar
Chien SIJ, Ding Y, Wei C (2002) Dynamic bus arrival time prediction with artificial neural networks. J Transp Eng 128(5):429–438
Article Google Scholar
Mazloumi E, Currie G, Rose G (2010) Using traffic flow data to predict bus travel time variability through an enhanced artificial neural network. In: Presented at the 12th world Congress on transport research. Lisbon, Portugal
Yao BZ, Chen C, Zhang L, Yu B, Wang YP (2019) Allocation method for transit lines considering the User Equilibrium for operators. Transp Res C-Emerg 105:666–682
Article Google Scholar
Yu B, Ye T, Tian XM, Ning GB, Zhong SQ (2012) Bus travel-time prediction with a forgetting factor. J Comput Civ Eng 28(3):06014002
Article Google Scholar
Gal A, Mandelbaum A, Schnitzler F, Senderovich A, Weidlich M (2017) Traveling time prediction in scheduled transportation with journey segments. Inf Syst 64:266–280
Article Google Scholar
Yu B, Song XL, Guan F, Yang ZM, Yao BZ (2016) k-Nearest neighbor model for multiple-time-step prediction of short-term traffic condition. J Transp Eng-ASCE 142(6):04016018
Article Google Scholar
Yao BZ, Chen C, Cao QD, Jin L, Zhang MH, Zhu HB, Yu B (2017) Short-term traffic speed prediction for an urban corridor. Comput-Aided Civ Inf 32(2):154–169
Article Google Scholar
Reddy KK, Kumar BA, Vanajakshi L (2016) Bus travel time prediction under high variability conditions. Curr Sci 111(4):700
Article Google Scholar
Wang WS, Liu JM, Yao BZ, Jiang YL, Wang YP, Yu B (2019) A data-driven hybrid control framework to improve transit performance. Transp Res C-Emerg 107:387–410
Article Google Scholar
Chen M, Liu X, Xia J, Chien SI (2004) A dynamic bus arrival time prediction model based on APC data. Comput-Aided Civ Inf 19(5):364–376
Article Google Scholar
Yu B, Yang ZZ, Chen K, Yu B (2010) Hybrid model for prediction of bus arrival times at next station. J Adv Transp 44(3):193–204
Article Google Scholar
Shalaby A, Farhan A (2003) Bus travel time prediction model for dynamic operations control and passenger information systems. In: Presented at the 82nd TRB annual meeting, Washington D.C., USA
Vanajakshi L, Rilett LR (2007) Support vector machine technique for the short term prediction of travel time. In: Proceedings of intelligent vehicles symposium. Istanbul, Turkey, pp 600–605
Billings D, Yang JS (2006) Travel time prediction using a seasonal autoregressive integrated moving average time series model. In: International conference on systems, man, and cybernetics. Taipei, Taiwan, pp 2529–2534
Guin A (2006) Application of the ARIMA models to urban roadway travel time prediction-a case study. In: Intelligent transportation systems conference. Toronto, Ontario, Canada, pp 494–498
Kumar P, Sehgal V, Chauhan DS (2011) Performance evaluation of decision tree versus artificial neural network based classifiers in diversity of datasets. In: World congress on information and communication technologies (WICT). Mumbai, India, pp 798–803
Kumar P, Sehgal VK, Chauhan DS (2012) A benchmark to select data mining based classification algorithms for business intelligence and decision support systems. Int J Data Min Knowl Manage. Process (IJDKP) 2(5):25–42
Article Google Scholar
Weigend A (1993) On overfitting and the effective number of hidden units. Department of Computer Science, University of Colorado, Boulder, Colorado, USA, CU-CS-674-93
Li DQ, Fu BW, Wang YP, Lu GQ, Berezin Y, Stanley HE, Havlin S (2015) Percolation transition in dynamical traffic network with evolving critical bottlenecks. Proc Natl Acad Sci USA 112(3):669–672
Article Google Scholar
Tang TQ, Shi YF, Wang YP, Yu GZ (2012) A bus-following model with an on-line bus station. Nonlinear Dyn 70(1):209–215
Article Google Scholar
Sun D, Ni X, Zhang L (2016) A discriminated release strategy for parking variable message sign display problem using agent-based simulation. IEEE Trans Intell Transp Syst 17(1):38–47
Article Google Scholar
Lv Y, Duan Y, Kang W, Li Z, Wang FY (2015) Traffic flow prediction with big data: a deep learning approach. IEEE Trans Intell Transp Syst 16(2):865–873
Google Scholar
Siripanpornchana C, Panichpapiboon S, Chaovalit P (2016) Travel-time prediction with deep learning. In: Region 10 conference (TENCON), Singapore. Singapore, pp 1859–1862
Li LC, Qu X, Zhang J, Wang YG, Ran B (2019) Traffic speed prediction for intelligent transportation system based on a deep feature fusion model. J Intell Transp Syst. https://doi.org/10.1080/15472450.2019.1583965
Article Google Scholar
Ran X, Shan Z, Fang Y, Lin C (2019) An LSTM-based method with attention mechanism for travel time prediction. Sensors 19(4):861
Article Google Scholar
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554
Article MathSciNet MATH Google Scholar
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Article MathSciNet MATH Google Scholar
Huang W, Song G, Hong H, Xie K (2014) Deep architecture for traffic flow prediction: deep belief networks with multitask learning. IEEE Trans Intell Transp Syst 15(5):2191–2201
Article Google Scholar
Koesdwiady A, Soua R, Karray F (2016) Improving traffic flow prediction with weather information in connected cars: a deep learning approach. IEEE Trans Veh Technol 65(12):9508–9517
Article Google Scholar
Soua R, Koesdwiady A, Karray F (2016) Big-data-generated traffic flow prediction using deep learning and dempster-shafer theory. In: International joint conference on neural networks. Vancouver, BC, Canada, pp 3195–3202
Hrasko R, Pacheco AG, Krohling RA (2015) Time series prediction using restricted Boltzmann machines and backpropagation. Proc Comput Sci 55:990–999
Article Google Scholar
Hinton GE (2012) A practical guide to training restricted Boltzmann machines. In: Neural networks: tricks of the trade. Springer, Berlin, Heidelberg, pp 599–619
Chapter Google Scholar
Kumar A, Johari S, Proch D, Kumar P, Chauhan DS (2018) A tree based approach for data pre-processing and pattern matching for accident mapping on road networks. Proc Natl Acad Sci India Sect A Phys Sci 89(3):453–466
Article Google Scholar

Download references

Acknowledgements

This work was supported in National Natural Science Foundation of China (U1811463 and 51578112), The State Key Laboratory of Structural Analysis for Industrial Equipment (S18307). Finally, the authors gratefully acknowledge financial support from China Scholarship Council.

Author information

Authors and Affiliations

State Key Laboratory of Structural Analysis for Industrial Equipment, School of Automotive Engineering, Dalian University of Technology, Dalian, 116024, People’s Republic of China
Chao Chen, Hui Wang, Fang Yuan, Huizhong Jia & Baozhen Yao
Urban Planning Group, Department of the Built Environment, Eindhoven University of Technology, PO Box 513, Vertigo 8.16, 5600MD, Eindhoven, The Netherlands
Chao Chen

Authors

Chao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Hui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Fang Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Huizhong Jia
View author publications
You can also search for this author in PubMed Google Scholar
Baozhen Yao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Baozhen Yao.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 Classical RBM

RBM is a special kind of generative energy-based model that can learn a probability distribution over a set of inputs. A classical RBM has binary valued hidden and visible units. And the energy of a joint configuration $\left( {v,h} \right)$ of the visible and hidden units can be obtained by:

$$E\left( {v,h} \right) = - \sum\limits_{i = 1}^{m} {a_{i} v_{i} } - \sum\limits_{j = 1}^{k} {b_{j} h_{j} } - \sum\limits_{i = 1}^{m} {\sum\limits_{j = 1}^{k} {v_{i} h_{j} w_{ij} } }$$

(16)

where $v_{i}$ and $h_{j}$ are the binary states of visible unit i and hidden unit j, $a_{i}$ and $b_{j}$ are their biases and $w_{ij}$ is the weight. Then, the probability that is assigned to every possible pair of a visible and a hidden vector is calculated via the energy function:

$$p\left( {v,h} \right) = \frac{{{\text{e}}^{{ - E\left( {v,h} \right)}} }}{{\sum\nolimits_{v,h} {{\text{e}}^{{ - E\left( {v,h} \right)}} } }}$$

(17)

Then, the probability of a particular visible state configuration $v$ is derived by summing over all possible hidden vectors:

$$p\left( v \right) = \sum\limits_{h} {p\left( {v,h} \right) = \frac{{\sum\nolimits_{h} {{\text{e}}^{{ - E\left( {v,h} \right)}} } }}{{\sum\nolimits_{v,h} {{\text{e}}^{{ - E\left( {v,h} \right)}} } }}}$$

(18)

Similarly, the formula of $p\left( h \right)$ is entirely analogous to that of $p\left( v \right)$:

$$p\left( h \right) = \sum\limits_{v} {p\left( {v,h} \right) = \frac{{\sum\nolimits_{v} {{\text{e}}^{{ - E\left( {v,h} \right)}} } }}{{\sum\nolimits_{v,h} {{\text{e}}^{{ - E\left( {v,h} \right)}} } }}}$$

(19)

Some other conditional expressions can also be derived as follows:

$$p\left( {v\left| h \right.} \right) = \frac{{p\left( {v,h} \right)}}{p\left( h \right)} = \frac{{{\text{e}}^{{ - E\left( {v,h} \right)}} }}{{\sum\nolimits_{v} {{\text{e}}^{{ - E\left( {v,h} \right)}} } }}$$

(20)

$$p\left( {h\left| v \right.} \right) = \frac{{p\left( {v,h} \right)}}{p\left( v \right)} = \frac{{{\text{e}}^{{ - E\left( {v,h} \right)}} }}{{\sum\nolimits_{h} {{\text{e}}^{{ - E\left( {v,h} \right)}} } }}$$

(21)

Thus, the probability of a particular visible unit being on given a hidden configuration, i.e., the state of a visible node, given a hidden vector, is derived by:

$$p\left( {v_{i} = 1\left| h \right.} \right) = \frac{{p\left( {v_{i} = 1,h} \right)}}{p\left( h \right)} = \frac{1}{{1 + {\text{e}}^{{ - \left( {a_{i} + \sum\nolimits_{j = 1}^{k} {h_{j} } w_{ij} } \right)}} }}$$

(22)

Similarly, for randomly selected training input $v$, the binary state $h_{j}$ of each hidden unit j is set to 1 with probability:

$$p\left( {h_{j} = 1\left| v \right.} \right) = \frac{{p\left( {h_{j} = 1,v} \right)}}{p\left( h \right)} = \frac{1}{{1 + {\text{e}}^{{ - \left( {b_{j} + \sum\nolimits_{i = 1}^{m} {v_{i} } w_{ij} } \right)}} }}$$

(23)

Given $\sigma \left( x \right) = \frac{1}{{1 + {\text{e}}^{ - x} }}$, formulas (22) and (23) can be rewritten as follows:

$$p\left( {v_{i} = 1\left| h \right.} \right) = \sigma \left( {a_{i} + \sum\limits_{j = 1}^{k} {h_{j} } w_{ij} } \right)$$

(24)

$$p\left( {h_{j} = 1\left| v \right.} \right) = \sigma \left( {b_{j} + \sum\limits_{i = 1}^{m} {v_{i} } w_{ij} } \right)$$

(25)

Given a set of $C$ training cases $\left\{ {v^{c} \left| {c \in \left\{ {1, \ldots ,C} \right\}} \right.} \right\}$, the goal is to maximize the average log probability of the set under the model’s distribution:

$$\sum\limits_{c = 1}^{C} {\log p\left( {v^{c} } \right)} = \sum\limits_{c = 1}^{C} {\log \frac{{\sum\nolimits_{h} {{\text{e}}^{{ - E\left( {v^{c} ,h} \right)}} } }}{{\sum\nolimits_{v,h} {{\text{e}}^{{ - E\left( {v,h} \right)}} } }}}$$

(26)

Then, the gradient or the derivative of the log probability of the training vector with respect to a weight $w_{ij}$ has the following form:

$$\frac{\partial }{{\partial w_{ij} }}\sum\limits_{c = 1}^{C} {\log p\left( {v^{c} } \right)} = \frac{\partial }{{\partial w_{ij} }}\left( {\sum\limits_{c = 1}^{C} {\log \sum\limits_{h} {{\text{e}}^{{ - E\left( {v^{c} ,h} \right)}} } - \log \sum\limits_{v,h} {{\text{e}}^{{ - E\left( {v,h} \right)}} } } } \right)$$

(27)

The first term of formula (26) can be written as:

$$\frac{\partial }{{\partial w_{ij} }}\sum\limits_{c = 1}^{C} {\log \sum\limits_{h} {{\text{e}}^{{ - E\left( {v^{c} ,h} \right)}} } = } - \sum\limits_{c = 1}^{C} {\frac{{\sum\nolimits_{h} {{\text{e}}^{{ - E\left( {v^{c} ,h} \right)}} v_{i}^{c} h_{j} } }}{{\sum\nolimits_{h} {{\text{e}}^{{ - E\left( {v^{c} ,h} \right)}} } }}}$$

(28)

Notice that the term $\frac{{\sum\nolimits_{h} {{\text{e}}^{{ - E\left( {v^{c} ,h} \right)}} v_{i}^{c} h_{j} } }}{{\sum\nolimits_{h} {{\text{e}}^{{ - E\left( {v^{c} ,h} \right)}} } }}$ is just the expected value of $v_{i}^{c} h_{j}$ given that $v$ is clamped to the data vector $v^{c}$. This is easy to compute since we know $v_{i}^{c}$ and we can compute the expected value of $h_{j}$ using formula (25).

The second term of formula (27) can also be written as:

$$\frac{\partial }{{\partial w_{ij} }}\sum\limits_{c = 1}^{C} {\log \sum\limits_{v,h} {{\text{e}}^{{ - E\left( {v,h} \right)}} } } = \, - \sum\limits_{c = 1}^{C} {\frac{{\sum\nolimits_{v,h} {{\text{e}}^{{ - E\left( {v,h} \right)}} v_{i} h_{j} } }}{{\sum\nolimits_{v,h} {{\text{e}}^{{ - E\left( {v,h} \right)}} } }}}$$

(29)

Here, the term $\frac{{\sum\nolimits_{v,h} {{\text{e}}^{{ - E\left( {v,h} \right)}} v_{i} h_{j} } }}{{\sum\nolimits_{v,h} {{\text{e}}^{{ - E\left( {v,h} \right)}} } }}$ is the expected value of $v_{i} h_{j}$ under the model’s distribution. This expectation can be approximated well in finite time by the contrastive divergence (CD) algorithm.

By using $\left\langle . \right\rangle_{d}$ and $\left\langle . \right\rangle_{m}$ to represent the expected values of the training data and model, respectively, formula (27) can be rewritten.

$$\frac{\partial }{{\partial w_{ij} }}\log p\left( v \right) = \left\langle {v_{i} h_{j} } \right\rangle_{d} - \left\langle {v_{i} h_{j} } \right\rangle_{m}$$

(30)

Thus, the update rule for weight $w_{ij}$ is shown as follows:

$$\Delta w_{ij} = \varepsilon \left( {\left\langle {v_{i} h_{j} } \right\rangle_{d} - \left\langle {v_{i} h_{j} } \right\rangle_{m} } \right)$$

(31)

where $\varepsilon$ is the learning rate.

The update rules for the biases are similarly derived to be:

$$\Delta v_{i} = \varepsilon \left( {\left\langle {v_{i} } \right\rangle_{d} - \left\langle {v_{i} } \right\rangle_{m} } \right)$$

(32)

$$\Delta h_{j} = \varepsilon \left( {\left\langle {h_{j} } \right\rangle_{d} - \left\langle {h_{j} } \right\rangle_{m} } \right)$$

(33)

1.2 Gaussian–Bernoulli RBM

The classical RBM was developed only using binary logistic units for visible and hidden units; in this paper for the traffic data that are continuous, a conversion to continuous-valued inputs is used as described in Refs. [42, 47]. To model continuous data, the binary visible units of RBM are replaced by linear units with Gaussian noise, and then the energy function of GBRBM becomes:

$$E\left( {v,h} \right) = - \sum\limits_{i = 1}^{m} {\frac{{\left( {v_{i} - a_{i} } \right)^{2} }}{{2\sigma_{i}^{2} }}} - \sum\limits_{j = 1}^{k} {b_{j} h_{j} } - \sum\limits_{i = 1}^{m} {\sum\limits_{j = 1}^{k} {\frac{{v_{i} }}{{\sigma_{i} }}} } W_{ij} h_{j}$$

(34)

where $\sigma_{i}$ is the standard deviation of the Gaussian noise for visible unit i.

Given the energy function (34), the distribution $p\left( {v\left| h \right.} \right)$ can be derived as follows:

$$\begin{aligned} p\left( {v\left| h \right.} \right) = \frac{{{\text{e}}^{{ - E\left( {v,h} \right)}} }}{{\int_{v} {{\text{e}}^{{ - E\left( {v,h} \right)}} {\text{d}}v} }} & = \frac{{{\text{e}}^{{ - \sum\nolimits_{i = 1}^{m} {\frac{{\left( {v_{i} - a_{i} } \right)^{2} }}{{2\sigma_{i}^{2} }}} + \sum\nolimits_{j = 1}^{k} {b_{j} h_{j} } + \sum\nolimits_{i = 1}^{m} {\sum\nolimits_{j = 1}^{k} {\frac{{v_{i} }}{{\sigma_{i} }}W_{ij} h_{j} } } }} }}{{\int_{v} {{\text{e}}^{{ - \sum\nolimits_{i = 1}^{m} {\frac{{\left( {v_{i} - a_{i} } \right)^{2} }}{{2\sigma_{i}^{2} }}} + \sum\nolimits_{j = 1}^{k} {b_{j} h_{j} } + \sum\nolimits_{i = 1}^{m} {\sum\nolimits_{j = 1}^{k} {\frac{{v_{i} }}{{\sigma_{i} }}W_{ij} h_{j} } } }} {\text{d}}v} }} \\ & = \prod\nolimits_{i = 1}^{m} {\frac{1}{{\sigma_{i} \sqrt {2\pi } }} \cdot {\text{e}}^{{^{{ - \frac{1}{{2\sigma_{i}^{2} }}\left( {v_{i} - a_{i} - \sigma_{i} \left( {\sum\nolimits_{j = 1}^{k} {W_{ij} h_{j} } } \right)} \right)^{2} }} }} } \\ \end{aligned}$$

(35)

Thus, $p\left( {h_{k} = 1\left| v \right.} \right)$ is computed as follows.

$$\begin{aligned} p\left( {h_{k} = 1\left| v \right.} \right) & = \frac{{\sum\nolimits_{{h_{j} \ne k}} {p\left( {v,h_{k} = 1,h_{j \ne k} } \right)} }}{p\left( v \right)} \\ & = \frac{{\sum\nolimits_{{h_{j} \ne k}} {{\text{e}}^{{\left( {\sum\nolimits_{i = 1}^{m} {\frac{{v_{i} }}{{\sigma_{i} }}w_{ik} + b_{j} } } \right) + \left( {\sum\nolimits_{i = 1}^{m} {\sum\nolimits_{j \ne k}^{k} {\frac{{v_{i} }}{{\sigma_{i} }}W_{ij} h_{j} } } + \sum\nolimits_{i = 1}^{m} {\frac{{\left( {v_{i} - a_{i} } \right)^{2} }}{{2\sigma_{i}^{2} }} + \sum\nolimits_{j \ne k}^{k} {h_{j} b_{j} } } } \right)}} } }}{{\sum\nolimits_{h} {{\text{e}}^{{ - E\left( {v,h} \right)}} } }} \\ & = \frac{1}{{1 + {\text{e}}^{{ - \left( {\sum\nolimits_{i = 1}^{m} {\frac{{v_{i} }}{{\sigma_{i} }}w_{ik} + b_{j} } } \right)}} }} \\ \end{aligned}$$

(36)

Note that Eq. (36) is the same as in the classical RBM except the $v_{i}$ scaled by the reciprocal of its standard deviation $\sigma_{i}$.

The training procedure for a GBRBM is identical to that of an RBM. As in that case, we take the derivative shown in formula (27). We find that

$$\begin{aligned} \frac{\partial }{{\partial w_{ij} }}\sum\limits_{c = 1}^{C} {\log \sum\limits_{h} {{\text{e}}^{{ - E\left( {v^{c} ,h} \right)}} } } & = - \sum\limits_{c = 1}^{C} {\frac{{\sum\nolimits_{h} {{\text{e}}^{{ - E^{{\left( {v^{c} ,h} \right)}} }} \frac{{\partial E\left( {v^{c} ,h} \right)}}{{\partial w_{ij} }}} }}{{\sum\nolimits_{h} {{\text{e}}^{{ - E\left( {v^{c} ,h} \right)}} } }}} \\ & = - \frac{1}{{\sigma_{i} }}\sum\nolimits_{c = 1}^{C} {\frac{{\sum\nolimits_{h} {{\text{e}}^{{ - E\left( {v^{c} ,h} \right)}} } v_{i}^{c} h_{j}^{c} }}{{\sum\nolimits_{h} {{\text{e}}^{{ - E\left( {v^{c} ,h} \right)}} } }}} \\ \end{aligned}$$

(37)

Similarly,

$$\frac{\partial }{{\partial w_{ij} }}\sum\limits_{c = 1}^{C} {\log \sum\limits_{v} {\sum\limits_{h} {{\text{e}}^{{ - E\left( {v,h} \right)}} } } = - \frac{1}{{\sigma_{i} }}\sum\nolimits_{c = 1}^{C} {\frac{{\sum\nolimits_{v} {\sum\nolimits_{h} {{\text{e}}^{{ - E\left( {v,h} \right)}} v_{i} h_{j} } } }}{{\sum\nolimits_{v} {\sum\nolimits_{h} {{\text{e}}^{{ - E\left( {v,h} \right)}} } } }}} }$$

(38)

which we estimate, as before, using CD algorithm.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, C., Wang, H., Yuan, F. et al. Bus travel time prediction based on deep belief network with back-propagation. Neural Comput & Applic 32, 10435–10449 (2020). https://doi.org/10.1007/s00521-019-04579-x

Download citation

Received: 30 November 2018
Accepted: 19 October 2019
Published: 02 November 2019
Issue Date: July 2020
DOI: https://doi.org/10.1007/s00521-019-04579-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Bus travel time prediction based on deep belief network with back-propagation

Abstract

Similar content being viewed by others

Deep Architecture for Traffic Flow Prediction

Deep belief network-based support vector regression method for traffic flow forecasting

Traffic Accident Forecasting Based on a GrDBN-GPR Model with Integrated Road Features

1 Introduction

2 Methodologies

2.1 Bus travel time prediction model

2.2 Deep belief network

3 Case study

3.1 Data descriptions

3.2 Performance indexes

3.3 Model identification

3.4 Numerical results

4 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

1.1 Classical RBM

1.2 Gaussian–Bernoulli RBM

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Bus travel time prediction based on deep belief network with back-propagation

Abstract

Similar content being viewed by others

Deep Architecture for Traffic Flow Prediction

Deep belief network-based support vector regression method for traffic flow forecasting

Traffic Accident Forecasting Based on a GrDBN-GPR Model with Integrated Road Features

Explore related subjects

1 Introduction

2 Methodologies

2.1 Bus travel time prediction model

2.2 Deep belief network

3 Case study

3.1 Data descriptions

3.2 Performance indexes

3.3 Model identification

3.4 Numerical results

4 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

1.1 Classical RBM

1.2 Gaussian–Bernoulli RBM

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation