1 Introduction

Trends over the past 10 years demonstrate that urban traffic congestion is increasing. Public transport, especially the bus transport, has been recognized as a valid method to alleviate traffic congestion. Public transport also can reduce the use of private car and fuel consumption. In order to attract more people to shift from personal vehicles to public transport, service quality should be improved. Nowadays most public transportation systems provide their timetables on the web or on the board of bus stops. Furthermore, with the development of smart phones, more and more apps are used for bus schedules. However, the provided information is always limited and not timely updated. Waiting a long time at bus stops may drive passengers away and make them reluctant to take buses. Meanwhile, accurate arrival time of the next bus will allow passengers to take alternative transport choices instead and improve their satisfaction. So providing accurate information about bus arrival to passengers is of great importance.

Bus arrivals at stops in road networks are stochastic because travel time on roads, dwell time at stops, and delays at intersections fluctuate spatially and temporally. And many unpredictable factors such as traffic conditions or harsh weather may cause buses to delay. Thus, developing such a model that can adapt to time varying traffic and demand conditions is a challenging task.

In this paper, we present a novel bus arrival time prediction approach based on real-time and real-world historical GPS trajectories of buses. In our approach, bus trajectories are firstly aggregated and mined in the cloud to discover some knowledge. Then bus arrival time is predicted based on the knowledge mined from the historical bus trajectories and real-time information. As the bus trajectories are constantly updated in the cloud, the prediction will be smarter and smarter.

The main contributions of the current work are summarized as follows.

  • We propose a graph, which models the knowledge of routes based on the bus trajectories and reduces the online computation of arrival time prediction. We name the graph as path-section graph. This graph has not been used in the works of bus arrival time prediction.

  • We present a clustering algorithm to learn the time-variant distributions of the travel times of path sections between any two bus stops. The algorithm includes two phases. The travel times of a path-section edge are clustered into several categories in the first phase. The proper time partition for each path-section edge is obtained in the second phase.

  • We build a bus arrival time prediction system by using real-world trajectory dataset generated by 1000 buses in a period of 2 months. Experiments show that our method can predict more accurate arrival time with less online computation than competing methods.

The rest of the paper is organized as follows: the second part describes related work of the problem. In the third part, an overview of our approach is presented. Three major components of our approach: bus trajectory processing, path-section graph construction, and arrival time predicting are respectively introduced from the fourth section to the sixth section. Results obtained from experiments on real-world data of Hangzhou are reported in the seventh section. Concluding remarks are drawn in the last section.

2 Related work

Bus arrival time prediction has recently attracted interest of many researchers and practitioners in the area. The approaches for the prediction of bus arrival time can be mainly divided into the following categories: artificial neural network (ANN), Kalman filtering, support vector machines (SVM), regression analysis models and time series models.

ANN models [1,2,3,4,5] are the most commonly used approaches for arrival time prediction. These models are constructed with multiple layers of processing units named artificial neurons, emulating human intelligent data processing abilities. Using the learning process, they can automatically adjust the synaptic weights to map the input-output relationship. Public transport system is a very complex nonlinear system, and neural network model has the ability to deal with complex nonlinear systems, so many researchers use neural network model to predict the bus arrival time. In these works, it is considered that the current bus arrival time has a certain connection with that of buses in front of it. Meanwhile, bus arrival time is affected by road, environment and other factors.

The Kalman filter [6,7,8] is a model-based estimation technology that has the ability to accommodate traffic fluctuations with their time-dependent parameters adequately. It takes into account the stochastic nature of process disturbances and measurement noise. In the previous studies applying the Kalman filter to bus arrival time predictions, AVL, APC or both are employed. The algorithm needs real-time feeds when data is updated. At the same time, data fluctuations might cause difficulties in solving the time lag.

Support vector machine [9,10,11,12] is a specific type of learning algorithm developed on the basis of statistical learning theory, which is characterized by the capacity control of decision functions, the use of kernel functions and the sparsity of solutions. It has been proposed in time series forecasting. By integrating multiple sources of information, previous works proved that support vector machine can accurately predict bus arrival times. Although SVM also depends on the similarity between historic and real-time traffic patterns, it still shows many breakthroughs and plausible performance in forecasting of bus arrival time.

Regression analysis models [13] determine a dependent variable with a complex mathematical function formed by a set of independent variables. In general, the applicability of the regression models is limited because variables in transportation systems are highly inter-correlated. At the same time, their performance is undermined by long computational time because of their reliance on large amounts of historical data.

Time series models [14, 15] are also used for predicting bus arrival time. It is assumed that all external factors of the system either are accounted for or are constant. Variation in the historical data patterns or a dissimilarity in real time data and historical data can lead to results being wrong and erroneous.

There are other methods: multiple data sources [16,17,18], metamodel approach [19], real-time data [20], agent-based approach [21], hybrid method [22,23,24,25,26], dynamic route [27]. Among these methods, some require special conditions and more processing time. Some fully depend on researchers’ experience or knowledge to preprocess data in order to select control parameters. And the accuracy of prediction also needs to be improved.

3 Overview

As shown in Fig. 1, the architecture of our bus arrival time prediction system based on a hybrid model consists of three major components: Bus Trajectory Processing, Path-section Graph Construction, and Arrival Time Predicting. Bus Trajectory Processing and Path-section Graph Construction operate offline and Time Predicting is running online. The offline parts only need to be performed once unless the trajectory archive is updated.

Fig. 1
figure 1

System overview

3.1 Bus trajectory processing

In practice, a GPS log may record a bus’s trajectory for several days, in which the buses always run the regular route. Therefore, we partition a GPS log into some bus trajectories representing individual trips according to the longitude and latitude coordinates of origin stops and terminal stops. Then we employ a map matching algorithm to map each GPS point of a trip to the corresponding road segment. As a result, a bus trajectory is transformed into a sequence of road segments.

3.2 Path-section graph construction

Because traffic patterns of weekday are quite different from the patterns of weekend. We separate weekday trajectories from the weekend ones, and build a path-section graph for weekdays and weekends respectively. Then, we estimate the distribution of travel time of each path-section edge by using a clustering algorithm. In general, a path-section is the path segment between two adjacent bus stops. After the construction phase, a time-dependent path-section graph is available for online computation.

3.3 Arrival time predicting

In this phase, we present a method to compute the predicting arrival time of the bus at the queried stop. Given the bus ID, stop ID and the query time, this method computes the arrival time by using the path-section graph and real-time data. The whole travel time consists of four parts and will be computed section by section.

4 Bus trajectory processing

In this phase, bus GPS trajectories are segmented into trips, and each trip is matched against the road network. Thus a GPS log is partitioned into some bus trajectories representing individual trips. The partition is executed according to the longitude and latitude coordinates of original stations and terminal stations. Then a map matching algorithm is needed because of the well-known error of GPS data. The algorithm maps each GPS point of a trip to the corresponding road segment. The algorithm we employed is illustrated as follows.

figure c

In the map matching algorithm, \(\mathrm{ResideG}\left( {B_i ,G_k } \right) \) is a Boolean function. \(\mathrm{ResideG}\left( {B_i ,G_k } \right) \) is true if link \(L_i \) lies or passes across the region of \(G_k \). \(\mathrm{locate}({\phi },B_i )\) is a function from \(({\phi },\theta ,B_i )\) into \(\phi ^{{\prime }}\), such that the longitude and latitude coordinates of \({\phi }\) can be remedied to \(\phi ^{{\prime }}\), which is located to polyline k associated with \(B_i \). \(\theta \) is the direction of \({\phi }\), which represented by a geographical angle clockwise from the north. Function \(\mathrm{d}\left( {{\phi },{\omega }} \right) \) computes the geographical distance between \({\phi }\) and \({\omega }\). Through experiments, we find that this map matching algorithm has a better performance when dealing with the low-sampling-rate bus trajectories.

5 Path-section graph construction

After the phase of bus trajectory processing, a path-section graph will be constructed. This consists of two parts of work: building path-section graph and travel time estimation. This section first describes the construction of the path-section graph, and then introduces the travel time estimation of path-section graph edges.

5.1 Building the path-section graph

For facilitating further description, several definitions and terms are listed below.

Definition 1 (Path-section) A path-section is the path segment between two adjacent bus stops.

Definition 2 (Path-section Graph) A path-section graph G =(V, E) is a directed graph that consists of a set of bus stops V and a set of path sections E.

According to the definition1, a path-section is a road segment between two adjacent bus stops. So it is not difficult to build the framework of path-section graph according to the road network and the longitude and latitude coordinates of bus stops. As we known that traffic patterns of weekdays are similar while those of weekdays and weekends are different. So we need to build two distinct path-section graphs for weekdays and weekends respectively. In other words, all the bus trajectories are divided into two parts. We put the weekday data part into a weekday path-section graph, and project the weekend data part into a weekend path-section graph. The vertexes of the path-section graph represent the bus stops. As a result, we convert each bus trajectory from a sequence of road segments to a path-section sequence. Then the distribution of travel time of each path-section edge should be estimated.

5.2 Travel time estimation

In this phase, we estimate the distribution of travel time of each path-section edge. The travel time of a path-section is dynamic at least in the following two aspects. Firstly, it is time-dependent. The travel time could be very long in rush hours while be quite short at other times. The time slots 8:00–9:00 and 14:00–15:00 have different distributions. Secondly, it is location-variant. Different path-sections have different time-variant traffic patterns. For example, some roads could be very crowded in the morning rush while be very fast in the evening rush. And some roads may be crowded most of the time. Therefore, we cannot apply the same or a predefined time partition method for all the path-sections.

Meanwhile, we find that the travel times of a path-section edge gather like a set of clusters rather than a typical Gaussian distribution. Some reasons can cause this. Firstly, different drivers may encounter different number of traffic lights. Secondly, different drivers have different personal behavior, skill and preferences. Therefore, we consider travel time of a path-section edge as a set of distributions corresponding to different time slots. This is distinct from some existing methods, which regard the travel time of a road as a single-valued function based on time of day. For example, some drivers need to spend 8–10 min to traverse a path-section in the time slot 8:00–10:00, while some drivers can accomplish this transition with a 5–18 min cost and some of them need less than 5 min.

Because the distribution will change over time of day. To learn different time partitions for different path-section edges based on the bus trajectories, we develop a clustering algorithm. The algorithm includes two phases. In the first phase, the travel times of a path-section edge are clustered into several categories according to the variance of travel times. In the second phase, a proper time partition for each path-section edge is obtained. As a result, the distributions of travel times in different time slots of each path-section edge can be estimated. The clustering algorithm is described as follows.

The first phase of clustering algorithm: x and y are two bus stops with leaving time \(t_x \) and arriving time \(t_y , \mathrm{s=(x,} \mathrm{y;}t_x ,t_y )\) is a transition, let \(s_{xy} \) be the set of the transitions connecting (x, y) and \(\tau _{xy} =\{(t_x ,t_y )|(x,y;t_x ,t_y )\in s_{xy} \}\). In this phase, we first sort \(\tau _{xy} \) according to the values of travel time \(\left( {t_y -t_x } \right) \), and then partition the sorted list L into several sub-lists in a recursive way. In each iteration, the variance of all the travel times in L is computed firstly. Later, the best split point having the minimal weighted average variance(WAV) is found. WAV is defined as Equation (1):

$$\begin{aligned} \mathrm{WAV}\left( {\mathrm{i};\mathrm{L}} \right) =\frac{\left| {L_1 \left( i \right) } \right| }{\left| L \right| }Var\left( {L_1 \left( i \right) } \right) +\frac{\left| {L_2 \left( i \right) } \right| }{\left| L \right| }Var\left( {L_2 \left( i \right) } \right) \end{aligned}$$
(1)

In Eq. (1), \(L_1 \left( i \right) \quad L_2 \left( i \right) \) are two sub-lists of L split at the \(i_{th} \) element and function Var() represents the variance. The best split point is the point which induces a maximum value of \(\Delta \mathrm{V}(\mathrm{i})\) as:

$$\begin{aligned} \Delta \mathrm{V}\left( \mathrm{i} \right) =\mathrm{Val}\left( \mathrm{L} \right) -\mathrm{WAV}\left( {\mathrm{i;L}} \right) \end{aligned}$$
(2)

The first phase of algorithm terminates when max \(max_i \{\Delta \mathrm{V}\left( \mathrm{i} \right) \}\) is less than a threshold \({\delta }\). As a result, a set of split points which divide the whole list L into several clusters \(\mathrm{C}=\{c_1 ,c_2 ,\ldots ,c_m \}\) are found. \(c_i \in \mathrm{C}\) represents a category of travel times.

The second phase of clustering algorithm: The main aim of this phase is to split the x-axis into several time slots such that the travel times have a relatively stable distribution in each slot. After the first phase, each travel time \(y_i \) can be represented with the category it belongs to \(\left( {\left( {y_i } \right) } \right) \). Then the pair collection \(S^{xc}=\{(x_i ,c(y_i ))\}_{i=1}^n \) are be sorted according to \(x_i \), which is arriving time. The information entropy of \(S^{xc}\) is defined as:

$$\begin{aligned} \mathrm{Ent}\left( {S^{xc}} \right) =-\mathop \sum \limits _{i=1}^m p_i \log \left( {p_i } \right) \end{aligned}$$
(3)

In Eq. (3), \(p_i \). is the proportion of \(c_i \). in the collection. The algorithm in the second phase runs iteratively to search a set of split points, which is similar to the algorithm in the first phase. However, we use the weighted average entropy(WAE) of \(S^{xc}\) instead of the WAV. WAE of \(S^{xc}\) is defined as:

$$\begin{aligned} \mathrm{WAE}\left( {\mathrm{i};S^{xc}} \right) =\frac{\left| {s_1^{xc} \left( i \right) } \right| }{\left| {S^{xc}} \right| }Ent\left( {s_1^{xc} \left( i \right) } \right) +\frac{\left| {s_2^{xc} \left( i \right) } \right| }{\left| {S^{xc}} \right| }Ent\left( {s_2^{xc} \left( i \right) } \right) \end{aligned}$$
(4)

In Eq. (4), \(s_1^{xc} \) and \(s_2^{xc} \) are subsets of \(S^{xc}\) when split at the \(i_{th}\) pair. The best split point is the point whicleads to a maximum value of \(\Delta \mathrm{E}(\mathrm{i})\). as:

$$\begin{aligned} \Delta \mathrm{E}\left( \mathrm{i} \right) =\mathrm{Ent}\left( {S^{xc}} \right) -\mathrm{WAE}\left( {\mathrm{i};S^{xc}} \right) \end{aligned}$$
(5)

As a result, the distribution of travel times in each time slot can be obtained after the second phase of the clustering algorithm. In other words, giving the time we can estimate the travel time of each path-section edge. These estimated travel times will be utilized in the next section for predicting arrival time.

6 Arrival time predicting

After the phase of path-section graph construction, bus arrival time will be predicted. In this section, we present the method to compute the predicting arrival time of the bus at the queried stop. The arrival time is then estimated as follows:

$$\begin{aligned} \mathrm{T}=T_{que} +T_1 +T_2 +\mathop \sum \limits _{i=k}^{q-1} TW_i \end{aligned}$$
(6)

In Eq. (6), \(T_{que} \) is the time when querying the arrival time. \(T_1 \) is the time for the bus to travel from its current location to the nearest next bus stop. Because the distance from the current bus location to the nearest next bus stop is in close, the possibility of traffic condition change is reduced greatly. So the bus speed won’t change much and \(T_1 \) is defined as \(T_1 =\frac{D}{V}\), in which D is the distance based on the road to the nearest next bus stop and V is the instantaneous velocity got through the GPS information.

\(T_2 \) represents the time for the bus to travel from its nearest next bus stop to the queried bus stop, which can be obtained from the path-section graph and travel time estimating described in Sect. 5. If there are i edges in the path-section graph from its nearest next bus stop to the queried bus stop, \(T_2 =T_{21} +T_{22} +\ldots +T_{2i} \), \(T_{2i} \) is the travel time of i-th edge in the path-section graph. It can be computed using the method described in Sect. 5.

\(\mathop \sum \limits _{i=k}^{q-1} TW_i \) is the dwelling time at stops from its nearest next bus stop \(Stop_k \) to the queried bus stop \(Stop_q \). The dwelling time of each stop is defined as the average dwelling time at this stop which can be computed from the bus trajectory data.

7 Practical examples

In this section, the proposed method for predicting bus arrival time at bus stops has been evaluated by real-world data in Hangzhou. Hangzhou is the capital of Zhejiang province and one of the most famous tourist cities in China. Hangzhou has a highly developed and sophisticated bus route network that comprises about 600 bus routes. Almost all the buses have been equipped with GPS devices. In Hangzhou, electronic bus stop board system provides the predicting arrival time information. We build a bus arrival time prediction system by using Hangzhou bus trajectory dataset generated by 1000 buses in a period of 2 months. The data set contains approximately one billion records. Each record contains the following information:

  • Bus ID the unique bus ID.

  • Time the sample timestamp “YYYY-MM-DDHH:MM: SS”.

  • GPS position the current longitude and latitude of the bus.

  • Speed the current bus speed in km/h;

  • Orientation the direction the bus is heading in, from 0\(^{circ}\) to 360\(^{circ}\)in clockwise.

  • GPS state this value is set to 1 if the GPS data is incorrect, and 0 otherwise.

In our experiment, all the records with incorrect GPS data are removed.

In our tests, the prediction results are evaluated in terms of the performances of three measures: the mean absolute error (MAE), the mean absolute percentage error (MAPE) and the root mean square error (RMSE). The three terms can judge the difference between the observed and the predicted running time in different aspects. They are calculated as:

$$\begin{aligned} \mathrm{MAE}=\frac{\mathop \sum \nolimits _{k=1}^N |t\left( k \right) -t_m (k)|}{N} \end{aligned}$$
(7)
$$\begin{aligned} \mathrm{MAPE}=\left( {\frac{1}{N}\mathop \sum \limits _{k=1}^N \frac{|t\left( k \right) -t_m (k)|}{t_m (k)}} \right) *100\% \end{aligned}$$
(8)
$$\begin{aligned} \mathrm{RMSE}=\sqrt{\frac{\mathop \sum \nolimits _{k=1}^N \left( \left( k \right) -t_m (k)\right) ^{2}}{N-1}} \end{aligned}$$
(9)

where t(k) is the predicted running time of the test buses and \(t_m (k)\) is the observed running time of the bus. And N is the test times.

We choose 6 routes including 2 test stops, as shown in Fig. 2. We first built the path-section graph by converting trajectory to a sequence. Then we use our method to estimate arrival time. Three of them were marked purple where the stop is Lung Bridge; three were marked as red where the stop is the martial arts small square.

Fig. 2
figure 2

Test route

Table 1 Parameter setting of improved algorithm

Table 1 shows the number of valid observations of each route for each day, and RMSE statistics for the collected travel time. In Table 1, we do the test on two bus stops from June 6 to June 10, which include route K10, K11, K41, K8, K19, K34. The RMSEs of our method are from 18.71 to 22.48. The average value of RMSE for 6 routes is 20.445.

The average value of the MAE, the MAPE and the RMSE of the two stops for all the bus routes are summarized in Fig. 3. MAE, RMSE and MAPE of stop1 is 28.3, 20.66 and 14.2%. The corresponding values of stop1 is 31.1, 20.23 and 16.3%.

We also test the proposed method on all bus stops of one route. The estimated arrival time and the measured arrival time of one bus for the 18 bus stops are compared and the results are presented in Fig. 4. On the whole, we can see that the prediction value is close to the actual value. As distance increases, the deviation increases.

To test the validity of the proposed method, we compare the prediction result of support vector machine (SVM) and our proposed method. The SVM is implemented with SVM MATLAB toolbox (University of Southampton, http://www.isis.ecs.soton.ac.uk/resources/svminfo/) on Microsoft Windows plat. Figure 5 shows the comparison results. We can see that our method exhibits some advantages over SVM model. All of the MAE, RMSE and MAPE values are lower than those of SVM.

Moreover, we performed extensive practical experiments to compare the prediction result of three-layer artificial neural network (ANN) and our proposed method. The three-layer ANN consists of three hidden neurons. A scaled conjugate gradient algorithm is employed for training, and the hidden neurons are optimized by trial and error. Figure 6 shows the comparison results. MAE, RMSE and MAPE values prove that our method is a superior model.

Fig. 3
figure 3

Example of a feasible solution

Fig. 4
figure 4

Comparison of the predicted and measured time for 18 stops of one route

Fig. 5
figure 5

Comparison of SVM and our method

Fig. 6
figure 6

Comparison of ANN and our method

Experiments show that the proposed method has higher predictive accuracy than SVM model and three-layer ANN model. The performance of our method for bus arrival time prediction at the stop is shown to be satisfactory. This can be attributed that our method makes full use of the real-time and historic data.

8 Conclusion

This paper presents an approach that predicts bus arrival time in terms of the knowledge learned from a large number of historical bus trajectories. In our method, a time-dependent path-section graph for weekdays and weekends is constructed respectively, and then arrival time is predicted based on this graph and real-time GPS information. The path-section graph models knowledge of routes based on the bus trajectories and reduces the online computation of arrival time prediction. This graph has not been employed in the previous works of bus arrival time prediction. Then a clustering algorithm is presented to learn the time-variant distributions of the travel times of path sections between any two bus stops. We build a bus arrival time prediction system by using a real-world trajectory dataset generated by 1000 buses in a period of 2 months, and evaluate the system with extensive experiments and in-the-field evaluations. Experiments show that our method can predict more accurate arrival time with less online computation than competing methods.