1 Introduction

In this study, we consider an important data-driven public transportation problem: finding the best bus route that minimizes passengers’ overall commuting time. Given a bus transportation system as well as the requests of specific passengers who commute from different starting locations to a fixed destination, the goal of the problem is designing a new bus route to satisfy the passengers’ demand without any transfer.

Previous studies on bus transfer problems were mostly not data-driven [2] due to data skewness problems [3]. In this work, we proposed a new integer-programming based method, which we call IntRoute, to find the route that minimizes the total time cost of the targeted passengers. Specifically, our IntRoute contains two main phases, i.e., the contraction of bus stops via K-means clustering and the derivation of new bus route via a mixed integer linear programming (ILP). The major contributions of this research are listed below.

  • We design a single bus route in which all passengers with an identical destination are delivered without any transfers.

  • We present a two-phase framework to minimize the total time expense of the specific passengers.

  • We develop a genetic algorithm (GA) to solve the integer linear programming (ILP) for large-scale instances.

2 Methodology

The main framework of our two-phase IntRoute method are shown in Fig. 1.

Fig. 1.
figure 1

The framework of the IntRoute.

Fig. 2.
figure 2

Graph transformation.

Phase 1: Contraction of Bus Stops. In the first phase of IntRoute, we contract multiple requests into a super node by using clustering approaches and determine a pickup bus stop for all the requests inside the super node. In each super node, passengers are asked to walk to the pickup bus stop and wait for buses. The walking time is counted into the total commuting time. We employ K-means clustering, as it consumes the least walking time compared with hierarchical clustering and density-peaks clustering. Then we exploit silhouette method and elbow method to determine the number of pickup stops, i.e., the cluster number n. Both methods indicate the best clustering number should be \(n=20\). Therefore, this phase finds 20 pickup stations that minimizes the passengers total walking time.

Phase 2: Design of One Alternate Route. In the real-world transportation network, an arbitrary node pair (ij) may be connected (Fig. 2 left). To solve the problem via mathematical IP model, we introduce the multi-level graph \(G'\) (Fig. 2 right), where each level represents a possible sub-route from one stop to the next one. The red paths in G and \(G'\) are equal. The graph \(G'\) indicates the order of visiting the pick-up stops directly by the levels.

Modelling via Integer Programming. We denote the node set as \(N=\{ 1,2,\dots ,n \}\cup \{ S,T \}\), where S and T represents the source and destination, respectively. The arc set is \(A=\{ (i,j)|i\in N,\ j\in N, \ \text {and} \ i\ne j \}\). The travel time from node i to node j is denoted as \(c_{ij}\), and \(r_{i}\) is the number of passengers who want to go to the destination. \(x_{ij}^l\in \{ 0,1 \}\) represents a binary variable, indicating whether the bus goes from node i to node j at level l on \(G'\). \(y_{ijk}^l\in \{ 0,1 \}\) represents a binary variable which denotes if request k travels through arc (ij) at level l. \(v_{k}^l\in \{ 0,1 \}\) represents a binary variable, indicating if request k is served at the level l. Formally, the IP model is formulated as:

$$\begin{aligned} \nonumber&\text{ min }\sum \limits _{l}\sum \limits _{k}\sum \limits _{(i,j)\in A}^{} y_{ijk}^{l} c_{ij} r_{k}, \quad s.t.\\&{\sum }_{(i,j)\in A} x_{ij}^{l}=1, for\ l=0,1,\dots ,n \end{aligned}$$
(1)
$$\begin{aligned}&{\sum }_{(j,i)\in A} x_{ji}^{l}={\sum }_{(i,k)\in A}x_{ik}^{l+1},\forall i;\forall l \end{aligned}$$
(2)
$$\begin{aligned}&{\sum }_{(S,i)\in A} x_{Si}^{0}={\sum }_{(i,j)\in A}x_{ij}^{1} \end{aligned}$$
(3)
$$\begin{aligned}&{\sum }_{(i,j)\in A} x_{ij}^{n-1}={\sum }_{(j,T)\in A}x_{jT}^{n}\end{aligned}$$
(4)
$$\begin{aligned}&{\sum }_{i}{\sum }_{l} x_{ij}^{l}\le 1, for\quad j=1,2,\dots ,n,T \end{aligned}$$
(5)
$$\begin{aligned}&v_k^l\le v_k^{(l+1)},\forall k, \forall l \end{aligned}$$
(6)
$$\begin{aligned}&{\sum }_{k} v_k^l= l, for\ l=1,2,\dots ,n\end{aligned}$$
(7)
$$\begin{aligned}&x_{ij}^{l} \le v_{i}^{l},for\ (i,j) \in A, l=1,2,\dots ,n \end{aligned}$$
(8)
$$\begin{aligned}&y_{ijk}^{l} \le (x_{ij}^{l} + v_{k}^{l})/2,\ \forall (i,j) \in A, \forall k, \forall l \end{aligned}$$
(9)
$$\begin{aligned}&y_{ijk}^{l} \ge (x_{ij}^{l} + v_{k}^{l})-1,\ \forall (i,j) \in A, \forall k, \forall l \end{aligned}$$
(10)
$$\begin{aligned}&x_{ij}^{l}\in \{ 0,1 \};\ v_k^l\in \{ 0,1 \};\ y_{ijk}^{l}\in \{ 0,1 \} \end{aligned}$$
(11)

where (1) ensures that each level is exactly passed once. (2)–(4) ensure that flow conservation of the graph. (5) ensures that each node can be entered at most once. (6) ensures that request k must be served at level \(l+1\) if it is served at level l. (7) ensures that l requests are on the bus when the bus is serving level l. (8) ensures that if arc (ij) is picked at level l, request i should be served at level l. (9) and (10) ensure that the request k is served at level l on the arc (ij) if both \(x_{ij}^{l}\) and \(v_{k}^{l}\) are equal to one. (11) shows \(x_{ij}^{l}\), \(v_{k}^{l}\), and \(y_{ijk}^{l}\) are all binaries.

Optimization via Genetic Algorithm. We design a genetic algorithm (GA) to solve this IP problem. Specifically, the chromosome represents possible sequence of the node set \(\{1,2,\dots ,n \}\), and the population is a set of chromosomes.

figure a

As shown in Algorithm 1, a new chromosome can be generated by the crossover between two parent bus routes with a probability of crossover rate \(R_c\). The mutation is defined as the position exchange between two randomly selected near-by bus stoops with a probability of \(R_m\). There are two steps in our GA: (1) the randomly population initialization with a given size M, and (2) the population evolution for T generations by crossover and mutation according to a fitness function. We adopt a 2-OPT technique to avoid the cross sub-paths. Besides, we propose a decomposition technique that clusters the optimal route into three sub-route via k-means. We concatenate the clusters in a reverse order, i.e., from the destination node to the start node. This strategy greatly reduces the travel time.

3 Experiments and Analysis

Data. The experiment is performed based on publicly available real-world commuting data, retrieved from the card-based transit payment systemFootnote 1 in Sydney, Australia, including approximately three million trips.

Table 1. Routing time before optimization
Fig. 3.
figure 3

Bus route without transfer

Results. The routing time before optimization are listed in Table 1, where routes are represented by the IDs of the bus stops. Here C denotes the time cost without transfer and without optimization (also demoed in Fig. 3), and \(C'\) denotes that of the original commute with transfers. The new route after our optimization is shown in Fig. 4, while the time costs of the bus routes optimized by different methods are listed in Table 2.

Conclusions. Our IntRoute method greatly reduces the time expense for passengers from 31.53 min to 18.06 min on average, saving about 43% of commute time. In future, we plan to investigate more optimization methods for further improving our solutions.

Table 2. Commute time
Fig. 4.
figure 4

Bus routes after optimization