Keywords

1 Introduction

With the current development of the intelligent urban public transportation system in China, the investigation on the bus passenger flow has become a key research subject (see [1] for instance). In order to maintain the competitiveness in the transportation market and provide services with high-level quality to the passengers, the bus transportation companies need to grasp the change rules of the passenger demand sustainably [2]. However, the passenger flow in the bus system is influenced by many factors, including commuting, holiday, weather, temperature, etc. [2]. For example, the volume would experience a sudden increase during low temperature and snowy days, which would lead to the inability of bus transport capacity to meet passenger demand and brings tremendous pressure to the bus transportation management. Considering the limited bus resources, some popular routes are often in short supply, which might result in the problems of passenger flow detention and reduced-quality service. The bus companies might thus lose competitiveness in the transportation market. Therefore, it is necessary to find an effective solution to the problems caused by such burst of passenger flow and adjust the current management policies in support to the optimal bus resource allocation, line planning and bus scheduling. The solution is of great importance to improve both the service capability and the working efficiency in the public transportation system.

The driving motivation of this work is to find a reliable method to solve the problems above. Undoubtedly, this piece of work is socially significant and important since the urban transport plan and policy could be well designed or adjusted with adapting the market demand. The implementation of this work involves a combination of big data processing, time series modeling and analysis. The primary objective of the work is to apply the time series models and data analytics to explore the passenger demand based on the real data and then to predict the daily passenger volume in a given bus line. The study will mainly focus on the following two aspects:

  • Descriptive statistics on the trip characteristics of passengers, including riding date and time, and on the volume and variation characteristics of transit passenger flow at different stations in a given bus line.

  • Time series parameters estimation and passenger volume prediction are based on the bus tick sale records.

In this work, SAS (version 9.4) (see [3] for instance) will be used to obtain the descriptive statistics, to do time series analysis and predictions.

The rest of the paper is organized as follows. Section 36.2 reviews the development of time series analysis and recent works on the application of the time series to the public transportation systems. Section 36.3 presents time series related concepts and methods, as well as our data analysis process. Section 36.4 summarizes and evaluates the empirical results. And finally the conclusion is discussed in Sect. 36.5 and certain open questions and some future improvements are proposed in Sect. 36.6.

2 Literature Review

Prior to 1920, the time series was limited to drawing lines through a mass of data. In 1927, Yule [4] first introduced the concept of ‘autoregressive’ that the variables are time related and time is not a causal factor, and pioneered the autoregressive (AR) Model of order two when studying the number of sunspots and exploring the period of the disturbed sequence. The autoregressive model he established is a special kind of stationary time series. In 1931, Walker [5] expanded and generalized the AR model to higher orders. While, Slutsky [6] was interested in the randomness of the time series, regarding them as the perturbations and then the moving average (MA) model was proposed. In 1938, Wold [7] proved that the discrete stationary process consists of implicit periodicity and linear regression. The hidden cycle is a deterministic component, while the linear regression part consists of a moving average and an autoregressive process, which are non-deterministic components of random perturbations. Any stationary time series, whose deterministic components are eliminated, can be reduced to a linear combination of random perturbations. This well-known time series decomposition idea is the theoretical basis for the idea of the autoregressive moving average (ARMA) model. By taking non-stationary into consideration, the autoregressive integrated moving average (ARIMA) model was proposed in the landmark work [2]. The book provided a systematic approach to analyze and forecast the time series and discussed how to identify, estimate and diagnose the ARIMA model.

The application of time series models in the modern society has rapidly widespread, as the application was extended to non-stationary process (see for instance [8]). A large number of empirical results show that most time series established based on the socio-economic phenomena are non-stationary and have a trend (see for instance [9]). According to Xia [9], there are two types of time trend, one is deterministic and another one is random. Deterministic time trend is the one that can be characterized by a function of the time. The commonly used trend functions are linear functions, quadratic parabola functions, exponential functions and logarithmic functions. By contrast, the time series with stochastic trend cannot be expressed by the deterministic functions of time. In this case, multiple differences are operated to the original process and then the ARIMA model is used to fit the data.

In the literature, the existing researches suggest that the time series analysis has been properly utilized in studying different public transportation systems. For the subway systems in Shanghai, Zhu [10] constructed an ARIMA model for the daily passenger flow by comparing the change rate of daily volume with that of ‘7-day’ average volume. For the airport terminal departure passenger traffic, Li et al. [11] took daily periodicity of the process into consideration and proposed a seasonal autoregressive integrated moving average (SARIMA) model to predict the passenger flow in Kunming Changshui International Airport. For the railway passenger flow forecast, a time series model was established in [12] with the combination of the long-term trend, the seasonal and the weather factors. To achieve an accurate real-time taxi passenger hotspot prediction, Jamil and Akbar [13] proposed an automatic ARIMA model to determine the value of the model order automatically. The algorithm designed by them overcame the common obstacle, subjectivity and complexity. All these applications make use of the knowledge of passenger flow and provide instructive insight to the management of the public transportation system, which has a referential significance for our investigation.

3 Methodology

3.1 Stationary Time Series Models

The time series analysis aims to reveal the underlying dynamics and structures that affect observable data, thus establishing a suitable theoretical model for monitoring and predicting data. For the definition of stationary time series (or simply called ‘time series’), one can refer for instance to the Definition 1.3.2 in [14]. In this book, the daily passenger flow volumes \( \{ Z_{t} \} \) at any unit of time t will be regarded as a discrete-time stochastic process. Roughly speaking, assuming that \( \{ Z_{t} \} \) is a stationary time series with mean 0 and \( Z_{t} \) depends only on its historical records \( Z_{t - 1} , Z_{t - 2} , \ldots \) then we can use the observed historical data to estimate the dynamic properties, create optimal models and then use these models to do the prediction. In this project, we construct discretely sampled time series based on the actual daily records of passenger volume in a given bus line. The detailed description about the database can be found in Sect. 36.4.1. In the rest of this subsection, some related fundamental concepts will be introduced. One may refer to [8] for the details.

Autoregressive Model: AR (p). The autoregressive (AR) model is a very common time series. The general p-order autoregressive model, denoted as AR(p), is given by:

$$ Z_{t} = \varphi_{1} Z_{t - 1} + \varphi_{2} Z_{t - 2} + \cdots + \varphi_{p} Z_{t - p} + a_{t} , $$
(36.1)

where the parameters \( \varphi_{1} ,\varphi_{2} , \ldots ,\varphi_{p} \) are called autoregressive coefficients and they are to be estimated. The random error terms \( \left\{ {a_{t} } \right\} \) is the white noise, i.e., a sequence of i.i.d. random variables, \( a_{t} \, \sim \,N\left( {0, \sigma_{a}^{2} } \right) \) and \( \{ a_{t} \} \) is mutually independent with \( Z_{t - 1} , Z_{t - 2} , \ldots , Z_{t - p} \).

Moving Average Model: MA (q). The general q-order moving average model, denoted as MA (q), is given by:

$$ Z_{t} = a_{t} - \theta_{1} a_{t - 1} - \theta_{2} a_{t - 2} - \cdots - \theta_{q} a_{t - q} , $$
(36.2)

where \( \theta_{1} ,\theta_{2} , \ldots ,\theta_{q} \) are called moving average coefficients and they are to be estimated.

Autoregressive Moving Average Model: ARMA (p, q). The autoregressive moving average (ARMA) combines an AR model with a MA model to produce a new process that simulates the time series. The general ARMA model, denoted as ARMA (p, q), is given by

$$ \begin{aligned} Z_{t} & = \varphi_{1} Z_{t - 1} + \varphi_{2} Z_{t - 2} + \cdots + \varphi_{p} Z_{t - p} \\ & \quad + a_{t} - \theta_{1} a_{t - 1} - \theta_{2} a_{t - 2} - \cdots - \theta_{q} a_{t - q} . \\ \end{aligned} $$
(36.3)

Autoregressive Integrated Moving Average Model: ARIMA(p, d, q). Notice that the \( {\text{AR}},{\text{MA}} \), and \( {\text{ARMA }} \) models are stationary time series. However, sometimes the time series are not necessarily stationary. It may have a linear trend component. For non-stationary time series, it is necessary to transform it into a stationary one through the backward shift operator. Such a non-stationary time series is called \( {\mathbf{ARIMA}} \) process, denoted as ARIMA (p, d, q), and is given by

$$ \left( {1 - \varphi_{1} B - \cdots - \varphi_{p} B^{p} } \right)\left( {1 - B} \right)^{d} Z_{t} = \left( {1 - \theta_{1} B - \cdots - \theta_{q} B^{q} } \right)a_{t} , $$
(36.4)

where B is the backward shift operator (lag) defined as \( \left( {1 - B} \right)Z_{t} = Z_{t} - Z_{t - 1} \) and d is the number (order) of the difference to make the process stationary.

ARMA Model with a Quadratic Function Trend. Indeed, besides considering a linear trend component in the time series, some other trend forms may also be taken into account. If the trend of a time series has a shape as a quadratic function, then it can be fitted by a quadratic function. The ARMA model with a quadratic function trend is given by

$$ \begin{aligned} Z_{t} & = {\text{quadratic function}} + {\text{ARMA process}} \\ & = at + bt^{2} + \varphi_{1} Z_{t - 1} + \varphi_{2} Z_{t - 2} + \cdots + \varphi_{p} Z_{t - p} + a_{t} \\ & \quad - \theta_{1} a_{t - 1} - \theta_{2} a_{t - 2} - \cdots - \theta_{q} a_{t - q} \\ \end{aligned} $$
(36.5)

In the rest of this section, the application of time series method to the passenger flow prediction will be introduced. This can be achieved by the descriptive and inferential studies on the current data.

3.2 Time Series Analysis

According to [8], the main steps of time series analysis and modeling are:

  1. 1.

    Stationarity and white noise test

  2. 2.

    Model identification (i.e., specifying the lag order)

  3. 3.

    Model selection and parameter estimation

  4. 4.

    Diagnostic checking

  5. 5.

    Prediction based on the optimal model.

Stationary Test. The first step of time series analysis is to verify whether the series is stationary. There are two main methods: one is the graph test, which illustrates the features shown in the time series plots and autocorrelation diagrams, while the other one is the unit root test.

Graph Test.

  1. 1.

    Time series plot

According to the property that mean and variance of a stationary time series are constant, the time series plot should show that the process fluctuates randomly near a constant value and the ranges of fluctuation are similar. The time series is usually not stationary if there exists a significant trend or periodicity.

  1. 2.

    Autocorrelation function (ACF) plot

ACF is used to describe the degree of linear correlation between different observations in time series. It is proven that the stationary time series usually have short-term correlation. The time series is stationary if the autocorrelation function declines rapidly to zero and all the values fall into the confidence interval by lag 3. In contrast, the autocorrelation of a non-stationary series declines slowly.

Unit Root Test. The unit root test is used to check whether a time series needs to be differenced. The procedure is described in [15]. Among the unit root tests, the most widely used one is the Dickey–Fuller (DF) test, which is applicable to the AR(1) model:

$$ Z_{t} = \varphi_{1} Z_{t - 1} + a_{t} = \left( {1 - \varphi_{1} B} \right)^{ - 1} a_{t} = \mathop \sum \limits_{k = 0 }^{\infty } \varphi_{1}^{j} a_{t - k} , $$
(36.6)

where \( \left| {\varphi_{1} } \right| < 1 \). Since the root of the characteristic equation \( 1 - \varphi_{1} B = 0 \) is \( \varphi_{1}^{ - 1} \), another equivalent statement of the stationary form is that the root must be outside the unit circle. So it suffices to test whether the root of the characteristic equation is outside the unit circle, with, respectively, null and alternative hypothesis:

$$ \begin{aligned} & H_{0} :\left\{ {Z_{t} } \right\} {\text{is non - stationary}}, \left| {\varphi_{1} } \right| = 1, {\text{a regular difference is needed}} \\ & H_{1} :\left\{ {Z_{t} } \right\} {\text{is stationary}}, \left| {\varphi_{1} } \right| < 1,{\text{the series donot need to be deferenced}} \\ \end{aligned} $$

The DF test is only applicable to the AR(1) model. In order to generalize the DF test and make it widely applicable to AR(p) processes, an augmented Dickey–Fuller (ADF) test was proposed in [16] with the same hypothesis and the decision rules and includes two other new terms: drift and trend.

White Noise Test. In order to verify whether a process is worth further time series modeling and analysis, it is needed to perform the white noise test. From the definition of the white noise, for any lag k, its autocorrelation coefficient is given by \( \rho_{k} = 0 \). It should be noted that this is the ideal situation. While in practice, most of the autocorrelation coefficients \( \hat{\rho }_{k} \) are not equal to zero due to the finiteness of the sample sequence, but they fluctuate randomly around a value of 0 with a small float. According to the methods summarized by Wei [17], instead of considering each autocorrelation individually, the first m autocorrelation coefficients as a whole are considered and an index to determine whether a sequence is white noise or whether there exists a correlation between observations is constructed. The null and alternative hypotheses for the white noise test are, respectively:

$$ \begin{aligned} & H_{0} :\rho_{1} = \rho_{2} = \cdots = \rho_{m} = 0, \forall m \ge 1,{\text{so}} \left\{ {Z_{t} } \right\} {\text{is a white noise sequence}} \\ & H_{1} :{\text{for}} \forall m \ge 1,\exists k \le m\,{\text{and}}\,k \ne 0 {\text{that}} \\ & \rho_{k} \ne 0, {\text{so}} \left\{ {Z_{t} } \right\} {\text{is not a white noise sequence}} \\ \end{aligned} $$

This is an approximate statistical hypothesis test that none of the autocorrelations of the series up to a given lag are significantly different from 0. If this is true for all m lags, then there is no information in the series to model and no \( {\text{ARIMA }} \) model is needed.

Methods of Order Specification. To determine the order (p, q) of ARMA models, SAS provides a list of the order combinations, which is mainly referred to ESACF, SCAN and MINIC methods.

The extended sample autocorrelation function (ESACF) method. Since the ACFs and PACFs of \( {\text{ARMA}}\left( {p,q} \right) \) model are all trailing, these two functions cannot be jointly used to determine the order \( \left( {p, q} \right) \). Considering this situation, Tsay and Tiao [18] proposed a general iterative regression method, and used the ESACF to estimate the order of the model. The method is applicable if the time series \( y_{t} \) belongs to \( {\text{ARMA}}\left( {p, q} \right) \) process, then by fitting \( {\text{AR}}\left( p \right) \) model to it, the estimate of the autocorrelation regression coefficients \( \hat{\varphi }_{i} ,i = 1,2, \ldots ,p \) will be inconsistent. Therefore, the residual error of regression must be introduced into the model as an explanatory variable, and when such process goes on until the q times the estimated model is as follows:

$$ Z_{t} = \mathop \sum \limits_{i = 1}^{p} \varphi_{i}^{\left( q \right)} Z_{t - i} + \mathop \sum \limits_{i = 1}^{q} \alpha_{i}^{\left( q \right)} \hat{e}_{t - i}^{{\left( {q - i} \right)}} + e_{t}^{\left( q \right)} . $$
(36.7)

Now the estimator \( \widehat{{\varphi_{i} }}^{\left( q \right)} \) will be consistent. Based on this idea, let \( m = 0,1,2,.. . \), \( \widehat{{\varphi_{i} }}^{\left( j \right)} \) is the \( j{\text{th}} \) iteration estimated autoregressive coefficient of the \( {\text{AR }}\left( {\text{m}} \right) \) model, then \( \widehat{{\rho_{i} }}^{\left( m \right)} \) is defined as the sample autocorrelation function of the following model:

$$ y_{t} = \left( {1 - \widehat{{\varphi_{1} }}^{\left( j \right)} B - \widehat{{\varphi_{2} }}^{\left( j \right)} B^{2} - \cdots - \widehat{{\varphi_{m} }}^{\left( j \right)} B^{m} } \right)z_{t} . $$
(36.8)

Regarding the ESACF, there exists the following probabilistic convergence:

$$ \hat{\rho }_{j}^{\left( m \right)} \mathop \to \limits^{p} \left\{ {\begin{array}{*{20}l} {0,} \hfill & {0 \le m - p \le j - q;} \hfill \\ {X \ne 0,} \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right.. $$
(36.9)

Because of this property, the distribution of the ESACF for ARMA (1,1) model can be displayed as in Table 36.1, which is characterized by the fact that all zeroes form a triangle with the vertex (1,1). Similarly for the general \( {\text{ARMA }}\left( {p, q} \right) \), the vertex of all zeroes is located at \( \left( {p, q} \right) \), which is the rule of identifying the order of the model. In fact, SAS provides two tables, one is for the estimate of ESACF and the other one is for the significance test.

Table 36.1 ESACF for \( {\text{ARMA }}\left( {1,1} \right) \) model, where X is a non-zero number

The smallest canonical correlation coefficient (SCAN) method. Tsay and Tiao [19] firstly put forward this idea, and Choi [20] gave the concrete method of solving and judging \( {\text{ARMA}}\left( {p,q} \right) \) model. Only the conclusion of this method is given here. First, the SCAN of each model with different order combination is calculated, and then the table of SCAN similar to that of ESACF is formed. The only difference is that the judgment is based on the rectangle with zeroes being vertices so that the corresponding vertex position is the order of the model. In our project, SAS gives two tables, one for the estimate of SCAN coefficients and the other for chi-square test results of the coefficient significance.

The minimum information criterion (MINIC) method. The minimum information criterion (MINIC) method, proposed by Hannan and Rissanen [21], can tentatively identify the order of a stationary and invertible ARMA process. The MINIC table is constructed by computing Bayesian information criterion (BIC) for various autoregressive and moving average orders. Suppose L is the value of the likelihood function evaluated at the parameter estimates of \( {\text{ARMA}}(p,q) \), N is the number of observations, and k is the number of estimated parameters, the BIC of \( {\text{ARMA}}(p,q) \) model can be calculated as:

$$ {\text{BIC}}\left( {p,q} \right) = k\,\ln \left( N \right) - 2\ln \left( L \right) $$
(36.10)

Values of \( {\text{BIC}}\left( {p,q} \right) \) that cannot be computed are set to missing. For large autoregressive and moving average test orders with relatively few observations, a nearly perfect fit can result. This condition can be identified by a large \( {\text{BIC}}\left( {p,q} \right) \) negative value. The MINIC table can be in the form in Table 36.2. The model with the minimum BIC value is chosen as the best fitted one.

Table 36.2 MINIC table

Methods of Parameters Estimation. There are various ways to estimate the parameters, such as moment estimation, least squares estimation, maximum likelihood estimation and so on. In this work, method of maximum likelihood estimation is adopted, which is recommended by most experts using SAS for prediction.

Maximum likelihood method. According to the maximum likelihood method of time series analysis discussed by Guidolin and Pedio [22], under the maximum likelihood criterion, it is considered that the sample comes from the population with the highest probability of occurrence of this sample. Therefore, the maximum likelihood method for the unknown parameter’s estimation is to make the likelihood function \( L\left( {\varphi_{1} , \ldots ,\varphi_{p} , \theta_{1} , \ldots ,\theta_{q} } \right) \) reach the maximum, suppose \( p\left( {z_{1} ,z_{2} , \ldots ,z_{n} , \varphi_{1} , \ldots ,\varphi_{p} , \theta_{1} , \ldots ,\theta_{q} } \right) \) is the joint density function, \( L \) can be written as:

$$ L\left( {\varphi_{1} , \ldots ,\varphi_{p} , \theta_{1} , \ldots ,\theta_{q} } \right) = p\left( {z_{1} ,z_{2} , \ldots ,z_{n} , \varphi_{1} , \ldots ,\varphi_{p} , \theta_{1} , \ldots ,\theta_{q} } \right) $$
(36.11)

The distribution function of the population must be known to use the maximum likelihood. However, in the time series analysis, the distribution of population is often unknown. In order to facilitate calculation and analysis, it is usually assumed that the sequence follows multivariate normal distribution:

$$ \begin{aligned} Z_{t} & = \varphi_{1} Z_{t - 1} + \varphi_{2} Z_{t - 2} + \cdots + \varphi_{p} Z_{t - p} + a_{t} - \theta_{1} a_{t - 1} \\ & \quad - \theta_{2} a_{t - 2} - \cdots - \theta_{q} a_{t - q} ,\tilde{z} = \left( {z_{1} ,z_{2} , \ldots ,z_{n} } \right)^{\prime } , \\ \end{aligned} $$
(36.12)
$$ \tilde{\beta } = \left( {\varphi_{1} , \ldots ,\varphi_{p} , \theta_{1} , \ldots ,\theta_{q} } \right)^{\prime } , $$
(36.13)
$$ \mathop \sum \limits_{n} = E(\tilde{z}^{\prime}\tilde{z}) =\Omega \sigma_{a}^{2} . $$
(36.14)

The likelihood function of \( \tilde{z} \) is

$$ L\left( {\tilde{\beta }} \right) = p\tilde{\beta } = \left( {2\pi } \right)^{ - n/2} \left| {\sum\limits_{n} {} } \right|^{ - 1/2} \exp \left\{ { - \frac{{\tilde{z}^{'} \sum_{n}^{ - 1} \tilde{z}}}{2}} \right\}. $$
(36.15)

The log likelihood function is

$$ l\left( {\tilde{\beta }} \right) = - \frac{n}{2}\ln \left( {2\pi } \right) - \frac{n}{2}\ln \left( {\sigma_{a}^{2} } \right) - \frac{1}{2}\ln \left|\Omega \right| - \frac{1}{{2\sigma_{a}^{2} }}\left[ {\tilde{z}^{'}\Omega ^{ - 1} \tilde{z}} \right]. $$
(36.16)

The system of likelihood equations can be obtained by computing the partial derivatives of the unknown parameters of the logarithmic likelihood function.

Theoretically, solving the likelihood equations yields the maximum likelihood of the unknown parameter. However, since \( \tilde{z}^{'}\Omega ^{ - 1} \tilde{z} \) and \( \ln \left|\Omega \right| \) is not an explicit expression of the parameter, the likelihood equations are actually composed of \( p + q + 1 \) transcendental equations, which usually requires a complex iterative algorithm to find the maximum likelihood of the unknown parameter.

The maximum likelihood method makes full use of the information provided by each observation, so its estimation accuracy is high, and it also has good statistical properties such as consistency and progressive validity.

Diagnostic test. In this test, the goodness of fit and the accuracy of the model are measured and the correlation test and the normality test on the residual series are performed. The following two kinds of criterion will be used to measure the goodness of fit for a model:

Akaike’s information criterion (AIC). Akaike [23] defined AIC as

$$ {\text{AIC}} = - 2\ln \left( L \right) + 2k, $$
(36.17)

where L is the value of the likelihood function evaluated at the parameter estimates, N is the number of observations and k is the number of estimated parameters. The first term of the AIC measures the goodness of fit of the ARMA model to the data, and the second term is called the penalty function of the criterion because it penalizes a candidate model by the number of parameters used. Therefore, the model with the minimum AIC value should be chosen.

Schwarz’s Bayesian information criterion (SBC). Schwarz [24] defined AIC as

$$ {\text{SBC}} = - 2\ln \left( L \right) + \ln \left( N \right)k, $$
(36.18)

Similarly, the model with the minimum SBC value should be chosen. The penalty for each parameter is 2 for AIC and \( \ln \left( N \right) \) for SBC, so compared to AIC, SBC tends to select a lower-order model when sample size is moderate or large.

There are other two kinds of criterion to measure the accuracy of a model’s predictions will be used. One can refer to [25] for the detailed description.

Mean absolute percentage error (MAPE). The MAPE is a common measure of forecast error in time series analysis. It usually expresses accuracy as a percentage and is defined by the formula:

$$ {\text{MAPE}} = \frac{100\% }{n}\mathop \sum \limits_{i = 1}^{n} |Z_{t} - F_{t} |/Z_{t} , $$
(36.19)

where \( Z_{t} \) is the actual value and \( F_{t} \) is the forecast value.

Mean square error (MSE). The MSE is measure of the differences between prediction values and the actual values. It is defined by:

$$ {\text{MSE}} = \frac{1}{n}\mathop \sum \limits_{t = 1}^{n} (Z_{t} - F_{t} )^{2} , $$
(36.20)

where \( Z_{t} \) is the actual value and \( F_{t} \) is the forecast value.

4 Empirical Results

4.1 Introduction to Database

In this work, the original data was provided by the bus companies in the city of Jiaozuo in China, including the bus IC card payment records among the six bus lines in the city, during the period of time January 01, 2018 to March 31, 2018. Table 36.3 shows a part of the originally collected data. The whole dataset consists of 2,874,878 rows (records), each with eight component variables. The meaning of the variables used in this work is shown in Table 36.4.

Table 36.3 Part of raw data in the database
Table 36.4 Descriptions of the variables used in this work

4.2 Data Preprocessing

The first phase of the data analysis is to process the data in order to construct a time series. The steps to obtain the descriptive statistics and the time series plot are the follows.

  1. Step 1:

    Standardization of the raw data. The variables which are not straightforward numeric or character, such as ‘SITE_TIME’, need to be standardized in the format that the SAS can recognize and interpret.

  2. Step 2:

    Data extraction. The daily passenger volume from the original database is extracted. For this, the ‘DATA’ and ‘PROC’ procedures are mainly used in SAS to create the datasets.

  3. Step 3:

    Construct the time series. After the datasets, including daily passenger flow volume in each bus line, are constructed, the graphical procedure in SAS is used to plot the time series for each bus line. In this work, the line No. 18 is chosen for case study. The time series plot of the daily passenger volume in the line No. 18 during the period January 01, 2018 to March 31, 2018 is shown in Fig. 36.1.

    Fig. 36.1
    figure 1

    Time series plot of daily passenger volume in line No. 18, including total passenger volume and the ones in two directions

4.3 Model Building

Case 1: ARMA modeling with the original time series. According to the results of ADF test shown in Table 36.5, the p-value is less than 0.05 for a lag of 0, indicating that the null hypothesis can be rejected and the sequence is stationary, so the ARMA model is suitable to the original data. After calculating the BIC of the models with different order combinations, SAS shows the optimal order for the order selection by the ascending order of BIC value. And three candidate models with minimum BIC values, namely \( {\text{AR}}\left( 3 \right), {\text{ARMA}}\left( {1,1} \right) {\text{and ARMA}}\left( {1,3} \right) \) are chosen. The results of the parameter estimation and fitting statistics for each candidate models are summarized in Tables 36.6 and 36.7, respectively.

Table 36.5 ADF test results
Table 36.6 Parameter estimation results of three candidate models (in Case 1)
Table 36.7 Fitting statistics of three candidate models (in Case 1)

Although ARMA(1,1) and ARMA(1,3) have the smallest SBC and the smallest AIC value, respectively, their MAPE values are much higher than that of AR(3). In comparison, \( {\text{AR}}\left( 3 \right) \) model has the smallest MAPE value indicating the highest prediction precision, and either its AIC or BIC value is slightly higher than the minimum indicating a relatively good goodness of fit. Therefore, \( {\text{AR}}\left( 3 \right) \) is chosen as the optimal model in this case. After implementing the parameter estimation by goodness of fit test for AR(3) model in SAS, all the AR coefficients are significant, so the optimal model is determined as below.

$$ Z_{t} = 0.65563Z_{t - 1} + 0.05421Z_{t - 2} + 0.27748Z_{t - 3} + a_{t} $$
(36.21)

As shown in Fig. 36.2, the residual diagnostics in SAS shows ACF value of the residual sequence is almost 0, and the white noise probability is greater than 0.05, which indicates that there is no dependence between the residuals and the \( {\text{AR}}\left( 3 \right) \) model has extracted all the useful information from the historical time series. Besides, the histogram and QQ-plot of residuals (Fig. 36.3) show that the residual sequence follows normal distribution, indicating the model is adequate. The ten-step ahead prediction of the passenger flow volume by using the model (36.21) and the comparison between actual and prediction are shown, respectively, in Table 36.8; Fig. 36.4.

Fig. 36.2
figure 2

Residual correlation diagnostics for AR(3) model

Fig. 36.3
figure 3

Residual normality diagnostics for AR(3) model

Table 36.8 Prediction based on \( {\text{AR}}\left( 3 \right) \) model
Fig. 36.4
figure 4

Actual values against prediction based on \( {\text{AR}}\left( 3 \right) _{ij} \) model

Although the diagnostic results show that \( {\text{AR}}\left( 3 \right) \) is adequate for the sequence fitting, 95% confidence interval of the prediction is very wide. Since the wider the confidence region is, the lower the prediction accuracy is, the prediction especially in the long term may not be accurate.

Case 2: ARMA modeling with the first-order differenced time series. According to the ACF plot shown in Fig. 36.5, even the autocorrelation decreases exponentially, it does not fall into the confidence interval until lag 5. Considering that the ACF decays gradually, not rapidly to zero, the time series is regarded as non-stationary and needs to be differenced, so the ARIMA model is applied to fit the data. And two models with minimum BIC value are chosen as candidate models, namely \( {\text{ARIMA}}\left( {0,1,1} \right) {\text{and ARIMA}}\left( {2,1,0} \right) \). The results of the parameter estimation and fitting statistics for each candidate models are summarized in Tables 36.9 and 36.10, respectively.

Fig. 36.5
figure 5

Trend and ACF plots for the original time series

Table 36.9 Parameter estimation results of two candidate models (in Case 2)
Table 36.10 Fitting statistics of three candidate models (in Case 2)

It can be seen \( {\text{ARIMA}}\left( {2,1,0} \right) \) has a smaller AIC value indicating a higher goodness of fit, and a smaller MAPE value indicating a higher prediction precision, while its SBC value is slightly higher than \( {\text{ARIMA}}\left( {0,1,1} \right) \). Therefore, \( {\text{ARIMA}}\left( {2,1,0} \right) \) model is chosen as the optimal one for the first-order differenced time series. After implementing the parameter estimation by goodness of fit test, all the coefficients are significant. So the optimal model is determined as below.

$$ \left( {1 - B} \right)Z_{t} = - 0.33412\left( {1 - B} \right)Z_{t - 1} - 0.28956\left( {1 - B} \right)Z_{t - 3} + a_{t} $$
(36.22)

By performing the residual diagnostics (similar to Case 1), it is observed that there is no dependence between the residuals and the \( {\text{ARIMA}}\left( {2,1,0} \right) \) model has extracted all the useful information from the time series. Besides, the residual sequence follows normal distribution, indicating the model is adequate. The ten-step ahead prediction of the passenger flow volume by using the model (36.22) and the comparison between actual and prediction are shown, respectively, in Table 36.11; Fig. 36.6.

Table 36.11 Fitting statistics of three candidate models (in Case 2)
Fig. 36.6
figure 6

Actual values against forecasts based on \( {\text{ARIMA}}\left( {2,1,0} \right) \) model

Although the diagnostic results show that \( {\text{ARIMA}}\left( {1,2,0} \right) \) is adequate for the sequence fitting, its 95% confidence region of the prediction from the model is still very wide.

Case 3: ARMA modeling with quadratic function trend. According to the original time series plot in Fig. 36.1, it is observed that the passenger flow time series may have a quadratic trend. The two trend variables, _LINEAR_ and _SQUARE_ (as shown in Table 36.12), representing linear and quadratic relationships, respectively, are pre-generated.

Table 36.12 Part of dataset with two trend variables

Then, the same steps as in the previous two cases are followed to build a quadratic ARMA model. And three models with minimum BIC value as candidate models are chosen, namely \( {\text{Quadratic}} + {\text{AR}}\left( 3 \right), {\text{Quadratic}} \) \( + {\text{ARMA}}\left( {1,1} \right) {\text{and Quadratic}} + {\text{ARMA}}\left( {1,3} \right) \). The results of the parameter estimation and fitting statistics for each candidate models are summarized, respectively, in Tables 36.13 and 36.14.

Table 36.13 Parameter estimation results of three candidate models (in Case 3)
Table 36.14 Fitting statistics of three candidate models (in Case 3)

Compared with the other two models, the \( {\text{Quadratic}} + {\text{ARMA}}\left( {1,3} \right) \) model has the smallest AIC and MAPE indicating the highest goodness of fit and the highest prediction precision. Although Quadratic + ARMA(1,1) has the smallest SBC, its MAPE is the highest indicating the lowest prediction accuracy. Therefore, the \( {\text{Quadratic}} + {\text{ARMA}}\left( {1,3} \right) \) is chosen as the optimal model with quadratic trend.

After implementing the parameter estimation by goodness of fit test, it can be seen that not only the MA and AR coefficients, but also the coefficients of linear and quadratic trend variables are significant. So the optimal model is determined as below.

$$ \begin{aligned} Z_{t} & = 113.79672t - 0.92369t^{2} + 0.76222Z_{t - 1} \\ & \quad + a_{t} - 0.26487a_{t - 1} - 0.05254a_{t - 2} + 0.35486a_{t - 3} \\ \end{aligned} $$
(36.23)

By performing the residual diagnostics (similar to Case 1), it is observed that there is no dependence between the residuals and the \( {\text{Quadratic}} + {\text{ARMA}}\left( {1,3} \right) \) model has extracted all the useful information from the time series. Besides, the histogram and QQ-plot (obtained by using the same method in Case 1) show that the residual sequence follows normal distribution, indicating the model is adequate.

Using the model (36.23), the ten-step ahead prediction and the comparison between actual and prediction are shown, respectively, in Table 36.15; Fig. 36.7.

Table 36.15 Forecasts based on \( {\text{Quadratic}} + {\text{ARMA}}\left( {1,3} \right) \) model
Table 36.16 Fitting statistics of three optimal models
Fig. 36.7
figure 7

Actual values against forecasts based on \( {\text{Quadratic}} + {\text{ARMA}}\left( {1,3} \right) \) model

The 95% confidence region width is significantly narrower, but the prediction does not describe the rapid growth at the end of the sequence, so probably it is caused by some external factors such as weather and holiday policies. If further improvements are needed, the external influences must be included in the model.

5 Conclusion

In this work, the prediction on the passenger flow volume in the bus transpiration system is performed, by using three kinds of time series models: AR, ARIMA and quadratic ARMA. At first, the bus IC card payment records were transformed into a time series, which represents the daily passenger volume in line No. 18. Then, the time series analysis was used and two optimal models, \( {\text{AR}}\left( 3 \right) \) and \( {\text{ARIMA}}\left( {2,1,0} \right) \), were found. Both models performed well in terms of goodness of fit but failed to attain accurate predictions. In order to achieve a higher prediction accuracy, the ARMA model with the quadratic trend was further explored and combined, and a \( {\text{Quadratic}} + {\text{ARMA}}\left( {1,3} \right) {\text{model}} \) was established for the time series, which achieves a better balance between fitting and forecasting. The fitting statistics of those models are shown in Table 36.7.

Each model has its own advantages and disadvantages. They are discussed as follows one by one.

  • \( {\mathbf{AR}}\left( 3 \right) \): The \( {\text{AR}}\left( 3 \right) \) model has no obvious advantages and disadvantages, because its performance is not outstanding either in goodness of fit or prediction accuracy. Its SBC, MAPE and MSE value are all in the middle level, except its AIC value is slightly higher than the other two models. The only advantage worth mentioning is that since differencing the process is not needed, this model is the simplest and the most straightforward one, and the cost is thus the lowest for the application.

  • \( {\mathbf{ARIMA}}\left( {2,1,0} \right) \): In terms of fitting effect, the \( {\text{ARMA}}\left( {2,1,0} \right) \) model has the lowest AIC and SBC values indicating the highest goodness of fit. However, it owns the highest MAPE and MSE among the three models, indicating the greatest deviation between the predicted value and the true value. Moreover, its prediction confidence region widens over time, so it may perform poorly in the long-term prediction. But since it has the best fitting effect, it can accurately describe the surge trend at the end of the original time series, so the prediction result will be reliable when the model is used to predict the most recent value.

  • \( {\mathbf{Quadratic}} + {\mathbf{ARMA}}\left( {1,3} \right) \): Compared with the other two models, \( {\text{Quadratic}} + {\text{ARMA}}\left( {1,3} \right) \) model has the smallest MAPE and MSE value, so it achieves the highest prediction accuracy. Most importantly, this model has a unique advantage over the others, and it has a narrower prediction confidence interval of a constant width over time, so it will perform more effectively with high prediction accuracy.

The initial objective of this project and the main demand from the traffic management is to improve the forecast accuracy. Due to this, the accuracy of the prediction is the most important factor for the solution performance evaluation. So it can be concluded that the \( {\text{Quadratic }} + {\text{ARMA}}\left( {1,3} \right) \) model is the most appropriate, compared to the other two models. Although \( {\text{ARIMA}}\left( {2,1,0} \right) \) model fits the current data the best and its short-term prediction shows relatively higher volatility, it may be more useful for short-term prediction.

6 Open Questions and Potential Improvements

Although the ARMA model with quadratic function trend performs best in our case, its application range is limited, because the time sequence must show a quadratic trend. In the reality, only the short-term change of passenger flow may show such a trend. For the long-term daily passenger flow, if the data span is more than one year, it usually fluctuates within a limited range near a fixed value. So the stationary time series model may be more suitable for such kind of data. In addition, in view of the change of daily passenger flow in certain city, a seasonal factor with week cycle may be considered because of the difference of the commuting time between weekdays and weekends. In this case, a seasonal ARIMA model may be built to fit the series.

As mentioned in the end of Sect. 36.4, the time series method has limitations. When the prediction time span is long, only a rough future trend line can be obtained, but not the specific volatility. In order to accurately describe the future fluctuations, more external factors, such as weather, temperature, holidays and events, might be introduced into the model. When the historical data is updated continuously and the sample size is increasing, the algorithm should be updated and adjusted accordingly.