Application of Time Series Method to the Passenger Flow Prediction in the Intelligent Bus Transportation System with Big Data

Ye, Yinna; Liu, Ruoxi; Xue, Feng

doi:10.1007/978-981-15-4917-5_36

Yinna Ye⁶,
Ruoxi Liu⁶ &
Feng Xue⁷

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 176))

671 Accesses
4 Citations

Abstract

Based on the real data collected from the bus IC card payment devices, first a time series plot on the daily passenger volume was obtained and then three kinds of time series models were proposed to do the prediction. The results show that the ARMA model with quadratic trend is the most suitable to the current data and performs the most effectively in the prediction.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A Prediction Model Based on Time Series Data in Intelligent Transportation System

Intelligent Traffic App Operation Mode and Prediction of People Based on Big Data

Short-term passenger flow forecast for urban rail transit based on multi-source data

Article Open access 20 January 2021

Keywords

1 Introduction

With the current development of the intelligent urban public transportation system in China, the investigation on the bus passenger flow has become a key research subject (see [1] for instance). In order to maintain the competitiveness in the transportation market and provide services with high-level quality to the passengers, the bus transportation companies need to grasp the change rules of the passenger demand sustainably [2]. However, the passenger flow in the bus system is influenced by many factors, including commuting, holiday, weather, temperature, etc. [2]. For example, the volume would experience a sudden increase during low temperature and snowy days, which would lead to the inability of bus transport capacity to meet passenger demand and brings tremendous pressure to the bus transportation management. Considering the limited bus resources, some popular routes are often in short supply, which might result in the problems of passenger flow detention and reduced-quality service. The bus companies might thus lose competitiveness in the transportation market. Therefore, it is necessary to find an effective solution to the problems caused by such burst of passenger flow and adjust the current management policies in support to the optimal bus resource allocation, line planning and bus scheduling. The solution is of great importance to improve both the service capability and the working efficiency in the public transportation system.

The driving motivation of this work is to find a reliable method to solve the problems above. Undoubtedly, this piece of work is socially significant and important since the urban transport plan and policy could be well designed or adjusted with adapting the market demand. The implementation of this work involves a combination of big data processing, time series modeling and analysis. The primary objective of the work is to apply the time series models and data analytics to explore the passenger demand based on the real data and then to predict the daily passenger volume in a given bus line. The study will mainly focus on the following two aspects:

Descriptive statistics on the trip characteristics of passengers, including riding date and time, and on the volume and variation characteristics of transit passenger flow at different stations in a given bus line.
Time series parameters estimation and passenger volume prediction are based on the bus tick sale records.

In this work, SAS (version 9.4) (see [3] for instance) will be used to obtain the descriptive statistics, to do time series analysis and predictions.

The rest of the paper is organized as follows. Section 36.2 reviews the development of time series analysis and recent works on the application of the time series to the public transportation systems. Section 36.3 presents time series related concepts and methods, as well as our data analysis process. Section 36.4 summarizes and evaluates the empirical results. And finally the conclusion is discussed in Sect. 36.5 and certain open questions and some future improvements are proposed in Sect. 36.6.

2 Literature Review

Prior to 1920, the time series was limited to drawing lines through a mass of data. In 1927, Yule [4] first introduced the concept of ‘autoregressive’ that the variables are time related and time is not a causal factor, and pioneered the autoregressive (AR) Model of order two when studying the number of sunspots and exploring the period of the disturbed sequence. The autoregressive model he established is a special kind of stationary time series. In 1931, Walker [5] expanded and generalized the AR model to higher orders. While, Slutsky [6] was interested in the randomness of the time series, regarding them as the perturbations and then the moving average (MA) model was proposed. In 1938, Wold [7] proved that the discrete stationary process consists of implicit periodicity and linear regression. The hidden cycle is a deterministic component, while the linear regression part consists of a moving average and an autoregressive process, which are non-deterministic components of random perturbations. Any stationary time series, whose deterministic components are eliminated, can be reduced to a linear combination of random perturbations. This well-known time series decomposition idea is the theoretical basis for the idea of the autoregressive moving average (ARMA) model. By taking non-stationary into consideration, the autoregressive integrated moving average (ARIMA) model was proposed in the landmark work [2]. The book provided a systematic approach to analyze and forecast the time series and discussed how to identify, estimate and diagnose the ARIMA model.

The application of time series models in the modern society has rapidly widespread, as the application was extended to non-stationary process (see for instance [8]). A large number of empirical results show that most time series established based on the socio-economic phenomena are non-stationary and have a trend (see for instance [9]). According to Xia [9], there are two types of time trend, one is deterministic and another one is random. Deterministic time trend is the one that can be characterized by a function of the time. The commonly used trend functions are linear functions, quadratic parabola functions, exponential functions and logarithmic functions. By contrast, the time series with stochastic trend cannot be expressed by the deterministic functions of time. In this case, multiple differences are operated to the original process and then the ARIMA model is used to fit the data.

In the literature, the existing researches suggest that the time series analysis has been properly utilized in studying different public transportation systems. For the subway systems in Shanghai, Zhu [10] constructed an ARIMA model for the daily passenger flow by comparing the change rate of daily volume with that of ‘7-day’ average volume. For the airport terminal departure passenger traffic, Li et al. [11] took daily periodicity of the process into consideration and proposed a seasonal autoregressive integrated moving average (SARIMA) model to predict the passenger flow in Kunming Changshui International Airport. For the railway passenger flow forecast, a time series model was established in [12] with the combination of the long-term trend, the seasonal and the weather factors. To achieve an accurate real-time taxi passenger hotspot prediction, Jamil and Akbar [13] proposed an automatic ARIMA model to determine the value of the model order automatically. The algorithm designed by them overcame the common obstacle, subjectivity and complexity. All these applications make use of the knowledge of passenger flow and provide instructive insight to the management of the public transportation system, which has a referential significance for our investigation.

3 Methodology

3.1 Stationary Time Series Models

The time series analysis aims to reveal the underlying dynamics and structures that affect observable data, thus establishing a suitable theoretical model for monitoring and predicting data. For the definition of stationary time series (or simply called ‘time series’), one can refer for instance to the Definition 1.3.2 in [14]. In this book, the daily passenger flow volumes $ \{ Z_{t} \} $ at any unit of time t will be regarded as a discrete-time stochastic process. Roughly speaking, assuming that $ \{ Z_{t} \} $ is a stationary time series with mean 0 and $ Z_{t} $ depends only on its historical records $ Z_{t - 1} , Z_{t - 2} , \ldots $ then we can use the observed historical data to estimate the dynamic properties, create optimal models and then use these models to do the prediction. In this project, we construct discretely sampled time series based on the actual daily records of passenger volume in a given bus line. The detailed description about the database can be found in Sect. 36.4.1. In the rest of this subsection, some related fundamental concepts will be introduced. One may refer to [8] for the details.

Autoregressive Model: AR (p). The autoregressive (AR) model is a very common time series. The general p-order autoregressive model, denoted as AR(p), is given by:

$$ Z_{t} = \varphi_{1} Z_{t - 1} + \varphi_{2} Z_{t - 2} + \cdots + \varphi_{p} Z_{t - p} + a_{t} , $$

(36.1)

where the parameters $ \varphi_{1} ,\varphi_{2} , \ldots ,\varphi_{p} $ are called autoregressive coefficients and they are to be estimated. The random error terms $ \left\{ {a_{t} } \right\} $ is the white noise, i.e., a sequence of i.i.d. random variables, $ a_{t} \, \sim \,N\left( {0, \sigma_{a}^{2} } \right) $ and $ \{ a_{t} \} $ is mutually independent with $ Z_{t - 1} , Z_{t - 2} , \ldots , Z_{t - p} $.

Moving Average Model: MA (q). The general q-order moving average model, denoted as MA (q), is given by:

$$ Z_{t} = a_{t} - \theta_{1} a_{t - 1} - \theta_{2} a_{t - 2} - \cdots - \theta_{q} a_{t - q} , $$

(36.2)

where $ \theta_{1} ,\theta_{2} , \ldots ,\theta_{q} $ are called moving average coefficients and they are to be estimated.

Autoregressive Moving Average Model: ARMA (p, q). The autoregressive moving average (ARMA) combines an AR model with a MA model to produce a new process that simulates the time series. The general ARMA model, denoted as ARMA (p, q), is given by

$$ \begin{aligned} Z_{t} & = \varphi_{1} Z_{t - 1} + \varphi_{2} Z_{t - 2} + \cdots + \varphi_{p} Z_{t - p} \\ & \quad + a_{t} - \theta_{1} a_{t - 1} - \theta_{2} a_{t - 2} - \cdots - \theta_{q} a_{t - q} . \\ \end{aligned} $$

(36.3)

Autoregressive Integrated Moving Average Model: ARIMA(p, d, q). Notice that the $ {\text{AR}},{\text{MA}} $, and $ {\text{ARMA }} $ models are stationary time series. However, sometimes the time series are not necessarily stationary. It may have a linear trend component. For non-stationary time series, it is necessary to transform it into a stationary one through the backward shift operator. Such a non-stationary time series is called $ {\mathbf{ARIMA}} $ process, denoted as ARIMA (p, d, q), and is given by

$$ \left( {1 - \varphi_{1} B - \cdots - \varphi_{p} B^{p} } \right)\left( {1 - B} \right)^{d} Z_{t} = \left( {1 - \theta_{1} B - \cdots - \theta_{q} B^{q} } \right)a_{t} , $$

(36.4)

where B is the backward shift operator (lag) defined as $ \left( {1 - B} \right)Z_{t} = Z_{t} - Z_{t - 1} $ and d is the number (order) of the difference to make the process stationary.

ARMA Model with a Quadratic Function Trend. Indeed, besides considering a linear trend component in the time series, some other trend forms may also be taken into account. If the trend of a time series has a shape as a quadratic function, then it can be fitted by a quadratic function. The ARMA model with a quadratic function trend is given by

$$ \begin{aligned} Z_{t} & = {\text{quadratic function}} + {\text{ARMA process}} \\ & = at + bt^{2} + \varphi_{1} Z_{t - 1} + \varphi_{2} Z_{t - 2} + \cdots + \varphi_{p} Z_{t - p} + a_{t} \\ & \quad - \theta_{1} a_{t - 1} - \theta_{2} a_{t - 2} - \cdots - \theta_{q} a_{t - q} \\ \end{aligned} $$

(36.5)

In the rest of this section, the application of time series method to the passenger flow prediction will be introduced. This can be achieved by the descriptive and inferential studies on the current data.

3.2 Time Series Analysis

According to [8], the main steps of time series analysis and modeling are:

1.
Stationarity and white noise test
2.
Model identification (i.e., specifying the lag order)
3.
Model selection and parameter estimation
4.
Diagnostic checking
5.
Prediction based on the optimal model.

Stationary Test. The first step of time series analysis is to verify whether the series is stationary. There are two main methods: one is the graph test, which illustrates the features shown in the time series plots and autocorrelation diagrams, while the other one is the unit root test.

Graph Test.

1.
Time series plot

According to the property that mean and variance of a stationary time series are constant, the time series plot should show that the process fluctuates randomly near a constant value and the ranges of fluctuation are similar. The time series is usually not stationary if there exists a significant trend or periodicity.

2.
Autocorrelation function (ACF) plot

ACF is used to describe the degree of linear correlation between different observations in time series. It is proven that the stationary time series usually have short-term correlation. The time series is stationary if the autocorrelation function declines rapidly to zero and all the values fall into the confidence interval by lag 3. In contrast, the autocorrelation of a non-stationary series declines slowly.

Unit Root Test. The unit root test is used to check whether a time series needs to be differenced. The procedure is described in [15]. Among the unit root tests, the most widely used one is the Dickey–Fuller (DF) test, which is applicable to the AR(1) model:

$$ Z_{t} = \varphi_{1} Z_{t - 1} + a_{t} = \left( {1 - \varphi_{1} B} \right)^{ - 1} a_{t} = \mathop \sum \limits_{k = 0 }^{\infty } \varphi_{1}^{j} a_{t - k} , $$

(36.6)

where $ \left| {\varphi_{1} } \right| < 1 $. Since the root of the characteristic equation $ 1 - \varphi_{1} B = 0 $ is $ \varphi_{1}^{ - 1} $, another equivalent statement of the stationary form is that the root must be outside the unit circle. So it suffices to test whether the root of the characteristic equation is outside the unit circle, with, respectively, null and alternative hypothesis:

$$ \begin{aligned} & H_{0} :\left\{ {Z_{t} } \right\} {\text{is non - stationary}}, \left| {\varphi_{1} } \right| = 1, {\text{a regular difference is needed}} \\ & H_{1} :\left\{ {Z_{t} } \right\} {\text{is stationary}}, \left| {\varphi_{1} } \right| < 1,{\text{the series donot need to be deferenced}} \\ \end{aligned} $$

The DF test is only applicable to the AR(1) model. In order to generalize the DF test and make it widely applicable to AR(p) processes, an augmented Dickey–Fuller (ADF) test was proposed in [16] with the same hypothesis and the decision rules and includes two other new terms: drift and trend.

White Noise Test. In order to verify whether a process is worth further time series modeling and analysis, it is needed to perform the white noise test. From the definition of the white noise, for any lag k, its autocorrelation coefficient is given by $ \rho_{k} = 0 $. It should be noted that this is the ideal situation. While in practice, most of the autocorrelation coefficients $ \hat{\rho }_{k} $ are not equal to zero due to the finiteness of the sample sequence, but they fluctuate randomly around a value of 0 with a small float. According to the methods summarized by Wei [17], instead of considering each autocorrelation individually, the first m autocorrelation coefficients as a whole are considered and an index to determine whether a sequence is white noise or whether there exists a correlation between observations is constructed. The null and alternative hypotheses for the white noise test are, respectively:

$$ \begin{aligned} & H_{0} :\rho_{1} = \rho_{2} = \cdots = \rho_{m} = 0, \forall m \ge 1,{\text{so}} \left\{ {Z_{t} } \right\} {\text{is a white noise sequence}} \\ & H_{1} :{\text{for}} \forall m \ge 1,\exists k \le m\,{\text{and}}\,k \ne 0 {\text{that}} \\ & \rho_{k} \ne 0, {\text{so}} \left\{ {Z_{t} } \right\} {\text{is not a white noise sequence}} \\ \end{aligned} $$

This is an approximate statistical hypothesis test that none of the autocorrelations of the series up to a given lag are significantly different from 0. If this is true for all m lags, then there is no information in the series to model and no $ {\text{ARIMA }} $ model is needed.

Methods of Order Specification. To determine the order (p, q) of ARMA models, SAS provides a list of the order combinations, which is mainly referred to ESACF, SCAN and MINIC methods.

The extended sample autocorrelation function (ESACF) method. Since the ACFs and PACFs of $ {\text{ARMA}}\left( {p,q} \right) $ model are all trailing, these two functions cannot be jointly used to determine the order $ \left( {p, q} \right) $. Considering this situation, Tsay and Tiao [18] proposed a general iterative regression method, and used the ESACF to estimate the order of the model. The method is applicable if the time series $ y_{t} $ belongs to $ {\text{ARMA}}\left( {p, q} \right) $ process, then by fitting $ {\text{AR}}\left( p \right) $ model to it, the estimate of the autocorrelation regression coefficients $ \hat{\varphi }_{i} ,i = 1,2, \ldots ,p $ will be inconsistent. Therefore, the residual error of regression must be introduced into the model as an explanatory variable, and when such process goes on until the q times the estimated model is as follows:

$$ Z_{t} = \mathop \sum \limits_{i = 1}^{p} \varphi_{i}^{\left( q \right)} Z_{t - i} + \mathop \sum \limits_{i = 1}^{q} \alpha_{i}^{\left( q \right)} \hat{e}_{t - i}^{{\left( {q - i} \right)}} + e_{t}^{\left( q \right)} . $$

(36.7)

Now the estimator $ \widehat{{\varphi_{i} }}^{\left( q \right)} $ will be consistent. Based on this idea, let $ m = 0,1,2,.. . $, $ \widehat{{\varphi_{i} }}^{\left( j \right)} $ is the $ j{\text{th}} $ iteration estimated autoregressive coefficient of the $ {\text{AR }}\left( {\text{m}} \right) $ model, then $ \widehat{{\rho_{i} }}^{\left( m \right)} $ is defined as the sample autocorrelation function of the following model:

$$ y_{t} = \left( {1 - \widehat{{\varphi_{1} }}^{\left( j \right)} B - \widehat{{\varphi_{2} }}^{\left( j \right)} B^{2} - \cdots - \widehat{{\varphi_{m} }}^{\left( j \right)} B^{m} } \right)z_{t} . $$

(36.8)

Regarding the ESACF, there exists the following probabilistic convergence:

$$ \hat{\rho }_{j}^{\left( m \right)} \mathop \to \limits^{p} \left\{ {\begin{array}{*{20}l} {0,} \hfill & {0 \le m - p \le j - q;} \hfill \\ {X \ne 0,} \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right.. $$

(36.9)

Because of this property, the distribution of the ESACF for ARMA (1,1) model can be displayed as in Table 36.1, which is characterized by the fact that all zeroes form a triangle with the vertex (1,1). Similarly for the general $ {\text{ARMA }}\left( {p, q} \right) $, the vertex of all zeroes is located at $ \left( {p, q} \right) $, which is the rule of identifying the order of the model. In fact, SAS provides two tables, one is for the estimate of ESACF and the other one is for the significance test.

Table 36.1 ESACF for $ {\text{ARMA }}\left( {1,1} \right) $ model, where X is a non-zero number

Application of Time Series Method to the Passenger Flow Prediction in the Intelligent Bus Transportation System with Big Data

Abstract

Similar content being viewed by others

A Prediction Model Based on Time Series Data in Intelligent Transportation System

Intelligent Traffic App Operation Mode and Prediction of People Based on Big Data

Short-term passenger flow forecast for urban rail transit based on multi-source data

Keywords

1 Introduction

2 Literature Review

3 Methodology

3.1 Stationary Time Series Models

3.2 Time Series Analysis

4 Empirical Results

4.1 Introduction to Database

4.2 Data Preprocessing

4.3 Model Building

5 Conclusion

6 Open Questions and Potential Improvements

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation