1 Introduction

Intelligent Transportation System is widely used in public transit to solve problems in urban transportation systems like pollution, congestion, and inefficiency caused by increasing mobility [1,2,3]. To design a public transit system and scheduling, ensuring the system’s resilience requires an understanding of the projected demand under regular and irregular conditions. Hence, traffic demand prediction and anomaly detection become vital processes.

We design an accurate and efficient system to make a short-term prediction and detect anomalous days in urban networks in this study by leveraging the open-source data available for New York City, one of the major urban centers in the world. Taxi ridership data form a respectable proportion of urban mobility in New York and thus can be used as a reliably proxy for urban mobility. A precise taxi pick-up and drop-off volume prediction system supports decision-making in dispatching cabs and for-hire vehicles to improve the taxi service [4, 5]. The widely accepted conception of anomalies comes from Hawkins [6]: ”An anomaly is an observation which deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism.”. While regular on a recurring basis, transportation flows are also subject to anomalous flows due to various events. National holidays, extreme weather, disasters, game days, and religious events cause significant surges [7] in traffic, leading to issues with traffic management and control. Surges and anomalies are crucial stress points for transport resource scheduling systems, often requiring pre-emptive actions from the authorities. This paper focuses on analyzing taxi ridership system behavior and exception in two airports in New York City: John F. Kennedy International Airport (JFK) and La Guardia Airport (LGA). The idea is to leverage relevant urban data, model the ridership prediction, and then find and analyze various anomalies.

Transport data have complex spatial dependencies and nonlinear temporal dynamics. Since transport data comprise connected origins and destinations and have a highly interdependent topological structure, modeling individual nodes cannot capture the interactions. Furthermore, mobility is highly correlated with external factors. In transportation hubs, pick-up volume depends heavily on arrival flows, and it varies with the weather, time, weekdays, and other elements. We build a dataset to capture critical external features. Then, we experiment with different regression models capable of modeling the influence of these external variables.

Our experimentation shows that nonlinear modeling techniques are more suited to mobility data. We evaluate two complex nonlinear models for this task, Random Forest and long short-term memory networks (LSTM). We use lagged variables with Random Forest to model temporal relationships, and LSTM is autoregressive in nature. We design a prediction model predicts ridership flows which considering the spatial relationships between nodes.

Subsequently, we apply residuals from prediction models to surge detection. The most common approach for anomaly detection is building a probabilistic model of the normal behavior and categorizing anomalies as events that have low likelihood according to the probabilistic model [7, 8]. These models have a shortcoming in that they do not consider the externalities and temporal dependencies of the variable in the study. Unsupervised feature learning approaches like principal component analysis (PCA) and autoencoders which are commonly used before anomaly detection also have the same flaw: they are not capable of modeling externalities like weather and day of the week. We formulate a framework where an initial modeling phase corrects externalities and temporal dependencies in the ridership data. Then, residuals from this stage are used for anomaly detection. Consequently, an anomaly is only detected if the anomalous behavior cannot be explained using the available external data and ridership history. Since we use the taxi zone data and perform zone-level predictions, the high-dimensional residual still poses a problem for anomaly detection. Hence, we experiment with multiple network aggregation techniques like PCA and network community detection to reduce the dimensions for anomaly detection.

Our main findings are summarized as follows:

  1. 1.

    we show that unsupervised learning-based anomaly detection based on regression residuals outperforms anomaly detection from ground truth ridership values;

  2. 2.

    we show that mobility networks have strong nonlinear spatial and temporal dependencies, and thus, nonlinear models are better than legacy linear models for modeling purposes;

  3. 3.

    we provide empirical evidence that dimensionality reduction is necessary before modeling high-dimensional mobility network data;

  4. 4.

    our results show that community detection is a better network aggregation technique when compared to geographical and administrative aggregations in anomaly detection;

  5. 5.

    our results show that autoregressive LSTM outperforms Random Forest, which uses lagged variables for temporal modeling in the out-of-sample test.

2 Related work

2.1 Taxi ridership volume prediction

Traffic demand prediction is fundamental to building a stable smart-city transportation system [9, 10]. We use taxicab and for-hired vehicle (FHV, e.g., Uber) as they are major components [11, 12] in this network. Taxi demand prediction methodologies lay strong emphasis on the complex spatial [13, 14] and temporal [15, 16] relationships in the dataset and the need for models that can exploit both. One challenge for mobility modeling is that traffic flow is heterogeneous of predictability across geographical regions [17].

Traditional approaches for traffic forecasting are mostly based on techniques like AutoRegressive Integrated Moving Average (ARIMA) model and Kalman filtering [18, 19]. Similarly, researchers model urban mobility using taxi data leveraging ARIMA models [20]. Autoregressive integrated moving average (ARIMA) [18] is the most widely used forecasting model and has been extensively applied to traffic forecasting problems [21, 22]. SARIMA is a variant of ARIMA that also captures seasonal periodicity has also been used for traffic forecasting [23]. These models produce predictions for individual nodes and fail to capture the relationships between nodes, which is a vital aspect of real-world traffic networks. Vector ARMA (VARMA) [24] and space-time ARIMA (STARIMA) [25] are two more sophisticated variants that perform traffic forecasting for multiple nodes in a network. These were initial steps towards edge-level traffic forecasting which considers spatial and temporal relationships. Ding et al. [26] proposed an STARIMA model to predict traffic volume in 5 min on street links, and spatial features include average trip duration on links. Ravi et al. [27] applied a vector autoregressive model to predict traffic volumes on freeways including upstream and downstream. These techniques have three shortcomings:

  1. 1.

    They only work on series that are stationary or stationary after differencing, while the non-stationarity problem in urban transportation is complex and might not necessarily be addressable with differencing.

  2. 2.

    They do not consider spatial and structural dependencies that traffic networks exhibit and forecast each sensor as an individual time series.

  3. 3.

    They suffer from the curse of dimensionality.

Most state-of-the-art models are deep learning-based include deep multi-task multi-graph learning approach [28], Spatio-Temporal Encoder–Decoder Residual Multi-Graph Convolutional network [29], and Spatio-temporal graph convolutional network [30]. Models that incorporate both spatial and temporal features into consideration have been shown to yield significantly improved performance [31]. Zone-based taxi prediction [31,32,33] is less challenge than OD-based prediction [29, 34, 35]. Furthermore, mobility is highly correlated with external factors [36]. In transportation hubs, pick-up volume depends heavily on arrivals amount, and it varies with the weather, time, weekdays, and other elements. We build a dataset for this set of important external features. Then, we experiment with different regression models capable of modeling external variables.

While, Graph Neural Network (GNN) [37] is a popular and successful method in traffic prediction [38, 39] as it captures connections between nodes. However, this feature brings no benefits to a star-structure network as it merely propagates information between adjacent nodes and neglects information between unconnected nodes [40]. Hence, it is not an appropriate method in our project as we are working on a star network: all nodes in our graph are only connected to one node (JFK or LGA).

A linear regression model with a large number of engineered features [41] can also outperform the complex time series model in out-of-sample datasets. We use linear regression models as our baseline of aggregated ridership prediction. Subsequently, we experiment with a more robust nonlinear model, Random Forest. We choose Random Forest regression as they have built-in mechanisms to avoid overfitting. Bootstrapping and random subsetting [42] of the feature space are used for building weak classifiers, which are then ensembled for the final prediction. This leads to good generalization and avoids overfitting. Furthermore, using the Random Forest for edge-level prediction allows modeling the spatial and structural dependencies in the nodes. Recurrent neural networks (RNN) are able to model nonlinear temporal dynamics in sequence data and forecasting. However, traditional RNNs suffer from blowing and vanishing gradients problems and thus make it difficult for them to capture long-term temporal dependencies [43]. Besides, they rely on predetermined time lags. Long-Short-Term Memory (LSTM) was proposed to address these challenges [43]. Unlike traditional RNNs, they can model these long-term dependencies using memory units and gated structures. Furthermore, they can automatically determine optimal time lags for the dataset by learning when to open and close relevant gates. Traffic data are known to have long-term temporal dependencies, making LSTM a suitable modeling technique.

Even previous models have impressive performance in ridership prediction, our objective is to train a model which not only returns reliable prediction but also provides insights on anomaly detection.

2.2 Anomaly detection in networks

The application of network anomaly detection is a broad research area that is applicable in different domains, including fraudulent activities in transactions [44], social networks data [45, 46], and transportation networks [7, 47, 48]. One challenge in this field is the lack of widely accepted labeled datasets for anomaly detection [44] which can be used to benchmark different approaches. Thus, in this work, we create our own labeled dataset.

If the input data are high-dimensional, a dimension reduction step is usually applied before anomaly detection. Lakhina et al. [49] applied Principal Component Analysis (PCA) to network anomaly detection for the first time using Principal Component reconstruction residuals as low-dimensional features for anomaly detection. Also, reconstruction error from Autoencoder [50] is also a common method to detect outliers. Lately, Ringberg et al. [51] and Brauckhoff et al. [52] suggested that PCA is not suitable for anomaly detection because of parameter sensitivity and lack of temporal correlation on a transformed dataset. An alternative approach to reduce the graph data dimension is using a graph aggregation technique like community detection. A Community is defined as a group of nodes in a network which interact excessively frequently, and it is a common substructure in networks [53]. Zhenzhang et al. [54] applied community detection in complex evolutionary networks for the first time. GNN and its variants have an advantage in network analysis as they capture relationships between nodes. However, they have no superiority, while there is no connection between neighbors [40].

At the anomaly detection stage, clustering is the most common machine learning technique. Prior work has shown that soft clustering methods are more flexible and feasible as they consider the nature of the data [55]. Gaussian Mixture Model (GMM) has been introduced to network anomaly detection in 2006 by Tran et al. [56], and it is shown to be a better choice than K-means when clusters have different sizes and correlations.

Supervised learning methods are applied to anomaly detection in Spatio-temporal datasets as well [47, 57]. Kong et al. applied a one-class support vector machine to detect urban anomalies based on prediction residuals from LSTM. This paper proves that residuals from Random Forest outperform LSTM in anomaly detection.

Based on the literature review, we identify four main challenges in temporal network anomaly detection:

  1. 1.

    A lack of a generally applicable method that works on different mobility networks.

  2. 2.

    Low signal-to-noise ratio makes separating normal and abnormal data points harder.

  3. 3.

    A lack of widely accepted labeled datasets for anomaly detection.

Fig. 1
figure 1

Experiment pipeline

3 Data and preprocessing

3.1 Outgoing traffic volumes from airports

The vehicle’s outgoing datasets are collected from two major transportation hubs in New York City, including JFK and LGA. New York City Taxi and Limousine Commission provides yellow taxi, green taxi, and for-hire vehicles’ trip records [58] include fields capturing pick-up and drop-off dates/times, pick-up/drop-off locations. The pick-up/drop-off locations are aggregated into 263 taxi zones, where JFK and LGA act as two taxi zone units. Here, we only take yellow taxis and for-hire vehicles into account, as green taxis are not allowed to pick up passengers in Manhattan. We aggregate the traffic volume by date. The summary of aggregated datasets is in Table 5.

3.2 Arrival flights and trains volumes

We assume that the arrival passengers’ amount at each transportation hub affects the outgoing traffic volume in a specific time lag. Thus, we collected flight arrival data to predict the traffic volume. Flight datasets came from the Bureau of Transportation Statistics in JFK and LGA in 2018.

Bureau of Transportation Statistics only records domestic flights. According to Port Authority traffic report, about 39% JFK flights (38,260 in 97,853) and 7% LGA flights (2,301 in 29,911) are international. And international flights usually have higher passenger capacity. Therefore, lacking international flights might limit the capacity of our models to explain associated ridership fluctuations.

3.3 External datasets: weather data and flight arrival data

Taxi ridership volume from transportation hubs is determined by the combination of incoming passengers’ population, the possibility of choosing a taxi, and the available taxi amount. The demand for taxis at transportation hubs and in the city influence each other, according to [59]. During inclement weather, the volume of trips within Manhattan increases. As frequent and short trips bring more income to drivers, airports become less attractive to taxi drivers [60]. Our weather dataset is collected from Aviation System Performance Metrics data, supplied by National Oceanic And Atmospheric Administration.

Besides weather conditions, the temporal variations, such as day-of-week, and time-of-day, significantly impact on the taxi supply and demand market. Dr. Kamga found that on weekend nights, taxis have the highest average revenue, which is caused by higher pick-up rates and relatively shorter distance trips [61].

3.4 Urban anomaly events

We set two events datasets separately for each transportation hub to validate our anomaly detection results and events types, including national holidays, extreme weather (snow and storms) days, and major flight delays and cancellations for airports. Details are in Appendix A Table 7.

Extreme weather includes two types: winter storm and thunderstorm wind. Snow will not melt away immediately, the impact will last longer, so the days after winter storms are regarded as winter storm days, as well. Airport events are collected from news from 01/01/2018 to 12/31/2018, and most of them are related to extreme weather days in other cities and states. For example, on Dec 21, a winter storm hit the east coast, and even this day is not recorded as a storm event in New York City, it still led to one-third of flights being delayed in LGA, and more than 15% flights delayed in JFK [62].

4 Methodology

Our study has two distinct objectives, developing an accurate traffic prediction model and analyzing the usability of this model for traffic surge detection. Both objectives influence our choice of methodology. The whole process is visualized in Fig. 1. Input datasets include taxi ridership data, weather data, and flight arrival data. Initially, we experiment with an array of hypothesis/modeling paradigms on non-spatial one-dimension timeseries ridership prediction. Then, select the most suitable model for edge-level traffic prediction. Subsequently, we choose the best-performing model and study its utility for surge detection.

We formulate a framework where an initial modeling phase corrects externalities and temporal dependencies in the ridership data. And residuals from this stage are used for anomaly detection. Consequently, an anomaly is only detected if the anomalous behavior cannot be explained using the available external data and ridership history. Throughout this paper, we refer edge-level data as taxi-zone wise, no aggregation data.

Mobility network data are usually very high-dimensional, and linear machine learning methods do not perform well with high-dimensional data, especially when the dataset size is not many times larger than the number of dimensions. This is a generic issue with machine learning models and is referred to as the curse of dimensionality. The curse of dimensionality can be detrimental to both the ridership prediction and subsequent anomaly detection. Furthermore, edge-level data have a low signal-to-noise ratio, and some level of aggregation is required to address this issue.

To summarize our pipeline, we use a dimension reduction technique to transform the edge-level network data in a reduced feature space. Then, we perform ridership prediction in this aggregated space and use disaggregation techniques to attain edge-level prediction. Anomaly detection uses the low-dimensional residuals of the aggregated space surge isolation.

4.1 Dimension reduction and network aggregation

Once an edge-level prediction model is developed, it is still a challenge to use this high-dimensional residual data for anomaly detection. Most prior works address this issue by adopting a two-stage approach [49, 51, 52], where low-dimensional representation is learned before applying anomaly detection techniques on the latent representation. We address this issue by experimenting with different spatial and structural aggregations and dimension reduction techniques. Then, we perform modeling and residual generation in these aggregated spaces to get low-dimensional residuals for anomaly detection. Since comparing the performance of ridership prediction with modeling at different aggregated spaces is hard, we include a disaggregation step that transforms the aggregated space back to edge level using inverse PCA or multiplying the aggregated demand by edge-level weights. The edge-level weights are the yearly edge-level demand divided by yearly aggregated-level demand.

Different aggregations and dimension reduction techniques that we experimented with are discussed below. At the dimension reduction stage, we aggregated the dataset in three different ways:

  1. 1.

    spatial aggregation that aggregates 263 taxi zones to 5 boroughs;

  2. 2.

    topological aggregation: uses a community detection algorithm to aggregate taxi zones to 6 or 24 communities;

  3. 3.

    linear dimension reduction that applies the principal component analysis method to reduce the raw dataset to 6 or 24 dimensions.

The following subsections introduce more details of dimension reduction and network aggregation techniques.

Principal component analysis: Edge-level prediction results in a pipeline including dimension reduction techniques are from an inverse transformation of predicted components from dimension reduction. We choose PCA, because it minimizes the reconstruction error, and the transformation is invertible. Also, the main message of our pipeline is to show the utility of each phase, and PCA is the simplest suitable method to demonstrate the utility of the dimension reduction, while other more advanced methods like kernel PCA or autoencoders could be considered. This model learns a linear transformation that projects the data into another space, where the variance of the data defines vectors of projections. By restricting the dimension to a certain number of components that account for most of the variance of the data set, we can achieve dimension reduction. We perform PCA to get a low-dimensional space which we use for ridership modeling. Inverse PCA is used to get edge-level prediction performance. We try with different numbers of principal components and choose the best-performing one for each hub.

Spatial aggregation: New York City is divided into administrative divisions termed as boroughs. There are five boroughs in New York named Bronx, Brooklyn, Manhattan, Queens, and Staten Island. The spatial breakdown of the city into boroughs is available in the Appendix Fig 5.

Community detection: provides a spatial topological aggregation for taxi zones, while borough-level aggregation ignores the interaction between taxi zones.

A community is a collection of nodes in a network that interact significantly. In community detection, the algorithm compares the actual edge weight with the average expected value for each edge in the original network. The edges with positive relative strength scores represent particularly strong network connections and are placed inside the community, while edges with negative scores are placed between the communities. This process maximizes the modularity score and determines the optimal partitioning. Therefore, community detection provides a way of aggregation that preserves topological and structural information as opposed to crude spatial aggregation. Our community detection results come from [7] which collected yellow cabs, green cabs, and for-hire vehicles trip within the New York City from Jul 01, 2017, to Dec 31, 2018. 263 taxi zones are aggregated to 6 or 24 communities. The visualization of community detection results is available in Appendix B Figs. 6 and 7. We try different granularities of community detection as our aggregation step before ridership prediction and use the best-performing granularity for each hub. For these experiments, we use weighted disaggregation, which is described below.

Weighted disaggregation: This technique is used to convert the aggregated ridership data in community and borough space back to network space. We learn a linear transform from the historical data that shows the average fractional contribution of each edge towards the community or borough that it resides in. For each unit u of aggregated space uni (which can be a single community or a borough)

$$\begin{aligned} R_{u} = \sum _{e\in u}B_e*R_e, \end{aligned}$$
(1)

where \(R_{u}\) is the ridership of that unit \(u\) in aggregated space, \(R_e\) is the rider of edge \(e\), and \(B_e\) is the fractional contribution that is learned from historical data.

4.2 Ridership prediction

We frame the ridership prediction problem as an hourly edge-level prediction for the mobility network based on external variables and ridership history.

As an initial experiment, we compare different models for network-wide aggregated 1-dimension ridership timeseries at the hourly level. In this stage, we compare a linear regression, a hybrid model (combined with linear regression and autoregressive integrated moving average), Random Forest regressor, and LSTM. The success of Random Forest model and LSTM emphasizes the ridership’s long-term temporal dependencies.

Subsequently, we frame the problem as hourly ridership prediction at edge level. This modeling is performed on both aggregated and original networks. For this task, we only evaluate two highly nonlinear models, Random Forest, and LSTM. Both models perform network-wide prediction enabling them to model relationships between network nodes. Random Forest uses lagged variables, while LSTM uses autoregressive modeling to model temporal relationships. Furthermore, we explore combinations of different models, aggregation, and disaggregation methods. Models evaluated in this experiment include the following.

4.2.1 Linear regression

We formulated the linear regression model for total outgoing traffic prediction from each hub. We use lagged variables to model temporal dependencies. The model formulation is shown below

$$\begin{aligned} \mathrm{RIDERSHIP}_t= & {} \beta _0 + \beta _1 * \mathrm{ARRIVAL}_{t,m,n} \nonumber \\&+ \beta _2 * \mathrm{TimeofDay}_t \nonumber \\&+ \beta _3 * \mathrm{DayofWeek}_t, \end{aligned}$$
(2)

where variables’ explanation is in Appendix A Table 8.

4.2.2 Hybrid model

A hybrid model is the combination of the ordinary least square linear regression and autoregressive integrated moving average. Since the normal linear regression model does not incorporate temporal relationships in ridership prediction, we combined it with AutoRegressive Integrated Moving Average (ARIMA). Based on the result, we combined linear regression with ARIMA and introduced a new variable \(\mathrm{arPred}_t\)

$$\begin{aligned} \mathrm{arPred}_t= & {} \hat{\mathrm{RIDERSHIP}'_t} - \hat{\mathrm{RIDERSHIP}'_{t-1}}\nonumber \\ \end{aligned}$$
(3)
$$\begin{aligned} \mathrm{RIDERSHIP}'_t= & {} \mathrm{RIDERSHIP}_t - \mathrm{RIDERSHIP}_{t-1}\nonumber \\ \end{aligned}$$
(4)
$$\begin{aligned} \hat{\mathrm{RIDERSHIP}'_t}= & {} \mu + \phi _1 * \mathrm{RIDERSHIP}'_t \nonumber \\&+ ... + \phi _8 * \mathrm{RIDERSHIP}'_{t-8} \nonumber \\&+ \theta _1 * \epsilon _{t-1} + ... + \theta _q *, \epsilon _{t-4} \end{aligned}$$
(5)

where \(\phi \) are the parameters of the autoregressive part, \(\theta \) are the parameters of the moving average, and \(\epsilon \) are error terms. \(\epsilon \) are independent, identically and normally distribution with zero mean.

ARIMA model is applied here to get time series prediction \(\mathrm{arPred}_t\) from the linear combination of lagged RIDERSHIP values and the error terms whose values occurred at various times in the past. Then, we add \(\mathrm{arPred}_t\) into the previous OLS model as a new variable. The resulting regression equation is shown below

$$\begin{aligned} \hat{\mathrm{RIDERSHIP}'_t}= & {} \beta _0 + \beta _1 * \mathrm{ARRIVAL} \nonumber \\&+ \beta _2 * \mathrm{arPred}_t + \beta _3 * \mathrm{TimeofDay}_t\nonumber \\&+ \beta _4 * \mathrm{DayofWeek}_t. \end{aligned}$$
(6)

4.2.3 Random Forest

Ensemble learning has been successfully applied to improve the performance of regression and classification tasks. Two branches of ensemble learning in trees are boosting [63] and bagging [42]. In boosting, successive trees give extra weight to misclassified examples by earlier trees. In bagging, all trees are constructed separately on bootstrapped samples of data. Both use majority voting for ensembled prediction. Random Forest [64] is a further modification of bagged trees, where a subset of available features is picked randomly to pick the best split for each node in the tree. Random Forests are less prone to overfitting which is a big problem for modeling from high-dimensional data. This is because Random Forest is an ensemble of weak predictors and incorporates feature pruning within the algorithm. Furthermore, they are capable of modeling nonlinear spatial and temporal dynamics.

We use Random Forests for nonlinear modeling for both edge level and aggregated space as they are relatively robust to overfitting which is a general hazard for predictive modeling based on high-dimensional data [65]. Furthermore, they have built-in support for multivariate regression, which we use for direct edge-level prediction using a single model. We perform a random grid search and n-fold cross-validation for the search of optimal model parameters. Our experiments uncovered that the data have strong temporal correlations which we exploited using time-lagged features. Grid search showed that the model performed the best when given ridership history of preceding 12 h as a part of lagged features.

4.2.4 LSTM

We designed an LSTM model for taxi zone-level ridership prediction. The model is composed of an aggregation layer, LSTM layers, and a disaggregation layer. The aggregation layer is matrix multiplication that transforms the data from edge level to an aggregated community level, and this layer is fixed and initialized to 24 community partition. LSTM layer consists of a stack of LSTM cells that consumes the sequential data in community space and outputs the prediction in low-dimensional latent space. The disaggregation layer is a linear transformation (fully connected layer without nonlinearity) that projects the prediction to taxi zone space. The whole model is trained end to end using adams optimizer. Bayesian optimization is used to tune design parameters like sequence context length and number of LSTM layers and model hyperparameters like learning rate.

4.3 Anomaly detection

Residual analysis is a method to study the residual between real data and estimated data in regression problems [66]. Generally, large errors are more likely to be anomalies. We applied log transformation to shape errors as normally distributed and eliminate the impact from the range of real values [67]. We experiment with residuals with and without log transformation and report the best-performing version

$$\begin{aligned} \mathrm{error}'= & {} \mathrm{log}(\mathrm{Ridership} - {\hat{\mathrm{Ridership}}}\nonumber \\&+\mathrm{min}(\mathrm{Ridership}-{\hat{\mathrm{Ridership}}})). \end{aligned}$$
(7)

Anomaly detection is a special case of imbalanced classification, to avoid overfitting, we applied an unsupervised learning clustering algorithm to cluster patterns in ridership. A Gaussian mixture model is a probability-based clustering model that assumes all data points are from a combination of multiple Gaussian distributions with unknown parameters [68]. Anomalies are defined by the threshold of the rank of log-likelihood returned by the Gaussian Mixture Model. We use an iterative model-fitting approach similar to expectation maximization: the first step is to fit a GMM model on the entire residual dataset and get the likelihood of each data point. Then, drop data points below a likelihood threshold, and refit the model on remaining data points, applying the trained model to the whole residual dataset again. Finally select data points having likelihood below the previous threshold. This iteration is done until data points selected before and after converge to the same dataset. We have experimented with 100 threshold values, ranging from the bottom 1% to 99% percentile of likelihood. The threshold of log-likelihood rank to divide days into anomalies and normal days is discussed in Sect. 5.3

For Gaussian Mixture Model Component Selection, we run models with 1–5 components on each residual dataset and use Bayesian Information Criterion (BIC) criteria for the selection of optimal components. According to the scikit-learn users’ guide [69], model selection for Gaussian Mixture Model could be based on information-theoretic criteria and the Bayesian Information Criterion is a better choice. Then, we used the number of the component which yields the lowest Bayesian Information Criterion score for the further iterated Gaussian Mixture Model. The optimal number of GMM components varies with hubs and prediction methods, and the final choice of the hyper-parameter is available in Sect. 5.2.

4.4 Evaluation

Since we are predicting ridership at multiple dimensions, the performance evaluation metrics will be a series \(R^2\) values, one \(R^2\) value for each dimension. To generate a general \(R^2\) which describes the overall predicting model performance in the whole city, averaged \(R^2\) is applied. Instead of a uniformly weighted average, variance-weighted \(R^2\) is used to evaluate the prediction of all the traffic prediction models. Since it is not possible to fairly compare the models at different aggregation levels directly, we use disaggregation to get taxi zone-level predictions for all models, enabling one-to-one out-of-sample comparison. To evaluate the anomaly detection performances, we used the area under the precision-recall curve, as it is preferred over the area under the ROC curve (AUROC) when dealing with a highly imbalanced dataset [70], which is true in our case with anomalies being very sparse

$$\begin{aligned} \mathrm{Precision}= & {} \mathrm{True Positives} / (\mathrm{True Positives} \nonumber \\&+ \mathrm{False Positives}) \end{aligned}$$
(8)
$$\begin{aligned} \mathrm{Recall}= & {} \mathrm{True Positives} / (\mathrm{True Positives} \nonumber \\&+ \mathrm{False Negatives}). \end{aligned}$$
(9)

5 Results and discussion

Prediction models are evaluated at first, the model that has the best performance in total demand prediction will be applied to estimated edge-level demand and further anomaly detection combined with different aggregation methods.

5.1 Demand prediction

Table 1 Total outgoing variance weighted \(R^2\)

Random Forest and LSTM give the best results for total outgoing ridership prediction for each hub. Random Forest models capture nonlinear temporal dynamics and spatial dependencies in the dataset. Furthermore, they are an ensemble of multiple regressors trained using bootstrapping, increasing their generalizability to unseen data. And LSTM is inherently designed for timeseries datasets. Hence, they were able to consistently outperform linear regression (Table 2).

Table 2 Variance weighted \(R^2\) from different pipelines

Table 1 compares the performance of different pipelines for edge-level prediction. However, not all the approaches are compatible or reasonable to combine, for example, LSTM–PCA–inverse PCA and Random Forest–community–learnable disaggregation. PCA and inverse PCA cannot be combined with an LSTM, because a neural network could naturally perform dimensionality reduction and reconstruction. Moreover, we have an LSTM pipeline using community-based aggregated input, LSTM to predict, and disaggregates prediction results to edge level using a high-dimension output layer. However, this high-dimension output layer disaggregation method cannot be combined with RF. It is supposed to be connected with the LSTM prediction layers and tuning parameters in all stages sequentially.

Our proposed LSTM model outperforms RF in prediction for the same aggregation level. 24 community LSTM is better than 24 community RF and taxi zone-level LSTM is better than Taxi Zone-level RF. This shows that an autoregressive nonlinear model is better than a nonlinear model that uses lagged variables.

Modeling in aggregated space and disaggregating into edge-level produces better results than direct modeling in taxi zone space. This validates our hypothesis that dimension reduction is vital for modeling on high-dimensional transportation networks and shows that our proposed framework of dimensionality reduction, modeling, and dimension expansion is better than direct edge-level modeling. Additionally, we note that the reason other aggregation schemes do not work as well as PCA and learnable aggregation might be that the adopted disaggregation technique was too simple. Whereas learnable disaggregation and inverse PCA are better disaggregation steps than weighted disaggregation, resulting in better taxi zone level \(R^2\).

Comparing our results with two related works turns out that our model outperforms previous methods. In Short-Term Forecasting of Passenger Demand under On-Demand Ride Services: A Spatio-Temporal Deep Learning Approach [32], Hangzhou (a major city in China) is divided \(7\times 7\)grids, while the whole city’s area is 6505 \(mi^2\), so we compared their results to our 24-community aggregated prediction results as New York City is only 307 \(mi^2\). Besides, as the total ridership amount in Hangzhou is not mentioned in this paper, we cannot compare RMSE or MAE with it, only \(R^2\) is applicable. The highest \(R^2\) in this project is 0.820, and we have different model performances in different transportation hubs, but all stations have a better performance: 0.865 in JFK, 0.929 in LGA. Another related work is Taxi Demand Prediction Using Parallel Multi-Task Learning Model [34], which predicts NYC taxi ridership at taxi zone level, but the discrepancy is they trained model on the first 10 months and forecast the upcoming 2 months, temporal granularity is 2-hour, while we only predict the next hour based on external data and ridership in the past 12 h. In this work, they did not provide \(R^2\), but as we have the same research area, MAE and RMSE are applicable in this case. In their best model, the average MAE is 14.499, and RMSE is 22.441. However, our highest MAE is 1.2, and the highest RMSE is only 3.2. While it is hard to find a previous work with identical spatial and temporal granularity to conduct the model performance comparison, our models still show superiority in predicting performance.

5.2 Anomaly detection

Table 3 shows the normalized area under precision–recall at JFK and LGA. Random selection refers to randomly assigning each observation as an anomaly, and the model failed to capture any signal from the dataset. This serves as a reference point to compare the performance of our different methodologies. And scores from other methods are normalized by this value.

Residuals from Random Forest based on 24-communities aggregated datasets achieve the best performance in anomaly detection at both JFK and LGA. And based on BIC results, the number of components in GMM is 3 for anomaly detection in both JFK and LGA taxi ridership. This establishes the superiority of topological aggregation over crude spatial aggregation, and linear dimension reduction. This result makes sense as the geospatial locations and connections define interactions between taxi zones. For example, even LGA is located in Queens, community detection results show a stronger bond with and midtown Manhattan other than Queens. Besides, high-dimension residuals outperform low-dimension residuals. As fluctuations caused by anomalies could be diluted in highly aggregated datasets. Furthermore, residuals from prediction models with high \(R^2\) values do not promise better performance in anomaly detection. For example, at 24-communities’ modeling space, LSTM beats RandomForest in prediction but does not perform as well in anomaly detection. The potential reason is that LSTM captures more pre-anomaly signals and adapts quickly to temporal trend changes, thus being poor in anomaly detection. Community detection aggregated networks based on topology instead of an arbitrary administrative region and generated a dataset more likely to capture anomalies in networks.

Table 3 Area under precision–recall curve at three transportation hubs

5.3 Spatial footprint of anomalies

We aim at detecting interpretable and distinctive spatial patterns of anomalies. Anomalies have different types and different scopes of influence. An event could lead to overestimation or underestimation of the prediction, and the effect could be global or local. The following methods are based on outputs from Comm24-RandomForest pipeline as it has the best performance in anomaly detection.

The first step is to locate a reasonable threshold from the anomaly detection framework to label anomalies and regular days. The framework has 100 iterations in total, the threshold of likelihood in iterations from 0.01 to 1.00 to divide anomalous days and normal days. To determine a proper threshold, we compare the mean absolute error of taxi-zone-level prediction values between anomalies and ordinary days (Appendix B Figs. 8 and 9). The threshold at JFK is 13%, which means that days below the bottom 13% likelihood will be labeled as anomalies, and the threshold at LGA is 5%. In total, JFK has 47 anomalous days, and LGA has 18.

The next step is to investigate anomalies’ spatial footprints. We applied the Gaussian Mixture Model to cluster detected anomalous days based on community-level residuals from Comm24-Random Forest. Same as anomaly detection, the number of components in a GMM is selected from the one returns the lowest BIC. At JFK, we have 4 different types of anomalies, 10 anomalies in type 0, 13 in type 1, and both type 2 and type 3 have 11 anomalies. LGA anomalies are divided; each type has 9 anomalies.

Fig. 2
figure 2

Normalized residual distribution at JFK

Fig. 3
figure 3

Consistency distribution at JFK

Table 4 Detected events in JFK ridership

To measure the impact on prediction from anomalies in different types, we calculated normalized residual and the consistency

$$\begin{aligned} R' = \frac{T-T'+1}{T+1}, \end{aligned}$$
(10)

where \(R'\) is normalized residuals, T is actual ridership, and \(T'\) is predicted ridership

$$\begin{aligned} C = \frac{D'}{D}, \end{aligned}$$
(11)

where C is the consistency. For one community, if the average of residuals from one cluster of anomalies is greater than 0, then \(D'\) is the number of days that normalized residuals are positive, and if the average of residuals are below 0, then \(D'\) is the number of days that normalized residuals are negative.

We observe four types of anomalies for taxi+FHV ridership at JFK and and two at LGA. Figures 2 and 3 are direction and consistency of impact from different clusters of anomalies for JFK ridership. At the first glance, we conclude that, on average, type 1 anomalies lead to underestimation in the entire NYC, while type 0, type 2, and type 3 anomalies have different impact directions at different communities. Also, anomalies in type 3 have a constant impact in most communities, while in other types, effects are only consistent in some communities. For example, more than 80% (8 days out of 10) of type 0 anomalies lead to overestimation at East Staten Island, Long Beach, Crown Heights, Astoria, Long Island City, and Bronx. In contrast, anomalies of the same type have different impact directions at other locations. It is also clear that anomalies (13 days) in type 1 lead to underestimation in the entire New York City in average, and the underestimating impact is constant in at least 11 out of 13 days at West Staten Island, Long Beach, South Brooklyn, East Queens, and South Bronx. However, the impacts of type 1 anomalies are not consistent in all communities. Type 2 anomalies are consistent at the least amount of communities, only at Ease Staten Island, Long Beach, and downtown Brooklyn.

We also explore the temporal patterns in clustered anomalies: consecutive days are more likely to be clustered into the same group. Tables 3 and 4 show the contribution of different types of anomalies in different clusters. Besides our labeled events like federal holidays, detected abnormalities include observed holidays like the day before/after Christmas day. It can be observed that certain types of events are contributing more to some clusters. The future step of this task is to examine the exact events of these clusters to generate a group of representative scenarios for travel demand anomalies.

6 Conclusion

We evaluate prediction and anomaly detection methods for the taxi ridership with the destination in the two transportation hubs in New York City—JFK and LGA airports. Our study emphasized that transportation data have strong nonlinear temporal and spatial dependencies, and hence, the nonlinear Random Forest prediction model outperforms the baseline linear model. Furthermore, we find that leveraging LSTM deep learning techniques could improve spatio-temporal traffic modeling.

Our work addresses the challenge of modeling high-dimensional network data. The proposed pipeline approach for predictive ridership modeling utilizes appropriate spatial aggregation, nonlinear modeling, and subsequent disaggregation back to the original spatial scale. This way, it outperforms the direct ridership modeling on the original spatial resolution leading to poor anomaly detection due to the lack of consideration of the underlying network topology. And the comparison of different aggregation techniques showed that network community-based aggregation performs the best for the prediction as well as further anomaly detection, highlighting the importance of accounting for the network topology.

We consider the assumptions used in preparing the list of the labeled anomalous events as a potential limitation of this study. Since the complete ground truth data for the anomalies are not present, our analysis assumes that the collected events should represent the major anomalies in the mobility network. Besides, as our list of anomalous events is not exhaustive, detecting anomalies other than those listed does not necessarily mean a shortcoming of the approach. Therefore, our ability to evaluate the anomaly detection performance is limited by the comprehensiveness of the labeled anomalous events.

Table 5 Summary of aggregated outgoing traffic data sets
Table 6 Arrival record dataset at transport hubs
Table 7 Events datasets’ description
Table 8 Hybrid model variables explanation

We have observed that our proposed pipeline framework of dimensionality reduction, modeling, and disaggregation outperforms the direct modeling and anomaly detection for the original transportation network. Furthermore, we note that when we jointly optimize the modeling and disaggregation step using gradient descent in our pipeline LSTM model, it results in superior performance. This might warrant further study considering a model where all three stages, including the aggregation, are jointly trained for optimizing prediction accuracy. Further evaluations of the modeling and anomaly detection methodology could also involve the graph neural network embedding of the mobility network [https://arxiv.org/abs/2105.03388].