Data-driven traffic congestion patterns analysis: a case of Beijing

Li, Xiang; Gui, Jiao; Liu, Jiaming

doi:10.1007/s12652-022-04409-4

Data-driven traffic congestion patterns analysis: a case of Beijing

Original Research
Published: 27 September 2022

Volume 14, pages 9035–9048, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Data-driven traffic congestion patterns analysis: a case of Beijing

Download PDF

523 Accesses
2 Citations
Explore all metrics

Abstract

With the rapid increase of urban population and the number of motor vehicles, the traffic congestion problem is becoming more and more serious in megacities. This paper aims to identify the traffic congestion patterns and analyze their spatial–temporal variations by conducting cluster analysis for daily traffic congestion index curves. First, since the importance of sampling points in different time segments is not exactly the same, the coefficient of variation is taken to assign weight for improving K-means clustering algorithm. The improved weighted K-means clustering algorithm is proposed to identify the representative traffic congestion patterns. Second, the paired t-test method is used to analyze the spatial and temporal variations of traffic congestion patterns. Finally, case studies are conducted based on the real-life traffic congestion index data in Beijing, including over 670, 000 records covering six districts from January 1, 2017 to December 31, 2017. The results illustrate that traffic congestion patterns are both temporal dependent and spatial dependent, and the automobile license plate restriction has significant influence on the traffic congestion patterns. This study could be instructive for formulating specific traffic optimization and control schemes to alleviate congestion and promote the balance of traffic conditions.

Spatio-Temporal Autocorrelation-Based Clustering Analysis for Traffic Condition: A Case Study of Road Network in Beijing

Relationship Between Urban Road Traffic Characteristics and Road Grade Based on a Time Series Clustering Model: A Case Study in Nanjing, China

Article 12 July 2018

Study on Clustering Analysis Model of Traffic Congestion State

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Traffic congestion is a phenomenon that the load of urban roads exceeds its specified capacity of traffic system, which especially occurs in commute peaks and poor weather conditions. Traffic congestion is a bane of urban life, especially in megacities, which significantly increases the travel cost for residents (Ke et al. 2020), causes more traffic accidents (Retallack et al. 2019), and makes traffic management extremely difficult (Praveen et al. 2021).

With a large number of data acquisition equipment densely distributed in the road network, it is possible to assess traffic characteristics by using the collected high-volume, real-time, and high-accuracy data from multiple and autonomous sources (Wu et al. 2014). In the case of traffic congestion, this information includes GPS data, map application data, data from massive sensors, and so on. Big data offers advantages over conventional data sources in terms of volume, velocity, variety, and veracity (Yaqoob et al. 2016). It can reveal some potential insights of smart cities after effective research and analysis (Chauhan et al. 2016). Therefore, it is highly important to analyze traffic congestion and its characteristics via big data technology for better traffic control and management.

Traffic congestion patterns refer to the data curves of traffic congestion index in one day with different curve characteristics (Zhao and Hu 2019). The grasp of urban traffic congestion patterns and their spatial–temporal evolution characteristics is instrumental to the accurate prediction of traffic situation and information provision for urban residents to optimize their daily travel decisions. From the macro perspective of urban management, it can provide the basis for road construction and city planning (Torkjazi et al. 2018). At the same time, understanding the evolution trend of urban traffic situation is helpful to judge and forecast the level and direction of the regional economic development (Li et al. 2019).

The analytical framework of this study is shown in Fig. 1. First, a linear interpolation method is used to fill in the missing values and a 2-sigma rule is used to identify and modify the outliers for data preprocessing. Second, an improved weighted K-means clustering method is proposed to identify the traffic congestion patterns, which takes a weighting operation among the daily traffic congestion index data before conducting the clustering process. Finally, the spatial–temporal variations of traffic congestion patterns arising from the space difference, the time difference, and the automobile license plate restriction are analyzed.

The main contributions of this research are summarized as follows. By modifying the typical K-means clustering method, a novel clustering method on time series data is proposed to identify the traffic congestion patterns. Based on the real-life traffic congestion index data in Beijing, the paired t-test is carried out, it is revealed that the traffic congestion patterns are both spatial dependent (there are significant differences in the number and shape of traffic congestion patterns in different regions) and temporal dependent (the variations of dates and automobile license plate restriction both impact the traffic congestion patterns). This work strengthens the understanding on urban traffic congestion patterns and their spatial–temporal characteristics, which is helpful for the accurate prediction of traffic situation and the precise decision for traffic operations management.

The remainder of this work is organized as follows. Section 2 reviews the related researches. Section 3 describes the traffic congestion index data in Beijing. Section 4 introduces the data preprocessing process. Section 5 presents methodologies using in this research. Section 6 shows the identification results on traffic congestion patterns, and the spatial–temporal characteristics of traffic congestion patterns. Section 7 concludes the paper with a brief summary and gives some potential directions for future research.

2 Literature review

In recent years, scholars pay more and more attention to traffic congestion forecasting from the perspectives of traffic flow (Angayarkanni et al. 2021), traffic velocity (Jiang et al. 2021), delay time (Shelke et al. 2019), traffic cost (Tian et al. 2010), traffic congestion index (Wang et al. 2018), and so on. Here traffic congestion index is a comprehensive and integrated indicator, which is defined as a conceptual value that could synthetically reflect the traffic conditions. A higher traffic congestion index corresponds with the heavier traffic congestion. Su et al. (2019) considered total number of vehicles in the system varying over time and proposed a dynamic stochastic differential model to describe traffic flow based on the Markov chain theory. By using traffic flow data from the I-80 Freeway Dataset from the NGSIM program, it showed that the proposed approach provided more accurate predictions of traffic flow. Sanchez-Cambronero et al. (2017) took advantage of the plate scanning technique to propose an algorithm that minimizes the required number of registering devices and their location in order to identify vehicles candidates to compute and predict the travel times of a given set of routes (or sub routes). Wang et al. (2017) pointed out that the PageRank values can act as signals in predicting upcoming traffic congestions, and observed the aforementioned laws experimentally based on the trajectory data of 12,000 taxies in Beijing city for one month.

In addition, the existing literature has also carried out a large number of analyses on the traffic congestion characteristics. For example, by computing urban traffic evolution on temporal complex network with PageRank, Wang et al. (2017) found the congestion degree of a local region is not only affected by the traffic states of its neighboring regions but also those of the whole network. ShirMohammadi et al. (2020) analyzed the traffic density, congestion index and peak hours for the main network of Hamedan communication routes based on the collected data of speed performance, and simulated the relationship between traffic velocity and congestion index by using neural network and genetic algorithm. Kan et al. (2019) proposed a traffic feature analysis and classification approach to detect traffic congestion from taxis’ GPS trajectories at the turn level. The case study in Wuhan supported the feasibility of this approach and proved that the proposed approach can sense traffic congestion at a lower cost compared with other approaches. Chen et al. (2021a, b) proposed a new categorization criterion to define traffic conditions as five levels based on speed performance index values, and applied the proposed criterion in a case study to investigate the daily curve of speed performance index data in Beijing. It was found that the curves vary significantly in shape on different days. Some research also detected traffic congestion characteristics from other perspectives, the results illustrated that there are significant differences across days (Kim et al. 2019).

Traffic congestion patterns refer to the data curves of traffic congestion index in one day with different curve characteristics (Zhao and Hu 2019). The existing literature mainly concerns on the traffic congestion forecasting and traffic congestion characteristics analysis, while the analysis on traffic congestion pattern is very rare. Wen et al. (2014) selected eight evaluation indices on traffic congestions in morning and evening peak hours, and then proposed a hierarchical clustering analysis method to divide the pattern characteristics of traffic congestions. The results revealed that weekdays included Normal Weekdays, Key Congested Weekdays, and Most Congested Weekdays. Sun et al. (2019) adopted hierarchical clustering algorithm to study the congestion patterns in Qingdao based on traffic performance index (TPI) data. The results showed that there were three categories of traffic congestion pattern: Workdays, Latter half of vacation (October 4th–8th), and Weekends and the beginning of vacation (October 1st–3rd). Based on the macro traffic congestion index data in Beijing, Zhao and Hu (2019) revealed that there were two typical traffic congestion patterns on weekdays by applying K-means cluster analysis, i.e., weekday mode A and weekday mode B. The former often appeared on Mondays and the main characteristic was the obvious morning peak and evening peak with similar congestion duration, while the latter often appeared on Fridays and the main characteristic was that the peak and duration of congestion in the evening were significantly higher than in the morning.

The above research has enlightening significance for the urban traffic management at the strategic level, but does not answer the following questions at the operational level: (1) whether the traffic congestion pattern is spatial dependent, should we carry out spatially differentiated traffic congestion management? (2) whether the traffic congestion pattern is temporal dependent, should we carry out temporally differentiated traffic congestion management? The motivation of this research is to answer these questions by using the micro traffic congestion index data, which could provide more valuable information for traffic management, planning and policy-making. Comparative analysis between the existing literature and this research is shown in Table 1.

Table 1 Comparative analysis of related works

Full size table

3 Traffic congestion index data

Traffic congestion index is a conceptual value that can synthetically reflect the road traffic conditions (Zhao et al. 2019), which has been widely studied as an urban traffic situation indicator in literature (Wen et al. 2014; Sun et al. 2019; ShirMohammadi et al. 2020). As the capital of China, Beijing is a typical megacity with permanent residents 21.89 million and motor vehicles 6.57 million by the end of 2020. The road network structure of Beijing is a ring road system radiating urban districts. Since 2006, Beijing has established traffic congestion index as the core evaluation indicator of traffic conditions, and publishes the real-time traffic congestion index to the public through the Internet and APPs ^[^{Footnote 1}^]. As shown in Fig. 2, the traffic conditions are divided into five grades as the traffic congestion index $R$ ranges from 0 to 10, that is, no congestion $\left( {0 \le R < 2} \right)$, less congestion $\left( {2 \le R < 4} \right)$, congestion $\left( {4 \le R < 6} \right)$, medium congestion $\left( {6 \le R < 8} \right)$, and serious congestion $\left( {8 \le R \le 10} \right)$. The higher the traffic congestion index, the heavier the traffic congestion (Wang et al. 2017). The traffic congestion index was 6.6 at time segment [10:00, 10:05) on May 26, 2021, which belongs to medium congestion. In this study, we collect the traffic congestion index data from January 1, 2017 to December 31, 2017, including over 670, 000 records covering six urban districts (Dongcheng, Xicheng, Chaoyang, Haidian, Fengtai, and Shijingshan). The sampling step of recorded data is five minutes, which means that the system records one piece of data each five minutes. As a result, the whole day (0:00–24:00) is partitioned into 288 time segments, and the length of each time segment is 5 min. An example of the recorded data is shown in Table 2.

Table 2 An example of traffic congestion index data in Beijing

Full size table

4 Data preprocessing

Due to mechanical failure or human error, there are missing values and outliers in the raw traffic congestion index data inevitably. As a result, data preprocessing is necessary before conducting the data analysis process. For the Beijing traffic congestion index data, the ratio of missing values is around 2.13% and among which more than 80% appear as single missing value. An example on the phenomena of missing value is shown in Table 3, in which the sample data with ID 61,094,687 takes congestion index value “Nan”, indicating that the congestion index data at time segment [7:50, 7:55) is missing. An outlier in a dataset is an observation with value far away from other observations. In Table 4, the traffic congestion index between 3:45 a.m. and 3:50 a.m. in Shijingshan district is generally less than 1.5 from January 9, 2017 to January 18, 2017, while it suddenly rises to 9.8 on January 13, 2017, which could be considered as an outlier. The missing values and outliers in the time series may distort the shape of traffic congestion patterns, therefore filling in the missing values and modifying the outliers should be performed first.

Table 3 An example of missing value

Full size table

Table 4 An example of outlier

Full size table

In literature, a great number of methods have been developed for filling in the missing values, in which linear interpolation method (Lu et al. 2003) is always used to tackle with the cases with small range of missing values, while empirical orthogonal function (Beckers et al. 2003), Gamma distribution function (Simolo et al. 2009), and autoregressive model (Kim et al. 2015) are more appropriate for dealing with the cases with large range of missing values. For the traffic congestion index data, since only single or small range of missing values are identified, the linear interpolation method is taken, which has been widely used in the preprocessing of transportation data analysis (Degen et al. 2007; Zhao et al. 2019; Sun et al. 2021). Assume that missing values are detected at successive time segment $i = 1,2, \cdots ,I$, while $x_{0}$ is the recorded congestion index at time segment $i = 0$ and $x_{I + 1}$ is observed congestion index at time segment $i = I + 1$. The linear interpolation method approximates the missing values $x_{i}$ as follows

$$x_{i} = x_{0} + \frac{i}{I + 1} \times \left( {x_{I + 1} - x_{0} } \right),{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \forall \, i = 1,2, \cdots ,I.$$

(1)

Taking Table 3 for example, there is only one missing value detected at time segment [7:50, 7:55). In this case, we have $I = 1$ and $x_{1} = \left( {x_{0} + x_{2} } \right)/2$. The filled congestion index at time segment [7:50, 7:55) should be $\left( {7.7 + 8.1} \right)/2 = 7.9.$

The detection and modification of outliers in time series are the key steps for data preprocessing. The existing methods mainly include 2-sigma rule (Li et al. 2015), 3-sigma rule (Klos et al. 2015), maximum likelihood estimation (Lee et al. 2006), Bayesian method (Kruschke et al. 2012), and multilevel model (Shi et al. 2008). For detecting the outliers sensitively and modifying them expediently, the 2-sigma rule is employed to handle the traffic congestion index data before feeding them into the clustering analysis algorithm, which has been widely used in transportation data analysis and achieved good performance (Li et al. 2015). Denote the daily time series data of traffic congestion index as an M-dimensional vector, where M is the amount of sampling points each day. For example, if the sampling step is 5 min, then there are 12 sampling points each hour and 288 sampling points each day. In this case, we have M = 288. Denote N as the number of observation days, $x_{n}^{m}$ is observed congestion index at sampling time segment m on the n^th day. Then the traffic congestion index can be written as

$$X_{n} = \left( {x_{n}^{1} ,x_{n}^{2} , \cdots ,x_{n}^{M} } \right), \, \forall \, n = 1,2, \cdots ,N.$$

(2)

The intraday trend $\overline{X}$ among these observation days, which represents the average value of daily traffic congestion index, can be formulated as

$$\overline{X} = \left( {\overline{x}_{{}}^{1} ,\overline{x}_{{}}^{2} , \cdots ,\overline{x}_{{}}^{M} } \right) = \left( {\frac{1}{N}\sum\limits_{n = 1}^{N} {x_{n}^{1} ,} \frac{1}{N}\sum\limits_{n = 1}^{N} {x_{n}^{2} ,} \cdots ,\frac{1}{N}\sum\limits_{n = 1}^{N} {x_{n}^{M} } } \right).$$

(3)

The residual fluctuations of the n^th day are

$$r_{n} = X_{n} - \overline{X} = \left( {r_{n}^{1} ,r_{n}^{2} , \cdots ,r_{n}^{M} } \right),{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \forall \, n = 1,2, \cdots ,N.$$

(4)

Finally, the sample standard deviation $\sigma^{m}$ is calculated as the square root of $r_{1}^{m} ,r_{2}^{m} , \cdots ,r_{N}^{m}$ with $m = 1,2, \cdots ,M.$ A point $x_{n}^{m}$ is defined as an outlier if the absolute residual $\left| {r_{n}^{m} } \right|$ is greater than twice of the sample standard deviation $\sigma^{m}$.In this case, the observation value $x_{n}^{m}$ is modified as $\overline{x}_{{}}^{m} - 2\sigma_{{}}^{m}$ or $\overline{x}_{{}}^{m} + 2\sigma_{{}}^{m}$. Otherwise, it is regarded as a regular point and its value should keep unchanged. The outlier detection and modification procedure is exhibited as follows:

$$x_{n}^{m} = \left\{ {\begin{array}{*{20}c} {\overline{x}_{{}}^{m} + 2\sigma_{{}}^{m} ,} & {{\text{if}}{\kern 1pt} r_{n}^{m} > {\kern 1pt} 2\sigma_{{}}^{m} \, } & {} \\ {x_{n}^{m} , \, } & {{\text{if}}{\kern 1pt} - {\kern 1pt} 2\sigma_{{}}^{m} \le r_{n}^{m} \le 2\sigma_{{}}^{m} ,} & {\forall \, n = 1,2, \cdots ,N, \, m = 1,2, \cdots ,M.} \\ {\overline{x}_{{}}^{m} - 2\sigma_{{}}^{m} ,} & {{\text{if}}{\kern 1pt} r_{n}^{m} < - {\kern 1pt} 2\sigma_{{}}^{m} \, } & {} \\ \end{array} } \right.$$

(5)

5 Research methodologies

Traffic congestion patterns refer to the data curves of traffic congestion index in one day with different curve characteristics (Zhao and Hu, 2019). In this section, an improved weighted K-means clustering method is proposed to identify traffic congestion patterns, and paired t-test method is introduced to analyze the temporal and spatial dependence.

5.1 Improved weighted K-means clustering method

Time series data is multi-dimensional, dynamic and temporal-dependent. Although time series data is composed of multiple data samples connected by time points, it can also be expressed as a single object to be clustered in the form of column vector. Assume that $D = \left\{ {X_{1} ,X_{2} , \cdots ,X_{N} } \right\}$ is a set of time series data where $X_{n}$ represents a column vector. The target of time series clustering is to divide the given set into K different types of clusters represented as $C = \left\{ {C_{1} ,C_{2} , \cdots ,C_{K} } \right\}$ in an unsupervised way, where $C_{k}$ is defined as the k^th cluster and $D = \bigcup\nolimits_{k = 1}^{K} {C_{k} }$.

K-means clustering method uses iterative process to partition a collection of sampling points into subsets known as clusters (Li et al. 2012; Yang et al. 2018; Xu et al. 2020; Chen et al. 2021a, b). Assume that there are K clusters in the sample dataset, the target of K-means clustering is to minimize the total deviation

$$\sum\limits_{k = 1}^{K} {\sum\limits_{{X_{n} \in C_{k} }}^{{}} {\sum\limits_{m = 1}^{M} {\left( {x_{n}^{m} - u_{k}^{m} } \right)^{2} } } } ,$$

(6)

where $X_{n} = \left( {x_{n}^{1} ,x_{n}^{2} , \cdots ,x_{n}^{M} } \right)$ represents an M-dimension sample, and $U_{k} = \left( {u_{k}^{1} ,u_{k}^{2} , \cdots ,u_{k}^{M} } \right)$ is an M-dimension vector representing the center of cluster k, which is calculated as

$$u_{k}^{m} = \frac{1}{{|C_{k} |}}\sum\limits_{{X_{n} \in C_{k} }} {x_{n}^{m} } , \, \forall \, m = 1,2, \cdots ,M,k = 1,2, \cdots ,K.$$

(7)

A cluster contains the cluster center and the data samples assigned to it. Each time a data sample is allocated, the cluster center will be recalculated according to the existing objects in the cluster. This process will be repeated until the termination condition is satisfied. The termination condition can be that the cluster centers keep unchanged or the sum of squares of errors is local minimum. Due to its good performance and computing efficiency, the K-means clustering algorithm has been widely used in the field of transportation data analysis (Zhao et al. (2019); Sun et al. 2021).

Based on the preprocessed time series data of traffic congestion index, an improved weighted K-means clustering method is proposed to identify traffic congestion patterns, which assigns differential weights among all M sampling time segments. Specifically, the sampling time segments with higher dispersion among daily congestion index are assigned with greater weights to strengthen their role in the clustering process. Conversely, the sampling time segments with lower dispersion among daily congestion index are assigned with smaller weights to weaken their influence. Here the Coefficient of Variation is taken (Arachchige et al. 2020) to measure the degree of dispersion, that is,

$$CV_{m} = \frac{{\sigma_{m} }}{{\overline{x}_{m} }}, \, \forall \, m = 1,2, \cdots ,M,$$

(8)

$$\overline{x}^{m} = \frac{1}{N}\sum\limits_{n = 1}^{N} {x_{n}^{m} } , \, \forall \, m = 1,2, \cdots ,M,$$

(9)

$$\sigma_{m} = \sqrt {\frac{1}{N}\sum\limits_{n = 1}^{N} {\left( {x_{n}^{m} - \overline{x}^{m} } \right)^{2} } } , \, \forall \, m = 1,2, \cdots ,M,$$

(10)

where $CV_{m}$ represents the degree of dispersion at sampling time segment m, $\overline{x}^{m}$ represents the sample mean of $x_{n}^{m}$ at sampling time segment m, $\sigma^{m}$ represents the sample standard deviation $x_{n}^{m}$ at sampling time segment m. The coefficient of variation is an appropriate weight selection, which considers the stability and volatility of time series data at the same time.

Based on the value of $CV_{m}$, a weighted K-means clustering method is proposed to partition the time series data of traffic congestion index. The objective is to minimize the total weighted deviations to the cluster centers

$$\sum\limits_{k = 1}^{K} {\sum\limits_{{X_{n} \in C_{k} }}^{{}} {\sum\limits_{m = 1}^{M} {\left( {CV_{m} x_{n}^{m} - u_{k}^{m} } \right)^{2} } } } ,$$

(11)

where $X_{n} = \left( {x_{n}^{1} ,x_{n}^{2} , \cdots ,x_{n}^{M} } \right)$ represents the time series data with M sampling points in a day, and $U_{k} = \left( {u_{k}^{1} ,u_{k}^{2} , \cdots ,u_{k}^{M} } \right)$ represents the weighted center of cluster k, which is defined as

$$u_{k}^{m} = \frac{1}{{|C_{k} |}}\sum\limits_{{X_{n} \in C_{k} }} {CV_{m} x_{n}^{m} , \, \forall \, m = 1,2, \cdots ,M} .$$

(12)

For determining the best number of clusters, i.e., the value of K, Silhouette Coefficient (Rousseeuw, 1987) is taken to evaluate the clustering performance associated with each value of K, and the one that maximizes the clustering performance is selected. First, for each sample $X_{n}$, its Silhouette Coefficient is defined as

$$s\left( {X_{n} } \right) = \frac{{b_{n} - a_{n} }}{{\max \left\{ {a_{n} ,b_{n} } \right\}}},$$

(13)

where $a_{n}$ represents the average Euclidean distance between sample $X_{n}$ and all the other samples in its cluster, and $b_{n}$ represents the average Euclidean distance between sample $X_{n}$ and all samples in its nearest cluster. Note that the Silhouette Coefficient works while the number of clusters is more than or equal to two, i.e., $K \ge 2$. Second, the Silhouette Coefficient for the whole dataset is defined as the mean Silhouette Coefficient among all samples, that is,

$$S = \frac{{s\left( {X_{1} } \right) + s\left( {X_{2} } \right) + \cdots + s\left( {X_{N} } \right)}}{N},$$

(14)

which takes value in [− 1, 1]. The closer the value is to 1, the better the clustering results.

Based on the above description, the general procedure for such a weighted K-means clustering method is summarized in Algorithm 1.

5.2 Paired t-test method

The paired t-test method is used to test whether the average difference between two set of paired sample data is zero. It can also be used in making observations on the same event under different conditions, in order to evaluate the influence of conditions on the event (Konietschke et al. 2014). The test is based on the difference between the values of a single pair denoted as $\left\{ {d_{1} ,d_{2} , \cdots ,d_{L} } \right\}$, and the test statistic value t is calculated as

$$t = \frac{{\sum\nolimits_{l = 1}^{L} {d_{l} } }}{{\sqrt {\frac{{L\left( {\sum\nolimits_{l = 1}^{L} {d_{{_{l} }}^{2} } } \right) - \left( {\sum\nolimits_{l = 1}^{L} {d_{l} } } \right)^{2} }}{L - 1}} }},$$

(15)

where L is the number of observations in a set of sample data.

If the two tailed P value that corresponds to the test statistic t with L-1 degrees of freedom is less than the chosen significance level (e.g., 0.10, 0.05, and 0.01), it indicates that the difference is significant between the two set of sample data (Hong et al. 2017).

6 Case studies

In this section, case studies are exhibited in details. First, traffic congestion patterns at different districts in Beijing are identified based on the proposed weighted K-means clustering method. Second, the temporal dependence of traffic congestion patterns is examined using paired t-test method. Finally, the spatial dependence of traffic congestion patterns is tested by analyzing the indicator values of different traffic congestion patterns at different districts.

6.1 Identification of traffic congestion patterns

Based on the weighted K-means clustering method, this subsection identifies the traffic congestion patterns at different districts in Beijing. The Silhouette Coefficients with different number of clusters are shown in Table 5, which imply that the traffic congestion patterns are spatial dependent, the districts closer to the downtown (i.e., Chaoyang, Dongcheng, and Xicheng) have three categories of congestion patterns; while the districts far away from the downtown (i.e., Haidian, Fengtai, and Shijingshan) have two categories of congestion patterns.

Table 5 The optimal number of clusters at different districts

Full size table

First, taking the clustering results in Haidian district as an example. In Fig. 3a, there are two representative traffic congestion patterns on weekdays. The first pattern is less congested, which often appears at the first-half or middle of the weekdays (i.e., Monday, Tuesday, and Wednesday), while the second pattern is more congested, which generally appears at the last-half of the weekdays (i.e., Thursday and Friday). For simplicity, they are hereinafter respectively named as MTW pattern and TF pattern. The main difference between them occurs from 6:00 to 23:00, and they are relatively consistent from 0:00 to 6:00. The trends of these two curves are basically the same, namely the morning peaks appear around 8:00 and the evening peaks appear around 18:00, but TF pattern obviously takes higher values than MTW pattern. The characteristics of traffic congestion patterns in Fengtai district and Shijingshan district are similar to those in Haidian district which are shown in Fig. 3c, e.

Differently, there are three traffic congestion patterns on weekdays in Chaoyang district. In Fig. 3b, except the MTW pattern and TF pattern, there is a holiday pattern, which will be hereinafter named as H pattern. Compared with the MTW and TF patterns, the H pattern is the least congested, which often appears on working days within 3 days before and after holidays (e.g., the Spring Festival, the Mid-Autumn Festival, the National Day, the Dragon Boat Festival). Similarly, the trends of these three patterns are basically the same, but their values differ significantly. The characteristics of traffic congestion patterns in Dongcheng district and Xicheng district are similar to those in Chaoyang district which are shown in Fig. 3d, f.

Zhao and Hu (2019) revealed that there were two typical traffic congestion patterns on weekdays in Beijing by applying K-means cluster analysis, i.e., weekday mode A and weekday mode B. The former often appeared on Mondays while the latter often appeared on Fridays. As shown in Table 5, when the number of congestion pattern clusters is two at Chaoyang, the corresponding Silhouette Coefficient is 0.32, which does not reach its optimal value. Essentially, the Silhouette Coefficient reaches its optimal value 0.34 when the number of congestion pattern clusters is three at Chaoyang. Similar clustering results also occur in Dongcheng and Xicheng. Therefore, it is more reasonable to divide the traffic congestion patterns of Chaoyang, Xicheng, and Dongcheng into three categories rather than two categories.

6.2 Temporal dependence of traffic congestion patterns

Now the temporal dependence of traffic congestion patterns can be examined. As described in Sect. 4.3, the clustering results of traffic congestion pattern are consistent with the variation of dates, which is also how each pattern is named. Taking Haidian district as an example (See Table 6 and Fig. 4a), the MTW pattern includes 42 Mondays, 32 Tuesdays, 34 Wednesdays, 10 Thursdays and 2 Fridays, the total proportion of Mondays, Tuesdays and Wednesdays is 90%, and the total proportion of Thursdays and Fridays is only 10%; while the TF pattern includes 4 Mondays, 16 Tuesdays, 16 Wednesdays, 40 Thursdays and 48 Fridays, the proportion of Thursdays and Fridays is 71%, and the proportion of Mondays, Tuesdays and Wednesdays is 29%.

Table 6 Statistical results of congestion patterns across weekdays at different districts

Full size table

Similarly, in Chaoyang district (See Table 6 and Fig. 4b), the MTW pattern includes 38 Mondays, 38 Tuesdays, 27 Wednesdays, 16 Thursdays and 7 Fridays, the total proportion of Mondays, Tuesdays and Wednesdays is 81%, and the total proportion of Thursdays and Fridays is only 19%; while the TF pattern includes 0 Mondays, 3 Tuesdays, 12 Wednesdays, 25 Thursdays and 39 Fridays, the total proportion of Thursdays and Fridays is 81%, and the total proportion of Mondays, Tuesdays and Wednesdays is only 19%; the H pattern is evenly distributed from Monday to Friday, but the most significant distribution characteristic for this pattern is that it includes 32 working days within 3 days before and after holidays, and it accounts for 82% of the total number of days in H pattern.

Based on the above analysis results, it is concluded that the variation of dates greatly impacts the congestion patterns, but can not completely explain the impact on traffic congestion patterns. Therefore, it could be inferred that there are other factors affecting the congestion patterns.

Automobile license plate restriction (ALPR) sets out rules that restrict automobile travel at particular date. For example, driving can be restricted based on vehicle license plate numbers for private cars. In details, vehicles with license numbers ending in 0 or 5 are prohibited from driving on Mondays; vehicles with license numbers ending in 1 or 6 are prohibited from driving on Tuesdays; vehicles with license numbers ending in 2 or 7 are prohibited from driving on Wednesdays; vehicles with license numbers ending in 3 or 8 are prohibited from driving on Thursdays; and vehicles with license numbers ending in 4 or 9 are prohibited from driving on Fridays. Generally speaking, the ALPR rules are updated quarterly and there are no driving restrictions on weekends. In China, ALPR is commonly implemented as a measure to reduce traffic congestion in megacities, e.g., Beijing, Tianjin, Guangzhou, Chengdu.

The influence of ALPR on the traffic congestion patterns is analyzed in this subsection. The statistical results about the congestion patterns across no-drive days at all six districts are shown in Table 7, which indicate that in TF pattern, the highest proportion is the days when the restriction numbers ending in 4 and 9. The number of days with restriction numbers ending in 4 and 9 increases from MTW pattern to TF pattern at Haidian, Fengtai, and Shijingshan, and increases from H pattern to MTW pattern, and then to TF pattern at Chaoyang, Dongcheng, and Xicheng.

Table 7 Statistical results of congestion patterns across no-drive days

Full size table

For each district q with restriction scenario p, Congestion Degree is defined as the sum of the mean congestion index of each congestion pattern multiplied by the proportion of days. The higher the congestion degree, the more serious the congestion in the restriction scenario. If w_pq denotes the congestion degree at district q with restriction scenario p, then we have

$$w_{pq} = \sum\limits_{k = 1}^{K} {e_{pq}^{k} \lambda_{pq}^{k} } ,$$

(16)

where $e_{pq}^{k}$ represents the mean congestion index of the k^th congestion pattern, and $\lambda_{pq}^{k}$ represents the proportion of days of the k^th congestion pattern with $\lambda_{pq}^{1} + \lambda_{pq}^{2} + \cdots + \lambda_{pq}^{K} = 1$. In Table 8, the numbers in the last row indicate the average congestion degree among six districts in each restriction scenario. It is shown that the restriction scenario (4, 9) results in the highest congestion degree, while the restriction scenario (1, 6) and (3, 8) lead to the lowest congestion degree. This is due to the Chinese people’s taboo for the number 4, which makes the number of vehicles ending in 4 very limited compared with other numbers. Conversely, 6 and 8 are the lucky numbers for Chinese people, which make the quantity of vehicles ending in 6 or 8 very large.

Table 8 The congestion degree at all six districts

Full size table

Paired t-test is taken to evaluate the difference of the congestion degree at all six districts under different restriction scenarios. The test statistic for paired t-test between scenario $p$ and scenario $p^{\prime}$ is calculated as

$$t_{{pp^{\prime}}} = \frac{{\sum\nolimits_{q = 1}^{6} {\left| {w_{pq} - w_{{p^{\prime}q}} } \right|} }}{{\sqrt {1.2 \times \left( {\sum\nolimits_{q = 1}^{6} {\left| {w_{pq} - w_{{p^{\prime}q}} } \right|}^{2} } \right) - 0.2 \times \left( {\sum\nolimits_{q = 1}^{6} {\left| {w_{pq} - w_{{p^{\prime}q}} } \right|} } \right)^{2} } }},$$

(17)

where $w_{pq}^{{}}$ denotes the congestion degree at district q with restriction scenario p, $w_{{p^{\prime}q}}^{{}}$ denotes the congestion degree at district q with restriction scenario $p^{\prime}$.

The results of paired t-test are given by Table 9. With significance level 0.05, if the two tailed P value is less than 0.05, it can be concluded that the values of congestion degree are significantly different. In Table 9, it is noted that the restriction scenario (4, 9) is significantly different from all other four scenarios; there is significant difference between scenario (1, 6) and scenario (2, 7); there is significant difference between scenario (3, 8) and scenario (2, 7); there are no significant differences among other scenarios. Therefore, the ALPR policy has an important influence on traffic congestion patterns.

Table 9 Paired t-test results among different restriction scenarios

Full size table

6.3 Spatial dependence of traffic congestion patterns

As we have shown in Fig. 3, traffic congestion patterns are spatial dependent, that is, different districts have different number of traffic congestion patterns. If the traffic congestion index takes value more than 4.0, it means that the traffic situation is congested (See Fig. 2). In Table 10, the minimum congestion index, the maximum congestion index, the mean congestion index, the variance of congestion index, and the duration of congestion associated with all congestion patterns across all districts are calculated respectively. It is found that the maximum value, mean value, variance, and congestion duration increase gradually when the congestion pattern changes from H and MTW to TF, while the minimum value keeps almost unchanged. Most importantly, the maximum value, mean value, variance, and congestion duration have significantly difference across districts, which illustrate again that congestion patterns are spatial dependent.

Table 10 The indicator values of different traffic congestion patterns at different districts

Full size table

Note that the shapes of H/MTW/TF patterns across districts are also significantly different as shown in Fig. 3. As the district gets closer to the downtown, the valley between the morning peak and evening peak becomes more sharp, the peak value gets greater, and the congestion lasts longer, which could also be observed in Table 10. Interestingly, although Haidian and Chaoyang have the similar distance to the downtown, their traffic congestion patterns are extremely different, which is reflected in both the number of congestion patterns and the specific indicator values. This phenomenon can be explained by the functional differences of these two districts in Beijing: Chaoyang is an important business center and foreign affairs with relatively active traffic, while Haidian is an education center with relatively light traffic congestion.

7 Conclusions

Alleviating traffic congestion has always been an important challenge for the sustainable development of megacities. Accurate understanding of traffic congestion patterns and its characteristics is helpful to formulate scientific congestion prevention measures. In this paper, a traffic congestion pattern analysis framework was constructed based on the congestion index data. First, an improved weighted K-means clustering method was proposed to identify the traffic congestion patterns. Second, based on the identified traffic congestion patterns, the spatial–temporal variations of traffic congestion patterns were analyzed. Case studies with real-life data illustrated that the traffic congestion patterns are both spatial dependent and temporal dependent, and the automobile license plate restriction has important influence on the traffic congestion patterns.

On the basis of the results in this study, several issues are deserving of future study. First, the traffic congestion index data used in this study are collected according to the administrative division of Beijing, more precise division should be carried out for obtaining more valuable information, such as the congestion pattern analysis for roads or blocks. Second, for the analysis of influencing factors about congestion patterns, more issues should be considered, such as weather conditions, geographical conditions, emergency events, and so on. Third, the congestion pattern analysis in this paper is based on the congestion index data in Beijing, the congestion patterns and characteristics at other cities should be conducted and compared with Beijing.

Notes

[1] http://jtw.beijing.gov.cn/

References

Angayarkanni SA, Sivakumar R, Ramana Rao YV (2021) Hybrid grey wolf: Bald eagle search optimized support vector regression for traffic flow forecasting. J Ambient Intell Humaniz Comput 12:1293–1304
Article Google Scholar
Arachchige CNPG, Prendergast LA, Staudte RG (2020) Robust analogs to the coefficient of variation. J Appl Stat. https://doi.org/10.1080/02664763.2020.1808599
Article MATH Google Scholar
Beckers JM, Rixen M (2003) EOF calculations and data filling from incomplete oceanographic datasets. J Atmos Oceanic Tech 20(12):1839–1856
Article Google Scholar
Chauhan S, Agarwal N, Kar AK (2016) Addressing big data challenges in smart cities: a systematic literature review. Info 18(4):1–10
Article Google Scholar
Chen YC, Chen YL, Lu JY (2021a) MK-means: detecting evolutionary communities in dynamic networks. Expert Syst Appl 176:114807
Article Google Scholar
Chen YY, Chen C, Wu Q, Ma JM, Zhang GH, Milton J (2021b) Spatial-temporal traffic congestion identification and correlation extraction using floating car data. J Intell Transp Syst 25(3):263–280
Article Google Scholar
Degen WLF (2007) Sharp error bounds for piecewise linear interpolation of planar curves. Computing 79:143–151
Article MathSciNet MATH Google Scholar
Hong YM, Lee YJ (2017) A general approach to testing volatility models in time series. J Manag Sci Eng 2(1):1–33
MathSciNet Google Scholar
Jiang MR, Chen W, Li X (2021) S-GCN-GRU-NN: a novel hybrid model by combining a spatiotemporal graph convolutional network and a gated recurrent units neural network for short-term traffic speed forecasting. J Data, Inform Manag 3:1–20
Article Google Scholar
Kan ZH, Tang LL, Kwan MP, Ren C, Liu D, Li QQ (2019) Traffic congestion analysis at the turn level using taxis’ GPS trajectory data. Comput Environ Urban Syst 74:229–243
Article Google Scholar
Ke JT, Yang H, Zheng ZF (2020) On ride-pooling and traffic congestion. Transp Res Part B: Methodol 142:213–231
Article Google Scholar
Kim J, Kwan MP (2019) Beyond commuting: ignoring individuals’ activity-travel patterns may lead to inaccurate assessments of their exposure to traffic congestion. Int J Environ Res Public Health 16(1):89
Article Google Scholar
Kim J, Ryu JH (2015) Quantifying a threshold of missing values for gap filling processes in daily precipitation series. Water Resour Manag 29:4173–4184
Article Google Scholar
Klos A, Bogusz J, Figurski M, Kosek W (2015) On the handling of outliers in the GNSS time series by means of the noise and probability analysis. Int Assoc Geod Symp 143:657–664
Article Google Scholar
Konietschke F, Pauly M (2014) Bootstrapping and permuting paired t-test type statistics. Stat Comput 24:283–296
Article MathSciNet MATH Google Scholar
Kruschke JK, Aguinis H, Joo H (2012) The time has come: Bayesian methods for data analysis in the organizational sciences. Org Res Methods 15:722–752
Article Google Scholar
Lee SY, Xia YM (2006) Maximum likelihood methods in treating outliers and symmetrically heavy-tailed distributions for nonlinear structural equation models with missing data. Psychometrika 71:565–585
Article MathSciNet MATH Google Scholar
Li X, Wong HS, Wu S (2012) A fuzzy minimax clustering model and its applications. Inf Sci 186:114–125
Article Google Scholar
Li L, Su XN, Wang YW, Lin YT, Li ZH, Li YB (2015) Robust causal dependence mining in big data network and its application to traffic flow predictions. Transp Res Part C: Emerg Technol 58:292–307
Article Google Scholar
Li YC, Xiong WT, Wang XP (2019) Does polycentric and compact development alleviate urban traffic congestion? A case study of 98 Chinese cities. Cities 88:100–111
Article Google Scholar
Lu ZD, Hui YV (2003) L1 linear interpolator for missing values in time series. Ann Inst Stat Math 55:197–216
Article MATH Google Scholar
Praveen DS, Raj DP (2021) Smart traffic management system in metropolitan cities. J Ambient Intell Humaniz Comput 12:7529–7541
Article Google Scholar
Retallack AE, Ostendorf B (2019) Current understanding of the effects of congestion on traffic accidents. Int J Environ Res Public Health 16(18):3400
Article Google Scholar
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Article MATH Google Scholar
Sanchez-Cambronero S, Jimenez P, Rivas A, Gallego I (2017) Plate scanning tools to obtain travel times in traffic networks. J Intell Transp Syst 21(5):390–408
Article Google Scholar
Shelke M, Malhotra A, Mahalle PN (2019) Fuzzy priority based intelligent traffic congestion control and emergency vehicle management using congestion-aware routing algorithm. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-019-01523-8
Article Google Scholar
Shi L, Chen GM (2008) Detection of outliers in multilevel models. J Stat Plan Inference 138:3189–3199
Article MathSciNet MATH Google Scholar
ShirMohammadi MM, Esmaeilpour M (2020) The traffic congestion analysis using traffic congestion index and artificial neural network in main streets of electronic city (case study: Hamedan city). Program Comput Softw 46:433–442
Article Google Scholar
Simolo C, Brunetti M, Maugeri M, Nanni T (2009) Improving estimation of missing values in daily precipitation series by a probability density function-preserving approach. Int J Climatol 30(10):1564–1576
Article Google Scholar
Su Y, Sun W (2019) Dynamic differential models for studying traffic flow and density. J Ambient Intell Humaniz Comput 10:315–320
Article Google Scholar
Sun QX, Sun YX, Sun L, Li Q, Zhao JL, Zhang Y, He H (2019) Research on traffic congestion characteristics of city business circles based on TPI data: the case of Qingdao, China. Physica A 534:122214
Article Google Scholar
Sun QX, Zhang Y, Sun L, Li Q, Gao P, He H (2021) Spatial–temporal differences in operational performance of urban trunk roads based on TPI data: the case of Qingdao. Physica A 568:125696
Article Google Scholar
Tian Q, Yang H, Huang HJ (2010) Novel travel cost functions based on morning peak commuting equilibrium. Oper Res Lett 38(3):195–200
Article MathSciNet MATH Google Scholar
Torkjazi M, Mirjafari PS, Poorzahedy H (2018) Reliability-based network flow estimation with day-to-day variation: a model validation on real large-scale urban networks. J Intell Transp Syst 22(2):121–143
Article Google Scholar
Wang MJ, Yang S, Sun Y, Gao J (2017) Discovering urban mobility patterns with PageRank based traffic modeling and prediction. Physica A 485:23–34
Article Google Scholar
Wang WX, Guo RJ, Yu J (2018) Research on road traffic congestion index based on comprehensive parameters: taking Dalian city as an example. Adv Mech Eng 10(6):1–8
Article Google Scholar
Wen HM, Sun JP, Zhang X (2014) Study on traffic congestion patterns of large city in China taking Beijing as an example. Procedia Soc Behav Sci 138:482–491
Article Google Scholar
Wu X, Zhu X, Wu GQ, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107
Article Google Scholar
Xu SJ, Chan HK, Ch’ng E, Tan KH (2020) A comparison of forecasting methods for medical device demand using trend-based clustering scheme. J Data Inf Manag 2:85–94
Article Google Scholar
Yang Y, Zhou JD, Li X (2018) Energy-efficient stochastic chance-constrained programming model for train timetable optimization. J Syst Eng 33(2):197–211
MATH Google Scholar
Yaqoob I, Hashem IAT, Gani A, Mokhtar S, Ahmed E, Anuar NB, Vasilakos AV (2016) Big data: from beginning to future. Int J Inf Manag 36(6):1231–1247
Article Google Scholar
Zhao PJ, Hu HY (2019) Geographical patterns of traffic congestion in growing megacities: big data analytics from Beijing. Cities 92:164–174
Article Google Scholar

Download references

Acknowledgements

This work was supported by grants from the National Natural Science Foundation of China (Nos. 71722007 & 71931001), the Funds for First-class Discipline Construction (XK1802-5), the Key Program of NSFC-FRQSC Joint Project (NSFC No. 72061127002 and FRQSC No. 295837), the Fundamental Research Funds for the Central Universities (buctrc201926).

Author information

Authors and Affiliations

School of Economics and Management, Beijing University of Chemical Technology, Beijing, 100029, China
Xiang Li & Jiao Gui
School of International Economics and Management, Beijing Technology and Business University, Beijing, 100048, China
Jiaming Liu

Authors

Xiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Jiao Gui
View author publications
You can also search for this author in PubMed Google Scholar
Jiaming Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiang Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, X., Gui, J. & Liu, J. Data-driven traffic congestion patterns analysis: a case of Beijing. J Ambient Intell Human Comput 14, 9035–9048 (2023). https://doi.org/10.1007/s12652-022-04409-4

Download citation

Received: 22 September 2021
Accepted: 12 September 2022
Published: 27 September 2022
Issue Date: July 2023
DOI: https://doi.org/10.1007/s12652-022-04409-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Data-driven traffic congestion patterns analysis: a case of Beijing

Abstract

Similar content being viewed by others

Spatio-Temporal Autocorrelation-Based Clustering Analysis for Traffic Condition: A Case Study of Road Network in Beijing

Relationship Between Urban Road Traffic Characteristics and Road Grade Based on a Time Series Clustering Model: A Case Study in Nanjing, China

Study on Clustering Analysis Model of Traffic Congestion State

1 Introduction

2 Literature review

3 Traffic congestion index data

4 Data preprocessing

5 Research methodologies