Keywords

1 Introduction

It has been noted by many researchers, lately, that the recent accelerated urbanization has altered the equilibrium state of urban road systems in modern cities. Traditional Transportation Systems (TS) in big cities can no longer meet the needs of today’s complex transport and congestion caused by the continuous increment of vehicle use. This global urbanization trend delivers a new set of challenges to authorities as they must reconfigure city services according to the new priorities imposed for planning and mitigating unforeseen traffic incidents. For this reason, the Intelligent TS (noted as ITS) research topic was created having as primary goals to: (a) advance the traffic monitoring methodologies and (b) offer better transportation planning and traffic management in congested cities [1]. Present works include sensor based monitoring schemes where the sensory equipment is installed/maintained by the city. Additional data generated from a variety of devices installed in vehicles (such as GPS, radio transceivers, small-scale collision radars and sensing devices to enhance travel safely etc.) are also used in recent studies towards the aforementioned goals. However these data cannot be stored centrally since the devices are designed to operate using short-range communication protocols. The tremendous shift in data-induced methodologies has happened due the use of social media platforms (local online forums, Facebook, Twitter) which nowadays are used as the primary and richest source of real time data [1,2,3]. The utilization of social media traffic related data (traffic jams, collisions, alternative route suggestions etc.) help ITS to improve traffic monitoring and management. But all this comes also with new challenges: (a) ITS have to overcome the Big Data consequences emanating by the rate of data generation (8,000 tweets/sec, hundreds of trillions per week) and the storage inability, (b) new intrinsic spatiotemporal principles of Big Data must feedback innovative machine learning solutions to optimize cloud computing and processing, (c) open availability of data poses social challenges of geospatial significance and (d) textual analysis must coexist with specialized Deep Learning approaches to decode human responses.

After we present in Sect. 2 a concise State of the Art, we formulate in Sect. 3 the methodology of an ITS used for clustering and classification of traffic related data extracted by a Twitter extractor that incorporates the stemming, IDF and similarity index techniques to choose traffic-incident related keywords. The classification methodology is also presented for a 2-class and a 3-class classifier. For the 2-class classification we provide performance metrics incorporating a MLP-NN and a SVM. For the 3-class classification we do the same using a DCL network.

2 State of the Art

There has been a variety of initiatives (both academic and commercial) dealing with traffic alert systems. At the same time all these systems harvest input information from a variety of sources including sensory equipment, human reactions, police traffic reports etc. In relevance to textual incoming sources from social media, while there is a lot of literature regarding data analytics methodologies, there hardly exists research that deals with the stages of data discovery, collection, and preparation from textual data [4]. Recently, Twitter started to provide services where, users can post geo-tagged tweets via the GPS interface of their smart devices [5, 6]. The reported information when relevant can support any traffic monitoring and alerting system by just logging a repertoire of traffic incidents. Towards these lines, TWITRAFFIC, [7] created a smart app that monitors and reports traffic events in UK. MISNIS [8] is another platform that facilitates these issues and allows a non-technical user to easily mine any given topic from Twitter’s corpus in order to obtain relevant contents and indicators such as user influence or sentiment analysis but it is focused mainly in the Portuguese language. Lately, [9] developed a clustering tool called I-TWEC, which utilizes Twitter data lexical and semantic similarities. I-TWEC uses the Longest Common Subsequence technique as a similarity metric to produce clusters presented with different visualizations enabling users to merge them based on their semantic similarity. Because the traffic topics attract global attention, such data suffer from the Long Tail Effect [10], thus an effective textual analytics tool must be a traffic event data extraction model, however it must be able to distinguish learning of specific locations utilizing only tweets via geolocation attributes. Even though it is somewhat difficult for travelers to read and/or for drivers to participate in such activities, experience has shown that almost all drivers and passengers during traffic rush hours, announce this on social media. Thus the optimum solution is to analyze the information available on social network platforms, perform sentiment analysis and machine learning methodologies to classify and cluster traffic cases and to predict traffic situations.

With the improvement of big data processing technologies, we now have the ability to perform traffic sensing and learn human mobility patterns from updated location information in network interaction log data (mostly GPS and textual). Recently, [11] extracted traffic patterns from big data using regression models. Also the research shown in [12] adopts Spark on Hadoop and MongoDB technologies to store, handle and process real time and historical traffic data from heterogeneous sources including social media. Similar work is also recognized in [13] where the distributed file system HDFS is used to store urban traffic data and the Spark is used to realize road traffic congestion state detection with lower cost, shorter period and more credible results.

The K-means methodologies are the most popular for data clustering. However, for the case of high-dimensional text data, K-means clustering becomes the only known solution. The cosine similarity property metric [14] is used to measure cohesion between produced clusters since it is a similarity measure between two non-zero vectors of an inner product. Finally regarding the ITS traffic, condition recognition is very important and K-means methods have been tried lately towards this issue [15, 16].

3 Methodology and Results

Gathering of tweets was achieved using the Twitter4J [17] open source Java library. The usage of the Twitter API allows us to mine tweets using criteria based on hashtags, limited time, longitude and latitude and any keyword. In this paper we focus on two main investigations: (a) Clustering of traffic data of numeric nature via the use of KMeans algorithm with the Euclidean distance as a cost function and (b) Classification considering two cases: (i) binary classification regarding tweets related to traffic either due to weather conditions or not and (ii) a ternary classification related to heavy traffic due to accidents, seasonality affected events (for example, Christmas Eve) and external unexpected events (basketball game, strikes, demonstrations etc.).

3.1 Traffic Big Data Clustering Using Unsupervised Machine Learning

We used one of the most commonly used clustering algorithms (KMeans) to cluster twitter data into a predefined number of K-clusters. Data was gathered using the Twitter4J Java library for the city of New York during the Christmas period of 2017 (Dec. 11th 2017–Jan. 3rd 2018). The area of interest was chosen to be the virtual rectangle (left upper corner: Hawthorne NJ Lat: 40.939825, Log: –74.160612, right lower corner: Jones Beach State Park NY Lat: 40.597646, Log: –73.505552). Apart from the geolocation and the date and time searching criterion of data acquisition, an additional searching criterion included keywords such as: congestion, traffic jam, traffic etc. Initial filtering of the aforementioned tweets was performed to mine the ones originated by people riding vehicles and therefore excluding pedestrians. The methodology used was based on the calculation of velocity of the tweet transmitter by taking two consecutive tweets. However, we owe to mention that the method does not guarantee to exclude all pedestrians since in heavy traffic conditions, vehicles may move at pedestrian speeds. Around 2.7 million of tweets are fed to a single machine Spark ML and SQLContext schema. After setting k = 7 clusters each geolocated tweet was assigned to its nearest centroid based on the Euclidean distance metric. These centroids depict the epicenters of intensified heavy traffic activities. The centroids were then updated in each pass of the algorithm and the process was repeated until there was a minimum change of the centroids. The structure used in the data frames of Spark was: (Date, Time, Latitude, Longitude, Keyword). Separate runs were performed for the aforementioned keywords. Figure 1 shows the centroid locations of the scenario with the keyword “congestion” (Table 1).

Fig. 1.
figure 1

Centroid locations in Google Maps.

Table 1. Centroid coordinates.

3.2 Traffic Big Data Classification Using Supervised Machine Learning

Classification Data Set Acquisition.

For the case of the binary classifier, the set of tweets either include the weather condition in a traffic event or not. The tweets were gathered from the same area as in clustering and had the same structure (Date, Time, Latitude, Longitude, Keywords) where an m-at most tuple creates the keywords. For m up to 10 such candidate keywords included the words {traffic, rain, snow, sleet, accident, slowdown, congestion, stuck, thunder, crash} when investigated heavy traffic due to extreme weather conditions. For the case of the ternary classifier the same tweet structure was used with keyword tuples of the form {game, strike, demonstration, flight, Christmas, year, accident, crash, ambulance, shopping}.

Data Set Preprocessing.

Data fetching was followed by a set of preprocessing procedures dealing with:

  1. 1.

    Removal of tweet meta-associations using a Java Regular Expression Filter [18] to discard hashtags, links, mentions and user-ids out coming a set of strings Si, i = 1 … N. The Si’s are further converted to lower case characters via the tolower(Si) procedure.

  2. 2.

    Tokenization of Si’s using a Java tokenizer [19] so that, all Si’s were transformed into a larger set of syllables or words called tokens with the synchronous extraction of non-text characters (apostrophes, hyphens etc.)

  3. 3.

    Extraction of stop-words [20] i.e. words with no statistical significance, conjunctions, articles, pronouns etc.

  4. 4.

    Stemming of tokens using the Porter’s algorithm [21] to remove suffices of tokens and to group words of similar semantics. The outcome of this process was the set of STi, i = 1 … N stemmed tokens. N was be the training set for the machine learning algorithms used later on thus is denoted as Ntr. For each stemmed token stj in Ntr we compute its importance in the training set using the Inverse Frequency Index (IDF) as:

    $$ w_{st} = { \ln }(N_{tr} /N_{st} ) $$
    (1)

    where Nst is the occurrence index of the stemmed token in Ntr [22].

  5. 5.

    For the set of calculated IDF’s we built a feature representation vector \( F = (f_{{j_{1} }} ,f_{{j_{2} }} , \ldots ,f_{{j_{Ntr} }} ) \) where each element was set according to:

    $$ f_{j}^{st} = \left\{ {\begin{array}{*{20}l} {w_{st} } \hfill & {if\;stemmed\;token \in N_{tr} } \hfill \\ 0 \hfill & {if\;stemmed\;token \notin N_{tr} } \hfill \\ \end{array} } \right. $$
    (2)
  6. 6.

    Information Gain, (IG) calculation for each stemmed token STi for the class vector \( C = \{ c_{1} ,c_{2} , \ldots ,c_{m} \} \). Note that for our aforementioned scenarios |C| = 2 or 3. The IG(STi) is:

    $$ \begin{aligned} IG(ST_{i} ) = & - \sum\nolimits_{m} {P(C_{m} )\,{ \log }\,P(C_{m} ) + P(ST_{i} )\sum\nolimits_{m} {P(C_{m} /ST_{i} )\,{ \log }\,P(C_{m} /ST_{i} )} } \\ & \quad \quad \; + P(\overline{{ST_{i} )}} \sum\nolimits_{m} {P(C_{m} /\overline{{ST_{i} )}} } \,{ \log }\,P(C_{m} /\overline{{ST_{i} )}} \\ \end{aligned} $$
    (3)

where P(STi) is the probability that the stemmed token STi occurs in (3), \( \overline{{ST_{i} }} \) is the occurrence negation, P(Cm) is the probability of the mth class value, P(Cm/STi) is the conditional probability of the mth class value given that STi occurs and \( P(C_{m} /\overline{{ST_{i} }} ) \) is the conditional probability of the mth class value given that STi does not occur.

3.3 Classification Using a MLP-NN and SVM

For the first experiment we used an MLP NN as a binary classifier from the April-ANN toolkit [23]. More in detail, we used the [MLP: ann.mlp.all_all.generate] call to involve an all-to-all connection between the hidden layers of the NN concentrating on the performance of the classifier that has only two classes-positive and negative. This allowed us the investigation of the: True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) instances and the calculation of the performance metrics: Accuracy (4), Precision (5), Recall (6) and F-Score (7)

$$ Acc = (TP + TN)/(TP + FP + FN + TN) $$
(4)
$$ Prec = TP/(TP + FP) $$
(5)
$$ Rec = \, TP/\left( {TP + FN} \right) $$
(6)
$$ F{\text{-}}Score = (1 + \beta^{2} )\frac{{\text{P}rec \cdot \text{R}ec}}{{(\beta^{2} \cdot \text{P}rec) + \text{R}ec}} $$
(7)

where \( \beta = 1 \) for class-balanced datasets.

For the second experiment we used a SVM as in [24] noting that the optimization problem under concern makes use of kernels, which map input features into a different space. This means that finding the derivative of the cost function and using gradient descent does not work, but instead, the SVM only weights examples that are close to the decision boundary. Table 2 depicts the classification results on the 2-class dataset for the two classifiers mentioned above, indicating the best performance in bold.

Table 2. Classification results for the 2-class dataset.

3.4 Classification Using DCL Network for Sentiment Analysis

For the case of the 3-class classifier we used the Deep Convolutional Neural Network shown in [25]. The training of the network was done by stochastic gradient descent via the use of a backpropagation algorithm to compute the gradients. The tendency of the network to over fit in the learning process of the decision function was confronted by augmenting the cost function. The testing of the model was done on the pre-processed tweet data in a 70% to 30% ratio between the train and test datasets. Unfortunately the results were inferior to the 2-class classifier as depicted in Table 3.

Table 3. 3-class deep convolutional learning classifier.

4 Conclusions

With the increase of vehicular traffic observed in recent years in urban areas, there has been a significant degradation of the efficiency of the traffic flow. The incorporation of machine learning methodologies is shown to be beneficial in identifying congestion centroids for the case of clustering traffic congestion related data generated by social media. Furthermore, for the case of classification in discovering the reasons of occurrence of congestion events, binary classifiers (MLP-NN and SVM) outperform the utilization of Deep Learning models. We suspect that this limited utilization of DCL is due to the fact that we have not used pre-trained embedding of neural language model thus, further investigation is apparent in justification of this comparison.