Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Social Media Analytics for Traffic Condition Monitoring

Perhaps the emergence of big data technology could not have been more disruptive anywhere else than in transportation and traffic engineering systems. This is considering that daily traffic flow of human transportation holds vast big data yet to be fully harnessed for real time estimation and prediction. Lu et al. [1] observed that such rapid development of urban “informatization”, in the era of big data, offers several details entrenched in some spatio-temporal characteristics, historical correlations and multistate patterns. Undoubtedly, big data have increasingly been used for discovering subtle population patterns and heterogeneities that are not possible with small-scale data [2]. For these reasons amongst others academia, governments, federal and state agencies, industries, and other organizations continue to seek innovations to manage and analyze big data; providing them the prospect of increasing the accuracy of predictions, improving the management and security of transportation infrastructures while enabling informed decision-making to gain better insight into their transportation and traffic engineering phenomena [3].

The practical significance of real-time traffic flow state identification and prediction using big data lies in the ability to identify and predict traffic flow state efficiently, timely and precisely [1]. Various articles [3,4,5] have employed big data resources to examine traffic demand estimation, traffic flow prediction and performance as well as integration, and validation with existing models. A noteworthy aspect is that the rapidly increasing (big data) volume of leading social media microblogging services such as Twitter (twitter.com) can be pragmatically challenging, and nearly impossible to manually analyze [6]. Nevertheless, the huge volume of data derived from Twitter makes it ideal for machine learning.

Few years ago, researchers developed sentiment and cluster analysis to monitor twitter messages, identify followers and followings, find word resemblances and examine the nature of the comments i.e. positive, negative or neural. Such promising twitter analytic tools appear to be sufficient in solving the aforementioned traffic flow problems. Our objective in this study is tweet mining of the twitter UK traffic delays and to perform sentiment analysis and cluster classification for traffic congestion prediction. The proposed methodology is based on tweet crawling, preprocessing steps, feature extraction and social network generation and cluster.

1.1 Traffic Twitter Sentiment Analysis

Following the launch of twitter in 2006, sentiment analysis has been applied to various areas of interests e.g. extracting adverse drug reactions from tweets [7], news coverage of the nuclear power issues [8], and in the tourism sector for capturing sentiment from integrated resort tweets [6]. Terabytes of twitter data could be from traffic road users expressing their opinions on traffic jam, road accidents and other information which constitute general traffic news update. The question, of course, is how to determine traffic flow state based on the weight as measured by the opinion contained in a twitter message (called “tweet”)—a short message that a sender post on twitter that cannot be longer that maximum 140 characters? According to Abidin et al. [9], certain special characters including @, RT, and # symbols used in a tweet creates a collective snapshot of what people are saying about a given topic. An in-depth process of computationally identifying and automatically extracting opinions from a writer’s piece of text to determine whether the attitude or emotions towards a topic is positive, negative or neutral is known as sentiment analysis [10, 11]. The technique of sentiment analysis is generally expected to yield a high accuracy rate of roughly 70–80% in training-test data matching tasks [12], while objectively seeking useful insights from a large quantity of aggregated data instead of achieving perfect classification of all data points [6]. Sentiment mining using corpus based and dictionary based methods for semantic orientation of the opinion words in tweets has been presented by Kumar and Sebastian [13].

In drawing the relevance of twitter sentiment analysis to traffic flow state prediction, He et al. [14] consider improving long-term traffic prediction with tweet semantics; and, then, analyze the correlation between traffic volume and tweet counts with various granularities. Finally, an optimization framework to extract traffic indicators based on tweet semantics using a transformation matrix, while integrating them into the traffic prediction using linear regression is proposed. Real-time traffic improvement by semantic mining of social networks has been captured by Grosenick [15]. Abidin et al. [9] introduce the use of Twitter API to retrieve traffic data serving as input to Kalman Filter models for route calculations and updates while fine-tuning the output for new, accurate arrival estimation.

1.2 Traffic Twitter Cluster Classification

Tweets could have a hashtag which consist of any word that starts with “#” symbol. Hashtags help to search messages containing a particular tag. Also of interest is the Part of Speech (POS) tagging in tweets, which has been applied by Elsafoury [16] to monitor urban traffic status. The main idea of POS tagging, also known as word-category disambiguation, is to mark up a word in a corpus and to assign it to a corresponding POS based on its definition and its context. The former is an example of exact term search while the latter, POS, can be considered a typical example of full-text search, which is usually thorough in its search process but can be more challenging to perform when compared to the exact text search. One instance of such text search is classification of tweets into positive and negative sentiments using multinomial Naïve Bayes’ unigram with mutual information based on n-grams and POS that has been presented by Go et al. [11]. It outperforms other classifier approaches under consideration. In between the exact and full-text search is the phrase text search for searching a particular word phrase. For instance, an exact term search might be required to search the term “delay” in a tweet stream. This would bring out only tweets containing the term “delay”. On the other hand, a phrase term search could be a phrase like “Traffic delay” in which there are more details of the search term. Phrase text search is often more useful when performing cluster classification than the other text search methods. It is noteworthy that using a particular search operation is based on measuring the relevance of the query to efficiently match the terms appropriately. Azam et al. [17] present the functional clustering details of their tweets mining approach which has the following steps:

  1. (1)

    Tweet crawling: It is the process of retrieving tweets from twitter server using Twitter Application Program Interface (API). The crawled tweets are stored on local machine for further processing.

  2. (2)

    Tweets pre-processing and tokenization: It involves the filtering of the crawled tweets of non-entirely textual items like emoticons, URL, special character, stop words etc. A common tokenization method known as the n-gram technique can then be applied to tokenize the tweets into bag-of-works (n = 1, known as a unigram is recommended for such tweets tokenization by Broder et al. [18]).

  3. (3)

    Feature extraction and social network generation: It is the process of extracting important features from the preprocessed and tokenized tweets while transforming the feature sets into a social network generation comprising a term tweet matrix A of order \( m \times n \), where m is the number of candidate terms and n is the number of tweets. The resulting matrix A is used to compute the weight \( w\left( {t_{i,j} } \right) \) using the following two equations:

    $$ w\left( {t_{i,j} } \right) = tf\left( {t_{i,j} } \right) \times idf(t_{i} ) $$
    (1)
    $$ idf(t_{i} ) = { \log }\frac{|D|}{{\{ d_{j} : t_{i} \in d_{j} \} }} + 1 $$
    (2)

    where \( tf\left( {t_{i,j} } \right) \) is the number of times t i occurs in jth tweet.

    \( |D| \) is the total number of tweets and \( \{ d_{j} : t_{i} \in d_{j} \} \) represents the number of tweets with term, \( t_{i} \). The objective is to normalize matrix A such that the tweet vectors’ length equals to 1.

  4. (4)

    Social network clustering: After generating the social network for the complete set of tweets, Markov clustering is used to achieve the social network clustering by crystallizing the network into various cluster each representing individual events. The Markov clustering algorithm (introduced by van Dongen [19]) is a fast and scalable unsupervised cluster algorithm for graphs (also known as networks). It serves as an iterative method for interleaving of the matrix expansion and inflation steps based on simulation of (stochastic) flow in graphs.

More details on the abovementioned steps can be found in Azam et al. [17]. For traffic flow prediction using big data analysis and visualization, McHugh [20] considered among other approaches the use of traffic tweets to test the effectiveness of geographical location of vehicles to determine the location of an incident. A useful method that analyzes traffic tweets in order to generate real-time city traffic insights and predictions for traffic management and city planning has been introduced by Tejaswin et al. [21].

2 Using Tweet Traffic Data for Traffic Condition Monitoring

The logs of twitter traffic data for the sentiment analysis and cluster classification were obtained using twitterR package. The tweets were connected to the Twitter API and OAuth authentication was performed using the ROAuth package all in RStudio. The plyr and stringr packages are used to crawl a number of tweets into RStudio while ensuring they are clean of unwanted symbols. More details of this twitter text mining technique can be found in Rais [22]. Detail documentation of the widely used twitter data mining statistical program can be found in cran.r-project.org [23]. We perform a phrase search based on the phrase using a POS tag: Uk traffic delay. This is made possible with a simplified phrase search algorithm derived from Eckert [24], with the original simplified version by Manning et al. [25], given by the following:

In order to apply the above algorithm for our problem, a positional index containing a list of a data mined tweets with a list of positions is used to indicate the search phrase. The Terms is taking to be a split-normalization tokenizer that splits the phrase into list of tokens, normalizing them and assigning its outputs to k as a bag of words. We consider the weighted k-nearest neighbor classifier [26] which assigns a weight \( 1/k \) to the outputs. This is done by finding the vector of nonnegative weights that is asymptotically optimal while minimizing the misclassification error rate, R R [26]. Essentially, the asymptotic expansion is needed to ensure strong consistency in the search. This is subject to a regularity class distribution condition:

$$ R_{R} \left( {C_{n}^{wnn} } \right) - R_{R} \left( {C^{Bayes} } \right) = (B_{1} s_{n}^{2} + B_{2} t_{n}^{2} )\{ 1 + o(1)\} , $$
(3)

Let \( C_{n}^{wnn} \) be the weighted nearest classifier with weights \( \{ w_{ni} \}_{i = 1}^{n} \) where \( B_{1} \) and \( B_{2} \) are constants determined by:

$$ \begin{aligned} B_{1} & = \mathop \smallint \limits_{S} \frac{{\bar{f} (x_{o} )}}{{4\left\| {\dot{\eta }(x_{o} )} \right\|}}dVol^{d - 1} (x_{o} ) \\ B_{2} & = \mathop \smallint \limits_{S} \frac{{\bar{f} (x_{o} )}}{{\left\| {\dot{\eta }(x_{o} )} \right\|}}dVol^{d - 1} (x_{o} ), \\ \end{aligned} $$
(4)

\( Vol^{d - 1} \) denotes the natural \( (d - 1) \) dimensional volume with measure inherent in \( S \in {\mathbb{R}}^{d} \) while \( \bar{f} \left( {x_{o} } \right) \) denotes the first derivative of the initial point \( x_{o} ;s_{n}^{2} = \sum\nolimits_{i = 1}^{n} {w_{ni}^{2} } \) and \( t_{n} = n^{ - 2/d} \sum\nolimits_{i = 1}^{n} {w_{ni} \{ i^{{1 + \frac{2}{d}}} - (1 - i)^{{1 + \frac{2}{d}}} )\} } \) represent variance and squared bias contributions. \( C^{Bayes} \) denotes the Bayes classifiers, minimizing the risk over R. Both are given by:

$$ \begin{aligned} C_{n}^{wnn} (x) & = \left\{ {\begin{array}{*{20}l} {1,} \hfill & { if} \hfill & {w_{ni\,i = 1}^{n} \ge 1/2} \hfill \\ {2,} \hfill & {} \hfill & {\quad otherwise} \hfill \\ \end{array} } \right. \\ C^{Bayes} (x) & = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {if} \hfill & {\eta \left( x \right) \ge 1/2} \hfill \\ {2,} \hfill & {} \hfill & {\quad otherwise} \hfill \\ \end{array} } \right. \\ \end{aligned} $$
(5)

Therefore, there is the interpretation that for the point \( x \in {\mathbb{R}}^{d} , \upeta\left( {\text{x}} \right) \) belongs to class \( C(x) \) with value of 1 in the sense of the weighted nearest neighbor classifier if \( w_{ni\,i = 1}^{n} \ge \frac{1}{2} \); and in the sense of the bayesian classifier, if the regression function \( \upeta\left( x \right) = {\text{P}}\left( {{\text{Y }} = 1 | {\text{X }} = {\text{x}}} \right) \ge \frac{1}{2} \) and; otherwise, both have a value of 2. Further interpretation of the asymptotic behavior towards optimal classification can be found in Samworth [26]. Subsequently, provided that a single term \( t \) from the index is not empty based on the resulting answer form the positional index, we can iterate over the number of incoming tweets while adapting the document list Forward-Position-Intersect algorithm [24, 25] as follows:

Re-defining the variables in Eckert [24] let \( p_{1} \), \( p_{2} \), \( pp_{1} \) and \( pp_{2} \) be the pointers to tweet lists and let \( p_{1} \) and \( p_{2} \) reference the tweet lists of the two terms to be intersected while \( pp_{1} \) and \( pp_{2} \) reference the inner position lists for each tweet with tweetId and pos dereferencing the pointers to their actual value in the list. Let positions extract the inner position list from an entry in the tweet list. Add adds a list identifier and a position to the resulting tweet list. The tweet lists represents the tweets logs of traffic information saved into file.

For our sentiment analysis, we consider the approach of Hu and Liu [27] lexicon of opinion words (LOWs). With our earlier derivations, we posit that the index of sentiments word would require correct interpretation of the word context in relevance to the topic of traffic delay and congestion by scoring the opinion contained in the traffic tweets based on the contextual polarity: positive, negative and neutral. The first method of the improved Naїve Bayes Algorithm (INB-1) by Kang et al. [28] was helpful in computing the score for the crawled filtered traffic tweets based on the following conditional probability:

$$ Class\left( {t_{i} } \right) = \arg \hbox{max} \,R_{1} \left( {p_{ij} } \right)P\left( {c_{j} } \right)\mathop \prod \limits_{i = 1}^{d} P\left( {p_{i} |c_{j} } \right) $$
(6)
$$ R_{1} \left( {p_{ij} } \right) = \frac{{\mathop \sum \nolimits_{{p_{ij} \in L_{j} }}^{|L|} C(p_{ij} )}}{{\mathop \sum \nolimits_{{p_{ij} \in L}}^{|L|} C(p_{ij} )}} $$
(7)

where \( Class\left( {t_{i} } \right) \) denotes the function that determines whether a traffic tweet (\( t_{i} ) \) is positive, negative or neural. The probability of class \( c_{j} \) is calculated by \( P\left( {c_{j} } \right) \) while \( P\left( {p_{i} |c_{j} } \right) \) computes the probability that \( p_{i} \) belongs to \( c_{j} \). \( R_{1} \left( {p_{ij} } \right) \) denotes the ratio of number of patterns. \( C(p_{ij} ) \) present in the class j of LOWs when the number of patterns |L| is counted over number of patterns \( C(p_{ij} ) \) present in the class j of LOWs when the number of patterns |L| is uncounted. The pattern essentially an n-gram, dwells on the form of \( n - 1 \) Markov model, representing contiguous sequence of n items from a corpus widely known as shingles. We used the Jaccard index to know the extent of similarity between sample sets of shingles irrespective of the ordering. This is given by:

$$ J\left( {C_{1} , C_{2} } \right) = \frac{{|C_{1} \mathop \cap \nolimits C_{2} |}}{{|C_{1} U C_{2} |}} $$
(8)

\( J\left( {C_{1} , C_{2} } \right) \) denotes the similarity between set \( C_{1} \) and \( C_{2} . \) It follows that when item \( C_{1} \) and \( C_{2} \) are unrelated then \( J\left( {C_{1} , C_{2} } \right) = 1 \); otherwise \( 0 \le J\left( {C_{1} , C_{2} } \right) \le 1 \). The cluster formation provide enough evidence to support the interrelations between traffic incidents with regards to the trending causatives of traffic congestions. Furthermore, we employ the term-frequency-inverse-document-frequency, tdidf [29] to classify each term in the traffic congestion clusters based on the frequency of occurrence. This is performed by invoking the TF log-normalization with the smooth tdidf weight-schemes as follows:

$$ tf\left( {t,d} \right) = 1 + { \log }\, (f_{t,d} ) $$
(9)
$$ idf(t,D) = \log \frac{N}{{n_{t} }} $$
(10)

Such that tweet document term weight is given by:

$$ tdidf(t,d,D) = tf\left( {t,d} \right) \cdot idf(t,D) $$
(11)

With N = |D| denoting the total number of document in the corpus; \( n_{t} = 1 + |\{ d \in D:t \in d \} | \) representing number of times term t appears in document d which belongs to D in the corpus. Notice that the addition of 1 to \( |\{ d \in D:t \in d \} | \) ensure that infinity value \( idf\left( {t,D} \right) \) is avoided.

3 Experimental Evaluation

3.1 Discussion of Results

A sample of 121 tweets were retrieved based on the phrase search UK traffic delay. The data was cleaned of irrelevant symbols. After tweets crawling, preprocessing, tokenization and feature extraction, we obtained the sentiment analysis results as presented in Table 1.

Table 1 Traffic twitter sentiment analysis

In the time period of obtaining the traffic delay tweets, it was observed that possible severity of 22 were negative sentiments; most likely attributed to serious accidents on the road way (12 negative sentiments). Other relevant phrases are generated in the sentiment analysis such as “serious accidents”, “long delays”, “looking good”, “serious delays” etc. The Jaccard index or similarity and tdidf is used to generate the relevant traffic trending events contributing to the cluster classification index as shown in Fig. 1a, b.

Fig. 1
figure 1

a Traffic delay trending events. b Cluster classification index

3.2 Classification Accuracy

The sentiment classification accuracy of our model is measured in order to determine the performance following the split of the traffic tweet dataset into the training sets (70%) for which the true values are known; validation set (15%) for tuning the classifier during training; and testing set (15%) with unknown values associated with the traffic congestion situation. This is based on the following measures:

$$ {\text{Accuracy}},a = \frac{\mathop \sum \nolimits TP + \mathop \sum \nolimits TN}{{TP_{o} }} $$
(12)
$$ {\text{Precision}},P_{r} = \frac{TN}{TP + FN} $$
(13)

Let TP be the true positive rate denoting the number of the traffic tweets that were correctly identified. TN is the true negative rate denoting the number of traffic tweets correctly rejected; FN be the false negative rate denoting the number of traffic tweets incorrectly rejected; FP be the false positive rate denoting the number of traffic tweets incorrectly accepted; \( TP_{o} \) be the total count of traffic tweets which belongs to a set; \( P_{r} \) be the precision which represents the fraction of the tweets relevant to the search query; a be the (overall) accuracy which determines the number of correct queries as per the total number of queries. The results show an average accuracy and average precision of 0.95 and 0.91, respectively. Table 2 summarizes the performance of the classifiers for each class under consideration with regards to some clusters associated with the traffic congestion delay.

Table 2 Sentiment classification accuracy

In the training set, the TP rate yields highest value of 0.990 for the positive sentiment traffic tweet classification with a least value of 0.908 in the testing set for the positive sentiment. The classifier of neural opinion has the least FP of 0.006 in the validation set while its highest value of 0.055 emerges in the testing set for the negative sentiments. The precision yields highest value of 0.977 in the neural sentiment found in the validation and testing set while its least value is in the positive sentiment classification contained in the testing set. We envisage that correctly classifying the traffic congestion based on the twitter sentiments would depend on the location of the user, internet accessibility and tweets time-proximity to the real time the traffic congestion persists with respect to the incident time leading to it.

3.3 Model Validation

To validate the model, the performance of Latent Dirichlet Allocation (LDA) is compared with the model employing the Naïve Bayes and Jaccard similarity with n-gram (JCn-g). The LDA is a typical example of a topic model that can be used for clustering data points; for instance, Azam et al. [17] applied it for clustering of tweets. It is also considered a generative probabilistic model that allows documents to be represented as random mixtures over latent topics characterized by a distribution over words [30]. Table 3 presents the comparative evaluation of JCn-g using unigram and bigram with LDA.

Table 3 Comparative evaluation of JCn-g with LDA

As observed JCn-g with bigram yields the best accuracy while LDA yields the most precise result. This can be attributed to the fact that LDA not only serves as a generative probabilistic model but also combines it topics interpretability with prior Dirichlet distribution form. Figure 2 presents the cluster generative probabilistic models for the JCn-g and LDA respectively. It shows the data compression of JCn-g (n = 2) and LDA as well as the better similarity between them to buttress our earlier statement. In fact, it can be seen that the green and black tweet clusters are approximately within the same dimensional vector space in the JCn-g (n = 2) and LDA. The best precision observe in LDA becomes obvious from the yellow tweets cluster data points which share same vector space with the JCn-g (n = 1).

Fig. 2
figure 2

Tweet cluster generative probabilistic model: JCn-g (n = 1, n = 2), LDA

4 Conclusions and Future Work

Exploring traffic condition using social media data, which can be readily obtained from Twitter, continues to influence traffic information and transportation engineering management decision makers. Applying the proposed data mining techniques on different strata of the UK traffic delay tweets yielded interesting results on traffic congestion, incidents and control.

The validation of JCn-g using LDA shows that the JCn-g with bigram has better accuracy than LDA; however, LDA maintained its high precision over the JCn-g with unigram and bigram. Precious works have suggested that LDA combines its topics interpretability with prior Dirichlet distribution form.

Future work should seek to improve the precision of our cluster classification algorithm. It should seek to improve our preliminary results with a view to seeing if a hybrid approach of the JCn-g with LDA can be more feasible. Also, investigating the reliability for seamless integration with well-known traffic management software system tools should be explored.