Keywords

1 Introduction

Twitter is one of the most popular social media websites, where users can post and interact via posts called ‘tweets’ and it has been growing hastily since its creation in 2006 [1]. The platform’s enormous benefit is the short time span that the messages can reach wide network of users, playing major role in real-time analytics [2]. Due to its ease of use, speed and reach, Twitter became a platform to set trends and agendas in topics that range from healthcare, through politics, technology, stock market analysis and entertainment industry. As Twitter has become a source for collective wisdom, many research studies used this power to predicting real-world outcomes. Twitter is also a cost effective and less time-consuming than other information extracting techniques such as surveys and opinion polls.

The enormous and high volume of information that disseminates through millions of Twitter users accounts presents an interesting opportunity to obtain a meaningful insight into population behavioural patterns along with the prediction of future trends. Moreover, gathering information on how people converse regarding topic can assist many sectors in the real-world applications.

In terms of case study selected, nearly 1 in 5 Australians suffered from allergic rhinitis in 2014 to 2015 [3]. The forecasts do not look promising due to climate changes as well as new allergens, worsening air quality etc. As the meteorological data on an array of hay fever triggers is becoming more and more available, there is currently no equivalent for the estimates of its prevalence and severity at the fine-grained spatial and temporal level. Thus, the study was inspired to fill this gap by utilising the real-time, low-cost and freely available social media to develop a proxy for pollen allergy prevalence and explore potential associations with the environmental factors.

The remainder of the paper is organised as follows. Section 2 presents the brief overview of text mining for Twitter application. Section 3 discusses various pre-processing techniques while dealing with noisy and unstructured data. Section 4 presents different classification based algorithms used in text mining. Section 5 introduces the real-time analytics of Twitter data and its applications in various domains are discussed. Section 6 ends with conclusions.

2 Text Mining for Twitter Application

The unstructured or semi-structured language is commonly used on Twitter or any other social media platform. Hence, the various types of ambiguities occur, such as morphological, syntactic or semantic. People tend to ignore grammatical rules and spelling mistakes in their posts [4]. In recent years, social media has become an active research area that has drawn huge attention among the research community for information retrieval and abstract topics discovery. Nonetheless, the following characteristics of Twitter makes it challenging for that purpose:

  1. 1.

    Immense volume, fast arriving rate and short message restriction,

  2. 2.

    Large number of spelling and grammatical errors,

  3. 3.

    Use of informal and mixed language,

  4. 4.

    High content of irrelevant data.

Therefore, an extraction of meaningful information from such noisy data became complex problem to solve. Text mining intends to address the above-mentioned issues. Liu et al. [5] defined text mining as an extension of data mining to text data. Information retrieval, text analysis, clustering and natural language processing are the multidisciplinary fields in text mining techniques. They facilitate models based on interesting patterns development and assist predictability.

3 Pre-processing Steps in Text Analysis

During data collection, the unstructured text data contains a lot of challenges that make it particularly challenging to work with as described in previous section. At the same time, the pre-processing steps are essential in any subsequent analyses. Precisely, if the data is not cleaned properly, the text analysis techniques at the later stage simply leads to “garbage in garbage out” phenomena [6]. Even though the pre-processing consumes a great amount of time, it improves the final output accuracy [7]. Feature extraction and feature selection are two basic methods of text pre-processing.

The content of collected tweets varies from useful and meaningful information to incomprehensible text. The former contains people’s opinion and relevant posts regarding the topic, whereas the latter may contain advertisements and it does not add value to the analysis. Hence, high quality information and features are extracted by incorporating some pre-processing techniques explained briefly in the following subsections.

3.1 Feature Extraction

The Feature Extraction can be further categorized as 3 methods such as Morphological analysis, Syntactical analysis and Semantic analysis. The 3 categories are briefly explained below. The feature extraction is used for many applications such as automatic tweets classification [8], opinion analyser [9] and sentiment classification [10].

Morphological Analysis. Morphological analysis deals mainly with tokenization, stop-words removal and stemming [7]. The tokenization is the process of breaking a stream of text into words or phrases called tokens. Stop word lists contain common English words like articles, prepositions, pronouns, etc. Examples are ‘a’, ‘an’, the’, ‘at’ etc. Hasan saif et al. [11] investigated that removing stop words improves the classification accuracy in Twitter analysis by reducing data sparsity and shrinking the feature space. Stemming is used to identify the root of a word, to remove the suffixes related to a term and to save a memory space. For example, the terms ‘relations’, ‘related’, ‘relates’ can be stemmed to simply ‘relate’. Different stemming algorithms are available in the literature, such as brute-force, suffix-stripping, affix-removal, successor variety, and n-grams [7]. Porter stemming [12] is applied to standardise terms appearance and to reduce data sparseness. In addition to the above 3 methods, non-textual symbols and punctuation marks are removed. Noisy tweets are filtered by eliminating links, non-ascii characters, user mentions, numbers and hashtags.

Syntactical Analysis. Syntactic analysis consists of Part-of-Speech tagging (POS-tagging) and parsing techniques [13]. It provides knowledge about grammatical formation of the sentence and it is used to interpret logical meaning from the sentence. The POS tagging defines contextually related grammatical sense in a sentence like noun, verb, adjective etc. Various approaches have been developed to implement POS tagging like Hidden Markov Model [13]. Parsing is another technique of syntactical analysis, where the sentence is represented in a tree-like structure and analysed for which group of words combine.

Semantic Analysis. Semantic analysis is the primary issue for relationship extraction form unstructured text [14]. This refers to wide range of processing techniques that identify and extract entities, facts, attributes, concepts and events to populate meta-data fields. This is usually based on two approaches like rule-based matching and machine learning approach. First approach is similar to entity extraction and requires the support of one or more vocabularies. Another one is machine learning approach and it deals with the statistical analysis of the content and derives relationship from the statistical co-occurrence of terms in the document corpus. WordNet-Affect [15] and SentiWordNet [16] are the popular approaches that are used to extract the useful contents from the textual message. Strapparava et al. [15] proposed the WordNet-Affect approach, a linguistic resource for a lexical representation of affective knowledge (affective computing is advancing as a field that allows a new form of human computer interaction in addition to the use of natural language). Another approach is SentiWordNet, which is proposed by Esuli et al. [16] and it is a publicly available lexical resource for opinion mining.

3.2 Feature Selection

Another essential step after feature extraction is feature selection that improves the scalability and accuracy of the classifier by constructing vector space. The main purpose of this approach is to select the most important subset of features from the original documents based on the highest score. The highest score is predetermined measure based on the importance of the word [17]. For the text mining, the high dimensionality of the feature space is the major hurdle, as it contains many irrelevant and noisy features. Hence Feature selection method is widely used to improve the accuracy and efficiency of the classifier. The selected features provide a good understanding of the data and retain original physical meaning.

A substantial amount of research has been applied to evaluate the predictability of features for the application in classification techniques. Among them, Peng et al. [18] studied how to select compact set of superior features at low cost according to a maximal statistical dependency criterion based on mutual information. Another approach is based on conditional mutual information and it is defined as a fast feature selection technique. This approach favours features that maximize their mutual information and ensures the selection of features that are both individually informative and 2-by-2 weakly dependent [19]. Mihalcea et al. [20] examined several measures to determine semantic similarity between short collections of text. It relies on simple lexical methods like pointwise mutual information and latent semantic analysis.

Another popular approach calculates feature vectors based on two basic methods, namely Term Frequency (TF) and Inverse Document Frequency (IDF). TF-IDF function is the combination of TF and IDF and is mainly used to estimate the frequency and relevancy of a given word in the document at the same time. Ramos et al. examined the results of applying TF-IDF to determine what terms in a corpus of documents might be more relevant to a query [21].

4 Literature Survey on Real-Time Analytics of Twitter Data

Twitter supports real time analytics in various aspects like spatial analytics, temporal analytics and text mining. Spatial analytics provides the visual representation of various trending topics across various geographical locations and temporal analysis presents an information about seasonal trends or outbreaks of various topics.

As for the examples, Kathy et al. [22] described a novel real-time flu and cancer surveillance system that uses spatial, temporal and text mining on Twitter data. The real-time analytics results are reported visually in terms of US disease surveillance maps, distribution and timelines of disease types, symptoms, and treatments. Several research studies focused on Twitter to analyse and predict sentiment analysis [23], opinion mining on political campaigns [24, 25], natural disasters [26], epidemic surveillance [27], event detection [28], topic modeling [29,30,31,32,33,34], and so on. O’Connor et al. [25] and Tumasjan et al. [24] showed that sentiment analysis of tweets correlated with the voters’ political preferences and closely aligned with the election results. Not only in the field of politics, but also in economics, have public tweets played a major role. Sentiment analysis has been previously studied on different aspects such as blogs and forums and has now been analysed in social media [35]. Bollen et al. [36, 37] analysed that tweet sentiments can be used to predict trends of stock and it is directly correlated with them. Bruns et al. [38] and Gaffney et al. [39] observed that Twitter is a powerful tool to gather public opinion and create social change.

Sakaki et al. [26] investigated tweets during natural disasters and shown that it is able to detect earthquakes and send warning alerts to society. They considered each twitter user as a mobile sensor in Japan and the probability of an earthquake is computed using time and geolocation information of the user. Posting time and volume were modelled as exponential distribution to estimate locations of earthquake using kalman and particle filters. Their research further evidenced that earthquake can be sensed earlier than official broadcast.

Culotta et al. [40] analysed Twitter to detect influenza epidemic outbreaks that improves speed and cost reduction from traditional methods. Data of the user like gender, age and location can be used to provide more descriptive information about demographic insights compared to search queries. They detected influenza using multiple regression models and Quincey et al. [41] identified swine flu from Twitter using pre-defined keywords and terms co-occurrence method. These methods are analysed by searching the tweets with the keywords and detected anomalous change with the rapid flow in message traffic related to given keywords. The aids of such a method is to collect more focused information from the Twitter stream. Twitter proves to be an effective source to research in healthcare topics and analyse various diseases like cholera [27], cardiac arrest [42], alcohol use [43], tobacco [44], drug use [45], mood swings [46] and Ebola outbreak [47]. Michael et al. proposed a technique called Ailment Topic Aspect Model [48, 49] to monitor the health care of public through the diseases, symptoms and treatments detection in tweets.

Hence, this section describes the real-time application of Twitter in various sectors like healthcare, politics, natural disasters, stock market analysis, sentiment analysis and so on.

5 Case Study of Australian Hay Fever Prediction from Twitter

5.1 Experiment

The case study aiming to utilise machine learning algorithms to estimate the prevalence of Hay Fever from Twitter data was conducted. The steps involved relevant tweets extraction, followed by the standard pre-processing tasks, manual annotation, automatic classification with logistic regression model, correlation with the external data sources and statistical validation.

The tweets were extracted during high pollen season (mid-August up till end-November 2017) in Australia (location bounding box in the extraction criteria) and included either the ‘hay fever’ or ‘hayfever’ related terms or one of the associated with this condition symptoms (according to Wikipedia [50]).

The dataset of 681 tweets was manually annotated by the author, producing 402 Hay Fever - HF (59% of dataset) and Non-Hay Fever tweets - N-HF (41% of dataset). The logistic regression classifier was selected to train and test the data with the 3-times repeated 5-fold cross-validation. The TF-IDF frequency function was applied and the feature selection using filter method was adopted. The uni-grams were used based on the Minimum term frequency threshold set heuristically to 10.

Next, the potential predictors such as the weather condition variables, common triggers of pollen allergies, were identified and daily observations were collected. These in turn were correlated with HF tweeting intensity in set of locations (8 major Australian cities) on each day covering the analysed period. The Pearson’s correlation coefficients for each city and weather variable on a daily temporal level were produced. For spatial patterns discovery and real-time analytics the interactive maps were developed.

5.2 Results

The first step after tweets extraction and pre-processing was training the classifier to automatically identify HF tweets from the collected dataset. The accuracy obtained was 0.925 for 45 features based on the Minimum term frequency threshold of 10. Associated Kappa was 0.846.

Apart from high performance accuracy on a test dataset, the advantage of logistic regression classifier is an insight into the relevant terms used for prediction, thus allowing for any future selection criteria refinement. As the main goal is an overall system’s sensitivity and precision maximisation, both extraction and classification form an integral part of a continuous improvement cycle.

The properly defined keywords allow to increase the ratio between the numbers of true positives to true negatives. Therefore, further investigation of terms identified as most predictive by the classifier and their corresponding coefficients enables better understanding of the classification criteria (Table 1). For instance, the word ‘sneezing’ was highly associated with HF related tweets, whereas ‘allergy’ occurred mostly in the false positives posts. Therefore, ‘allergy’ term is not recommended search query for future HF data extraction.

Table 1. Terms coefficients (shortlist).

The words associations function further facilitated the knowledge discovery about the Hay Fever in Australia (Table 2). The combination ‘watery eye’ (r = 0.46) occurred with the correlation twice as high as either ‘red eye’ (r = 0.23) or ‘swollen eye’ (r = 0.22). The ‘stuffy nose’ (r = 0.50) was more common than ‘itchy nose’ (r = 0.22) and the ‘sore throat’ (r = 0.46) was the only meaningful and at the same time dominant association. The correlation score obtained is the degree of confidence in the word association. The values of the coefficients fall between 0.22 and 0.60 revealing moderate to strong correlation. The terms relevant to Hay Fever were underlined.

Table 2. Word associations and their corresponding correlation values for ‘eye’, ‘nose’ and ‘throat’ terms.

In terms of validation, the F test for the whole model proved statistically significant with p \({<}\) 0.001. The highest adjusted r2 (adjusted for the number of predictors) was obtained for Melbourne (0.626). In other words, over 60% of the variance in the number of HF related tweets (as indicator of its prevalence) was able to be explained by the weather statistics and pollen data.

Table 3. Multiple regression coefficients and p-values for Melbourne.

As pollen rates information for Melbourne covered only a proportion of the analysed period, the total number of observations included in the model was 67 (62.0% of the total). Regression constant was set to 0.

Finally, an interactive map to visually explore correlations of weather variables and HF tweeting intensity was developed. Strength and direction is indicated by the size and colour gradient of the circle (orange indicates negative, whereas blue positive association) (Fig. 1).

Fig. 1.
figure 1

Relative Humidity correlation map. (Color figure online)

5.3 Discussion

In terms of the analysis for Melbourne area, the moderately strong correlations were observed (Table 3). In particular, the positive correlation between the Average Wind Speed and HF tweeting pattern is worth noting as wind plays a major role in pollen grains spread, triggering the allergic symptoms. Another positive and significant correlation occurred for Evaporation and Relative Humidity. Usually, the plants are more likely to release their pollen into the air more on a sunny rather than rainy day. However, if the rain is occurring around a thunderstorm, then the humidity can make pollen grains burst open, releasing a high density of pollen into the air [51]. That may explain the coefficients values obtained. Furthermore, the Victoria State is known for its high probability of another co-related respiratory condition occurrence called thunderstorm asthma. As a matter of fact, the positive correlation for Relative Humidity variable paralleled the findings from the study on thunderstorm asthma predictability conducted in Melbourne reporting higher humidity with higher asthma admissions [52].

The correlation between pollen grains count and hay fever tweeting intensity was found insignificant (p = 0.117), although the value obtained was weakly positive (0.014). The pollen data for Melbourne was collected from 6 different pollen stations. In the analysis, the average was taken into account what might have affected the final output accuracy due to variations across the locations.

6 Conclusion

This survey provides the high-level overview of the specifics of Twitter data analysis, the challenges present as well as the current approaches to address them. The numerous applications utilising the real-time analytics potential from previous studies that transform unstructured tweets into valuable knowledge are given along with the case study on the Australian hay fever prediction. The experiment combined multiple heterogeneous data sources (numerical - structured vs text - unstructured) in order to obtain an instant insight into potential triggering factors of pollen allergy with the use of machine learning algorithms as well as interactive maps. The correlation values obtained allowed to measure the impact of specific variables in order to assist future forecasting ability. Finally, an analysis of logistic regression outputs (terms coefficients magnitudes and directions) enabled further extraction and classification criteria refinement for an on-going real-time analysis in a continuous improvement cycle.