Introduction

Since the last decade, it is unquestionable that social media has become an inevitable part of daily life for people around the world. In addition, due to the series of lockdowns in 2020, social media and the internet usage has grown at the fastest rate in the recent years with an increase of 4.50 billion users per day. People are using social media such as Facebook, Twitter, or Instagram to not only to share their opinions or but also express their beliefs, emotions to the whole world with the convenience of a click-away (Pai et al. 2020). In addition, whenever there is a catastrophic event occurs, there is a huge surge of text traffic on Twitter, Facebook, Instagram of informative messages, tweets, emotional outbursts, and rumours (Kostakos et al. 2018) as people tend to react faster to negative news than positive news (Esraa Najjar and Salam Al Augby 2021; Berger and Perez 2016; Conway et al. 2019). And it obvious that, due to such wide outreach of these social medial platforms, terrorist organizations such as ISIS, Hezbollah, Al-Qaeda have started using the Online Social Networks (OSNs) as a tool to spread their propaganda or hate speech, raise fund, radicalize, and recruit new members around the world (Berger and Perez 2016; Zerzri 2017).

The weaponization of social media platforms by extremists has led many governments and researchers to focus on developing new methods to counter cyber extremism. From the period August 2015 to December 2017, micro-blogging platform, Twitter has suspended 1,210,357 accounts for violations which are related to terrorism (Berger and Perez 2016; Conway et al. 2019; Aleroud et al. 2020). Though such tech giants become obliged to form regulations and make tools to counter such online extremism (Torregrosa et al. 2021), yet we witnessed live-streamed Christchurch attack in New Zealand in 2019 (Aleroud et al. 2020; Gaikwad et al. 2021). Also, it is often claimed that nowadays extremists have been deploying countermeasures to come back on Twitter and increased their usage to spread their propaganda (Conway et al. 2019). This activity ushered many governments to fund more money on counter extremism research. Having said that, as most of extremism related data published online are based on text and images. Hence it is tedious to analyse such huge chunk of data manually and draw a conclusion. Thanks to the Artificial Intelligence (AI) technologies, researchers made numerous contributions for extremism research. Sentiment Analysis (SA) and Opinion Mining (OM) are two emerging areas used to classify the sentiment of vulnerable tweets to reach appropriate conclusions and make predictions about future mass violent events. There has been a colossal number of research publications on analysing the sentiment of a particular tweet (Esraa Najjar and Salam Al Augby 2021). Most of these papers use one of two widely popular approaches—lexicon based or machine-learning based approach. Lexicon-based approach uses manually pre-classified sentiments for certain words and is further divided into dictionary-based approach and corpus-based approach. Whereas in machine-learning approach numerous algorithms for sentiment analysis—namely Maximum Entropy (ME), Naive Bayes (NB), Support Vector Machines (SVM) and Neural Network (NN) models are used for classifying tweets.

Motivations and contributions

In light of the rise of social media platforms, extremists make use of the chance to portray themselves as saviours and foster and recruit vulnerable youth to commit violent or lone-wolf attacks (Torregrosa et al. 2021; Gaikwad et al. 2021; Dadkhah et al. 2021; Rowe and Saif 2016). To understand such extremism act and behaviour, several research contributions were made using both manual and automated techniques. After enormous research from various perspectives including availability of dataset, proposed detection techniques, performance validation methods and tools, we found that a very few surveys approached the problem conceptually and whereas others focussed on identification and classification of extremists. However, such literatures have some limitations. First, the lack of standard discussion on data sources or dataset selection criteria and custom-made datasets should be studied and standardized to fulfil the research gaps. Secondly, some studies focussed only on a specific process of detection instead of providing much attention to validation techniques. This study focusses on covering the gap between existing research work and its limitations by shedding lights on various data sources, identification and classification, tools, and validation of performance metrics. This article is a systematic review of collection of voluminous literatures and analysed the details systematically based on comparative approach. In addition, it will present a state of the art of the counter terrorism research opportunities based on following research objectives:

RO1: Outline the availability of various data sources or datasets and tools pertaining for combatting online extremism.

RO2: Summarize how sentiment analysis techniques used in the field of extremism research.

RO3: Present current topic and contributions from machine learning techniques to extremism research

RO4: Discuss data validation techniques steps required for sentiment analysis in extremism research.

RO5: Throw light on future directions and challenges of the domain based on this study.

This study broadly surveys the existing Twitter sentiment analysis methodologies and techniques for combating against cyber terrorist activities. Various performance metrics are discussed to compare the performance of the existing approaches for determining which might be the best fit approach to use in future to predict terrorist attacks. In  “Research Methodology” section, presents research methodology and in “Sentiment Analysis in Twitter” section, outlines Sentiment Analysis, level, challenges, and feature selection types. In “Twitter Sentiment Analysis Approaches” section, discusses various approaches of Sentiment Analysis and in “Data Sets and Collection Strategies” section, Data Sets and Collection Strategies will be discussed. Then “Discussion and Future Research Directions” section presents the discussion and future research directions whereas in “ Conclusion” section we conclude this study.

Research methodology

Based on a systematic approach (Misra 2021), a survey of articles that contributed to the detection of extremism using sentiment analysis techniques was conducted. The articles were extracted from various databases such as ScienceDirect, Scopus, IEEE transactions and Web of Science. While searching for the articles we used various keywords not limited to the following: ((“Extremism Detection” OR “Cyber Terrorism” OR “ISIS” OR “Jihadist” OR “Propaganda” OR “Radicalization” OR “Classification of Extremist”) AND (“Sentiment Analysis” OR “Polarity Analysis” OR “Topic Detection” OR “Opinion Analysis” OR “Emotion analysis”)). In the extremism domain, following are a few important key terms widely used. They are Extremism can be defined as “an ideology or supporting belief, not based on civil or ethical values of a society and uses various methods like verbal or physical violence to achieve its goals" (Harb 2019). Radicalization is believing in extremism or violence due to the changes in belief and Propaganda can be defined as biased information to justify point of view specific group of people or political cause (Berger and Perez 2016; Garg et al. 2017; Misra 2021). A general screening process was conducted to check closeness and quality of the articles. This is done by checking the title, abstract, clearly description of proposed methodology and algorithm, validation techniques, and most importantly datasets used. 283 articles have been found as initial result. After the general scrutiny of textual analysis, 68 articles were included for this survey as shown in the Fig. 1.

Fig. 1
figure 1

Article surveyed in the review

Sentiment analysis in Twitter

Sentiment analysis is a process that automates attitude mining of opinions, and emotions from text, audio, video or from any database sources through Natural Language Processing (NLP). Figure 2. shows the basic flow of Sentiment Analysis in Twitter (SAT). Analyzing the sentiment in Twitter is crucial in decision-making process where it is involved in classifying opinions in text into categories like "positive" or "negative" or "neutral". This is an automated detection and quantification of thoughts and emotions in a tweet. It is also referred as opinion mining, and subjectivity analysis. The words such as sentiment, opinion, and belief are used synonymously but there are differences among them (Pai et al. 2020; Kharde and Sonawane 2016; Sharma et al. 2018; Giachanou and Crestani 2016).

Fig. 2
figure 2

Sentiment analysis process

Sentiment analysis levels

Generally, sentiment analysis is based on what is the object, object features and opinion about the object. There are three levels of analysis that can be done to analyse the polarity of the object or tweets (Ali 2015; Kolkur et al. 2015).

Document level

Document level analysis analyses a piece of text or document to determine if the text has positive or negative or neutral sentiment. The entire document of opinionated text is assumed as a single unit of information. This works better in case of a movie or a product review.

Sentence level

Sentence level analysis considers each sentence as individual unit and have a different opinion. This has two sub tasks namely, subjectivity classification and sentiment classification. In subjectivity classification, every sentence is classified into objective or subjective sentence, where subjective sentence has opinions and the later has only facts. Sentence can be classified as positive, or negative, or neutral depending upon the opinion words present in a sentence (Torregrosa et al. 2021; Kolkur et al. 2015).

Feature level

Feature level analysis works on labelling each word with their opinion and classifying the data towards where the sentiment is directed. Feature engineering concerns with identifying and extracting aspects or features from given data.

Sentiment analysis challenges in Twitter

Social media monitoring and listening through sentiment analysis on Twitter (SAT) is a special kind of social media monitoring. It is a non-trivial and challenging task of mining and preprocessing unstructured tweet like text and reviews, for feelings and evaluations of a specific event or service.

Unlike traditional sentiment analysis in websites, blogs or forums, Twitter possesses some unique challenges while analyzing the sentiments such as Text Length, Topic Relevance, Incorrect English, Negation, Stop words, Tokenization (Giachanou and Crestani 2016). Figure 3. depicts the challenging scenarios of pre-processing of the tweets. In addition to that, we discuss some of the vital challenges of Sentiment Analysis in Twitter.

Fig. 3
figure 3

Scenarios of pre-processing tweets

Text Length: One of the major differences between conventional sentiment analysis and SAT is tweet length and it can be up to 280 characters. However, considering the topic relevance while analyzing the sentiment orientation of tweet, a lot of existing works considered the existence of a word. In addition, a few other studies considered the hashtags as a reliable indicator of the tweet’s relevance about a certain topic. Having said that, due to shorter length of the tweets it is easy to classify tweets than categorizing longer documents such as review pages and blogs.

Data Sparsity and Negation: Due to contemporary casual way of communication with length restriction, sometimes tweets may contain a lot of noise such as incorrect English and misspellings. In addition, the occurrence of negation words has a vital part in finding tweet polarity. Identifying the negations is a crucial and challenging task when analysing the sentiment because it may change the sentiment polarity (Giachanou and Crestani 2016).

Multilingual and Multimodal Content: Tweets are written in 34 different of languages. The challenge is sometime tweets are written in mixed languages such as English + Arabic or English + Spanish. Identifying the correct polarity among mixed language short length tweet is yet to be explored by researchers. Moreover, tweets may contain images or videos. analysing the multimodal content is also under-explored area in sentiment analysis.

Feature extraction

In most of the SAT methods, the accuracy of sentiment analysis depends on feature selection. The selected features and their combination play an important role in analyzing the sentiment of tweet. The selected key features are known as feature vectors which are required for the subsequent classification tasks. Here, we present a few key features used in the existing works (Kharde and Sonawane 2016).

Syntactic features include n-grams, dependency trees and part of speech tags, these are used to understand subjectivity patterns (Kaati et al. 2015). Most widely used N-gram features are unigrams, bigrams and n-gram models with their frequency counts. Parts of speech includes adjectives, adverbs, verb clusters and nouns are good indicators of subjectivity and sentiment. We can generate syntactic dependency patterns by parsing or dependency trees (Ngoge 2016).

Opinion Words and Phrases: Apart from words, sometimes idioms and phrases convey sentiments can be used as features (Omer 2015) Opinion word is considered as a binary-valued feature vector, which indicates that the word availability in the sentence or not where 1 denotes the occurrence of word and 0 denotes absence of word (Sharma et al. 2018) Sometimes such words frequency also considered and compared to analyse the sentiment polarity in a sentence.

Stylistic and Twitter-specific features include individual writing style using emoticons, abbreviations, and intensifiers, hashtags, URLs, followers, and retweets. Feature Extraction techniques has many benefits including improvement of accuracy, overfitting risk reduction, acceleration of training, better data visualization, and increase in annotation of the classifying model. However, before deciding on feature vectors, one should analyze on the features should be used. Because using less features improves the information retrieval of the model.

Performance metrics for sentiment analysis on Twitter

The performance of Twitter sentiment analysis is evaluated by few metrics such as accuracy, precision, recall, and F-score. Sentiment analysis is a classification problem, which involves classifying the tweet opinions in text into categories like "positive" or "negative" or "neutral” (Giachanou and Crestani 2016; Kolkur et al. 2015). Once the classifier model is developed, the next phase is to calculate the performance of the developed model. Confusion Matrix is a tool which contains information about actual and predicted classifications to determine the performance of the classifier.

Figure 4, presents the performance metrics used in the existing works of Sentiment Analysis in Twitter (Kostakos et al. 2018).

Fig. 4
figure 4

Confusion matrix

Sensitivity or Recall is a degree of positive samples labelled as positive by classifier. It is calculated as in:

$$ {\text{Recall}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}} $$

Precision is ratio of total number of correctly categorised positive samples and the total number of projected positive samples. It shows correctness attained in positive prediction. That is:

$$ {\text{Precision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}} $$

Accuracy is widely used metric which is proportion of the total number of predictions that are correct. That is:

$$ {\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}} $$

F-Score is a weighted average of the recall and precision. This is also known as F1-score, or F-measure accuracy and is calculated as

$$ {\text{F - Score}} = 2 \times \frac{{{\text{Precision}} \times {\text{Recall}}}}{{{\text{Precision}} + {\text{Recall}}}} $$

Twitter sentiment analysis approaches

Sentiment analysis is an emerging field used for classifying sentiment or polarity of vulnerable tweets to reach appropriate conclusions. There are two widely used approaches namely lexicon based and machine learning based approach. Figure 5. show the classification of sentiment analysis approaches.

Fig. 5
figure 5

Sentiment analysis approaches

Machine learning based approaches

Omer (2015) et al. presented a new machine learning based approach using AdaBoost classifier. Initially, author collected three datasets which are supporters of ISIS (TW-PRO), anti-supporters of ISIS (TW-CON), and random tweet dataset that have no connection with ISIS (TW-RAND). The total number of collected tweets was 135,608, and selected 619 features based on stylometric, time-based, and sentiment-based feature selection processes. He used three classifiers namely support vector machine (SVM), AdaBoost Naive Bayes (NB), and obtained impressive outcomes using AdaBoost with 100% on correctly classifying the instances, while NB was 99.9% and SVM was 99.1%.

Smedt et al. (2018) et al. developed a technique based on NLP and machine learning (ML) to automatically detect jihadist hatred speech. They gathered around 45,000 tweets over the period from October 2014 to December 2016. LIBSVM machine learning algorithm was used for balanced training of data. They analysed the sentiment based on accuracy that varied according to the language being spoken. For instance, 80% for French, 79% for English, Farsi was 80%, 84% for Arabic, and Portuguese was 81% with overall accuracy was 82%.

Mirani and Sasi (2016) et al. proposed a unique approach of combining geolocation with data mining algorithms. They contributed an innovative system for categorizing ISIS-related Tweets based on polarity-based classification. Using “Jeffrey Breen” algorithm, they compared five hashtags #ISLAMICSTATE, #DAESH, #ISIS, #ISIL, #IS. The algorithm performance was measured by accuracy, F value, recall and precision. Among all the five hashtags #ISIL provided the highest accuracy while testing the dataset on SVM, maximum entropy, bagging, random forest, and decision tree. Average accuracy 90% was achieved after tenfold cross-validation test. Maximum accuracy 99% was obtained from maximum entropy classifier.

Nouh et al. (2019) et al. presented a new approach to automatically analyze extremism propaganda materials and radical content in the tweets. The authors collected data from 3 datasets including two Kaggle dataset (How ISIS uses Twitter—17,000 tweets and Tweets targeting ISIS—1,22,000 tweets) and one TwitterAPI crawled dataset consists of 8000 tweets from 1000 users. They applied TF-IDF and LIWC dictionary as feature extraction methods and achieved 80% accuracy by training them in various machine-learning models such as SVM, K Nearest Neighbor, and Random Forest. Davidson et al. also applied same feature engineering techniques for 24,802 tweets and achieved 0.91 Recall value with SVM. Table 1 depicts comparison analysis of machine leaning based approaches.

Table 1 Comparison of machine learning based approaches

In Kaati et al. (2015), a method based on machine-learning technique to recognize twitter accounts, support jihadist groups and distribute propaganda content online was demonstrated. Feature engineering was performed by analysing data dependency and classified the features into data-dependent and data independent features and the combination of both. The authors used two datasets (English tweeps and Arabic tweeps) with tweets including hashtags related to jihadists and especially ISIS. They used binary text classification method to detect tweeps involved in media mujahideen and applied linguistic features to train the AdaBoost classifier. While performing validation test, accuracy for data dependent, data independent and the combination were 99.07%, 98.82%, and 99.51%, respectively, for English tweets. Whereas in Arabic, accuracy of data independent 82.4%, 84.66% from dependent data and 86.38% combined features. The results of English tweets had high accuracy, precision, and recall ratios than Arabic tweets.

In Ferrara et al. (2016), a sentiment analysis technique to predict the polarity in their interaction with extremists was discussed. For this they used public dataset available in the name Lucky Troll Club which has 3,395,901 tweets from 25,000 user accounts. They used metadata as feature with greedy selection algorithm and classified the sentiment using Logistic Regression and Random Forest classifier models. Araque and Iglesias (2020) found a machine learning based approach to identify racial text on Twitter and online press or magazines. They have used twitter datasets such as Pro-Anti and Pro-Neu and online magazines to analyse the sentiment and generated the distributed representations of the text that are fed into Linear SVM and Logistic regression classifiers to compute the similarity between the analysed text and a particular lexicon. In addition, they proposed a novel approach that uses the emotion dictionary to calculate statistical summary of emotions in analysed text. They evaluated the performance of Pro-Neu, Pro-Anti, Magazines using F1-score metric and achieved 92.41%, 77.21% and 72.22%, respectively.

In Rehman et al. (2021), authors contributed in a work to identify the radical text in social media and believed that religious languages play a major role in radicalization on Twitter. They employed religious features and radical features to train the algorithm. They have taken 7000 tweets from 15 October 2019 to 20 October 2019 and performed feature engineering with radical features and religious features using TF-IDF technique. Then, they applied Naïve Bayes, SVM and Random Forest to predict the polarity. tenfold cross-validation test applied to validate the results and achieved F-Score value 0.87, which higher than existing works.

In An et al. (2021), authors presented a technique based on supervised machine learning to foresee terrorist events or potential risks using microblog entries (tweets). The authors used a combined approach of Word2Vec and K-means clustering to identify topics which will be further used for emotion analysis. Logistic regression classifier was used and achieved 85.8% accuracy. Garg et al. (2017) studied survival and sentiment from post-terror attack tweets. They considered the features like last retweet, number of retweets, number of favourites to study the information flow on Twitter. They adopted the combination of Naïve Bayes and SVM (NB-SVM) classifier to find the polarity of the information flow from 59,988 tweets taken over the period from 16 September 2016 to 15 October 2016. Moreover, the results shown the negative tweets lasted long than the positive tweets, though the number of negative tweets is significantly lesser than positive tweets.

Authors in Smith et al. (2020), created a paradigm to detect and predict the changes in users mind when they are in psychological group memberships through Twitter posts. They analyzed the longitudinal changes in individual user’s twitter post over time. For this, they collected 40,053 tweets from 110 users which related to support of Daesh (ISIS) and compared them with baseline Twitter timelines of 215,008 tweets from 109 users. They used logistic regression classifier to classify the accounts into baseline users or Daesh supports, and they validated the results using 10 – fold cross validation testing and achieved 89% accuracy, F-score 89%, recall 88%, and precision 90%.

Authors in Omar et al. (2021), developed a technique to find correlation between hate speech and topics available in online social media. They collected 14,000 tweets and 33,000 Facebook posts and developed a multi-label Arabic dataset and performed manual annotation by dividing them into 11 classes. To perform multi-label classifications, they applied machine learning classifiers such as Linear SVC, Logistic Regression, Random Forest with feature representations N-gram, TF-IDF, and Bow to classify the sentiment polarity into positive, negative, and neutral and achieved highest accuracy of 97.92% in Linear SVC with N -gram (1,2) classifier.

In Dadkhah et al. (2021), authors proposed a method to detect online hostile activities automatically by analysing the polarity of online news content. They investigated many datasets in various dimensions such as role, influential level, vulnerabilities, and distribution pattern. Authors implemented the detection system using machine-learning techniques, deep-learning models, NLP, and Social Network Analysis techniques and analysed the data based on bot score, credibility score, classification score, topic modelling, name entity recognition, truthful score, sentiment score, risky score, and community detection. They contributed a visual data analytics framework to provide a complete understanding of cyber activities at several levels and results were evaluated with tenfold effectiveness test and achieved 95% approximately. Hartung et al. (2017) et al. demonstrated an idea to detect whether a Twitter user is Right-Wing extremist or non-extremist using 45,747 Tweets as dataset. They used Bag of Words (BoW), Bi-grams, Emoticons, Identity as features and achieved 95% accuracy in SVM Classifier.

Authors in Jain and Vaidya (2021), proposed an idea to analyze the sentiment of people on Uri, Pulwama and Surgical Strike attack. They collected tweets related to these attacks hashtags to find sentiment based on user’s geolocation. The authors used K-Means Clustering algorithm to find geolocation of the users and Naïve Bayes classifier to classify the orientation of user’s sentiment with emotions such as anger, anxiety, and sadness towards these attacks from their tweets into positive, negative, and neutral polarity.

In Aleroud et al. (2020), authors suggested a methodology based on feature augmentation which is used to categorize the twitter accounts into Pro-ISIS and Anti—ISIS accounts. Terms from the tweets are considered as nodes in a graph, clustered them based on similar terms. They have collected 2 Kaggle Datasets where first has 17,000 Tweets from 112 Pro-ISIS accounts and the second has 77, 813 Tweets from 95, 725 Anti-ISIS accounts. They tested the data on SVM, KNN, Decision Tree, and Random Forest and achieved F1-Score of 88% on original data 94% after data reduction.

In Masood and Abbasi (2021), authors proposed a framework called Supervised Rebel Identification to identify the rebel users on Twitter. They developed a unique methodology to structure the tweets into directed user graph. The user graph then converted into graph embedding to use the semantics within the machine-learning classifiers such as SVM, Random Forest, Gaussian Naïve Bayes, and Logistic regression. They used to 284,000 tweets to classify them into rebel user, counter rebel and normal user. Similarly, Abrar et al. (2019) proposed a machine-learning technique for real-time analysis of terrorist-related tweets. They extracted feature from N-grams methods on 55,123 tweets and classified using AVM, Multinomial Logistic Regression models.

Deep-learning based approaches

Deep Learning is based on artificial neural networks in which multiple layers of processing are used to progressively extract high-level features from data (Nizzoli et al. 2019). Table 2 summarizes the techniques based deep learning techniques. Harb and Becker (2018) et al. found an approach based on the study of emotional reactions of Twitter users on a few terrorist events that occurred in United Kingdom. They have used two deep-learning architectures to create an emotion classifier and developed an analysis on tweets related to terrorist events to understand whether there is an emotional shift due to the terrorist attack and whether the emotional reactions are dependent on the incident, or on the demographics of the users (Harb et al. 2019). Both models, based on convolutional and recurrent neural network architectures, offered almost similar performances. The analysis shown an emotion shift due to the events and a difference in the reactions to each specific event, where gender is the most critical factor the results were obtained with the precision above 70%, recall is above 70% and F-Measure is below 60%.

Table 2 Comparison of deep learning-based approaches

Authors in Alhalabi et al. (2021), developed artificial intelligence-based terrorist behavior detection system. In this work, the authors proposed a distinct value proposition is based on unified methodology to characterize the Arabic tweets available on Twitter. The system uses advanced social mining techniques to detect terrorist behavioral patterns, provides enhanced visualization and decision-making. They collected 10,000 tweets over the period from July 2018 to October 2018 and analyzed the polarity using deep-learning models.

Authors in Ahmad et al. (2019), presented a deep learning-based technique to analyse the sentiment and classifies the tweets into extremism or non-extremism categories. Their proposed work operated in three segments such as users’ tweet collection, pre-processing, and classification of tweets with respect to extremist and non-extremist classes using LSTM + CNN model and obtained accuracy 92.66%.

In Zinovyeva et al. (2020), authors elaborated the detection of anti-social online behaviour using NLP deep learning. The authors compared their work with existing deep-leaning-based detection methodologies. For this they have considered four data sets of online social media including Twitter. They used SVM, Random Forest, Logistic Regression and LightGBM (Light Gradient Boosting Model) to classify the data and obtained average precision 99.6% approximately on all models. Harb et al. (2020) et al. developed a framework to analyse the emotional responses over various mass shooting events and its influential features. They have collected tweets from two days before and five days after eight different mass shooting events. They created emotion classifiers using tree different deep learning strategies such as Convolutional Neural Network, biLSTM, and BERT and classified the emotions into anger, fear, sadness, surprise, disgust, and no-feel and achieved the average F-measure nearly 75%.

Lexicon-based approaches

Lexicon-based methods employs word list or annotated dictionary by polarity score to determine opinion score of given data. This method does not require training data. Analysing tweets for find the polarity using lexicon is challenging because of the ever-changing colloquial expressions and hashtags (Giachanou and Crestani 2016). However, there has been quite a few existing works have been proposed using lexicon-based approaches as shown in Table 3. Simon et al. (2014) et al. developed a document level sentiment analysis to analyze the sentiment from original tweets communicated from the field by emergency organizations and their managers during the Kenya Westgate Mall attack. The authors used corpus-based approach to analyze the positive and negative classifications of the tweets and recommended that emergency organizations dispatchers from the field and the communication center should minimize the use of negative emotion during their communication with the public at that time. They have used 67,849 tweets, collected from 21 to 25, September 2013 and got 59.6% accuracy for positive classifications of manager tweets 46.5% accuracy for negative classifications of emergency organizations tweets.

Table 3 Comparison of lexicon-based approaches

In Mansour (2018), authors proposed an approach to examine the sentiment of people from western countries and eastern how they look at or their sympathy ISIS entity. The author used text sentiment analysis to analyze the word frequency and sentiment of the words using Term Frequency -Inverse Document Frequency (TF -IDF) tool from 6853 tweets over the period of Sep 2017- Dec 2017. Moreover, the author obtained positive accuracy 29% to 33% and negative accuracy 67% to 71%. Fadel and Cemil (2020) created a model for automatically classifying users’ reviews on Twitter after a terrorist attack. The model was developed using lexicon and machine-learning approaches. Lexicon approach was used to create labelled training dataset while machine-learning approach was used to build the model. Scores of some domain related words were neutralized to avoid their negative effect. Features were selected based on Part of Speech tags such as VER, ADJ and the combination of both VER + ADJ. The author used majority voting between NB, SVM and LR machine-learning classification algorithms was applied. The performance of classification algorithms was measured using accuracy and F1 scores. Negative polarity tweets were categorized as terrorist supporters while positive polarity categorized as non-supporters. The results were compared to identify the best classification algorithm for features selection. This model achieved 94.8% accuracy with 95.9% F1 score.

Authors in Kostakos et al. (2018), carried out a comprehensive study based on the events—Manchester attacks and Las Vegas shooting to analyze the reactions shown and the way those reactions spread over the incident timeline in Twitter. They found “echo chambers” that is group of people sharing similar interest about the same event. They assigned positive and negative scores for each tweet using two lexicon-based methods. First, SentiWordNet 3.0. scores were determined for every single word by examining negative and positive values from a lexicon by the word and its PoS tag. The second SentiStrength was used to find final sentiment score, which was calculated by adding the positive and negative scores and then divided by number of words in tweet. Though the sentiment analysis technique used by the authors classifies the real news and fake news, they should provide the results based the certain performance metrics such as accuracy, precision, and F-score.

In Rekik et al. (2020), authors developed a recursive method to detect radical groups on social media mainly Twitter. Their analysis is based on violent vocabulary and suspicious interactions by anti-social communities, and they computed the danger degrees of the recognized users to find radical communities. They have performed an unsupervised learning analysis on tweets from 3325 accounts and iterated the analysis for 3 times and measures the performance using F-measure.

Authors in Ngoge (2016), proposed a machine-learning-based technique to determine the level of twitter terrorism and to identify the terrorist activities. To achieve this, they have implemented Maximum Entropy, SVM, and Naïve Bayes classifier with Lexicon-based approaches to classify the trends in 346 tweets pertaining to terrorist attack for seven days in Kenya. They achieved 73% accuracy, 15% recall and precision rate 60% while predicting the real-time sentiment over the attack. Simon et al. (2014) et al. developed a methodology to determine the time of radicalization among twitter users using divergent behavior analysis. They considered 154 K users and created a lexicon-based corpus to analyze and found that only 727 users shown interests towards Pro-ISIS behavior.

In Al-Khalisy and Jehlol (2018), authors used data mining techniques to extract useful information such as supporter data such as location, account name and terrorism propaganda. They gathered around 10,322 tweets related to the keyword terrorism and then they performed preprocessing and converted it into a text corpus. Their proposed work consists of two modules such as analyzing twitter data and next was about mapping the sentiment with GeoJSON to find the location of terrorists. They used manually created word list which has synonyms and antonyms from dictionary and used it to analyze the polarity. They also employed word bag feature by calculating the total number of the word points in tweets indicating the training data. Depending on the training data, Naive Bayes classifier classified 7122 tweets as negative.

Data sets and collection strategies

Dataset collection is a critical step in any research process. Nowadays, the collection of data on online extremism groups and activities has become extremely a hard process, since online extremism is considered a highly sensitive domain for its risk and security reasons. However, many researchers tried to collect their own data or used publicly available datasets for analysis purpose. By default, social media platforms are gatherers of user data, it acts as source to researcher as well. Among all other social network platforms, Twitter is notorious platform as it was widely used by extremists and become popular data source for researchers (Kostakos et al. 2018; Berger and Perez 2016; Conway et al. 2019; Aleroud et al. 2020; Torregrosa et al. 2021; Gaikwad et al. 2021). A lot of existing works who used their own dataset obtained textual data through web crawling tools (Sheth et al. 2021) like TwitterEcho or TwitterCrawl using keywords related to extremism (Kaati et al. 2015; Ngoge 2016; Smedt et al. 2018; Harb et al. 2019; Harb et al. 2020; Kumar et al. 2017). Such datasets contain information such as, basic account details along with the metadata such as followers and following details, tweet text, retweets, and mentions. However, such datasets have some serious limitations on their data collection process like lack of characterizing account inclusion errors and errors caused by lack of filtering and standard validation process (Rowe and Saif 2016; Ferrara et al. 2016; Deven et al. 2018; Berger and Morgan 2015).

On the other hand, publicly available dataset from Kaggle.com (Tribe et al. 2015; Dataset 2016) was widely used in online extremism research to avoid hustles in custom dataset. Such datasets are mainly based on supporters of ISIS and anti-supporters of ISIS (Aleroud et al. 2020; Omer 2015; Sharma and Jain 2020). We found that standard datasets have some problems. First, even after a mass Twitter account suspension during 2016, still these datasets have accounts of suspended users and Table 4. Depicts sources of standard datasets. In addition, to get new insights on counter terrorism research authors may try to use new dataset instead of using same old dataset over and over.

Table 4 Shows the sources of publicly available datasets

Discussion and future research directions

Notwithstanding ongoing headways in Twitter Sentiment Analysis, it is as yet an open area for research and many issues are underexplored. The most important challenges are lack of verification of datasets, lack of benchmarks in research fields since most of the existing systems were based on existing theories, solution to the multilingual and multimodal content (Fernandez and Alani 2021; Softness 2016). In this section, we will discuss the attainment of research objectives to emphasize on some of the potential prospects to target the problem of online extremism (Torregrosa et al. 2021; Gaikwad et al. 2021; Narula and Jindal 2015) The attainment of research objectives of this work will be the learning outcome of this literature review. The attainment level of the objectives can be checked through the insights obtained from the review process of this article.

RO1: Outline the availability of various data sources or datasets and tools pertaining for combatting online extremism

Based on this comprehensive review, Twitter is the most identified OSN when it comes counter extremism research. Hence data can be obtained directly from Twitter using Twitter API or data crawling techniques. Public datasets can also be downloaded from various websites such as Kaggle.com. From the period of 2016 to 2017, Twitter has suspended 1,210,357 ISIS accounts for strict safety policies and most of the researchers created and used the dataset which closely around that period. Due to this, there is a low availability of standard datasets on the internet. On the other hand, most of the available datasets lacks inter-rater agreement which often reflects on less accuracy during classification. A summary of the custom-made or public datasets with dataset size, articles used, are shown in “Data Sets and Collection Strategies” section.

RO2: Summarize how sentiment analysis techniques used in the field of extremism research

In the literature, feature extraction methods such as TF-IDF, Part-of-speech Tagging, Topic Modelling (LDA), and N-grams were combined with various machine-learning techniques to identify the sentiment of the tweets. Researchers used sentiment analysis techniques not only to identify the most common terms related extremism but also to find the polarity of emotions of tweets after some real-life events, reactions to comments, message content analysis and to detect abnormal or hostile activities of extremists.

RO3: Present current topic and contributions from machine learning techniques to extremism research

Machine-learning-based extremism research has increased over the years, especially after 2015 Paris attack as extremists used Twitter to communicate their agenda. Since then, counter terrorism seems to be popular among researchers by proposing more ideas to prevent radicalization or community identification for avoiding future terrorist attacks. Most of the existing machine-learning algorithm are based on basic feature extraction techniques such as TF-IDF, Bag of Words, N-grams and PoS tagging and classifies the polarity using SVMs, Naïve Bayes, Logistic Regression classifiers. Since a few years back, Deep learning methods have been gaining popularity in fast pace as researchers are using various deep learning models such as BERT, LSTM and Convolutional Neural Network along with effective feature extraction methods such as Word2Vec, and Word Embeddings to classify extremism content.

RO4: Discuss data validation techniques steps required for sentiment analysis in extremism research

Most of the existing surveys related to extremism research lacks to discuss about the importance of data validation techniques while working on text mining problems (Ferrara et al. 2016; Tang et al. 2015). During data collection stage, the researcher should check for data quality to avoid inclusion of irrelevant accounts and exclusion extremist account, and this can be solved by collecting the data using appropriate keywords. In the next stage, context identification is critical as it is the most challenging part of the research. Researchers should choose the right extremism context such behavior, religious or psychology as their core work. Finally, obtained results or performance metrics should be evaluated to identify the data imbalance or micro/macro differences in accuracy, precision, recall and F-value.

RO5: Throw light on future directions and challenges of the domain based on this study

Preparing a dataset is a critical step in sentiment analysis (Fadel and Cemil 2020). Especially, obtaining a dataset which is related to extremism content is not an easy task. The availability of data sources will remain as one of the more challenging tasks to confront online extremism (Adek and Ula 2021). Moreover, it has been observed that most of the researchers (Mansour 2018) collected the datasets on their own (Smedt et al. 2018) and presented results based on the dataset. Some of the studied works based on event based (Kharde and Sonawane 2016; Ngoge 2016; Harb et al. 2019; Simon et al. 2014) data set collection (Rehman et al. 2021; Alhalabi et al. 2021; Zinovyeva et al. 2020; Berger and Morgan 2015; Jaki and Smedt 2019; Berger 2016). Hence, if there is an opportunity in future for preparing and sharing full datasets with other researchers by following proper protocols and ethics, it will be a new dimension for researchers to delve into this area and produce quality outcomes.

Another issue is that the interest in analyzing online extremism is mostly reliant on pre-existing feature selection methods, but not on the insight extraction. For effective insight extraction, researchers should be aware of terrorism context such as psychology, ideology, and belief before developing models (Gaikwad et al. 2021; Lara Cabrera et al. 2019). At the same time, feature selection is also critical as one of the main drawbacks of machine learning algorithms is that the efficacy of the approach depends on the extracted features. Hence, there should be a balance between appropriate feature selection and insight extraction instead choosing same old TF-IDF or N-grams or BoW techniques. Having said that, 95% of the surveyed research work did not address the problem of negation detection. Most of the existing analysis is based on SVM, Naïve Bayes, Bayesian Networks and Logistic Regression. In all these different analysis the area of negation detection remains underexplored (Omer 2015; Berger 2018; Zucco et al. 2019). Having said that, an interesting dimension would be studying the efficiency of the neural network algorithms on negation handling in Twitter Sentiment Analysis (Esraa Najjar and Salam Al Augby 2021; Zinovyeva et al. 2020).

Twitter supports 34 different languages and allows the user to use more than one language in the same tweet. This presence of multiple languages in tweets is incredibly challenging during sentiment analysis. Especially, in twitter terrorism, only a few researchers addressed multilingual sentiment analysis of tweets. The use of new approaches, such as word embedding (and, specially, those that recognize word variations) could be the right direction to follow here, together with the creation of specific lexicons for different types of extremism. One more challenge lies in analyzing the polarity based on correlations among various factors such as geographical locations, gender, and age.

The extension of tweet length from 140 to 280 opens new possibilities in sentiment analysis by providing us with more data to analyse. However, this length extension may also lead the users to use more informal language such as the usage of emoticons, and slang. In some cases, longer tweets also mean discussion of different topics in the same tweet – which poses a new challenge of isolating these different topics.

Limitations of the study

This comprehensive review on extremist content on Twitter might have been investigated and discussed much less papers than an average survey paper does in different domains. However, this is due primarily to the fact that after the Paris Attack in 2015, Twitter has suspended a huge number of extremist accounts on its platform. Moreover, another factor that contributed to limit the data sets contents, is that the scope of this study is limited to the analysis of Jihadism related content. In summary, only 32 studies were available for this review with the data ranging from 2014 to 2019.

Conclusion

In this digital age, social networks have an inevitable presence in our daily lives and it is difficult for people to survive without them. Twitter is an extremely popular platform among all organizations, including terrorist organization, to reach the masses with their message. This study was motivated by the continuing increase in online activities by terrorist organizations in Twitter, where there is lack of automated techniques to predict such terrorism-related activities. The survey looked at different sentiment classification techniques and algorithms that have been tested by various researchers working with different datasets. This research work contributes to provide a better understanding of Twitter Sentiment Analysis using Lexicon based methods, such as Dictionary-based approach, Corpus-based and Machine Learning-based approach such as SVM, Bayesian Networks, Maximum Entropy, Naive Bayes, and Neural Networks. Furthermore, based on the above analysis, Machine learning-based approaches were the most common methods used by researchers. Though Support Vector Machine and Naïve Bayes were the two most frequently adopted methods; yet the highest accuracy was achieved by AdaBoost classifier. Thus, this survey provides a comprehensive overview of the existing SAT methods and highlights promising future research directions in confronting cyber terrorism.