1 Introduction

Social media website has become a platform and forum where users express emotions and opinions in diverse subjects such as politics, events, individual, products, dialogue systems and review ranking as well as summarization (Bharti et al. 2016). It has also become a popular platform for global interaction and idea discussion among users. Many firms have realized the necessity of analyzing the social media data in order to get the emotion of the customers regarding their products, which will, in turn, increase the quality of their products. The subjective and emotional language often requires a specific context in order to comprehend the meaning of what the user is discussing. Sarcasm, according to the Cambridge English dictionary is defined as ‘the use of remarks that clearly mean the opposite of what one says, made in order to hurt someone’s feelings or to criticize something in a humorous way’ (Dictionary 2008). Similarly, Macmillan English dictionary defines Sarcasm as ‘the use of remarks in saying or writing the reverse of one’s motive in order to hurt someone’s perception’ (Dictionary and Rundell 2007). Moreover, sarcasm is a figurative language often used in verbal and written text form to communicate in microblogs, such as Twitter. In sarcasm sentiment, the negative emotion of people is communicated using a positive term in the text to reveal their sarcasm. Sarcasm exists in many kinds of structure and order such as verbal or written sarcasm. The verbal sarcasm that usually occurs in speech can also be referred to as spoken sarcasm. Features like pitch level and variation, speech time and tempo, as well as acoustic features (intensity, volume, and frequency), are found in verbal sarcasm. This kind of sarcasm uses tones and gestures like eye and hand movement to show their sarcastic features. In contrast, written sarcasm occurs in a medium such as official letter, email, social media, and product review. In the one hand, when sarcasm is used in communication, it becomes hard to efficiently identify by employing data mining approaches due to the differences in its implicit and explicit meanings in a sentence (Yee Liau and Pei Tan 2014). On the other hand, when sarcasm utterance is expressed in a textual data, it is difficult to be identified by a common person due to the absence of tune and gesture in the textual data (Bharti et al. 2016). Therefore, an efficient natural language processing (NLP) method for text classification in a sentence that possesses sarcastic attributes and properties is required to identify sarcasm (Yavanoglu et al. 2018).

Various authors have defined sarcasm in terms of NLP approaches. For instance, Yavanoglu et al. (2018) defined sarcasm identification as an activity of using NLP techniques to classify a word or sentence sequence that possesses sarcasm attributes and properties. They also referred to it as the system that learns and distinguishes between normal sentence and sarcasm within the semantic level in a sentence. The main objective of sarcasm identification in a sentence is sentiment classification. Thus, the machine-learning model is often employed for sarcasm identification due to its durability and competence to observe itself in conformity with the datasets and specifications. There are various areas that sarcasm identification has played critical roles. For instance, sarcasm identification experiment enhances the research on sentiment analysis. In such a case, emotion features serve as a bedrock for sentiment polarity identification and opinion mining classification. In addition, sarcasm identification enables companies to analyse feelings of customers regarding their products; this could improve the quality of their products (Saha et al. 2017). It is also helpful in the reduction of the wrong categorization of consumer’s opinions towards issues, products, and services (Mukherjee and Bala 2017b). Moreover, sarcasm identification is useful in dialogue, system review ranking, and summarization in human–computer interaction application domains (Davidov et al. 2010).

Lately, few review and survey studies have been published on sarcasm identification in the social media (Wicana et al. 2017; Yavanoglu et al. 2018). For instance, Wicana et al. (2017) presented a machine learning-based review on sarcasm identification by explaining the most current used classification algorithm for sarcasm identification such as support vector machine, maximum entropy, winnow class, neural network, semantic, and statistics, among others. Similarly, Yavanoglu et al. (2018) presented a technical review on sarcasm detection algorithm and explained the most currently used algorithm for sarcasm identification. In addition, Joshi et al. (2017), carried out an in-depth survey on automatic identification of sarcasm and reported the comparison in the magnitude of the study such as the approaches, features employed, classification algorithm and the performance parameter, which is useful in the understanding of the latest trends in identifying sarcasm. According to their study, three discoveries have been identified since the history of sarcasm identification namely pattern extraction using semi-supervised identification, supervised learning with the use of hashtag and context usage above the target text.

However, there are inherent limitations with the current reviews mentioned above. Firstly, the above reviews have failed to provide a comprehensive review of the dataset employed for sarcasm identification. Secondly, none of the studies has provided an extensive and in-depth review of recent pre-processing techniques for sarcasm identification, though a pre-processing step is a key step in any classification problem. Furthermore, none of the reviews has been able to provide a comprehensive review on feature selection and representation schemes for sarcasm identification. Besides, a review of performance parameters for sarcasm identification was also omitted in the previous reviews. However, the aforementioned limitations that are found in the current reviews have motivated the authors for a thorough review and study on sarcasm identification approaches using textual data. Nonetheless, the investigation shows that no studies have been conducted on the comprehensive reviews on sarcasm identification in well-known databases such as Scopus, IEEE Explore, Web of Science, Association for Computing Machinery, Google Scholar, Science Direct and Springer. Hence, there is a need for a systematic study to find out the present state-of-the-art sarcasm identification in the social media.

Consequently, the aim of this review is to present an extensive review and analysis on identification of sarcasm on published articles starting from 2008 to 2019 by exploiting and critically reviewing sarcasm identification under the following perspectives: The datasets, feature usage, feature engineering techniques (feature extraction technique, feature selection techniques, and feature representation technique), classification algorithm and the performance parameters. Forty (40) academic literature were selected after an in-depth search from the six familiar academic databases to carry out the review. The purpose of conducting this review is to help scholars in carrying out research in the area of sarcasm identification by answering the under listed research questions:

Research Question 1 Are there annotated sarcastic datasets publicly available in the area of sarcasm identification using text classification methods?

Research Question 2 What are the most useful features for sarcasm identification by researchers and why?

Research Question 3 Which feature extraction techniques are mostly often employed in sarcasm classification methods and why?

Research Question 4 Which feature representation technique is mostly applied in sarcasm classification methods and why?

Research Question 5 Which feature selection techniques do researchers in sarcasm classification commonly embrace?

Research Question 6 Which of the text classification algorithm produces better accuracy and why?

Research Question 7 Which performance measures are most widely used to measure the performance of the classifiers in sarcasm classification?

Research Question 8 What are the directions for future research and challenges in the domain of sarcasm classification?

The major contributions of this systematic literature review to the current body of knowledge in sarcasm identification in the social media are:

  • A comprehensive investigation of characteristics, types, strengths, and weaknesses of datasets for sarcasm identification in the social media textual data.

  • An outline of effective approaches for sarcasm identification, in conjunction with the various features representation and extraction scheme for efficient algorithm development.

  • A critical analysis of various data preparation (pre-processing) techniques and classification algorithms (classifier) for sarcasm identification.

  • Identification of recent research challenges and suggestion of open research direction to tackle issues in sarcasm identification domain.

The remainder of this article is divided into six sections. Section 2 describes the approaches for sarcasm identification such as the lexicon-based, the rule-based, and the machine learning based. Section 3 discusses text classification stages and techniques. Section 4 explains the research methodology for this review. Section 5 provides an extensive review of the selected articles under five different phases such as datasets, feature sets, feature engineering techniques, text classification techniques, and performance measures. Section 6 presents the research challenges and the future research direction on sarcasm identification in the text classification domain. Finally, Sect. 7 provides the conclusion of the study by giving a summary of the review findings. The review structure is shown in Fig. 1 whereas the list of abbreviations used in this review with their full forms is shown in the “Appendix”.

Fig. 1
figure 1

Review structure

2 Sarcasm identification approaches

Researchers have carried out studies on sarcasm identification in textual data. Various studies approaches for automatic identification of sarcasm found in literature are lexicon-based (Riloff et al. 2013), rule-based NLP (Mukherjee and Bala 2017a), pattern-based (Bouazizi and Ohtsuki 2016), lexicon-based approach (Bharti et al. 2015), Corpus-based (Khodak et al. 2017), statistical-based approach (Reyes et al. 2013) and machine learning based (González-Ibánez et al. 2011). Recently, deep learning approach (Ghosh and Veale 2016; Mehndiratta et al. 2017) which is a new trend, has also gained considerable ground on sarcasm identification and few researchers have employed the approach. The detailed explanations of those approaches are presented in the subsections below.

2.1 Lexicon based approach

In lexicon-based approach, a bag-of-lexicon (comprising unigram, bigram, trigram. etc.) and phrases are used to recognize sarcasm in tweets. For instance, Riloff et al. (2013), utilized a bootstrapping method to construct two bags-of-lexicon that consist of unigram, bigram and trigram phrases. Moreover, these phrases were employed for sarcasm identification in tweets, where the positive sentiment is used in a negative situation. Comparably, four bags-of-lexicon consisting of positive sentiment, negative sentiment, positive situation and negative situation have been developed (Bharti et al. 2015). However, they employed these phrases to recognize the occurrence of sarcasm as negative sentiment in a positive situation and positive sentiment in a negative situation.

2.2 The rule-based approach

In sarcasm identification, the rule-based approach is a problem finding method, which uses an object that relies on some specific principle or guideline. The rule-based approach uses syntactic, semantic and stylistic properties of the sentence such as the pattern of phrase and lexical structure of sentence analysis in any language for sarcasm identification. Most researchers often employ this approach as a means of result comparison with the classifier being used. The semantic-based approach, one of the rule-based approaches, emphasizes more on the meaning of word use, its structure, structural relationship of the word and the contextual usage in the language (Liu 2012). The semantic-based model is the bedrock of the rule-based approach due to its effectiveness in nature. Accordingly, one of the studies that utilized this approach for sarcasm identification was presented by (Bharti et al. 2015). The study used Twitter dataset and the feature extraction techniques that comprise parsing, parts-of-speech tagging and parse tree to learn the semantic arrangement of a sentence. The study employed two algorithms to determine the diverse polarity sentiment in a tweet and the tweets that started with interjections. However, their result shows that the most sarcastic sentences begin with an interjection in a sentence. Similarly, Riloff et al. (2013) also presented a rule-based algorithm that searches for the occurrence of a negative situation and positive verb phrase in a sentence. The study utilized a well-structured iterative algorithm for the extraction of the negative situation phrase and carried out the experimental analysis with various sets of the rule.

2.3 Machine learning approach

This approach is one of the most applied approaches for sarcasm identification by researchers. This is because of its stability feature and its ability to observe itself in correspondence with a dataset and a given specification. Machine learning approaches deals with the creation of a prediction model using an intelligent method. The effect of pragmatic and lexical aspects in machine learning algorithm was studied in (González-Ibánez et al. 2011). The machine learning approach can be further be categorized into unsupervised learning, supervised learning, semi-supervised learning, structural and hybrid learning. A brief explanation of these approaches is given below.

2.3.1 Supervised learning

Among the machine learning algorithms, supervised learning is mostly used in sarcasm detection because of its ability to build a model by taking a labelled dataset as an input data (Mohri et al. 2012), and producing a labelled output data which helps in the construction of a descent model. This is made possible because the training datasets have already provided the result that is to be processed by the model. Supervised learning algorithm (like NB, DT, and LR) serves as the bedrock for other learning algorithms with similar precepts (Yavanoglu et al. 2018). The machine learning algorithm (such as SVM and LR) in addition with the sequential minimal optimization (SMO), was also employed to differentiate sarcasm from the polarity sentiment occurring in Twitter messages (González-Ibánez et al. 2011).

Furthermore, the popularity of the architecture of deep learning approaches has created an opportunity for researchers in this domain to conduct a study on the automatic identification of sarcasm. This form of learning consists of a subset of machine learning by employing neural network to automatically learn from large datasets (Nweke et al. 2018). A neural network is a learning algorithm that processes the features similar to the functioning of the nerve system in the human brain. In the neural network, each unit of the network has a connection to many other units, which can possess a summation function that combines all its input value together. The neural network uses 0.0 and 1 real number value representation in terms of core and axon. Recently, Ghosh and Veale (2016) employed a deep neural network model to identify sarcasm occurrences on twitter datasets. In their work, they combined the algorithms that consist of a convolutional neural network, long short term memory (LSTM) network, and recursive support vector machine and got an impressive performance of the model over the baseline method for sarcasm detection system of F-score of 92% (Schifanella et al. 2016). Similarly, Joshi et al. (2016) in their study also used features based on the similarity of word embedding for sarcasm identification. The feature used in their study was enhanced in relation with the most congruent and incongruent word pair, which resulted in an improvement of the performance.

2.3.2 Semi-supervised learning

This form of machine learning algorithm is a mixture of supervised and unsupervised learning using a minimal quantity of annotated data and a vast number of unannotated data (Tsur et al. 2010). The presence of the unlabelled datasets, and the open access to the unlabelled datasets is the feature that differentiates the supervised learning from the semi-supervised learning. This type of learning approach was employed by Davidov et al. (2010) for automatic sarcasm identification using amazon product review datasets. In their study, a total number of 66,000 products and book reviews were collected and both syntactic and pattern-based features were extracted. The sentiment polarity of 1 to 5 was chosen on the training phase for each training data. The authors reported a promising performance of 77% precision and 83.1% recall on the evaluation phase.

2.3.3 Hybrid learning based

This approach is a mixture of two or more classifiers to form a new one. In other words, it refers to an ensemble classifier. A study that employed the approach is the learning of user-specific context presented by Amir et al. (2016), it uses a convolutional network to learn user embedding feature in conjunction with the utterance-based embedding feature. The resultant features formed a hybrid convolutional user embedding convolutional neural network (CUE-CNN) model in the domain of sarcasm detection and the result of the study produced a performance increment of 2% over single machine learning approaches for sarcasm identification.

3 Supervised text classification process for sarcasm identification

According to Nithya et al. (2012), supervised text classification is a classification that makes use of labelled training datasets of the text to learn and build a text classifier that can be used to automatically classify the unlabelled test sets. Human observers are often used to perform text categorization nowadays, however, these are deemed incompetent owing to the huge amount of files, email messages and web addresses that are being saved in a folder every day. Moreover, manual categorization is usually slow and costly to maintain. In addition, inconsistency is another limitation inherent in manual categorization. The above-identified limitations have shifted the text classification from a manual to an automated base. Several techniques exist in automated text classification such as supervised, semi-supervised and unsupervised text classification. However, the supervised approach is most globally used as it has the ability to build a model using labelled data as an input data (Mujtaba et al. 2017; Yavanoglu et al. 2018). Supervised text classification experimental process consists of six main steps as explained in the subsequent sections.

3.1 Data collection

Data collection phase comes first in any text classification process. The collection of dataset is in relation to the domain the study is considering. For example, when a study seeks to detect sarcasm on Twitter, then the Twitter data is collected. When a study seeks to analyse the disaster response and recovery through sentiment analysis, then the disaster-related data is collected in the social media. In any case, once the raw data is collected, the next phase of the classification is to pre-process the data before the actual analysis can be carried out on the dataset.

3.2 Data pre-processing

Raw data collected during the data collection phase contains a lot of noisy information and requires cleaning. The purpose of cleaning is to eliminate the noise from the data before some knowledge or features can be extracted from it. In addition, duplicate data are also removed during pre-processing stage especially the social media data (Eke et al. 2019). Data pre-processing is referred to as the data preparation phase, where the training and testing datasets are prepared. In the training sets, Twitter datasets are labelled as either sarcastic or non-sarcastic and are required to train the model, whereas the testing datasets are not labelled since it is mainly used for model evaluation. The pre-processing stage mainly seeks to remove unnecessary characters or sequences, which have no value to the sentiment classification. In this phase, the collected data will first undergo a tokenization process also called automatic filtering. This is purposefully performed to remove retweets, duplicates, stop words, punctuations, numerals, tweets written in other languages and tweets with the only URL. At the end of the filtering stage, parts of speech (POS) tagging and stemming is then applied to the remaining tweets in order to covert the text to its original form.

3.3 Feature extraction

Feature extraction is the third stage in the supervised learning approach with regards to the text classification task. It is a technique used to reduce the number of resources required for the description of the dataset by transforming the input data into a set of features. The feature consists of linguistic, pragmatic, emotional, psychological, hyperbolic features, among others. Section 5.3 provides more explanation on these features. The most commonly used feature engineering techniques are Bag-of-words and N-gram techniques. The Bag-of-words model is a text classification technique that uses the frequency of each word as a feature for classification. The bag-of-word technique is one of the widely used techniques for document representation in information retrieval for some years now and as a tool for feature generation (Salton and McGill 1986; Yavanoglu et al. 2018). However, in the N-gram technique, n stands for the number of word features. For example, when the value of n is 1, the feature is called unigram, when n is 2, it is called bigram, and when n is 3, it is called trigram, and so on. Simplicity and scalability are one of the choices of using this technique over the bag-of-words model (Yavanoglu et al. 2018).

3.4 Feature selection

The whole feature sets extracted from the datasets contain irrelevant features that may limit the prediction result during the classification stage. For instance, drawbacks in during the text classification due to the immaterial feature content are a reduction in the accuracy, a problem in generating a result, a decrease in the classification process and a difficulty in storage and retrieval of information. Hence, there is a need for a feature selection technique to choose the most discriminant feature subsets from the extracted feature sets for better prediction (Guyon and Elisseeff 2003). A thorough understanding of the aspect of the datasets that are relevant for the prediction that is to be carried out is needed. Feature selection technique can be sub-divided into wrapper, filter-based and embedding techniques (Guyon and Elisseeff 2003). Among these three categories, the filter-based technique is widely employed (Yang and Pedersen 1997). The filter-based technique uses statistical means to allocate a score to each feature and the selection and rejection of the feature are determined by the score. Chi square (χ2) and information gain (IG) are common examples of feature selection filter-based technique. However, the wrapper-based approach uses the query technique for the best feature selection from the different combination and performs an evaluation using other combinations, whereas the embedding method studies the essential features on the course of building the model.

3.5 Feature representation

In text classification, the feature extracted is converted into a numerical value during the feature representation step (Salton and Buckley 1988). Feature representation technique is categorized into term frequency (TF), binary representation (BR), and term frequency with inverse document frequency (TFIDF) (Debole and Sebastiani 2004). In the TF representation, the value of the feature signifies the total occurrences of the feature in the document (Ramos 2003). However, in the BR technique, the feature value 0 or 1 is used for representation where value 1 indicates the presence of the feature in the document and value 0 signifies the absence of the feature in the document (Salton and Buckley 1988). In IF-IDF representation, the frequency of the text in a particular document is calculated and the result is compared with the inverse portion of the frequency of the word in the whole document. It is effective in matching a word in a query to documents that are important to the query (Ramos 2003).

3.6 Classifier construction

At this phase, the classification model is created on the training datasets by utilizing the machine-learning algorithm. The created model has the ability to classify the unlabelled data as sarcastic or non-sarcastic. Several algorithms have been implemented for the purpose of sarcasm identification. Few of the algorithms used in the selected studies consist of Naïve Bayes (NB), support vector machine (SVM), decision tree (DT), random forest (RF), linear regression (LR) and artificial neural network (ANN) (Yang 1999). These algorithms are described in the subsection below.

3.6.1 Naïve Bayes

Naïve Bayes (NB) is a classification algorithm that uses a probabilistic model to predict how data is obtained within a given class. It is a machine learning algorithm that performs a statistical analysis of numerical data (Sahami et al. 1998). It uses a labelled set of data as input data to calculate the parameter of the generative model. It is one of the simplest learning classifiers that assumes that all features do not depend on each other in a given class context (McCallum and Nigam 1998). Moreover, NB is one of the fastest classifiers that perform well when Bag-of-words techniques are used in text representation (Rennie et al. 2003).

3.6.2 Decision tree

Decision tree (DT) is a core algorithm employed in data mining for classification as well as for prediction. It is an induced learning algorithm that is centered on the instances, it concentrates on the classification rule that displays a decision tree deduced from a group of disorder to an irregular instance (Dai et al. 2016). The tree consists of leaf node, path, decision node and edges (Quinlan 1990). DT is a classifier that is represented in the form of a flow-chart tree structure, in which a core node represents the attribute test, each branch denotes a test result and each leaf node denotes a class. Thus, the whole tree tallies to a collection of a disjunctive representation rule (van der Aalst 2001). DT is employed to train instance classification, which can classify instance based on the definite attribute occurrence of the value sets. Over-fitting problem is one of the limitations inherent in a decision tree classifier. This is due to its capability of fitting every category of data along with the noise that can extremely influence its performance. Notwithstanding, this problem can be overcome by employing multiple classifier model such as the random forest in which different trees are designed and trained by dividing the training set, and the final predictions are combined over the tree.

3.6.3 Random forest

Random forest (RF) is an ensemble classification that uses sub-training sets to build a decision tree classifier. As such, DT classifies each of the input vectors in a forest and the most predicted classifier is selected. Random forest solves the over-fitting problem and it produces better prediction compared to a single decision tree (Liaw and Wiener 2002; Fernández-Delgado et al. 2014).

3.6.4 Support vector machine

Support vector machine (SVM) is a supervised learning algorithm that builds a classification model using the learning theory of statistics. The classification task requires the separation of the data into the training set and the test set. However, it uses the training set to build a model that predicts the target value of data, giving only the test data attributes (Hsu et al. 2003). In a support vector machine, a hyper-plane, also known as a support vector is used to separate the two-class data points by reducing the space between them with the help of training sets (Cristianini and Shawe-Taylor 2000). Many applications such as sarcasm detections, image classification, and bioinformatics have been successfully carried out using the SVM classifier (Fernández-Delgado et al. 2014).

3.6.5 Maximum entropy classifier

The classifier that depends on the maximum entropy chooses from all the models the classifier with the highest entropy that fits the training data. This model does not presume the conditionally independent feature and as such, has a lesser restrictive model than other classifiers. Maximum entropy classifier has an optimization problem that requires handling in order to calculate the parameters of the model. Consequently, it requires more time for training compared to other classifiers like NB classifier (Mukherjee and Bala 2017a).

3.6.6 Artificial neural network

A neural network is a learning algorithm that possesses the features similar to the functioning of the nerve system in the human brain. An artificial neural network comprises of three distinct layers; input, hidden and output layer. While the input and hidden layers consist of numerous nodes, the output layer is made up of just one node. In the neural network, each unit of the network has a connection to any other units, which can possess a summation function that combines all its input value together. The hidden layer is designed for input processing and it connects to the output layer that garbage out the output values. The Neural network uses 0.0 and 1 real number value representation in terms of core and axon (Yavanoglu et al. 2018). According to Yao (1999), learning in artificial network is categorized into unsupervised, supervised and reinforcement learning. The unsupervised approach centres on the relationship that exists among the input data. In that regard, there is unavailability of “correct output” information for the learning. In supervised approach, the learning is based on comparison between the actual input and the target output of artificial neural network, in order to reduce the error function that exists between them. In so doing, the gradient decent-based optimization such as back propagation is employed to iteratively regulate the connection weight in order to reduce the error. Reinforcement learning on the other hand is a special case of supervised approach that provides information on the correctness of the actual output. In that case, there is no knowledge of the precise desired output. In an artificial neural network, learning rule is utilized for weight modification on each input pattern and the most commonly used rule is Delta rule (He and Xu 2010).

3.7 Classification evaluation

In the evaluation phase, the formulated classifier predicts the class of unlabelled text (sarcastic or non-sarcastic) using the training data sets. The classifier accuracy can be estimated by evaluating.

  • The instance accurately classified in the correct class [true positive (TruPos)].

  • The instance accurately predicted in the correct classes that are not members of the class [true negative (TruNeg)].

  • The instances that were either inaccurately predicted to the particular class [false positive (FalsPos)] or that were not predicted as the instance of the class [false negative (FalsNeg)]. These four members consist of the confusion matrix for the binary classification as indicated in Table 1. Various performance parameters have been employed for the evaluation of the model performance. The most commonly used measure for text classifications is accuracy, F-measure, precisions and recalls. They are briefly described below.

    Table 1 Confusion matrix

3.7.1 Accuracy (Acc)

The accuracy provides the percentage ratio of the predicted instance. It measures overall correctly classified instance. It is defined as

$$ Acc = \frac{TruPos + TruNeg}{TruPos + TruNeg + TruNeg + TruNeg}. $$
(1)

3.7.2 Precision (Pre)

Precision is a computation ratio of true positive over positive result

$$ \Pr e = \frac{TruPos}{TruPos + FalsPos}. $$
(2)

3.7.3 Recall (Rec)

Recall is the proportion of actual positives, which are predicted positive. It computationally represents the ratio of true positive against all the true result.

$$ \text{Re} c = \frac{TruPos}{TruPos + FalsNeg}. $$
(3)

3.7.4 F-measure (F-m)

F-measure represents the harmonic mean of precision and recall particularly when there is severe equality of false positive and false negative. The standard F-M is F1, which gives precision and recall equal importance.

$$ F - m = 2 \times \frac{Pre \times Rec}{Pre + Rec} .$$
(4)

4 Research methodology

The aim of this study is to conduct a review of sarcasm identification classification in textual data. This section provides the research methodology adopted for the study. The study adopted systematic literature review (SLR), a guideline established by Kitchenham et al. (2009) for the computer technology field. The guideline consists of four major different phases namely planning for the study, primary study search and selection, data acquisition, and analysing of data. Initially, the planning phase establishes the problem statement, formulates the objectives, research questions and review protocol for the study (covered in Sect. 1). The search process consists of the search strategy, the search query, the selection criteria, and the search keyword on the screened study (to be explained in Sect. 4.2). The data collection phase applies the data extraction strategy on the retrieved study as explained in Sect. 4.3. Finally, the data analysis stage that combines the systematic review involves data synthesis and extensive analysis of the selected studies as explained in Sect. 5. The process of the review methodology is presented in Fig. 2.

Fig. 2
figure 2

Methodology process

4.1 Search strategy for the study identification

This study carried out an electronic search from seven major academic databases viz ACM Digital library, Web of Science, IEEE Explore, Science Direct, Springer, Google Scholar, and Scopus. The study considers the articles published from 2008 to 2019. In the search strategy, different suitable keywords were defined to search the literature on “Sarcasm identification on social platform” from the chosen databases. The search keywords used are sarcasm identification, sarcasm detection, sarcastic text, sarcastic sentence, sarcasm in microblog, sarcasm, sarcastic, sarcasm in a social platform, sarcasm in social media, and sarcasm in twitter. The synonyms of the formulated keywords were used to create additional keywords for search such as cynicism in a social platform, mockery remark in microblog, and satire utterance in social media. All the articles written in English language were investigated irrespective of the language used for the data analysis. The article type and language screening were employed. Finally, an extensive full-text evaluation review was carried out on the selected articles for suitable studies based on datasets, feature engineering, classification, and evaluation.

4.2 Search result

In this section, the queries based on the search keywords were applied to the entire seven-selected databases to fetch the academic articles. Thus, a total number of 51,069 articles were gathered. Table 2 shows the thorough search result from each academic database. The duplicate copies obtained across different databases were removed and only the distinct copies were retained and saved in the endnote.

Table 2 Search and screening result from the 7 databases

4.3 Screening and selection criteria

After removing the similar studies occurring in more than one database, the total number of forty-two (42) articles was further analyzed by reading the title, abstract, and keywords to find out if these retrieved articles were obviously relevant to the purpose of carrying out the systematic review. This process is called screening stage 1. The output of this screening stage produced forty-two (42) articles, which were finally read intensely to see whether they tally with the inclusion criteria. This is called the screening stage 2. The output of this stage of the screening filtered thirty-six (36) articles. Lastly, the references of the thirty-six (36) articles selected were scanned to find some more related articles that conform to the inclusion criteria. This is called screening stage 3, and the output of the scanned references produced additional four (4) new articles. Therefore, a total number of forty (40) articles was selected for detailed analysis for the seven major academic databases, as indicated in Table 2. However, the selected articles were extensively reviewed under the following consideration: (1) dataset for the study, (2) the pre-processing techniques (3) feature engineering techniques (4) classification techniques and (5) performance metrics (section five gives the detail discussion). The selection criteria is shown in Table 3.

Table 3 Selection criteria

The academic database wisely distribution of the forty (40) selected articles for the study is shown in Fig. 3. In the 40 articles, 2 were selected from the web of science, 7 from ACM digital library, 8 from IEEE Explore, 4 from Science Direct, 4 from Springer, 11 from Google scholar, and 4 from Scopus.

Fig. 3
figure 3

Distribution of selected articles

Figure 4 represents the selected studies distribution according to the type of article used for the study. The figure shows that 25 articles out of the 40 selected articles are conference proceeding articles, 12 articles are journal articles, and 1 article is a book chapter.

Fig. 4
figure 4

Article type distribution

The yearly distribution publication count and the yearly citation count of the articles are shown in Fig. 5. In the chart, the vertical axis represents the number of articles published in years and the citation count obtained on the article that year, whereas the horizontal axis represents the year of publication. The optimum number of publication on the selected articles was attained in the year 2013, followed by 2010, 2011, 2015, etc. There is a decreasing trend in publication and citation count between 2017, 2018 and 2019. The figure also shows that there is no publication identified in the year 2008, 2009, and 2012 on the targeted topic.

Fig. 5
figure 5

Yearly publication and citation count

The country-wise distribution of the selected articles is also shown in Fig. 6. It is obvious from Fig. 6 that the largest number of the selected articles on the topic was published from the USA, succeeded by India, Netherland, Indonesia, Japan, Portugal, China and UK, Philippine, Sweden, Australia, Ireland, Tunisia, France, Slovenia, and Vietnam.

Fig. 6
figure 6

Country-wise distribution of the selected article

5 Review of sarcasm identification using text classification technique

In this section, a critical review of the selected primary study on various aspects was carried out. The aspects consist of datasets usage, pre-processing techniques, feature engineering techniques, the modelling approach, and performance metrics. The section is divided into various subsections. In Sect. 5.1, the reviews of the various datasets used for sarcasm identification were presented. Section 5.2 presents a review of various pre-processing techniques used for sarcasm detection. Section 5.3 presents a review of feature engineering techniques used for sarcasm identification. In Sect. 5.4, a review of different modelling approaches used for sarcasm identification was presented. Lastly, Sect. 5.5 gives a review of various performance metrics used for classification performance evaluation for sarcasm detection.

5.1 Review of datasets for sarcasm identification

The sarcasm identification dataset is an important component of the sarcasm classification task. However, such dataset is worthless on its own except some features or useful knowledge are extracted from it. Related studies on sarcasm text classification showed that authors collected primary data using social media and employed two main annotation strategies such as distant supervision via hashtag (Abercrombie and Hovy 2016) and manual annotation strategy (Riloff et al. 2013). The first stage in sarcasm identification experiment is the collection of data to be utilized for building the classification model. The analysis of the selected studies for sarcasm identification shows that datasets can be broadly categorized into homogeneous and heterogeneous data. These data categorizations review are explained below while the strengths and weaknesses of deploying the datasets for sarcasm identification are shown in Table 4.

Table 4 Dataset and volume used on the selected studies

5.1.1 Homogeneous data

In homogeneous data, the studies utilized only one type of dataset which is majorly from the Twitter platform. For instance, a study on ‘Sentence level sarcasm detection in English and Filipino’ that was carried out by (Samonte et al. 2018) utilized only twitter datasets. The researchers collected a total number of 12,000 tweets consisting of 6000 Tagalog and 6000 English tweets. The authors employed datasets on topics such as transportation, government, politics, social media, and weather. In the study, face pager API was utilized for the collection of data from Twitter. The parameters on face pager were set accordingly such as the result type (result_type); that specifies the preferred result by the users (i.e. popular, recent or mixture of both), the count; that specifies the maximum number of tweets to be retrieved (usually 200 maximum), and the language type; that specifies the type of language of the returned tweets. However, similar parameters settings were used for both English and Tangalog tweets collection except in the language specification, in which tl (for Tangalog) was used on Tangalog dataset. Thus, the study indicated that the nature of the datasets (balanced or Imbalanced) has a great influence on the model’s prediction in terms of the accuracy for sarcasm. In addition, (Kumar and Harish 2018) used a content-based feature selection technique to build a classification model for sarcasm identification. The study utilized amazon product review datasets created by the study carried out in (Filatova 2012) and sourced from a crowd sourcing platform-Mechanical Turk. A total number of 1254 Amazon products reviews, consisting of 437 reviews (sarcastic) and 817 reviews (non-sarcastic) were used for the classification experiment. Interestingly, the datasets were structured using a star rating (ranging from 1 to 5) and review comments written in English language. In another study, Zhang et al. (2016) utilized twitter datasets for sarcasm identification using a deep neural network. The tweets datasets were obtained by using twitter-streaming API with sarcasm hashtag (#sarcasm) and not hashtags (#Not) keyword. The study adopted the datasets obtained by (Rajadesingan et al. 2015b), in which a total number of 9104 tweets annotated by the author of the tweets was used for the experiment. In this regards, similar tweets IDs provided by them were used to stream the corpus. Similarly, the contextual tweets were obtained by employing Twitter API in each tweet. However, the hashtag for sarcasm and Not (#sarcasm and #Not) were removed on the historical tweet to prevent the use of explicit clue for sarcasm prediction. Furthermore, the author noted that the use of both balanced and imbalanced datasets was modelled and the experimental result shows that the imbalanced dataset accuracies are greater than the balanced counterparts with the conflicting value of the F-measure. Therefore, imbalanced data create biases in sarcasm identification and performances of the model.

5.1.2 Heterogeneous data

The dataset used here for identification of sarcasm is obtained from various social media and other platforms such as Instagram, Amazon, Tumblr as well as product reviews from electronic commerce in order to improve the robustness and generalization of the sarcasm identification model. For instance, Schifanella et al. (2016) utilized dataset obtained from Twitter, Tumblr and Instagram for sarcasm detection in the multimodal social platform which comprises of text and image datasets. In a previous work (Liu et al. 2014), the researchers evaluated their model by employing two corpora (English and Chinese) sarcasm feature. However, the English sarcasm verification was carried out in the first corpus, which is content of news articles sets adopted from Davidov et al. (2010), the Twitter datasets used by Reyes et al. (2012), and Amazon datasets provided by Burfoot and Baldwin (2009). Then, the second corpus, which was used to verify Chinese sarcasm features also consisted of three different datasets obtained from Sina Weibo, Tencent Weibo and Netease BBC, to crawl various topical comments. Invariably, the heterogeneous dataset employed in this study is highly imbalanced. Consequently, Area under curve (AUC) performance measure was employed for performance evaluation as it has been proven successful for providing better performance measure for imbalanced dataset compared to F-score by using true positive rate instead of precision. Furthermore, Davidov et al. (2010) study focused on sarcasm identification that deployed two multimodal datasets. In this study, the datasets used consists of tweets (5.9 million tweets) and Amazon product review datasets (66,000 product reviews), which were adopted from Tsur et al. (2010). The tweets data was streamed using #sarcasm hashtag included by the tweeter. However, there is inconsistent in the use of the hashtag since it is not known to all the users, hence, most tweeters do not explicitly apply the hashtag for tagging the sarcastic tweets. To this end, the tweets that included hashtag annotation can be regarded as the ‘Secondary gold standard for the detection of sarcastic tweets’. Still, in this study, the Amazon product review consisted of 120 products. The corpus is the content of different books and electronic products reviews. In contrast with the tweets, amazon products datasets are longer in size, as some of the review sentences contained about 2000 words. Interestingly, the sentence structure and grammar in the product review are better than the tweets datasets.

Table 4 outlines the data types, sources, strengths, and weaknesses of the data utilized for sarcasm identification.

5.2 Review of pre-processing techniques for sarcasm identification

Pre-processing of social media data is necessary because of the irregular and informal form of data acquired. The purpose of pre-processing is to eliminate some problems inherent in such texts like a misuse of letter, use of acronyms, poor grammatical sentence and unnecessary repetition (Cotelo et al. 2015). In the pre-processing stage, meaningless data from the acquired dataset are removed in order to enhance the performance of the classification model. The pre-processing techniques that are mostly used in sarcasm identification research according to the previous literature include removal of stop word, empty space, punctuations, special symbols, conversion of uppercase letters to lower case, stemming, tokenization, POS tagging, lemmatization, removal of URLs and hashtags. Thus, the efficiency of this pre-processing techniques are reported in various studies under consideration. In recent studies, Al-Ghadhban et al. (2017) and Samonte et al. (2018) tested the impacts of inclusion or removal of URL, user mentions and stops word in the textual data for sarcasm detection in twitter. The experimental result showed that their removal enhances classification accuracy than when they are present. Some researchers in their studies Ghosh et al. (2015), Dharwal et al. (2017) and Abulaish and Kamal (2018) illustrated the application of stemming, tokenization and conversion of upper case letters to lower case for pre-processing tasks for sarcasm identification. These studies reported that the application of such pre-processing techniques produced a better performance in classification when compared with other studies. A couple of scholars (Altrabsheh et al. 2015; Abulaish and Kamal 2018) have also tested the removal of the white space character, punctuation marks, numbers, and emoticon. Their reports showed the effectiveness of applying these pre-processing techniques for improved classification tasks. Nonetheless, Kunneman et al. (2015) tested the usage of punctuation marks as a feature for modelling in their study on ‘Signalling sarcasm from hyperbole to hashtag’. The result of their experiment showed a better performance in classification when punctuation marks are present than when they were removed. Therefore, we can conclude that researchers should test the performance of the various technique of pre-processing on the sarcastic corpus to check the accuracy of the algorithm in classification. The summary of the pre-processing techniques applied in the selected studies is illustrated in Table 5. The analysis from Table 5 shows that many studies made use of basic pre-processing techniques, which revealed the effectiveness of the pre-processing in attaining a better accuracy in the classification task.

Table 5 Pre-processing techniques used in the selected studies

5.3 Review of feature engineering techniques for sarcasm identification

Feature engineering is one of the major steps in any classification problem. Three stages are involved in feature engineering stages; they are feature extraction, feature representation and feature selection (Mujtaba et al. 2018). The output of the feature engineering stage is in the form of the feature vectors (in numerical form), which serves as an input to the learning algorithm (SVM, RF, DT, etc.) for classification model construction and validation. The detailed explanation of these stages was given in Sect. 3 and the review is presented in the subsequent subsection.

5.3.1 Review of feature extraction techniques for sarcasm identification

In sarcasm identification, feature extraction is the process of extracting relevant and discriminant information from the sarcastic dataset, which will help in the training of the model for sarcasm identification. The review of the selected studies showed that the semantic properties of the sentence features were used in most studies; researchers also utilized automatic feature extraction technique to extract content-based and linguistic features. This was carried out by using the algorithm and various statistical methods. The content-based feature extraction technique consists of Bag of the word (BoW) (da Silva et al. 2014), word to vector (word2vec) (Lee et al. 2018) and n-gram (Sintsova and Pu 2016) technique. As revealed in Table 8.

Table 6, most studies utilized N-gram feature extraction technique on the selected studies. For instance, some authors (González-Ibánez et al. 2011; Rajadesingan et al. 2015a; Kumar and Harish 2018) utilized n-gram feature extraction technique for sarcasm detection and reported that n-gram technique is useful in extracting lexical features. One of the motivations of the n-gram model usage by the researcher is due to its simplicity and scalability (the matching scale of all the enormous sample datasets) properties. In another study Suhaimin et al. (2017), on sarcasm detection in the bilingual text, various NLP techniques were used to extract the combination of various features such as lexical, pragmatic, syntactic, prosodic and idiosyncratic. These features were trained using a non-linear SVM algorithm. However, the result shows that NLP selected features outperformed the baseline features such as bag-of-words, which demonstrated a better performance of the proposed method. Furthermore, lexicon sentiment based feature and pragmatic features (emoticons and user mentions) were extracted in a study by González-Ibánez et al. (2011) for sarcasm identification. The experimental analysis showed that the combination of such features improved the accuracy of the prediction. The summary of the features extraction on the selected studies is shown in Table 8.

Table 6 Feature extraction techniques used in the selected studies

5.3.2 Review of feature representation techniques

In addition to the feature extraction techniques, the study revealed that the feature representation techniques mostly used to convert the extracted feature into numerals is term frequency (TF), which is used to determine the frequency and occurrence of sarcasm in the extracted features. For instance, the contextual features extracted from the target author’s historical tweets in a study by Suhaimin et al. (2019) were represented with TF and IDF. In that regard, the feature values of TF-iDF were used to sort the history tweets in order to choose the constant number of contextual tweets word (feature), having the greatest values of TF-iDF. In another study, Suhaimin et al. (2019), on sarcasm detection and sentiment analysis classification, the three NLP categories of features (pragmatic, syntactic, and prosodic), proposed by Suhaimin et al. (2018), were adopted due to the demonstration of its improvement in sarcasm detection. Thus, the extracted features were represented using term frequency-inverse document frequency (TF-iDF) and binary representation (BR). Out of the 40 selected studies, 12 studies used TF, 8 studies used BR and 20 studies did not report any feature representation technique that was used.

5.3.3 Review of feature selection techniques for sarcasm identification

In feature selection, certain criteria are followed to discover suitable feature sets (Guyon and Elisseeff 2003) and it is broadly employed in sarcasm detection. Notwithstanding, only a few studies in the selected studies on sarcasm identification utilized the feature selection technique to investigate the outcome of the different subgroups on the classification accuracy. The feature selection techniques that were used on the selected studies are Chi square (χ2), information gain (IG) and mutual information (MI), which are briefly explained below.

Chi square (χ2) Chi square is a statistical test used for measuring the absence of the independence that exists between a particular class (c) and term of features (f) (Kumar and Harish 2018).

Information gain (IG) Information gain is a feature selection technique that is used to determine the information gain by knowing the value of the attribute within a feature vector (Yang and Pedersen 1997).

Mutual information (MI) It is a statistical measure that is commonly used to model two random variables (word association and related application) that are mutually dependent (Yang and Pedersen 1997).

For instance, Kumar and Harish (2018) employed Chi square (χ2), mutual information (MI), and information gain (IG) as conventional feature selection techniques to select the discriminative features for sarcasm classification. The researcher tested their presence and the experimental finding shows that the use of these feature section techniques brought about the reduction of the high dimensional feature space and also increased the classifiers classification accuracy. For example, SVM and RF classifiers yielded a maximum accuracy when MI and IG selection scheme were applied in classification. In a related study Muresan et al. (2016), the N-gram lexical features were extracted using linguistic inquiry and word count (LIWC) and WordNet-Affect dictionary (Strapparava and Valitutti 2004; Pennebaker et al. 2015). Furthermore, pragmatic features such as emoticon and punctuation were extracted. However, the discriminative features were selected in these features by employing the Chi square (χ2) selection scheme before modelling. The review showed that five (5) out of the 40 selected studies used Chi square to select discriminative features, three (3) studies used information gain, one study used Chi square, information gain and mutual information (MI), 31 studies, however, did not report the use of any feature selection scheme to select the important feature from the extracted one. The summary of the feature representation techniques is shown in Table 7, while the feature selection scheme utilized in the analyzed studies is shown in Table 8.

Table 7 Feature representation techniques used on the selected studies
Table 8 Feature selection techniques used on the selected studies

5.4 Review of classification techniques for sarcasm identification

Various classification algorithms according to our findings have been used for sarcasm identification in the social media. The review summary of the classification algorithms used in the selected studies is depicted in Table 9, which shows that one or more classifiers have been utilized by each study. In addition, some studies utilized multiple classifiers in order to compare the performance of each classifier with the proposed method. It is obvious from Table 9 that some studies employed only one learning algorithm for classification. Moreover, different researchers on sarcasm identification used different datasets. Thus, the comparison of different classifiers performance in classification in such an instance becomes difficult. For instance, a few recent studies Liebrecht et al. (2013) and Kunneman et al. (2015) employed only balanced winnow classifiers for sarcasm identification. In these studies, a balanced winnow allocates scores to each class label and good performance was obtained when area under curve (AUC) metrics were used, which showed its confidence in such a label. In another study, random forest (RF), support vector machine (SVM), K-nearest neighbour (K-NN) and maximum entropy (ME) were used to classify sarcasm on tweets datasets using pattern related features. The performance classifier result showed that RF outperformed SVM, K-NN and ME by attaining an accuracy of 81.3% F-measure. Ling and Klinger (2016) in their study on the ‘Comparative analysis classification of differences between irony and sarcasm’, compared the performance of the DT, ME and SVM classifiers. The empirical analysis showed that the ME model performed better than the decision tree and SVM classifiers. Sulis et al. (2016), investigated the classifier performance of Naïve Bayes (NB), DT, RF, LR and SVM in modelling the differences among the three figurative messages (#sarcasm, #Not and #Irony) on twitter. Among these classifiers, the highest result of f-measure was obtained by applying RF classifier in distinguishing #Irony versus #Not. However, when similar datasets used in (Barbieri et al. 2014) were employed for the #Irony versus #Sarcasm classification experiment, the performance result showed an improvement of F-measure from 0.62 to 0.70. Moreover, Abulaish and Kamal (2018) compared the performance of NB, DT and Bagging (ensemble) classifier to classify hyperbolic and self-deprecating features for sarcasm identification in the tweets datasets (balanced and unbalanced). They reported the performance result of the experiment in the form of precision, f-measure and recall in applying all the three classifiers, that the DT attained highest values in f-measure and recall while the best precision value was achieved by the bagging classifier in both datasets. It is obvious from Table 9 that support vector machine (SVM) and Naïve Bayes (NB) are the most used classifiers for sarcasm identification in the social platform. Among the 40 selected studies, 22 used the SVM classifier and 14 used NB (Fig. 7).

Table 9 Classification algorithm used on the selected studies
Fig. 7
figure 7

Frequency of the classification techniques used in the selected studies

5.5 Review of performance measure

The performance evaluation of sarcasm classification can be measured using various performance metrics such as accuracy (ACC), recall (REC), F-measure (F-M), precision (PR), the Area under curve (AUC) and kappa statistics (KS). The values of false positive (FP), false negative (FN), true positive (TP), and true negative (TN), which are the contents of the confusion matrix can be used for computation of these metrics. The detail description and the computation of these measures are given in Sect. 3.7. However, the choice of selecting the performance metrics depends on the goal for which sarcasm is being identified. Although the review indicated precision, accuracy, recall, and F-measure as the mostly employed performance metrics, these metrics may be inadequate to correctly evaluate the classifier’s performance correctly. This is because of the class imbalance in various datasets found in most selected studies. In such a situation, AUC would be the best option due to its suitability in evaluating the classification performance related to an individual class (Provost and Fawcett 1997; Provost et al. 1998). For instance, Samonte et al. (2018) collected two sets of tweets dataset (English and Filipino) on a range of domains such as social media, politics, weather, government and transportation to build a model for sarcasm identification in a multilingual platform. In the study, only accuracy metrics were employed by the author to measure the performance of the classification. The English datasets comprised 1101 sarcastic and 13,998 non-sarcastic, whereas Filipino datasets consisted of 894 sarcastic and 14,229 non-sarcastic. Here, the two sets of data are naturally imbalance and in such a case, there may be biases in using the only accuracy as performance metrics. Thus, the right measure to accurately determine the performance of the algorithm for sarcasm identification is AUC. In another study, Liu et al. (2014) employed two corpora to classify English and Chinese sarcasm features. The first corpus consists of Twitter, Amazon product review and News article datasets. Among this corpus, the Twitter dataset comprised 3200 sarcastic and 36,800 non-sarcastic, Amazon product (471 sarcastic and 5020 non-sarcastic), News article (223 sarcastic and 4000 non-sarcastic). However, the second corpus consist of three Chinese topic comments crawled from Tencent Weibo (359 sarcastic and 5128 non-sarcastic), Sina Weibo (238 sarcastic and 3621 non-sarcastic), and Netease BBC (546 sarcastic and 9810 non-sarcastic). It is obvious that all the class distributions of the corpus used in the classification experiment are highly imbalanced. Thus, area under the curve (AUC) was employed by the authors to accurately measure the performance of the classification models. This is because; AUC has a strong resistance to the skewness in datasets compared to the F-score, when employing TPR instead of precision. The summary of the performance measure used in the selected studies is shown in Table 10.

Table 10 The frequency of performance metrics in the selected studies

5.6 Discussion

The extensive review of the academic articles on sarcasm identification classification published between 2008 and 2019 has been carried out in this study. The review concentrated on the aspect of dataset usage, the pre-processing techniques, the feature engineering techniques, the classification algorithm and the performance measures used in the selected studies. It was discovered in the study that sarcasm detection has been applied in many application domains such as product review, sentiment analysis, spam email filtering, and dialogue in human–computer interaction, etc.

The first review question: “Are there annotated sarcastic datasets publically available in the area of sarcasm identification using text classification methods?”, provide insight on various publicly available datasets for sarcasm identification. The review findings show that the datasets for sarcasm identification are obtained by researchers due to the fact that there is no standard publicly available datasets on sarcasm identification except the Amazon product review datasets, which are only available only on request. Many studies have collected their datasets on a microblogging sites such as Twitter. The distinctiveness properties of Twitter have made it to be the mostly utilized in comparison with other type of datasets. Some of the reasons for employing twitter include the generation of the large volume of the tweets in a short period of time, as Twitter data consists of different characteristics that can be categorized when crawling the data such as the domain type, trending (past and current trends), politics, gender, age factors, and geographical location. In addition, the use of #hashtag and keyword for streaming on Twitter is another property that is of great interests to researchers in using the Twitter domain. A Twitter hashtag is a string preceded by the hash symbol, which can be viewed as a topic marker or the key context expression of the tweet. Thus, users that discuss similar topics make use of the hashtag (Tsur and Rappoport 2012). One of the issues observed with the datasets deployed in the studies is due to the imbalanced nature of the datasets. In such studies, there is an unequal distribution of class instances, which can result in the bias of the classification accuracy. Conversely, the review showed that there is no publicly available annotated datasets in this research domain. Therefore, it is necessary to have a standard public datasets for the classification experiment on sarcasm identification and to employ the suitable performance metrics such as AUC for the evaluation of classification performance when an imbalance datasets are used.

Furthermore, various pre-processing techniques have been employed to process the extracted data in order to cleanse the data from the unwanted item that will not contribute to the classification performance. Nevertheless, only few studies (Riloff et al. 2013; Al-Ghadhban et al. 2017; Bharti et al. 2017; Mukherjee and Bala 2017b; Ranjan et al. 2017; Manjusha and Raseek 2018; Samonte et al. 2018) experimented the existence and non-existence of the stop word and reported that the removal of stop word attained a better accuracy in classification than their presence. Also, some of the selected studies indicated that the application of word tokenization with basic pre-processing task achieved better performance in classification (Riloff et al. 2013; Barbieri et al. 2014; Ghosh et al. 2015; Khattri et al. 2015; Ranjan et al. 2017; Samonte et al. 2018). In some selected studies, stemming (Riloff et al. 2013; Al-Ghadhban et al. 2017; Dharwal et al. 2017; Samonte et al. 2018) and lemmatization (Bouazizi and Ohtsuki 2016; Manohar and Kulkarni 2017) have also been applied and the effectiveness of the techniques has been demonstrated. Besides, researchers have also demonstrated the text normalization technique. In such a technique, the data were scaled to a common unit using a regular expression. The research finding shows that text normalization helps in the improvement of the classification performance and therefore eliminates the dimensionality problem (Patro and Sahu 2015). As such, there is a need for empirical evaluations and comparison of some of the pre-processing techniques on the collected data for sarcasm identification so as to ensure better classification performance.

The review answered the next four research questions (Research Question 2, Research Question 3, Research Question 4 and Research Question 5 as outlined in Sect. 1. The research questions seek to answer various feature engineering techniques (consisting of feature extraction, feature selection, and feature representation) employed in the selected studies. Based on the findings from the review, most researchers employed the feature extraction techniques that consist of N-gram, BoW, Word2vec, and PoS tagging technique to extract discriminative features from the collected sarcastic datasets before the classification stage. However, as revealed in Table 6, most studies utilized N-gram extraction technique for sarcasm identification due to its simplicity and scalability properties. Thus, content-based linguistic features such as unigram, bigram, trigram, among others, were most useful features in the selected studies for sarcasm identification. In sarcasm identification, it is not encouraged to rely only on the content-based features extraction for classification. This is because of the limited accuracy that may occur in the classification performance due to the limitations inherent in those features. One of the issues with the content-based feature is disregarding of word order and grammar even though the word frequency is retained. Secondly, these features do not account for the word-level synonyms and polysemy when used for sarcasm identification. In order to avoid these limitations, a combination of other features together with the content-based feature is necessary to enhance the classification accuracy. In addition to the feature extraction, several studies used binary representation (BR) to find the occurrence of sarcasm on the extracted feature and term frequency (TF) representation scheme to identify the frequency occurrence of the sarcastic features in the extracted feature. A study by Barbieri et al. (2014) represented sarcastic features using TF and BR and obtained a promising result. Therefore, TF and BR are mostly employed feature representation techniques in the selected studies and are thus, recommended for sarcastic feature representation due to the promising results obtained on the studies that have used them. It should also be noted that not all the available features might be useful in realizing improved classification performance accuracy since indiscriminative features may lead to model over-fitting (Forman 2003). Hence, suitable feature selection scheme is required in order to find the useful features that can enhance the classification accuracy, lower the computation time and decrease the noise in the construction of the classification model (Hall and Smith 1998). Consequently, the review on the selected study indicated that Chi square (χ2), information gain (IG) and mutual information (MI) feature selection schemes were mostly employed for the selection of relevant features.

In answering the Research Question 6: “Which of the text classification algorithms produces better accuracy and why?”, the review discovered that various classification algorithms have been employed in the selected studies for identification of sarcasm in social media platforms. However, the result of the analysis in the studied datasets with the proposed features in their corresponding studies showed that SVM produced the best performance results (González-Ibánez et al. 2011; Riloff et al. 2013; Ghosh et al. 2015; Schifanella et al. 2016). For instance, González-Ibánez et al. (2011) in their study tested the evaluation of SVM and logistic regression classifier for classification in order to distinguish sarcasm from the positive and negative sentiment in the Twitter message after using the Chi squared feature selection scheme to select the most discriminant feature; and it was reported that the accuracy outperformed the LR model. Recently, Riloff et al. (2013) carried out the comparison of the SVM classifier and rule-based approach in the detection of sarcasm and produced a better result than using only the ruled based approach. Interestingly, the sparse nature of the SVM model has made it suitable for text classification. Report on several studies also indicated that NB algorithm produced enhanced classification results in sarcasm identification. Furthermore, only a few studies among the selected studies applied the KNN algorithm for sarcasm detection and the experimental results in those studies showed vacillating results. Nonetheless, the analysis of different results on the selected studies showed that SVM produced better performance in sarcasm classification followed by NB, and KNN classification algorithms as they also provided optimum performance in the selected studies. It should be noted that four (4) studies (Amir et al. 2016; Ghosh and Veale 2016; Zhang et al. 2016; Manjusha and Raseek 2018) out of the 40 selected studies used deep learning approach for sarcasm classification and compared the result of the deep learning with the traditional machine learning approach such as LR, SVM and RF. The results of the experiments showed that deep learning outperformed traditional machine learning. For example, a novel convolutional network-based approach was presented by Amir et al. (2016), the study learnt the user-specific context and reported a 2% improvement in performance. In addition, Ghosh and Veale (2016) combined convolutional neural network (CNN), deep neural network (DNN) and long short term memory (LSTM) in their classification approach, thus, resulting to an improvement shown by their deep learning architecture, as when compared with the recursive support vector machine model. The main advantage of deep learning is that feature is engineered and learned automatically through a general learning process, unlike the shallow learning that depends on a human for feature engineering. Thus, the deep learning approach is very helpful in sarcasm detection classification by solving the problem of data dimensionality, which usually occurs when features are humanly engineered.

From the Research Question 7: “Which performance measures are most widely used to measure the performance of the classifiers in sarcasm classification?”, the analysis of the selected studies indicated that precision, accuracy, recall, and F-measure were the mostly employed performance metrics yet, these metrics may be inadequate to correctly evaluate the classifier performance. This is because of the class imbalance that is mostly found in various datasets in the selected studies. In such a situation, AUC would be the best option due to its suitability in evaluating the classification performance related to an individual class (Provost and Fawcett 1997; Provost et al. 1998). Besides, AUC has a feature of strong resistance to the skewness in datasets by using TPR when compared with F-Measure.

Based on the review, only one study (Ptáček et al. 2014) out of the 40 selected studies provided a detail error analysis for misclassification. For instance, in the study for sarcasm detection on English and Czech tweets, an imbalanced distribution performance was carried out. In their experiment, an English corpus consisting of 100,000 tweets was sampled to obtain similar distribution on Czech corpus consisting of 325 sarcastic and 6675 non-sarcastic tweets. Thus, the combination of various features yields F-measure of 0.734 ± 0.01 on the Maximum Entropy classifier and 0.729 ± 0.01 on SVM which shows the drop in the performance. This is an indication that the amount of training data plays a vital role in classification performance (0.92 approximation on English corpus versus 0.73 approximation on Czech corpus). Hence, wrong classification may lead to poor performance. To this end, research questions 1 to 7 of this study have been answered while research question 8 is answered in Sect. 6 below.

6 Research challenge and future directions

This review has identified several research issues inherent in the previous researches in sarcasm identification using text classification approaches. The highlighted research gaps need considerable research efforts to create an efficient classification model in the domain of sarcasm identification. These research challenges require further research in order to solve them. These challenges and open research directions are discussed below:

  1. 1.

    Datasets One of the major problems in sarcasm identification domain is lack of standard dataset. There is no standard publicly available dataset for sarcasm identification; this has made most researchers to create privately owned datasets. Consequently, this situation has resulted in the biases of the data since both the training and testing sets are created by the researchers and there is no existing standard data that can be used for comparison with the proposed technique to evaluate the unbiased in terms of the performances. There is also an imbalance in the class distribution of the datasets which make the number of sarcastic text data and non-sarcastic not to correspond to the same size. This calls for the creation of standard datasets, which will solve the problem of biases in the data. A technique also needs to be proposed in order to balance the datasets before classification experiment and to apply performance metrics such as AUC, which is suitable for the evaluation of the performance of the classifier in the imbalance datasets.

  2. 2.

    Tweets typo Twitter data is the most widely used domain for sarcasm detection according to our review. Misspelling of words has become a common mistake in microblog while composing a message. Humans, without any effort, can easily correct such mistakes manually but it is very difficult for machine learning to detect and correct such misspelt words. However, such words can correspond to a specific dictionary that has been removed during the pre-processing stage. Thus, it can drastically influence the sentence polarity. Not only that, machine learning could ignore such wrongly spelt words and replace them with closely related ones. Notwithstanding, such errors are very common in sarcasm detection. Thus, attention should be paid in finding a technique that could detect and correct such wrongly spelt words.

  3. 3.

    The exploitation of new features The review shows that most of the existing studies made use of the content-based linguistic-based features in the classification phase for sarcasm identification on social media platform. However, only a few studies (Bharti et al. 2016; Zhang et al. 2016) took advantage of the behavioural and contextual features to identify sarcasm. In those studies, promising accuracies were obtained compared with the content-based features. One of the studies Schifanella et al. (2016), out of the 40 selected studies also made use of the visual semantics feature (VSF), in which the sarcasm can only be understood through the semantics in the image and was able to attain a higher accuracy when combined with N-gram with the SVM classifier. Therefore, it is important for future research to explore various novel features such as behavioural, contextual and visual features for sarcasm identification.

  4. 4.

    Application of deep learning methods Most researchers in the field of data mining domain are now shifting from the traditional machine learning to Deep learning methods due to the cumbersomeness inherent in the pre-classification phase especially the feature extraction phase in the traditional machine learning approaches for sarcasm identification. The deep learning approach is required in order to overcome such issues, as the features are not engineered by human intervention. Only four (4) studies (Amir et al. 2016; Ghosh and Veale 2016; Zhang et al. 2016; Manjusha and Raseek 2018) out of the 40 selected studies made use of deep learning approach. The classification accuracy of the sarcasm detection can be enhanced by applying different deep learning techniques for effective feature extractions such as word to vector (word2vec) conversion, n-gram and bag-of-words. Some of the deep learning classification algorithms such as recurrent neural networks (RNN) and convolutional neural network (CNN), have reported good performance when applied in sarcasm identification. Deep learning has also enhanced the performance accuracy in many texts and web mining classification (Dumais and Chen 2000). As such, future research can shift attention to the application of deep learning methods.

  5. 5.

    Intense use of emoji and emoticon People have been familiar with the use of emotion symbols like emoji and emoticon in the social media to display their state of mind especially in microblog that has restrictions on the number of characters per chat. Ambiguity is likely to occur among the users with regards to the specific meaning of emoji. Thus, it has the ability to change the overall sentiment of the sentence as the emoji features are not incorporated into the current system. To this end, future researchers should take note of how to investigate and incorporate these features.

  6. 6.

    Multilingual-based approach Majority of the existing works on sarcasm identification utilized only English language datasets. However, most people usually express their emotions better in their native languages than in English. Thus, mining such opinions becomes problematic because many people do not have interest in such research; that is why most existing works on sarcasm classification paid more attention to textual data expressed only in English language. However, only a few studies worked on the other languages apart from English. For instance, (Samonte et al. 2018) in their study worked on the sentence level sarcasm detection in English qand Filipino tweets. Classification performances were compared in both languages and the result showed that maximum entropy (ME) model obtained a better accuracy of 88.506% for training and 91.994% after validation when applied on Filipino datasets compared with English datasets that produced an accuracy of 79.91% for training and 78.75% for testing. As such, further research that will focus on feature extraction on other languages and modification of classifiers is urgently required so that it can be applicable in sarcasm identification written in other languages.

  7. 7.

    Clustering-based approach Clustering-based approach deploy an unsupervised learning approach (Yang 1993) that is mostly applicable in pattern recognition but this is still an infant in the domain of sarcasm identification. Most researchers in the selected studies implemented a supervised learning approach to build a classification model and obtained a good result despite the limitations inherent in such approaches. One of the key issues in supervised learning is the labelling of the datasets in order to construct the training sets. Such tasks require linguistic experts and they are time-consuming. For instance, in a study conducted by Samonte et al. (2018) for detection of sarcasm in English and Filipino at the sentence level, six (6) experts in the linguistics were engaged to manually label 30,231 tweets (that consists of 15,099 English and 15,132 Filipino) as sarcastic or non-sarcastic. Thus, a tremendous amount of time is required in the preparation, and disagreement could arise in a situation where more than an expert is engaged for annotation. So, further research in this domain can focus more on the unsupervised approach (clustering) for modelling sarcasm identification in order to get rid of such labelling exertion.

7 Conclusion

The study presents a comprehensive review of classification techniques for sarcasm identification on the social media platform. The comprehensive review covered articles on sarcasm detection published between 2008 and 2019. The study selected 40 primary studies from 7 different academic databases and critically reviewed the areas of datasets usage, pre-processing techniques, feature engineering techniques (consisting of feature extraction, representation, and selection), the classification approach and the performance metrics. The study showed that there are no standard and publicly available datasets for sarcasm identification in social microblogs such as Twitter in such a way that researchers are required to crawl their own datasets. Content-based features were mostly used features whereas N-gram and POS tagger were the mostly used feature extraction techniques due to their simplicity in usage. (BR) and TF were the most used feature representation schemes in the selected studies. BR technique is very effective in sentiment feature representation, as the occurrence of the sarcasm is checked on the textual data. For example, sentiment 1 is used to indicate the presence of sarcasm in the sentence whereas sentiment 0 indicates the absence of sarcasm in the sentence. TF was also used to check the frequency of occurrence of the feature in the training sets; this has the potential of increasing the likelihood occurrence of feature in the test set. In order to eliminate the non-discriminative features, various studies applied feature selection schemes such as Chi squared and Information gain. In the classification phase, the majority of the studies applied supervised machine learning algorithms such as SVM, NB, RF, ME and DT. The review showed that the SVM algorithm is mostly used, followed by NB, RF, and ME. This is so because it obtained better result compared to other classifiers. Only a few studies used rule-based and NLP approaches. In recent studies, a deep learning approach has gained ground in sarcasm identification owing to the fact that learning and feature engineering is done automatically without human intervention. Performance metrics such as precision, recall, accuracy, and F-measure were used as a performance measure to measure the performance of the classification algorithm and it was found that accuracy was mostly used in the selected studies. Relying only on the accuracy for performance measure will not produce a better result in a situation where imbalance datasets are used. Hence, AUC is a more suitable metrics for performance measure where there are datasets imbalances. A comprehensive investigation of characteristics, types, strengths, and weaknesses of datasets for sarcasm identification in the social media textual data was carried out. In addition, outline taxonomy, various features representation and extraction for efficient algorithm development are presented. The survey also critically analyzed various data preparation (pre-processing) techniques and recent classification algorithms for sarcasm identification. Finally, in order to set the pace for development of the new ground, the study identifies recent research challenges and proposes open research direction to tackle issues in sarcasm identification domain. This comprehensive review of sarcasm identification systems would provide invaluable insight into the research domain and researchers are to further improve sarcasm identification system using textual data.