Keywords

1 Introduction

In the present-day world where humans are having conflicting emotions, it is a tedious task to analyze their sentiments. Sarcasm requires shared knowledge between speaker and the listener [1]. Detection of sarcasm in text is difficult because gestural and tonal clues are missing. Many machine learners collect their dataset from social texts to detect sarcasm, especially in tweets [1]. We used the dataset provided by the Amazon Alexa’s sample set to apply machine learning algorithms. A machine learning algorithm is attempted to design to detect sarcasm in text. Naive Bayes, one-class SVM and Gaussian kernel are few algorithms commonly used to perform the same task [2].

Semi-supervised sarcasm is identified on two different datasets: a collection of millions of tweets collected from Twitter, and a collection of millions of product reviews from Amazon [3]. On Twitter a common form of sarcasm exists in a form where a positive sentiment contradicted with a negative situation. For example, many sarcastic tweets include a positive sentiment, such as “love” or “enjoy”, followed by an expression that describes an undesirable activity or state (e.g., “taking exams” or “being ignored”) [4].

Sarcasm changes the polarity of an apparently positive or negative statement into its contradictory statement. A corpus of sarcastic messages on Twitter is created by many authors on whom determination of the sarcasm of each message has been made by its author. These corpuses are used as a reliable benchmark to compare sarcastic expressions in Twitter. Many authors also investigated the impact of lexical and pragmatic factors for discovering sarcastic statements. Sarcastic statements are difficult to identify. Therefore, we compare the performance of machine learning techniques and human judges on this task to find who is performing better. Perhaps unsurprisingly, neither the human judges nor the machine learning techniques [5,6,7] perform very well [8]. There are many computational approaches for sarcasm detection using lexical cues has been given [9].

Many properties [10,11,12,13,14] were explored while finding sarcasm in text like theories of sarcasm, syntactical properties [15], lexical feature [16, 17], etc. [4, 18, 19]. Model’s accuracy can be improved after finding positive and negative works, which can be done using bag-of-words. Accuracy increases for feature extraction by the use of bag-of-words [12]. The experimental results depict that the proposed method outperforms the existing methods. The rest of the paper is organized as follows: Sect. 2 describes the proposed method. Section 3 discusses experimental results and Sect. 4 concludes the paper.

2 Proposed Work

In this research paper, we performed various operations to build our model. SentiWordNet 3.0 dictionary is preprocessed and transformed to form a map, which contains a key and value from the dictionary, where key contains the POS tags and synsets of the SentiWordNet dictionary and value contains the mean of positive and negative values of the respective words in the SentiWordNet dictionary. We used this map to calculate to sentiment of the provided textual data. The complete steps of proposed method have been shown in Fig. 1.

Fig. 1
figure 1

Flowchart shows the process of data processing, features extraction to detect sarcasm

The aim of TextBlob is to provide access to common text processing operations. Polarity and Subjectivity are the main factors of Python library, i.e., TextBlob. TextBlob objects can be treated as Python library to do Natural Language Processing. On the provided textual data, polarity and subjectivity are calculated by the TextBlob objects, to improve the sentiment score. Above two methods were very useful and improvement in accuracy was up to 5–7%. Apart from these two methods, we implemented Vectorization method.

In which a vector was created to store the count of Nouns, Adverbs, Adjectives, and Verbs in the provided textual data. This method was implemented with the help of POS-TAG (Part-Of-Speech Tagging), a very impressive method in the Python library in NLTK (Natural Language Toolkit). NLTK library deals with the textual data and simplifies work for Python programmers. Method POS-TAG returns a list of each word from the provided textual data, with the tags of Nouns, Verbs, Adverbs, Adjectives, etc.

To improve our accuracy for about 2–3%, we implemented a technique called Capitalization. In which the focus is given on the words which are Capital, so that we can detect the words which are to be focused to be spoken. When we provide a textual data, we have no idea which word is given stress on. This was a very impressive technique to judge the textual data’s sense.

A matrix was created for the whole dataset containing the features extracted from the above techniques and final step taken was to apply naive Bayes Algorithm. The naive Bayes is used as a baseline for text categorization. The classifier makes the naïve assumption that the independence occurs between all the features. The classifier is applied from Bayes theorem. Its simplicity makes it a popular machine learning classifier.

$$ P\left( {C_{k} |x} \right) = \frac{{P\left( {x|C_{k} } \right)\, \cdot \,P\left( {C_{k} } \right)}}{P\left( x \right)} $$

On a whole, after the application of all these great techniques, an accuracy of 70.96% was obtained.

3 Experimental Results

The performance of the proposed method has been tested on sarcasm dataset and its accuracy is also compared with naïve Bayes, decision tree, and SVM. From Table 1, it is easily observed that the proposed method outperforms the exiting method. Moreover, histogram for accuracy is also plotted in Fig. 2. From Fig. 2, the effectiveness of the proposed method can be easily observed.

Table 1 Accuracy of the existing method and the proposed method
Fig. 2
figure 2

Comparison of various models on sarcasm dataset

The above histogram shows the accuracy rate variation for executing the same model for three times.

4 Conclusion

Automatic sarcasm detection is a formidable task. This paper offers novel naïve Bayes method to detect sarcasm in Amazon Alexa dataset [20]. The dataset is divided into training and test dataset using cross-validation techniques. The quality of features/attributes extracted from the training dataset affects the performance of the technique. Therefore, SentiWordNet and TextBlob have been used to extract important features from dataset and the model is trained using those features. The test dataset is tested using Gauss-based naïve Bayes method and three baseline methods namely; naïve Bayes, decision tree, and support vector machine. From the experimental results, it is found that the proposed method outperforms the baseline methods.

Sarcasm is closely related to language- or culture-specific traits. Future approaches to identity sarcasm in new languages can benefit to identify such traits.