1 Introduction

Social media has risen to prominence as the most popular platform for voicing one’s thoughts, feelings, and facts. Due to this, social media platforms like Twitter, Facebook, Instagram, and others generate tremendous volumes of data daily (Srinivasarao and Sharaff 2020). Many entrepreneurs use this information to understand and analyze public perception about a specific individual, product, idea, or entity. As an outcome, sentiment analysis of social media content has sparked a lot of interest. Sentiment analysis identifies the emotional feelings conveyed in textual data, including online conversation forums, product reviews, social media posts, etc. (Jindal and Aron 2021). Organizations employ tools that mine social media information to retrieve people’s views and analyze the demand for goods and services. Stock trading enterprises also use emotion analysis strategies to acquire information and assess their impression of various news essays (Zhao et al. 2020). Though sentiment analysis has had fantastic success in a wide variety of fields, a few aspects still need to be investigated further, one of which is sarcasm identification. Finding sarcasm is tricky, requiring a thorough understanding of the language, dialogue system, and skills such as understanding context and content (Kumaran and Chitrakala 2022). Not only that, but correctly verifying the presence of sarcasm in a sentence is difficult for human beings too. Consequently, training a machine to differentiate between non-sarcastic and sarcastic comments is tricky and an emerging research challenge (Joshi et al. 2017).

The presence of sarcasm is felt when encouraging words and feelings in tweets have slang, unfavorable, or undesirable interpretations (Sarsam et al. 2020). Take the following line as an example: “It smells good. How long did you leave it to marinate?". Most people can infer a negative emotion from this sentence with a basic knowledge of sarcasm. We all recognize that the preceding phrase means that you have put on too much fragrance and is, therefore, not acceptable. However, machines struggle to identify the figurative nature of sarcasm in the text due to the presence of the positive phrase “good.” Therefore, if not appropriately handled, sarcasm can alter the polarity of a statement from negative to positive (Liebrecht et al. 2013).

Three methods can be used to illustrate sarcasm detection (Yaghoobian et al. 2021): (a) a machine learning (ML)-based classification approach that leverages learning models; (b) a rule-based system that utilizes corpus-based sentiment lexicons, lexical dictionaries, or publicly accessible sentiment lexicons; and (c) a hybrid strategy that blends machine learning and rule-based systems. Consequently, deep learning (DL) approaches have proven their significance in text classification (Ghayoomi and Mousavian 2022), speech recognition (Nassif et al. 2019), and computer vision (He et al. 2022) as part of various studies. It has been observed that DL algorithms outperform conventional machine learning techniques when it comes to detecting sarcasm. Furthermore, integrating CNN and BiLSTM has been seen to yield more reliable and effective results in detecting sarcasm. BiLSTM is effective for gathering long-term dependencies, while CNN is excellent for retrieving local characteristics (Ay Karakuş et al. 2018).

Fig. 1
figure 1

General Flow of Sarcasm Identification Framework

Figure 1 depicts the general architectural framework of the sarcasm identification task. Identifying sarcasm begins with data preprocessing, which entails tokenizing the data points into individual phrases (tokens). The word tokens in each data point are then uniquely mapped to their indexes in the corpus’s vocabulary database. This mapping is accomplished by assigning each token to an integer. Finally, the embedding layer translates the tokens with integer encoding into feature vectors of real values with fixed dimensions. Several classifier models then processes these real-valued feature vectors to classify the raw documents into sarcastic or non-sarcastic.

For many NLP applications, textual representation is essential. For this, researchers have been seen to use word embedding strategies to convert raw textual data into numeric word vectors that various ML-based sarcasm identification frameworks can process. Word embedding generates dense feature vectors with the appropriate dimensions, which can preserve the contextual and semantic relations between words in text documents. Four extensively used word embedding strategies are Word2Vec (Mikolov et al. 2013), fastText (Bojanowski et al. 2017), GloVe (Pennington et al. 2014), and BERT (Devlin et al. 2018). The first three word embedding schemes are static, whereas the fourth is contextual. Previous research has focused solely on static or contextual embeddings, with just one embedding approach for transforming raw text input into numeric vectors. Almost no research has been reported on developing a hybrid deep learning model that incorporates multiple kinds of embeddings (a combination of static and contextual embeddings).

An ensemble of models is a collection of learning models merged in fruitful ways to yield a more evident outcome (Rahman and Verma 2013). Ensemble learning approaches aimed at developing more generalizable models are also gaining attraction in sarcasm recognition. Furthermore, much of the ensemble learning work focuses on homogeneous ensembles. Only a few studies have been reported on heterogeneous ensembles (merging diverse models). Still, almost all of these use different deep learning models (Base CNN, RNN variants) with the same or varied embeddings. However, none of the work focuses on designing an efficient deep-learning model and applying the same to varied word embeddings (including static and contextual embeddings).

The following are the primary contributions of the proposed investigation:

  1. 1.

    We propose a novel hybrid deep learning strategy by concatenating the features from heterogeneous word representations (including static and contextual) with the aid of the BiLSTM-CNN.

  2. 2.

    We propose a novel stacking-based ensemble strategy with heterogeneous word embeddings (including static and contextual) utilizing the BiLSTM-CNN.

  3. 3.

    Our proposed models enhance sarcasm recognition and generate improved insights after training and evaluation on two publicly available datasets.

The remaining portion of the paper is presented as follows. We commence with a review of the literature in Sect. 2, followed by Sect. 3, which presents the framework of the proposed fusion and stacking ensemble approaches. Section 4 includes the experimental setup, results, and comparison with earlier research. Finally, Sect. 5 wraps up the paper and suggests future research.

2 Related work

This section examines current state-of-the-art methodologies pertinent to the proposed work, primarily focusing on sarcasm detection using hybrid and ensemble approaches. The literature review is discussed in four subheadings: the first and second subheadings concentrate on existing sarcasm recognition strategies that use static and contextual word embeddings, respectively. The third and fourth subheadings focus on existing hybrid-based and ensemble-based text classification works.

2.1 Sarcasm detection with static word embeddings

Azwar et al. (2020) use a six-layer MCAB-BLSTM (BiLSTM based on multi-channel attention) network to recognize sarcasm in the news headlines. Two BiLSTM networks based on attention run concurrently—one utilizing GloVe word embeddings, while the other utilizes fastText. The reported accuracy is 96.64%. Kumar et al. (2019) present the sAtt-BLSTM convNet, an eight-layer model hybridizing sAtt-BLSTM (BiLSTM plus soft attention) and convNet (CNN). Their proposed method uses GloVe to produce meaningful word embeddings. The experimentation has been carried out with two datasets, and produces accuracies of 97.87% on Twitter’s balanced corpus and 93.71% on an unbalanced random-tweet corpus.

Misra and Arora (2019) bring a novel dataset containing news headlines to confront Twitter data’s failings. This study presents an attention-based hybrid neural architecture that uses Word2vec embeddings as input and improves accuracy by approximately 5% over the baseline. On the SARC dataset, Mehndiratta and Soni (2019) investigated and presented the behaviors of several hyper-parameters, including epochs, data size, and dropout for each approach (CNN, LSTM, and CNN/LSTM blend). This study also examines the influence of word embeddings (fastText and GloVe).

2.2 Sarcasm detection with contextual word embeddings

Bhardwaj and Prusty (2022) present a novel strategy in which BERT is used to preprocess the sentences before it is fed into a blended deep-learning model for training and classification purposes. They got 99.63% accuracy with tenfold cross-validation on the news headlines dataset. The LMTweets encoder model, introduced by Ahuja and Sharma (2022), is used to capture the dataset’s features after training on 500000 tweets crawled from various social media sites. The CNN model uses the retrieved features to identify the sentence as ironic/non-ironic and sarcastic/non-sarcastic. Six transformer models, six deep learning models, and five machine learning techniques were used in the study. According to the data, the LMTweets + CNN model outperforms all other models tested with accuracies of 88.3% on SemEval 2018 Task 3.A corpus, 95.9% on Riloff corpus, and 80.9% on SARC(political) corpus.

Shrivastava and Kumar (2021) suggest an innovative strategy based on BERT to represent the text and assess if it is sarcastic or otherwise. In this study, several hyperparameters were examined, and the F1-score of the proposed model is 69.64%, which is compared against a set of benchmarks. Potamias et al. (2020) presents RCNN-RoBERTa, a hybrid of Recurrent CNN and RoBERTa evaluated on four standard repositories. Furthermore, this model surpasses all other state-of-the-art strategies examined, including XLnet, BERT, USE, and ELMo, on all criteria, some by a considerable margin.

2.3 Hybrid-based text classification works

Pandey and Singh (2023) proposes a hybrid model consisting of BERT stacked with LSTM. For a code-mixed English-Hindi dataset, the authors used BERT to create embeddings, and then an LSTM network used the resulting vectors. The proposed model achieved an accuracy of 92%, an improvement of nearly 6% over the other baseline methods. Eke et al. (2021) use three strategies on three benchmark repositories to present a context-based feature solution for sarcasm detection. BiLSTM on GloVe embeddings is used in the first strategy, whereas BERT is used in the second. The third model, on the other side, employs a feature fusion strategy that combines BERT, GloVe embedding features, sentiment-related, and syntactic with traditional machine learning. To design a hybrid model for discovering rumors in a given tweet, Albahar (2021) combine SVM and RNN with BiGRU. RNN with BiGRU is utilized in the first stage of the proposed hybrid approach for learning features, and SVM is used in the second step for classification.

To handle mixed inputs, Yuan et al. (2020) designed a generic framework of hybrid deep neural networks (HDNNs), which is an aggregation of multiple networks (CNN and MLP), indicating its versatility and adaptability. The suggested HDNN model outperforms the MLP and CNN models in terms of accuracy and generalization. Using FastText and character-level embeddings with CNN and LSTM algorithms, Salur and Aydin (2020) introduce an innovative hybrid strategy for opinion analysis. This article utilizes a dataset of 17,289 Turkish tweets and gives an accuracy of 82.14%.

2.4 Ensemble-based text classification works

Praseed et al. (2023) presents an ensemble approach for the identification of fake news in Hindi, comprised of three transformer models, including XLM-RoBERTa, ELECTRA, and mBERT. The model suggested enhanced efficacy by overcoming the shortcomings of the individual transformer models. Goel et al. (2022) suggest an ensemble model that employs baseline as CNN, BiLSTM, and GRU. They obtained word embeddings for the model using Word2vec, GloVe, and fastText. They used two standard repositories to train and validate their model. They examined three ensemble models and reported that the (Glove + weighted average ensemble) performed well on the datasets tested. Subba and Kumari (2022) propose a computationally efficient stacking ensemble-based sentiment classification strategy that uses several word embeddings (Glove, BERT, and Word2vec), multiple deep base-level classifiers (LSTM, GRU, and BiGRU), and an LSTM-based meta-level classifier. They tested their approach on four standard corpora and claimed that it outperformed the strategies reported in the literature.

Fig. 2
figure 2

Preprocessing Pipeline

Gundapu and Mamidi (2021) provide an ensemble method for detecting fake news that combines three transformer architectures (BERT, ALBERT, and XLNET). This model achieved an F1-score of 0.9855 after being trained and assessed in the ConstraintAI 2021 shared task (Patwa et al. 2021). To identify idioms and literals on the in-house dataset of 1470 data points, Briskilal and Subalalitha (2022) recommends an ensemble model using BERT and RoBERTa models. The suggested model has 2% greater accuracy than the benchmarks.

It has been observed that several hybrid and ensemble approaches have been developed for the sarcasm detection task. Most of these approaches made use of a single-word embedding strategy for converting raw text into numerical vectors. However, none of the work focuses on developing an efficient hybrid model with multiple word embeddings. Also, only a few works have concentrated on stacking ensembles, and almost all works use homogeneous ensembles or a single word embedding strategy with multiple deep learning models. To bridge this gap, the hybrid and stacking ensemble models proposed in this article use three state-of-the-art heterogeneous word embeddings (including static and contextual) and BiLSTM-CNN.

3 Methodology

3.1 Data preprocessing

The ultimate aim of this submodule is to utilize NLP techniques for preprocessing the raw text sentences and arranging them for the subsequent step of extracting valuable features. Figure 2 depicts the brief preprocessing pipeline. The tokenized data point is sent through the pipeline in the preprocessing step to remove or normalize the useless tokens in the sarcasm recognition dataset. The following are the subparts of the preprocessing pipeline:

  • Handling Hashtags: After identifying hashtags with the pound (#) sign, we separated # from the hashtag. Example: #TheProudFamily \(\rightarrow \) TheProudFamily.

  • Stemming: We stripped off the inflectional morphemes such as “est”, “ed”, “ing”, and “s” from their token stems using Snowball stemmer (Usually called Porter2). Example: Words like “interested,” “interesting,” and “interests” are reduced to their base form, “interest”.

  • Text cleaning: We perform this step to discard the irrelevant data. We removed URLs, punctuation marks, html tags, digits, non-ASCII glyphs, and special characters from each data point.

We carried out each step of preprocessing for the SARC dataset. The first step was not carried out in the case of news headlines because the headlines do not include hashtags.

3.2 Word embedding

Fig. 3
figure 3

BiLSTM-CNN Model

Quantitative data are necessary for computer algorithms to function. Therefore, data must be expressed quantitatively for algorithms to handle text-based data. There are numerous ways to do this, which can be categorized into three groups: Count-based methods (Count Vectorization, TF-IDF), Static word embeddings (Word2vec, GloVe, fastText), and Contextual embeddings (ELMo, BERT). For a lot of NLP activities, word representations are necessary. Effective word representations can enhance text encoding and classification capabilities. The core premise behind Word2vec is that rather than expressing words as one-hot representations (Count Vectorization/TF-IDF) in high-dimensional space, we present words in dense low-dimensional space so that similar words receive similar word vectors, allowing them to be mapped to their nearest neighbors. Word2vec does not utilize the information in the entire document as a window-based paradigm and does not acquire sub-word information. Disambiguation is another issue that Word2Vec does not address. The former problem is tackled with GloVe and FastText, while the latter is handled with BERT.

This module includes three pre-trained word embeddings: GloVe, fastText, and BERT. As a result, we get three different kinds of word embeddings by utilizing this module. We generated two pairs of 300-dimensional word embeddings using pre-trained GloVeFootnote 1 and fastTextFootnote 2 models with 300-dimensional word vectors. Similarly, the third pair of 768-dimensional word embeddings is generated using the BERT-base,Footnote 3 a pre-trained base variant of BERT. The word vectors generated using these three particular embeddings will be given as input to BiLSTM-CNN.

3.3 BiLSTM-CNN model

The sequentially combined BiLSTM-CNN serves as the deep learning model. CNN and LSTM offer efficient outcomes in several applications as CNN is better at handling short phrases and capturing inter-dependencies between all conceivable word combinations. At the same time, LSTM can retrieve long-term correlations among word sequences. These techniques work well since they can contribute to the problem of categorizing sarcasm. The acquired categorization results have verified this effort. Combining the two yields considerably superior outcomes.

As shown in Fig. 3, utilizing word embeddings as input, our presented BiLSTM-CNN begins with a BiLSTM layer having a hidden dimension of 128. Three concurrent convolutional layers having kernel sizes of 2, 3, and 4 with 128 filters each are applied with this output. The yield from each convolutional layer feeds to max-pooling layers, and the outputs of all max-pooling layers combine to form a single feature vector comprising freshly extracted features. After that, the retrieved vector is fed into a convolutional layer comprising 256 filters and a kernel size of 5, followed by an average-pooling layer.

3.4 Proposed hybrid model

Fig. 4
figure 4

Overview of the architecture of Proposed Hybrid Model

This section briefly presents the proposed hybrid-based sarcasm identification framework. As illustrated in Fig. 4, the overall framework comprises of several sub-modules, each of which performs a specific task. Initially, the data pre-processing module pre-processes the raw textual data by performing the steps described in Sect. 3.1. The cleaned data acquired after the pre-processing step has been given to three word embedding techniques (described in Sect. 3.2) separately, of which two are static and one is contextual. For the first two branches, GloVe and fastText use their pre-trained embeddings to generate the word vectors. On the other hand, the word vectors have been obtained from the sequence output of BERT in its third branch. The word vectors obtained from each embedding technique have been given to the BiLSTM-CNN model separately. For all three branches, we use the same BiLSTM-CNN architecture (described in Sect. 3.3) for acquiring three separate contextual features. From each branch, we now have the outcome of 256 features.

While we typically accept NumPy input for GloVe and fastText, the input type for BERT is a tensor. We employed the idea of Yuan et al. (2020) to deal with these mixed input types. We merged all the outcomes by applying this approach to get 768 features. Consequently, these obtained features are sent to fully connected dense layers. Instead of Sigmoid activation function at the output layer, we use SVM for classifying the acquired features from hidden layers. Here, we employ SVM since it is a robust approach for resolving real-world binary classification tasks (Cervantes et al. 2020).

3.5 Proposed ensemble model

Figure 5 illustrates the structure of the suggested stacking ensemble-based sarcasm recognition framework that uses BiLSTM-CNN with heterogeneous word embeddings. The entire framework comprises various modules, just like the hybrid approach. We follow precisely the data preprocessing procedures outlined in Sect. 3.1, uses the word embedding strategies outlined in Sect. 3.2, and the BiLSTM-CNN architecture outlined in Sect. 3.3.

Fig. 5
figure 5

Overview of the architecture of proposed stacking ensemble model

Fig. 6
figure 6

A schematic diagram showing the transformation of textual documents into word-embedded matrices

The stacking ensemble module consists of BiLSTM-CNN with three kinds of heterogeneous embeddings as the three base-level classifiers and a GRU-based meta classifier. Because SVM is successful in binary classification, we utilize it as the classifier at the output layer instead of sigmoid in all the base-level classifiers. The data points from the data preprocessing module that have been processed and integer-encoded are given as input to the base-level classifiers.

To avoid updating the embedding layer’s weights after they have already been learnt, the parameter “trainable” of the first two base-level classifiers’ embedding layers is set to False. The BERT’s parameters are kept fixed to prevent its weights from changing, while the BiLSTM-CNN model is being trained. Figure 6 presents a schematic illustration of the process flow for converting the text documents into the appropriate word embedding matrices.

The integer-encoded data points obtained by the data preprocessing module are divided into the train (\(DS_{train}\)) and test (\(DS_{test}\)) sets at a ratio of 80%:20% to evaluate the suggested ensemble model. The base classifiers are trained on 90% of the data in the training set, and the leftover 10% is utilized to validate the trained base classifiers. Each learned base classifier is then appraised using the data from the testing set (\(DS_{test}\)). A single data frame (\(DS^{f}\)) contains four columns, out of which the first three columns have outcomes from the base classifiers on the test data (\(DS_{test}\)), and the fourth column includes labels from the test data (\(DS_{test}\)). Later, the \(DS^{f}\) is partitioned into training (\(DS_{train}^{f}\)) and testing (\(DS_{test}^{f}\)) sets with an 80%:20% ratio. Thereafter, 90% of the training (\(DS_{train}^{f}\)) data are used to train the GRU-based meta classifier, whereas the remaining 10% are used to validate the meta classifier. The trained meta-classifier is assessed on the test (\(DS_{test}^{f}\)) set to determine how well the recommended stacking ensemble-based classifier succeeds overall.

4 Experimental results and discussion

Numerous tests were carried out to demonstrate the viability of the suggested models. The efficacy of the suggested models is presented in this subheading utilizing a variety of evaluation criteria, including accuracy, F1-score, precision, and recall. We performed experiments in a Jupyter notebook running Python 3.9 from the Anaconda distribution on Ubuntu 20.04.4 LTS with an Intel Xeon (R) CPU (W-2133), 64GB of RAM, and a Quadro RTX 4000 GPU. Section 4.1 describes the datasets used for experimentation. Hyperparameters, Evaluation metrics, Baselines, and Results were all covered in sections 4.24.34.4, and 4.5, respectively.

4.1 Dataset

We have made use of two independent corpora in this work, one containing news headlines and the other with Reddit comments. In this subheading, we present an overview of these two corpora.

4.1.1 News headlines dataset

To overcome the noise limits in Twitter datasets, Misra and Arora (2019) built a news headlines dataset from two famous news sources: TheOnion and HuffPost. The dataset is in JSON format and can be acquired from Kaggle.Footnote 4 There are 28,619 news headlines in the collection. Out of which, 47.6% of are sarcastic, whereas 52.4% are not. Each data point has three attributes. One is a binary variable that expresses whether or not the headline is sarcastic. The other is the news headline. The final one is the headline’s URL. We excluded the URL because the objective is to judge whether a headline is sarcastic or otherwise.

4.1.2 SARC dataset

Khodak et al. (2017) developed the Self-Annotated Reddit Corpus (SARC), an enormous dataset for sarcasm identification. A portion of Reddit reviews between January 2009 and April 2017 comprise the dataset, which we can download from Kaggle.Footnote 5 The dataset comprises a sarcastic label, author, subreddit where the comment first appears, user-voted comment score, date of the comment, and parent comment. The corpus contains 1.3 million sarcastic utterances and many more non-sarcastic utterances. We considered the comments, each of which had at least ten words. The total number of comments is 4,41,637, with 2,25,974 being sarcastic and 2,15,663 being non-sarcastic.

4.2 Hyperparameters

Hyperparameters are the specific settings the user makes to regulate the learning process. The best/optimal hyperparameters must be chosen in the training step for learning algorithms to produce the most significant results. Table 1 presents the hyperparameters employed in our suggested approaches. After examining how the suggested methods performed with various sets of hyperparameters, we zeroed in on these values.

Table 1 Hyperparameters used for all the experiments

It has been observed that the majority of headline/comment lengths vary from 10 to 50 words in both datasets. As a result, we set the maximum text length at 50. For texts with less than 50 words, we have included a sufficient number of zeros at the end of the headline/comment to guarantee a uniform length of 50. Similarly, headlines/comments with more than 50 words were condensed to just the first 50 words. With the incorporation of these changes, the total performance was unaffected because there were fewer headlines/comments with longer word counts. For each data point in both datasets, GloVe, fastText, and BERT produced embeddings with the following dimensions:

  • GloVe and fastText: 2-D embedding matrix of dimension “\(50\times 300\)".

  • BERT: 2-D embedding matrix of dimension “\(50\times 768\)".

We have used the features ModelCheckpoint and EarlyStopping of Keras library to save the best model during training. Different learning rates, including 2e-5, 3e-5, 4e-5, and 4e-5, were tested. The optimal value arrived at is 2e-5. Additionally, we experimented with dropouts of 0.2, 0.3, and 0.4. In this instance, 0.2 is the optimal value. For the news headlines dataset, we used a batch size of 32, while for the SARC dataset, we used a batch size of 64. Table 2 displays the complete count of the train, validation, and test sets.

Table 2 Dataset statistics

4.3 Evaluation metrics

Quantifiable metrics have been devised as evaluation tools to assess the results of the categorization algorithms. Accuracy and F1-score were utilized as the assessment measures because the task required binary classification (1: Sarcastic, 0: Non-sarcastic). The percentage of accurately anticipated data points is the accuracy, whereas the F1-score combines precision and recall into a single statistic, which is a harmonic mean of both. The formulas for computing accuracy, precision, recall, and F1-score are shown in equations (1)–(4).

$$\begin{aligned} \textrm{Accuracy}= & {} \frac{\mathrm{Tr.P.+Tr.N.}}{\mathrm{Tr.P.+Tr.N.+Fal.P.+Fal.N.}} \end{aligned}$$
(1)
$$\begin{aligned} \textrm{Precision}= & {} \frac{\mathrm{Tr.P.}}{\mathrm{Tr.P.+Fal.P.}} \end{aligned}$$
(2)
$$\begin{aligned} \textrm{Recall}= & {} \frac{\mathrm{Tr.P.}}{\mathrm{Tr.P.+Fal.N.}} \end{aligned}$$
(3)
$$\begin{aligned} \mathrm{F1-Score}= & {} \frac{\mathrm{2*Tr.P.}}{\mathrm{2*Tr.P.+Fal.P.+Fal.N.}} \end{aligned}$$
(4)

where

  • The text that is correctly identified as being sarcastic is known as a Tr.P. (True Positive) because it originally belonged to that category.

  • Tr.N. (True Negative) is the text originally belonging to the non-sarcastic class, and the model predicts the same.

  • The text initially categorized as non-sarcastic but anticipated to be sarcastic is marked as Fal.P. (False Positive).

  • Fal. N. (False Negative) is the text mistakenly classified as non-sarcastic, while it falls under the sarcastic category.

4.4 Baselines

Representing the texts accurately is one of the primary goals of a text classification task. GloVe, FastText, and BERT are the three word embedding strategies used in this work to represent the dataset. The efficacy of the proposed fusion and ensemble models was compared with 18 deep learning techniques. We name the neural network models M-1, M-2,..., M-18, and they are either CNN or RNN forms or a combination of both. The three deep neural techniques, CNN, BiLSTM, and BiGRU-also known as feature extractors and classifiers-make up the 18 deep learning models. We adopted the notations M-4-A, M-10-B, and M-16-C since M-4, M-10 and M-16 are the stems of our proposed approaches, as shown in Fig. 4 and 5. Further, the proposed fusion model is M-Hybrid, and the proposed ensemble model is M-Ensemble.

We tried six combinations with each of the word embedding approaches. For the CNN and BiLSTM, we used the same structure discussed in Sect. 3.3, whereas for the BiGRU, we used the hidden dimension of 128 in both directions.

4.5 Results and discussion

Tables 3456, and 7 detail the classification performance of the presented hybrid and ensemble models against the baselines. Compared to models with GloVe and fastText embeddings, those with BERT embeddings are observed to produce better outcomes. The models that employ GloVe and fastText have pretty comparable results. From Tables 34, and 5, we can see that the BiLSTM-CNN model produces superior outcomes for all the static and contextual word embeddings that have been taken into account. The combinations M-4-A, M-10-B, and M-16-C prove the same. Hence, we integrated the same three pairs in the M-Hybrid and M-Ensemble frameworks. We presented a framework with two inputs and binary output using the Keras functional API to assess the classification performance of the M-Hybrid and M-Ensemble frameworks with that of the M-4-A, M-10-B, and M-16-C.

Table 3 Results(%) of deep learning models with GloVe embeddings and SVM as the last layer for classification
Table 4 Results(%) of Deep Learning Models with fastText embeddings and SVM as the last layer for classification
Table 5 Results(%) of deep learning models with Sequence output from BERT and SVM as the last layer for classification
Table 6 Results(%) of proposed M-Hybrid, M-Ensemble, and baselines on news headlines with BiLSTM-CNN as the model combination
Table 7 Results(%) of proposed M-Hybrid, M-Ensemble, and baselines on SARC with BiLSTM-CNN as the model combination
Fig. 7
figure 7

Confusion matrices of M-Hybrid and M-Ensemble on News Headlines Dataset

Fig. 8
figure 8

Confusion matrices of M-Hybrid and M-Ensemble on SARC Dataset

The comparative results of the suggested models on the news headline and SARC datasets are shown in Tables 6 and 7. On the news headlines dataset, Table 6 demonstrates that the M-Hybrid and M-Ensemble models, with 95.70% and 96.76% accuracy, respectively, clearly surpass all other standalone models. Table 7 shows that the suggested models also perform well on the SARC dataset, with accuracy values of 80.64% and 81.46%, respectively. Figure 78 picture the M-Hybrid and M-Ensemble confusion matrices on both datasets. The number of data points in the confusion matrix is much less for M-Ensemble than M-Hybrid. The basis behind this is that initially, we used 20% of the data as test data for the base classifier to forecast. Again, we have picked 20% of the data from that to test the meta-classifier. The graphical depiction of suggested approaches and their stems for both datasets is shown in Fig. 9. Table 8 compares the suggested techniques with a few relevant state-of-the-art frameworks reported in the literature. In the case of the news headlines dataset, it can be seen from the table that the recommended frameworks perform noticeably better than previous frameworks. The M-Ensemble technique outperforms all others on the SARC dataset.

Fig. 9
figure 9

Comparative performance of proposed models with their stems and other state-of-the-art approaches on both datasets

Table 8 Performance(%) comparison of the methods proposed with state-of-the-art methods

As a result of using three distinct kinds of word representations supplied by GloVe, fastText, and BERT, the proposed approaches perform better than the earlier ones. This makes it possible for the suggested frameworks to accurately represent the contextual dependencies between words in the texts, improving their ability to anticipate outcomes. Another aspect is that the proposed frameworks use an effective deep learning model, a sequential combination of BiLSTM and CNN, rather than the base forms of CNN and RNN, like in previous studies.

5 Conclusion

In this paper, we have proposed novel hybrid and stacking-based ensemble models for sarcasm identification with heterogeneous word embeddings and BiLSTM-CNN. The proposed hybrid model has been employed to successfully extract three sets of features by running the BiLSTM-CNN with three heterogeneous word embeddings: GloVe, fastText, and BERT. The extracted features from the three sets were fused and sent to an SVM classifier for classification. On the other hand, the BiLSTM-CNN model with three types of heterogeneous word embeddings forms the three base-level classifiers of the proposed stacking ensemble-based framework. The base classifier probabilities are then combined into a single data frame, which is subsequently used to train the meta classifier, GRU. The hybrid model produced promising results, with 95.70% accuracy on the news headline repository and 80.64% on the SARC repository. In addition, very promising results have been attained with the stacking ensemble-based model. With this model, the accuracy obtained is 96.76% on the news headlines and 81.46% on the SARC. The proposed stacking ensemble-based model has outperformed all reported works in this area with superlative outcomes on both datasets. The experiments suggest that combining heterogeneous word embeddings enhances sarcasm identification performance.

As sarcasm detection is a vast and fascinating field, there is much work to explore in the future. The popularity of typo-graphic images-text that is portrayed as an image-shows off the expressiveness of online social data even more, and sarcasm detection in them is a fascinating area for future research. Additionally, the usage of code-mix and code-switch languages, intentional ambiguity, unique vocabulary, “crowd-sourced” or “self-tagging” datasets, and other factors make it a vibrant field of study with many research challenges.