1 Introduction

In daily life, people often express negative meaning with sarcasm, but the literal meaning is affirmative. For example, “You are so right”, if this sentence is said with smile and sincere tone, it is the meaning of affirmative praise. However, if it is accompanied by an exaggerated expression, contemptuous smile, slow speaking speed and abnormally high tone, it is likely to be negative and disagreeable. It is flooded with a large number of comments on products, policies and so on in social media. Calculating the true intentions of these data has far-reaching significance. If there is an error in the sarcasm true intention detection, it will have the opposite effect. For example, the original opinion on the product is negative, but the test result is a compliment to the product. In this case, it will only make the situation worse. Therefore, The use of advanced and efficient methods to study sarcasm detection is of great significance.

Sarcasm has metaphorical characteristics [14] and is often associated with wisdom [34],which makes it difficult to detect [21]. A lot of research on single-modal irony detection has been carried out [19, 30,31,32], For example, detecting the sarcasm of the text modal by capturing the inconsistency of the text context [34, 39], and detecting whether the facial expression is a sincere smile or a contemptuous fake smile through the analysis of facial expression units. Intuitively, the detection effect of a single modal cannot be better than the effect of comprehensive consideration of multiple modalities [35, 42]. As long as the inconsistency of text, intonation and facial expressions can be detected, the accuracy of detection should be greatly improved [25]. However, the inconsistency between the various modes increases the difficulty of information fusion.

The previous sarcasm detection method was based on rules and statistical knowledge, by manually extracting special vocabulary and punctuation as features to detect sarcasm. However, this method requires professional domain knowledge and is not robust, which means this method needs to design different rules and feature extraction methods for different scenarios, so the cost is high. Deep learning has been successful in other fields. From an intuitive point of view, using deep learning methods can automatically learn useful features for sarcasm detection And when the training samples are comprehensive enough, the deep learning method should be better. Therefore, this paper proposes a multi-level multi-modal fusion network with residual connections on the later fusion method based on deep learning, which improves the accuracy of irony detection on some data sets. The contribution of our work are as follows:

  1. (a)

    We Proposed a network fusion model with residual connections based on late fusion;

  2. (b)

    The proposed model was first applied to the data set MUStARD and competitive performance was obtained;

  3. (c)

    A more reasonable speaker-independent experimental setup is proposed on the data set MUStARD, which laid a foundation for exploring deep multi-modal fusion on a small data set.

The main work of this article is to transfer a variety of multimodal fusion methods to the detection of the MUStard data set, and do sufficient ablation experiments. The data set is small, and the irony detection is very difficult. “It is difficult for a clever woman to cook without rice.” Especially in the speaker-independent experimental setting, the improved post-fusion network with residual connections has achieved good results. Next, The rest of this paper is organized into seven sections: The second part is a literature review of irony detection methods. Part 3 is a description of our proposed model. Part 4 introduces the data set and data preprocessing methods we use, and the method of feature extraction. Part 5 introduces the details of the 3 groups of experimental settings and the baseline models we used. Part 6 shows the detection results of multiple models in 3 sets of experimental settings. Part 7 shows the detection results on multiple other data sets and the analysis of detection errors in the experiment. Part 8 is about future work prospects and improvement directions.

2 Related work

2.1 Multi-modal sarcasm data set

Available multi-modal irony data sets are very rare. Schifanella et al. [33] collected pictures and texts from three social platforms to create a data set, and used SVM to explore multi-modal irony detection for the first time. Castro et al. [5] produced a satirical TV series data set MUStARD, and used SVM for classification after stitching text, audio and image features. However, This method is only an early fusion, and it will inevitably lose the semantic information of different modalities. Moreover, the support vector machine has limited ability to model complex satirical semantics.

2.2 Single mode detection methods

Early sarcasm detection work was mainly focused on a single mode [11, 13], where detection methods were divided into traditional machine learning methods [18] and deep learning methods [48]. Many text modal detection methods are based on recurrent neural networks, such as LSTM [1] and GRU. Until the emergence of Bert [10] achieved the best performance on multiple natural language processing tasks. Through training on a large number of samples, Bert can better capture the deep semantic features of the text. Many image modal detection methods are based on CNN networks [16]. Wang K et al. [37] proposes a simple yet efficient Self-Cure Network to effectively improve the accuracy of facial expression detection. However, the detection of single modalities cannot better solve the irony detection in the actual scene for lacking of other modalities information, so the multi-modal detection method is the trend of future research.

2.3 Multimodal fusion method

The existing multi-modal fusion methods include early fusion [38] [26], which directly cascade the single modal features and send them to the classifier for classification [33]. This method ignores the inconsistencies between the modes. Direct cascading in different semantic spaces will cause the loss of information, and cannot effectively solve the problem of information redundancy and complementarity. Some researchers have proposed a strategy called late fusion [4]. That is, the features of each modal are selected and extracted separately, and then projected to a unified semantic space by a fusion net respectively. This step effectively resolves the inconsistency between the various modes. However, some useful information will inevitably be lost in the respective fusion process. Pereira et al. [28] proposed a new multi-modal method that integrates the facial expressions, audio features, and transcripts of the host and reporter to estimate the degree of tension in news videos, and achieved good results, but the model’s ability to fuse semantic information in multiple modalities is limited. Cai et al. [3] proposed a hierarchical fusion model that combines text features, image features and image attributes. Xu et al. [40] proposed a cross-modal attention semantic comparison and relational network fusion model, in which the decomposition of relational networks can provide relevant evidence for interpreting sarcasm detection. However, these two models do not comprehensively consider the collaborative fusion of the original modal semantic space and the unified semantic space after fusion. Therefore, we propose an improved residual connection fusion algorithm [15]. Our idea is that the internal representation of the subsequent fusion continues to be spliced ​​with the original single-modal feature, and then fused to obtain the best detection performance. Intuitively, the results of multi-modal comprehensive analysis need to be compared with the feature of a single modal, so that collaborative analysis can draw better conclusions. A large number of experiments on the MUStARD data set show that our idea is feasible.

2.4 Multi-task and multi-modal fusion method

Chauhan et al. [7] proposed a multi-modal multi-task learning framework, which uses the attention between tasks and the attention between classes to simulate the relationship between different types of tasks. Similarly, Chauhan et al. [8] used the data set MUStARD and proposed a multi-task learning framework and two attention mechanisms under the condition of marking additional explicit and implicit emotions, proving that emotions and sentiment help to improve the effect of satire detection. Yu et al. [41] used film and television works as a multi-modal data set and proposed a multi-modal multi-task learning framework. By individually labeling each mode, the differences and complementary features between the learning modes are learned. However, the multi-task fusion framework requires additional annotations, which greatly increases the cost of data processing, and these models are not suitable for the detection of a small number of irony data set. Therefore, this paper proposes an improved multi-modal multi-level fusion network with residual connections to improve the accuracy of sarcasm detection.

3 Multi-modal learning framework

As shown in Fig. 1, there are 3 modalities as the input of the model. We take text input as an example. The text is divided into two parts, composed of final sentences and context. We use the pre-trained model Bert [10] to get the word embedding feature of each word and the word embedding of the [CLS] token that representing the semantics of the entire sentence. After that we get the feature vector of (batch_size, seq_length_text, 768). This feature is transformed into a feature vector of (batch_size, 128) by Text Sub-Net. This process means a nonlinear mapping from the text semantic space to the unified first-level fusion space. As shown in Fig. 1, there are a total of 3 levels fusion network. The feature dimension after fusion should be less than that before, so as to achieve the purpose of feature selection and learning deep semantics. In the first-level fusion network, we use a design idea similar to residuals. The 768-dimensional vector that Bert extracted is spliced with the fused 128-dimensional vector. On the one hand, it can avoid over-fitting due to the deepening of the network. On the other hand, this method is intuitively explained by considering the original semantic space of the text modal and the semantic space after fusion will better capture the inconsistency between other two modalities, and improve the accuracy of sarcasm detection. The method in the second-level fusion is similar to the first-level fusion. In the third level of fusion, we combine contextual features [17] and speaker identity features to further improve the accuracy of sarcasm detection. Obviously, the semantics of sarcasm will be easier to detect with the context of final utterance.

Fig. 1
figure 1

multimodal fusion framework

The method of audio and image processing is similar to that of text. The difference is that the feature extraction of audio uses the librosa library [24] to extract (zero_crossing_rate, mfcc, chroma_cqt) which is a total of 33-dimensional features. The video feature extraction step is to first segment the video to multi-frame pictures which are then used to extract image features using the pre-trained model ResNet-152 [15] on ImageNet [9] to obtain a feature vector of (batch_size, seq_len_video, 2048). In order to speed up the training during the experiment, the average of 10 picture frames was used as a window to reduce the sequence length of the feature vector. The experiment was carried out on a rtx3080 GPU. Each model trains taking about a few minutes.

The sub-network of each mode is to get the internal representation of the mode, which can be formalized as:

$${O}_u={F}_u\left({I}_u\right)$$

where \({I}_u\in {R}^{B\times {D}_i^u}\), \({O}_u={R}^{B\times {D}_o^u}\). F(•) is the feature extractor network for modal u.Iu is the input of mode u,where u ∈ {t, v, a}.It is worth noting that the video modal and audio modal are averaged over time steps before being input to the feature extraction sub-net.

As shown in Fig. 2, we show the model diagram of the text sub-network and the first-level fusion network. The fusion network of audio and picture is similar to the text sub-network, and the other two-level fusion network is similar to the first-level fusion network. In the text sub-network, We can obtain the feature vector of (768,) by obtaining the mean value of the word embedding of the sentence. We use BatchNorm to accelerate the convergence of training, use Dropout to avoid overfitting, and use 3-layer DNN and ReLU activation function in the hidden layer of the model for nonlinear features mapping. After that the final feature dimension is reduced to 128 dimensions, and then we feed the 768-dimensional features and 128-dimensional features into the first-level fusion network. It can be seen from the figure that the structure of the first-level fusion network is similar to the text subnet, but the difference is that the post fusion feature is 256 dimensions which is bigger than 128.

Fig. 2
figure 2

Sub-Net and Feature Fusion Network Model diagram

The feature fusion network is to obtain the representation after the fusion of the three modes. We use the idea of residual network, which can be formalized as:

$${O}_{m1}={G}_1\left({O}_t,{O}_v,{O}_a,{I}_t,{I}_v,{I}_a\right)$$

where \({O}_t,{O}_v,{O}_a\in {R}^{B\times {D}_o^u}\) are the unimodal representations. G1(•) is the feature fusion network and Om1 is the first-level fusion representation.

$${O}_{m2}={G}_2\left({O}_{m1},{I}_t,{I}_v,{I}_a\right)$$

where Om2 is the second-level fusion representation..G2(•) is the feature fusion network.

$${O}_{m3}={G}_3\left({O}_{m2},{I}_{\mathrm{s}},{O}_{tc}\right)$$

where Om3 is the last-level fusion representation.Is is one-hot representation of speaker identifiers,Otc is the context text representation.

4 Feature extraction and dataset

Following the work of [5], the features of the text are extracted using the Bert model, and finally a 768-dimensional vector representation is extracted for each target sentence. The method is that we use the average of the internal representations of the last four transformer [12, 20] layers of the first token[CLS] as the final representation of the sentence. In the later stage of the experiment, we use the last [CLS] token to get the representation of the whole sentence, which is better.

In order to obtain the audio representation, we use the popular librosa library to obtain a fixed-length representation by dividing dw non-overlapping windows, otherwise the length of the speech will be variable in the data set. In the early stage of the experiment, each window gets a 283 dimensional vector representation, including MFCC, MFCC Delta, Mel spectrogram, Mel spectrogram Delta and spectral central. The final representation of the whole discourse is obtained by the mean of all window vectors. Later, we adopted the second processing method, the effect of detection is better, and set hop_ Length = 5120, select zero_ crossing_ Rate, MFCC, and chroma_ CQT constitutes a 33 dimensional eigenvector.

In order to extract the features of the visual modal, we treat the f frame as a window, and use the representation of the pool5 layer \({u}_i^v\) of the ResNet-152 model pre-trained on ImageNet to take the average value to obtain a 2048-dimensional vector,where \({u}_v=\frac{1}{f}\left({\sum}_i{u}_i^v\right)\in {R}^{d_v}\).

After the feature extraction of text, audio and video modes, in order to batch training, we truncated and padding the training set of 690 data. The final length is composed of the sum of the mean and standard deviation of all data lengths,which is final_len = mean(seq_len) + std.(seq_len)。 In order to visualize the visual model and fine tune the model, we try to design an end-to-end training scheme, that is, extract features from the original video and send them directly to the model training. We found that if the features are not saved to the hard disk in the form of files, the speed of training will be greatly reduced. Because the performance of our training equipment is limited, we choose to extract the features and save them as files. However, the end-to-end scheme can be used to verify the model, that is, inputting a video can quickly judge whether the video contains ironic meaning.

We use the data set MUStARD, which is composed of video clips of 4 TV shows, Friends, The Golden Girls, Sarcasmaholics Anonymous and The Big Bang Theory. There are a total of 690 videos in the data set, of which about half are sarcastic, and about half are not sarcastic. As described in previous work, the detection of sarcasm often requires comprehensive consideration of complementary information from multiple modalities. At the same time, because some characters themselves are set as satirical characters, the speaker’s identity characteristics are also conducive to sarcasm detection. In addition, the context of the target sentence will also help the detection of sarcasm [5]. The data set is divided into two parts, final utterance and context. The statistical information is shown in Table 1 [5]. There are 18 speakers in the whole data set.

Table 1 Data set statistics by final utterance and context

5 Experiments

5.1 Experimental setup

Following the work of [5], we set up 4 sets of experiments, the first is speaker-dependent set, where we used a random five-fold crossover experiment, and the sample allocation is random. In other words, the speakers in the training set may also appear in the test set.

The second experimental setting is consistent with the setting in the previous work [5]. All samples belonging to the TV series Friends are used as the test set, and the samples of the other three TV shows are used as the training set, which guarantees the speakers of the training set and the test set will not overlap. However, this division has caused imbalance in training, that is, the ratio of the ironic and non-satiric categories in the training set and the test set is quite different, and the precious samples are not fully utilized.

In the third set of experiments, we also set up a speaker-independent data set partition, and obtained a more reasonable setting to verify the generalization ability of the model.

In the fourth group of experiments, we added the information fusion of the features of contextual text and speaker identity information to test the fusion ability of the model.

For all experimental settings we use weighted Precision, Recall and F-Score values as measurement indicators,where the weights are obtained in proportion to the number of sarcasm and non-sarcasm categories. In the second group of experiments, it is easy to over fit the model because of the imbalance of data set partition and the small proportion of training set and validation set. In order to avoid over fitting, we have reduced the number of layers of the model, removed the components of nonlinear activation function, and used the weighted F-Score as a measure indicator of early stop. In other cases, Because the setting of five fold cross validation experiment means that the division of data set is balanced, there is little difference in whether the measure indicators is weighted or not.

The early stop epoch is set to 20, that is, if the highest weighted F-Score value is still not updated after 20 epochs, the training procedure is stopped. The optimizer, the weight decay value and the learning rate value refer to the work of [41], and have been adjusted and tested during the experiment process. In the training process, because the number of samples in the data set is relatively small, the training has certain fluctuations, but the average training effect is better than the result of the early fusion of SVM.

5.2 Baselines

Majority

Following the work of [5],This baseline assigns all the instances to the majority class,i.e.,non-sarcastic.

Random

Following the work of [5],This baseline make random predictions across test set.

SVM

Following the work of [5],We use early fusion to splice three modal features and feed them into (Support Vector Machines) SVM classifier [27].

EF-LSTM

The early fusion lstm model stitches the original input features of the three modalities, and feeds them into LSTM to capture deep semantic features. This method requires the modalities to be aligned in time step.

MFN

The memory fusion network [45] first feeds the three modalities into the LSTM unit separately, and obtains the cross-attention internal state of the three modalities with a window of two time steps, and uses a special gating mechanism to obtain the entire semantic features of the sentence.

LMF

The Low-rank Multimodal Fusion model adds one-dimensional features to each mode, and uses low rank factor to perform nonlinear transformation of features.

MULT

The Multimodal Transformer model uses the other two modalities as a reference which is mapped into the third modality and finally the information interaction of the three modes is obtained by Transformer Encoder.

LF_DNN

The model used in the speaker-dependent experimental setting only contains the first-level fusion model and does not use the residual network.

LF_DNNv1

Following the work of [5], because the data set split is imbalanced, in order to avoid over-fitting, only the first-level fusion model is retained, which reduces the number of layers of the fusion network and cancels the nonlinear activation function. The model degenerates to a fully connected neural network without nonlinear activation.

LF_DNNv2

In the speaker-independent experimental setting that we designed, the model needs stronger generalization ability. We propose a three-level fusion network with residual connections. A competitive performance has been achieved, The structure of the model is shown in Fig. 1.

6 Ablation study

As shown in Table 1, among the speaker-dependent settings, Majority’s baseline is the worst because of its arbitrary decision. The Random baseline is consistent with our intuition. The prediction accuracy is about 0.5. Generally speaking, the late fusion LF_DNN model is superior to the early fusion model of SVM either in terms of single mode or multi-modal. Overall,the combination of visual,audio and textual signals significantly improves over the uni-modal variants,with a relative error rate reduction up to 17.8%.

For single-mode signals, the amount of information of visual signals is the largest. Because the training set and the test set are randomly disrupted, and the visual features include the scene layout in the TV program, the characters’ faces and postures [49], and these features may be well memorized by the neural network in the training set. Therefore,it has better performance than other modalities in the test set. The 2048-dimensional features contain a wealth of information. Therefore, the learning effect of video modal is fully demonstrated in the later fusion of neural network.

The multi-modal fusion effect of the model is generally higher than the single-modal detection effect, which indicates that the single-modal learning is lacking in the information source. Only by comprehensively considering the three modalities can the detection F-Score value be improved. Let’s look at the fusion of T + V mode and the fusion effect of T + V + A mode. This may be because the sound mode has very little supplementary information for the other two modes.what it provides is redundant information. It can also be said that even if it is a silent movie with subtitles, the audience can clearly feel whether the dialogue is ironic or not, which is intuitive.

We use t-SNE [23] to visualize the internal representation of single-mode and multi-mode in the later fusion model. As can be seen from Fig. 3, in the single-mode internal representation, The internal representation of the text modal is not conducive to the detection of sarcasm, while the internal representation of the visual modal is more conducive to the detection of sarcasm. After the fusion of the three modal features, the model eliminates redundant features, merges complementary features, and learns more useful rules to detect sarcasm.

Fig. 3
figure 3

Visualization of multi-modal fusion, Each sub-picture shows the internal representation of the text, audio, visual and the fusion of the three modes. The Speaker-dependent setup shown in Table 2 is used

As shown in Table 3, in the speaker-independent experimental settings, the effects of SVM and LF_DNNv1 are generally lower than the performance in Table 2. This is also intuitive, because the division of the data set reduces the number of training sets to 334, while the test set is 356, which is different from the number of divisions of one-fifth of the data in the five-fold cross as the test set. So this is more challenging for the model. And unbalanced data may have little effect on SVM, but it will not give full play to the advantages of neural networks and deep learning. And because none of the speakers in the test set appeared in the training set, the model is more prone to overfitting, so we used a variant of the model LF_DNNv1 to reduce the number of layers of the fusion network and cancel the Relu activation function. The final experimental effect is better than SVM.

Table 2 Speaker-dependent setup. We use five-fold cross-validation, run 5 times and take the average value. Precision, Recall and F-Score are all weighted values. Two values are separated by commas in each cell, the first is the average and the second is the standard deviation. This is the case for the values in the following tables. ∆multi−unimodal is the difference between the highest value of multi-mode and the highest value of single-mode. The Error rate reduction is the reduction of the relative Error rate. Whose value is the result of the difference between the two divided by the absolute error of the single mode, such as 5.7/(100–68.1)

In single modality, the effects of visual modalities are comparable to those of audio modalities, and they are both superior to text modalities, which shows that sound and animation are more expressive than text. The textual modality alone cannot predict sarcasm accurately. The effect of multi-modal fusion is similar to the speaker-dependent experimental settings, and the contribution of audio modalities in the fusion process is relatively low. This may be because the complementary information of the text modality and the picture modality has been better fused by the model, while the information of the audio modality has less complementary information and much redundant information to the other two modalities.

In general, the model variant LF_DNNv1 improves the performance compared with the single-mode best result of SVM, but it still does not reach the speaker-dependent experimental setting. In order to better verify the generalization ability of the model, we analyzed the data set, as shown in Table 4 and Table 5.In the distribution of the data set, the total number of samples belonging to HOWARD and SHELDON in The Big Bang Theory TV show accounted for exactly one-fifth of the total number of data sets.

If the samples belonging to two persons are used as the test set, the ratio of the number of sarcastic samples to the number of non-satiric samples in the test set is approximately 1:1, which happens to be a balanced division. And the ratio of training set to test set is about 4:1, which is similar to the speaker-dependent 5-fold crossover experiment setting. At the same time, HOWARD and SHELDON in the test set will not appear in the training set. This also satisfies the original intention of the speaker-independent experimental setup.

As shown in Table 6, after a more reasonable division of the data set, with sufficient training data, the 3-level fusion model variant LF_DNNv2 with residuals obtained better performance, with a relative error reduction of up to 11.8%. However, there is still a certain gap in comparison with the speaker-dependent experimental settings in Table 2. This is because predicting speakers who have never appeared in the training set are inherently challenging, and the characteristics of the visual modality are the target features for the entire picture and scene, not the facial expressions and postures for the sarcasm.

Compared with the results in Table 3, it can be seen that even for the more complex and fusion-capable LF_DNNv2 model variants, the more reasonable data set division failed to significantly improve the performance of speaker-independent sarcasm detection. This may be because HOWARD and SHELDON have distinctive character settings, The irony habits and methods of scientists and ordinary people are quite different. This also shows that the speaker-independent experiments are challenging, but our proposed model is still better than the early fusion model of SVM.

Table 3 Speaker-independent setup. Run 5 times and average the results. Following the work of [5] for comparison. But this data set split is very imbalanced
Table 4 The data set division in [5] is unbalanced, so we set our own division
Table 5 We choose the samples with two speakers of “HOWARD” and “SHELDON” as the test set to obtain a more reasonable speaker-independent data set division. This division is reasonable and balanced
Table 6 Speaker-independent setup. We ran the experiment 5 times and reported the average of the results. Use our own set of data set division, that is, “HOWARD” and “SHELDON” are used as the test set, and the others are used as the training set for prediction. We extended the work of [5] to prove the generalization ability of our proposed model. The LF_DNNv2 model variant here uses a three-level late fusion structure with residual connections
Table 7 Role of text context and Speaker identifier features, Speaker-independent setup that we use our own data set split
Table 8 Test results of 4 data sets in 7 models. All results are the average of 5 runs. The best results are shown in bold. There are two values in each cell, separated by commas. The first is the mean value of the binary classification accuracy and the second is the standard deviation. In the speaker-dependent experimental setting, we use five-fold cross-validation, and in the speaker-independent experimental setting, we use our own data set split
Table 9 Example of misclassification corresponding to fig. 5.TRUE means sarcasm, and FALSE means non-sarcasm

We report the effects of contextual features and speaker identity features on the best single-modal and multi-modal models, as shown in Table 7. In the Speaker Dependent setting, the visual mode with the best performance is selected, and the multi-modal variant is the combination of textual, audio and visual, which is determined by the results shown in Table 2. In the Speaker Independent setting, we use the experimental results in Table 6, select textual for single mode, and the same setting for multimodal as Speaker-dependent setup.

For the addition of context feature, the performance improvement of the model is not obvious except for the experiment of multi-modal variants in the Speaker Dependent setting. This may be because the features are input into the model after averaging. This process loses the timing information of the context and the target sentence. Because the length of the context and the final sentence is quite different, and we truncate and padding the features respectively, we lose some semantic information. If we change the way to splice the context and the final sentence first, and then truncate and fill the length, the effect may be improved. We can do this work in the future.

For the addition of speaker identification features, the performance of the model in the Speaker Dependent setting has been improved. This is because the identities of the speakers in the training set and the test set overlap, and the model has learned the law of the speaker’s sarcasm tendency. In the Speaker Independent setting, because the speaker identification information in the test set has not appeared in the training set, the addition of the speaker identification feature cannot effectively improve the detection effect.

In the speaker-independent setup, we reported some fluctuations in the results of the experiment. This may be because the number of data sets is limited, and samples with scientists’ satirical features like ‘HOWARD’ and ‘SHELDON’ are less in the training set.

7 Supplementary experiment

In this section, in order to verify the generalization ability of the model, we supplemented the experiments on several data sets. They are:the CMU-MOSI [43, 46], MELD [29], CH-SIMS [41]. Because multimodal data sets are very rare, we use the closest sentiment classification data set for verification. In order to verify the comparison results between the deep learning models, we have added the experimental results of several models, they are: EF-LSTM [38], MFN [47], LMF [22], TFN [44] and MULT [36]. The results are shown in Table 8.

As can be seen from Table 8, no model can perform best on all data sets. Although our model is simple, it has good performance. It is worth noting that MOSI’s data set is time-aligned, so EF-LSTM has played its due advantage. The MULT model is the most time-consuming, but the effect is not good.

As shown in Fig. 4, on the MUSTARD data set, the performance of LF_DNNv2 is better than other models except EF_LSTM. The large fluctuation of the curve is because the data set is too small. We adopt an early stopping strategy, that is, if the test accuracy does not improve after 20 iterations, the training is stopped.

Fig. 4
figure 4

Comparison of training curves between LF_DNNv2 and the remaining six models in the speaker-independent experimental setting of our split,. The y-axis represents the accuracy of the test set, and the x-axis represents the epoch of the iteration.’123′ represents the random number seed of one of the 5 experiments. We used Weights & Biases [2] for experiment tracking and visualizations to develop insights for this paper

As shown in Fig. 5, the best LF_DNNv2 still has a lot of room for improvement in the speaker-independent experimental setting of our split. 22 sarcasm samples were misclassified as non-sarcasm, and 27 non-sarcasm samples were misclassified as sarcasm. SHELDON’s sample had more classification errors than HOWARD, reaching 33. The number of misclassifications in HOWARD’s sample is 16. For SHELDON, more non-satiric samples were misclassified as satire, reaching 19, which may be due to SHELDON’s distinctive satire style.

Fig. 5
figure 5

The left sub graph shows the classification results of the test set of LF_DNNv2 in the speaker-independent experimental setting of our split. The subgraph on the right shows the distribution of classification errors among the two speakers

As shown in Fig. 6 and Table 9, in the 4 samples that the model classified incorrectly, SHELDON was always expressionless, speaking faster, and had no change in tone. The meaning of satire in the text is also more abstract and requires additional common sense [6]. Therefore, it is so difficult to classify satire on this small data set.

Fig. 6
figure 6

Examples of misclassification

8 Conclusion and future work

This paper proposes a post-fusion model with a three-level fusion structure and residual network. Experimental results on the MUSTARD data set show that this post-fusion model can better integrate three modalities into a unified semantic space to improve the detection effect of sarcasm. The application of neural network gives full play to the advantages of late fusion, which is better than the SVM model based on early fusion. According to different experimental environments, we propose a later fusion model variant and a more reasonable Speaker-independent data set split, which will help make full use of this data set for research in the future. We analyzed methods to avoid over-fitting and ideas for adjusting model variants according to data set split.

In the future, we can consider trying more advanced fusion methods, such as adding timing features, and performing feature alignment and fusion on timing. For the model, we can try to change it to end-to-end, based on the pretrained fine-tuned model to further improve the effect of the model. For image features, we can extract the features of face and pose to further improve the effect of the model, instead of using a more general features of the whole screen. The expression of sarcasm may involve the intention and relationship of multiple representatives of the speaker, and the modeling of this intention and relationship can also be added to the multi-modal fusion model. We can also use knowledge graphs to model sarcasm external knowledge bases, and introduce external knowledge into the model fusion process to further improve the performance of multi-modal sarcasm detection. In order to give full play to the advantages of deep learning, we can collect and expand the sarcasm data set for further research. Sarcasm detection is closely related to emotion. We can use deep learning to learn deep semantic knowledge in the data set of emotion detection, and then transfer this knowledge to the current field of sarcasm detection.