Keywords

1 Introduction

In recent years, many people have shared their opinions and reviews on the internet through various social media platforms. Fully mining information from these texts can provide significant help in improving products and increasing efficiency, making aspect-based sentiment analysis (ABSA) [1,2,3] a popular research direction. ABSA is a task that aims to detect fine-grained sentiment towards specific aspects of targets. Initially, ABSA only focused on aspect terms and sentiment polarities [4,5,6]. Later, researchers gradually realized the importance of two other key factors that influence sentiment polarity judgments: opinion terms and categories [2]. Depending on the different elements of interest, various ABSA tasks have been proposed, including pair extraction tasks (e.g. aspect-opinion pair extraction, AOPE [7]), triple ABSA tasks (e.g. aspect sentiment triplet extraction, ASTE [8, 9]), and quadruple ABSA tasks (e.g. aspect sentiment quad prediction, ASQP [10, 11]).

The aforementioned research has been based on short texts such as comments. However, conversational texts are also a significant category in social media, and conducting sentiment analysis on these texts is equally meaningful. To perform fine-grained sentiment analysis on conversational texts, aspect-based sentiment quadruple analysis [12], also known as DiaASQ, has been proposed. As shown in Fig. 1, conversational texts have a natural special structure. Firstly, a conversation consists of multiple participants who may have different stances and views [13]. Secondly, the elements of sentiment quadruples may come from multiple sentences, which we refer to as cross-utterance quadruples. Finally, As the conversation progresses, the topic tends to shift gradually. These characteristics pose new challenges for modeling sentiment analysis on conversational texts.

Fig. 1.
figure 1

Illustration of the conversational aspect-based sentiment quadruple analysis. We can extract the following four target-aspect-opinion-sentiment quadruples from this dialog: (‘Hongmeng’, ‘drop power’, ‘very fast’, ‘neg’), (‘mate40pro+’, ‘battery life’, ‘good’, ‘pos’), (‘Honor 30Pro’, ‘power drop’, ‘fast’, ‘neg’), and (‘Hongmeng’, ‘battery life’, ‘poor’, ‘neg’).

To address the problem of DiaASQ, Li et al. [12] proposed a new model that used the thread, speaker, and reply views to model the conversation. Their model encodes each utterance separately using pre-trained language models (PLMs) [14], and then models the global information relying on self-attention mechanisms [15] and masking methods. This approach, however, can not fully utilize the powerful contextual modeling capabilities of PLMs, resulting in the loss of some interactive information between adjacent utterance pairs. Therefore, we propose a context-fusion encoding method for DiaASQ, which models the contextual information of the entire thread’s speech, rather than model each sentence separately, performing better in extracting cross-utterance quadruples. Meanwhile, we treat extremely short conversations as a whole for context encoding. Furthermore, we incorporate regularized dropout [16] and fast gradient method [17] to improve the robustness of the model.

In summary, the main contributions of this work could be summarized as follows: (1) proposing a context-fusion encoding method that allows the model to better understand context and extract cross-utterance quadruples; (2) incorporating regularized dropout and fast gradient method into the model to enhance its performance; (3) the experimental results have demonstrated that the proposed method achieves an average F1-score of 42.12% in DiaASQ, which is 6.48% higher than the best comparative model, indicating superior performance.

2 Related Work

In this section, we will provide an overview of related work that focuses on sentiment analysis of short texts and the shared task.

2.1 Aspect-Based Sentiment Quadruples Extraction

In the field of aspect-based sentiment analysis (ABSA) for short texts, aspect sentiment quad prediction (ASQP), also referred to as aspect-based sentiment quadruple extraction, has been an active research area [2, 3]. Cai et al. [10] were the pioneers to investigate the ABSA quadruple extraction task, with a focus on implicit aspects or opinions. They introduced two new datasets with sentiment quadruple annotations and constructed a series of pipeline baselines by combining existing models to benchmark the task. Zhang et al. [11] proposed a paraphrase modeling strategy to predict sentiment quadruples end-to-end. They transformed the original quadruple prediction task into a text generation problem and solved it using a Seq2Seq modeling paradigm. This approach enabled the full utilization of label semantics, i.e., the meaning of sentiment elements. Later methods have further formalized the task as generating opinion trees [18, 19] or structured schema [20].

2.2 Conversational Aspect-Based Sentiment Quadruple Analysis

Conversational aspect-based sentiment quadruple analysis [12] was a new task, and previous work did not consider how to extract sentiment quadruples from conversation text. The shared task provided a model that used a novel labeling scheme based on the grid-tagging method [8], which divided the labeling task into three sub-tasks: detections of entity boundary, entity pair, and sentiment polarity. Compared to pipeline models that required extract-filter-matching processes [10], this approach reduced error propagation and accumulation. Additionally, compared to seq-to-seq approaches [11], it avoided exposure bias. The model first extracted the contextual representation of the sentence through an encoding layer. Then, it proposed a multi-view interaction layer that constructed Thread Mask, Speaker Mask, and Reply Mask, combined with a multi-head self-attention mechanism [15] to strengthen the awareness of the dialogue discourse. Finally, it fused the Rotary Position Embedding (RoPE) [21] and calculated the score between any token pair in terms of the label.

3 Methodology

In this section, we will provide a detailed description of our method. Our model structure is shown in Fig. 2. Overall, we propose a context-fusion encoding method based on the thread and conversation length in the stage of context characterization. We will introduce the adversarial training strategy and regularization technique strategy we used as well.

3.1 Task Introduction

The goal of conversational aspect-based sentiment quadruple analysis is to extract the target-aspect-opinion-sentiment quadruple from conversational texts. The target, aspect, and opinion are continuous words extracted from sentences, and these elements may come from different sentences, referred to as cross-utterance. The sentiment polarity can be classified into three categories: positive, negative, and neutral, based on the extracted three elements. As shown in Fig. 1, a conversation starts from a root post. All subsequent posts are child or grandchild posts of this root post. The so-called thread refers to the subtree derived from the root node of the conversation tree. We treat the root post as a separate thread. Target denotes a particular object(e.g. product or service), while aspect denotes a specific attribute or component of the target. In contrast, category is a broader concept that refers to the class to which the aspect belongs. An opinion term often takes the form of an adjective that conveys the speaker’s evaluation of the aspect. For instance, as shown in Fig. 1, the aspect of “battery life” related to the target “mate40pro+” is mentioned.

Specifically, we represent each dialog as a training sample \(D = \{u_1, ... , u_n\}\) with the corresponding replies \(r =\{l_1, ... , l_n\} \) of utterances, where \(l_i\) denotes \(i^{th}\) utterance reply to \(l^{th}_i\) utterance. To maintain generality, we consider \(u_1\) as root utterance. \(t_k = \{u_i, u_{i+1}, ... ,u_j\}(1 \le i \le j \le n)\) represents k-th thread where \(l_i\) equal to 1 and \(\{l_{i+1} ... l_j\}\in \{i, i+1, ... , j-1\}\). Each \(u_i = \{w_1, ... , w_{m_i}\}\) denotes i-th utterance text and \(m_i\) is the length of utterance of \(u_i\). DiaASQ aims to extract all possible (targetaspectopinionsentiment) quadruples, denoted as \(Q = \{t, a, o, p\}\) where \(\{t, a, o\}\) is the sub-string of dialogue D and \(p \in \{pos, neg, other\}\).

Fig. 2.
figure 2

The overall framework of the proposed method.

3.2 Context Fusion Encoding with Adversarial Training

Thread Fusion. Usually, a dialogue consists of multiple rounds and involves multiple speakers, presenting a complex hierarchical structure. As reported in [12], around 22% of cross-utterance quadruples exist in the Chinese and English datasets. If context encoding is only performed on individual utterance, on the one hand, the outstanding performance of PLMs [14] can not be fully utilized; on the other hand, there is no interaction between different utterances, undoubtedly resulting in the loss of contextual information. Therefore, we propose a contextual fusion method based on thread, which we call “thread fusion”, and use PLMs to better model multiple speakers and different utterances. The method merges the utterances in the same thread of conversation into a dialogue segment and treats each segment as a whole for contextual representation encoding.

$$\begin{aligned} t_k^{\prime } = <[cls], u_i, [sep], u_{i+1}, [sep], ..., [sep], u_j, [sep]>, \end{aligned}$$
(1)
$$\begin{aligned} \boldsymbol{TH}_{k} = \boldsymbol{h}_{cls}, \boldsymbol{H}_{i}, \boldsymbol{h}_{sep}, ..., \boldsymbol{h}_{sep}, \boldsymbol{H}_{j}, \boldsymbol{h}_{sep} = \text{ PLMs }(t_k^{\prime }), \end{aligned}$$
(2)

where \(u_i, ..., u_j\) are the utterances of k-th thread \(t_k\) , [cls] and [sep] are the special tokens in PLMs, \(\boldsymbol{H}_{i}\) and \(\boldsymbol{TH}_k\) means the contextual representation of i-th utterance and k-th thread. We found that the contents discussed in the same thread often have relevance, while the relationships between different threads are relatively weak. This is also the motivation for our proposed thread fusion.

Dialog Fusion. After further analysis of the dataset, it is discovered that some threads in certain conversations are very short in length, containing incomplete quadruples and little information, resulting making no predictions from model. As shown in Table 1, the average length of threads is around 28, with the shortest thread containing only 3 words. Naturally, we consider additional processing for these particularly short threads, by merging them into longer texts. In addition, the maximum length of threads in the Chinese dataset is 257 words, with the longest conversation containing 462 words. It is not applicable to all conversations, as some long conversations may exceed the maximum acceptable length of PLMs. Moreover, long conversations are usually more informative and may introduce noise to the model if merged together.

Taking into account the above two points, we propose treating certain conversations with a length less than a threshold value \(\tau \) as a whole, and using a PLMs to obtain its global context information. The representation of whole dialog \(\boldsymbol{DH}\) can be constructed as follow:

$$\begin{aligned} D^{\prime } = <[cls], u_1, [sep], u_{2}, [sep], ..., [sep], u_n, [sep]>, \end{aligned}$$
(3)
$$\begin{aligned} \boldsymbol{DH} = {\left\{ \begin{array}{ll} \boldsymbol{h}_{cls}, \boldsymbol{H}_1, \boldsymbol{h}_{sep}, ..., \boldsymbol{H}_n, \boldsymbol{h}_{sep} = \text{ PLMs }(D^{\prime }), &{} \text {if}\quad \sum _{i=1}^{n}m_i \le \tau , \\ \boldsymbol{TH}_1||\boldsymbol{TH}_2||...||\boldsymbol{TH}_k, &{} \text {else}, \end{array}\right. } \end{aligned}$$
(4)

where dialog \(D^{\prime }\) is one training sample connected by [cls] and [sep], \(m_i\) is the length of i-th utterance, \(\tau \) is a controllable hyperparameter that restricts the scope of the processing object, and the operation of “||” is concat.

Adversarial Training. For further improving the performance and robustness of context fusion encoder, we have chosen the Fast Gradient Method (FGM) [17] as our adversarial training technique. FGM is a popular adversarial attack method, which is used in deep learning to generate adversarial examples by perturbing input data to maximize the loss function of the model. It calculates the gradient of the loss function with respect to the input data and perturbs the data in the direction of the gradient with a certain magnitude while maintaining a maximum norm constraint. The perturbations \(\boldsymbol{r}_{a d v}\) can be defined as:

$$\begin{aligned} \boldsymbol{r}_{a d v}=\epsilon \cdot \boldsymbol{g} /\Vert \boldsymbol{g}\Vert _2 \text{ where } \boldsymbol{g}=\nabla _s L(D, y), \end{aligned}$$
(5)

where \(\epsilon \) is a hyperparameter limiting the size of adversarial perturbations \(\boldsymbol{r}_{a d v}\).

3.3 Quadruple Decoder

Multi-view Interaction. Following Li et al. [12], we construct attention masks \(M^c\) and use multi-head self-attention [15] to extract three types of features: dialogue threads, speakers, and reply, where \(c \in \{Th, Sp, Rp\}\) and the corresponding values represent thread mask, speak mask and speaker mask, respectively:

$$\begin{aligned} \boldsymbol{H}^{c} = \text{ Masked-Att } (\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}, \boldsymbol{M}^{c}) ={\text {Softmax}}(\frac{\left( \boldsymbol{Q}^{T} \cdot \boldsymbol{K}\right) \odot \boldsymbol{M}^{c}}{\sqrt{d}}) \cdot \boldsymbol{V}, \end{aligned}$$
(6)

where \(\boldsymbol{Q}=\boldsymbol{K}=\boldsymbol{V}=\boldsymbol{DH}\) is the representation of whole dialogue. Thread mask \(\boldsymbol{M}_{i j}^{T h}=1\) if the \(i^{t h}\) and \(j^{th}\) token belong to the same dialogue thread; speaker mask \(\boldsymbol{M}_{i j}^{S p}=1\) if the \(i^{t h}\) and \(j^{t h}\) token are derived from the same speaker; and reply mask \(\boldsymbol{M}_{i j}^{R p}=1\) if the two utterances containing the \(i^{t h}\) and \(j^{\text{ th } }\) token respectively have a replying relation.

To better guide discourse understanding, the model fuses the Rotary Position Embedding (RoPE) [21] into token representations, which can dynamically encode the relative distance globally between tokens at the dialogue level. And then the score \(s_{i j}^{r}\) indicating the probability of relation label r between \(w_{i}\) and \(w_{j}\) can be calculated as:

$$\begin{aligned} s_{i j}^{r}=(\boldsymbol{R}(\theta , i) \boldsymbol{v}_{i}^{r})^{T} (\boldsymbol{R}(\theta , j) \boldsymbol{v}_{j}^{r}), \end{aligned}$$
(7)

where \(\boldsymbol{R}(\theta , i)\) is a positioning matrix parameterized by \(\theta \) and the absolute index i of \(\boldsymbol{v}_{i}^{r}\).

Regularization. Inspired by Liang et al. [16], we improve quadruple decoder using Regularized Dropout (R-Drop), an unsupervised contrastive loss, as the regularization technique. By utilizing the probabilistic nature of the dropout layer, the model’s predictions vary each time. R-Drop passes each training data sample through the model twice, and then uses Kullback-Leibler (KL) divergence to constrain the results of the two predictions, which can be defined by the following formula:

$$\begin{aligned} \begin{aligned} \mathcal L_{KL} = \frac{\alpha }{2}\left[ \mathcal {D}_{K L}\left( \mathcal {P}_1^w(y \mid D) \Vert \mathcal {P}_2^w(y \mid D)\right) +\mathcal {D}_{K L}\left( \mathcal {P}_2^w(y \mid D) \Vert \mathcal {P}_1^w(y \mid D)\right) \right] \end{aligned}, \end{aligned}$$
(8)

where \(\mathcal {P}_1^w(y \mid D)\) and \(\mathcal {P}_2^w(y \mid D)\) are two distributions of model predictions, \(\alpha \) is the coefficient weight to control \(\mathcal L_{K L}\).

3.4 Learning

The training loss \(\mathcal L_d\) of the sum of each subtask can be defined as:

$$\begin{aligned} \mathcal {L}_{k}=-\frac{1}{G \cdot N^{2}} \sum _{g=1}^{G} \sum _{i=1}^{N} \sum _{j=1}^{N} {\alpha }^{k} y_{i j}^{k} \log \left( p_{i j}^{k}\right) , \end{aligned}$$
(9)
$$\begin{aligned} \mathcal {L}_d=\mathcal {L}_{\text{ ent } }+\beta \mathcal {L}_{\text{ pair } }+\eta \mathcal {L}_{\text{ pol } }, \end{aligned}$$
(10)

where \(k \in \{\) ent, pair, pol \(\}\) indicates a subtask defined by Li [12], N is the total token length in a dialogue, and G is the total training data instances. \(y_{i j}^{k}\) is ground-truth label, \(p_{i j}^{k}\) is the prediction. A tag-wise weighting hyperparameters \({\alpha }^{k}\) is applied to counteract the imbalance among label types, where \({\alpha }^{pair}=\beta \) and \({\alpha }^{pol}=\eta \) are determined by dataset and experimental tuning. The finally loss \(\mathcal L\) with the loss of R-Drop is:

$$\begin{aligned} \mathcal {L}=\mathcal {L}_d^1 + \mathcal {L}_d^2 + \mathcal {L}_{KL}, \end{aligned}$$
(11)

where \(\mathcal {L}_d^1 \) and \( \mathcal {L}_d^2\) represent the loss obtained from the model predicting the same sample twice.

4 Experiment

4.1 Datasets and Metrics

4.1.1 Datasets.

The corpus consists of posts and comments collected from Weibo, the largest Chinese social media platform. The datasets include both Chinese and English, with the English dataset being translated from the Chinese dataset [12]. As shown in Fig. 1, a dialogue starts from a root post and is composed of replies from multiple speakers. Each reply to the root post is considered as a thread. From a data structure perspective, the multi-thread and multi-turn dialogue forms a tree structure, where each subtree of the root node is a thread. This data structure provides clear information about the target of each sentence’s reply, which benefits the model’s understanding of context a lot.

The data statistics of datasets are shown in Tables 1. From Table 1, we can see that the English dataset is, on average, slightly longer than the Chinese dataset. The length difference between the shortest and longest samples is very large, regardless of whether it is an utterance, thread, or dialogue.

4.1.2 Metrics.

The task of DiaASQ uses exact F1 as the metric, and a sample will be viewed as false unless it matches all four elements exactly. Therefore, the task uses micro F1 and identification F1 [22] respectively for measurements, where micro F1 measures the whole quad, including the sentiment polarity. In contrast, identification F1 does not distinguish the polarity, and is more suitable for evaluating the model’s boundary prediction ability and entity matching ability. Finally, the evaluation criterion for the competition is the average of the four indicators of the Chinese and English datasets.

4.2 Experiment Setting

Due to the similarity in content between the Chinese and English datasets, after initial parameter search, we used the same parameter settings for both datasets. We set the maximum epoch to 30 and trained the model with an early stopping mechanism. The batch size was 1, and evaluation was performed every 100 steps. The initial learning rate was set to 1e-5, and we applied a dropout rate of 0.1 to the intermediate layer. We set the weight \(\alpha \) in R-Drop as 1e-4. For the dialogue length threshold \(\tau \), we experimented with several different values, including 128, 192, 256, and 512. As shown in Table 1, 512 is already longer than all of dialogues. Following prior work, we used Chinese-Roberta-wwm-base [23] and Roberta-Large [24] as our base encoders for the Chinese and English datasets, respectively.

4.3 Baseline System

We mainly compared some of the latest models of short-text ABSA and dialogue ABSA, as shown below:

  • CRF-Extract-Classify [10]. A three-stage system (extract, filter, and combine) proposed for the sentence-level quadruple ABSA.

  • SpERT [25]. A model for joint extraction of entity and relation based on a span-based transformer. The model was slightly modified to support triple-term-extraction and polarity classification.

  • Span-ASTE [26]. A span-based approach for triplet ABSA extraction. Similarly, it was change to be compatible with the DiaASQ task by editing the last stage of SpanASTE to enumerate triplets.

  • ParaPhrase [11]. A generative seq-to-seq model for the quadruple ABSA extraction. The model outputs are modified to adapt to DiaASQ task.

  • DiaASQ\(_{\textbf{MTV}}\) [12]. A model to solve the problem of DiaASQ benchmark, which encoding the utterance separately.

Table 1. Statistics on the length of utterance, thread, and dialog in the testset. ‘Utt.’, ‘Thd.’, and ‘Dia.’ respectively refer to utterance, thread, and dialog.

4.4 Results and Analysis

4.4.1 Main Experiment.

Table 2 presents the main results of our experiments, demonstrating that our model outperforms all the models with which it is compared. Our best model incorporates thread fusion encoding and dialog fusion encoding with \(\tau \) = 128 and it is trained using FGM and R-Drop. The DiaASQ\(_\text {MTV}\) scores an average of 35.64% on the English and Chinese datasets, while our method exceeds it by approximately 6.48%. This result demonstrates the effectiveness of our approach and theory. Generally, the scores on the Chinese dataset are higher than those on the English dataset.

We also conducted some ablation experiments. First, to verify the effectiveness of context fusion method and eliminate the interference of FGM and R-Drop, we removed these two modules and obtained an average score of 39.44%. Although this score is worse than the main model, it is still 3.8% higher than DiaASQ\(_\text {MTV}\), further demonstrating that the context fusion method we proposed can help with the context encoding of the model.

Table 2. Performance of the context fusion encoding method in both main experiments and ablation experiments. ‘T-Fusion’ represents the thread fusion method, and ‘D-Fusion\(_{128}\)’ represents the dialog fusion method with a dialogue length threshold of \(\tau \)=128.

In another ablation experiment, we verified whether thread fusion and dialog fusion respectively played a role. The model achieves Average F1 of 41.05% when removed dialog fusion, while achieves 41.64% when removed the thread fusion. We also can find out that dialog fusion had a greater effect than thread fusion. This result was somewhat unexpected, as dialog fusion only processes some short conversations, while thread fusion is effective for all conversations. One possible explanation we propose is that the effect of thread fusion is to improve the accuracy of quadruple extraction within the same thread, whereas many quadruples may not only be cross-utterance but also cross-thread. For cross-thread sentences, dialog fusion can have a greater effect.

In the above experiments, our team achieved the third place in the NLPCC 2023 shared task 4 by obtaining an average score of 41.05% without using the dialog fusion method which is denoted as D-Fusion\(_{128}\). In fact, our theoretical best score of 42.12%, which could have achieved a higher ranking, was not submitted due to the competition’s limit of three submission attempts.

4.5 Effectiveness of Dialog Fusion

Fig. 3.
figure 3

The comparison of the model with and without D-Fusion\(_{128}\) method on the dialog which length less than 128.

It is believe that the reason why dialog fusion improves the score is that the model enhances its ability to understand the context of short dialogues. To further verify this, we first identified all sentences with a dialogue length of less than 128 and then compared the model’s F1 scores on these sentences before and after adding D-Fusion\(_{128}\). The results are shown in Fig. 3. After adding D-Fusion\(_{128}\), the micro F1 and identification F1 scores on the Chinese dataset increased by 1.75% and 3.18% respectively, while the micro F1 score on the English dataset increased by 1.47%. These results support our hypothesis. However, the identification F1 score on the English dataset decreased, indicating that the model’s prediction performance for English boundaries deteriorated after concatenating the dialogues. This may be due to the fact that the English dataset are generally longer than the Chinese dataset (as shown in Table 1) and that English has WordPiece mechanism in PLMs, making the text longer and harder to locate. Overall, this indicates that dialog fusion does improve the accuracy of the model in understanding and modeling the context of short dialogues. We also experiment different threshold \(\tau \) for further validation and as the \(\tau \) increased, there was an overall downward trend in the average score, which is consistent with our hypothesis.

5 Conclusion

This work proposes a context fusion method to enhance the performance of conversational aspect-based sentiment quadruple analysis. Firstly, utterances within the same thread are merged through thread fusion, enabling the model to simultaneously model context information from multiple speakers. Then, dialog fusion is applied to some particularly short dialogues to obtain global information, which effectively improves the model’s performance on shorter dialogues. Through experiments, we conclude that concatenating the entire text of long dialogues actually leads to negative effects. Our model achieved an average F1 score of 42.12%, which is 6.48% higher than the DiaASQ\(_\text {MTV}\), indicating the effectiveness of our approach.