DialogueTRGAT: Temporal and Relational Graph Attention Network for Emotion Recognition in Conversations

Kang, Junjun; Kong, Fang

doi:10.1007/978-3-031-17120-8_36

Junjun Kang^11,12 &
Fang Kong^11,12

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13551))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

2678 Accesses
2 Citations

Abstract

Emotion Recognition in Conversations (ERC) is the task of identifying the emotions of utterances from speakers in a conversation, which is beneficial to a number of applications, including opinion mining over conversations, developing empathetic dialogue systems, and so on. Many approaches have been proposed to handle this problem in recent years. However, most existing approaches either focus on using RNN-based models to simulate temporal information change in the conversation or graph-based models to take the relationships between the utterances of the speakers into account. In this paper, we propose a temporal and relational graph attention network, named DialogueTRGAT, to combine the strengths of RNN-based models and graph-based models. DialogueTRGAT can better model the intrinsic structure and information flow within a conversation for better emotion recognition. We conduct experiments on two benchmark datasets(IEMOCAP, MELD), and the experimental results demonstrate the great effectiveness of our approach compared with several competitive baselines.

Access provided by Autonomous University of Puebla. Download conference paper PDF

DVDGCN: Modeling Both Context-Static and Speaker-Dynamic Graph for Emotion Recognition in Multi-speaker Conversations

Developing Relationships: A Heterogeneous Graph Network with Learnable Edge Representation for Emotion Identification in Conversations

DGNN: Dependency Graph Neural Network for Multimodal Emotion Recognition in Conversation

Keywords

1 Introduction

As a fundamental aspect of human communication, emotions play important roles in our daily lives and are crucial for more natural human-computer interaction. In recent years, with the development of social networks and the construction of large datasets for dialogue, emotion recognition in conversations has become an emerging task for the research community due to its applications in several important tasks such as opinion mining over conversations(Kumar et al. [7]), building an emotional and empathetic dialogue system (Majumder et al.[8], Zhou et al. [16]), and so on.

Emotion recognition in conversations aims to identify the emotion of each utterance in conversations involving two or more speakers. Different from other emotion recognition tasks, conversational emotion recognition is not only for utterances, but also depends on the context and the states of speakers. With the development of deep learning technologies, many approaches have been proposed to handle this problem. They can generally be divided into two categories: RNN-based methods and graph-based methods. But they all have their disadvantages. For the RNN-based methods, they use RNN-based models encoding the utterances temporally, but because RNN has long-term information propagation issues, they tend to aggregate relatively limited information from the nearest utterances for the target utterance, so can’t model the long-term dependency within the conversation. For graph-based models, they adopt neighborhood-based graph convolutional networks to model conversational context. In these models, they construct relational edges to directly build the correlation between utterances, thereby alleviating the long-distance dependency issues. But they neglect the sequential characteristic of conversation.

According above discussion, in this paper, we try to combine the advantage of RNN-based models and graph-based models to complement each other. We propose a temporal and relational graph attention network, named DialogueTRGAT to model the conversation as temporal graph structure. In particular, like RNN-based models, we gather historical context information for each target utterance based on the their temporal position in dialogue. For each target utterance, it only receives information from some previous utterances and cannot propagate information backward. In order to model the inter-speaker dependency^{Footnote 1} and self-dependency^{Footnote 2} between utternaces, we follow Ishiwatari et al. [5], use the message aggregation principle of relational graph attention networks(RGAT) to aggregate context information for the target utterance based on the speaker identity between itself and the previous utterances.

Compared with the traditional static graph networks, DialogueTRGAT enables the targe utterance can indirectly attend to the remote context without having to stack too many graphical layers. And it can be seen as an extension of traditional graph neural networks with an additional focus on the temporal dimension. We argue that DialogueTRGAT can better model the flow of information in dialogue and aggregate more meaningful historical contextual information for each target utterance, leading to better emotion recognition.

2 Related Work

We generally classify related works into two categories according to the method of modeling the dialogue context.

RNN-Based Models: Many works capture contextual information in utterance sequences. ICON [3] uses an RNN-based memory network to model contextual information that incorporates inter-speaker and self-dependency. HiGRU [6] propose a hierarchical GRU framework, where lower-level GRU is utterance encoder and the contexts of utterances are captured by the upper-level GRU. Considering the individual speaker state change throughout the conversation, Majumder et al. [9] propose DialogueRNN, which utilizes GRUs to update speakers’ states, the global state of the conversation and emotional dynamics. DialogCRN [4] uses LSTM to encode the conversational-level and speaker-level context respectively for each utterance and proposes to apply LSTM-based reasoning modules to extract and integrate clues for emotional reasoning.

Graph-Based Models: Many works model the conversational context by designing a specific graphical structure. For example, DialogueGCN [2] models two relations between speakers: self and inter-speaker dependencies, and utilizes graph network to model the graph constructed by these relations. Base on DialogueGCN, DialogueRGAT [5] uses relational position encoding to combine position information into the graph network structure. ConGCN [15] regards both speakers and utterances as graph nodes, the context-sensitive dependence and the speaker-sensitive dependence are modeled as edges to construct graphical structure. Shen et al. [12] model the dialogue as a directed acyclic graph and use directed acyclic graph neural networks [14] to model the conversation context. Our work is closely related to the graph-based models. But like RNN-based models, our model focuses more on the temporality of information propagation in graphical models than the above-mentioned models.

3 Methodology

3.1 Problem Definition

Given the transcript of a conversation along with speaker information of each constituent utterance, the task is to identify the emotion of each utterance from several pre-defined emotions. Formally, given the input sequence of N number of utterances and corresponding speakers{($u_{1}$, $s_{1}$), ($u_{2}$, $s_{2}$), . . . , ($u_{N}$ , $s_{N}$)}, where each utterance $u_{i}$ = {$w_{i,1}$, $w_{i,2}$, . . . , $w_{i,T}$} consists of T words $u_{i,j}$ and spoken by speaker $s_{i}$, $s_{i} \in S$, where S is the set of the conversation speakers. The task is to predict the emotion label $e_{i}$ for each target utterance $u_{i}$ based on its historical context {$u_1$, $u_2$, ..., $u_{i-1}$} and the corresponding speaker information.

3.2 Model

Emotion Recognition in Conversations, as a conversational utterance-level understanding task, most of the recent methods consist of three common components including (i) feature extraction for utterances (ii) conversational context encoder, and (iii) the emotion classifier. Our model also follows the paradigm. Figure 1 shows the overall architecture of our model.

Utterance-Level Feature Extraction. Convolutional Neural Networks (CNNs) are effective in learning high-level abstract representations of sentences from constituting words or n-grams. Following (Ghosal et al. [3] Hazarika et al., [9] Majumder et al.[2]), we use a single convolutional layer followed by max-pooling and a fully connected layer to obtain the feature representations for the utterances. We denote $\{h_{i}\}_{i=1}^{N}$, $h_{i} \in \mathbb {R}^{d_{u}}$ as the representation for N utterances.

Sequential and Speaker-Level Context Encoder. We model the conversation as a temporal graph structure and propose a temporal and relational graph attention network(TRGAT) to model the controversial context and gather historical context information for target utterance. Our graph structure transmit information in temporal order to imitate the process of dynamic conversation, which can preserve the temporal change information of conversation. The relational graph attention network’s message aggregation principle captures both self-dependency and inter-speaker dependency for target utterance.

Graph Structure:Node: Each utterance in a conversation is represented as a node $v_{i} \in V$. Each node $v_{i}$ is initialized with the utterance representation $h_{i}$. The representation can updated by aggregating the representations of previous utterance within a certain context window through our TRGAT layers. The updated representation is donated as $h_{i}^{l}$ , where l denotes the number of TRGAT layers. So we also denoted $h_{i}$ as $ h_{i}^{0}$.

Edges: For each target utterance $u_{i}$, its emotion is most likely to be influenced by the utterance between the previous utterance spoken by $s_{i}$ and the utterance $u_{i-1}$. We use these utterances as the historical window to aggregate context information for utterance $u_{i}$. We argue that it is more reasonable compared to using a fixed-size history window. We regard $u_{j}$ as the latest utterance spoken by $s_{i}$ before $u_{i}$ ($s_{j}= s_{i}$). Then for each utterance $u_{\tau }$ in between $u_j$ and $u_{i-1}$, we make a directed edge from $u_{\tau }$ to $u_i$. Depending on whether the speaker of $u_{\tau }$ is the same as the speaker of $u_{i}$, we divide the edges into two types. Formally, the above process can be expressed by the following formulas:

$$ \begin{aligned} j = \max _j j<i\; \& \; s_{j}=s_{i} \end{aligned}$$

(1)

$$\begin{aligned} historical\;window = [u_{j}, u_{j + 1},...,u_{i-1}] \end{aligned}$$

(2)

$$\begin{aligned} edges = \{u_{\tau }\rightarrow u_{i}\}_{\tau =j}^{i-1} \end{aligned}$$

(3)

$$\begin{aligned} edge\;type = {\left\{ \begin{array}{ll} 0&{}s_{\tau } = s_{i}\\ 1&{}s_{\tau } \ne s_{i}\\ \end{array}\right. } \qquad \tau \in [j, j + 1,...,i-1] \end{aligned}$$

(4)

To ensure that the representation of the utterance node at layer l can also be informed by the corresponding representation at layer $l-1$, we add a self-loop edge to $u_i$. We set the edge type as 0.

Node(utterance) Representation Update Scheme: At each layer of TRGAT, We aggregate historical context information for each utterance in temporal order, and allow each utterance to gather information from neighbors(utterances in its historical window) and update their representations. So the representation of utterances would be computed recurrently from the first utterance to the last one. Follow DialogueRGAT [5], in order to model the self and inter-speaker dependency between utterances, we use the message aggregation principle of relational graph attention networks(RGAT) to aggregate context information for each utterance.

In l-th layer, for each target utterance $u_{i}$, the attention weights between $u_{i}$ and $u_{\tau }$ and the attention weights between $u_{i}$ and itself are calculated as follows:

$$\begin{aligned} e_{i,\tau }^{l}= LeakyReLu\left( ({a_r^{l}})^{T}[W_{r}^{l}h_{i}^{l-1}||W_{r}^{l}h_{\tau }^{l}]\right) \quad edge\;type(s_{\tau }, s_{i}) = r \in \{0,1\} \end{aligned}$$

(5)

$$\begin{aligned} e_{i,i}^{l}= LeakyReLu\left( (a_{0}^{l})^{T}[W_{0}^{l}h_{i}^{l-1}||W_{0}^{l}h_{i}^{l-1}]\right) \end{aligned}$$

(6)

$$\begin{aligned} \alpha _{i,\tau }^{l}= softmax_i(e_{i,\tau }^{l}) \end{aligned}$$

(7)

$$\begin{aligned} \alpha _{i,i}^{l}= softmax_i(e_{i,i}^{l}) \end{aligned}$$

(8)

where $\alpha _{i,\tau }^{l}$ denotes the edge(attention) weight from $u_{\tau }$ to the target utterance $u_{i}$ in layer l. $\alpha _{i,i}^{l}$ denotes self-loop edge weight for $u_{i}$ in layer l, $W_{r}^{l}$ denotes a parameterized weight matrix for edge type r in layer l. $a_r^{l}$ denotes a parameterized weight vector for edge type r in layer l, $W_{r}$ and $a_{r}$ not shared across the layers. T represents transposition. || represents the concatenation operation of vectors. A softmax function is used to obtain the incoming edges whose total weight is 1.

It is worth noting that the attention weights between $u_{i}$ and $u_{\tau }$ are based on the $u_{i}$’s hidden state^{Footnote 3} in the $l-1 $-th layer ($h_{i}^{l-1}) $and the $u_{\tau }$’s hidden state in the l-th layer($h_{\tau }^{l}$). The reasons are as follows: we update hidden state for each utterance based on their temporal position and the temporal position of $u_{\tau }$ is in front of $u_i$. So the hidden state for $u_{\tau }$ has been updated before $u_{i}$, donated $h_{\tau }^{l}$, when updating the hidden state of ${u_i}$, we use the updated hidden state to calculate the attention weight.

Finally, a relational graph attention networks propagation module updates the representation of $u_{i}$ by aggregating representations of its neighborhood N(i), and an attention mechanism is used to attend to the neighborhood’s representations. We define the propagation module as follows:

$$\begin{aligned} {\begin{matrix} h_{i}^{l} = \left( \sum _{r} \sum _{\tau \in N^{r}(i)}\alpha _{i,\tau }^{l}W_{r}^{l}h_{\tau }^{l} \right) + \alpha _{i,i}^{l}W_{0}^{l}{h_i^{l-1}}\\ \qquad r \in {0, 1} \qquad j<= \tau <= u-1 \end{matrix}} \end{aligned}$$

(9)

where $N^{r}(i)$ donates the neighborhood of $u_{i}$ under the edge type r.

In each layer, TRGAT can adaptively gather context information for target utterance from both the neighboring utterances and the remote utterances because of the following reason: the target utterance can directly interact with the previous utterances in the context window through directed relational edges. And each utterance in context window has gathered context information for itself, so the target utterance can indirectly attend to the remote utterances.

Let’s take the conversation in Fig. 1 as an example to illustrate the update process of utterance representation. The dialogue consists of six utterances {$u_1$,$u_2$,$u_3$,$u_4$,$u_5$,$u_6$}, $u_1$, $u_3$, $u_6$ are spoken by $s_1$, $u_2$, $u_4$, $u_5$ are spoken by $s_2$. The historical context for each utterance is shown in Table 1, and the update process of utterance representation in the l-th TRGAT layer is shown in Fig. 2.

Table 1. The utterances and its historical context in conversation.

Full size table

Emotion Classification. After obtaining the representations $h_{i}^{L}$ of each utterance node through stacking TRGAT layer of L layers, we concatenate the non-contextual representation $h_{i}^{0}$ and the representation $h_{i}^{L}$ as the final representation of $u_{i}$, and pass it through a feed-forward neural network and a softmax layer to get the emotion distribution:

$$\begin{aligned} H_{i} = h_{i}^{0}||h_{i}^{L} \end{aligned}$$

(10)

$$\begin{aligned} Z_{i} = ReLu(W_{H}H_{i} + b_{H}) \end{aligned}$$

(11)

$$\begin{aligned} P_{i} = Softmax(W_{Z}Z_{i} + b_{Z}) \end{aligned}$$

(12)

where $W_{H}$ and $W_{Z}$ denote learnable weight matrixes, and $b_{H}$ and $b_{Z}$ denote learnable bias vectors.

4 Experiment

4.1 Datasets and Evaluation Metrics

We evaluate our model on two benchmark datasets: IEMOCAP [1] and MELD [11]. Both datasets are multimodal datasets containing textual, visual, and acoustic information for every utterance of each conversation. In this work, we focus on conversational emotion recognition only from textual information. We leave multimodal dialogue emotion recognition as future work, and when comparing model performance, we also only use the performance of different models in text modalities.

The IEMOCAP dataset contains videos of dyadic conversations where actors perform improvisations or scripted scenarios. Each conversation is segmented into utterances, which are annotated with one of the six emotion labels: happy, sad, neutral, angry, excited, and frustrated.

The MELD dataset comes from the Friends TV series with multiple speakers involved in the conversations. The utterances are annotated with one of seven labels: neutral, happiness, surprise, sadness, anger, disgust, and fear.

The statistics of the two datasets are shown in Table 2. Because IEMOCAP has no validation set, we extract the validation set from the randomly shuffled training set with the ratio of 8:2. Following [2, 9], we use the F1-score to evaluate the performance for each emotion class, and use the weighted F1-score to evaluate the overall performance on the two datasets.

Table 2. Statistics of IEMOCAP, MELD

Full size table

4.2 Baselines

For a comprehensive performance evaluation, we compared our model with the following baselines:

CNN: As described in Sect. 3.2, it is our utterance representation extractor and trained at the utterance-level without contextual information. scLSTM [10]: It captures contextual information from historical utterances by using a unidirectional LSTM. Memnet [13]: The current utterance is fed to a memory network, where the memories correspond to historical utterances. The output from the memory network is used as the final utterance representation for emotion classification. DialogueRNN [9]: It is a recurrent network that uses two GRUs to track individual speaker’s states and global context during the conversation. Further, another GRU is employed to track emotional state through the conversation. DialogueGCN [2]: It captures self-dependency and inter-speaker dependency between utterances by using two-layer graph neural networks. For a fair comparison, we remove the directed edges from future utterances to current utterances from the original graph structure to avoid backpropagation of dialogue information. DialogueRGAT [5]: Based on DialogueGCN and taking the sequential information of conversation into account, DialogueRGAT propose a kind of relational position encodings that provide RGAT with sequential information. Our handling of graph structures is consistent with DialogueGCN.

4.3 Implementation Settings

We use the following settings to optimize the model parameters during training: the dimension of initial utterance representation is set to 100, 600 for IMEOCAP and MELD respectively. In each TRGAT layer, the size of hidden states is the same as the utterance representation dimension. To prevent our model from over-fitting, we adopted drop out after each TRGAT layer and the dropout rate is 0.4. We employed AdamW as the optimizer for model learning and the learning rate is 0.0005. We used the standard cross-entropy loss as the loss function to train the model. On both datasets, we train 100 epochs on the training set and the batch size is 32, saving the model parameters with the best overall performance on the validation set, and finally report the performance on the test set.

For the TRGAT layer size L, we let $L=3$ for the overall performance comparison by default, but we also carried out experiments with different layer size in Sect. 4.5 to explore how it influence the overall performance.

4.4 Experimental Results

Tabel 3 and Table 4 present the results of IEMOCAP and MELD testing sets, respectively.

Table 3. Performance comparison on the IEMOCAP dataset. The evaluation metrics is F1 for each class. Average(w) = Weighted F1, $\dagger $ denotes results refer to the original paper. $^{*}$ denotes the re-implement results.

Full size table

Table 4. Performance comparison on the MELD dataset.

Full size table

IEMOCAP: In Table 3, our model performs better than all compared models on IMOCAP dataset. Our model attains the best overall performance with improvement over the strongest RNN-based baseline DialogueRNN (+3.8% weight-f1) and the strongest graph-based baseline DialogueRGAT(+1.9% weight-f1).

From the experiment results, the graph-based models(DialogueGCN, DialogueRGAT) perform better than the RNN-based model (DialogueRNN). Perhaps DialogueRNN employs gated recurrent unit (GRUs) to model conversational context, GRUs-based modeling methods can be problematic for many long conversations in IEMOCAP dataset. In contrast, DialogueGCN and DialogueRGAT try to overcome this issue by constructing relational edges to directly model the correlation between utterances. Our model acts like a combination of RNN-based and graph-based models and can better model conversational context.

MELD: For the conversations in MELD dataset, it contains an average of 10 utterances and many conversations containing more than 5 speakers. So this makes the interaction between speakers more difficult than IEMOCAP which only consists of dyadic conversations. So under this circumstance, graph-based models’ advantage in encoding context is not that important. So we found that the difference in results between RNN-base models and graph-based models is not as contrasting as it is in the case of IEMOCAP. The overall performance is not significantly different.

But our models still outperform all baseline methods that suggest the efficacy of our context-modeling method. Compared with the best baseline model DialogueRGAT, our model attains +1.7% weight-f1 improvement in overall performance. In addition, our models perform the best on the two minority classes fear and disgust, this demonstrates the capability of our models in recognizing minority emotion classes.

4.5 Model Analysis

Pre-trained Models as Utterance Feature Extractor. With the outstanding performance of pre-trained models in natural language understanding tasks, pre-trained models are often used as utterance feature extractor in recent works. We replace the CNN-based extractor described in Sect. 3.2 with the Roberta-based extractor to demonstrate the effectiveness of our method regardless of what utterance feature extractor is used. The experimental results are shown in Table 5. From the results, all models can gain remarkable improvement by employing the powerful extractor. Our method attains comparable results compared with the state-of-the-art model DAG+Roberta [14] on IEMOCAP dataset. Meanwhile, our model also achieves comparable results with the best baseline models on the MELD dataset.

Table 5. Performance comparison of different models using roberta as feature extractor on IEMOCAP and MELD datasets.

Full size table

Number of TRGAT Layers. We further explore the relationship between model performance and number of TRGAT layer, and whether using RGAT’s message aggregation principle to aggregate contextual information for each utterance outperforms other graph networks? Here, we use the message aggregation principle used in Graph Attention Network (GAT) [16] and Relational Graph Convolutional Network (RGCN) [13] as a comparative experiment. we denoted the two layers as TGAT and TRGCN. As shown in Fig. 3, we set different TRGAT layers on IEMOCAP and MELD datasets to compare the performance with TGAT and TRGCN.

For static graph neural network (GNN) based models such as DialogueGCN and DialogueRGAT, the only way to receive information from remote utterances for an utterance is to stack several GNN layers. However, in our model, at every layer of TRGAT, we can gather remote utterance information indirectly for each utterance by considering the timing of aggregated information. So rather than stacking many TRGAT layers, we can attain competitive performance with few layers on both datasets. Meanwhile, when stacking more TRGAT layers on the IEMOCAP dataset, the model suffers from performance degradation, which is not obvious on the MELD dataset. We believe when the number of TRGAT layers increases, the number of parameters of the model also increases, the IMEOCAP dataset is relatively small and over-fitting occurs. And RGAT’s message aggregation principle perform better than GAT and RGCN. Compared with RGAT, GAT’s message aggregation principle don’t take the relation of the edge into consideration, so it don’t model the self-dependency and inter-speaker dependency when gather historical context information for the utterance. Compared with RGCN, RGAT can more flexibly determine the importance of historical utterances to current utterances through an attention mechanism.

5 Conclusion

In this paper, we propose a temporal and relational graph attention network, named DialogueTRGAT, for emotion recognition in conversation. DialogueTRGAT gathers context information for each utterance based on their temporal position in dialogue and uses the message aggregation principle of relational graph attention networks (RGAT) to aggregate historical context information for each utterance. So it acts like a combination of the RNN-based model and graph-based model. We think it is a more effective way to model the information flow within conversations and can gains more meaningful context cues for each utterance for better emotion recognition. Extensive experiments were conducted and compared with previously proposed methods, our resulting model is more competitive.

Notes

1.
the speaker’s emotions are influenced by others.
2.
emotional inertia of individual speakers.
3.
The hidden state of utterance in layer l is equivalent to the representation of utterance in layer l.

References

Busso, C., et al.: Iemocap: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008)
Article Google Scholar
Ghosal, D., Majumder, N., Poria, S., Chhaya, N., Gelbukh, A.: Dialoguegcn: a graph convolutional neural network for emotion recognition in conversation. arXiv preprint arXiv:1908.11540 (2019)
Hazarika, D., Poria, S., Mihalcea, R., Cambria, E., Zimmermann, R.: ICON: interactive conversational memory network for multimodal emotion detection. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2594–2604. Association for Computational Linguistics, Brussels, Belgium, Oct-Nov 2018. https://doi.org/10.18653/v1/D18-1280, https://aclanthology.org/D18-1280
Hu, D., Wei, L., Huai, X.: Dialoguecrn: contextual reasoning networks for emotion recognition in conversations. arXiv preprint arXiv:2106.01978 (2021)
Ishiwatari, T., Yasuda, Y., Miyazaki, T., Goto, J.: Relation-aware graph attention networks with relational position encodings for emotion recognition in conversations. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7360–7370 (2020)
Google Scholar
Jiao, W., Yang, H., King, I., Lyu, M.R.: Higru: hierarchical gated recurrent units for utterance-level emotion recognition. arXiv preprint arXiv:1904.04446 (2019)
Kumar, A., Dogra, P., Dabas, V.: Emotion analysis of twitter using opinion mining. In: 2015 Eighth International Conference on Contemporary Computing (IC3), pp. 285–290. IEEE (2015)
Google Scholar
Majumder, N.: Mime: mimicking emotions for empathetic response generation. arXiv preprint arXiv:2010.01454 (2020)
Majumder, N., Poria, S., Hazarika, D., Mihalcea, R., Gelbukh, A., Cambria, E.: Dialoguernn: an attentive rnn for emotion detection in conversations. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 6818–6825 (2019)
Google Scholar
Poria, S., Cambria, E., Hazarika, D., Majumder, N., Zadeh, A., Morency, L.P.: Context-dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), pp. 873–883 (2017)
Google Scholar
Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., Mihalcea, R.: Meld: a multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508 (2018)
Shen, W., Wu, S., Yang, Y., Quan, X.: Directed acyclic graph network for conversational emotion recognition. arXiv preprint arXiv:2105.12907 (2021)
Sukhbaatar, S., Weston, J., Fergus, R., et al.: End-to-end memory networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Google Scholar
Thost, V., Chen, J.: Directed acyclic graph neural networks. arXiv preprint arXiv:2101.07965 (2021)
Zhang, D., Wu, L., Sun, C., Li, S., Zhu, Q., Zhou, G.: Modeling both context-and speaker-sensitive dependence for emotion detection in multi-speaker conversations. In: IJCAI, pp. 5415–5421 (2019)
Google Scholar
Zhou, L., Gao, J., Li, D., Shum, H.Y.: The design and implementation of xiaoice, an empathetic social chatbot. Comput. Linguist. 46(1), 53–93 (2020)
Article Google Scholar

Download references

Acknowledgments

The authors would like to thank the anonymous reviewers for the helpful comments. This work was supported by Projects 61876118 under the National Natural Science Foundation of China, the National Key RD Program of China under Grant No.2020AAA0108600 and the Priority Academic Program Development of Jiangsu Higher Education Institutions.

Author information

Authors and Affiliations

Laboratory for Natural Language Processing, Soochow University, Suzhou, China
Junjun Kang & Fang Kong
School of Computer Science and Technology, Soochow University, Suzhou, China
Junjun Kang & Fang Kong

Authors

Junjun Kang
View author publications
You can also search for this author in PubMed Google Scholar
Fang Kong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fang Kong .

Editor information

Editors and Affiliations

Singapore University of Technology and Design, Singapore, Singapore
Wei Lu
Nanjing University, Nanjing, China
Shujian Huang
Soochow University, Suzhou, China
Yu Hong
Soochow University, Soochow, China
Xiabing Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kang, J., Kong, F. (2022). DialogueTRGAT: Temporal and Relational Graph Attention Network for Emotion Recognition in Conversations. In: Lu, W., Huang, S., Hong, Y., Zhou, X. (eds) Natural Language Processing and Chinese Computing. NLPCC 2022. Lecture Notes in Computer Science(), vol 13551. Springer, Cham. https://doi.org/10.1007/978-3-031-17120-8_36

Download citation

DOI: https://doi.org/10.1007/978-3-031-17120-8_36
Published: 24 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17119-2
Online ISBN: 978-3-031-17120-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)

DialogueTRGAT: Temporal and Relational Graph Attention Network for Emotion Recognition in Conversations

Abstract

Similar content being viewed by others

DVDGCN: Modeling Both Context-Static and Speaker-Dynamic Graph for Emotion Recognition in Multi-speaker Conversations

Developing Relationships: A Heterogeneous Graph Network with Learnable Edge Representation for Emotion Identification in Conversations

DGNN: Dependency Graph Neural Network for Multimodal Emotion Recognition in Conversation

Keywords

1 Introduction

2 Related Work