Towards Causal Explanation Detection with Pyramid Salient-Aware Network

Zuo, Xinyu; Chen, Yubo; Liu, Kang; Zhao, Jun

doi:10.1007/978-3-030-63031-7_9

Xinyu Zuo^14,15,
Yubo Chen^14,15,
Kang Liu^14,15 &
…
Jun Zhao^14,15

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12522))

Included in the following conference series:

China National Conference on Chinese Computational Linguistics

979 Accesses
3 Citations

Abstract

Causal explanation analysis (CEA) can assist us to understand the reasons behind daily events, which has been found very helpful for understanding the coherence of messages. In this paper, we focus on Causal Explanation Detection, an important subtask of causal explanation analysis, which determines whether a causal explanation exists in one message. We design a Pyramid Salient-Aware Network (PSAN) to detect causal explanations on messages. PSAN can assist in causal explanation detection via capturing the salient semantics of discourses contained in their keywords with a bottom graph-based word-level salient network. Furthermore, PSAN can modify the dominance of discourses via a top attention-based discourse-level salient network to enhance explanatory semantics of messages. The experiments on the commonly used dataset of CEA shows that the PSAN outperforms the state-of-the-art method by 1.8% F1 value on the Causal Explanation Detection task.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Causality Extraction Based on Dependency Syntactic and Relative Attention

LeakGAN-Based Causality Extraction in the Financial Field

A fine-grained causality extraction model incorporating relative location coding

Article 02 September 2023

Keywords

1 Introduction

Causal explanation detection (CED) aims to detect whether there is a causal explanation in a given message (e.g. a group of sentences). Linguistically, there are coherence relations in messages which explain how the meaning of different textual units can combine to jointly build a discourse meaning for the larger unit. The explanation is an important relation of coherence which refers to the textual unit (e.g. discourse) in a message that expresses explanatory coherent semantics [12]. As shown in Fig. 1, M1 can be divided into three discourses, and D2 is the explanation that expresses the reason why it is advantageous for the equipment to operate at these temperatures. CED is important for tasks that require an understanding of textual expression [25]. For example, for question answering, the answers of questions are most likely to be in a group of sentences that contains causal explanations [22]. Furthermore, the summarization of event descriptions can be improved by selecting causally motivated sentences [9]. Therefore, CED is a problem worthy of further study.

The existing methods mostly regard this task as a classification problem [25]. At present, there are mainly two kinds of methods, feature-based methods and neural-based methods, for similar semantic understanding tasks in discourse granularity, such as opinion sentiment classification and discourse parsing [11, 21, 27]. The feature-based methods can extract the feature of the relation between discourses. However, these methods do not deal well with the implicit instances which lack explicit features. For CED, as shown in Fig. 1, D2 lacks explicit features such as because of, due to, or the features of tenses, which are not friendly for feature-based methods. The methods based on neural network are mainly Tree-LSTM model [30] and hierarchical Bi-LSTM model [25]. The Tree-LSTM models learn the relations between words to capture the semantics of discourses more accurately but lack further understanding of the semantics between discourses. The hierarchical Bi-LSTM models can employ sequence structure to implicitly learn the relations between words and discourses. However, previous work shows that compared with Tree-LSTM, Bi-LSTM lacks a direct understanding of the dependency relations between words. Therefore, the method of implicit learning of inter-word relations is not prominent in the tasks related to understanding the semantic relations of messages [16]. Therefore, how to directly learn the relations between words effectively and consider discourse-level correlation to further filter the key information is a valuable point worth studying.

Further analysis, why do the relations between words imply the semantics of the message and its discourses? From the view of computational semantics, the meaning of a text is not only the meaning of words but also the relation, order, and aggregation of the words. In other simple words is that the meaning of a text is partially based on its syntactic structure [12]. In detail, in CED, the core and subsidiary words of discourses contain their basic semantics. For example, as D1 shown in Fig. 1, according to the word order in syntactic structure, we can capture the ability of temperature is advantageous. We can understand the basic semantic of D1 which expresses some kind of ability is advantageous via root words advantageous and its affiliated words. Additionally, why the correlation and key information at the discourse level are so important to capture the causal explanatory semantics of the message? Through observation, the different discourse has a different status for the explanatory semantics of a message. For example, in M1, combined with D1, D2 expresses the explanatory semantics of why the ability to work at these temperatures is advantageous, while D3 expresses the semantic of transition. In detail, D1 and D2 are the keys to the explanatory semantics of M1, and if not treated D1, D2, and D3 differently, the transitional semantic of D3 can affect the understanding of the explanatory semantic of M1. Therefore, how to make better use of the information of keywords in the syntactic structure and pay more attention to the discourses that are key to explanatory semantics is a problem to be solved.

To this end, we propose a Pyramid Salient-Aware Networks (PSAN) which utilizes keywords on the syntactic structure of each discourse and focuses on the key discourses that are critical to explanatory semantics to detect causal explanation of messages. First, what are the keywords in a syntactic structure? From the perspective of syntactic dependency, the root word is the central element that dominates other words, while it is not be dominated by any of the other words, all of which are subordinate to the root word [33]. From that, the root and subsidiary words in the dependency structure are the keywords at the syntax level of each discourse. Specifically, we sample 100 positive sentences from training data to illuminate whether the keywords obtained through the syntactic dependency contain the causal explanatory semantics. And we find that the causal explanatory semantics of more than 80% sentences be captured by keywords in dependency structure^{Footnote 1}. Therefore, we extract the root word and its surrounding words on the syntactic dependency of each discourse as its keywords.

Next, we need to consider how to make better use of the information of keywords contained in the syntactic structure. To pay more attention to keywords, the common way is using attention mechanisms to increase the attention weight of them. However, this implicitly learned attention is not very interpretable. Inspired by previous researches [1, 29], we propose a bottom graph-based word-level salient network which merges the syntactic dependency to capture the salient semantics of discourses contained in their keywords. Finally, how to consider the correlation at the discourse level and pay more attention to the discourses that are key to the explanatory semantics? Inspired by previous work [18], we propose a top attention-based discourse-level salient network to focus on the key discourses in terms of explanatory semantics.

In summary, the contributions of this paper are as follows:

We design a Pyramid Salient-Aware Network (PSAN) to detect causal explanations of messages which can effectively learn the pivotal relations between keywords at word level and further filter the key information at discourse level in terms of explanatory semantics.
PSAN can assist in causal explanation detection via capturing the salient semantics of discourses contained in their keywords with a bottom graph-based word-level salient network. Furthermore, PSAN can modify the dominance of discourses via a top attention-based discourse-level salient network to enhance explanatory semantics of messages.
Experimental results on the open-accessed commonly used datasets show that our model achieves the best performance. Our experiments also prove the effectiveness of each module.

2 Related Works

Causal Semantic Detection: Recently, causality detection which detects specific causes and effects and the relations between them has received more attention, such as the researches proposed by Li [17], Zhang [35], Bekoulis [2], Do [5], Riaz [23], Dunietz [6] and Sharp [24]. Specifically, to extract the causal explanation semantics from the messages in a general level, some researches capture the causal semantics in messages from the perspective of discourse structure, such as capturing counterfactual conditionals from a social message with the PDTB discourse relation parsing [26], a pre-trained model with Rhetorical Structure Theory Discourse Treebank (RSTDT) for exploiting discourse structures on movie reviews [10], and a two-step interactive hierarchical Bi-LSTM framework [32] to extract emotion-cause pair in messages. Furthermore, Son [25] defines the causal explanation analysis task (CEA) to extract causal explanatory semantics in messages and annotates a dataset for other downstream tasks. In this paper, we focus on causal explanation detection (CED) which is the fundamental and important subtask of CEA.

Syntactic Dependency with Graph Network: Syntactic dependency is a vital linguistic feature for natural language processing (NLP). There are some researches employ syntactic dependency such as retrieving question answering passage assisted with syntactic dependency [4], mining opinion with syntactic dependency [31] and so on. For tasks related to causal semantics extraction from relevant texts, dependency syntactic information may evoke causal relations between discourse units in text [8]. And recently, there are some researches [20, 34] convert the syntactic dependency into a graph with graph convolutional network (GCN) [14] to effectively capture the syntactic dependency semantics between words in context, such as a semantic role model with GCN [20], a GCN-based model assisted with a syntactic dependency to improving relation extraction [34]. In this paper, we capture the salient explanatory semantics based on the syntactic-centric graph.

3 Methodology

The architecture of our proposed model is illustrated in Fig. 2. In this paper, the Pyramid Salient-Aware Network (PSAN) primarily involves the following three components: (i) input processing module (IPM), which processes and encodes the input message and its discourses via self-attention module; (ii) bottom word-level salient-aware module (B-WSM), which captures the salient semantics of discourses contained in their keywords based on the syntactic-centric graph; (iii) top discourse-level salient-aware module (T-DSM), which modifies the dominance of different discourse based on the message-level constraint in terms of explanatory semantic via an attention mechanism, and obtain the final causal explanatory representation of input message m.

3.1 Input Processing Module

In this component, we split the input message m into discourses d. Specially, we utilize the self-attention encoder to encode input messages and their corresponding discourses.

Discourse Extraction. As shown in Fig. 1, we split the message into discourses with the same segmentation methods as Son [25] based on semantic coherence. In detail, first, we regard (‘,’), (‘.’), (‘!’), (‘?’) tags and periods as discourse makers. Next, we also extract the discourse connectives set from PDTB2 as discourse makers. Specifically, we remove some simple connectives (e.g. I like running and basketball) from extracted discourse marks. Finally, we divide messages into discourses by the discourse makers.

Embedding Layer. For the input message $s=\{s_1,...,s_n\}$ and discourse $d=\{d_1^d,...,d_m^d\}$ separated from s, we lookup embedding vector of each word $s_n$ ($d_m^d$) as $\varvec{s_n}$ ($\varvec{d_m^d}$) from the pre-trained embedding. Finally, we obtain the word representation sequence $\textit{\textbf{s}}=\{\varvec{s_1},...,\varvec{s_n}\}$ of message s and $\textit{\textbf{d}}=\{\varvec{d_1^d},...,\varvec{d_m^d}\}$ of discourse d corresponding to s.

Word Encoding. Inspired by the application of self-attention to multiple tasks [3, 28], we exploit multi-head self-attention encoder to encode input words. The scaled dot-product attention can be described as follows:

(1)

where $\textit{\textbf{Q}} \in \mathbb {R}^{N \times 2dim_{h}}$, $\textit{\textbf{K}} \in \mathbb {R}^{N \times 2 dim_{h}}$ and $\textit{\textbf{V}} \in \mathbb {R}^{N \times 2 dim_{h}}$ are query matrices, keys matrices and value matrices, respectively. In our setting, $\textit{\textbf{Q}} = \textit{\textbf{K}} = \textit{\textbf{V}} = \textit{\textbf{s}}$ for encoding sentence, and $\textit{\textbf{Q}} = \textit{\textbf{K}} = \textit{\textbf{V}} = \textit{\textbf{d}}$ for encoding discourse.

Multi-head attention first projects the queries, keys, and values h times by using different linear projections. The results of attention are concatenated and once again projected to get the final representation. The formulas are as following:

(2)

$$\begin{aligned} \begin{aligned} \mathbf {H}^{\prime }&=\left( h e a d_{i} \oplus \ldots \oplus h e a d_{h}\right) \mathbf {W}_{o} \end{aligned} \end{aligned}$$

(3)

where, $\mathbf {W}_{i}^{Q} \in \mathbb {R}^{2 dim_{h} \times dim_{k}}$, $\mathbf {W}_{i}^{K} \in \mathbb {R}^{2 dim_{h} \times dim_{k}}$, $\mathbf {W}_{i}^{V} \in \mathbb {R}^{2 dim_{h} \times dim_{k}}$ and $\mathbf {W}_{o} \in \mathbb {R}^{2 dim_{h} \times 2 dim_{h}}$ are projection parameters and $dim_{k}=2 dim_{h} / h$. And the output is the encoded message $\textit{\textbf{H}}_{S}^{ed}=\{\textit{\textbf{h}}_{s_1}^{ed},...,\textit{\textbf{h}}_{s_n}^{ed}\}$ and discourse $\textit{\textbf{H}}_{D^d}^{ed}=\{\textit{\textbf{h}}_{d^d_1}^{ed},...,\textit{\textbf{h}}_{d^d_m}^{ed}\}$.

3.2 Bottom Word-Level Salient-Aware Module

In this component, we aim to capture the salient semantics of discourses contained in their keywords based on syntactic-centric graphs. For each discourse, first, it extracts the syntactic dependency and constructs the syntactic-centric graph. Second, it collects the keywords and their inter-relations to capture the discourse-level salient semantic based on the syntactic-centric graph.

Syntactic-Centric Graph Construction. We construct a syntactic-centric graph of each discourse based on syntactic dependency to assist in capturing the semantics of discourses. We utilize Stanford CoreNLP tool^{Footnote 2} to extract the syntactic dependency of each discourse and convert them into syntactic-centric graphs. Specifically, in the syntactic-centric graph, the nodes represent words, and the edges represent whether there is a dependency relation between two words or not. As shown in the subplot (a) of Fig. 2, need is the root word in the syntactic dependency of “the devices need less thermal insulation” (D2 in S1), and words which are syntactically dependent on each other are connected with solid lines. Keywords Collection and Salient Semantic Extraction. For each discourse, we collect the keywords based on the syntactic-centric graph and capture the salient semantic based on the syntactic-centric graph from its keywords. Firstly, as illustrated in Sect. 1, we combine the root word and the affiliated words that connected with the root word in k hops as the keywords. For example, as shown in Fig. 2, when $k=1$, the keywords are {need, devices and insulation}, and the keywords are {need, devices, insulation, the and thermal} when $k=2$. Secondly, inspired by previous works, we utilize k-layer graph convolutional network (GCN) [14] to encode the k hops connected keywords based on the syntactic-centric graph. For example, when $k=1$, we encode 1-hop keywords with 1-layer GCN to capture the salient semantic. Specifically, we can capture different degrees of salient semantics by changing the value of k. However, it is not the larger the value of k, the deeper the salient semantics are captured. Conversely, the larger the k, the more noises are likely to be introduced. For example, when $k=1$, need, devices and insulation are enough to express the salient semantic of D2 (working at these temperatures need less insulation). Finally, we select the representation of the root word in the final layer as the discourse-level representation which contains the salient semantic.

The graph convolutional network (GCN) [14] is a generalization of CNN [15] for encoding graphs. In detail, given a syntactic-centric graph with v nodes, we utilize an $v \times v$ adjacency matrix $\textit{\textbf{A}}$, where $A_{ij} = 1$ if there is an edge between node i and node j. In each layer of GCN, for each node, the input is the output $\textit{\textbf{h}}_i^{k-1}$ of the previous layer (the input of the first layer is the original encoded input words and features) and the output of node i at k-th layer is $\textit{\textbf{h}}_i^k$, the formula is as following:

(4)

where $W^k$ is the matrice of linear transformation, $b^k$ is a bias term and $\sigma $ is a nonlinear function.

However, naively applying the graph convolution operation in Equation (3) could lead to node representations with drastically different magnitudes because the degree of a token varies a lot. This issue may cause the information in $h_i^{k-1}$ is never carried over to $h_i^k$ because nodes never connect to themselves in a dependency graph [34]. In order to resolve the issue that the information in $h_i^{k-1}$ may be never carried over to $h_i^k$ due to the disconnection between nodes in a dependency graph, we utilize the method raised by Zhang [34] which normalizes the activations in the GCN, and adds self-loops to each node in graph:

(5)

where $\tilde{\mathbf {A}}=\mathbf {A}+\mathbf {I}$, $\mathbf {I}$ is the $v \times v$ identify matrix and $d_{i}=\sum _{j=1}^{v} \tilde{A}_{i j}$ is the degree of word i in graph.

Finally, We select the representation $\textit{\textbf{h}}^{k}_{d_{root}}$ of the root word in final layer GCN as the salient representation of d-th discourse in message s. For example, as shown in the subplot (b) of Fig. 2, we choose the representation of need in the final layer as the salient representation of the discourse “the devices need less thermal insulation”.

3.3 Top Discourse-Level Salient-Aware Module

How to make better use of the relation between discourse and extract the message-level salient semantic? We modify the dominance of different discourse based on the message-level constraint in terms of explanatory semantic via an attention mechanism. First, we extract the global semantic of message s which contains its causal explanatory tendency. Next, we modify the dominance of different discourse based on global semantic. Finally, we combine the modified representation to obtain the final causal explanatory representation of input message s.

Global Semantic Extraction. Inspired by previous research [25], the average encoded word representation of all the words in message can represent its overall semantic simply and effectively. We utilize the average pooling on the encoded representation $\textit{\textbf{H}}_{S}^{ed}$ of message s to obtain the global representation which contains the global semantic of its causal explanatory tendency. The formula is as following:

(6)

where $\textit{\textbf{h}}_{s}^{glo}$ is the global representation of message s via average pooling operation and n is the number of words.

Dominance Modification. We modify the dominance of different discourse based on the global semantic which contains its causal explanatory tendency via an attention mechanism. In detail, after obtaining the global representation $\textit{\textbf{h}}_{s}^{glo}$, we modify the salient representation $\textit{\textbf{h}}^{k}_{d_{root}}$ of discourses d constrained with $\textit{\textbf{h}}_{s}^{glo}$. Finally, we obtain final causal representation $\textit{\textbf{h}}^{caul}_{s}$ of message s via attention mechanism:

$$\begin{aligned} \alpha _{ss} = \textit{\textbf{h}}_{s}^{glo} \textit{\textbf{W}}_{f} (\textit{\textbf{h}}_{s}^{glo})^T \end{aligned}$$

(7)

$$\begin{aligned} \alpha _{sd} = \textit{\textbf{h}}_{s}^{glo} \textit{\textbf{W}}_{f} (\textit{\textbf{h}}_{d_{root}}^k)^T \end{aligned}$$

(8)

$$\begin{aligned} \begin{bmatrix} \alpha _{ss}^{'}, \cdots , \alpha _{sd}^{'} \end{bmatrix} = softmax([\alpha _{ss}, ..., \alpha _{sd}]) \end{aligned}$$

(9)

$$\begin{aligned} \textit{\textbf{h}}^{caul}_{s} = \alpha _{ss}^{'} \textit{\textbf{h}}_{s}^{glo} +...+\alpha _{sd}^{'} \textit{\textbf{h}}_{d_{root}}^k, \end{aligned}$$

(10)

where the $\textit{\textbf{W}}_{f}$ is matrice of linear transformation, $\alpha _{ss}^{'}$, $\alpha _{sd}^{'}$ are the attention weight. Finally, we mapping $\textit{\textbf{h}}^{caul}_{s}$ into a binary vector and get the output via a softmax operation.

4 Experiment

Dataset. We mainly evaluate our model on a unique dataset devoted to causal explanation analysis released by Son [25]. This dataset contains 3,268 messages consist of 1598 positive messages that contain a causal explanation and 1670 negative sentences randomly selected. Annotators annotate which messages contain causal explanations and which text spans are causal explanations (a discourse with a tendency to interpret something). We utilize the same 80% of the dataset for training, 10% for tuning, and 10% for evaluating as Son [25]. Additionally, to further prove the effectiveness of our proposed model, we regard sentences with causal semantic discourse relations in PDTB2 and sentences containing causal span pairs in BECauSE Corpus 2.0 [7] as supplemental messages with causal explanations to evaluate our model. In this paper, PDTB-CED and BECauSE-CED are used to represent the two supplementary datasets respectively.

Parameter Settings. We set the length of the sentence and discourse as 100 and 30 respectively. We set the batch size as 5 and the dimension of the output in each GCN layer as 50. Additionally, we utilize the 50-dimension word vector pre-trained with Glove. For optimization, we utilize Adam [13] with 0.001 learning rate. We set the maximum training epoch as 100 and adopt an early stop strategy based on the performance of the development set. All the results of different compared and ablated models are the average result of five independent experiments.

Compared Models. We compare our proposed model with feature-based and neural-based model: (1) Lin et al. [19]: an end-to-end discourse relation parser on PDTB, (2) Linear SVM: a linear designed feature based SVM classifier, (3) RBF SVM: a complex designed feature based SVM classifier, (4) Random Forest: a random forest classifier which relies on designed features, (5) Son et al. [25]: a hierarchical LSTM sequence model which is designed specifically for CEA. (6) H-BiLSTM + BERT^{Footnote 3}$^{,}$^{Footnote 4}: a fine-tuned language model (BERT) which has been shown to improve the performance in some other classification tasks based on (5), (7) H-Atten.: a well-used Bi-LSTM model that captures hierarchical key information based on hierarchical attention mechanism, (8) Our model: our proposed pyramid salient-aware network (PSAN). Furthermore, we evaluate the performance of the model (5), (7), and (8) on the supplemental dataset to prove the effectiveness of our proposed model. Additionally, we design different ablation experiments to demonstrate the effectiveness of the bottom word-level salient-aware module (B-WSM), top discourse-level salient-aware module (T-DSM), and the influence of different depths in the syntactic-centric graph.

4.1 Main Results

Table 1. Comparisons of the state-of-the-art methods on causal explanation detection.

Full size table

Table 1 shows the comparison results on the Facebook dataset and two supplementary datasets. From the results, we have the following observations.

(1)
Comparing with the current best feature-based and neural-based models on CED: Lin et al. [19], Linear SVM and Son et al. [25], our model improves the performance by 23.0, 7.7 and 11.0 points on F1, respectively. It illustrates that the pyramid salient-aware network (PSAN) can effectively extract and incorporate the word-level key relation and discourse-level key information in terms of explanatory semantics to detect causal explanation. Furthermore, comparing with the well-used hierarchical key information captured model (H-Atten.), our model improves the performance by 5.9 points on F1. This confirms the statement in Sect. 1 that directly employing the relation between words with syntactic structure is more effective than the implicit learning.
(2)
Comparing the Son et al. [25] with pre-trained language model (H-BiLSTM+BERT), there is 9.2 points improvement on F1. It illustrates that the pre-trained language model (LM) can capture some causal explanatory semantics with the large-scale corpus. Furthermore, our model can further improve performance by 1.8 points compared with H-BiLSTM+BERT. We believe the reason is that the LM is pre-trained with large-scale regular sentences that do not contain causal semantics only, which is not specifically suitable for CED compared to the proposed model for explanatory semantic. Furthermore, the performance of H-Atten. is better than Son et al. [25] which indicates focusing on salient keywords and key discourses helps understand explanatory semantics.
(3)
It is worth noting that, regardless of our proposed model, comparing the Linear SVM with Son et al. [25], the simple feature classifier is better than the simple deep learning model for CED on the Facebook dataset. However, when combining the syntactic-centric features with deep learning, we could achieve a significant improvement. In other words, our model can effectively combine the interpretable information of the feature-based model with the deep understanding of the deep learning model.
(4)
To further prove the effectiveness of the proposed model, we evaluate our model on supplemental messages with causal semantics in other datasets (PDTB-CED and BEcausE-CED). As shown in Table 1, the results show that the proposed model performs significantly better than the Son et al. [25] and H-Atten. on the other two datasets^{Footnote 5}. It further demonstrates the effectiveness of our proposed model.
(5)
Moreover, our model is twice as fast as the Son et al. [25] during training because of the computation of self-attention and GCN is parallel. It illustrates that our model can consume less time and achieve significant improvement in causal explanation detection. Moreover, compared with the feature-based models, the neural-based models rely less on artificial design features.

Table 2. Effectiveness of B-WSM. (w/o B-WSM denotes the models without B-WSM. $\varvec{+}$ denotes repalcing the B-WSM with the module after $\varvec{+}$. root denotes using the encoded representation of the root word in each discourse to represent it. ave denotes using the average encoded representation of words in discourse to represent it.)

Full size table

4.2 Effectiveness of Bottom Word-Level Salient-Aware Module (B-WSM)

Table 2 tries to show the effectiveness of the salient information contained in the keywords of each discourse captured via the proposed B-WSM for causal explanation detection (Sect. 3.2). The results illustrate B-WSM can effectively capture the salient information which contains the most causal explanatory semantics. It is worth noting that when using the average encoded-word representation to represent each discourse (w/o B-WSM + ave), the model also achieves acceptable performance. This confirms the conclusion from Son [25] that the average word representation at word level contains certain causal explanatory semantic. Furthermore, only the root word of each discourse also contains some causal semantics (w/o B-WSM + root) which proves the effectiveness of capturing salient information via syntactic dependency from the keywords.

4.3 Effectiveness of Top Discourse-Level Salient-Aware Module (T-DSM)

Table 3 tries to show the effectiveness of the salient information of the key discourses modified and incorporated via T-DSM for causal explanation detection (Sect. 3.3). The results compared with w/o T-DSM + seq D illustrate our T-DSM can effectively modify the dominance of different discourses based on the global semantic constraint via an attention mechanism to enhance the causal explanatory semantic. Specifically, the results of w/o T-DSM + ave S/D show that both discourse-level representation and global representation contain efficient causal explanatory semantics, which further proves the effectiveness of the proposed T-DSM.

Table 3. Effectiveness of T-DSM. (w/o T-DSM denotes models without T-DSM. $\varvec{+}$ denotes replacing the T-DSM with the module after $\varvec{+}$. seq D denotes mapping the representation of discourses via a sequence LSTM to represent the whole message. ave S/D denotes using the average encoded representation of words in message and its discourses to represent the whole message.)

Full size table

4.4 Comparisons of Different Depths of Syntactic-Centric Semantic

To demonstrate the influence of the causal explanatory semantics contained in the syntactic-centric graph with different depths, we further compare the performance of our proposed model with a different number of GCN layers. As shown in Fig. 3, when the number of GCN layers is 2, the most efficient syntactic-centric information can be captured for causal explanation detection.

4.5 Error Analysis

As shown in Fig. 4, we find the two main difficulties in this task:

(1)
Emotional tendency The same expression can convey different semantic under different emotional tendencies, especially in this kind of colloquial expressions. As M2 shown in Fig. 4, make 8 blankets expresses anger over not do any homework, and our model wrongly predicts the make 8 blankets is the reason for not do any homework.
(2)
Excessive semantic parsing Excessive parsing of causal intent by the model will lead to identifying messages that do not contain causal explanations as containing. As shown in Fig. 4, M3 means pancakes are awesome, but the model overstates the reason for awesome is a pancake.

5 Conclusion

In this paper, we devise a pyramid salient-aware network (PSAN) to detect causal explanations in messages. PSAN can effectively learn the key relation between words at the word level and further filter out the key information at the discourse level in terms of explanatory semantics. Specifically, we propose a bottom word-level salient-aware module to capture the salient semantics of discourses contained in their keywords based on a the syntactic-centric graph. We also propose a top discourse-level salient-aware module to modify the dominance of different discourses in terms of global explanatory semantic constraint via an attention mechanism. Experimental results on the open-accessed commonly used datasets show that our model achieves the best performance.

Notes

1.
Five Ph.D. students majoring in NLP judge whether sentences could be identified as which containing causal explanatory semantics by the root word and its surrounding words in syntactic dependency, and the agreement consistency is 0.8.
2.
https://stanfordnlp.github.io/CoreNLP/.
3.
https://github.com/huggingface/transformers.
4.
BERT can not be applied to the feature-based model suitably, so we deploy BERT on the latest neural model to make the comparison to prove the effectiveness of our proposed model.
5.
We obtain the performance with the publicly released code by Son et al. [25]. The supplementary datasets are not specifically suitable for this task, and the architectural details of designed feature-based models are not public, so we only compare the performance of the latest model to prove the effectiveness of our proposed model.

References

Bastings, J., Titov, I., Aziz, W., Marcheggiani, D., Sima’an, K.: Graph convolutional encoders for syntax-aware neural machine translation. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1957–1967. Association for Computational Linguistics, Copenhagen, September 2017. https://doi.org/10.18653/v1/D17-1209. https://www.aclweb.org/anthology/D17-1209
Bekoulis, G., Deleu, J., Demeester, T., Develder, C.: Adversarial training for multi-context joint entity and relation extraction. arXiv preprint arXiv:1808.06876 (2018)
Cao, P., Chen, Y., Liu, K., Zhao, J., Liu, S.: Adversarial transfer learning for Chinese named entity recognition with self-attention mechanism. In: Empirical Methods in Natural Language Processing, pp. 182–192. Association for Computational Linguistics, Brussels (2018)
Google Scholar
Cui, H., Sun, R., Li, K., Kan, M.Y., Chua, T.S.: Question answering passage retrieval using dependency relations. In: ACM SIGIR, pp. 400–407. ACM (2005)
Google Scholar
Do, Q.X., Chan, Y.S., Roth, D.: Minimally supervised event causality identification. In: Empirical Methods in Natural Language Processing, pp. 294–303. Association for Computational Linguistics (2011)
Google Scholar
Dunietz, J., Levin, L., Carbonell, J.: Automatically tagging constructions of causation and their slot-fillers. TACL 5, 117–133 (2017)
Article Google Scholar
Dunietz, J., Levin, L., Carbonell, J.: The BECauSE corpus 2.0: annotating causality and overlapping relations. In: Proceedings of the 11th Linguistic Annotation Workshop, pp. 95–104. Association for Computational Linguistics, Valencia, April 2017. https://doi.org/10.18653/v1/W17-0812. https://www.aclweb.org/anthology/W17-0812
Gao, L., Choubey, P.K., Huang, R.: Modeling document-level causal structures for event causal relation identification. In: North American Chapter of the Association for Computational Linguistics, pp. 1808–1817. Association for Computational Linguistics, Minneapolis, June 2019
Google Scholar
Hidey, C., McKeown, K.: Identifying causal relations using parallel Wikipedia articles. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1424–1433. Association for Computational Linguistics, Berlin, August 2016. https://doi.org/10.18653/v1/P16-1135. https://www.aclweb.org/anthology/P16-1135
Ji, Y., Smith, N.: Neural discourse structure for text categorization. arXiv preprint arXiv:1702.01829 (2017)
Jia, Y., Ye, Y., Feng, Y., Lai, Y., Yan, R., Zhao, D.: Modeling discourse cohesion for discourse parsing via memory network. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 438–443. Association for Computational Linguistics, Melbourne, July 2018. https://doi.org/10.18653/v1/P18-2070. https://www.aclweb.org/anthology/P18-2070
Jurafsky, D.: Speech and language processing: an introduction to natural language (2010)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Li, J., Luong, T., Jurafsky, D., Hovy, E.: When are tree structures necessary for deep learning of representations? In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2304–2314. Association for Computational Linguistics, Lisbon, September 2015. https://doi.org/10.18653/v1/D15-1278. https://www.aclweb.org/anthology/D15-1278
Li, P., Mao, K.: Knowledge-oriented convolutional neural network for causal relation extraction from natural language texts. Expert Syst. Appl. 115, 512–523 (2019)
Article Google Scholar
Li, Q., Li, T., Chang, B.: Discourse parsing with attention-based hierarchical neural networks. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 362–371. Association for Computational Linguistics, Austin, November 2016. https://doi.org/10.18653/v1/D16-1035. https://www.aclweb.org/anthology/D16-1035
Lin, Z., Ng, H.T., Kan, M.Y.: A pdtb-styled end-to-end discourse parser. Nat. Lang. Eng. 20(2), 151–184 (2014)
Article Google Scholar
Marcheggiani, D., Titov, I.: Encoding sentences with graph convolutional networks for semantic role labeling. In: Empirical Methods in Natural Language Processing, pp. 1506–1515, September 2017
Google Scholar
Nejat, B., Carenini, G., Ng, R.: Exploring joint neural model for sentence level discourse parsing and sentiment analysis. In: Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pp. 289–298. Association for Computational Linguistics, Saarbrücken, August 2017. https://doi.org/10.18653/v1/W17-5535. https://www.aclweb.org/anthology/W17-5535
Oh, J.H., Torisawa, K., Hashimoto, C., Sano, M., De Saeger, S., Ohtake, K.: Why-question answering using intra-and inter-sentential causal relations. In: Association for Computational Linguistics, vol. 1, pp. 1733–1743 (2013)
Google Scholar
Riaz, M., Girju, R.: In-depth exploitation of noun and verb semantics to identify causation in verb-noun pairs. In: SIGDIAL, pp. 161–170 (2014)
Google Scholar
Sharp, R., Surdeanu, M., Jansen, P., Clark, P., Hammond, M.: Creating causal embeddings for question answering with minimal supervision. arXiv preprint arXiv:1609.08097 (2016)
Son, Y., Bayas, N., Schwartz, H.A.: Causal explanation analysis on social media. In: Empirical Methods in Natural Language Processing (2018)
Google Scholar
Son, Y., et al.: Recognizing counterfactual thinking in social media texts. In: Association for Computational Linguistics, pp. 654–658 (2017)
Google Scholar
Soricut, R., Marcu, D.: Sentence level discourse parsing using syntactic and lexical information. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 228–235 (2003). https://www.aclweb.org/anthology/N03-1030
Tan, Z., Wang, M., Xie, J., Chen, Y., Shi, X.: Deep semantic role labeling with self-attention. In: AAAI (2018)
Google Scholar
Vashishth, S., Bhandari, M., Yadav, P., Rai, P., Bhattacharyya, C., Talukdar, P.: Incorporating syntactic and semantic information in word embeddings using graph convolutional networks. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3308–3318. Association for Computational Linguistics, Florence, July 2019. https://doi.org/10.18653/v1/P19-1320. https://www.aclweb.org/anthology/P19-1320
Wang, Y., Li, S., Yang, J., Sun, X., Wang, H.: Tag-enhanced tree-structured neural networks for implicit discourse relation classification. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 496–505. Asian Federation of Natural Language Processing, Taipei, Taiwan, November 2017. https://www.aclweb.org/anthology/I17-1050
Wu, Y., Zhang, Q., Huang, X., Wu, L.: Phrase dependency parsing for opinion mining. In: Empirical Methods in Natural Language Processing, pp. 1533–1541. Association for Computational Linguistics, Singapore, August 2009
Google Scholar
Xia, R., Ding, Z.: Emotion-cause pair extraction: a new task to emotion analysis in texts. In: Association for Computational Linguistics, pp. 1003–1012. Association for Computational Linguistics, Florence, Italy, July 2019. https://doi.org/10.18653/v1/P19-1096. https://www.aclweb.org/anthology/P19-1096
Zhang, X., Zong, C.: Statistical natural language processing (second edition). Mach. Transl. 28(2), 155–158 (2014)
Article MathSciNet Google Scholar
Zhang, Y., Qi, P., Manning, C.D.: Graph convolution over pruned dependency trees improves relation extraction. In: Empirical Methods in Natural Language Processing, pp. 2205–2215 (2018)
Google Scholar
Zhang, Y., Zhong, V., Chen, D., Angeli, G., Manning, C.D.: Position-aware attention and supervised data improve slot filling. In: Empirical Methods in Natural Language Processing, pp. 35–45 (2017)
Google Scholar

Download references

Acknowledgements

This work is supported by the Natural Key RD Program of China (No. 2018YFB1005100), the National Natural Science Foundation of China (No. 61533018, No. 61922085, No. 619- 76211, No. 61806201) and the Key Research Program of the Chinese Academy of Sciences (Grant NO. ZDBS-SSW-JSC006). This work is also supported by Beijing Academy of Artificial Intelligence (BAAI2019QN0301), CCF-Tencent Open Research Fund and independent research project of National Laboratory of Pattern Recognition.

Author information

Authors and Affiliations

National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China
Xinyu Zuo, Yubo Chen, Kang Liu & Jun Zhao
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100049, China
Xinyu Zuo, Yubo Chen, Kang Liu & Jun Zhao

Authors

Xinyu Zuo
View author publications
You can also search for this author in PubMed Google Scholar
Yubo Chen
View author publications
You can also search for this author in PubMed Google Scholar
Kang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jun Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xinyu Zuo .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Maosong Sun
Peking University, Beijing, China
Sujian Li
Westlake University, Hangzhou, China
Yue Zhang
Tsinghua University, Beijing, China
Yang Liu
Chinese Academy of Sciences, Beijing, China
Shizhu He
Beijing Language and Culture University, Beijing, China
Gaoqi Rao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zuo, X., Chen, Y., Liu, K., Zhao, J. (2020). Towards Causal Explanation Detection with Pyramid Salient-Aware Network. In: Sun, M., Li, S., Zhang, Y., Liu, Y., He, S., Rao, G. (eds) Chinese Computational Linguistics. CCL 2020. Lecture Notes in Computer Science(), vol 12522. Springer, Cham. https://doi.org/10.1007/978-3-030-63031-7_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-63031-7_9
Published: 12 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63030-0
Online ISBN: 978-3-030-63031-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Towards Causal Explanation Detection with Pyramid Salient-Aware Network

Abstract

Similar content being viewed by others

Causality Extraction Based on Dependency Syntactic and Relative Attention

LeakGAN-Based Causality Extraction in the Financial Field

A fine-grained causality extraction model incorporating relative location coding

Keywords

1 Introduction

2 Related Works