Keywords

1 Introduction

Causal explanation detection (CED) aims to detect whether there is a causal explanation in a given message (e.g. a group of sentences). Linguistically, there are coherence relations in messages which explain how the meaning of different textual units can combine to jointly build a discourse meaning for the larger unit. The explanation is an important relation of coherence which refers to the textual unit (e.g. discourse) in a message that expresses explanatory coherent semantics [12]. As shown in Fig. 1, M1 can be divided into three discourses, and D2 is the explanation that expresses the reason why it is advantageous for the equipment to operate at these temperatures. CED is important for tasks that require an understanding of textual expression [25]. For example, for question answering, the answers of questions are most likely to be in a group of sentences that contains causal explanations [22]. Furthermore, the summarization of event descriptions can be improved by selecting causally motivated sentences [9]. Therefore, CED is a problem worthy of further study.

Fig. 1.
figure 1

Instance of causal explanation analysis (CEA). The top part is a message which contains its segmented discourses and a causal explanation. The bottom part is the syntactic dependency structures of three discourses divided from M1.

The existing methods mostly regard this task as a classification problem [25]. At present, there are mainly two kinds of methods, feature-based methods and neural-based methods, for similar semantic understanding tasks in discourse granularity, such as opinion sentiment classification and discourse parsing [11, 21, 27]. The feature-based methods can extract the feature of the relation between discourses. However, these methods do not deal well with the implicit instances which lack explicit features. For CED, as shown in Fig. 1, D2 lacks explicit features such as because of, due to, or the features of tenses, which are not friendly for feature-based methods. The methods based on neural network are mainly Tree-LSTM model [30] and hierarchical Bi-LSTM model [25]. The Tree-LSTM models learn the relations between words to capture the semantics of discourses more accurately but lack further understanding of the semantics between discourses. The hierarchical Bi-LSTM models can employ sequence structure to implicitly learn the relations between words and discourses. However, previous work shows that compared with Tree-LSTM, Bi-LSTM lacks a direct understanding of the dependency relations between words. Therefore, the method of implicit learning of inter-word relations is not prominent in the tasks related to understanding the semantic relations of messages [16]. Therefore, how to directly learn the relations between words effectively and consider discourse-level correlation to further filter the key information is a valuable point worth studying.

Further analysis, why do the relations between words imply the semantics of the message and its discourses? From the view of computational semantics, the meaning of a text is not only the meaning of words but also the relation, order, and aggregation of the words. In other simple words is that the meaning of a text is partially based on its syntactic structure [12]. In detail, in CED, the core and subsidiary words of discourses contain their basic semantics. For example, as D1 shown in Fig. 1, according to the word order in syntactic structure, we can capture the ability of temperature is advantageous. We can understand the basic semantic of D1 which expresses some kind of ability is advantageous via root words advantageous and its affiliated words. Additionally, why the correlation and key information at the discourse level are so important to capture the causal explanatory semantics of the message? Through observation, the different discourse has a different status for the explanatory semantics of a message. For example, in M1, combined with D1, D2 expresses the explanatory semantics of why the ability to work at these temperatures is advantageous, while D3 expresses the semantic of transition. In detail, D1 and D2 are the keys to the explanatory semantics of M1, and if not treated D1, D2, and D3 differently, the transitional semantic of D3 can affect the understanding of the explanatory semantic of M1. Therefore, how to make better use of the information of keywords in the syntactic structure and pay more attention to the discourses that are key to explanatory semantics is a problem to be solved.

To this end, we propose a Pyramid Salient-Aware Networks (PSAN) which utilizes keywords on the syntactic structure of each discourse and focuses on the key discourses that are critical to explanatory semantics to detect causal explanation of messages. First, what are the keywords in a syntactic structure? From the perspective of syntactic dependency, the root word is the central element that dominates other words, while it is not be dominated by any of the other words, all of which are subordinate to the root word [33]. From that, the root and subsidiary words in the dependency structure are the keywords at the syntax level of each discourse. Specifically, we sample 100 positive sentences from training data to illuminate whether the keywords obtained through the syntactic dependency contain the causal explanatory semantics. And we find that the causal explanatory semantics of more than 80% sentences be captured by keywords in dependency structureFootnote 1. Therefore, we extract the root word and its surrounding words on the syntactic dependency of each discourse as its keywords.

Next, we need to consider how to make better use of the information of keywords contained in the syntactic structure. To pay more attention to keywords, the common way is using attention mechanisms to increase the attention weight of them. However, this implicitly learned attention is not very interpretable. Inspired by previous researches [1, 29], we propose a bottom graph-based word-level salient network which merges the syntactic dependency to capture the salient semantics of discourses contained in their keywords. Finally, how to consider the correlation at the discourse level and pay more attention to the discourses that are key to the explanatory semantics? Inspired by previous work [18], we propose a top attention-based discourse-level salient network to focus on the key discourses in terms of explanatory semantics.

In summary, the contributions of this paper are as follows:

  • We design a Pyramid Salient-Aware Network (PSAN) to detect causal explanations of messages which can effectively learn the pivotal relations between keywords at word level and further filter the key information at discourse level in terms of explanatory semantics.

  • PSAN can assist in causal explanation detection via capturing the salient semantics of discourses contained in their keywords with a bottom graph-based word-level salient network. Furthermore, PSAN can modify the dominance of discourses via a top attention-based discourse-level salient network to enhance explanatory semantics of messages.

  • Experimental results on the open-accessed commonly used datasets show that our model achieves the best performance. Our experiments also prove the effectiveness of each module.

2 Related Works

Causal Semantic Detection: Recently, causality detection which detects specific causes and effects and the relations between them has received more attention, such as the researches proposed by Li [17], Zhang [35], Bekoulis [2], Do [5], Riaz [23], Dunietz [6] and Sharp [24]. Specifically, to extract the causal explanation semantics from the messages in a general level, some researches capture the causal semantics in messages from the perspective of discourse structure, such as capturing counterfactual conditionals from a social message with the PDTB discourse relation parsing [26], a pre-trained model with Rhetorical Structure Theory Discourse Treebank (RSTDT) for exploiting discourse structures on movie reviews [10], and a two-step interactive hierarchical Bi-LSTM framework [32] to extract emotion-cause pair in messages. Furthermore, Son [25] defines the causal explanation analysis task (CEA) to extract causal explanatory semantics in messages and annotates a dataset for other downstream tasks. In this paper, we focus on causal explanation detection (CED) which is the fundamental and important subtask of CEA.

Syntactic Dependency with Graph Network: Syntactic dependency is a vital linguistic feature for natural language processing (NLP). There are some researches employ syntactic dependency such as retrieving question answering passage assisted with syntactic dependency [4], mining opinion with syntactic dependency [31] and so on. For tasks related to causal semantics extraction from relevant texts, dependency syntactic information may evoke causal relations between discourse units in text [8]. And recently, there are some researches [20, 34] convert the syntactic dependency into a graph with graph convolutional network (GCN) [14] to effectively capture the syntactic dependency semantics between words in context, such as a semantic role model with GCN [20], a GCN-based model assisted with a syntactic dependency to improving relation extraction [34]. In this paper, we capture the salient explanatory semantics based on the syntactic-centric graph.

3 Methodology

The architecture of our proposed model is illustrated in Fig. 2. In this paper, the Pyramid Salient-Aware Network (PSAN) primarily involves the following three components: (i) input processing module (IPM), which processes and encodes the input message and its discourses via self-attention module; (ii) bottom word-level salient-aware module (B-WSM), which captures the salient semantics of discourses contained in their keywords based on the syntactic-centric graph; (iii) top discourse-level salient-aware module (T-DSM), which modifies the dominance of different discourse based on the message-level constraint in terms of explanatory semantic via an attention mechanism, and obtain the final causal explanatory representation of input message m.

Fig. 2.
figure 2

The structure of PSAN. The left side is the detail of the bottom word-level salient-aware module (B-WSM), the top of right side is the top discourse-level salient-aware module (T-DSM) and the bottom of right side is the input processing module (IPM).

3.1 Input Processing Module

In this component, we split the input message m into discourses d. Specially, we utilize the self-attention encoder to encode input messages and their corresponding discourses.

Discourse Extraction. As shown in Fig. 1, we split the message into discourses with the same segmentation methods as Son [25] based on semantic coherence. In detail, first, we regard (‘,’), (‘.’), (‘!’), (‘?’) tags and periods as discourse makers. Next, we also extract the discourse connectives set from PDTB2 as discourse makers. Specifically, we remove some simple connectives (e.g. I like running and basketball) from extracted discourse marks. Finally, we divide messages into discourses by the discourse makers.

Embedding Layer. For the input message \(s=\{s_1,...,s_n\}\) and discourse \(d=\{d_1^d,...,d_m^d\}\) separated from s, we lookup embedding vector of each word \(s_n\) (\(d_m^d\)) as \(\varvec{s_n}\) (\(\varvec{d_m^d}\)) from the pre-trained embedding. Finally, we obtain the word representation sequence \(\textit{\textbf{s}}=\{\varvec{s_1},...,\varvec{s_n}\}\) of message s and \(\textit{\textbf{d}}=\{\varvec{d_1^d},...,\varvec{d_m^d}\}\) of discourse d corresponding to s.

Word Encoding. Inspired by the application of self-attention to multiple tasks [3, 28], we exploit multi-head self-attention encoder to encode input words. The scaled dot-product attention can be described as follows:

(1)

where \(\textit{\textbf{Q}} \in \mathbb {R}^{N \times 2dim_{h}}\), \(\textit{\textbf{K}} \in \mathbb {R}^{N \times 2 dim_{h}}\) and \(\textit{\textbf{V}} \in \mathbb {R}^{N \times 2 dim_{h}}\) are query matrices, keys matrices and value matrices, respectively. In our setting, \(\textit{\textbf{Q}} = \textit{\textbf{K}} = \textit{\textbf{V}} = \textit{\textbf{s}}\) for encoding sentence, and \(\textit{\textbf{Q}} = \textit{\textbf{K}} = \textit{\textbf{V}} = \textit{\textbf{d}}\) for encoding discourse.

Multi-head attention first projects the queries, keys, and values h times by using different linear projections. The results of attention are concatenated and once again projected to get the final representation. The formulas are as following:

(2)
$$\begin{aligned} \begin{aligned} \mathbf {H}^{\prime }&=\left( h e a d_{i} \oplus \ldots \oplus h e a d_{h}\right) \mathbf {W}_{o} \end{aligned} \end{aligned}$$
(3)

where, \(\mathbf {W}_{i}^{Q} \in \mathbb {R}^{2 dim_{h} \times dim_{k}}\), \(\mathbf {W}_{i}^{K} \in \mathbb {R}^{2 dim_{h} \times dim_{k}}\), \(\mathbf {W}_{i}^{V} \in \mathbb {R}^{2 dim_{h} \times dim_{k}}\) and \(\mathbf {W}_{o} \in \mathbb {R}^{2 dim_{h} \times 2 dim_{h}}\) are projection parameters and \(dim_{k}=2 dim_{h} / h\). And the output is the encoded message \(\textit{\textbf{H}}_{S}^{ed}=\{\textit{\textbf{h}}_{s_1}^{ed},...,\textit{\textbf{h}}_{s_n}^{ed}\}\) and discourse \(\textit{\textbf{H}}_{D^d}^{ed}=\{\textit{\textbf{h}}_{d^d_1}^{ed},...,\textit{\textbf{h}}_{d^d_m}^{ed}\}\).

3.2 Bottom Word-Level Salient-Aware Module

In this component, we aim to capture the salient semantics of discourses contained in their keywords based on syntactic-centric graphs. For each discourse, first, it extracts the syntactic dependency and constructs the syntactic-centric graph. Second, it collects the keywords and their inter-relations to capture the discourse-level salient semantic based on the syntactic-centric graph.

Syntactic-Centric Graph Construction. We construct a syntactic-centric graph of each discourse based on syntactic dependency to assist in capturing the semantics of discourses. We utilize Stanford CoreNLP toolFootnote 2 to extract the syntactic dependency of each discourse and convert them into syntactic-centric graphs. Specifically, in the syntactic-centric graph, the nodes represent words, and the edges represent whether there is a dependency relation between two words or not. As shown in the subplot (a) of Fig. 2, need is the root word in the syntactic dependency of “the devices need less thermal insulation” (D2 in S1), and words which are syntactically dependent on each other are connected with solid lines. Keywords Collection and Salient Semantic Extraction. For each discourse, we collect the keywords based on the syntactic-centric graph and capture the salient semantic based on the syntactic-centric graph from its keywords. Firstly, as illustrated in Sect. 1, we combine the root word and the affiliated words that connected with the root word in k hops as the keywords. For example, as shown in Fig. 2, when \(k=1\), the keywords are {need, devices and insulation}, and the keywords are {need, devices, insulation, the and thermal} when \(k=2\). Secondly, inspired by previous works, we utilize k-layer graph convolutional network (GCN) [14] to encode the k hops connected keywords based on the syntactic-centric graph. For example, when \(k=1\), we encode 1-hop keywords with 1-layer GCN to capture the salient semantic. Specifically, we can capture different degrees of salient semantics by changing the value of k. However, it is not the larger the value of k, the deeper the salient semantics are captured. Conversely, the larger the k, the more noises are likely to be introduced. For example, when \(k=1\), need, devices and insulation are enough to express the salient semantic of D2 (working at these temperatures need less insulation). Finally, we select the representation of the root word in the final layer as the discourse-level representation which contains the salient semantic.

The graph convolutional network (GCN) [14] is a generalization of CNN [15] for encoding graphs. In detail, given a syntactic-centric graph with v nodes, we utilize an \(v \times v\) adjacency matrix \(\textit{\textbf{A}}\), where \(A_{ij} = 1\) if there is an edge between node i and node j. In each layer of GCN, for each node, the input is the output \(\textit{\textbf{h}}_i^{k-1}\) of the previous layer (the input of the first layer is the original encoded input words and features) and the output of node i at k-th layer is \(\textit{\textbf{h}}_i^k\), the formula is as following:

(4)

where \(W^k\) is the matrice of linear transformation, \(b^k\) is a bias term and \(\sigma \) is a nonlinear function.

However, naively applying the graph convolution operation in Equation (3) could lead to node representations with drastically different magnitudes because the degree of a token varies a lot. This issue may cause the information in \(h_i^{k-1}\) is never carried over to \(h_i^k\) because nodes never connect to themselves in a dependency graph [34]. In order to resolve the issue that the information in \(h_i^{k-1}\) may be never carried over to \(h_i^k\) due to the disconnection between nodes in a dependency graph, we utilize the method raised by Zhang [34] which normalizes the activations in the GCN, and adds self-loops to each node in graph:

(5)

where \(\tilde{\mathbf {A}}=\mathbf {A}+\mathbf {I}\), \(\mathbf {I}\) is the \(v \times v\) identify matrix and \(d_{i}=\sum _{j=1}^{v} \tilde{A}_{i j}\) is the degree of word i in graph.

Finally, We select the representation \(\textit{\textbf{h}}^{k}_{d_{root}}\) of the root word in final layer GCN as the salient representation of d-th discourse in message s. For example, as shown in the subplot (b) of Fig. 2, we choose the representation of need in the final layer as the salient representation of the discourse “the devices need less thermal insulation”.

3.3 Top Discourse-Level Salient-Aware Module

How to make better use of the relation between discourse and extract the message-level salient semantic? We modify the dominance of different discourse based on the message-level constraint in terms of explanatory semantic via an attention mechanism. First, we extract the global semantic of message s which contains its causal explanatory tendency. Next, we modify the dominance of different discourse based on global semantic. Finally, we combine the modified representation to obtain the final causal explanatory representation of input message s.

Global Semantic Extraction. Inspired by previous research [25], the average encoded word representation of all the words in message can represent its overall semantic simply and effectively. We utilize the average pooling on the encoded representation \(\textit{\textbf{H}}_{S}^{ed}\) of message s to obtain the global representation which contains the global semantic of its causal explanatory tendency. The formula is as following:

(6)

where \(\textit{\textbf{h}}_{s}^{glo}\) is the global representation of message s via average pooling operation and n is the number of words.

Dominance Modification. We modify the dominance of different discourse based on the global semantic which contains its causal explanatory tendency via an attention mechanism. In detail, after obtaining the global representation \(\textit{\textbf{h}}_{s}^{glo}\), we modify the salient representation \(\textit{\textbf{h}}^{k}_{d_{root}}\) of discourses d constrained with \(\textit{\textbf{h}}_{s}^{glo}\). Finally, we obtain final causal representation \(\textit{\textbf{h}}^{caul}_{s}\) of message s via attention mechanism:

$$\begin{aligned} \alpha _{ss} = \textit{\textbf{h}}_{s}^{glo} \textit{\textbf{W}}_{f} (\textit{\textbf{h}}_{s}^{glo})^T \end{aligned}$$
(7)
$$\begin{aligned} \alpha _{sd} = \textit{\textbf{h}}_{s}^{glo} \textit{\textbf{W}}_{f} (\textit{\textbf{h}}_{d_{root}}^k)^T \end{aligned}$$
(8)
$$\begin{aligned} \begin{bmatrix} \alpha _{ss}^{'}, \cdots , \alpha _{sd}^{'} \end{bmatrix} = softmax([\alpha _{ss}, ..., \alpha _{sd}]) \end{aligned}$$
(9)
$$\begin{aligned} \textit{\textbf{h}}^{caul}_{s} = \alpha _{ss}^{'} \textit{\textbf{h}}_{s}^{glo} +...+\alpha _{sd}^{'} \textit{\textbf{h}}_{d_{root}}^k, \end{aligned}$$
(10)

where the \(\textit{\textbf{W}}_{f}\) is matrice of linear transformation, \(\alpha _{ss}^{'}\), \(\alpha _{sd}^{'}\) are the attention weight. Finally, we mapping \(\textit{\textbf{h}}^{caul}_{s}\) into a binary vector and get the output via a softmax operation.

4 Experiment

Dataset. We mainly evaluate our model on a unique dataset devoted to causal explanation analysis released by Son [25]. This dataset contains 3,268 messages consist of 1598 positive messages that contain a causal explanation and 1670 negative sentences randomly selected. Annotators annotate which messages contain causal explanations and which text spans are causal explanations (a discourse with a tendency to interpret something). We utilize the same 80% of the dataset for training, 10% for tuning, and 10% for evaluating as Son [25]. Additionally, to further prove the effectiveness of our proposed model, we regard sentences with causal semantic discourse relations in PDTB2 and sentences containing causal span pairs in BECauSE Corpus 2.0 [7] as supplemental messages with causal explanations to evaluate our model. In this paper, PDTB-CED and BECauSE-CED are used to represent the two supplementary datasets respectively.

Parameter Settings. We set the length of the sentence and discourse as 100 and 30 respectively. We set the batch size as 5 and the dimension of the output in each GCN layer as 50. Additionally, we utilize the 50-dimension word vector pre-trained with Glove. For optimization, we utilize Adam [13] with 0.001 learning rate. We set the maximum training epoch as 100 and adopt an early stop strategy based on the performance of the development set. All the results of different compared and ablated models are the average result of five independent experiments.

Compared Models. We compare our proposed model with feature-based and neural-based model: (1) Lin et al. [19]: an end-to-end discourse relation parser on PDTB, (2) Linear SVM: a linear designed feature based SVM classifier, (3) RBF SVM: a complex designed feature based SVM classifier, (4) Random Forest: a random forest classifier which relies on designed features, (5) Son et al. [25]: a hierarchical LSTM sequence model which is designed specifically for CEA. (6) H-BiLSTM + BERTFootnote 3\(^{,}\)Footnote 4: a fine-tuned language model (BERT) which has been shown to improve the performance in some other classification tasks based on (5), (7) H-Atten.: a well-used Bi-LSTM model that captures hierarchical key information based on hierarchical attention mechanism, (8) Our model: our proposed pyramid salient-aware network (PSAN). Furthermore, we evaluate the performance of the model (5), (7), and (8) on the supplemental dataset to prove the effectiveness of our proposed model. Additionally, we design different ablation experiments to demonstrate the effectiveness of the bottom word-level salient-aware module (B-WSM), top discourse-level salient-aware module (T-DSM), and the influence of different depths in the syntactic-centric graph.

4.1 Main Results

Table 1. Comparisons of the state-of-the-art methods on causal explanation detection.

Table 1 shows the comparison results on the Facebook dataset and two supplementary datasets. From the results, we have the following observations.

  1. (1)

    Comparing with the current best feature-based and neural-based models on CED: Lin et al. [19], Linear SVM and Son et al. [25], our model improves the performance by 23.0, 7.7 and 11.0 points on F1, respectively. It illustrates that the pyramid salient-aware network (PSAN) can effectively extract and incorporate the word-level key relation and discourse-level key information in terms of explanatory semantics to detect causal explanation. Furthermore, comparing with the well-used hierarchical key information captured model (H-Atten.), our model improves the performance by 5.9 points on F1. This confirms the statement in Sect. 1 that directly employing the relation between words with syntactic structure is more effective than the implicit learning.

  2. (2)

    Comparing the Son et al. [25] with pre-trained language model (H-BiLSTM+BERT), there is 9.2 points improvement on F1. It illustrates that the pre-trained language model (LM) can capture some causal explanatory semantics with the large-scale corpus. Furthermore, our model can further improve performance by 1.8 points compared with H-BiLSTM+BERT. We believe the reason is that the LM is pre-trained with large-scale regular sentences that do not contain causal semantics only, which is not specifically suitable for CED compared to the proposed model for explanatory semantic. Furthermore, the performance of H-Atten. is better than Son et al. [25] which indicates focusing on salient keywords and key discourses helps understand explanatory semantics.

  3. (3)

    It is worth noting that, regardless of our proposed model, comparing the Linear SVM with Son et al. [25], the simple feature classifier is better than the simple deep learning model for CED on the Facebook dataset. However, when combining the syntactic-centric features with deep learning, we could achieve a significant improvement. In other words, our model can effectively combine the interpretable information of the feature-based model with the deep understanding of the deep learning model.

  4. (4)

    To further prove the effectiveness of the proposed model, we evaluate our model on supplemental messages with causal semantics in other datasets (PDTB-CED and BEcausE-CED). As shown in Table 1, the results show that the proposed model performs significantly better than the Son et al. [25] and H-Atten. on the other two datasetsFootnote 5. It further demonstrates the effectiveness of our proposed model.

  5. (5)

    Moreover, our model is twice as fast as the Son et al. [25] during training because of the computation of self-attention and GCN is parallel. It illustrates that our model can consume less time and achieve significant improvement in causal explanation detection. Moreover, compared with the feature-based models, the neural-based models rely less on artificial design features.

Table 2. Effectiveness of B-WSM. (w/o B-WSM denotes the models without B-WSM. \(\varvec{+}\) denotes repalcing the B-WSM with the module after \(\varvec{+}\). root denotes using the encoded representation of the root word in each discourse to represent it. ave denotes using the average encoded representation of words in discourse to represent it.)

4.2 Effectiveness of Bottom Word-Level Salient-Aware Module (B-WSM)

Table 2 tries to show the effectiveness of the salient information contained in the keywords of each discourse captured via the proposed B-WSM for causal explanation detection (Sect. 3.2). The results illustrate B-WSM can effectively capture the salient information which contains the most causal explanatory semantics. It is worth noting that when using the average encoded-word representation to represent each discourse (w/o B-WSM + ave), the model also achieves acceptable performance. This confirms the conclusion from Son [25] that the average word representation at word level contains certain causal explanatory semantic. Furthermore, only the root word of each discourse also contains some causal semantics (w/o B-WSM + root) which proves the effectiveness of capturing salient information via syntactic dependency from the keywords.

4.3 Effectiveness of Top Discourse-Level Salient-Aware Module (T-DSM)

Table 3 tries to show the effectiveness of the salient information of the key discourses modified and incorporated via T-DSM for causal explanation detection (Sect. 3.3). The results compared with w/o T-DSM + seq D illustrate our T-DSM can effectively modify the dominance of different discourses based on the global semantic constraint via an attention mechanism to enhance the causal explanatory semantic. Specifically, the results of w/o T-DSM + ave S/D show that both discourse-level representation and global representation contain efficient causal explanatory semantics, which further proves the effectiveness of the proposed T-DSM.

Table 3. Effectiveness of T-DSM. (w/o T-DSM denotes models without T-DSM. \(\varvec{+}\) denotes replacing the T-DSM with the module after \(\varvec{+}\). seq D denotes mapping the representation of discourses via a sequence LSTM to represent the whole message. ave S/D denotes using the average encoded representation of words in message and its discourses to represent the whole message.)

4.4 Comparisons of Different Depths of Syntactic-Centric Semantic

To demonstrate the influence of the causal explanatory semantics contained in the syntactic-centric graph with different depths, we further compare the performance of our proposed model with a different number of GCN layers. As shown in Fig. 3, when the number of GCN layers is 2, the most efficient syntactic-centric information can be captured for causal explanation detection.

Fig. 3.
figure 3

Comparisons of different number of GCN layers.

4.5 Error Analysis

As shown in Fig. 4, we find the two main difficulties in this task:

Fig. 4.
figure 4

Predictions of the proposed model.

  1. (1)

    Emotional tendency The same expression can convey different semantic under different emotional tendencies, especially in this kind of colloquial expressions. As M2 shown in Fig. 4, make 8 blankets expresses anger over not do any homework, and our model wrongly predicts the make 8 blankets is the reason for not do any homework.

  2. (2)

    Excessive semantic parsing Excessive parsing of causal intent by the model will lead to identifying messages that do not contain causal explanations as containing. As shown in Fig. 4, M3 means pancakes are awesome, but the model overstates the reason for awesome is a pancake.

5 Conclusion

In this paper, we devise a pyramid salient-aware network (PSAN) to detect causal explanations in messages. PSAN can effectively learn the key relation between words at the word level and further filter out the key information at the discourse level in terms of explanatory semantics. Specifically, we propose a bottom word-level salient-aware module to capture the salient semantics of discourses contained in their keywords based on a the syntactic-centric graph. We also propose a top discourse-level salient-aware module to modify the dominance of different discourses in terms of global explanatory semantic constraint via an attention mechanism. Experimental results on the open-accessed commonly used datasets show that our model achieves the best performance.