1 Introduction

Event causality extraction aims to identify causal relations between events in natural language text. For example, in Fig. 1, the sparks are the cause of the explosion in the sentence “The explosion was caused by sparks". It plays an important role in question answering (Oh et al., 2013; Dalal et al., 2021), event detection (Radinsky et al., 2012), scene generation (Hashimoto et al., 2014), and other task applications. In texts, the causality of events is expressed in complex forms. The causal relationship between some events is often based on common sense, which means there are cases where the sentence lacks words such as “because" and “therefore" that can clearly indicate the causal connection. This type of causal relationship is called implicit causal relationship, which brings a great challenge to the task of extracting the causal relationship between events.

Fig. 1
figure 1

An example of a sentence that contains a causal relationship

Recent work (De Silva et al., 2017; Kadowaki et al., 2019) typically divides the causality extraction process into two steps: First, selecting candidate causal event pairs from the text, and then classifying the relationships between these candidate event pairs. But these methods have the problem of error propagation (Yan et al., 2021; Chen et al., 2020). The extraction error of candidate causal event pairs will affect the accuracy of the subsequent relationship classification task. Since joint extraction can mitigate the impact of error propagation (Miwa & Bansal, 2016; Zheng et al., 2017), some researchers use end-to-end models to extract entities and relations simultaneously. The information interaction between entities and relationships is enhanced by enabling the two sub-processes to share the underlying parameters of the network. However, most of the existing work only uses the given text to analyze the causal relationships between events, and it is difficult to discover the implicit causal relationships when explicit causal correlation words are not available. Besides, although existing work considers the influence of contextual information when embedding words, the relevant features of the causality extraction task are still insufficient, so it is difficult to highlight possible causal-related words in a sentence.

To this end, we introduce external knowledge from the knowledge graph into our model. Try to enhance the model’s ability to uncover implicit causal links between events by adding related knowledge of events. Moreover, we integrate the character features of the words as well as the POS properties to address the problem of insufficient causal features. The knowledge graph is a semantic network that contains rich entity relationships. Through the triples in the knowledge graph, we can obtain knowledge related to event entities. For example, background knowledge related to “hurricane" can be described as (hurricane - IsA - natural disaster), (hurricane - Causes - house collapse), etc. The model can make use of the knowledge associated with these events to deduce the hidden causal relationship between events in the absence of explicit causal correlation words in the text, so as to improve the extraction effect of an implicit causal relationship. Besides, most of the event words indicating causality are usually composed of nouns and verbs, and the POS properties of the word have a strong correlation with the causal labels corresponding to the word. And there are similar word morphological features among some of the causal words, as shown in Fig. 2Footnote 1. Therefore, we capture the character morphological features of words, and enhance the information of words combined with the POS features to improve the degree of differentiation between causal event words and other words.

Fig. 2
figure 2

Some causal words with similar character structures

Specifically, to model the event-related knowledge representation, we consider the neighbors of the current event in the knowledge graph. Encoding the related knowledge representation of the current event based on the association relationships and weights among those nodes. Meanwhile, we use convolutional neural networks to obtain the character-level feature of input words, and then combine the POS feature to obtain the enhanced word representation of words. After obtaining the related knowledge representation and the word-enhanced information representation, we fuse them with the word representation generated by BERT (Kention & Toutanova, 2019). Feeding them into Bi-directional Gated Recurrent Unit (Bi-GRU) (Cho et al., 2014) to capture the global features based on the sentence context. Finally, by combining Conditional Random Field (CRF) (Lafferty et al., 2001), we predict the causal role label corresponding to each word in the text.

The contributions of this paper can be summarized as follows:

  • To solve the problem of lacking explicit causal correlation words, we introduce external knowledge into our model, so that the model can use the event-related knowledge to establish the implied causal links between events.

  • To solve the problem that word representation lacks features related to the causality extraction task, we propose a word information enhancement method. Getting additional information on words from its POS and character features to highlight possible causal-related words in the sentence.

  • Experimental results and analysis indicate that our proposed model (KLPWE) has achieved the best results and outperformed other previous state-of-the-art methods.

The structure of the paper is as follows: Sect. 2 introduces related work; In Sect. 3, we present the overall framework and each module of our model; In Sect. 4, we analyze the experimental results and verify the effectiveness of our method; Finally, in Sect. 5, we summarize the work of this paper and discuss the possible future research directions.

2 Related work

Our work focuses on using external knowledge to enrich the representation of events, and combining the character morphology and POS of words to enhance the causal semantic features of words, to extract the causal relationship of events in the text. Therefore, it is highly related to Causal Extraction Methods, External Knowledge-Based Methods, Character-Level Feature-Based Methods, and POS-Based Methods. Thus, in this section, we will briefly summarize some of the above-mentioned works.

2.1 Causal extraction methods

The early task of extracting event causality mainly adopted the methods based on pattern matching (Ittoo & Bouma, 2011; Kim et al., 2018; Hashimoto et al., 2015). For instance, Khoo et al. (2000) propose an extraction method by combining syntactic trees for causal relationships in the medical domain, Mirza et al. (2014) propose a method for causal labeling between event pairs based on the properties of events. Some studies have used a combination of syntactic patterns and statistical features to extract causal relationships (Luo et al., 2016; Gao et al., 2019). Girju (2003) proposes an inductive learning approach, learning syntactic and semantic constraints of causality by automatic induction of syntactic patterns; For the extraction of causality in medical diseases, Lee and Shin (2017) present a method based on causality frequency and the strength of association between causal event pairs. In recent years, many researchers have started trying to apply deep learning to event causality extraction tasks. Some works (Feng et al., 2018; Khetan et al., 2022; Kadowaki et al., 2019) convert the causal extraction problem into determining whether there is a causal relationship between two events. However, these methods not only rely on the correctness of the event extraction task, but also need to pair all the extracted events. In addition, since a pipeline-based approach is employed in the work, it is difficult to avoid the problem of error propagation and entity redundancy. To address the impact of the aforementioned problems, joint extraction methods (Fu et al., 2011; Martínez-Cámara et al., 2017) based on sequence annotation schemes have emerged. Li et al. (2021) proposed SCITE, and transfer the Flair embedding (Akbik et al., 2018) into their model; Xu et al. Jinghang et al. (2020) extend syntactic dependency trees to syntactic dependency graphs, and propose a graph attention network based on syntactic dependency graphs for identifying event causalities. However, these studies usually focus on the analysis of causality from a given text, and it is often difficult to find more causal clues when the text lacks sufficient causal information.

2.2 External lnowledge-based methods

With the development of the knowledge graph, many researchers begin to apply external knowledge to natural language processing. Yang and Mitchell propose KBLSTM (Yang & Mitchell, 2017), using external knowledge bases to improve recurrent neural networks for machine reading. BP Majumder et al. Majumder et al. (2022) inject external knowledge into the reply of dialogue models. In terms of event causality extraction, Kruengkrai et al. (2017) retrieve descriptions related to a given causality candidate pair from a large number of knowledge sources, and input them into the multi-column convolutional neural network. Cao et al. propose Latent Structure Induction Network (LSIN) (Cao et al., 2021), learning descriptive knowledge and relational knowledge of events respectively through two different modules, and inferences the causal relationship of events according to the inductive structure. Although previous work has shown that introducing external knowledge can help models better identify causal relationships between events, not all external knowledge is useful in this task. Besides, there are also differences in the importance of knowledge.

2.3 Character-level feature-based methods and POS-based methods

In terms of character morphological features, the character-level CNN model was first used to deal with text classification (Zhang et al., 2015). Chung (Chung et al., 2016) proposes a character-level decoder without explicit segmentation; Lee et al. (2017) propose a fully character-level Neural Machine Translation (NMT) model, which proved that character-level CNN could effectively alleviate the problem of Out-Of-Vocabulary (OOV); Chiu and Nichols (2016), Santos and Guimarães (2015) use CNN to learn character-level features of words; The study of Cherry et al. (2018) show that the character-level model can outperform the word-level model with sufficient time and model capacity; R Van Noord et al. (2020) combine the character-level model with the context language model, and find that adding character-level information can still improve the performance of the model even when large pre-trained language models have become very popular. Different from our approach, these studies focus on using character-level information to improve the performance of language models, ignoring the fact that morphological similarity in words can also be used as a feature.

In terms of POS, Fabio (Celli, 2010) adds part-of-speech counting in the process of relation extraction, and finds that POS information was useful for predicting the position of entities in relation. Cai et al. (2019) improve the accuracy of entity boundary detection with the help of POS. For Japanese named entity recognition, M Suzuki et al. Suzuki et al. (2018) use POS tagging to fine-tune name entity recognition (NER), to learn a NER model with high performance. Although POS information has been used in many NLP tasks, few researchers have noticed the association between causal words and POS.

2.4 Similarities and Differences Between KLPWE and Other Methods

In general, existing causality extraction methods focus on how to mine as much causal information as possible from a given text, and it is difficult to discover deep implicit causality when the information contained in the text is limited. Therefore, our approach introduces external knowledge into this task, thus providing additional information to the model. Moreover, unlike other external knowledge-based methods, KLPWE also considers the importance of different knowledge. Besides, existing character-level and POS feature-based approaches have demonstrated that the above features are useful in some NLP tasks. Therefore, we introduce them into the event causality extraction task and use these features to improve the differentiation of causal event words.

3 Our KLPWE method

In this section, we will introduce the details of our proposed model named KLPWE. We first split a text into a sequence of words, where some words are related to event entities in a knowledge graph. Then, as shown in Fig. 3, our model is divided into four main modules: (1) Entity embedding module, which forms entity representations through their neighbors in a knowledge graph; (2) Static word embedding module, which generates static word representations according to the POS and character-level morphological characteristics of words. This representation is shared across different sentences. (3) Dynamic word embedding module, which outputs dynamic word representation by a Bert model. The semantic meaning of each word would be different in different sentences. (4) Causal reasoning module, which utilizes the combination of entity representations, static word representations, and dynamic word representation to construct a more informative token representation. The token representation is then fed to a Bi-GRU to evaluate the causal role of each token in a sentence. Finally, CRF is adopted to jointly decode the label sequence, to assign corresponding causal labels to each word.

Fig. 3
figure 3

The overall framework of our model

The design of KLPWE can incorporate external knowledge graph information, static word morphological characteristics, and dynamic word semantic information, simultaneously. Thus, it can promote the performance of causal reasoning. Next, we will introduce each module of KLPWE in detail.

3.1 Entity embedding module

Just as humans can infer implicit causal connections between two event entities with their prior knowledge, as a large-scale semantic network constructed by connections among many entities, the knowledge graph provides a rich source of knowledge for computers to make causal inferences through connections among entities. We use ConceptNet as the source of external knowledge. As one of the most commonly used knowledge graphs, ConceptNet (Speer et al., 2017) contains more than 8 million nodes and 21 million edges, and it assigns weights to each edge according to the strength of association between nodes. Besides, in this module, we recode the knowledge representation of events by using NumberbatchFootnote 2. Numberbatch is a set of static word vectors based on ConceptNet. In constructing Numberbatch, the ConceptNet graph is represented as a sparse matrix, and Speer et al. (2017) computes the word embeddings of Numberbatch from this sparse matrix following the same method as Levy et al. (2015). Since it utilizes both semi-structured knowledge and textual information in ConceptNet, it has some semantic features that may not be learned from the trained text corpus alone. The structure of this module is shown in Fig. 4.

Specifically, the entity embedding module consists of two parts: relation filtering and knowledge encoding.

Fig. 4
figure 4

Entity embedding module

3.1.1 Knowledge filtering

In the Knowledge Graph, not all neighbor nodes associated with event entities can be used as the source of event-related knowledge representation in causal extraction tasks. Considering that in ConceptNet, the weight value of a node is calculated based on the credibility of the message, the node with higher credibility has a higher weight. Therefore, we believe that the neighbor nodes with higher weights are more able to represent the knowledge associated with the event. So, when facing a huge amount of related knowledge, selecting the knowledge with a higher weight can better highlight the associated knowledge features of the event.

Thus, for a given event node E, we search for neighboring nodes associated with it in ConceptNet and filter these nodes according to the type of relationship between these neighboring nodes and E. Neighbor nodes with relationships such as “Antonym" and “ExternalURL" with E will be excluded. We only choose “Causes", “HasSubbevent", “Capable of" and other types of relations that can clearly indicate causality or can be used for causal reasoning. And we retain the top n neighbor nodes \(\left\{ N_1,N_2,N_3,\ldots ,N_n \right\}\) with the highest relevance and their corresponding association weights \(\left\{ W_1,W_2,W_3,\ldots ,W_n \right\}\).

3.1.2 Knowledge encoding

After obtaining the neighbors’ information related to the given event, we encode the related knowledge representation of this event based on these neighbor nodes and their weights. At this stage, we normalize the weight of each neighbor node. The final event-related knowledge representation is formed by combining multiple knowledge representation vectors according to their normalized weights. For each neighbor \(N_j\), We define its normalized weights according to the following equation:

$$\begin{aligned} \alpha _j = \frac{W_{j} }{ {\textstyle \sum _{i=1}^{n}}W_{i} } \end{aligned}$$
(1)

Where \(W_{j}\) denotes the weight value of the associated edge between \(\alpha _j\) and E. After that, we calculate the related knowledge representation \(F^{knowl}\) of event node E according to equation (2):

$$\begin{aligned} F^{knowl}=\sum _{j=1}^{n}\alpha _{j}v_{j} \end{aligned}$$
(2)

where \(v_{j}\) is the feature vector of \(N_j\) in Numberbatch.

3.2 Static word embedding module

In the event causality extraction task, our goal is to identify the event words in sentences with causal semantic role labels, and distinguish these words from others that are not causally related. Previous methods usually obtain the contextual features of words directly based on the initial semantic vector, but this is not enough to highlight the causal features of words. Especially when the context lacks connecting words such as “because" and “cause", which can explicitly indicate causality. Since causal words often have similar character-level morphological features among themselves, so we extract the character morphological features of the word as an enhanced Information representation of it. In addition, since causal events in sentences are usually composed of verbs and nouns, there is a certain correlation between word POS and causal semantic role labels. So, we add the POS feature as part of the enhanced information as well. Combining character morphological features and POS features together to construct static word embeddings, to highlight the causal features of words. The structure of this module is shown in Fig. 5.

Fig. 5
figure 5

Static word embedding module

The module consists of two parts: character feature capture and POS feature capture, which will be shown in detail in the following subsections.

3.2.1 Character feature capture

Often these words have the same place in the syntactic structure of the sentence. Learning the character morphological representation of these words can highlight the local vocabulary in the sentence, and then better help the model learn the common position of causal words in the sentence structure.

Previous studies (Santos & Guimarães, 2015; Labeau et al., 2015) have demonstrated the effectiveness of CNN in extracting word character-level features. To capture the character features of causal words, we use the same convolutional neural network as Chiu et al Chiu and Nichols (2016), splitting the words into multiple characters for convolution. In the convolution process, to avoid the problem of information loss, we first fill the boundaries of the word. For a given word W of length t, we split it by character to obtain the set of characters \(\left\{ c_{1},c_{2},\ldots ,c_{t} \right\}\). Subsequently, we look up the character feature vector \(v_{i}\) corresponding to each character \(c_{i}\) from the character-to-character feature mapping table and construct the character feature vector matrix \(R^{m\times d}\) corresponding to W by combining the features of the filled character. Let the set of convolution kernels \(K=\left\{ k_{1},k_{2},\ldots ,k_{n} \right\}\), then for a local feature \(f_{i}^{c}\), it can be calculated by the following equation:

$$\begin{aligned} f_{i}^{c}=f\left( w\cdot v_{i: i+l-1}+b\right) \end{aligned}$$
(3)

where \(w\in R^{m*d}\), l is the window length of the convolution kernel \(k_{i}\), b is the bias value, \(f_{i}^{c}\) denotes the feature obtained by the i-th filter, and f is the activation function Relu. We compute the convolution of features for each window that \(k_{i}\) slides through, get \(F_{i}^{c} = \left\{ f_{1}^{c}, f_{2}^{c},\ldots ,f_{m-l+1}^{c} \right\}\). And then perform maximum pooling to obtain the feature \(\widetilde{F}_{i}^{c}=max\left( F_{i}^{c} \right)\) corresponding to this convolution kernel. Eventually, for a given word W, its character features under the action of n convolutional kernels in the set K of convolutional kernels are represented as:

$$\begin{aligned} F^{c}=\left\{ \tilde{F}_{1}^{c}, \tilde{F}_{2}^{c}, \ldots , \tilde{F}_{n}^{c}\right\} \end{aligned}$$
(4)

3.2.2 POS feature capture

In general, the words with causal role labels are usually the core ones in causal sentences. Considering that in the event causality extraction task, words such as determiners, gerunds, complements, and other modifying and restricting words are relatively less important in the sentence and not usually in the key structure of the sentence. So, we can distinguish the POS of words to further highlight the influence that each word has on the sentence. Based on the above, we build a POS table and initialize the feature vector for each POS in the table. For the input sentence, we perform part-of-speech tagging on each word in the sentence to obtain the POS of word W. After this, we look up the corresponding POS feature embedding \(F^{p}\) of W in the POS table according to its POS.

$$\begin{aligned} F^{word}= F^{p} \oplus F^{c} \end{aligned}$$
(5)

3.3 Dynamic word embedding module

In this module, we combine the related knowledge feature and word information enhancement feature with the dynamic word vector from the pre-trained language model, and get the final word representation as the input of the neural network layer.

To perform the feature fusion, we need to convert the input text into the corresponding word vector representation. BERT is a pre-trained language model built on the bidirectional transformer. Since BERT is pre-trained with the help of MLM (Masked Language Model) and NSP (Next Sentence Prediction) tasks, it has a powerful semantic acquisition capability and can effectively solve the problem of multiple meanings of words.

In our model, we use BERT-base to model the text. For each word \(w_{i}\) in input sentence \(S=\left\{ w_{1},w_{2},\ldots ,w_{t} \right\}\), after BERT encoding, we use the output \(F_{i}^{bert}\) as the word embedding, and fuse it with the previously obtained background knowledge representation and word information enhancement representation to obtain the final representation of words:

$$\begin{aligned} F_{i}= F_{i}^{bert} \oplus F_{i}^{knowl} \oplus F_{i}^{word} \end{aligned}$$
(6)

3.4 Causal reasoning module

In this module, we predict the causal label with the highest probability for each word in the sentence based on the output of the dynamic word embedding module.

Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) is a special kind of recurrent neural network structure. It regulates the sequence of information by designing a structure called “gates", which can selectively preserve the contextual information in sentences. We use the Gated Recurrent Unit to model the global semantic feature of a word. GRU is a variant of the LSTM network, which has a simpler structure and faster training speed compared with LSTM. For the input semantic feature vector, GRU is calculated using the following formula:

$$\begin{aligned} \begin{aligned}&z_{i}=\sigma \left( W_{z} F_{i}+U_{z} h_{i-1}+b_{z}\right) \\&r_{i}=\sigma \left( W_{r} F_{i}+U_{r} h_{i-1}+b_{r}\right) \\&\tilde{h}_{i}=\tanh \left( W_{h} F_{i}+U_{h}\left( r_{i} \times h_{i-1}\right) +b_{h}\right) \\&h_{i}=z_{i} \times h_{i-1}+\left( 1-z_{i}\right) \times \tilde{h}_{i}\\ \end{aligned} \end{aligned}$$
(7)

where \(\sigma\) is the sigmoid activation function, \(F_{i}\) represents the fused feature vector corresponding to the i-th word in the input sentence, \(W_{z},W_{r},W_{h},U_{z},U_{r},U_{h}\) is the weight matrix in GRU, \(b_{z}\) and \(b_{r}\) are the bias variables.

Considering that the cause event and effect event in the sentence is context-dependent, the forward GRU can only consider the text before the current word, so we add the backward GRU and use the bidirectional GRU to obtain global semantic features. Finally, the output \(h_{t}\) of GRU layer is determined by both the forward and backward GRU:

$$\begin{aligned} h_{t}={\text {concat}}\left( \overrightarrow{h_{t}}, \overleftarrow{h_{t}} \right) \end{aligned}$$
(8)

where \(\overrightarrow{h_{t}} , \overleftarrow{h_{t}}\) respectively denotes the output vector after \(F_{i}\) goes through forward GRU and backward GRU, and concat denotes the splicing function between vectors.

In the causal extraction task, there is usually a strong dependency between the causal semantic role label of words. For an “Effect" label, there must be a corresponding “Cause" label. To use the constraint relationship between causal labels, we take Conditional Random Field (CRF) to assign final causal semantic role labels for words in the sentence, and obtain a globally optimal label chain for the given input sequence. CRF is a special case of Markov random field, which is able to predict the conditional probability distribution of the output sequence corresponding to a set of given input sequences. We take the global semantic feature vector of sentence S after passing through the Bi-GRU network layer as the input of the CRF layer. For the given sentence \(S=\left\{ w_{1},w_{2},\ldots ,w_{n} \right\}\) and label sequence \(y=\left\{ y_{1},y_{2},\ldots ,y_{n} \right\}\), CRF uses the following formula for scoring:

$$\begin{aligned} {\text {score}}(S, y)=\sum _{i=0}^{n} A_{y_{i}, y_{i+1}}+\sum _{i=1}^{n} P_{i, y_{i}} \end{aligned}$$
(9)

where A is the transfer matrix, \(A_{y_{i}, y_{i+1}}\) denotes the transfer score from the label \(y_{i}\) to \(y_{i+1}\), and \(P_{i, y_{i}}\) denotes the probability that the i-th word is labeled as \(y_{i}\). For input sentence S, we calculated the probability of tag sequence y based on the above scoring formula:

$$\begin{aligned} p(y \mid S)=\frac{e^{{\text {score}}(S, y)}}{\sum _{\tilde{y} \in Y_{S}} {\text {score}}(S, \tilde{y})} \end{aligned}$$
(10)

where \(Y_{S}\) denotes all possible label combinations of S, and \(\tilde{y}\) denotes the real label. The model is trained by the maximum likelihood function to maximalize \(p(y \mid S)\):

$$\begin{aligned} \log (p(y \mid S))={\text {score}}(S, y)-\log \left( \sum _{\tilde{y} \in Y_{S}} e^{\text{ score } (S, \tilde{y})}\right) \end{aligned}$$
(11)

Finally, the highest scoring predicting label sequence will be output by the following formula:

$$\begin{aligned} y^{*}=\arg \max _{\tilde{y} \in Y_{S}} {\text {score}}(S, \tilde{y}) \end{aligned}$$
(12)

4 Experiments

4.1 Dataset

We extend the annotation of causal sentences based on SemEval 2010 Task8 (Hendrickx et al., 2010). There are some ambiguous annotations in the original annotations of the dataset. For example, in the sentence “These <e1>germs</e1> cause illnesses ranging from common ailments, like the cold and <e2>flu</e2>, to disabling.", “cold" and “flu" are specific cases under the concept of “illnesses". The original dataset, however, only labels “flu" as the effect. To address the impact of ambiguous annotations on reliability and accuracy in the original dataset, we relabled the original dataset and extended it. For these ambiguous annotations, we use the word with the highest conceptual level in the sentence as the final annotation. In the above example, “illnesses" will be labeled as “Effect". In addition, for the annotation of phrase types, we uniformly select the most core word in the phrase as the annotation result. Finally, our corpus consists of 3000 sentences with 1331 causal instances. We divide our dataset into train set, validation set, and test set by the ratio of 4.5:1:1.

4.2 Evaluation metrics

Same as the previous method, we use Precision, Recall, and F1-score as evaluation metrics, which can be calculated by the following formulas:

$$\begin{aligned} Precision&=\frac{TP}{TP+FP} \end{aligned}$$
(13)
$$\begin{aligned} Recall&=\frac{TP}{TP+FN} \end{aligned}$$
(14)
$$\begin{aligned} F1-Score&=\frac{2*Precision*Recall}{Precision+Recall} \end{aligned}$$
(15)

Where TP is True Positive, denotes the predicted value is true and the actual value is true. FP is False Positive, which denotes the predicted value is true and the actual value is false. FN is False Negative, which denotes the predicted value is false and the actual value is true.

4.3 Experimental settings

We use the “bert-base-uncased” model under BERT to get the embedding representation of input text. Set the batch size to 8, the learning rate to \(1\times 10^{-5}\), the epoch of training to 50, and the hidden size layer of GRU to 256. And based on the average length of sentences in the dataset, the maximum length of sentences is set to 64. In the entity embedding module, we keep the top 10 neighbor nodes with the highest relevance and set the dimension of background knowledge embedding to 300. In the static word embedding module, we use CNN with 128 convolution kernels, set the window size of convolution to 3, and obtain 37 different word properties based on NLTK’s word annotation library.

4.4 Results and analysis

We compared our model with baselines and conducted ablation experiments as a way to demonstrate the effectiveness of our work. Each experiment has been performed five times, and then evaluation metrics were calculated based on multiple experiments. We selected IDCNN, CLSTM, and other mainstream methods for comparison:

IDCNN-CRF (Strubell et al., 2017): The model uses Iterated Dilated Convolutions to replace Bi-LSTM, which allows convolutions of fixed depth to run in parallel throughout the document. The Iterated Dilated Convolutions significantly improves the speed of training while maintaining the same accuracy as Bi-LSTM.

CLSTM-BiLSTM-CRF (Lample et al., 2016): The model uses a bidirectional LSTM as a character encoder (Char LSTM) to generate word embeddings deriving from characters, which are connected to pre-trained word vectors in the word table as input to the lower-level model. The bidirectional LSTM encoder enables the model to benefit from both word and character-level representations.

CCNN-BiLSTM-CRF (Ma et al., 2016): Similar to the previous model, but the difference is that this model uses CNN as a character encoder (Char CNN) to learn word features instead of CLSTM.

BERT-BiLSTM-CRF: This is a widely used model in sequence annotation tasks and extended on the basis of Huang et al. (2015). The model uses BERT as a pre-trained model to obtain dynamic word vectors based on contextual contexts as input to the lower-level model, which can handle the presence of multiple meanings of a word.

Table 1 The performance of different models on the test set

Table 1 shows the experimental results of different models for the causal extraction task. We can find that our model has achieved an F1 score of 0.8175 in test sets, outperforming the other models, thus confirming the validity of our work. Meanwhile, to verify the role of the entity embedding module and the static word embedding module, we conduct ablation experiments on our model. We test the performance of our model in the absence of entity embedding module and static word embedding module respectively. The final results show that adding both modules has improved the performance to different degrees, and both achieved better results than the baseline model. Moreover, using both modules together can further improve the model’s performance, thus verifying the effectiveness of our proposed module for the event causality extraction task.

4.5 The effect of causal connection words

To explore the impact of causal connection words in sentences for model extraction performance, we select the sentences with causal instances in the test set and manually classify these sentences into explicit causal sentences with causal connection words and implicit causal sentences without causal connection words. Finally, we obtain 176 explicit causal sentences and 52 implicit causal sentences. We only use the selected explicit and implicit causal sentences as the test set for the experiments in this section, and test the performance of different models in extracting explicit and implicit causal relationships between events, respectively, and the results are shown in FIG.6.

We observe the following: (1) Compared with extracting explicit causality, the performance of each model decreases to different degrees when extracting implicit causality, which indicates that the lack of causal correlation words brings difficulties in mining the deep implicit causality in sentences. (2) Compared with the baseline model, our model achieves the best results on both tasks, especially getting a larger improvement in the implicit causality extraction task. It achieves an improvement of 0.38% on the explicit causality extraction task and 2.24% on the implicit causality extraction task. It is confirmed that our method can effectively alleviate the problem of missing causal correlation words in the sentence, and provide more information on causal clues for the causality extraction task.

Fig. 6
figure 6

The performance of different models in extracting explicit and implicit causality tasks

4.6 The effect of neighbor node counts

To investigate the effect of the number of relevant neighbor nodes on model extraction results, we select the top 3, 5, 10, 15, and 20 neighbor nodes with the highest relevance ranking to event nodes as the background knowledge sources for knowledge representation encoding, and conduct comparative experiments. Fig. 7 shows the experimental results.

Fig. 7
figure 7

F1 scores with the different numbers of neighbor nodes

In Fig. 7, we can observe that the F1 score of our model increases with the number of selected neighbors, and reaches the highest score when the number of selected neighbor nodes increases to 10. As the number of selected neighbor nodes continues to increase, the F1 score starts to show a decreasing trend. When the number of selected neighbor nodes reaches 20, the score decreases instead by 0.24% compared with the benchmark model with no related knowledge representation. We have analyzed this, the reason is probably that the small number of relevant neighbor nodes limits the scope that event-related knowledge can cover, leading to a less comprehensive knowledge representation generated. So, a proper number of neighbor nodes can contribute to providing a more adequate representation of event-related knowledge features. When selecting too many neighbor nodes, those related knowledge features with a high association will be diluted, which results in lower quality of the generated event knowledge representation. These excessively-diluted related knowledge features not only make it difficult to represent the relevant knowledge of the event, but even bring negative effects to the model.

4.7 Analysis of static word embedding module

To further analyze the effect of the static word embedding module on the representation of causal semantic features, we divide the test set into Short \(\left( 0<l<15 \right)\), Mid \(\left( 15\le l<25 \right)\), and Long \(\left( 25\le l \right)\) according to the length l of the sentence, with a ratio of roughly 2:2:1. And we conduct experiments on the segmented test set respectively based on baseline model (BERT-BiLSTM-CRF), model without the static word embedding module (KLPWE w/o word), and the final completed model (KLPWE). As shown in Fig. 8, on each test set with different sentence lengths, models using the static word embedding module obtained a certain degree of improvement compared with the baseline model. It is worth noting that compared with the model without the static word embedding module, the F1 score of the completed model only improved by 0.17% on the short-sentence test set. However, the improvement of the completed model reached 0.4% and 0.95% on the mid and long-sentence test sets respectively. This demonstrates that in the case of longer sentences, the model with static word embedding module can effectively highlight important words with causal relevance among numerous words, which verifies the effectiveness of this module in enhancing the representation of word causal semantic features.

Fig. 8
figure 8

Results on test sets of different lengths

4.8 Case study

In Table 2, we present some representative examples to illustrate the differences between our proposed approach and other approaches. For each example, we show the input sentence and the causal event words contained in the sentence in the first line. The remaining lines show the causal extraction results of our model and other models.

Sentences 1 and 2 are examples of explicit causality. From this, we observe that explicit causal correlation words can help to extract causal relationships to a certain extent. Most methods can identify explicit causal relationships when the distance between events is close. However, when explicit causal relationships are far apart, even if “caused" can serve as an indicator of causality, methods that do not use pre-trained language models cannot correctly extract the causal relationships. We analyzed this situation, and the reason may be that pre-trained language models can dynamically generate word vectors based on the sentence context, resulting in more accurate semantic representations of words. Therefore, compared to methods that do not use pre-trained language models, they can achieve better results.

Sentence 3 is an example of implicit causality. From this, we observe that the lack of explicit causal correlation words in sentences presents a significant challenge for learning implicit causal relationships. In this example, only KLPWE can correctly extract the underlying causal relationships between events when compared with other models.

Table 2 Result of causality extraction, where the correct part is shown in underlined italic, and the wrong part is shown in bold italic. We use bold to highlight cause-and-effect words in sentences

5 Conclusion

In this paper, we propose a method for event causality extraction based on external event knowledge learning and polyhedral word embedding. To alleviate the problem that the model has difficulty in discovering implicit causal associations between events in the absence of causal clues in the text, we generated related knowledge representation for events through external knowledge. In addition, to address the lack of causal extraction task-related features in semantic representations of words, we performed an information enhancement representation of the word to highlight the causal-related features. The experimental results verified the effectiveness of our proposed method.

In future work, we will try to extract multiple causal relationships simultaneously from sentences, and extend the extraction of causality from one causal pair to multiple causal pairs. Furthermore, the event-related knowledge can be further extended based on nodes on the multi-hop paths in the knowledge graph. Therefore, investigating how to utilize the relevant knowledge on multi-hop paths for causality extraction is also a potential direction for future research.