Keywords

1 Introduction

Generally, event extraction(EE) tasks consist of two subtasks. Event detection aims to identify and classify event triggers, and argument extraction aims to identify arguments of event and label their roles. In Fig. 1, event detection task aims to identify the trigger “dropped” and event type “Conflict: Attack”, then event argument extraction task aims to identify the arguments “U.S. planes” and “Iraqis”, and their roles “Attacker” and “Victim”.

Fig. 1.
figure 1

Example documents in ACE 2005 corpus. Triggers and event types are marked in red. Arguments and roles are marked in other colors. The event extraction results of the three sentences are on the far right of the figure. (Color figure online)

There are a lot of research works about sentence-level event extraction, and they still face two critical challenges. 1) Ambiguity of Triggers: A word will express different meanings in different sentence so as it will trigger different events. In Fig. 1, both S1 and S2 contain the word “dropped”. It would be quite challenging to detect the word “dropped” trigger an “Transport” event in S1 and trigger an “Attack” event in S2 without considering more information with different granularities, such as argument role (Weapon, Place , etc.) and context information. Only the event is detected correctly, can the argument in the event be extracted better. Therefore, how to distinguish different event semantics by a comprehensive understanding of information with different granularities is crucial for improving the accuracy of event extraction. 2) Event Interdependency: A sentence may express several correlated events simultaneously. For example, in the event mention “Three people plus the bomber were killed, and at least 30 others were hurt.”, a “Die” event is triggered by “killed”, and an “Injure” event is triggered by “hurt”. This kind of event co-occurrence is also called multiple-event in one sentence. And it is common in ACE 2005 corpus. According to the previous statistics by [9], nearly 27% sentences have more than one events. And these events are often associated with each other, having similar event types with the same roles of arguments. Only modeling the interdependency among them, which is the fundamental to successful extraction, can we extract all events in a sentence correctly, so as to prompt the effect of event extraction. Therefore, how to effective model on such interdependency among the correlated events is one of the key challenges in event extraction.

For challenge 1), most of the existing methods only consider a single granularity information [1, 9, 11, 17, 18], especially the inter-sentence information, using entity type or dependency tree. These methods try to make the best of the sentence granularity information to distinguish semantics. Unfortunately, ambiguity can’t be solved only by inter-sentence information in many cases. For example, in Fig. 1, it is impossible to distinguish the event type “Attack” or “End_Position” by using entity information in S3. It requires context information “military action” and “casualties”. Therefore, we should comprehensively handle the information with different granularities to distinguish different semantic. [2, 7] use document granularity information, but bringing a lot of redundant information in the document. Meanwhile, they also neglect sufficient understanding of sentence granularity information. Because of that, there is no obvious improvement in the effect and no method to consider the influence of argument in event, especially roles. For challenge 2), some methods use recurrent neural network to remember previous correlated events [11, 13], but they still have the long-distance dependence problem. Another kind of method is graph neural network (GNN), which can effectively model the interdependency among nodes [3, 9, 12, 17]. In order to model the interdependency of events, they construct a graph of word nodes in sentence by dependency tree. But they only consider the words of sentence and are used in the event detection subtask [3, 12, 17]. That is to say, they not only neglect the key multi-granularity information mentioned above, but also neglect another argument extraction subtask in event extraction and the interaction between the two subtasks.

In order to solve the above problems, inspired by [14], we propose a Multi-granularity Heterogeneous Graph model for sentence-level Event Extraction (MGHEE) to complete the task of event detection and argument extraction simultaneously. Unlike previous works which only take words as nodes, MHGEE contains another two types of nodes with different granularities: entity and context. Besides, we construct six types of edges. We design three type of nodes by considering the nearest context of the sentence and inter-sentence information to learn multi-granularity semantic information. Then we use R-GCN to enable rich interactions among nodes, so as to distinguish the semantic information of the same trigger word in different events. At the same time, our model also constructs heterogeneous graph to model intra-sentence and inter-sentence event interdependency by aggregating the information of relevant events in the same sentence or context, so as to solve the challenge of multiple events in one sentence. The contributions of this paper can be summarized as follows:

  • We propose a novel event extraction model based on Multi-granularity Heterogeneous Graph (MHGEE). Our MGHEE designs multi-granularity nodes and enables rich interactions among nodes by R-GCN, which strengthens the semantic and helps distinguish ambiguity of triggers.

  • We are the first to construct heterogeneous graph for whole event extraction. Our MHGEE can model intra-sentence and inter-sentence event interdependency and capture multiple events in one sentence effectively

  • Experiments on the ACE 2005 dataset show that our model outperforms the previous SOTA models by nearly 5% F1 on the trigger identification and 2% F1 on the argument identification.

2 Related Work

From the perspective of text, event extraction can be divided into sentence-level event extraction and document-level event extraction. And sentence-level event extraction can be divided into extraction and generation methods. Our work focuses on sentence-level event extraction. We will classify the relevant works based on deep learning from the perspective of the methods used.

Event extraction models based on basic neural network have been widely used to extract features automatically, such as convolutional neural networks (CNN) [1], recurrent neural networks (RNN) [11, 13].

Some works operated BERT as the pretrained language model [4, 10, 16] in recent several years, since BERT has been proven its validity to improve the performance of downstream natural language processing tasks including event extraction.

And with the application of GNN in various fields of natural language processing, some researchers propose to transform the syntactic dependency tree, which contains syntactic information and plays an important role in event extraction, into a graph and employ GCN [5] to conduct event detection through information propagation over the graph [3, 12, 17]. These works only consider event detection task and ignore argument information.

However, the above existing methods, all focus on the single granularity information in sentence-level, neglecting the multiple granularity information aggregation across sentences. Considering that adjacent sentences from context also store some relevant event information to solve the above challenges, these methods will not integrate multiple granularity information, which would enhance the event signals of the sentence that triggers belong to.

3 Approach of MHGEE Model

Our MHGEE model consists of the following four modules: 1) Input Layer: we aim to get initialization vector representations of words, entities, and contexts; 2) Graph Construction: we build multi-granularity heterogeneous graph, including three types of nodes and six types of edges; 3) Information Aggregation over MHG: we use R-GCN algorithm with gating mechanism to propagate information among multi-granularity information sources, so as to enhance information flow from context and entity nodes for event extraction; 4) Classification Layer: we obtain the final embedding representation of words and entities, and get trigger candidates of certain types from trigger labels in BIO form annotation schema, then predict the roles that each entity plays in such events after aggregating word embedding representations to trigger candidate vector \( t_i \) and entity vector \( e_i \). Figure 1 gives the architecture of MHGEE model.

Fig. 2.
figure 2

The architecture of our MHGEE model. Different types of nodes are represented by circles with different colors, and similarity, different types of edges are represented by lines with different numbers in Graph Construction module. Due to space limitations, not all nodes and edges are represented in the graph.

3.1 Input Layer

We need to get the initial embedding representation vector of words, entities and contexts respectively. Let \( W= w_1,w_2,\ldots ,w_n \) be a sentence of length n where \( w_i \) is the i-th word. Similarly, let \( E= m_1,m_2,\ldots ,m_k \) be the entity in the sentence where \( m_k \) is the k-th entity.

The word embedding vector \( \mathbf {x}_{\mathrm {i}} \). In order to get the word embedding, each token \( w_i \) in the sentence is transformed to a real-valued vector \( \mathbf {x}_{\mathrm {i}} \) by looking up in embedding matrices and concatenating the following vectors: 1) The word embedding vector of \( \mathbf {w}_{\mathrm {i}} \): The word embedding vector is obtained by looking up a pre-trained word embedding matrix GloVe; 2) The POS-tagging label embedding vector \( \mathbf {pos}_{\mathrm {i}} \): The POS-tagging label embedding is generated by looking up the randomly initialized POS-tagging label embedding table; 3) The positional embedding vector \( \mathbf {p}_{\mathrm {i}} \): If \( w_c \) is the current word in a sentence, then we encode the relative distance \( i-c \) from \( w_i \) to \( w_c \) as a real-valued vector by looking up the randomly initialized position embedding table [11, 12]; 4) The entity type label embedding vector \( \mathbf {n}_{\mathrm {i}} \): Similarly to the POS-tagging label embedding vector of \( w_i \), we annotate the entities in a sentence using BIO annotation schema and transform the entity type labels to real-valued vectors by looking up the embedding table. Thus, the input embedding of \( w_i \) can be defined as:

$$\begin{aligned} \mathbf {x}_{\mathrm {i}}=\left[ \mathbf {w}_{\mathrm {i}};\mathbf {pos}_{\mathrm {i}};\mathbf {p}_{\mathrm {i}};\mathbf {n}_{\mathrm {i}}\right] \in \mathbb {R}^{d_{w}+2 \times d_{p}+d_{pos}+d_{n}} \end{aligned}$$
(1)

where \( d_w \), \( d_p \), \( d_{pos} \) and \( d_n \) denote the dimension of word embedding, positional embedding, POS-tagging label embedding and entity type embedding respectively.

The entity embedding vector \( \mathbf {e}_{\mathrm {i}} \). We calculate the entity embedding vector \( \mathbf {e}_{\mathrm {i}} \) with the mean-pooling operation of the vectors of all words, from \( w_1 \) to \( w_n \), that make up the entity \( m_i \). Thus, the input embedding of \( \mathbf {e}_{\mathrm {i}} \) can be defined as:

$$\begin{aligned} \mathbf {e}_{\mathrm {i}}={\text {mean-pooling}}\left( \sum _{{{w}_{{i}} \in {m}_{i}}}^{n} \mathbf {w}_{\mathrm {i}}\right) \in \mathbb {R}^{d_{e}} \end{aligned}$$
(2)

where \( d_e \) denotes the dimension of entity.

The context embedding vector \( \mathbf {c}_{\mathrm {i}} \). We take the vectors generated by Word2Vec of two sentences above and below the current sentence. These four sentences are made up of words, from \( w_1 \) to \( w_i \). Therefore, each sentence vector \( \mathbf {W}_{\mathrm {j}} \) is concatenated from all word vectors, from \( \mathbf {w}_{\mathrm {1}} \) to \( \mathbf {w}_{\mathrm {m}} \). Then we concatenate these four sentence vectors \( \mathbf {W}_{\mathrm {1}} \) to \( \mathbf {W}_{\mathrm {4}} \) simultaneously, and then average each bit of the vectors obtained after splicing:

$$\begin{aligned} \mathbf {W}_{\mathrm {j}}=\left[ \mathbf {w}_{\mathrm {1}};\mathbf {w}_{\mathrm {2}};...;\mathbf {w}_{\mathrm {m}}\right] \in \mathbb {R}^{d_{w}} \end{aligned}$$
(3)
$$\begin{aligned} \mathbf {c}_{\mathrm {i}}=\left[ \mathbf {W}_{\mathrm {1}};\mathbf {W}_{\mathrm {2}};\mathbf {W}_{\mathrm {3}};\mathbf {W}_{\mathrm {4}}\right] \in \mathbb {R}^{d_{c}} \end{aligned}$$
(4)

where \( d_c \) denotes the dimension of context.

3.2 Graph Construction

We construct the graph in a multi-granularity way, motivated by the fact that each sentence constituting context contains multiple entities. Then we design three types of nodes to learn multi-granularity semantic information by considering the nearest context, the inter and intra sentence information of sentences: word, entity and context. Here we consider that entities and words are not one-to-one correspondence. Then the MHGEE model is expected to aggregate information from different granularities, as well as model interactions among these nodes for event extraction.

We also define the following six types of edges to reflect the various structural information and intra-sentence and inter-sentence event interdependency in MHGEE. 1) Word-Word Edge: via syntactic dependency tree. Then we make these edges exist based on the following assumptions: 2) Word-Word Edge: if a word may be the trigger since it has been a trigger ever; 3) Word-Entity Edge: if a word belongs to an entity; 4) Word-Entity Edge: if an entity and a word ever appeared in a centain event; 5) Entity-Entity Edge: if the types of these two entities have been arguments involved in the same event before; 6) Context-Entity Edge: if an entity appears in context. These edges enable nodes with different granularities to connect with each other simultaneously with a short path, and enables the MHGEE model to learn node representations specific to different edge types. Different edges are used to learn information of different granularity, which is related to the type of nodes they connect.

3.3 Information Aggregation over MHG

Since previous GNN only considers the node-wise connectivity, ignoring edge types, we employ R-GCN to perform information dissemination over our model. R-GCN performs well at handling high-relational data and distinguishing six different edge types when updating nodes, and the information dissemination over graph nodes can be achieved by aggregation and combination. The update process of the i-th node at the l-th layer can be formally formulated as:

$$\begin{aligned} \mathbf {n}_{i}^{(l)}=\frac{1}{\left| \mathcal {N}_{i}\right| } \sum _{j \in \mathcal {N}_{i}} \sum _{j \in \mathcal {R}_{i j}} f_{r}\left( \mathbf {h}_{j}^{(l)}\right) \end{aligned}$$
(5)
$$\begin{aligned} \mathbf {u}_{i}^{(l)}=f_{s}\left( \mathbf {h}_{i}^{(l)}\right) +\mathbf {n}_{i}^{(l)} \end{aligned}$$
(6)

where \(\mathcal {N}_{i}\) is the set of neighbors of node i, \( R_{ij} \) is the set of edge types between i and j, \(\mathbf {h}_{j}^{(l)}\) is the representation of node j in layer l. \( f_r \) is a parametrized function specific to an edge type \( r\in R \), and both \( f_r \) and \( f_s \) are implemented with a MLP, \(\mathbf {u}_{i}^{(l)}\) represents the updated representation of node i.

We apply a gate mechanism to provide the way to prevent completely overwriting past information in this module, since it has been shown that GNNs suffer from the smoothing problem when the number of layers is large [5]. Formally:

$$\begin{aligned} \mathbf {g}_{i}^{(l)}=\sigma \left( f_{g}\left( \left[ \mathbf {u}_{i}^{(l)} ; \mathbf {h}_{i}^{(l)}\right] \right) \right) \end{aligned}$$
(7)

where \(\sigma \) is the sigmoid function and \( f_g \) is implemented with a MLP. Gating vector \(\mathbf {g}_{i}^{(l)}\) is then applied to control the amount information from neighbor nodes or the original node:

$$\begin{aligned} \mathbf {h}_{i}^{(l+1)}=\phi \left( \mathbf {u}_{i}^{(l)}\right) \odot \mathbf {g}_{i}^{(l)}+\mathbf {h}_{i}^{(l)} \odot \left( \mathbf {1}-\mathbf {g}_{i}^{(l)}\right) \end{aligned}$$
(8)

where \( \phi \) is the tanh function and \( \odot \) denotes element-wise multiplication. After L times of information dissemination, the information of each node will be propagated to L-node distance away, generating L-hop-reasoning relation-aware node representations.

3.4 Classification Layer

We formulate event extraction as a sequence labeling task following previous works [1, 8, 11, 17]. Thus, each word in sentence is assigned a label that contributes to event annotation. We apply the BIO annotation schema to assign trigger label \( t_i \) to each token \( w_i \), as there are triggers that consist of multiple tokens, and tag “O” represents the “Other” tag, which means that the corresponding word is irrelevant of the target events. In addition, another two tags “B-type” and “I-type” consist of two parts: the word position in the trigger and any event type.

After aggregating word and entity node embedding representations from R-GCN, we feed the word representation into a fully-connected network, which is followed by a softmax function to compute distribution over all event types:

$$\begin{aligned} y_{t_{i}}={\text {softmax}}\left( \mathbf {W}_{t} \mathbf {h}+b_{t}\right) \end{aligned}$$
(9)

where \(\mathbf {W}_{t}\) maps the word node representation \( \mathbf {h} \) to the feature score for each event type, and \( b_t \) is a bias term. We choose event label with the largest probability as the classification result according to the value of \(y_{t_{i}}\).

After we get trigger candidates of certain types from trigger labels, we then need to predict the roles that each entity \( e_j \) plays in such events. We aggregate word embedding representations to trigger candidate vector \( t_i \) and entity vector \( e_j \) by average pooling along the sequence length dimension. The trigger candidate vector \( t_i \) consists of words that combine to form the trigger. Then we concatenate them together and feed into a new fully-connected network to predict the argument role as:

$$\begin{aligned} y_{a_{i j}}={\text {softmax}}\left( \mathbf {W}_{a}\left[ t_{i}, e_{j}\right] +b_{a}\right) \end{aligned}$$
(10)

where \( y_{a_{i j}} \) represents the final output of which role the j-th entity plays in the event triggered by the i-th trigger candidate, and \( b_a \) is also a bias term.

3.5 Biased Loss Function

We minimize the joint negative log-likelihood loss function with a bias item as follow:

$$\begin{aligned} J(\theta )=-\sum _{k=1}^{N}\left( \sum _{i=1}^{n_{k}} I(O) \log \left( p\left( y_{t_{i}} \mid \theta \right) \right) +\beta \sum _{i=1}^{t_{k}} \sum _{j=1}^{e_{k}} \log \left( p\left( y_{a_{i, j}} \mid \theta \right) \right) \right) \end{aligned}$$
(11)

where N is the number of sentences in training dataset; \( n_p \), \( t_p \) and \( e_p \) are the number of words, extracted trigger candidates and entities of the k-th sentence; I(O) represents a switching function to distinguish the loss of tag “O” and event type tags, it outputs number 1 if the tag is “O”, otherwise 0; \( \beta \) is a bias weight.

4 Experiments and Results

4.1 Experiment Settings

Dataset and Evaluation Metrics. We conduct our whole experiments on the standard supervised ACE 2005 dataset, which consists of 599 documents annotated with 33 event subtypes, and 34 role classes. Then we add the NONE class and BIO annotation schema to role classes. Therefore, the total number of labels for event detection is 67, and the total number of labels for argument extraction is 37. Tag “O” in both subtasks represents the “Other” tag, which means that the corresponding word is irrelevant of any types. We use the same data split method [1, 11, 16, 17] to compare with the previous works. The data split includes 40 articles with 881 sentences for the test set, 30 other documents with 1087 sentences for the development set and 529 remaining documents with 21,090 sentences for the training set. We follow the traditional evaluation metrics for evaluation: 1) Trigger Identification (TI); 2) Trigger Classification (TC); 3) Argument Identification (AI); 4) Argument Classification (AC). We use the official scorer Precision, Recall and F1-score at the evaluation stage.

Hyper-parameter Setting. The learning rate and batch size we set in our experiments is 2 and 32 respectively. For all experiments below, we use 300 dimensions for word embeddings and 50 dimensions for POS-tagging embedding, positional embedding and entity type embedding. In the R-GCN module, we use a two-layer GCN. The bias parameter in biased loss function \(\beta \) is set to 5.

4.2 Baselines

We compare our proposed MHGEE model with a range of state-of-the-art models in order to comprehensively evaluate performance boost results: 1) DMCNN [1], builds a dynamic multi-pooling convolutional model to learn sentence feature; 2) Cross-Event [7], uses document level information to improve the performance; 3) GAIL [18], bases on an inverse reinforcement learning; 4) JointBeam [6], extracts events based on structure prediction by manually designed features; 5) Joint3EE [15], bases on the shared hidden representations; 6) JRNN [11], employs bidirectional RNN and manually designed features to event extraction jointly; 7) Embedding+T: uses word embedding vectors and the traditional sentence-level features; 8) PSL [8], uses a probabilistic reasoning model to classify events; 9) HBTNGMA [2], models sentence event inter-dependency via a hierarchical and bias tagging model. Some baseline methods operate BERT as the pre-trained language model. 10) BERT_QA [4], is a QA-based model which uses machine reading comprehension model for both two subtasks; 11) TEXT2EVENT [10], presents a generation-based paradigm; 12) DMBERT [16], mainly focuses on the training data augmentation, with external unlabeled data through adversarial mechanism. And some models build a GNN over the dependency tree of a sentence to exploit syntactical information. 13) GCN-ED [12], is the first attempt to explore how to effectively use GCN in event detection; 14) JMEE [9], enhances GCN with self-attention and highway network to improve the performance of GCN for event detection; 15) MOGANED [17], improves GCN with aggregated attention to combine multi-order word representation from different GCN layers.

4.3 Overall Performance and Ablation Analysis

Table 1 shows the overall performance. Our MHGEE model achieves the best F1 scores for event extraction among all the compared methods. There is a significant gain with the trigger identification, which is nearly 5% higher over the best-reported models. There is also a significant gain with the argument identification, which is over 2% higher over the best-reported models. In addition, our MHGEE model still outperforms BERT-based models without using BERT as a pre-trained language model, although the encoder of BERT has been proven its validity to improve the performance of event extraction, which is one of the downstream natural language processing tasks. It demonstrates the effectiveness of aggregating information with different granularities for event extraction tasks. Compared with the previous GNN-based models, our MHGEE model complete the subtask of argument extraction with the consideration of argument information, and the interaction between two subtasks. This information interaction between arguments and trigger has a good effect on improving the performance of event extraction.

Table 1. Overall performance comparing to the SOTA methods

Table 2 shows the ablation analysis of our study. We assume that if one type of nodes has been removed, it means the corresponding edges also do not exist in the heterogeneous graph. The F1 score drops more than 2 points regardless of the different edge types, context nodes or entity nodes we remove. If we remove entity nodes, we observe a more significant decline on F1 score than we remove context nodes. It indicates that all kinds of nodes and edges in our MHGEE model play important roles, but entity nodes are more essential. This is because when we alleviate the challenges, we more dependent on the entity information, which means entity nodes can be used as key nodes. If there is no entity information in context to help us determine the triggers, then context is not so necessary.

Table 2. Results of ablation studies on ACE 2005 dataset

Additionally, under the condition of using identified trigger, the F1 score of event detection task will not drop significantly. However, the F1 score drops by more than 10% in event argument extraction task. This result shows that we utilize golden trigger other than identified trigger to complete the event argument detection task, since identified trigger can cause the error propagation problem.

Overall, information with different granularities, and all edges with different types, can promote the interaction among nodes by R-GCN, in order to capture the information from the multi-granularity heterogeneous graph to complete event extraction, and finally benefits the performance.

4.4 Effect on Event Interdependency

Following some previous works [1, 9, 11], we split the test data into two parts: 1/1 and 1/N to evaluate the effect of our model for alleviating the multiple-events phenomena. 1/1 means that one sentence only has a single trigger, and 1/N is for all remaining cases. We perform evaluations separately.

Table 3. Performance on single event sentences and multiple event sentences

We use F1 scores to illustrate the performance of Embedding+T [6], CNN [1], JRNN [11], DMCNN [1], and our model for event extraction in Table 3. CNN is similar to DMCNN, except that it applies the standard max-pooling mechanism. Our MHGEE model significantly outperforms all the other mentioned methods in trigger classification subtask. In the 1/N data split of triggers, our model is 3.1% better than the JMEE. It demonstrates that our model, with the utilization of multi-granularity heterogeneous graph and the model of intra-sentence and inter-sentence event interdependency, can capture multiple events in one sentence effectively.

Table 4. Performance on single event sentences and multiple event sentences.
Fig. 3.
figure 3

The example of case study. Yellow highlighted content indicates entity information that can be used to solve ambiguity. The context of the sentence where the trigger is located, as well as the true BIO annotation and predicted BIO annotation results of the sentence through other colors are shown. If the colors are the same, our model predicts correctly, otherwise it is different. (Color figure online)

Table 4 shows the event extraction effect of our MHGEE model on both 1/1 and 1/N. Our model performs better on 1/N. It indicates that multiple granularity information has greater gain on distinguishing the semantics between different triggers in one sentence, that is, event interdependency, which sourced from the fact that multiple events are often associated with each other, having similar event types. We model this intra-sentence and inter-sentence event interdependency phenomenon through a heterogeneous graph to not only capture multiple events in one sentence, but also mitigate their similarity.

4.5 Case Study and Effect on Ambiguity of Triggers

In Fig. 3, we show two examples of case study on ambiguity of triggers. In both (a) and (b), both words “discuss” and “fight” trigger two different events respectively. Ambiguity occurs in both cases. However, our MHGEE model can solve this problem in (a), but fails in (b). According to the idea of our model, we need to learn rich different granularity information, including entities and contexts, to solve the ambiguity problem. In (a), there is enough information to help us solve the ambiguity problem, but in (b), there is not enough semantic information.

The two event types triggered by “fight” have certain similarities, and “elected” in context serves as a trigger word for event type “Personnel: elected”, which is irrelevant to the two events triggered by “fight”, having no evidence to solve the ambiguity problem. The case study shows that multi-granularity information does help to alleviate the problem of ambiguity of triggers. There is indeed rich multi-granularity information around some event mentions, and our MHGEE model can solve the ambiguity problem by aggregating this information.

5 Conclusions and Future Works

In this paper, we propose a novel model MHGEE for event extraction. In order to disambiguate triggers, our MHGEE model aggregate nodes and edges simultaneously into a heterogeneous graph to enable rich information interactions among nodes with different granularities by R-GCN. In addition, we consider the multiple-event phenomenon with modeling intra-sentence and inter-sentence event interdependency. The whole experimental results demonstrate that our MHGEE model can achieve new state-of-the-art performance on the ACE 2005 dataset. In the future, we would like to apply MHGEE to other information extraction tasks, such as aspect extraction and named entity recognition.