Keywords

1 Introduction

Relation Extraction (RE), a task that automatically extracts relational facts among entities from raw texts, is widely used in knowledge base construction [22] and question answering [18]. Previous researches mainly focus on sentence-level RE, which aims to identify relations between an entity pair in a single sentence. However, large amounts of relational facts are expressed by multiple sentences, which cannot be achieved by sentence-level RE. Therefore, researchers gradually pay more attention to document-level RE.

Fig. 1.
figure 1

An example document from DocRED. Entities are distinguished by color, with the reasoning clue and relation label listed offside.

Doc-level RE not only handles the sentence-level RE but also captures complex interactions among cross-sentence entities in the document. Recent studies focus on graph-based reasoning skills [5, 14, 16], where coreference information, especially produced by mentions, is extensively used for logical inference. However, the coreference information of pronouns, beneficial to obtaining interactive information across sentences [3] and multi-hop graph convolution, is ignored.

Figure 1 shows an example from the DocRED dataset [15]. As will be readily seen, only based on the fact that mention Colette de Jouvebel (in the 1st sentence) and pronouns she (in the 8th sentence) refer to the same entity, can we infer the relation of entity pair (Colette de JouvebelLachaise) is the place of death. And the relational reasoning pattern of entity pairs (Colette de Jouvebel\(Castel- Novel)\) and (ColetteLachaise) is the same as above. Therefore, the pronouns in documents can produce rich semantic information, which is extremely vital to Doc-level RE. To verify the hypothesis, we randomly sample 100 documents from the DocRED training set and measure the number of pronouns and mention-pronoun pairs. Table 1 describes that each document has approximate 32 pronouns (“he", “him", “his", “she", “her", etc.) and 14 mention-pronoun pairs. Obviously, pronouns can provide significant clues to Doc-level RE if some strategies are designed ingeniously.

Table 1. Statistics of pronouns and mention-pronoun pairs.

To capture the feature produced by pronouns, we propose a novel Coref-aware Doc-level RE based on Graph Inference Network (CorefDRE). CorefDRE is a fine-tuned coreference-aware approach that instructs the model directly to learn the coreference information produced by mentions and pronouns. Specifically, we propose a heterogeneous graph, Mention-Pronoun Affinity Graph (MPAG) with two types of nodes, namely mention node and pronoun node, as well as three types of edges (i.e., intra-sentence edge, intra-entity edge, and pronoun-mention edge) to capture semantic information of pronouns in the document for relation extraction. MPAG is a fusion of Mention Graph(MG) [20] and Mention-Pronoun Graph(MPG), which is constructed according to NeuralCoref, an extension to the Spacy. After that, we apply GCNs [6] to MPAG to get the representation for each mention and pronoun. Then we merge mentions and pronouns that refer to the same entity to get the Entity Graph (EG), and based on EG we infer multi-hop relations between entities. Meanwhile, to reduce the noise brought by NeuralCoref, we propose the noise suppression mechanism that first calculates the affinity of each mention-pronoun pair as edge weigh of MPAG and then suppresses the low weight edge during the fusion of MPAG into EG.

Our contributions are summarized as follows:

  • We introduce a novel heterogeneous graph, Mention-Pronoun Affinity Graph (MPAG), which integrates the coreference information produced by mentions and pronouns to better adapt to Doc-level RE task.

  • We propose a noise suppression mechanism to calculate the affinity between mention and corresponding pronoun for suppressing noise produced by false mention-pronoun pairs.

  • We conduct experiments and the results outperform baseline by nearly 1.7–2.0 in F1 on the public datasets, DocRED, DialogRE, and MPDD, which demonstrate the effectiveness of our CorefDRE model.

This paper is organized as follows: Sect. 1 outlines the research on doc-level RE and the main contribution of this paper, Sect. 2 and section3 detail the proposed model and the experimental results, respectively. Section 4 describes the related work of graph-based doc-RE and Sect. 5 summarizes the advantages and disadvantages of this paper and provides the direction for future research.

2 Proposed Approach

We formulate the task of document-level RE (doc-level RE) in the following way: Document D : the document is the raw text containing multiple sentences, namely \(D = \left\{ \mathrm {s}_{1}, \mathrm {~s}_{2}, \ldots , \mathrm {s}_{\mathrm {n}}\right\} \). Entity E: the entity set E consists of the entities that appear in the document. For each entity \(e_{i}\), it is represented by a set of mentions in the document as well as an entity type: \(e_{i} = \left( \left\{ m_{i1},m_{i2},\dots \right\} ,t_{i}\right) \), where \(t_{i} \in R_{e}\) (the set of predefined entity types in the datasets). Mention m: the mentions refer to the representation of entities in a document, and each mention refers to a span of words: \(m = \{ w_{1},w_{2},\dots \}\). Pronoun p: pronouns are words that can refer to mention in a document (e.g., it, he, she, etc.).

Given the document D and entity set E, Doc-level RE is required to predict the relational facts between entities, namely \(r_{s, o}=f( e_{s},e_{o})\), where \(e_{s}\), \(e_{o}\) are subject entity and object entity in E, \(r_{s,o}\) is a relational fact in pre-defined relation set R. In order to produce the above described output, our model, Coref-aware Doc-level RE based on Graph Inference Network (CorefDRE), mainly consists of 3 modules: Mention-Pronoun Affinity Graph construction module (Subsect. 2.1), noise suppression mechanism (Subsect. 2.2), graph inference module (Subsect. 2.3), as is shown in Fig. 2.

Fig. 2.
figure 2

The architecture of our CorefDRE. First, the document is fed into the encoder respectively, and then MG is constructed. Second, find the mention-pronoun pairs and use the noise suppression mechanism to calculates the affinity of mention-pronoun pairs as the weight of mention-pronoun edge, and then MPG is constructed. Third, merge MG and MPG to MPAG and mention-pronoun pairs with low affinity are inhibited when merging. Finally, after applying GCNs, MPAG is transformed into EG, where the paths between entities are identified for reasoning. Different entities are drawn with colors and the number i in each node denotes that it belongs to the i-th sentence.

2.1 Mention-Pronoun Affinity Graph Construction Module

To model the coreference relationships and enhanced the interactions between entities, Mention-Pronoun Affinity Graph (MPAG), a combination of MG and MPG, is constructed. MG is constructed according to Zeng et al. [20] but no document node here. MPG is constructed according to the mention-pronoun pairs generated by NeuralCoref. Specifically, the NeuralCoref first identify the pronouns that refer to the same mention and cluster the mention-pronoun pairs together as coreference clusters. For instance, we can obtain mention-pronoun pair clusters simply (e.g., [(Bel Gazoushe),(Bel GazouShe),...,(Coletteher mother)]) from the sentences as shown in Fig. 2. And m and p in the pair (mp) correspond a mention node m and a pronoun node p of MPG respectively, and there is a mention-pronoun edge between node m and node p.

There are two types of nodes and three types of edges in MPG:

Mention Node: each mention node in graph corresponds to a particular mention of an entity. The representation of the mention node \(m_{i}\) is defined by the concatenation of semantic embedding, coreference embedding and type embedding [15], namely \(m_{i}=\left[ avg_{w_{k} \in m_{i}}\left( h_{k}\right) ;~t_{m};~c_{m}\right] \), where \(t_{m}\) \(\in \) \(R_{e}\), \(c_{m}\) represents which entity it refers to, and \(avg_{w_{k} \in m_{i}}\left( h_{k}\right) \) is the average representation of mention containing words encoded by encoder.

Pronoun Node: Each pronoun (e.g. it, his, she, etc.) refers to a mention in the document corresponding to a pronoun node. The representation of the pronoun node is similar to that of the mention, where the type embedding and coreference embedding are the same as that of the corresponding mention nodes.

Intra-entity Edge: Nodes that refer to the same entity are fully connected with the intra-entity edge between them. The edge can model the interaction among different mentions and pronouns of the same entity and establish the interaction of cross-sentence.

Intra-sentence Edge: If two nodes co-occur in a single sentence, there is an intra-sentence edge between them. The edge can model the interaction among the mentions and pronouns referring to different entities.

Mention-Pronoun Edge: The mention-pronoun edge is the same as the mention-pronoun edge of MPG. The edge can strengthen the interaction of semantic information among sentences through the coreference information.

To initialize MPAG, we follow the GAIN proposed by Zeng et al. [20] and then dynamically update MPAG by applying Graph Convolution Network [6] to convolute the heterogeneous graph. Given node \(n_{i}\) at the l-th layer, the heterogeneous graph convolutional operation is formed as follows:

$$\begin{aligned} n_{i}^{l+1}=\sigma \left( \sum _{e \in E} \sum _{j \in N} \frac{1}{|N|} W_{e}^{l} n_{j}^{l}+b_{e}^{l}\right) \end{aligned}$$
(1)

where \(\sigma (.)\) is the activation function. E denotes the set of different edges, N denotes the set of different neighbors of node \(n_{i}\), and \(W_{e}^{l}\), \(b_{e}^{l} \in R^{d \times d}\) are trainable parameters.

To cover features of all levels, the node \(\mathrm {n}_{i}\) is defined as the concatenation of hidden states of each layer:

$$\begin{aligned} n_{i}=\left[ n_{i}^{0} ; n_{i}^{1} ; \ldots ; n_{i}^{N}\right] \end{aligned}$$
(2)

where \(n_{i}^{l}\) is the representation of node \(n_{i}\) at layer l, and \(n_{i}^{0}\) is the initial representation of node \(n_{i}\), which is formed by the document representation from encoder.

2.2 Noise Suppression Mechanism

Mention-pronoun pairs will produce noise because of the weak adaptability between the datasets and the NeuralCoref. Therefore, we propose the noise suppression mechanism to filter noise in the process of graph inferencing (Subsect. 2.3). In our framework, we adopt the BERT to measure the affinity of the mention-pronoun pairs as the weight of the mention-pronoun edge. For each pair (mentionpronoun), we concatenate the context of mention and pronoun as input and produce a single affinity scalar for every pair when constructing MPAG. The input form of tokens is as follows:

$$\begin{aligned} \begin{aligned} {[CLS] \left\langle \text { Mention }\right\rangle [SEP]\left\langle \text { Pronoun }\right\rangle [SEP]} \\ where \left\langle \star \right\rangle :=c_{l}[S T A R T] \star [E N D] c_{r} \end{aligned} \end{aligned}$$
(3)

where \(\star \) is mention tokens or pronoun tokens, \(c_{l}\) and \(c_{r}\) represent the text on left and right of “\(\star \)" respectively. The \(\left[{START} \right]\) and \(\left[{END} \right]\) are two special tokens fine-tuned that indicate the start and end of “\(\star \)" in the context respectively.

Inspired by Angell et al. [1], we make affinity symmetric by averaging the representation of (mentionpronoun) and (pronounmention) to improve the representation. And then the affinity of the mention-pronoun pair is calculated by passing the enhanced representation of pairs into a linear layer with sigmoid activation. For instance, the affinity between mention-pronoun pair (mentionpronoun) is set 1, which is a strong signal for the fusion of MPAG. To calculate the affinity between the mention-pronoun pair accurately, we subtly design the positive sampling and negative sampling to train the affinity calculation. We screen out 300 positive samples \(D_{p}\) from the data D obtained by NeuralCoref and replace the mention m of the positive sample with other mentions \(m^{\prime }\) randomly. To train the affinity module, we minimize the following triplet max-margin loss when training.

$$\begin{aligned} L_\varphi =\ \sum _{p_+,m\epsilon P^+\ }\sum _{p_-,m\epsilon P^-\ } l\left( m,p_+,p_-\right) \end{aligned}$$
(4)
$$\begin{aligned} {l}(g,p,n)=\left[ aff(g, n)^{2}-(1-aff(p, n))^{2}\right] _{+} \end{aligned}$$
(5)

where m and p are mention and pronoun in mention-pronoun pair (mp) and \(aff\left( m,p \right) \) is the affinity between m and p. The g, p, n in Eq. (5) are mention, negative pronoun, and positive pronoun referring to mention.

2.3 Graph Inference Module

Graph Merging. Inspired by Zeng et al. [20], we predict relational facts between entity pairs by reasoning on Entity Graph (EG), which is transformed from MPAG. Furthermore, the dynamic process of merging MPAG to EG is divided into three steps:

Step 1: Pronoun nodes that refer to the same mention are merged with the corresponding mention node to form a new mention node. Note that if the affinity between the mention-pronoun pair is less than the threshold \(\theta \), the pronoun does not participate in the merging process so that noise is depressed simply. For the i-th mention node merged from N pronoun nodes, it is represented by concatenating the mention and the average of its N pronoun node representations, and the representation of new mention node is defined as:

$$\begin{aligned} {m}_{i}=\bar{m_{i}} \oplus \frac{1}{~N} \sum _{n} aff_{n} p_{n}\left( {aff}_{n} \ge \theta \right) \end{aligned}$$
(6)

where \(\bar{\mathrm {m}_{\mathrm {i}}}\) denotes the mention representation. \(p_{n}\) is the n-th pronoun referred to the mention \(m_{i}\), \(aff_{n}\) is the affinity of \((\bar{m_{i}},p_{n})\) pair and \(\oplus \) denotes concatenate operation.

Step 2: Mention nodes that refer to the same entity are merged to an entity node in EG. For the i-th entity node merged from N mention nodes, it is represented by the average of its N mention node representation:

$$\begin{aligned} e_{i}=\frac{1}{N} \sum _{n} m_{n} \end{aligned}$$
(7)

Step 3: Intuitively, intra-sentence edges between the mentions, which refer to the two entities, are merged as the bi-directional edge in EG. The directed edge from entity node \(e_{i}\) to \(e_{j}\) in EG is defined as:

$$\begin{aligned} \text{ e }dge_{ij}=\sigma \left( W_{q}\left[ e_{i} ; e_{j}\right] \right) +b_{q} \end{aligned}$$
(8)

where \(W_{q}\) and \(b_{q}\) are trainable parameters and \(\sigma \) is an activation function (e.g., ReLU).

We model the potential reasoning clue between the entity nodes in EG through the path between the entity nodes. Based on the representation of the edge, \(two-hop\) path between entity nodes \(e_{s}\) and \(e_{o}\) is defined as:

$$\begin{aligned} \begin{aligned} \mathrm {P}_{s, o}^{k}=[edge_{s, i} ;edge_{i, o} ;edge_{o, i} ;edge_{i, s} ] \end{aligned} \end{aligned}$$
(9)

where i stands for the intermediate node. Since there are multiple paths between two entity nodes, an attention mechanism is introduced to fuse the path information and pay more attention to the strong path. Path information of the entity in EG is defined as:

$$\begin{aligned} s_{i}=\sigma \left( \left[ e_{s} ; e_{o}\right] \cdot W_{l} \cdot p_{s, o}^{i}\right) \end{aligned}$$
(10)
$$\begin{aligned} \alpha _{i}=\frac{e^{s_{i}}}{\sum _{j} e^{s_{j}}} \end{aligned}$$
(11)
$$\begin{aligned} p_{s, o}=\sum _{i} \alpha _{i} p_{s, o}^{i} \end{aligned}$$
(12)

where \(\alpha _{i}\) is the attention weight for \(i^{th}\) path and \(\sigma \) is an activation function (e.g., ReLU).

Relation Inference. According to the fusion of MPAG and noise suppression mechanism, The isomorphic graph EG is dynamically constructed, and the relationship between entity nodes can be predicted by the path inference. To identify the relationship of entity pair \(( e_{s},e_{o})\), we concatenate the following representations as \(I_{s,o}\) and compute the probability of relation r from the pre-specified relation schema as Eq. (14):

$$\begin{aligned} I_{s, o}=\left[ e_{s} ; e_{t} ;\left| e_{s}-e_{o}\right| ; e_{s} \odot e_{o} ; p_{s, o}\right] \end{aligned}$$
(13)
$$\begin{aligned} P\left( r \mid \mathbf {e}_{s}, \mathbf {e}_{o}\right) ={\text {sigmoid}}\left( W_{b} \sigma \left( W_{a} I_{s, o}+b_{a}\right) +b_{b}\right) \end{aligned}$$
(14)

where \(e_{s}\) and \(e_{o}\) are the representation of subject and object entity in EG, and \(p_{s, 0}\) is the comprehensive inferential path information. \(W_{a}\), \(W_{b}\), \(b_{a}\), \(b_{b}\) are trainable parameters and \(\sigma \) is an activation function (e.g., ReLU). We use binary cross entropy as the loss function to train our model.:

$$\begin{aligned} L=-\sum _{D \in S} \sum _{s \ne o} \sum _{r \in R} { CrossEntropy }\left( P\left( r \mid \mathbf {e}_{s}, \mathbf {e}_{o}\right) , \overline{P}\left( r \mid \mathbf {e}_{s}, \mathbf {e}_{o}\right) \right) \end{aligned}$$
(15)

where S denotes the whole corpus, and \(\overline{P}\left( r \mid \mathbf {e}_{s}, \mathbf {e}_{o}\right) \) refers to ground truth.

3 Experiments

3.1 Dataset and Experimental Settings

DocRED [15]: DocRED consists of 3053 documents for training, 1000 for development and 1000 for test. And more than 40.7\(\%\) of the relation facts require reasoning over multiple sentences. DialogRE [17]: DialogRE includes 1073 for training, 358 for development and 357 for test, and 95.6\(\%\) of relational triples can be inferred through multiple sentences, where pronouns are extensively used. MPDD [2]: A publicly available Chinese dialogue dataset with the emotion and interpersonal relation labels and a mass of pronouns. To learn an effective representation for documents and capture the context of each mention, Following Yao et al. [15], for each word, we concatenate its word embedding, entity type embedding, and entity id embedding. And then we feed all the word representations into Glove/BERT to get the representation of the document. We extract the relation between pronoun and mention based on Huggingface’s NeuralCoref and use BERT to pretrain the affinity for the mention-pronoun pair. We use 2 layers of GCN to encode the MPAG and EG. Our model is optimized with AdamW [9] and sets the dropout rate of GCN to 0.6, learning rate to 0.001.

3.2 Baseline Models

We use the following models as baselines.

CNN & BiLSTM: Yao et al. [15] proposed CNN and BiLSTM to encode the document into a sequence of the hidden state vectors. Context-Aware: Yao et al. [15] also proposed LSTM to encode the document and attention mechanism to fuse contextual information for predicting. CorefBERT: a pre-trained model was proposed by Ye et al. [16] for word embedding. DocuNet-BERT: Zhang et al. [21] proposed a U-shaped segmentation module to capture global information among relational triples.GAIN-GloVe/GAIN-BERT: Zeng et al. [20] proposed GAIN, which designed mention graph and entity graph to predict target relations, and make use of GloVe or BERT for word embedding, GCN for representation of the graph.

Table 2. Performance on DocRED. Models above the first double line do not use pre-trained models. Results with * are reported in their original papers. Ign F1 refers to excluding the relational facts shared by the training and dev/test sets.
Table 3. Performance on the datasets DialogRE and MPDD.

3.3 Main Results

We compare our CorefDRE model with other baselines on the DocRED dataset. The results are shown in Table 2. We use F1 and Ign F1 as evaluation indicators to evaluate the effect of models. Compared with the models based on GloVe, CorefDRE outperforms strong baselines by 1.7–2.0 F1 scores on the development set and test set. Compared with the models on BERT-base, CorefDRE outperforms strong baselines by 1.6–1.9. These results suggest that MPAG can capture the interaction relationship of multi-sentences for better Doc-level RE. Although we only conduct the experiments on DocRED, DialogRE, and MDPP shown in Table 3, our model is fit to others since pronouns are the essential grammar and syntax of the natural language.

Table 4. Performance of CorefDRE with different embeddings and submodules.

3.4 Ablation Study

To verify the effectiveness of different modules in CorefDRE, we further analyze our model and the results of the ablation study shown in Table 4.

First, we remove the noise suppression mechanism. We set the weight of the mention-pronoun edge directly to 1 and merge all the pronoun nodes with the corresponding mention node when generating EG. Without the weight between pronoun node and mention node, the performance of CorefDRE-GloVe/CorefDRE-BERT\(_{base}\) sharply drops by 1.39 F1 on the development set. This drop shows that the affinity between pronoun node and mention node plays a vital role in suppressing the noise caused by unsuitable mention-pronoun pairs.

Next, we remove the pronoun nodes. Specifically, we convert MPAG into MG proposed by Zeng et al. [20]. Without pronoun nodes, the result drops by an average of 1.88 F1 on the development set. This suggests that the pronoun nodes can capture richer information that mention node and document cannot capture effectively.

3.5 Case Study

Fig. 3.
figure 3

Case Study on our CorefDRE model and baseline model. The models take the document as input and predict relations among different entities in different colors. The Graph Inference is reasoning process on graphs and the NA stands for no relation.

Figure 3 illustrates the case study of our CorefDRE compared with baseline. As is shown, GAIN can not predict the relation of entity pairs (Conrad Johnson, Wiley College) and (Samuel CBrightmanWorld War II), while CorefDRE can predict the relation between Conrad Johnson and Wiley College is educated at and the relation between SamuelC. Brightman and World War II is conflict, because pronoun nodes he and He can connect the entity pair (Conrad JohnsonWiley College) and (Samuel CBrightmanWorld War II) respectively. We observe that relation extraction among those entities needs pronouns to connect them across sentences. The observation indicates that the information introduced by pronouns is beneficial to relation extraction.

4 Related Work

Relation Extraction is to extract relation facts from a given text, while early researches mainly focus on predicting relation fact between two entities within a sentence [4, 22]. These approaches include sequence-based methods, graph-based methods, and pre-training methods, which can tackle sentence-level RE effectively, and the dataset contains a fixed number of relation types and entity types. However, large amounts of real-world relational facts only can be extracted through multiple sentences.

Doc-Level Relation Extraction. Researchers extend sentence-level to Doc-level RE [12, 21] and explore two trends. The first is the sequence-based method that uses the pre-trained model to get the contextual representation of each word in a document, which directly uses the pre-trained model to obtain the relationship between entities [7, 10, 16]. These methods adopt transformers to model long-distance dependencies implicitly and get the entities embedding, and feed them into a classifier to get relation labels. However the sequence-based methods cannot capture enough entity interactions when the document length is out of the capability of the encoder at a time. In order to model these interactions, the graph-based method are proposed to constructs graphs according to documents, which can model entity structure more intuitively [11, 20, 23]. These methods take advantage of LSTM or BERT to encode the input documents and obtain the representation of entities, then utilize the GCNs to update representation, and finally feed them into the classifier to get relation labels.

Coreference Dependency Relation Reasoning. Some previous efforts on Doc-level RE introducing coreference dependency for multi-hop inference are useful for solving multi-hop reasoning. Previous works [10, 11, 23] have shown that graph-based coreference resolution is obviously beneficial to construct dependencies among mentions for relation reasoning. [19] proposed intra-and-inter-sentential reasoning based on R-GCN to model multiple paths by covering all cases of logical reasoning chains in the graph. [14] introduced a reconstructor to rebuild the graph reasoning paths to guide the relation inference by multiple reasoning skills including coreference and entity bridge. However, none of the above methods model the influence of pronouns on relation extraction and reasoning directly. Our CorefDRE model deals with the problem by introducing a novel heterogeneous graph with mention-pronoun coreference resolution and noise suppression mechanism.

5 Conclusion and Future Work

This paper proposed CorefDRE which features two novel skills: the coref-aware heterogeneous graph, MPAG, and the noise suppression mechanism. Based on the proposed method, the model can extract document-level entity pair relation more effectively due to the richer semantic information brought by pronouns. Experiments demonstrated that our CorefDRE outperforms most previous models significantly and is orthogonal to pre-trained language models. However, there are still some problems not been completely solved. The noise generated by pronouns hinders the improvement of the performance of our model. In the future, we will explore other methods to construct less noisy mention-pronoun pairs to optimize CorefDRE.