Keywords

1 Introduction

Biomedical relation extraction is an important natural language processing task, which aims to quickly and accurately detect the relations between multiple entities related to medicine from the mass medical information on the Internet, it plays an important role in clinical diagnosis [1], medical intelligence question and answer [2], and medical knowledge mapping [3]. This research can provide technical support for medical institutions and drug companies, and has great benefits for public health. At present, there are some knowledge bases of entities and relations, but more biomedical relations exist in cross-sentence documents, which brings challenges to the research of relation extraction.

With the rise of the neural network, the deep learning model has been widely used in medical relationship extraction tasks. The existing methods are mainly divided into two categories: semantic-based model and dependency-based model. Semantic-based models, such as Convolutional Neural Network (CNN), and Recurrent Neural Network (RNN), can obtain context information effectively by encoding text sequences. Ekbal et al. [4] used the CNN model to classify relations by using the features extracted from the convolution kernel and max-pooling layer. As a feature extraction method, CNN has a good performance, but it is more suitable to capture local information features. To better capture long-distance information and reflect the importance of different information, the attention mechanism [5] attracts researchers’ attention. Zhou et al. [6] proposed an attention-based Bi-directional Long Short-Term Memory (BI-LSTM) framework that automatically focuses on words that have a decisive effect on classification and captures important semantic information in sentences. At present, the attention-based Bi-LSTM model has become an important method for natural language processing tasks.

In order to fully mine the deep information in sentences, syntactic dependency structure is applied to the relation extraction task. Guo et al. [7] fused the attention mechanism based on the shortest dependency path with CNN and RNN to obtain keywords and sentence features; Zhang et al. [8] used Graph Convolutional Network to extract relations based on the Lowest Common Ancestor (LCA) rule of entities. Miwa and Bansal [9] encoded the Shortest Dependency Path (SDP) between two entities by using Tree LSTM. Peng et al. [10] divided the input graph into two directed acyclic graphs (Dags), and Song et al. [11] proposed the Graph Recurrent Network model (GRN) to obtain the semantic structure. In addition to using parsers to construct dependency graphs, researchers also begin to pay attention to and propose methods to construct dependency graphs automatically. Jin et al. [12] proposed a complete dependency forest model, to construct a weight map that adapts to terminal tasks, Guo et al. [13] proposed a “Soft pruning” strategy, the neural network of Attention-Guided graph is used to represent the graph better. Besides, Jin et al. [14] proposed a method to generate dependency forests consisting of the semantic-embedded 1-best dependency tree. Qian et al. [15] proposed an auto-learning convolution-based graph convolutional network to perform the convolution operation over dependency forests and Tang et al. [16] devised a cross-domain pruning method to equalize local and nonlocal syntactic interactions. In the general domain, Chen et al. [17] proposed to exploit the sequential form of POS tags and naturally fill the gap between the original sentence and imperfect parse tree. Zhang et al. [18] proposed a dual attention graph convolutional network with a parallel structure to establish bidirectional information flow.

Based on the above ideas, we propose a novel end-to-end model called Multi-head Attention and Graph Convolutional Networks with R-Drop (RD-MAGCN) for N-ary document-level relation extraction, which combines semantic information and syntactic dependency information. First, we interact the input representation and the relation representation with Multi-head Attention Layer to obtain the weighted context semantic representation of the text. To make full use of syntactic dependency information in cross-sentence extraction, we construct document-level syntactic dependency trees and encode them with GCN to solve the long-distance dependence problem. Then, Concatenate the two representations and feed them into the decoder. Finally, the network is enhanced by using the R-Drop mechanism and the biomedical relation is extracted.

The major contributions of this paper are summarized as follows.

  • We propose a novel end-to-end model (RD-MAGCN) that effectively combines context semantic information and syntactic information.

  • We introduce a regularization method for the randomness of dropout, which can enhance the performance of the network.

  • We evaluate the performance of our model, the experimental results show that the performance of this model exceeds that of previous models.

2 Method

In this section, we introduce our proposed method. The input of our model is the long documents containing the relations between medical entities, and the output is a certain type of relation. There are four steps in our method: (1) preprocessing the corpus, including instance construction and other information extraction; (2) constructing document-level syntactic dependency tree; (3) building Attention and Graph convolution Networks for relation extraction; (4) utilizing R-Drop mechanism to enhance the network.

2.1 Data Preprocessing

For the texts in the corpus, we carry out a series of preprocessing processes. We first use the Stanford CoreNLP toolkit to parse each document in the corpus to obtain the syntactic parsing results and POS tags for each word. Then we construct instances for each pair of entities marked in the dataset, each instance contains the tokens of the text, the directed dependency edges of each word, the POS tags of each word, the absolute position of each entity, and the relation type used as the label.

POS tags and entity positions are used in the Input Representation Layer to enrich the text information, and the syntactic dependency information is used to build dependency trees and encodes them with GCN to capture long-range dependency information in the text.

2.2 Dependency Tree Construction

Syntactic analysis [19] is one of the important techniques in natural language processing, which is used to determine the dependencies between words in sentences. The dependency tree is a kind of syntactic analysis method, which mainly expresses the dependence relation between the words. In order to get the syntactic dependency feature of documents, we introduce a document-level dependency tree, in which the nodes represent words and the edges represent the intra-sentence and inter-sentence lexical dependency relations. As shown in Fig. 1, in this paper, we use the following three types of edges between nodes to construct the dependency trees:

  1. 1.

    Syntactic dependency edges: the results of parsing text by Stanford CoreNLP toolkit. They denote the dependencies between the words in a sentence.

  2. 2.

    Adjacent sentence edges: we connect the dependency roots of two adjacent sentences using the adjacent sentence edges. The dependency between two sentences is indicated by “next”. By using adjacent sentence edges, the entire document can form a connected graph.

  3. 3.

    Self-node edges: each node in the dependency tree has a self-node edge, which allows the model to learn about the node itself during training.

Fig. 1.
figure 1

An example of a constructed dependency tree consisting.

2.3 Model Structure

In this paper, we propose a novel model, Attention and Graph Convolutional Networks with R-Drop (RD-MAGCN), for N-ary document-level relation extraction. As shown in Fig. 2, the overall framework of our model consists of five parts: Input representation layer, Bi-LSTM layer, Multi-head Attention layer, GCN layer, and output layer. In addition, we utilize the R-Drop mechanism to enhance the networks and further improve the performance. The next few sections will describe the details of our model.

Input Representation Layer.

For the specific domain of biomedical research, we introduce the Bio-BERT pre-trained language model [20] as the text representation encoder of Multi-head Attention Layer. However, since BERT and the improved pre-training models based on BERT use Word-Piece as the word segmentation method, and our dependency tree uses the entire words as nodes, we choose the ELMo pre-training model [21] to represent the input of GCN Layer. Furthermore, we enrich the representation of text with additional information, enabling the model to mine deeper semantics. POS tagging can strengthen the features of text, and position embedding can allow the model to locate the entities and better learn the information of the context near the entities. Therefore, our Input Representation Layer is divided into two parts, the embedding of Multi-head Attention layer is concatenated by Bio-BERT embedding, POS embedding, and position embedding:

$$ w_{1} = \left[ {w_{Bio - BERT} ;w_{POS} ;w_{position} } \right] $$
(1)

The embedding of GCN Layer is concatenated by ELMo embedding, POS embedding, and position embedding:

$$ w_{2} = \left[ {w_{ELMO} ;w_{POS} ;w_{position} } \right] $$
(2)
Fig. 2.
figure 2

Overview of our model.

Bi-LSTM Layer.

RNN is very commonly used in NLP, it can capture the information of the previous text in the sentence, and LSTM utilizes the gating mechanism [22] to solve the problems of vanishing gradient, exploding gradient, and long-distance dependence that exist in RNN. Therefore, LSTM is suitable for handling document-level tasks. In this paper, we use two LSTMs, forward and backward, to encode two different input representations to obtain representations that contain both the preceding and the following information. We specify that the hidden state of the forward LSTM is \(h_{t}^{f} \) and the backward LSTM is \(h_{t}^{b} \), the final hidden state is concatenated as:

$$ h_{t} = \left[ {h_{t}^{f} ;h_{t}^{b} } \right] $$
(3)

Multi-head Attention Layer.

The attention mechanism has gradually become more and more important in NLP. The attention mechanism is the focus on the input weight distribution, which can enable the model to learn more valuable information and improve the performance of relation extraction. Following Li et al. [23], we build Multi-head Attention Layer that interacts with relation representations. Based on the idea of TransE [24], we regard relation representation as to the difference between entity representations:

$$ w_{relation} = w_{tail} - w_{head} $$
(4)

When there are only two entities, the relation representation is denoted by the tail entity minus the head entity. When there are three entities such as drug, gene, and variety, we use the third entity representation (variety) minus the first entity representation (drug) as the relation representation.

We then use normalized Scaled Dot-Product Attention to compute a weighted score for the interaction of text with relation representations:

$$ Attention\left( {Q,K,V} \right) = softmax\left( {\frac{{QK^{T} }}{\sqrt d }} \right)V $$
(5)

where \(Q\) indicates query, from the output of Bi-LSTM Layer, represented as sequences of text. \(K\) and \(V\) indicate key and value, from relation representation. \(d\) is the dimension of the vector and \(\sqrt d\) is the scaling factor. The introduction of relation representation allows the model to give higher weight to text representations that are closer to the relation representation, which is helpful for relation extraction.

Eventually, concatenate the results of \(n\) heads:

$$ h = \left[ {h_{1} ;h_{2} ; \ldots ;h_{n} } \right] $$
(6)

where \(n\) is set to 5 in our experiments. Multiple heads allow the model to learn relevant information from different representation subspaces. Finally, we perform max pooling to get the output \(h_{att}\) of Multi-head Attention Layer.

GCN Layer.

GCN (Graph convolution Network) [25] is a natural extension of ConvNets on the graph structure, which can well extract the spatial structure features of images. The application of GCN to the syntactic dependency tree can extract the syntactic structure features of the text and solve the problem of long-distance separation of entities in document-level relation extraction.

In this paper, We convert the constructed document dependency tree into an adjacency matrix \(A\), where \(A_{i,j} = 1\) indicates that there is a dependency edge between word \(i\) and word \(j\). Following Zhang et al. [8], we set the adjacency matrix as a symmetric matrix, i.e. \(A_{i,j} = A_{j.i}\), and then we add self-node edges for each node, i.e. \(A_{i,i} = 1\), for information about the node itself. Furthermore, we normalize the numerical values in the graph convolution to account for the large variation in node degrees in the dependency tree before adopting the activation function. At last, the graph convolution operation for node \(i\) at the \(l\)-th layer with the adjacency matrix of the dependency graph transformation can be defined as follows:

$$ h_{i}^{{\left( l \right)}} = \rho (\frac{{\sum\nolimits_{{j = 1}}^{n} {A_{{ij}} } W^{{\left( l \right)}} h_{j}^{{\left( {l - 1} \right)}} }}{{d_{i} }} + b^{{\left( l \right)}} ) $$
(7)

where \(h_{i}^{{\left( {l - 1} \right)}}\) and \(h_{i}^{\left( l \right)}\) denotes the input and the output of node \(i\) at the \(l\)-th layer. And the inputs of GCN Layer are the outputs of the Bi-LSTM Layer \(h_{1}^{\left( 0 \right)} , \ldots ,h_{n}^{\left( 0 \right)}\), then the outputs \(h_{1}^{\left( L \right)} , \ldots ,h_{n}^{\left( L \right)}\) are obtained through the graph convolution operation. \(W^{l}\) is the weight matrix, \(b^{\left( l \right)}\) is the bias vector, \(d_{i} = \mathop \sum \limits_{j = 1}^{n} A_{ij}\) is the degree of node \(i\) in the dependency tree for normalization, and \(\rho\) is the activation function.

Following Lee et al. [26], we also extract representations of entity nodes and concatenate them with representations of documents to highlight the role of entity nouns in the text structure and improve the performance of relation extraction. Similarly, we perform max pooling to get the output \(h_{GCN}\) of GCN Layer.

Output Layer.

In this paper, our Output Layer is a two-layer perceptron. We concatenate the outputs of the two main modules to get \(h_{final}\) and then calculate as follow:

$$ h_{final} = \left[ {h_{GCN} ;h_{att} } \right] $$
(8)
$$ h_{1} = ReLU\left( {W_{{h_{1} }} h_{final} + b_{{h_{1} }} } \right) $$
(9)
$$ h_{2} = ReLU\left( {W_{{h_{2} }} h_{1} + b_{{h_{2} }} } \right) $$
(10)

In the end, we utilize the Softmax function to \(h_{2}\) to determine the relation category:

$$ o = Softmax\left( {W_{o} h_{2} + b_{o} } \right) $$
(11)

2.4 R-Drop Mechanism

The dropout technique [27] accomplishes implicit ensemble by randomly hiding some neurons during neural network training. Liang et al. [28] introduce a simple regularization technique upon dropout, named as R-Drop. R-Drop works on the output of sub-models sampled by dropout. In each mini-batch training, each data sample undergoes two forward passes to obtain two sub-models. R-Drop forces two distributions of the same data samples outputted by two sub-models to be consistent with each other by minimizing the bidirectional Kullback-Leibler (KL) divergence between the two distributions. Finally, the two sub-models are used to jointly predict to achieve the effect of model enhancement. Results on multiple datasets show that R-Drop achieves good performance.

In this paper, we use the R-Drop mechanism to enhance our model. For the same batch of data, pass the model forward twice to get two distributions, denoted as \(P_{1} (y_{i} |x_{i} )\) and \(P_{2} (y_{i} |x_{i} )\). For each sub-model, we use cross-entropy as the loss function. Bidirectional KL divergence is then used to regularize the predictions of the two sub-models. Finally, the two are merged as the final loss function at the training steps:

$$ L_{CE} = - logP_{1} \left( {y_{i} {|}x_{i} } \right) - logP_{2} \left( {y_{i} {|}x_{i} } \right) $$
(12)
$$ L_{KL} = \frac{1}{2}(D_{KL} (P_{1} |{|}P_{2} {)} + D_{KL} (P_{2} |{|}P_{1} {)}) $$
(13)
$$ L = L_{CE} + \alpha L_{KL} $$
(14)

where \(\alpha\) is the weight coefficient, which we set to 0.5 in the experiments. In this way, R-Drop further regularizes the model space and improves the generalization ability of the model. This regularization method can be universally applied to different model structures, as long as there is randomness that can produce different outputs.

3 Experiments

3.1 Dataset

In this paper, we validate our method using the dataset introduced in Peng et al. [10], which contains 6987 drug-gene-mutation ternary relation instances and 6087 drug-mutation binary relation instances extracted from PubMed. The data we used were extracted by cross-sentence N-ary relation extraction, which extracts the triples in the biomedical literature. Table 1 shows the statistics of the data. Most instances are documents that contain multiple sentences. There are a total of 5 relation types for labels: “resistance or nonresponse”, “sensitivity”, “response”, “resistance” and “None”. Following Peng et al., we perform relation extraction for all instances according to binary classification and multi-classification, respectively, and obtain the results using a five-fold cross-validation method. In the case of binary classification, we classify all relation types as positive examples and "None" labels as negative examples.

Table 1. The statistics of the instances in the training set.

3.2 Parameter Settings

This section describes the details of our model experiment setup. We tune the hyperparameters based on the results on the validation set, and the final hyperparameter settings are set as follows: the dimension of Bio-BERT pre-trained language model is 768, the dimension of ELMo pre-trained language model is 1024, the dimension of POS embedding obtained by StanfordNLP and position embedding are both 100. The dimension of the Bi-LSTM hidden layer and GCN layer are both 500, the number of heads of Multi-head Attention layer is 5, and the dimension is 1000. All dropouts in the model are set to 0.5. We train the model with a batch size of 16 and the Adam optimizer [29] with a learning rate: \(lr = 1e - 4\).

We evaluate our method using the same evaluation metric as the previous research, that is the average accuracy of five cross-validations.

3.3 Baselines

In order to verify the effectiveness of the model in this paper, the model in this paper is compared with the following baseline models:

  1. 1.

    Feature-Based (Quirk and Poon, 2017) [3]: a model based on the shortest dependency path between all entity pairs;

  2. 2.

    DAG LSTM (Peng et al., 2017) [10]: contains linear chains and the graph structure of the Tree LSTM;

  3. 3.

    GRN (Song et al., 2018) [11]: a model for encoding graphs using Recurrent Neural Networks;

  4. 4.

    GCN (Zhang et al., 2018) [8]: a model for encoding pruned trees using Graph Convolutional Networks;

  5. 5.

    AGGCN (Guo et al., 2019) [13]: a model that uses an attention mechanism to build dependency forests and encodes it with GCN;

  6. 6.

    LF-GCN (Guo et al., 2020) [30]: a model for automatic induction of dependency structures using a variant of the matrix tree theorem;

  7. 7.

    AC-GCN (Qian et al., 2021) [15]: a model that learns weighted graphs using a 2D convolutional network;

  8. 8.

    SE-GCN (Tang et al., 2022) [16]: a model that uses a cross-domain pruning method to equalize local and nonlocal syntactic interactions;

  9. 9.

    CP-GCN (Jin et al., 2022) [14]: a model that uses dependency forests consisting of the semantic-embedded 1-best dependency tree and adopts task-specific causal explainer to prune the dependency forests;

  10. 10.

    DAGCN (Zhang et al., 2023) [18]: a model that uses a dual attention graph convolutional network with a parallel structure to establish bidirectional information flow.

3.4 Main Results

In the experiments, we count the test accuracies of ternary relation instances and binary relation instances in binary and multi-class, respectively. In the binary-class experiment, the intra-sentence and inter-sentence situations are counted separately. The results are shown in Table 2.

As can be seen from Table 2, the performance of neural network-based methods is significantly better than that of feature-based methods. Thanks to the powerful encoding ability of GCN for graphs, GCN-based methods generally outperform RNN-based methods. Except the ternary sentence-level in Binary-class, our model RD-MAGCN achieves state-of-the-art performance.

We first focus on the multi-class relation extraction task. On the ternary relation task, RD-MAGCN achieves an average accuracy of 90.2%, surpassing the previous state-of-the-art method CP-GCN by 5.3%. On the binary relation task, RD-MAGCN achieves an average accuracy of 90.3%, surpassing AC-GCN by 9.3%. This is a huge improvement, mainly due to the greater gain effect of the R-Drop mechanism on the model in multi-classification tasks.

Table 2. Compare with related work.

For the binary-class relation extraction task, although our model RD-MAGCN does not have such a large increase as the multi-classification task, it almost still exceeds CP-GCN under different tasks. The above results show that our method of combining contextual semantic features and text structure features and enhancing the model with regularization methods is effective. Next, we will introduce the ablation study we have done for each module of the method.

3.5 Ablation Study

In this section, we have proved the effectiveness of each module in our method. First, we investigate the role of the three main modules of R-Drop, Multi-head Attention Layer and GCN Layer. We define the following variants of RD-MAGCN:

  • w/o R-Drop: this variant denotes using the traditional single-model cross-entropy loss function instead of two sub-models ensembles at training steps.

  • w/o Attention: this variant denotes removing Multi-head Attention Layer and the corresponding inputs from the model.

  • w/o GCN: this variant denotes removing GCN Layer and the corresponding inputs from the model.

  • w/o \({\varvec{w}}_{{{\varvec{relation}}}}\): this variant denotes that Self-Attention is applied instead of introducing relation representation in Multi-head Attention Layer, that is, Q, K, and V all from text representation.

Table 3 shows the results of the comparison of RD-MAGCN with four variants.

It can be seen from Table 3: (1) The effect of the R-Drop regularization method to enhance the model is obvious, especially in the multi-class relation extraction task. Removing the R-Drop module has a performance loss of 4.9% and 6.5% in multi-class ternary and binary tasks, respectively. We speculate the reason is that in the binary classification task, due to its low difficulty, the constraint of KL divergence makes the distribution of the output of the two sub-models roughly the same, so the ensemble effect is not obvious. In multi-classification tasks, the ensemble effect will be better. (2) The performance of removing Multi-head Attention Layer model drops in each task, indicating the usefulness of interactive contextual semantic information. (3) The performance of removing GCN Layer model drops across tasks indicating the usefulness of syntactic structure information. Moreover, the performance of inter-sentence relation extraction drops more than that of intra-sentence relation extraction, indicating that GCN can capture long-distance structure features. (4) No introduction of relation representations in Multi-head Attention Layer degrades the results, indicating that the interaction of relation representations allows the model to pay more attention to the texts that are closer to the relation.

Table 3. The effect of the main modules of RD-MAGCN.

Next, we discuss the effects of the different input representations. We utilize the same model for the following types of inputs:

Table 4 shows the comparative performance of different input representations.

Original:

The inputs to our proposed model. The input of Multi-head Attention module is the concatenation of Bio-BERT, POS and position embedding, and the input of the GCN module is the concatenation of ELMo, POS and position embedding.

Variant 1:

The input of Multi-head Attention module is the concatenation of BERT, POS embedding, position embedding, and the input of the GCN module is the concatenation of ELMo, POS embedding, position embedding.

Variant 2:

The input of Multi-head Attention module is the concatenation of Bio-BERT, POS embedding, and the input of the GCN module is the concatenation of ELMo, POS embedding.

Variant 3:

The input of Multi-head Attention module is the concatenation of Bio-BERT, position embedding, and the input of the GCN module is the concatenation of ELMo, position embedding.

Variant 4:

The input of Multi-head Attention module is Bio-BERT, and the input of the GCN module is ELMo.

Table 4. The effect of the input representation on performance.

From Table 4, we can see that: Bio-BERT, a domain-specific language representation model pre-trained on the large biomedical corpus, outperforms traditional BERT in the task of biomedical relation extraction. Bio-BERT enables a better understanding of complex biomedical literature. Besides, POS embedding and position embedding provide additional information for the model, which can help the model to better learn the semantics and structure of the text and locate the entities that appear in the text.

4 Conclusions

In this paper, we propose a novel end-to-end neural network named RD-MAGCN for N-ary document-level relation extraction. We extract weighted contextual features of the corpus via Multi-head Attention Layer that interacts with relation representations. We extract the syntactic structure features of the corpus through the syntactic dependency tree and GCN Layer. The combination of the two types of features can make the model more comprehensive. In addition, we ensemble the two trained sub-models through the R-Drop regularization method, and let the two sub-models jointly predict the relation type, which effectively enhances the performance of the model. Finally, we evaluate the model on multiple tasks of the medical dataset extracted from PubMed, where our RD-MAGCN achieves better results.

Our research improves the accuracy of biomedical relation extraction, which is helpful for other tasks in the medical field and the development of intelligent medicine. In future research, we will focus on applying more comprehensive techniques such as introducing medical knowledge graphs to study biomedical relation extraction more deeply.