1 Background

The medical literature contains a large quantity of valuable inter-entity relations. Currently, the common medical relation extraction tasks include: protein–protein interactions (PPIs) [1], drug–drug interactions (DDIs) [2], chemical–protein interactions (CPIs) [3] and chemical–disease interactions (CDIs) [4]. In promoting the development of medical fields, this information has important research value and plays an important role. For datasets extracted from medical relations, there are some publicly available datasets, such as Aimed [5], DDIExtraction 2013 [6], BioCreative V-CDR [4] and so on. All of this data comes from the MEDLINE database, the largest medical literature database in the world. Compared with these English datasets, there is a lack of relevant public datasets in China, so that the development of Chinese medical text mining is relatively late.

With the rapid development of the Chinese medical field, Chinese medical literature has grown at an explosive rate. Most of the Chinese medical literature is included in China National Knowledge Infrastructure (CNKI) that providing a wealth of literature resources for the development of medical science in China. According to the literature statistics of CNKI, the number of medical literature published is large every year. A wealth of medical knowledge has accumulated in the medical literature. However, few researchers conducted research on relation extraction in Chinese literature. For the research of relation extraction in Chinese medical literature, it is still an issue worthy of attention.

The early medical relation extraction task mainly relies on the template matching that needs to analyze the text features and manually summarize the grammar rules, and then match the relations in the new text. This method required the annotator to have high requirements for linguistics and medical knowledge. Blaschke [7] et al. built more than ten templates by syntactic information, contextual information, and other information, to identify genes and protein entities from medical literature and judging the relation between entities. Corney et al. [8] used a custom template to replace the original template in GATEFootnote 1 to implement a rule-based information extraction system. Due to the different syntactic information and different language expression of datasets, it would lead to a lower recall of relation extraction, so that the generalization ability of the template matching method is very limited.

As the medical relation extraction task become a focus of research, many institutions have conducted relevant evaluations and prompted many machine learning methods to be applied to this task. Early machine learning focused on eigenvector methods to map text features into a high-dimensional vector as the final feature vectors in classifiers. Alam et al. [9] manually extracted varied linguistic features, such as vocabulary information, phrases information, and dependency syntax information, to extract the relation on chemical–disease relation data. Kim et al. [10] also extracted rich features and input into a linear support vector machine model (SVM) for the drug–drug interaction extraction task. Machine learning methods generally have better generalization and portability than template matching methods. However, the quality of feature selection often determines the performance of relation extraction and needs to use external resources or tools to extract features, such as part-of-speech tagger, syntax-dependent approach. Errors generated by these external tools can have a cascading effect on the performance of the relation extraction.

However, machine learning methods rely on manual construction of features, which is laborious and time-consuming. With the advance of deep learning and word representation learning, lots of researchers apply deep learning methods to the medical relation extraction. Peng et al. [11] used a multi-channel dependency-based convolutional neural network (McDepCNN), which has the best performance on the PPI dataset in 2017. Zhang et al. [12] integrated the sentence sequence and shortest dependency path by the hierarchical recurrent neural networks (RNNs)-based method for the DDI extraction task. Zhang et al. [13] used CNN and RNN for biomedical relation extraction, it is called a hybrid deep neural model. With strong representations of data, better performances, and less feature engineering, deep learning methods became the most popular methods of relation extraction.

Bidirectional long short-term memory (Bi-LSTM) and CNN are the most widely employed neural network structures in medical relation extraction. CNN [14] reduces model training parameters by them sharing and local connections. It is more suitable for short text classification tasks. For dealing with long-distance features and obtain the information, The Bi-LSTM [15] is designed. However, different parts of the sentence have different effects on the classification, focusing on more important information will help to improve the classification results. Thus, some researches have added an attention mechanism to select more important information in the sentence [16]. However, these methods only learned a sentence vector representation that could not capture more complex information in a sentence.

Therefore, in this paper, we design an attention-based model to obtain additional vector representation of the sentence, which considers the previous self-attention memory information. In conclusion, the main contributions are:

  • We first evaluate the multi-hop self-attention model on the task of Chinese medical relation extraction. By each step of attention, different words weight could be generated for the sentence.

  • We could obtain the multi-aspect semantic information from Chinese medical literature by the model. The final predication combines the results from different sentence representations.

  • We conduct our experiment on Chinese medical literature datasets. Extensive evaluations show that our method establishes a new state-of-the-art on Chinese literature.

2 Methods

We apply the multi-hop attention to the Chinese medical relation extraction task. Our model learns complex semantic information based on multiple sentence vector representations, and comprehensively considers different information to achieve the final classification result. The model is divided into four parts: (1) embedding layer, (2) Bi-LSTM layer, (3) multi-hop self-attention layer, (4) output layer. The input of the model is sentence representation, which is obtained by concatenation of the word embedding and the position embedding. Next, the Bi-LSTM layer learns the deep semantic information of the text. Then, the multi-hop self-attention mechanism extracts complex semantic information. Through multiple iterations, the model can obtain a different representation for different information of the sentence. Finally, these vector representations are input the fully connected layer, respectively. By the softmax function, we calculate the classification probability, and all the classification probabilities are averaged as the final classification result. The model structure diagram is shown in Fig. 1:

Fig. 1
figure 1

Architecture of our proposed model

2.1 Text preprocessing

After the data preprocessing is completed, the next step is to segment each token of the sentence. However, the original tool is not good for the segmentation of medical proprietary words. To solve this issue, we expand the medical entity dictionary we built, the keywords of the medical literature, and the entity words identified in the word segmentation dictionary. The experimental results show that the new word segmentation dictionary could improve the performance.

2.2 Feature design

In this paper, we extracted two features for model training. In each sentence, the word representation \({x}_{i}\) is got by concatenating the word embedding and position embedding as the input of the model. For the word \({x}_{i}\) in the sentence, each word \({x}_{i}\) is represented by a vector \({x}_{i}\) = [\({E}_{wordi}\); \({E}_{posi}\)].

Word embedding. As the input of the model, the word embedding plays a crucial role in model training. The Word2Vec [17] is a popular word embedding method, and it is an unsupervised learning method. During training, Word2Vec learns low-dimensional vector representations for words. Meanwhile, it solves the memory overflow issue caused by one-hot encoding. In recent years, Word2Vec has been used in text classification, sentiment analysis, and other natural language processing (NLP) tasks. In our model, we use the skip-gram model of Word2Vec to pre-train word embeddings. The training data are from the abstracts of the CNKI medical literatures by using web crawlers. The final training data size is about 2G.

Position embedding. The closer the words are to the entities, the more important they are. Based on this assumption, we extract the relative distances between each entity and other words as an auxiliary feature in the relation extraction task. Zeng et al. [18] used the position embedding as one part of the input in the relation extraction task, and the performance is better than without the position embedding. The position feature refers to the relative distance of other words to each entity. For example, the position features calculated afterword segmentation are shown in Fig. 2:

Fig. 2
figure 2

A sample of position feature

For the word \({x}_{i}\) in the sentence, there are two position features pos1 and pos2, which are concatenated as the final position feature pos = [pos1; pos2]. The position feature vector is obtained by random initialization, and the vector representation is updated during the relation extraction model training process.

2.3 Bi-LSTM layer

During the training, the traditional RNN import the output of the previous moment to learn the sequence information of the sentence. However, when the sentence is too long, the previous information will be lost. Therefore, LSTM [19] is proposed. It could not only retain the previous information but also solve the vanishing gradient problem.

For an input sequence is \(X= ({x}_{1},{x}_{2},\dots ,{x}_{n})\), and the output through the Bi-LSTM layer is H \(=({h}_{1},{h}_{2},\dots ,{h}_{n})\). Also, there is a cell state \({C}_{t}\) at each moment t, which is used to preserve long-term information. The cell \({C}_{t-1}\) is the state at last moment t-1. There are only a few linear interactions during training. The structure of an LSTM unit incorporates three gates which are the input, forget and output gate. The formulas are:

$${f}_{t}=\sigma ({W}_{f}\bullet \left[{h}_{t-1},{x}_{t}\right]+{b}_{f})$$
(1)
$${i}_{t}=\sigma ({W}_{i}\bullet \left[{h}_{t-1},{x}_{t}\right]+{b}_{i})$$
(2)
$${\stackrel{\sim }{C}}_{t}=tanh({W}_{C}\bullet \left[{h}_{t-1},{x}_{t}\right]+{b}_{C})$$
(3)
$${C}_{t}={f}_{t}*{C}_{t-1}+{i}_{t}*{\stackrel{\sim }{C}}_{t}$$
(4)
$${o}_{t}=\sigma ({W}_{o}\bullet \left[{h}_{t-1},{x}_{t}\right]+{b}_{o})$$
(5)
$${h}_{t}= {o}_{t}* tanh({C}_{t})$$
(6)

where \(W\) are parameters to be trained, \({x}_{t}\) is the input token at the moment t, \({h}_{t-1}\) is the hidden state of the t-1 cell, \({h}_{t}\) is the hidden state of the t cell, \(b\) are the bias vectors, σ indicates the sigmoid function and tanh is an activation function.

A forward LSTM mechanism can only learn forward information in a text sequence, but the effect of the text classification model often depends on the contextual information of the text. Therefore, in this paper, we use the forward and backward LSTM as the feature extraction layer of relation extraction task. Because the Bi-LSTM model can obtain richer contextual information, it is better than the uni-directional LSTM. Finally, we concatenate the two hidden states as the final output \({h}_{t}=[\overrightarrow{{h}_{t}};\overleftarrow{{h}_{t}}]\) at time step t.

2.4 Self-attention mechanism

In NLP, the attention mechanism has developed rapidly. The main reasons are divided into three aspects. Firstly, the neural network with the attention mechanism outperforms the best model in multiple NLP tasks. Secondly, although the deep learning could obtain good performance, the training process is usually difficult to explain. The weighting mechanism assigns the weight to each word in the sentence, to increase the interpretability of neural networks. Finally, the attention mechanism can learn the contextual information of the text and solve the issue of long-distance dependency. With the development of attention mechanism research, many variant models have been produced, and the self-attention model is the most widely used in text classification [20, 21] and relation extraction [22, 23]. The self-attention could automatically learn the weight for each word. It assigns different weights to each word so that to distinguish the importance of the word in the sentence.

The hidden representation of the sentence is H \(=({h}_{1},{h}_{2},\dots ,{h}_{n})\), the formula to calculate the word weight using the self-attention mechanism is as follows:

$$\beta =softmax({w}^{T}\mathit{tanh}\left(WH+b\right))$$
(7)
$$u={\sum }_{t}{\beta }^{T}{h}_{t}$$
(8)

where W indicates the trainable parameter matrix, w is the trainable parameter vector, b is bias vector. β is the weight for the each word. t is the t-th word in the sentence. Finally, we obtain the vector representation u of the sentence.

2.5 Multi-hop self-attention mechanism

The earliest multi-hop attention mechanism was proposed in the question answering task [24]. Multiple attention could focus on different parts of the sentence to obtain multiple information in the sentence. The multi-hop attention mechanism translates the different information into multiple vector representations and considers each important information to select the best answer. Currently, the common relation extraction model has a problem with single semantic information. For example, the self-attention mechanism concentrates text information on the most relevant part of the task but ignores other information. But we could use multiple self-attention mechanisms to assign different weights to the words based on the previous memory weights. So that we could obtain complex semantic information in sentences and obtain multiple sentence vector representations. Inspired by the above ideas, we propose a multi-hop self-attention model to improve the effectiveness of medical relation extraction. We use multiple attention iterations to obtain the complex relationship between the entities.

After the Bi-LSTM layer, we obtain the sentence hidden vectors H, the length of the sentence is n. We need to define an array of parameters M to save the word weight for memory after self-attention. The \({m}^{k}\) denotes a memory vector, it is saved in M and could guide the next attention step. When calculating the word weight of the kth self-attention, the formula is as follows:

$${S}^{k}=\mathrm{tanh}({W}_{h}^{k}H)\odot \mathrm{tanh}({W}_{m}^{k}{m}^{k})$$
(9)
$${\beta }^{k}=\mathrm{softmax}({W}_{s}^{k}{S}^{k})$$
(10)

where \(W\) indicate the attentive weight matrices. The initialization vector of m is obtained by averaging the sentence hidden vector after the Bi-LSTM. The \({m}^{k}\) parameter is recursively updated by:

$$\left\{\begin{array}{c}{m}^{0}=\frac{1}{N}{\sum }_{t}{h}_{t}\\ {m}^{k}={m}^{k-1}+{u}^{k}\end{array}\right.$$
(11)

where \({u}^{k}\) represents the sentence vector representation obtained after the weighted summation of the word vectors. The formula is as follows:

$${u}^{k}={\sum }_{t}{\beta }^{k}{h}_{t}$$
(12)

We could obtain the different \({u}^{k}\) after each step of the self-attention so that k vector representation of the sentence is calculated. Then by using \({u}^{k}\), we calculate the classification probability. The finally multiple classification results are averaged as a basis of the classification. The formula is as follows:

$${R}^{k}=softmax({u}^{k})$$
(13)
$$R=\frac{1}{k}{\sum }_{k}{R}^{k}$$
(14)

In this paper, we set the number of self-attention steps is 2, which the result is the best. When the k is 2, the multi-hop self-attention mechanism structure is shown in Fig. 3. It can be seen from the figure that each step of the self-attention updates \({m}^{k}\) until k is 2, and interactives with the deep sentence vector representation to obtain a new sentence representation \({u}^{k}\).

Fig. 3
figure 3

The multi-hop self-attention mechanism structure

2.6 Output and training

The loss function is the cross-entropy function of the model. The formula is as follows:

$$\mathrm{C}=-{\sum }_{i}{y}_{i}ln{R}_{i}$$
(15)

The parameters are trained by minimizing the loss function. \({y}_{i}\) is the real classification result and \({R}_{i}\) is prediction result.

3 Results

3.1 Datasets

In this paper, the dataset is extracted from 4000 Chinese medical literature abstracts. Chinese medical literature is all from the Chinese core journal of PKU. We use crawler technology to obtain literature abstracts from CNKI. The sentence splitting was performed after the complete annotation of the abstracts. As a result of annotation, the 4000 Chinese medical literature abstracts generated 5490 sentences as training samples. Each semantic relation in the dataset consists of a specific type of entity pair. Then, the entity pairs with the same type are divided into the same group.

Before the relation extraction, the entities in the sentence have all been annotated. For sentences of multiple entities, we combine the different types of entities and then generate new instances for different entity pairs. In each instance, we focus on only two entities. In this paper, we only retain the sentences that contain therapeutic relation and causal relation. Therefore, we focus on the therapeutic relation extraction and causal relation extraction task. There is a therapeutic relation sample in Fig. 4:

Fig. 4
figure 4

The therapeutic relation sample of data processing

There are more than two entities in an instance. To clearly distinguish the entity pairs with judgment in each instance, we replace drug name and disease name with “drug” and “diseases”. During the experiment, each dataset was randomly selected 10% as the test set, and the sentence length of each dataset was counted. Here we present the details of two datasets in Table 1.

Table 1 The details of different types of dataset

3.2 Evaluation metrics and training details

The input of the model is the concatenation of word embeddings and position embeddings. Then the deep semantic information is obtained through the Bi-LSTM layer and multi-hop self-attention mechanism. Finally, the results are classified by the softmax function. The learning rate of the gradient descent method is Adam optimizer. As a classification task, we calculate the precision (P), recall (R) and F-score (F1) to evaluate the performance of our model. When determining the dimension of position embedding, 10, 15, and 20 were tried in the experiment, and finally the result with 15-dimensional of position embedding is the best. In the multi-hop self-attention mechanism, the number of self-attention steps k is 1, 2, and 3, respectively, and according to the performance of the model, we select k is 2. The parameter settings are shown in Table 2.

Table 2 Training details of medical relation extraction model based on multi-hop self-attention mechanism

3.3 Experimental results

In this paper, to verify that the Bi-LSTM model based on multi-hop self-attention mechanism has better performance in relation extraction tasks. We select several mainstream relation extraction models as comparative experiments. The experiment results are shown in Table 3.

Table 3 Comparison of performance with other methods

Because the relation extraction task could be viewed as a classification task, the CNN and Bi-LSTM methods are commonly used as the basic model. In this paper, our model also compares with these two models. The self-attention mechanism is widely used in various NLP tasks because it can learn important information in the text by assigning different weights to the words, and the performance is better in the classification task. Therefore, a lot of researchers applied it to medical relation extraction [25]. Also, the multi-head attention [26] mechanism also obtained high attention in NLP. This mechanism divided the sentence hidden vector representation into multiple parts, and did the attention operation for each part, and then connect them to obtain the final output result. The performance of multi-head attention is better than the single attention mechanism in the classification effect. Therefore, we also use the multi-head attention in medical relation extraction. The input of all models is the sentence vector representations by concatenation of the word embedding and position embedding.

As shown in Table 3, the multi-hop self-attention mechanism outperforms other methods in the Chinese medical relation extraction task. In the first two lines, the Bi-LSTM achieved higher performance than the CNN model, because the contextual information extracted by Bi-LSTM is important although CNN can extract the local features of a sentence. Meanwhile, adding multi-head attention can assign weights to multiple parts of the Bi-LSTM output vector to capture different parts of semantic information. Therefore, the performance is better than the self-attention mechanism. The Bi-LSTM model based on the multi-hop self-attention mechanism proposed in this paper can obtain more complex semantic relations in sentences. Through multiple iterations of self-attention, it can capture different important information in each iteration. Compared with other methods, our model achieved significant results. The F1-score was obtained on our methods on the therapeutic relation task (93.19%) and causal relation task (73.47%).

3.4 Effect of the number of self-attention step

To verify the influence of the number of self-attention step k on the result, Table 4 shows the results with the different numbers of steps on the therapeutic relation task, and Table 5 shows the results with the different numbers of steps on the causal relation task. From Tables 4 and 5, we see that when the number of multi-hop self-attention mechanism iteration steps is 2, the performance of the model is the best. Meanwhile, the optimal number of iterations may be related to sentence length. Therefore, multiple self-attention steps may be effective for long sentence information detection.

Table 4 Comparison of step of self-attention in therapeutic relation extraction
Table 5 Comparison of step of self-attention in causal relation extraction

3.5 Case study

We visualize the heat map of some sentences generated by our multi-hop self-attention model in Fig. 5. Each word in the sentence, the stronger the color, the larger the attention weight. This shows that the word plays an important role in classification. As shown in Fig. 5, we provide the heat map with a different number of multi-hop self-attention steps. The sentences in Fig. 5a and b are from the therapeutic relation dataset, and the sentences in Fig. 5c and d are from the therapeutic relation dataset.

Fig. 5
figure 5

The weight of the example on different self-attention steps

Figure 5a and b show the sentence weight generated through the first-hop self-attention and the second-hop self-attention, respectively. In this example, we observe that the second step could focus on information in the sentence that is different from the first step. The entities in the sentence are ‘family care’ and ‘colorectal cancer’ respectively. The word ‘improve’ could indicate the positive of the view between the two entities. However, the word ‘reduce’ also could play an important role in the classification of relation extraction. Finally, through the combination of the two aspects of the sentence, it can be better classified. In Fig. 5c and d, the word ‘improve’ and ‘treatment’ could play an important role in the classification of relation extraction.

4 Conclusions

Although deep learning solved the drawbacks of machine learning relying on manual extraction features, most neural network models are limited to a single vector representation. However, the semantic relationship between two entities in a sentence is often complicated, especially for long sentences. Therefore, we employ the multi-hop self-attention mechanism to general multiple sentence representations. In the Chinese medical relation extraction task, our model focus on different semantic information between two entities by multiple iterations of attention. Experimental results on two Chinese medical datasets show that our model is highly effective among existing experiments, achieving state-of-the-art results with little feature engineering.