Keywords

1 Introduction

Relation extraction (RE) is the natural language processing task of automatically extracting important relations between named entities in the text. One of the applications of relation extraction is automatic database completion and expansion. In order to construct databases from textual resources so that humans can easily access important information, it is necessary to comprehensively read a bunch of documents, which requires a large amount of manual cost. The research on relation extraction from texts is crucial in terms of achieving advanced human-computer interactions.

Relation extraction from biomedical texts is vital research to help biomedical experts. One of the tasks is extracting drug-drug interactions from texts. Drug-drug interaction (DDI) is defined as a change in the effects of one drug by the presence of another drug [4]. In order to practice “evidence-based medicine” [16] and prevent accidents caused by drugs, it is important to extract knowledge about DDIs from pharmaceutical papers comprehensively. Automatic DDI extraction can greatly benefit the pharmaceutical industry, providing an interesting way of reducing the time spent by healthcare professionals reviewing the medical literature.

Classification-based supervised methods [14, 17] have been conventionally adopted for information extraction from biomedical texts, however, with the success of large language models (LLMs), prompt-tuning-based information extraction methods [6] have been started to be studied. In prompt-tuning methods, the input sentence and the prompt, which is an instruction text for the target downstream task, are fed into the LLM, and then the LLM predicts the entities and relations between these entities. In recent years, research on prompt-tuning has drawn more and more attention, and various methods such as in-context learning [15] and instruction tuning [9] have been proposed. Because of the extremely large number of parameters in the LLM, it is not realistic to update the entire model parameters by supervised learning. Instead, a few-shot learning approach with only a few supervised examples, or a zero-shot learning approach with no supervised examples is commonly used to predict answers.

The critical issue is that despite the success of LLMs in generative tasks such as summarization and question answering, LLMs do not significantly improve performance on the information extraction task. According to the previous surveys [6, 7], the GPT-3.5 model, which has 355B parameters, underperformed traditional classification-based state-of-the-art methods on several biomedical named entity recognition and relation extraction tasks. Furthermore, the GPT-4 model, which has an even larger model size, underperforms the method with fully supervised PubMedBERT [11], which has only 110 M parameters. These results show that the existing prompt-based few-shot and zero-shot learning with LLMs is not effective in the information extraction task in the biomedical domain.

In this study, we propose a novel information extraction method enhanced by LLMs. The overview of our proposed method is shown in Fig. 1. We investigated three DDI extraction methods that leverage LLMs. In the first method, we investigate the ability to extract DDIs in a few-shot learning setting via an extremely large-sized language model Gemini-Pro [20]. In the second method, we enhance the seq2seq-based full fine-tuned DDI extraction by CoT reasoning explanations generated by Gemini-Pro. In the third method, we enhance the classification-based full fine-tuned DDI extraction by drug entity descriptions that are automatically generated by Gemini-Pro. Our contributions are summarized as follows:

  • We propose three DDI extraction methods that leverage the benefit of LLMs.

  • Experimental results on the DDIExtraction-2013 dataset show that the entity descriptions that are generated by LLMs can boost the performance of the classification-based DDI extraction method, achieving significant F-score improvement.

Fig. 1.
figure 1

An Overview of relation extraction methods with LLMs.

2 Related Work

Extracting information from biomedical literature is an important NLP task that can convert unstructured text data such as academic papers and web articles to structured data that can be easily accessed by humans. One of the target tasks is drug-drug interaction (DDI) extraction from the literature. The definition of DDI is broadly described as a change in the effects of one drug by the presence of another drug [4]. The detection of DDIs is an important research area in patient safety since these interactions can become very dangerous and increase healthcare costs. The DDIExtraction-2013 [18] dataset was constructed to promote automatic DDI extraction from the literature via machine learning methods.

On the DDI extraction task, classification-based methods using encoder-only relatively small pre-trained language models (PLMs) have shown high performance. PLMs in the biomedical domain such as BioBERT [13], SciBERT [5] and PubMedBERT [11] have been adopted for the DDI extraction task. Methods combining PLMs with information from external drug databases, e.g., DrugBank [22] have been proposed and it has been reported that using information from external databases improves the extraction performance rather than considering only the context [1,2,3].

In the general domain of relation extraction, REBEL [12], which adopted seq2seq-based PLMs showed higher performance than existing pipeline-based methods on joint extraction of entities and relations. Wadhwa et al. [21] firstly showed that few-shot learning with GPT-3 yields near state-of-the-art performance on general domain relation extraction datasets and then proposed the approach of training Flan-T5 with Chain-of-Thought (CoT) style “explanations” (generated automatically by GPT-3) that support relation inferences; this achieved state-of-the-art results on general domain relation extraction tasks.

On the other hand, Chen et al. [8] reported that LLMs do not significantly improve performance on the information extraction task in the biomedical domain. GPT-3.5 model, which has 355B parameters, underperformed traditional classification-based state-of-the-art on several biomedical named entity recognition and relation extraction tasks. Furthermore, GPT-4 model, which has an even larger model size, underperforms the method with fully supervised PubMedBERT [11], which has only 110M parameters. There has been not enough discussion regarding the effectiveness of LLMs, and methods for combining LLMs and smaller-size PLMs on the biomedical information extraction task.

3 Method

3.1 Relation Extraction via In-Context Few-Shot Learning with LLMs

We adopt forms of instructional in-context few-shot prompting to Gemini-Pro [20]. Figure 2 shows the instructional prompt and examples (“shots”) for the input of LLMs. In this method, we verify two approaches: Direct prompting, which predicts the relation type directly from the instructional prompt and a few examples, and chain-of-thought prompting, which predicts the relation type after predicting an explanation of two entities.

Direct Prompting. To construct prompts for relation extraction, we use the prompt that defines the types of relations and instructs LLMs to predict the correct relation type from the given texts, as shown in Fig. 2 A. Special tokens (<e1>, </e1>, <e2>, </e2>) are used to clarify which of the drugs in the sentence are targeted. Example sentences are selected from the training dataset of the relation extraction corpus. Among them, we select the sentences that appeared within the annotation guideline for dataset construction, because we consider these examples to be representative of their relation types.

Chain-of-Thought Prompting. In chain-of-thought (CoT), the prompt instructs LLMs to first generate an explanation of entities and then predict the relation type, rather than directly predict the relation types. Examples for the few-shot learning are selected in the same way as in the Direct prompting method, and an explanation of each sentence is added, as shown in Fig. 2 B. As explanations, we adopt the text that describes the relation between entities in the annotation guideline.

Fig. 2.
figure 2

Model overview of in-context few-shot learning with LLMs.

3.2 Seq2seq-Based Relation Extraction Enhanced by LLMs

We applied the method [21] of using LLMs for data augmentation in full fine-tuning of relation extraction with seq2seq-based PLMs to the biomedical domain. Figure 3 shows the overview of the method. In this method, relatively small-size PLMs with less than 1B parameters are fine-tuned on the whole training dataset. The relation labels are generated by the seq2seq model, and we add CoT style explanations generated automatically by LLMs that support relation inferences in fine-tuning on training dataset. Firstly we prepare the CoT style explanations for all examples of the training dataset, by feeding the instructional prompt and examples as shown in the left part of Fig. 3. Then we fine-tune seq2seq PLMs on gold relation labels and explanations generated by LLMs, as shown in the right part of Fig. 3.

Fig. 3.
figure 3

Model overview of seq2seq-based relation extraction enhanced by LLMs.

3.3 Classification-Based Relation Extraction Enhanced by LLMs

We propose a classification-based relation extraction method that is enhanced by LLMs. In this approach, input sentences are converted into a pooled representation by the encoder-only PLMs, and resulting vectors are converted into the dimension of the number of relation labels for multi-class classification. We utilize LLMs for augmenting the information of entities in full fine-tuning with PLMs. Specifically, for the two entities in the sentence, descriptions of entities are generated in advance by LLMs with the prompt “Please provide a short description on <ENTITY> in one sentence.”, as shown in Fig. 4. The input sentence, the first entity description, and the second entity description are given to the PLMs. Three output vectors are concatenated and finally, the resulting vector is fed to the linear layer for dimension conversion. We prepare the separated two PLMs, one for the input sentences, and the other for entity descriptions.

4 Experimental Settings

4.1 DDI Extraction Task Settings

We followed the DDIExtraction-2013 [18] shared task settings. This dataset is composed of input sentences containing the drug mention pair, and the following four DDI types are annotated to each drug pair.

  • Mechanism: This type is assigned when a pharmacokinetic interaction is described in an input sentence.

  • Effect: This type is assigned when a pharmacodynamic interaction is described in an input sentence.

  • Advice: This type is assigned when a recommendation or advice regarding the concomitant use of two drugs is described in an input sentence.

  • Interaction (Int.): This type is assigned when the sentence states that interaction occurs and does not provide any detailed information about the interaction.

Table 1 shows the statistics of DDI extraction dataset. We can see that the dataset is highly imbalanced, there are roughly six times the number of pairs not mentioning a relation (negative pairs) than the pairs mentioning a relation (positive pairs). Since no validation set splitting is provided by the official dataset, we split the training data into a smaller training set and validation set to perform hyper-parameter tuning. After determining the hyper-parameters, we re-trained the model on the whole training set and evaluated the model on the test set.

Fig. 4.
figure 4

Model overview of classification-based relation extraction enhanced by LLMs.

Table 1. The statistics of DDIExtraction-2013 dataset

4.2 LLMs and Prompts

We adopted Gemini-Pro [20] as a LLM. Gemini-Pro is a performance-optimized model in terms of cost as well as latency that delivers significant performance across a wide range of tasks. In the evaluation results on the series of text-based academic benchmarks covering reasoning, reading comprehension, STEM and coding, Gemini-Pro showed higher performance than GPT-3.5. We obtained the output from Gemini-Pro via Google AI APIFootnote 1. If the model generated text that did not match any relation label name, it was assumed to predict the negative relation.

To prepare the prompts for few-shot learning, we selected 14 examples from the annotation guidelineFootnote 2 of the DDIExtraction-2013 dataset. The explanations for CoT reasoning are also extracted from the annotation guideline.

4.3 PLMs for Seq2seq Methods

We adopted Flan-T5 Large [9] model, which has 783M parameters, as a baseline of the seq2seq-based method. In the seq2seq-based DDI extraction, the model generates the output in the form of Relation: xxx, and the model with CoT generates Relation: xxx Explanation: xxx. The generated explanation part is not used for the evaluation, only the generated relation type is used. When the model generates an output that does not match any of the relation types, We assume that the negative label is predicted. Flan-T5 model parameters are trained on all training samples of the DDIExtraction-2013 dataset. Besides, the model with CoT is trained on the explanations that are generated by Gemini-Pro in advance. We set the beam size as 5 for the generation. We employed the Adafactor optimizer [19], and tuned hyper-parameters on the development dataset.

4.4 PLMs for Classification Methods

We employed PubMedBERT Large [11] as a baseline of the encoder-only PLMs for classification-based relation extraction. We employed the Adafactor optimizer [19] and tuned hyper-parameters on the development dataset. Our significance tests are based on the permutation test [10]. We set the number of shuffles to 5,000.

5 Results and Discussions

5.1 In-Context Few-Shot Learning-Based Relation Extraction by LLMs

Table 2 shows the performance comparison among the traditional classification-based method and few-shot in-context learning methods via Gemini-Pro with and without CoT. As shown in Table 2, few-shot in-context learning via Gemini-Pro showed quite low performances compared to the classification-based method with smaller PLM (PubMedBERT-Large). The model with CoT showed a higher F-score than the direct prompting model, but the performance is still much lower than the fully fine-tuned PubMedBERT. These results are consistent with the report [6] that have validated GPT-3.5 in other biomedical relation extraction datasets, indicating that while LLMs have reasonable text generation capacity, it is difficult to correctly predict relations between entities from few-shot samples.

We performed further analysis on the predicted relation labels by LLMs. Figure 5 shows the normalized confusion matrix of the gold labels and predictions from Gemini-Pro with and without CoT. Each row of the matrix shows the distribution of the label predictions by the model for each gold label, and the scale is normalized. The diagonal components of the matrix indicate the samples that are correctly predicted, which means the darker color of all diagonal elements indicates higher model performance. As shown in Fig. 5, there are many positive relation instances incorrectly predicted as negative relations on the model of Gemini-Pro without CoT. In the Gemini-Pro with CoT model, there are fewer cases of incorrectly predicting positive relations as negative relations, however, there are more cases of incorrectly predicting negative relations as positive relations. These results show that it is difficult for LLMs-based in-context few-shot learning to predict correct relation labels on a highly imbalanced relation extraction dataset.

Table 2. The performance of DDI extraction on in-context few-shot prompt learning methods
Fig. 5.
figure 5

Normalized confusion matrix of the labels and predictions from Gemini-Pro with and without CoT.

5.2 Seq2seq-Based Relation Extraction Enhanced by LLMs

Table 3 shows the F-score comparison with baseline models and seq2seq-based models. Seq2seq-based Flan-T5 backboned DDI extraction model showed 82.25% of the F-score, which is lower than the classification-based baseline mode. In particular, the precision score is much lower than the classification-based method. The CoT model with the explanations generated by Gemini-Pro showed a lower F-score than the model without CoT.

Table 3. The performance of DDI extraction on seq2seq-based methods
Table 4. The performance of DDI extraction on classification-based methods. \(~^{*}\) indicates performance improvement from PubMedBERT (baseline) at a significance level of \(p < 0.05\)
Table 5. The comparison of F-scores for individual DDI types on the DDIExtraction-2013 test dataset. Mech. and Int. denote Mechanism and Interaction, respectively.

5.3 Classification-Based Relation Extraction Enhanced by LLMs

Table 4 shows the F-score comparison between the baseline model, the model with entity explanations by Gemini-Pro, and state-of-the-art method HKG-DDIE [2] that utilizes the heterogeneous knowledge graphs information into DDI extraction task. By using the entity explanations that are generated by Gemini-Pro, the F-score improved by 1.77 pp, showing significant performance improvement on the permutation test. Table 5 shows the performance comparison for individual DDI types. The model with entity explanations showed higher performance than the baseline model on the relation labels of Mechanism, Effect, and Interaction, while showing lower performance on Advise relation type. In particular, our proposed model improved the 12.41 pp F-score on Interaction type. These results show the effectiveness of leveraging LLMs for classification-based DDI extraction methods.

6 Conclusion

In this paper, we proposed three methods that leverage LLMs for the DDI extraction task. We showed that in-context few-shot learning with LLMs suffers in biomedical relation extraction tasks, which is also consistent with previous reports. We then investigated the seq2seq-based relation extraction in the biomedical domain. Seq2seq-based models showed a lower F-score, which lies in the low precision score. We added CoT explanations generated by LLMs to seq2seq-based models, but the model CoT explanations do not improve the DDI extraction performance. We further showed entity explanations that are generated by LLMs can improve the performance of the classification-based relation extraction method on the DDIExtraction-2013 task.