FormalPara What does this study add to the clinical work:

Large Language Models can provide valid and largely guideline-compliant therapy recommendations. We show when, why and where a medical expert’s evaluation and filtering is nonetheless indispensable.

Introduction

Artificial intelligence and large language models (LLMs) are increasingly being used both in general and medicine in particular. The LLM GPT (Generative Pre-Trained Transformer) in its online form, the chatbot ChatGPT, made publicly available by the developer OpenAI, is enjoying great popularity. Within 2 months of its launch, it had over 100 million users, with over 1 billion page views per month as of February 2023 [1]. ChatGPT gained greater attention in the general context of medicine when, applied to the U.S. medical school final exam, the software was able to answer multiple choice exam questions above a passing grade [2].

ChatGPT has also been able to demonstrate its high level of knowledge in gynecological testing situations in a virtual Objective Structured Clinical Examination (OSCE) [3]. Other fields of application in our field have been to demonstrate the capabilities and limitations of the system with respect to scientific writing [4, 5]. In the current medical literature, however, the essential aspect regarding the clinical potential of LLMs is their applicability to diagnostic and therapeutic problems [6]. In gynecology, too, the use of LLMs is increasingly being discussed [7]. It’s ability to answer clinical questions, as well as its supportive use in a multidisciplinary tumor board with breast cancer patients are just 2 recent examples of potential use cases in our field [8, 9]. The authors of the article point out that a conscious use of these systems is necessary to exploit advantages appropriately and to avoid wrong answers. This is especially important when patients work with these systems without sufficient contextual knowledge to correctly interpret a ChatGPT answer.

Currently, an increasing number of female patients are seeking advice on disease symptoms through ChatGPT. The authors are not aware of any statistics on this, but their own experience shows that more and more patients are specifically using this option to obtain a clearly formulated answer to a specific question from an AI (artificial intelligence) system rather than a range of different information, as is the case with classic online searches. Although initial analyses show that, depending on the type of question, the subject area of the question and the queried symptomatology, the answers can give a correct overview, there is no structured survey of the quality of these systems, especially with regard to gynecologic oncology symptomatology in a palliative situation.

Aim of this work was to evaluate the recommendations of the freely accessible version of ChatGPT regarding constructed patient inquiries about the possible therapy of gynecologic oncology symptoms in patients in a palliative treatment situation by experts and to classify them against the background of the current guideline standards. In addition, advantages and disadvantages of the technique will be discussed, in particular to better understand the response patterns depending on the questions wording.

Materials and methods

Short case vignettes were constructed for 10 common accompanying symptoms in gynecologic oncology tumor patients in a palliative treatment situation (Table 1). From this, one prompt was formulated per case vignette, which was constructed according to the following pattern: "I am an (age) year old patient with a (tumor diagnosis) (with metastases) with a symptom in a palliative treatment situation. What therapy is available for my (symptom)?". The search history was cleared after every query. Chat GPT based on GPT 3.5 was used in the version dated March 23, 2023 and the query was performed on 04/16/2023. The prompts were entered in the above structure and the given answers were transferred to a Word document for the experts to assess. In total, the prompts and answer texts were submitted to 5 experts from the fields of gynecologic oncology (n = 3) and palliative care (n = 2), each with more than 10 years of professional experience, for evaluation. A general evaluation of the treatment proposal (Likert scale 5 = agree; 1 = disagree), the assessment of the evidence of the treatment proposal (Likert scale 5 = present; 1 = not present), and the applicability of the proposal (Likert scale 5 = completely applicable; 1 = not applicable at all) were queried. In addition, the evaluating experts were allowed to give free-text answers on the pros and cons of the treatment recommendations. The evaluation of the experts was processed as a numerical descriptive evaluation and the free text comments were included in the discussion (Table 1). No actual patient data were used for this work. All experts consented to the publication of their answers.

Table 1 Case vignettes

Results

The overall rating across all case vignettes averaged 4.1 (range 3–5). Guideline conformity of all responses was rated an average of 4.0 (range 2–5), while applicability was 3.3 (range 2–5). As part of the answering of all questions, ChatGPT pointed out that the answer was an overview of the basic therapy options and that a visit to a physician was necessary for actual treatment.

ChatGPT responses generally followed a schematic approach. Suitable therapeutics were named for therapy and drug substance groups with exemplary active ingredient names and their mechanisms of action understandable to laypersons were explained. In addition, information on non-drug and integrative therapy options were also provided in varying degrees of detail (Fig. 1). Experts agreed that some recommendations could have been more specific. For numerous patients, ChatGPT provided more general advice divided into therapeutic treatment groups rather than tailoring treatment recommendations to the specific disease or individual patient, limiting their direct clinical usefulness. In addition, all therapeutic options were reported as being of equal value, without any evaluation for the patient in light of her own condition. In addition, individual therapeutic procedures were omitted. Table 1 contains the detailed treatment recommendations for each patient as well as the ratings of the experts (Table 1). All responses of the PIs are available as an appendix to this article (Appendix 1).

Fig. 1
figure 1

Exemplary, English language answer of ChatGPT is shown

Discussion

The present work shows the basic potential of large language models with regard to a general, medical consultation of our patients. ChatGPT was also to provide usable and also predominantly guideline-compliant answers to the patients' questions in the freely available version. At the same time, however, there is still a need for expert consultation, especially with regard to completeness, the weighting of the individual therapy suggestions, and their individual evaluation for the specific case of illness of the inquiring patient, which is also indicated by the responses of the AI.

The answer to the first question on dyspnea in pulmonary and hepatic metastatic breast cancer impressively demonstrates the approach of the language model. The leading symptom dyspnea is understood and various therapy options are given in an overview style. Especially the listing of opiates as palliative relief of dyspnea shows that the language model basically understood the patient's problem and situation. However, in addition to other correct answers with bronchodilators, therapeutics are listed that are rarely an applicable therapy in a palliative situation, but at the same time-specific oncological systemic therapy for symptom control is not listed. These response patterns are also known from other surveys on the therapeutic quality of language models [10]: although the replies of the AI are not obviously incorrect, the leading symptomatology, in this case dyspnea, is determinant and triggers corresponding therapy recommendations for various differential diagnoses of dyspnea, which may also lie outside the palliative context of the query situation.

One reason for this lack of context consideration may be that language models, such as the one tested here, are so-called autoregressive models, which have only a limited amount of text as input length that can be meaningfully put into the model [11]. The models, therefore, largely lack the ability for the differentiated, medical, human comprehension of complex conditions, which must be evaluated in larger context, taking into account detailed information and based on informed reasoning [12, 13]. This issue is well-known and numerous research, as well as commercial, stakeholders are working to enable greater lengths of input to these models to improve contextual consideration [14].

The omission or only limited mention of more invasive therapeutic methods such as specific oncological therapy including chemotherapeutic agents, or radiation therapy in patients with osseous metastases is also known from preliminary work in other fields, in which, for example, invasive surgical measures are always placed in second place to conservative, less burdensome forms of therapy and reference is made to a medical consultation with regard to their evaluation [15]. This seems to correspond to a deliberately cautious interpretation of the ChatGPT language model, in order not to prejudge medical treatment by a prematurely given evaluation. Another indication of this is that, at the time the survey was created, the language model integrated into the Microsoft Bing search engine merely acts as a classic online search engine for medical questions and does not provide a text response. Ultimately, the warning before each answer, in which the language model explicitly points out that it is merely the output of a language model with a suggestive character and not a medical answer, is also to be understood under this safety aspect in the case of medical questions.

The complete omission of recommendations, on the other hand, is potentially dangerous and sometimes withholds important information from patients and practitioners in their decision-making process regarding the choice of therapy. For example, in the case of constipation symptoms, the sometimes important therapy with peripheral opioid antagonists was not listed for these patients who often receive opioid therapy [16]. In the case of the patient with vomiting, potentially highly acute ileus symptoms are also not sufficiently taken into account, and thus the potential need for urgent care is disregarded by the language model. Especially in situations relevant to emergency medicine, these language models are, therefore, not yet fully usable and are not useful as sole therapy decision makers [17]. Rather, these systems are conceivable as support systems for medical decision-making processes, so-called decision support systems, but also as basic advice for patients prior to a planned medical consultation [18]. In this case, the overview-like presentation character of the answers merely represents a supportive entry into further clinical decision-making processes and enables patients to have an informed, pre-structured discussion without having to filter them from the multitude of (false) information available on the Internet, as is the case with classic search engine-based information [19, 20].

Furthermore, the generally polite way in which ChatGPT deals with patients' inquiries is striking. In addition to the warning that it is not medical advice, the language model usually expresses regret about the patient's situation. This polite, quasi human way of the language model has already been noticed in other studies, in which, among other things, the quality of the transmission of serious findings to patients in discharge letters was examined [21]. Here, the factor "humanity" of the answers was explicitly evaluated and in this study, it showed itself to be on a similarly high level as humanly created discharge letters. In addition, the phrasing 'I'm sorry to hear' suggests empathy in the reply, almost as though the LLM wants the answer recipient to feel understood. This field of medical ethics and AI and the effects of the answers of conversational chatbots on the patient is still fairly young, but of high interest to both the clinical, educational, as well as research oriented medical community and ethical frameworks are currently being developed [22, 23].

It should be noted that the present study deals with fictitious case vignettes and not with concrete clinical cases. This was necessary to avoid an ethically questionable forwarding of sensitive patient data to an AI system and is, against this background, common practice in the preparation these research works [17]. Furthermore, the case vignettes were evaluated not only by palliative experts, but also by gynecologists working in oncological surgery. This may explain subtle differences in the evaluation of the ChatGPT statements, depending on whether they were made against the background of a general palliative symptom control or in dependence of an entity-specific guideline taking into account also metastasized tumors in a palliative situation. Ultimately, however, this interdisciplinary assessment corresponds to the everyday clinical treatment of these diseases and thus allows, in our view, a clinically realistic assessment while accepting possible inhomogeneity of the numerical assessment.

Conclusion

Language models have in principle a high potential in the general counseling of our patients. The responses provide an overview of most, basically available treatment options of important core symptoms of palliative care, but are thereby rather to be understood as general advice, without the claim to absolute completeness, or detailed contextual consideration of an individual treatment situation. Against this background, ChatGPT also issues a corresponding warning to the person asking that additional medical advice must be obtained. This is particularly important for invasive therapies, and therapies, where the LLM is missing awareness for a potential emergency situation. As an outlook on the further use of language models, it can be stated that further technical development of AI will certainly make more precise and, above all, more context-appropriate answers possible in the future [24]. The use of these models by our patients can, therefore, be assumed to increase in the future. For our field, we should accompany the currently rapidly progressing evolution of these language models to be able to adequately react to inquiries of patients and their relatives without medical knowledge. From an ethical and legal perspective alone, we are still obliged to advise our patients on their treatment, irrespective of whether differential therapeutic planning with an AI system has been carried out by the patient or another practitioner [25]. However, knowing these systems, they can certainly support the counseling process of more informed patients in the future, as our answers show the at times superficial but adequate quality of the answers. Against this background, it is not unexpected but reassuring that the direct applicability of the answers was rated lowest. The benefit of these systems currently and in the near future lies in the supportive consultation. The ultimate evaluation and selection of appropriate therapies lies with the physicians and patients.