Web-based large language models (LLMs) are a subsets of artificial intelligence and have become popular, with ChatGPT reaching over 100 million users shortly after its release [1]. These artificial intelligence platforms undergo multi-stage training using articles, books, and other online content to generate conversational, human-like responses to user queries [2]. LLMs iteratively learn language through word associations during this process to recognize, interpret, and generate text without fine-tuning [2]. Due to their public accessibility and convenient user interface [3, 4], there is significant interest in the ability of LLMs to provide the user with recommendations for healthcare-related queries [5,6,7]. Up to 90% of Internet users including both patients and clinicians search for health-related information online [8,9,10], as it is immediate, convenient, and generally free. Moreover, up to 80% of Internet users feel that the online health information which they retrieve is reliable [11].

However, the responses generated by LLM’s are not verified by health professionals, leading to concerns about the accuracy and safety of chatbot medical advice.[9] The provision of inaccurate clinical recommendations by chatbots has the potential to negatively impact patient safety [12]. While most chatbots provide a disclaimer that the responses should not be taken as medical advice, the healthcare community has an obligation to study and report the performance of these tools on behalf of our patients, especially while more rigorous standards for assessment are still in development [13]. The accuracy of clinical recommendations provided by LLM-linked chatbots has health implications for patients with common medical problems.

Gastroesophageal reflux disease (GERD) affects 18.1–27.8% of North Americans [14]. It has been reported that 93% of ChatGPT-derived recommendations for the management of GERD is appropriate based on expert physician opinion [15]. However, GERD can be managed with various medical and surgical options, increasing the difficulty of making treatment decisions [16]. Surgical decision-making in the treatment of GERD is especially multi-factorial [17], necessitating various technical considerations [16]. Patient factors, response to treatment, complicated diagnostic studies, and patient-tailored assessments are all incorporated into successful strategies. Thus, gastrointestinal surgeons, gastroenterologists, primary care providers, patients, and researchers would benefit from a structured investigation of the ability of chatbots to provide accurate treatment advice for GERD.

Given the short timeframe in which LLM-linked chatbots have been sensationalized, the use of objective measures of clinical performance among chatbots remain early in development and validation. High-quality Chatbot Assessment Studies must report transparent, reproducible methodology to facilitate the interpretation of study findings by readers. In the absence of formal evaluation tools for Chatbot Assessment Studies, the use of standardized patient cases with expert input and assessment based on high-quality evidence would facilitate the evaluation of chatbot performance in providing clinical recommendations. Thus, the aim of this study was to assess the performance of LLM-linked chatbots in providing recommendations for the surgical management of GERD using recently published SAGES Guidelines as an objective measure of chatbot performance [16].

Materials and methods

Study objectives

The primary objective of this study was to assess whether LLM-linked chatbots could provide accurate recommendations for the surgical management of GERD to both patients and surgeons based on their alignment with SAGES guideline recommendations. Secondary objectives were to evaluate whether LLM-linked chatbots could provide accurate ratings of the certainty of the evidence based on their alignment with SAGES guideline ratings, as well as to identify whether chatbots would provide incongruent recommendations to patients and surgeons. Evidence cited by chatbots to support their recommendations was also explored.

Query strategy

Hypothetical adult and pediatric patient cases and prompts were based on key questions (KQs) from the SAGES guidelines for the surgical treatment of gastroesophageal reflux (GERD) [16]. The patient cases were constructed to reflect the population, intervention, and comparator addressed by clinical recommendations for each KQ. This information was combined with input from expert general and foregut surgeons to develop clinical questions that were phrased with appropriate medical terminology for surgeon inquiries, while KQs for patient inquiries were worded using lay terminology. The pediatric patient case prompts were phrased such that a parent or caregiver was asking the chatbot for recommendations for their child.

On November 16th, 2023, prompts were tested across LLM-linked chatbots including ChatGPT-4 (GPT-4-0613), Copilot (formerly Bing Chat), Google Bard, and Perplexity AI. These LLMs were chosen among the most frequently assessed chatbots for clinical application based on an internal scoping review. Bing Chat was queried prior to its full rebranding to Copilot on December 1st. Copilot (and Bing Chat on the date of query) is built on OpenAI’s GPT-4 and DALL-E 3. Copilot was set in the “More Precise” mode. This was done to identify generic chatbot responses or responses that did not provide meaningful information. Google Bard uses an experimental model named PaLM 2 that was last updated on the day of the search query. Perplexity AI accessed their fine-tuned version of OpenAI’s GPT-3.5 using the “co-pilot” mode for all queries. During this time, follow-up prompts were trialed to bypass obstructive responses such as legal disclaimers, which would otherwise dilute the meaningful information obtained. The patient case and KQs were used in standardized prompts to query LLM-linked chatbots for clinical recommendations for surgeons (Table 1) and for patients (Table 2). Follow-up prompts were created, and specific scenarios for their use were defined a priori (Table 3). Neither prompts nor follow-up prompts contained any reference to major surgical societies, organizations, or countries to mitigate bias. All prompts were reviewed by a second team member for grammatical correctness, as well as appropriateness for the study. The English language was used for all prompts.

Table 1 Clinician prompts
Table 2 Patient prompts
Table 3 Follow-up prompts

ChatGPT-4, Copilot, Google Bard, and Perplexity AI were queried on November 16th, 2023, from a computer server in Hamilton, Ontario, Canada. A hotspot program was used to access Google Bard from the USA, as it is not yet accessible in Canada. All LLMs were freely accessible with the exception of ChatGPT-4, with a rough cost of $20 USD monthly at the time of writing. All chatbots were queried by two different study team members using the same prompts and follow-up prompts to ensure the consistency of recommendations made by the chatbots (Supplementary Appendix 1). All prompts were entered into a fresh chat window without prior history in the session. Prompts for surgeon inquiries were entered sequentially in separate chat windows from those utilized for patient inquiries. For surgeons, all prompts began with “I am a surgeon” to prime the chatbots. For patients, all prompts employed layperson terminology, such as “Should I receive surgery for heartburn?.” Specific responses for which the use of follow-up prompts was indicated were identified a priori during the prompt testing phase. For surgeons, these included but were not limited to medical disclaimers that the chatbot is not a doctor and/or could not provide medical recommendations, being told to seek a surgical consultation, being told that the patient should trial more medications and lifestyle modifications, and being told to seek the opinions of physicians of other specialties, and other health professionals (Table 3).

Performance evaluation & response classification

Accurate performance was defined as the alignment of LLM advice with current SAGES guideline recommendations for adult and pediatric patients with GERD. Additionally, we evaluated whether LLM-linked chatbots could accurately cite the certainty of the evidence based on the alignment of chatbot responses with SAGES guideline statements for the certainty of the evidence. The certainty of the evidence characterizes the strength of the evidence used to make guideline recommendations. A data collection form was developed to collate prompts and response data. Descriptive statistics were used to report dichotomous outcomes including counts and percentages. Dichotomous outcomes included whether responses to prompts aligned with guideline recommendations or not. Responses that were judged not to align with guideline recommendations included those providing guidance conflicting with SAGES guideline recommendations, those without meaningful answers, and those that did not make a recommendation for or against an intervention or comparator. Responses were judged to be successful if they gave guidance that was concordant with SAGES guideline recommendations or gave a recommendation that it was “reasonable” or “appropriate” to proceed with a given intervention or comparison aligned with SAGES recommendations. Surgeon recommendations were compared to patient recommendations to identify the presence of incongruent guidance. When comparing information given to surgeons and patients, recommendations were classified as discordant when recommendations given to surgeons contradicted those given to patients and when no meaningful guidance was given to one group while a recommendation was given to the other. Chatbot guidance that was indifferent for either the intervention or comparator was considered to be correct if corresponding guideline recommendations also did not recommend one intervention over another in any situation. Similarly, certainty of the evidence was judged to be accurate based on the alignment of LLM ratings of certainty of evidence with SAGES guideline ratings for certainty of evidence for each recommendation. Two team members evaluated responses in a blinded fashion so that they could not identify the chatbots that produced any given response. Conflicts were resolved using a synchronous session. A third expert general surgeon team member was available to resolve conflicts as needed. All researchers were trained on response evaluation through exposure to the above criteria and three pilot questions. Evidence cited by chatbots to support their recommendations was described in narrative form.

Results

Accurate recommendations for the surgical management of GERD in an adult were provided for 5/7 (71.4%) KQs by ChatGPT-4, 3/7 (42.9%) KQs by Copilot, 6/7 (85.7%) KQs by Google Bard, and 3/7 (42.9%) KQs by Perplexity according to the SAGES guidelines (Table 4). The certainty of the evidence was appropriately provided for 4/7 (57.1%) KQs by ChatGPT-4, 2/7 (28.6%) KQs by Copilot, 3/7 (42.9%) KQs by Google Bard, and 0/7 (0.0%) KQs by Perplexity based on guideline recommendations (Table 4). Patient recommendations for an adult were appropriately given for 3/5 (60.0%) KQs by ChatGPT-4, 2/5 (40.0%) KQs by Copilot, 4/5 (80.0%) KQs by Google Bard, and 1/5 (20.0%) KQs by Perplexity, respectively (Table 4). ChatGPT-4 gave no clinically meaningful recommendations when asked for a recommendation to proceed with laparoscopic versus robotic surgery for a patient who was bothered by their PPI use. Based on the SAGES guidelines, no chatbot provided the correct recommendations for robotic versus laparoscopic fundoplication for an adult concerned about the effectiveness of surgery, the need for reoperation, and postoperative complications.

Table 4 Alignment of Chatbot Responses with SAGES Guideline Recommendations for Adults

Accurate recommendations for the surgical management of GERD in a child were provided for 2/3 (66.7%) KQs by ChatGPT-4, 3/3 (100.0%) KQs by Copilot, 3/3 (100.0%) KQs by Google Bard, and 2/3 (66.7%) KQs by Perplexity according to the SAGES guidelines. The certainty of evidence was appropriately provided for 0/3 (100.0%) KQs by ChatGPT-4, 0/3 (100.0%) KQs by Copilot, 2/3 (66.7%) KQs by Google Bard, and 0/3 (0.0%) KQs by Perplexity based on guideline recommendations (Table 4). Recommendations for a pediatric patient were appropriately given for 2/2 (100.0%) KQs by ChatGPT-4, 2/2 (100.0%) KQs by Copilot, 1/2 (50.0%) KQs by Google Bard, and 1/2 (50.0%) KQs by Perplexity based on SAGES guidelines Table 4). All chatbots responded with clinically meaningful recommendations. No chatbot was able to appropriately rate the certainty of the recommendation for minimal vs maximal dissection based on the quality of the evidence for a child receiving surgical fundoplication for refractory GERD (Table 5).

Table 5 Alignment of chatbot responses with SAGES guideline recommendations for children

ChatGPT-4 cited recommendations from the American College of Gastroenterology (ACG) 2022, SAGES 2021, and the United European Gastroenterology (UEG)/European Association of Endoscopic Surgery (EAES) 2021 guidelines. Copilot cited SAGES guidelines from 2021. Google Bard cited guidance from the ACG 2021, American College of Physicians (ACP) 2015, SAGES 2021, and joint recommendations from the North American Society for Pediatric Gastroenterology, Hepatology, and Nutrition (NASPGHAN) and the European Society for Pediatric Gastroenterology, Hepatology, and Nutrition (ESPGHAN) guidelines. Perplexity AI cited recommendations from the SAGES 2021 guidelines.

Discussion

This study evaluated the ability of various LLM-linked chatbots to provide recommendations for the surgical management of GERD based on guidelines published by SAGES [16]. We observed that LLM-linked chatbots provided recommendations with inconsistent accuracy. These LLMs also provided discrepant responses for both physician and patient inquiries. We found that Google Bard was most accurate in providing recommendations for both physicians and patients when compared to ChatGPT 4.0, Copilot, and Perplexity. ChatGPT 4.0 followed closely behind Google Bard in the accuracy of information. ChatGPT-4 provided a marginally higher accuracy in its certainty than Google Bard, with Perplexity performing the worst in this domain. However, none of the chatbots provided correct guidance for all clinical questions based on SAGES guideline recommendations. We also found that chatbot-derived ratings for the certainty of the evidence underlying their recommendations were often inaccurate based on ratings from the SAGES guidelines. Though there is promise in the clinical application of chatbots for patient and physician recommendations, significant improvements must be made to optimize safety for both adult and pediatric patients.

Machine learning has been used to identify patients at risk for acute appendicitis and choledocholithiasis in children [18, 19]. In adults, machine learning has been used to predict patients at risk for GERD following bariatric surgery [20], as well as classify the severity of GERD using endoscopic images [21]. However, the accuracy of LLM-linked chatbots in providing clinical recommendations for patients and clinicians for the management of GERD is not well characterized despite a growing population searching for health advice online [8]. Our findings suggest that chatbots perform differently when providing clinical advice for adult patients compared to pediatric patients. Google Bard and ChatGPT-4 answered the highest proportion of key questions correctly for adults, but Copilot and Google Bard performed the best for cases relating to children. It is noteworthy that prompts were initially tested in Google Bard prior to other LLMs during the development of standardized prompts. Utilizing the tool itself to test the prompts has the potential to significantly impact model output as the tool learns during the training phase [22]. This “pretraining” process may explain the superior performance of Google Bard to ChatGPT demonstrated here. While Rahsepar et al. reported that the converse was true when evaluating lung cancer screening recommendations provided by these LLMs [22], computer scientists recognize that reinforcement learning, as described in creating the scenarios, can both strength and weaken model behaviors [23]. The unpredictability of these tools remains a challenge to be overcome before wide adoption in clinical application for patient care.

No LLM provided consistently accurate recommendations for all key questions based on SAGES guideline recommendations in this study. Similarly, Henson and colleagues interrogated ChatGPT for guidance surrounding the diagnosis and management of GERD and found that 29% of questions were answered with complete appropriateness, while 62.3% were considered mostly appropriate [15]. In both studies, inappropriate or incorrect clinical guidance would have been provided to both surgeons and patients. Even the highest-performing model for accurate recommendations, Google Bard, provided guidance that conflicted with SAGES guideline recommendations. A significant limitation to LLMs is that they are susceptible to experiencing hallucinations—that is, generating confident answers that may be false, or are not justified by their training data [24]. As many of these models are freely accessible online, this poses a significant risk to patient safety. Many online news reports state that ChatGPT and other LLMs may provide comparable health management to a physician, largely based on studies showing that they can pass licensing examinations [25, 26], or even respond to patient inquiries with greater empathy than clinicians [5]. However, it is essential to recognize that these chatbots predict the next word in a phrase based on the language that they have learned their training datasets [2]. In this context, these models are not synthesizing and interpreting evidence to provide clinical recommendations such as the approach used in clinical practice guidelines. Our study is the first 0-shot evaluation to highlight this in the clinical context, as the ability of LLMs to rate the certainty of the evidence supporting their clinical recommendations was poor.

Gastrointestinal surgeons that perform anti-reflux and foregut surgery, and other clinicians such as family physicians, gastroenterologists, and allied health professionals involved in the treatment of GERD, as well as patients with GERD should be aware of the limitations of LLM-linked chatbots in providing clinical advice. Despite their increased accessibility, these chatbots do not synthesize evidence directly, must often be prompted to provide citations, and are demonstrated to provide inaccurate information. The application of current LLM-linked chatbots in the clinical setting may negatively impact patient care. Prior to the entrance of LLMs into the mainstream, just under half of adults were searching the internet for health information or advice [8, 11], including Google or Wikipedia [27, 28]. Moreover, 80% of patients perceive these online resources to be reliable [11]. Few comparisons have been made between LLMs and traditional online sources. One study reported that health advice for postoperative otolaryngology care generated by ChatGPT scored lower in understandability, actionability, and procedure-specific content than Google [29]. However, different prompts were used to search ChatGPT versus Google, which clouds the interpretation of their findings. In contrast, Hristidis and colleagues found that ChatGPT generated more relevant responses compared to Google for health information related to dementia [30]. Furthermore, the ability of LLMs to conveniently provide a single resource to synthesize online information will only increase the amount of internet users. Without regulation and quality improvement, the clinical advice from LLMs may impact the ability of patients to understand the treatment plans recommended for them, with the potential to negatively impact their care. Policymakers and hospital managers should take note that LLMs are currently not able to reliably provide accurate recommendations for patients. However, as these models improve, we will likely see their gradual integration into health systems used in the hospital setting. Particularly, the use of institutional data to train closed, inaccessible models to generate tailored patient recommendations based on local outcomes is a key area for future research. Mahajan and colleagues successfully trained a machine learning model using hospital network data to develop a surgical risk prediction tool [31]. Furthermore, these models could be trained to develop a publicly accessible LLM that summarizes societal recommendations as a central resource. Still, this innovative movement must be done with the utmost regard for patient safety, balancing their potential to positively transform patient care and their shortcomings.

Limitations exist in this study. Though high-quality guidelines were used as an objective measure of performance, the quality of currently available primary data limits the certainty of the evidence for many guideline recommendations. Certain surgeon prompts such as minimal versus maximal dissection of short gastric vessels in adult patients could not be answered due to the lack of literature available to inform guideline recommendations. Additionally, these LLM-linked chatbots are not trained specifically for medical application. Most chatbots are closed/proprietary models, and little is known about their functionality. Generally, LLMs are also limited by the information learned from their training dataset which may further impact their performance in a clinical setting. Notably, LLM-linked chatbots are dynamically improving and the results of this study apply to the current state of machine learning. Finally, prompts were not generated by patients, and the results of this study must be interpreted accordingly. The strength of this study is its rigorous methodology including its use of a transparent testing phase, standardized prompts, and an objective measure of LLM performance. While reporting guidelines are in development [13], it is imperative that future Chatbot Assessment Studies adhere to robust methodology and transparent reporting standards. Emphasis must be placed on the development of “open” or accessible LLMs that are trained using clinical datasets. The potential for the use of local datasets to develop LLMs capable of supporting surgical decision-making based on institution-tailored outcome data cannot be understated.

Conclusion

LLM-linked chatbots are a promising technology within the field of artificial intelligence. Though their widespread accessibility and simple linguistic abilities position them well to support patients and providers with health recommendations, they currently perform surgical decision-making with inconsistent accuracy. Gastrointestinal surgeons, gastroenterologists, and other healthcare professionals involved in the management of patients with GERD must be aware of the potential for future patients to present to their care following the use of LLMs for health recommendations, as well as their current limitations. Policymakers and hospital managers must recognize the potential of LLMs to greatly improve patient care in the clinical setting and be aware of their gradual integration into health systems and applications, but these advancements must be conducted with the utmost consideration for patient safety.