Since their introduction, large language models (LLMs) have gained popularity due to widespread accessibility and impressive ability to generate prompt answers using expert or layperson language [1]. Recognizing this, patients are beginning to use LLMs for clinical advice for various topics related to their health. Despite tremendous interest, the clinical application and endorsement of the use of LLMs within healthcare remains restricted [2]. While LLMs have proven valuable in various domains [3], they have exhibited inconsistency in providing medical advice and recommendations. A major barrier to their acceptance in clinical practice is the clinical accuracy of their responses [4]. LLM-generated recommendations can lack evidence-based support or offer information inconsistent with clinical guidance.

LLMs such as ChatGPT (Open AI) are trained on extensive datasets [5] on various topics, but its training data lacks specificity to medical contexts. With growing interest in the use of LLMs for clinical application, there is a need for developing tailored LLM models that are medically relevant. However, developing a LLM is time-consuming and resource-intensive, requiring unique expertise. Recently, ChatGPT-4 released a “customize GPT” function that enables the user to direct the responses of the LLM for specific purposes. The ability to tailor a LLM toward utilizing evidence-based recommendations as a primary data source to mitigate clinical inaccuracies and inconsistencies may provide an opportunity to significantly enhance their reliability.

Tailoring pre-trained LLMs which have been extensively trained on diverse datasets for medical applications offers a promising strategy [6]. These LLMs would be able to offer clinically accurate recommendations with a short induction period of “customization,” avoiding the resource and time constraints associated with developing a LLM from scratch. The aim of this study was to develop a customized ChatGPT using guidelines to create a LLM-linked chatbot that provides accurate clinical recommendations and compare it to an untrained GPT model.

Materials and methods

Objective, model customization, & prompt engineering

On July 21st, 2024, ChatPT-4’s “Create a GPT” feature was used to customize a version of ChatGPT for our purposes. A paucity of literature exists on the reliability of ChatGPT-4’s customization feature and the number of prompts needed to ensure its reliability. First, the model was informed that its purpose would be to guide clinicians on the surgical management of gastroesophageal reflux disease (GERD) based on the SAGES & UEG-EAES clinical practice guidelines [7]. We chose these guidelines as our team members have extensively reviewed the randomized evidence behind the surgical management of GERD. We further trained the model using the 2022 UEG/EAES guideline recommendations on the surgical management of GERD [8]. The model was told to use layperson language for patients, but to use conventional medical terminology and a professional tone when conversing with surgeons. Secondly, a PDF copy of the SAGES & UEG-EAES clinical practice guideline on the surgical management of GERD was uploaded to the website [7]. Examples were given to the model regarding patient cases and clinical questions that applied to the first two key recommendations from the 2021 guidelines. Specifically, hypothetical patient cases were posed to the custom model in-progress, asking whether patients should receive surgery, or whether they should receive surgery robotically or laparoscopically. Thirdly, feedback on whether the model’s response was correct was provided iteratively. For instance, the model was corrected to make a firm recommendation based on the guideline recommendations. Steps two and three comprised our prompt engineering/testing phases. This iterative input and feedback loop was repeated until the model provided a correct response aligning with the guideline recommendations for three consecutive cases for this pilot study. This process was completed by the lead author over 1.5 h. During model training, only four clinician-oriented cases and questions were posed. No patient-oriented questions were posed. Additional queries were inputted to the generic ChatGPT-4 to ensure that prompts were structured appropriately to elicit responses from both models to complete the prompt engineering/testing phase. (Fig. 1).

Fig. 1
figure 1

Customizing a GPT Model Using ChatGPT-4

Query strategy

Standardized patient cases were developed based on key questions from the SAGES & UEG-EAES guidelines for the surgical treatment of GERD [7]. These cases specified combinations of patient age, clinical history, and clinical questions based on the relevant guideline recommendations. Each case reflected the population, intervention, and comparator addressed by the applicable key question. With input from practicing general and foregut surgeons, clinical question phrasing was refined. These cases were used as prompts to query our customized version of ChatGPT-4 as well as the generic version of ChatGPT-4 on July 21st, 2024 from a computer serve in Hamilton, Ontario, Canada (Table 1). The most recent update to the generic ChatGPT-4 model was May 13th, 2024.

Table 1 Study prompts

Patient prompts were generated based on surgeon prompts by adjusting phrasing to reflect layperson terminology while limiting medical terminology. No follow-up prompts or medical disclaimers were applied. No prompts contained any reference to professional organizations, societies, or countries. All prompts were constructed in English. All prompts were entered into a fresh chat window without prior chat history in the session to limit additional learning from prior input.

Performance evaluation

Accurate performance was defined as the alignment of ChatGPT responses with guideline recommendations on the surgical management of with GERD [7]. Findings were reported using descriptive statistics, with counts and percentages applied to characterize dichotomous outcomes. Responses that did not provide clinically meaningful advice and guidance conflicting with guideline recommendations were judged not to align with guideline recommendations, demonstrating inaccurate model performance. Two team members evaluated all responses in a blinded fashion to the chatbot model, and no conflicts were generated.

Results

We analyzed a total of 60 cases presented by a hypothetical surgeon and 40 cases presented by a hypothetical patient to evaluate the recommendations provided by the GTS and generic ChatGPT models. The GTS correctly addressed 100% (60/60) of the surgeon’s queries and 100% (40/40) of the patient’s queries. Conversely, the ChatGPT model exhibited a lower accuracy rate, correctly responding to 66.6% (40/60) of the surgeon’s questions and 47.5% (19/40) of the patient’s inquiries (see Table 2). Recommendations on the surgical management of GERD generated by the GTS consistently adhered to the SAGES guidelines, whereas those from the generic ChatGPT model did not cite evidence to support its recommendations. No identifiable pattern was observed in the nature of the cases to which the generic ChatGPT-4 model provided incorrect responses.

Table 2 Alignment of recommendations with guidelines generated by the GTS & a generic ChatGPT-4 model for the surgical management of GERD

Discussion

This study evaluated the ability of the GTS, a customized ChatGPT model, to provide recommendations for the surgical management of gastroesophageal reflux disease (GERD) to both surgeons and patients. We observed that the GTS provided very accurate recommendations compared to the generic ChatGPT-4 model. The GTS was 100% accurate in both patient and surgeon inquiries, citing its tailored guidelines. Conversely, the generic ChatGPT-4 model provided inaccurate recommendations frequently to both surgeons and patients without citing evidence to support its guidance. Surgeons and researchers should note that customizing LLMs like ChatGPT-4 could overcome the limitations of generic LLMs for simple topics, as demonstrated by the customized GPT model in this study.

The emergence of LLMs like ChatGPT has created opportunities to access information for patients already seeking clinical advice online [9]. However, the functionality of LLMs can be misconceived [10]. Users may assume that LLMs access the internet in real-time while applying complex algorithms to provide them with the most suitable responses to their query based on these resources. Rather, LLMs rely on complex neural networks developed through an iterative process of input and user feedback [3]. LLMs become sophisticated in their ability to predict the most likely next word in a sequence as opposed to accessing and searching data from its training dataset to reply to a given input. Because of their ability to create language and sentence structure they can appear confident, even when inaccurate. Numerous studies demonstrate that generically trained LLMs provide medical advice with inconsistent reliability. One study [11] found that LLMs occasionally provided incorrect or out-of-date information and cited inappropriate sources, similar to the findings here for the generic ChatGPT-4 model. However, through customization, we achieved a significant improvement in accuracy. Similarly, custom chatbots are outperforming generic LLMs in the setting of urology, and gastroenterology, often with 100% accuracy as reported here [12,13,14]. Clinicians and researchers may take interest in this approach to customizing GPTs, which avoids the time and resource-intensive nature associated with the development of LLMs while potentially enhancing the reliability for clinical decision support [15]. Though ChatGPT restricts access to custom LLMs to its paid users as a closed-source entity that withholds details about its functionality, other open-source LLMs exist and could be similarly customized and integrated into society webpages, clinical workflow via apps or electronic medical health systems via multidisciplinary collaboration with machine learning researchers and data scientists.

Despite their accessibility, caution is warranted. The guidance provided by LLMs is not consistently grounded in clinical evidence, putting patient safety at risk [16]. Numerous studies have highlighted inaccuracies in LLM decision-making and a lack of verifiable resources to support their recommendations. These findings raise ethical concerns regarding the use of LLMs in healthcare settings. In contrast, several studies have justified and recommended the use of LLMs [17] over traditional internet search engines such as Google [18]. Users must be aware of the most updated training data for these LLMs, as they may not be equipped with the latest updates on treatment guidelines or recommendations depending on the clinical topics and contexts. Additionally, the lack of regulation of LLMs and widespread accessibility make the outputs hard to generalize in terms of accuracy or reliability. The threats to patient safety also include cybersecurity concerns [19], which involve safeguarding patient data and protecting against potential breaches or unauthorized access. Patient data must be handled ethically and in compliance with privacy regulations [20, 21]. Moreover, there is increasing awareness of the potential for bias [22] within LLMs, both in terms of the data they are trained on and the recommendations they generate. Bias in LLMs can manifest in various ways, including disparities in the representation of different demographic groups or medical conditions, which could impact the fairness and equity of clinical advice provided. Therefore, addressing these concerns [23] surrounding cybersecurity, data privacy, and bias is crucial to ensuring the safe and ethical use of LLMs in healthcare. The clinical integration of LLMs must be taken with caution, while acknowledging the potential advantages to integrating LLMs into healthcare practices.

Professional societies, tertiary institutions and hospitals may take interest in exploring the development of online platforms that support customized LLMs accessible via their websites or apps, offering medical advice and addressing common patient inquiries in both inpatient and outpatient settings [24]. Similarly, tailored online tools could be designed for healthcare providers, serving as evidence-based resources for guiding patient management, particularly in complex cases. Institutions [25] stand to benefit significantly from leveraging a customized LLM to optimize communication with patients and automate repetitive tasks. Integrating these clinical tools into healthcare systems could potentially improve patient care and satisfaction while alleviating the workload on healthcare staff. With the proper training dataset, the clinical integration of custom LLMs could potentially lower bias and tailor responses to local populations. Therefore, policymakers and healthcare managers should prioritize the exploration and implementation of these innovative solutions to create positive outcomes for both patients and healthcare professionals. While the potential benefits of these customized models are promising, the proportion of patients using LLMs for health advice is currently unknown. Moreover, an increased emphasis on objective performance evaluation is needed to ensure that LLM advice aligns with the highest quality evidence such as clinical practice guidelines [26], while acknowledging doubt in the setting of more controversial, complex topics.

Limitations exist in this study. Firstly, all cases applied in this pilot study were hypothetical. There is a need for patient-centered, prospective studies to evaluate to efficacy of LLMs in providing clinical advice in a pragmatic context. Moreover, despite the advantages of customizing ChatGPT to provide accurate recommendations based on guidelines, there are significant constraints. One significant limitation is that while we can tailor ChatGPT to our customized model, we are bound by the database on which ChatGPT has been trained. This means that the responses generated by ChatGPT may be influenced by the data it has been exposed to during training, potentially limiting the accuracy of its outputs in healthcare contexts. Moreover, it’s essential to acknowledge that ChatGPT was not originally designed or validated for medical use. So, while we may customize it to address medical concerns from surgeons or patients, it lacks the formal validation [26] and regulatory approval required for clinical applications. Establishing a customized ChatGPT tool in real-world medical settings is not advisable without thorough validation studies. Further research is required to assess the reliability, consistency, and safety of using a customized LLM in healthcare practice. Validation studies would be necessary to evaluate its performance in providing accurate and clinically relevant recommendations across a diverse range of medical scenarios. Only through rigorous validation can we establish the trustworthiness and effectiveness of a tailored LLM model as a viable tool for supporting healthcare providers and patients in clinical decision-making.

Conclusion

Clinicians, researchers, and patients may take interest in the ability of a customized version of OpenAI’s ChatGPT-4 to significantly improve the accuracy of generating advice for the surgical management of GERD. Customization of the LLM increased patient-focused and provider-focused questions responses substantially. With prior studies illustrating the limitations of LLMs in providing reliably accurate health advice, this approach may be applied to mitigate the time and resource-intensive nature associated with developing de-novo LLMs. The integration of LLMs into clinical practice must be undertaken with the utmost consideration for patient safety, privacy, ethical, and regulatory factors.