Introduction

In the age of advanced artificial intelligence (AI) and deep learning, Large Language Models (LLMs) represent a significant breakthrough in our ability to understand and generate natural language, mimicking human-like text. LLMs offer tremendous potential for healthcare professionals by providing quick and accessible access to the ever-expanding realm of medical knowledge. These models undergo a two-stage training process, starting with self-supervised learning from vast unannotated data and progressing to fine-tuning on small, task-specific, annotated datasets. This fine-tuning enables LLMs to perform specialized tasks tailored to end-users' needs [1].

This distinction highlights the essence of deep machine learning, underscoring the disparity between machine learning and human learning. While humans can swiftly derive general and intricate associations from limited data, machines require extensive data volumes to achieve similar results, primarily due to their lack of common sense. This AI's capacity to absorb copious amounts of data, learn from it, and instantaneously access it stands in stark contrast to our finite capabilities, largely constrained by linear time [2].

One AI model that has recently gained global recognition is the Chat-Generative Pre-Trained Transformer (Chat-GPT), equipped with over 175 billion parameters. This Chatbot extracts a wealth of information from diverse online sources, including books, articles, and websites, and refines its text generation capabilities through human feedback [3]. OpenAI, an artificial intelligence research organization and company founded in 2015, released Chat-GPT in November 2022.

Obstructive sleep apnea (OSA) is a sleep-related breathing disorder characterized by cyclic partial or complete upper airway obstruction. These cycles lead to intermittent hypoxemia, autonomic fluctuations, and sleep disruption, ultimately culminating in a chronic inflammatory systemic state associated with elevated cardiovascular risk. OSA has been linked to various complications, including hypertension, heart failure, coronary artery disease, cerebrovascular disease, metabolic syndrome, and type 2 diabetes [4]. Despite its prevalence (between 9% and 17% depending on gender) and potential repercussions, OSA remains underdiagnosed and undertreated [5].

The intersection of AI and OSA research holds immense promise for facilitating the diagnosis of this condition, not only among otolaryngologists but also among general practitioners and other medical specialists. This study aims to bridge this gap by conducting a comparative analysis of responses to a specialized OSA survey. Through a comparison between sleep surgeons’ skills and Chat-GPT, our objective is to contribute significantly to the ongoing discourse in otolaryngology concerning OSA and shed light on the role of AI-generated content in medical decision-making.

Methods

We designed a prospective, cross-sectional study to assess the level of agreement between responses to a ten-question survey provided by a panel of experts and responses generated by Chat-GPT. All experts included in the study were Otolaryngologists with specialization in sleep medicine. The ten super-experts were selected based on their exceptional expertise and academic recognition in the field of sleep-related disorders.

Survey design

We developed a comprehensive survey comprising ten questions related to OSA. Each question was designed as a clinical case and offered four potential multiple-choice answers. In one case, only one correct answer was possible (Question 5), while in others, multiple answers were acceptable.

Data collection

The survey was distributed to a panel of 350 otolaryngologists, all experts in the field of obstructive sleep apnea, representing 25 countries across four continents (Africa, America, Asia, and Europe). Responses were collected between June and July 2023. Simultaneously, from July 9th to 14th, 2023, we requested Chat-GPT (version 3.5) to provide answers to each of the survey questions. All questions were entered into Chat-GPT 3.5 by a single investigator.

Following this, we submitted the answers from both the experts and Chat-GPT to the super-experts and asked them to review and rate the level of agreement on each question. We employed a Likert-Scale method, ranging from 1 to 5 (1 = Strongly disagree, 2 = Disagree, 3 = Neutral, 4 = Agree, and 5 = Strongly agree), for their assessments (Fig. 1).

Fig. 1
figure 1

Study protocol

This study did not need ethical approval as no patient-level data were used.

Statistical analysis

Quantitative and continuous variables were expressed as the mean ± standard deviation (SD). Two-tailed t tests were used to compare the mean super-expert assessment of experts’ and Chat-GPT’s answers. The significance threshold used was p < 0.05. The kappa correlation coefficient (R) was used to analyze the agreement between super-experts, with the following guidelines for interpretation: R < 0.4: poor correlation; R [0.4–0.75]: intermediate correlation; R > 0.75 good correlation [6].

All statistical analysis were made on free and validated online tools (http://justusrandolph.net/kappa/; and https://biostatgv.sentiweb.fr/).

Results

A total of 97 responses (response rate 27.7%) from 25 countries across Africa, the Americas, Asia, and Europe were collected (refer to Fig. 1) during the period spanning from June 26th, 2023, to July 23rd, 2023. The consensus answers derived from both Chat-GPT and the experts are presented in Table 1. Table 2 showcases the agreement levels between Chat-GPT and the experts for each question.

Table 1 Most consensual experts’ answers and ChatGPT’s answers
Table 2 Agreement between experts and Chat-GPT's answers

For each multiple choice question, Chat-GPT and experts shared a common answer in more than 75% of cases (item by item analysis). However, when the whole response was taken into consideration, only 4 questions reached the 75% of consensus between experts and Chat-GPT.

Ten "super-experts" evaluated the most consensus-driven responses from both experts and Chat-GPT. The “super-experts” rated all expert’s responses at a value of 4/5 or more, while this rating was achieved only for 6 Chat-GPT responses.

The mean agreement level, as determined by the super-experts using the Likert scale for Chat-GPT's responses, was 4.07 (Minimum 1; Maximum 5; Standard Deviation 1.22). For the experts, the mean agreement level was 4.56 (Minimum 2; Maximum 5; Standard Deviation 0.78). Notably, there was a significant difference between these values (p = 0.0009, as determined by a student t test). Detailed agreement data for each question can be found in Table 3.

Table 3 Assessment provided by super-expert on Chat-GPT’s and experts’ consensual answers

The kappa coefficient of agreement between super-experts for expert response assessment was R = 0.44 (CI95% [0.30; 0.58]). For ChatGPT response assessment, the kappa coefficient of agreement was R = 0.17 ([0.03; 0.30]).

Discussion

The integration of LLMs, particularly Chat-GPT, into the field of medicine has shown great promise, offering the potential to revolutionize the way healthcare professionals access and utilize medical knowledge [7, 8]. This study aimed to explore the applicability of Chat-GPT in the domain of obstructive sleep apnea (OSA), a significant health concern associated with various comorbidities and yet often underdiagnosed and undertreated [9, 10].

The results of our study, as presented in Tables 1 and 2, reveal a moderate global degree of consensus between Chat-GPT and the expert panel. In four questions the level of agreement between Chat-GPT and experts was high while in the remaining questions agreement was significantly lower. The consensus answers for the ten survey questions demonstrate that Chat-GPT might be capable of providing responses that align with those of human experts but still needs improvement.

Moreover, our findings indicate that the level of agreement between Chat-GPT and experts, as assessed by the super-experts, is substantial. The mean agreement levels, represented by a Likert scale, were 4.07 for Chat-GPT and 4.56 for the experts, with the latter showing slightly higher agreement levels. However, it is important to note that the differences in agreement between Chat-GPT and experts were statistically significant (p = 0.0009). This suggests that while Chat-GPT's responses are generally in concordance with expert opinions, there are instances where distinctions exist.

These distinctions may arise from the inherent limitations of AI models, including their reliance on data patterns and the potential absence of clinical intuition.

The data presented in Table 3 provide valuable insights into the super-expert assessments of Chat-GPT's answers compared to experts' consensual answers for each of the ten survey questions.

Examining the data, we observe some key points. For questions Q1 and Q2 Super-experts rated Chat-GPT's responses lower than the experts' consensual answers, with means of 2.8 and 3.4 compared to 4.1 and 4.6, respectively. The p-values of 0.01 for both questions indicate a significant difference in these assessments. This suggests that while Chat-GPT provided responses that were generally aligned with expert consensus, super-experts found room for improvement in these particular cases. For question Q4: Super-experts rated Chat-GPT's response lower than experts' consensual answer, with a mean of 2 compared to 4.1. The p-value of 0.0003 indicates a significant difference in these assessments. This suggests that Chat-GPT struggled to align with expert consensus on this question, with room for improvement in its response quality.

Another aspect that warrants a more in-depth examination is the level of agreement among super-experts when assessing the responses provided by both experts and ChatGPT. The degree of agreement was found to be intermediate for expert responses and low for ChatGPT responses. These findings underscore the intricate nature of managing obstructive sleep disorders, where numerous therapeutic choices exist, and there is a dearth of conclusive evidence in the literature to guide the selection of the optimal approach for a specific clinical presentation.

These results have several implications for the field of OSA diagnosis and treatment. Firstly, they highlight the potential of Chat-GPT as a valuable resource for general practitioners and medical specialists in the initial assessment of OSA cases. Chat-GPT's ability to provide accurate and consensus-driven responses can aid healthcare providers in making informed decisions and recommendations, especially in regions where access to sleep medicine specialists is limited.

Secondly, our study underscores the importance of collaboration between AI systems and human experts. While Chat-GPT can offer valuable insights, it should be seen as a complementary tool rather than a replacement for medical professionals [11,12,13,14]. Combining the strengths of AI, such as rapid data processing, with the clinical expertise of otolaryngologists can enhance the accuracy and efficiency of OSA diagnosis and management.

Finally, our findings contribute to the ongoing discourse in otolaryngology regarding OSA and the role of AI-generated content. By demonstrating the potential of Chat-GPT to align with expert opinions, this study encourages further research and development in AI-driven healthcare applications.

In conclusion, our study signifies the promise of AI, particularly Chat-GPT, in aiding healthcare professionals in the realm of OSA diagnosis. While Chat-GPT exhibits a commendable level of consensus with expert responses, the collaboration between AI and human experts is essential for optimal patient care. This research represents a significant step towards harnessing AI's capabilities to address the underdiagnosis and undertreatment of OSA, ultimately improving the health outcomes of affected individuals. Further investigations and refinements in AI-based healthcare tools hold great potential for the future of medicine.