Chat GPT for the management of obstructive sleep apnea: do we have a polar star?

Mira, Felipe Ahumada; Favier, Valentin; dos Santos Sobreira Nunes, Heloisa; de Castro, Joana Vaz; Carsuzaa, Florent; Meccariello, Giuseppe; Vicini, Claudio; De Vito, Andrea; Lechien, Jerome R.; Chiesa-Estomba, Carlos; Maniaci, Antonino; Iannella, Giannicola; Rojas, Eduardo Peña; Cornejo, Jenifer Barros; Cammaroto, Giovanni

doi:10.1007/s00405-023-08270-9

Chat GPT for the management of obstructive sleep apnea: do we have a polar star?

Head and Neck
Published: 19 November 2023

Volume 281, pages 2087–2093, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

European Archives of Oto-Rhino-Laryngology Aims and scope Submit manuscript

Chat GPT for the management of obstructive sleep apnea: do we have a polar star?

Download PDF

Felipe Ahumada Mira^1,13,
Valentin Favier^2,13,
Heloisa dos Santos Sobreira Nunes^3,13,
Joana Vaz de Castro^4,13,
Florent Carsuzaa^5,13,
Giuseppe Meccariello⁶,
Claudio Vicini⁶,
Andrea De Vito⁶,
Jerome R. Lechien^7,13,
Carlos Chiesa-Estomba^8,13,
Antonino Maniaci^9,13,
Giannicola Iannella^10,13,
Eduardo Peña Rojas¹¹,
Jenifer Barros Cornejo¹² &
…
Giovanni Cammaroto ORCID: orcid.org/0000-0002-1618-0048^6,13

1449 Accesses
11 Citations
2 Altmetric
Explore all metrics

This article has been updated

Abstract

Purpose

This study explores the potential of the Chat-Generative Pre-Trained Transformer (Chat-GPT), a Large Language Model (LLM), in assisting healthcare professionals in the diagnosis of obstructive sleep apnea (OSA). It aims to assess the agreement between Chat-GPT's responses and those of expert otolaryngologists, shedding light on the role of AI-generated content in medical decision-making.

Methods

A prospective, cross-sectional study was conducted, involving 350 otolaryngologists from 25 countries who responded to a specialized OSA survey. Chat-GPT was tasked with providing answers to the same survey questions. Responses were assessed by both super-experts and statistically analyzed for agreement.

Results

The study revealed that Chat-GPT and expert responses shared a common answer in over 75% of cases for individual questions. However, the overall consensus was achieved in only four questions. Super-expert assessments showed a moderate agreement level, with Chat-GPT scoring slightly lower than experts. Statistically, Chat-GPT's responses differed significantly from experts' opinions (p = 0.0009). Sub-analysis revealed areas of improvement for Chat-GPT, particularly in questions where super-experts rated its responses lower than expert consensus.

Conclusions

Chat-GPT demonstrates potential as a valuable resource for OSA diagnosis, especially where access to specialists is limited. The study emphasizes the importance of AI-human collaboration, with Chat-GPT serving as a complementary tool rather than a replacement for medical professionals. This research contributes to the discourse in otolaryngology and encourages further exploration of AI-driven healthcare applications. While Chat-GPT exhibits a commendable level of consensus with expert responses, ongoing refinements in AI-based healthcare tools hold significant promise for the future of medicine, addressing the underdiagnosis and undertreatment of OSA and improving patient outcomes.

ChatGPT-4 accuracy for patient education in laryngopharyngeal reflux

Article 16 March 2024

Artificial intelligence chatbots as sources of patient education material for obstructive sleep apnoea: ChatGPT versus Google Bard

Article 02 November 2023

Comparing ChatGPT and a Single Anesthesiologist’s Responses to Common Patient Questions: An Exploratory Cross-Sectional Survey of a Panel of Anesthesiologists

Article 22 August 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

In the age of advanced artificial intelligence (AI) and deep learning, Large Language Models (LLMs) represent a significant breakthrough in our ability to understand and generate natural language, mimicking human-like text. LLMs offer tremendous potential for healthcare professionals by providing quick and accessible access to the ever-expanding realm of medical knowledge. These models undergo a two-stage training process, starting with self-supervised learning from vast unannotated data and progressing to fine-tuning on small, task-specific, annotated datasets. This fine-tuning enables LLMs to perform specialized tasks tailored to end-users' needs [1].

This distinction highlights the essence of deep machine learning, underscoring the disparity between machine learning and human learning. While humans can swiftly derive general and intricate associations from limited data, machines require extensive data volumes to achieve similar results, primarily due to their lack of common sense. This AI's capacity to absorb copious amounts of data, learn from it, and instantaneously access it stands in stark contrast to our finite capabilities, largely constrained by linear time [2].

One AI model that has recently gained global recognition is the Chat-Generative Pre-Trained Transformer (Chat-GPT), equipped with over 175 billion parameters. This Chatbot extracts a wealth of information from diverse online sources, including books, articles, and websites, and refines its text generation capabilities through human feedback [3]. OpenAI, an artificial intelligence research organization and company founded in 2015, released Chat-GPT in November 2022.

Obstructive sleep apnea (OSA) is a sleep-related breathing disorder characterized by cyclic partial or complete upper airway obstruction. These cycles lead to intermittent hypoxemia, autonomic fluctuations, and sleep disruption, ultimately culminating in a chronic inflammatory systemic state associated with elevated cardiovascular risk. OSA has been linked to various complications, including hypertension, heart failure, coronary artery disease, cerebrovascular disease, metabolic syndrome, and type 2 diabetes [4]. Despite its prevalence (between 9% and 17% depending on gender) and potential repercussions, OSA remains underdiagnosed and undertreated [5].

The intersection of AI and OSA research holds immense promise for facilitating the diagnosis of this condition, not only among otolaryngologists but also among general practitioners and other medical specialists. This study aims to bridge this gap by conducting a comparative analysis of responses to a specialized OSA survey. Through a comparison between sleep surgeons’ skills and Chat-GPT, our objective is to contribute significantly to the ongoing discourse in otolaryngology concerning OSA and shed light on the role of AI-generated content in medical decision-making.

Methods

We designed a prospective, cross-sectional study to assess the level of agreement between responses to a ten-question survey provided by a panel of experts and responses generated by Chat-GPT. All experts included in the study were Otolaryngologists with specialization in sleep medicine. The ten super-experts were selected based on their exceptional expertise and academic recognition in the field of sleep-related disorders.

Survey design

We developed a comprehensive survey comprising ten questions related to OSA. Each question was designed as a clinical case and offered four potential multiple-choice answers. In one case, only one correct answer was possible (Question 5), while in others, multiple answers were acceptable.

Data collection

The survey was distributed to a panel of 350 otolaryngologists, all experts in the field of obstructive sleep apnea, representing 25 countries across four continents (Africa, America, Asia, and Europe). Responses were collected between June and July 2023. Simultaneously, from July 9th to 14th, 2023, we requested Chat-GPT (version 3.5) to provide answers to each of the survey questions. All questions were entered into Chat-GPT 3.5 by a single investigator.

Following this, we submitted the answers from both the experts and Chat-GPT to the super-experts and asked them to review and rate the level of agreement on each question. We employed a Likert-Scale method, ranging from 1 to 5 (1 = Strongly disagree, 2 = Disagree, 3 = Neutral, 4 = Agree, and 5 = Strongly agree), for their assessments (Fig. 1).

This study did not need ethical approval as no patient-level data were used.

Statistical analysis

Quantitative and continuous variables were expressed as the mean ± standard deviation (SD). Two-tailed t tests were used to compare the mean super-expert assessment of experts’ and Chat-GPT’s answers. The significance threshold used was p < 0.05. The kappa correlation coefficient (R) was used to analyze the agreement between super-experts, with the following guidelines for interpretation: R < 0.4: poor correlation; R [0.4–0.75]: intermediate correlation; R > 0.75 good correlation [6].

All statistical analysis were made on free and validated online tools (http://justusrandolph.net/kappa/; and https://biostatgv.sentiweb.fr/).

Results

A total of 97 responses (response rate 27.7%) from 25 countries across Africa, the Americas, Asia, and Europe were collected (refer to Fig. 1) during the period spanning from June 26th, 2023, to July 23rd, 2023. The consensus answers derived from both Chat-GPT and the experts are presented in Table 1. Table 2 showcases the agreement levels between Chat-GPT and the experts for each question.

Table 1 Most consensual experts’ answers and ChatGPT’s answers

Full size table

Table 2 Agreement between experts and Chat-GPT's answers

Full size table

For each multiple choice question, Chat-GPT and experts shared a common answer in more than 75% of cases (item by item analysis). However, when the whole response was taken into consideration, only 4 questions reached the 75% of consensus between experts and Chat-GPT.

Ten "super-experts" evaluated the most consensus-driven responses from both experts and Chat-GPT. The “super-experts” rated all expert’s responses at a value of 4/5 or more, while this rating was achieved only for 6 Chat-GPT responses.

The mean agreement level, as determined by the super-experts using the Likert scale for Chat-GPT's responses, was 4.07 (Minimum 1; Maximum 5; Standard Deviation 1.22). For the experts, the mean agreement level was 4.56 (Minimum 2; Maximum 5; Standard Deviation 0.78). Notably, there was a significant difference between these values (p = 0.0009, as determined by a student t test). Detailed agreement data for each question can be found in Table 3.

Table 3 Assessment provided by super-expert on Chat-GPT’s and experts’ consensual answers

Full size table

The kappa coefficient of agreement between super-experts for expert response assessment was R = 0.44 (CI95% [0.30; 0.58]). For ChatGPT response assessment, the kappa coefficient of agreement was R = 0.17 ([0.03; 0.30]).

Discussion

The integration of LLMs, particularly Chat-GPT, into the field of medicine has shown great promise, offering the potential to revolutionize the way healthcare professionals access and utilize medical knowledge [7, 8]. This study aimed to explore the applicability of Chat-GPT in the domain of obstructive sleep apnea (OSA), a significant health concern associated with various comorbidities and yet often underdiagnosed and undertreated [9, 10].

The results of our study, as presented in Tables 1 and 2, reveal a moderate global degree of consensus between Chat-GPT and the expert panel. In four questions the level of agreement between Chat-GPT and experts was high while in the remaining questions agreement was significantly lower. The consensus answers for the ten survey questions demonstrate that Chat-GPT might be capable of providing responses that align with those of human experts but still needs improvement.

Moreover, our findings indicate that the level of agreement between Chat-GPT and experts, as assessed by the super-experts, is substantial. The mean agreement levels, represented by a Likert scale, were 4.07 for Chat-GPT and 4.56 for the experts, with the latter showing slightly higher agreement levels. However, it is important to note that the differences in agreement between Chat-GPT and experts were statistically significant (p = 0.0009). This suggests that while Chat-GPT's responses are generally in concordance with expert opinions, there are instances where distinctions exist.

These distinctions may arise from the inherent limitations of AI models, including their reliance on data patterns and the potential absence of clinical intuition.

The data presented in Table 3 provide valuable insights into the super-expert assessments of Chat-GPT's answers compared to experts' consensual answers for each of the ten survey questions.

Examining the data, we observe some key points. For questions Q1 and Q2 Super-experts rated Chat-GPT's responses lower than the experts' consensual answers, with means of 2.8 and 3.4 compared to 4.1 and 4.6, respectively. The p-values of 0.01 for both questions indicate a significant difference in these assessments. This suggests that while Chat-GPT provided responses that were generally aligned with expert consensus, super-experts found room for improvement in these particular cases. For question Q4: Super-experts rated Chat-GPT's response lower than experts' consensual answer, with a mean of 2 compared to 4.1. The p-value of 0.0003 indicates a significant difference in these assessments. This suggests that Chat-GPT struggled to align with expert consensus on this question, with room for improvement in its response quality.

Another aspect that warrants a more in-depth examination is the level of agreement among super-experts when assessing the responses provided by both experts and ChatGPT. The degree of agreement was found to be intermediate for expert responses and low for ChatGPT responses. These findings underscore the intricate nature of managing obstructive sleep disorders, where numerous therapeutic choices exist, and there is a dearth of conclusive evidence in the literature to guide the selection of the optimal approach for a specific clinical presentation.

These results have several implications for the field of OSA diagnosis and treatment. Firstly, they highlight the potential of Chat-GPT as a valuable resource for general practitioners and medical specialists in the initial assessment of OSA cases. Chat-GPT's ability to provide accurate and consensus-driven responses can aid healthcare providers in making informed decisions and recommendations, especially in regions where access to sleep medicine specialists is limited.

Secondly, our study underscores the importance of collaboration between AI systems and human experts. While Chat-GPT can offer valuable insights, it should be seen as a complementary tool rather than a replacement for medical professionals [11,12,13,14]. Combining the strengths of AI, such as rapid data processing, with the clinical expertise of otolaryngologists can enhance the accuracy and efficiency of OSA diagnosis and management.

Finally, our findings contribute to the ongoing discourse in otolaryngology regarding OSA and the role of AI-generated content. By demonstrating the potential of Chat-GPT to align with expert opinions, this study encourages further research and development in AI-driven healthcare applications.

In conclusion, our study signifies the promise of AI, particularly Chat-GPT, in aiding healthcare professionals in the realm of OSA diagnosis. While Chat-GPT exhibits a commendable level of consensus with expert responses, the collaboration between AI and human experts is essential for optimal patient care. This research represents a significant step towards harnessing AI's capabilities to address the underdiagnosis and undertreatment of OSA, ultimately improving the health outcomes of affected individuals. Further investigations and refinements in AI-based healthcare tools hold great potential for the future of medicine.

Availability of data and materials

The data are available.

Change history

13 January 2024
First and family name of the author "Carlos Chiesa-Estomba" was published incorrectly and corrected in this version.

References

Shen Y, Heacock L, Elias J et al (2023) ChatGPT and other large language models are double-edged swords. Radiology. 307(2):e230163
Article PubMed Google Scholar
Rajkomar A, Dean J, Kohane I (2019) Machine learning in medicine. N Engl J Med 380(14):1347–1358
Article PubMed Google Scholar
Johnson D, Goodman R, Patrinely J et al (2023) Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the chat-GPT model. Res Sq. https://doi.org/10.21203/rs.3.rs-2566942/v1
Article PubMed PubMed Central Google Scholar
Yeghiazarians Y, Jneid H, Tietjens JR et al (2021) Obstructive sleep apnea and cardiovascular disease: a scientific statement from the american heart association. Circulation 144(3):E56–E67
Article CAS PubMed Google Scholar
Peppard PE, Young T, Barnet JH et al (2013) Increased prevalence of sleep-disordered breathing in adults. Am J Epidemiol 177(9):1006–1014
Article PubMed PubMed Central Google Scholar
Warrens MJ (2010) Inequalities between multi-rater kappas. Adv Data Anal Classif 4(4):271–286
Article MathSciNet Google Scholar
Brown TB, Mann, B, Ryder, N, et al (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
Radford A, Wu J, Child R et al (2019) Language models are unsupervised multitask learners. OpenAI Blog 1(8):9
Google Scholar
Peppard PE, Young T, Palta M et al (2000) Prospective study of the association between sleep-disordered breathing and hypertension. N Engl J Med 342(19):1378–1384
Article CAS PubMed Google Scholar
Senaratna CV, Perret JL, Lodge CJ et al (2017) Prevalence of obstructive sleep apnea in the general population: a systematic review. Sleep Med Rev 34:70–81
Article PubMed Google Scholar
.Lyons RJ, Arepalli SR, Fromal O, et al (2023) Artificial intelligence chatbot performance in triage of ophthalmic conditions. Can J Ophthalmol. https://doi.org/10.1016/j.jcjo.2023.07.016
Article PubMed Google Scholar
Xv Y, Peng C, Wei Z et al (2023) Can Chat-GPT a substitute for urological resident physician in diagnosing diseases?: a preliminary conclusion from an exploratory investigation. World J Urol 41(9):2569–2571
Article PubMed Google Scholar
Chiesa-Estomba CM, Lechien JR, Vaira LA et al (2023) Exploring the potential of Chat-GPT as a supportive tool for sialendoscopy clinical decision making and patient information support. Eur Arch Otorhinolaryngol. https://doi.org/10.1007/s00405-023-08104-8
Article PubMed Google Scholar
Chen S, Kann BH, Foote MB et al (2023) Use of artificial intelligence chatbots for cancer treatment information. JAMA Oncol 9(10):1459–1462
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The authors would like to express their gratutide to the following super-experts for having participated to the study Bhik Kotecha, Clemens Heiser, Nico De Vrie, Rodolfo Lugo Saldana, Joachim Maurer, Ofer Jacobowitz, Kenny Pang, Michel Cahali, Ewa Olszewska.

Author information

Authors and Affiliations

ENT Department, Hospital of Linares, Linares, Chile
Felipe Ahumada Mira
ENT Department, University Hospital of Montpellier, Montpellier, France
Valentin Favier
ENT and Sleep Medicine Department, Nucleus of Otolaryngology, Head and Neck Surgery and Sleep Medicine of São Paulo, São Paulo, Brazil
Heloisa dos Santos Sobreira Nunes
ENT Department, Armed Forces Hospital, Lisbon, Portugal
Joana Vaz de Castro
ENT Department, University Hospital of Poitiers, Poitiers, France
Florent Carsuzaa
Head and Neck Department, ENT & Oral Surgery Unity, G.B. Morgagni, L. Pierantoni Hospital, Via Forlanini, 47121, Forlì, Italy
Giuseppe Meccariello, Claudio Vicini, Andrea De Vito & Giovanni Cammaroto
Division of Laryngology and Broncho-Esophagology, Department of Otolaryngology and Head and Neck Surgery, EpiCURA Hospital, UMONS Research Institute for Health Sciences and Technology, University of Mons, Mons, Belgium
Jerome R. Lechien
Department of Otorhinolaryngology, Biodonostia Research Institute, Donostia University Hospital, Osakidetza, 20014, San Sebastian, Spain
Carlos Chiesa-Estomba
Department of Medical and Surgical Sciences and Advanced Technologies “GF Ingrassia”, ENT Section, University of Catania, Piazza Università 2, 95100, Catania, Italy
Antonino Maniaci
Department of ‘Organi di Senso’, University “Sapienza”, Viale Dell’Università 33, 00185, Rome, Italy
Giannicola Iannella
Clínica Lircay, Talca, Chile
Eduardo Peña Rojas
Hospital Clínico UC Christus, Santiago, Chile
Jenifer Barros Cornejo
Young Otolaryngologists-International Federations of Oto-Rhinolaryngological Societies (YO-IFOS), Paris, France
Felipe Ahumada Mira, Valentin Favier, Heloisa dos Santos Sobreira Nunes, Joana Vaz de Castro, Florent Carsuzaa, Jerome R. Lechien, Carlos Chiesa-Estomba, Antonino Maniaci, Giannicola Iannella & Giovanni Cammaroto

Authors

Felipe Ahumada Mira
View author publications
You can also search for this author in PubMed Google Scholar
Valentin Favier
View author publications
You can also search for this author in PubMed Google Scholar
Heloisa dos Santos Sobreira Nunes
View author publications
You can also search for this author in PubMed Google Scholar
Joana Vaz de Castro
View author publications
You can also search for this author in PubMed Google Scholar
Florent Carsuzaa
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Meccariello
View author publications
You can also search for this author in PubMed Google Scholar
Claudio Vicini
View author publications
You can also search for this author in PubMed Google Scholar
Andrea De Vito
View author publications
You can also search for this author in PubMed Google Scholar
Jerome R. Lechien
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Chiesa-Estomba
View author publications
You can also search for this author in PubMed Google Scholar
Antonino Maniaci
View author publications
You can also search for this author in PubMed Google Scholar
Giannicola Iannella
View author publications
You can also search for this author in PubMed Google Scholar
Eduardo Peña Rojas
View author publications
You can also search for this author in PubMed Google Scholar
Jenifer Barros Cornejo
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Cammaroto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giovanni Cammaroto.

Ethics declarations

Ethics declaration

The author Jerome R. Lechien is also guest editor of the special issue on ‘ChatGPT and Artifcial Intelligence in Otolaryngology-Head and Neck Surgery’. He was not involved with the peer review process of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the Topical Collection on sleep apnea syndrome. Guest editors: Manuele Casale, Rinaldi Vittorio.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Mira, F.A., Favier, V., dos Santos Sobreira Nunes, H. et al. Chat GPT for the management of obstructive sleep apnea: do we have a polar star?. Eur Arch Otorhinolaryngol 281, 2087–2093 (2024). https://doi.org/10.1007/s00405-023-08270-9

Download citation

Received: 13 September 2023
Accepted: 29 September 2023
Published: 19 November 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s00405-023-08270-9

Chat GPT for the management of obstructive sleep apnea: do we have a polar star?