Abstract
Objective
ChatGPT (Generative Pre-trained Transformer) is an artificial intelligence language tool developed by OpenAI that utilises machine learning algorithms to generate text that closely mimics human language. It has recently taken the internet by storm. There have been several concerns regarding the accuracy of documents it generates. This study compares the accuracy and quality of several ChatGPT-generated academic articles with those written by human authors.
Material and methods
We performed a study to assess the accuracy of ChatGPT-generated radiology articles by comparing them with the published or written, and under review articles. These were independently analysed by two fellowship-trained musculoskeletal radiologists and graded from 1 to 5 (1 being bad and inaccurate to 5 being excellent and accurate).
Results
In total, 4 of the 5 articles written by ChatGPT were significantly inaccurate with fictitious references. One of the papers was well written, with a good introduction and discussion; however, all references were fictitious.
Conclusion
ChatGPT is able to generate coherent research articles, which on initial review may closely resemble authentic articles published by academic researchers. However, all of the articles we assessed were factually inaccurate and had fictitious references. It is worth noting, however, that the articles generated may appear authentic to an untrained reader.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
ChatGPT is an artificial intelligence (AI) language tool developed by OpenAI that utilises machine learning algorithms to generate text that closely mimics human language [1]. ChatGPT has been trained on large amounts of data to improve its ability to understand and generate natural-language text and can comprehend and respond to natural-language inputs. A tool of this calibre has the ability to influence a diverse array of sectors. In the field of research, ChatGPT has demonstrated the ability to generate scientific papers that are similar to authentic papers written by academic researchers, which has raised many questions about its potential role in the future of academic research. Several concerns, however, have been raised regarding the accuracy of research texts generated by ChatGPT [2, 3]. This study aims to assess the accuracy of radiology research articles generated by ChatGPT by performing a comparison with authentic, published articles or those under review.
Materials and methods
Five randomly selected articles published prior to 2021 or written and under review by senior authors were selected. ChatGPT was asked to write the articles with references. ChatGPT version 3.0 was used. These were then independently compared by two fellowship-trained musculoskeletal radiologists with the published articles. The references generated were cross-referenced with scientific databases (PubMed, Google, and Ovid Medline) to assess for authenticity. The DOIs (digital object identifiers) of the articles were also evaluated for authenticity.
Five articles published in PubMed or written and under review articles by senior authors were included. These included one original article, one case series, two case reports, and one technical report describing a new technique for bone biopsy. Three of these articles were already been published in scientific journals; one was accepted for publication with revisions, and one manuscript under review were included. One of the original articles described the BACTIP (Birmingham Atypical Cartilage Imaging Protocol) study. This was published in 2019 and is being used widely across the globe in management of central cartilage tumours [4]. The case series article was on ‘rising root sign’ seen in patients post spinal surgery with dural tear [5]. The case reports included one case of an intraneural ganglion of median nerve and another case of iliotibial band friction syndrome following curettage and cementation for giant cell tumour of lateral femoral condyle. The technical report was about a new technique to aid in bone biopsy called the Birmingham Intervention Tent Technique (BITT).
The articles were graded using Likert scale of 1–5 (1 = bad and inaccurate, 2 = poor, 3 = average, 4 = good, and 5 = excellent and accurate). Each section of article created by ChatGPT, i.e. introduction, main article, conclusion, and references, were assessed in particular for accuracy and quality. Word counts excluding references were also recorded. Comments were documented regarding these and if the articles were totally incorrect or different.
Results
ChatGPT generated duplicates of the original articles and case series that were factually incorrect (Table 1). The introduction, main body, and conclusion of the articles were good. The articles generated were relatively brief in nature, with word counts of less than 400 words. However, ChatGPT generated an article on BACTIP that discussed wound infection and methods to decrease this. It even described the steps for this. The case series article was completely wrong as it described the rising root sign in patients with lumbar disc pathology and spinal canal stenosis. It even went on to suggest surgical intervention for all these cases, which is contrary to what should be done in patients with rising root signs. The references for both of these articles were wrong.
The case reports generated by ChatGPT fared relatively better. The introduction, main body, and discussion of the median nerve ganglion were good, with a differential diagnosis and mechanisms for the development of ganglion. There were 9 references, all of which did not exist on PubMed or Google and were fictitious.
The case report regarding iliotibial band friction due to cement had a good introduction, main text, and conclusion. It even described two cases of iliotibial band impingement due to cement that were apparently described in literature which on checking using PubMed and Google, were found to be incorrect. Needless to say, the references for the two case reports were false and did not exist.
The paper describing a new BITT was incorrectly described by ChatGPT. It described the technique to be of pathological tendon seen in lateral epicondylitis and Achilles tendinopathy. It even described the steps of surgery for these, where the pathological tendon is excised. The references, like the others, were fictitious.
All 5 articles were graded as 1 on the Likert scale by both readers. These were deemed to be inaccurate with fictitious references, which to an untrained person might appear genuine.
Discussion
In our short pilot study, ChatGPT could write the relevant articles within 15 s. Three of the articles that it was asked to write about were published by senior authors prior to 2021. Two other articles are based on recent cases that have been written by senior authors and are under review in peer-reviewed journals. We used a combination of these to assess the accuracy pre-2021, as most of the parameters that the current version of ChatGPT is based on are pre-2021. The results of our study were quite interesting. There was consensus about the results of the study between the two observers. The format of the articles written is similar to what one would expect in a journal, with introduction, main body of the article, conclusion, and references, though the articles were relatively short. This could be because we were using the free version (version 3.0). ChatGPT even clarified acronyms and abbreviations at the beginning of the articles it generated by providing the complete terms within brackets, as is normal practice when writing an article. Four out of 5 articles generated by ChatGPT were totally inaccurate and unrelated to the actual topic. While the references appear genuine, on cross-checking with PubMed and Google, these were found to be fictitious except for one from 1976.
ChatGPT is an AI tool that has created a buzz across the globe. It was released in November 2022 and is free to use. The use of AI in various aspects of industry ranging from banking and transportation to medicine has been on the rise over the last decade. This is probably the first time it has been released to be tried and used by general public, who practically have been left gobsmacked by the efficiency and speed of this tool. ChatGPT is based on over 700 billion parameters and is predominantly based on data pre-2021. The latest version, which is due to be released in the near future, is expected to be much better as it is expected to be based on over 100 trillion parameters [1].
The ability of ChatGPT to write seemingly authentic-appearing scientific papers has been a topic of increasing interest in recent times, particularly in the world of medical and academic research. It is quite likely that ChatGPT, being an advanced language processing artificial intelligence, could play an important role in the future of research, particularly in regard to writing papers, as natural language processing and machine learning algorithms continue to advance. It has also been successful in clearing USMLE (US Medical Licensing Examination) and certain law exams [6].
However, there have been concerns about the authenticity and accuracy of some of the data that is being produced in ChatGPT [7]. As the results of our study indicate, ChatGPT-generated articles were factually inaccurate compared to authentic ones. While the language and structure of the papers were convincing, the content was often misleading or outright incorrect. This raises serious concerns about the potential role of ChatGPT in its current state in scientific research and publication. Its close resemblance to authentic articles is concerning, as they could appear authentic to untrained individuals and, with its free accessibility on the internet, be an adjunct to propagation of scientific misinformation [7].
Plagiarism is another issue that one needs to be aware of. To exacerbate the matter, current existing plagiarism detection software such as Grammarly or Turnitin are poor at detecting language generated by advanced AI tools such as ChatGPT [3]. Several societies and publishers have taken this in their stride by modifying their author instructions and policies to reflect this.
It is important to note that while ChatGPT may have the ability to generate text that closely resembles human writing, it lacks the necessary understanding of scientific concepts and methods to produce accurate and reliable results and analyse scientific data. It also lacks adequate scientific knowledge. It also does not appear to have the ability to extract factual information from existing web-based information resources or databases to perform an accurate literature review. It is worth noting that the articles generated may appear authentic to an untrained reader, including early-career researchers. Clinicians unaware of this might believe what was written and might use this in their clinical practice or even teach the trainees. This has the potential to cascade and spread disinformation. If used in clinical practice, it can result in increased patient.
Despite these concerns, there may be potential benefits to using ChatGPT in scientific research, particularly in the generation of well-worded and comprehensible text, provided factual data is entered into it by the user. ChatGPT’s ability to quickly process and analyse large volumes of data could prove useful in certain fields of study. The format of articles generated by ChatGPT can be used as a draft template to write an expanded version of the article. The language, grammar, and spelling checks can be done for the articles simultaneously. However, it is important to ensure that any findings generated by ChatGPT are verified by human experts to ensure accuracy and reliability.
Overall, while ChatGPT may hold some promise in the future for certain applications in scientific research, it is important to carefully consider its limitations and potential pitfalls, particularly in its current state. Future research should continue to investigate the role of ChatGPT in scientific publication and explore ways to mitigate its limitations and maximise its potential benefits while also maintaining the ethical standards and quality of academic research. It is crucial that researchers exercise caution when utilising ChatGPT as a tool for scientific writing and that any generated papers are carefully scrutinised by human experts before publication if it is to be used.
Our study had several limitations. We used a relatively small sample size, analysing only 5 articles generated by ChatGPT. We also used version 3.0, which may have certain limitations, including ability to generate accurate information. Further studies analysing a larger number of articles with more advanced versions of the AI software would ultimately be needed to definitively assess its reliability in generating scientific articles and could be a topic for future research.
In conclusion, while ChatGPT has the ability to generate seemingly authentic appearing scientific articles swiftly with minimal input and resembling genuine articles written by human authors, our study found that the articles generated were largely inaccurate and unreliable.
References
OpenAI. [Internet]. Introducing ChatGPT. San Francisco, California: OpenAI. 2022. [cited 2023 Feb 27]. Available from: https://openai.com/blog/
Kitamura FC. ChatGPT is shaping the future of medical writing but still requires human judgment. Radiology. 2023;2:230171. https://doi.org/10.1148/radiol.230171.
Biswas S. ChatGPT and the future of medical writing. Radiology. 2023;2:223312. https://doi.org/10.1148/radiol.223312.
Patel A, Davies AM, Botchu R, James S. A pragmatic approach to the imaging and follow-up of solitary central cartilage tumours of the proximal humerus and knee. Clin Radiol. 2019;74(7):517–26. https://doi.org/10.1016/j.crad.2019.01.025.
Bharath A, Uhiara O, Botchu R, et al. The rising root sign the magnetic resonance appearances of post-operative spinal subdural extra-arachnoid collections. Skeletal Radiol. 2017;46:1225–31. https://doi.org/10.1007/s00256-017-2682-x.
Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, Madriaga M, Aggabao R, Diaz-Candido G, Maningo J, Tseng V. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. https://doi.org/10.1371/journal.pdig.0000198.
Shen Y, Heacock L, Elias J, Hentel KD, Reig B, Shih G, Moy L. ChatGPT and other large language models are double-edged swords. Radiology. 2023;26:230163. https://doi.org/10.1148/radiol.230163.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ariyaratne, S., Iyengar, K.P., Nischal, N. et al. A comparison of ChatGPT-generated articles with human-written articles. Skeletal Radiol 52, 1755–1758 (2023). https://doi.org/10.1007/s00256-023-04340-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00256-023-04340-5