Introduction

ChatGPT is an artificial intelligence (AI) language tool developed by OpenAI that utilises machine learning algorithms to generate text that closely mimics human language [1]. ChatGPT has been trained on large amounts of data to improve its ability to understand and generate natural-language text and can comprehend and respond to natural-language inputs. A tool of this calibre has the ability to influence a diverse array of sectors. In the field of research, ChatGPT has demonstrated the ability to generate scientific papers that are similar to authentic papers written by academic researchers, which has raised many questions about its potential role in the future of academic research. Several concerns, however, have been raised regarding the accuracy of research texts generated by ChatGPT [2, 3]. This study aims to assess the accuracy of radiology research articles generated by ChatGPT by performing a comparison with authentic, published articles or those under review.

Materials and methods

Five randomly selected articles published prior to 2021 or written and under review by senior authors were selected. ChatGPT was asked to write the articles with references. ChatGPT version 3.0 was used. These were then independently compared by two fellowship-trained musculoskeletal radiologists with the published articles. The references generated were cross-referenced with scientific databases (PubMed, Google, and Ovid Medline) to assess for authenticity. The DOIs (digital object identifiers) of the articles were also evaluated for authenticity.

Five articles published in PubMed or written and under review articles by senior authors were included. These included one original article, one case series, two case reports, and one technical report describing a new technique for bone biopsy. Three of these articles were already been published in scientific journals; one was accepted for publication with revisions, and one manuscript under review were included. One of the original articles described the BACTIP (Birmingham Atypical Cartilage Imaging Protocol) study. This was published in 2019 and is being used widely across the globe in management of central cartilage tumours [4]. The case series article was on ‘rising root sign’ seen in patients post spinal surgery with dural tear [5]. The case reports included one case of an intraneural ganglion of median nerve and another case of iliotibial band friction syndrome following curettage and cementation for giant cell tumour of lateral femoral condyle. The technical report was about a new technique to aid in bone biopsy called the Birmingham Intervention Tent Technique (BITT).

The articles were graded using Likert scale of 1–5 (1 = bad and inaccurate, 2 = poor, 3 = average, 4 = good, and 5 = excellent and accurate). Each section of article created by ChatGPT, i.e. introduction, main article, conclusion, and references, were assessed in particular for accuracy and quality. Word counts excluding references were also recorded. Comments were documented regarding these and if the articles were totally incorrect or different.

Results

ChatGPT generated duplicates of the original articles and case series that were factually incorrect (Table 1). The introduction, main body, and conclusion of the articles were good. The articles generated were relatively brief in nature, with word counts of less than 400 words. However, ChatGPT generated an article on BACTIP that discussed wound infection and methods to decrease this. It even described the steps for this. The case series article was completely wrong as it described the rising root sign in patients with lumbar disc pathology and spinal canal stenosis. It even went on to suggest surgical intervention for all these cases, which is contrary to what should be done in patients with rising root signs. The references for both of these articles were wrong.

Table 1 Summary of the assessment of the articles generated by ChatGPT

The case reports generated by ChatGPT fared relatively better. The introduction, main body, and discussion of the median nerve ganglion were good, with a differential diagnosis and mechanisms for the development of ganglion. There were 9 references, all of which did not exist on PubMed or Google and were fictitious.

The case report regarding iliotibial band friction due to cement had a good introduction, main text, and conclusion. It even described two cases of iliotibial band impingement due to cement that were apparently described in literature which on checking using PubMed and Google, were found to be incorrect. Needless to say, the references for the two case reports were false and did not exist.

The paper describing a new BITT was incorrectly described by ChatGPT. It described the technique to be of pathological tendon seen in lateral epicondylitis and Achilles tendinopathy. It even described the steps of surgery for these, where the pathological tendon is excised. The references, like the others, were fictitious.

All 5 articles were graded as 1 on the Likert scale by both readers. These were deemed to be inaccurate with fictitious references, which to an untrained person might appear genuine.

Discussion

In our short pilot study, ChatGPT could write the relevant articles within 15 s. Three of the articles that it was asked to write about were published by senior authors prior to 2021. Two other articles are based on recent cases that have been written by senior authors and are under review in peer-reviewed journals. We used a combination of these to assess the accuracy pre-2021, as most of the parameters that the current version of ChatGPT is based on are pre-2021. The results of our study were quite interesting. There was consensus about the results of the study between the two observers. The format of the articles written is similar to what one would expect in a journal, with introduction, main body of the article, conclusion, and references, though the articles were relatively short. This could be because we were using the free version (version 3.0). ChatGPT even clarified acronyms and abbreviations at the beginning of the articles it generated by providing the complete terms within brackets, as is normal practice when writing an article. Four out of 5 articles generated by ChatGPT were totally inaccurate and unrelated to the actual topic. While the references appear genuine, on cross-checking with PubMed and Google, these were found to be fictitious except for one from 1976.

ChatGPT is an AI tool that has created a buzz across the globe. It was released in November 2022 and is free to use. The use of AI in various aspects of industry ranging from banking and transportation to medicine has been on the rise over the last decade. This is probably the first time it has been released to be tried and used by general public, who practically have been left gobsmacked by the efficiency and speed of this tool. ChatGPT is based on over 700 billion parameters and is predominantly based on data pre-2021. The latest version, which is due to be released in the near future, is expected to be much better as it is expected to be based on over 100 trillion parameters [1].

The ability of ChatGPT to write seemingly authentic-appearing scientific papers has been a topic of increasing interest in recent times, particularly in the world of medical and academic research. It is quite likely that ChatGPT, being an advanced language processing artificial intelligence, could play an important role in the future of research, particularly in regard to writing papers, as natural language processing and machine learning algorithms continue to advance. It has also been successful in clearing USMLE (US Medical Licensing Examination) and certain law exams [6].

However, there have been concerns about the authenticity and accuracy of some of the data that is being produced in ChatGPT [7]. As the results of our study indicate, ChatGPT-generated articles were factually inaccurate compared to authentic ones. While the language and structure of the papers were convincing, the content was often misleading or outright incorrect. This raises serious concerns about the potential role of ChatGPT in its current state in scientific research and publication. Its close resemblance to authentic articles is concerning, as they could appear authentic to untrained individuals and, with its free accessibility on the internet, be an adjunct to propagation of scientific misinformation [7].

Plagiarism is another issue that one needs to be aware of. To exacerbate the matter, current existing plagiarism detection software such as Grammarly or Turnitin are poor at detecting language generated by advanced AI tools such as ChatGPT [3]. Several societies and publishers have taken this in their stride by modifying their author instructions and policies to reflect this.

It is important to note that while ChatGPT may have the ability to generate text that closely resembles human writing, it lacks the necessary understanding of scientific concepts and methods to produce accurate and reliable results and analyse scientific data. It also lacks adequate scientific knowledge. It also does not appear to have the ability to extract factual information from existing web-based information resources or databases to perform an accurate literature review. It is worth noting that the articles generated may appear authentic to an untrained reader, including early-career researchers. Clinicians unaware of this might believe what was written and might use this in their clinical practice or even teach the trainees. This has the potential to cascade and spread disinformation. If used in clinical practice, it can result in increased patient.

Despite these concerns, there may be potential benefits to using ChatGPT in scientific research, particularly in the generation of well-worded and comprehensible text, provided factual data is entered into it by the user. ChatGPT’s ability to quickly process and analyse large volumes of data could prove useful in certain fields of study. The format of articles generated by ChatGPT can be used as a draft template to write an expanded version of the article. The language, grammar, and spelling checks can be done for the articles simultaneously. However, it is important to ensure that any findings generated by ChatGPT are verified by human experts to ensure accuracy and reliability.

Overall, while ChatGPT may hold some promise in the future for certain applications in scientific research, it is important to carefully consider its limitations and potential pitfalls, particularly in its current state. Future research should continue to investigate the role of ChatGPT in scientific publication and explore ways to mitigate its limitations and maximise its potential benefits while also maintaining the ethical standards and quality of academic research. It is crucial that researchers exercise caution when utilising ChatGPT as a tool for scientific writing and that any generated papers are carefully scrutinised by human experts before publication if it is to be used.

Our study had several limitations. We used a relatively small sample size, analysing only 5 articles generated by ChatGPT. We also used version 3.0, which may have certain limitations, including ability to generate accurate information. Further studies analysing a larger number of articles with more advanced versions of the AI software would ultimately be needed to definitively assess its reliability in generating scientific articles and could be a topic for future research.

In conclusion, while ChatGPT has the ability to generate seemingly authentic appearing scientific articles swiftly with minimal input and resembling genuine articles written by human authors, our study found that the articles generated were largely inaccurate and unreliable.