figure a

Introduction

Electronic health records (EHRs) and electronic medical records (EMRs) have largely become the norm for storing patient data in medicine, enabling the storage of vast amounts of data that were previously unfeasible using paper records. These large volumes of patient data have the potential for secondary utility in bolstering clinical decision making and generating predictive disease stratification models [1]. Yet, due to the heterogeneity in the collected variables, information recording, and data input, it is challenging to use this data to derive meaningful clinical research conclusions [1]. Artificial intelligence (AI) provides a promising solution to analyze these vast amounts of data and has been successful in several fields. For example, within cardiology, AI has been used with EHR/EMR data to assist in the early detection of heart failure and predict the onset of congestive heart failure [2, 3]. In the realm of ophthalmology, AI and machine learning approaches have been utilized to predict the risk of complications post-cataract surgery, conduct risk assessment of diabetic retinopathy, and improve the diagnosis of conditions such as glaucoma and age-related macular degeneration [4,5,6,7].

Despite the promising potential of AI in ophthalmology, reporting standards across studies are not consistent, leading to a lack of clarity and transparency in the literature. As a result, there have been efforts to develop standardized guidelines for AI-specific study reporting. For example, the Consolidated Standards of Reporting Trials (CONSORT) statement provides the basic guidelines for reporting in randomized trials. The Consolidated Standards of Reporting Trials—Artificial Intelligence (CONSORT-AI) extension guideline was developed to provide guidance for reporting in randomized controlled trials (RCTs) specifically evaluating interventions with an AI component, ensuring that the results are transparent, reproducible, and comparable across the literature [8]. Herein we completed a critical analysis of all studies applying AI to data from electronic health and medical records within the field of ophthalmology and vision science. Furthermore, as there are no AI-specific generalized guidelines for non-RCT studies, we used the relevant AI elements from the CONSORT-AI checklist in order to critically appraise the adherence of each included study to the reporting guideline.

Methods

This is a systematic review of all studies applying AI to the analysis of patient data from EHRs/EMRs within the field of ophthalmology and vision science from January 1, 2010 to April 17, 2022 with a search update run on February 23, 2023. This review was conducted in accordance with the Preferred Reporting Items for a Systematic Review and Meta-analysis (PRISMA) guidelines. The protocol was prospectively registered in PROSPERO (registration number: CRD42022303128). A comprehensive search of the relevant databases MEDLINE, EMBASE, and Cochrane Library was done in consultation with an experienced librarian. A combination of keywords and Medical Subject Headings related to concepts of EHRs/EMRs, ophthalmology and AI were used to build the search strategy (Appendix 1).

Primary English studies that focused on human subjects published after January 2010 were eligible, including observational studies, case reports, and population studies. Articles were included if they provided outcomes regarding the value of AI in the analysis of patient EHR/EMR data, with or without imaging data, in any of the following ocular conditions: corneal disease, lens disease, glaucoma, retinal disease, scleral diseases, uveal diseases, choroid diseases, ocular neoplasms, strabismus, eyelid diseases, and ophthalmic emergencies. Studies were excluded if they solely focused on AI in the evaluation of ophthalmic imaging data, were in a language other than English, or were in the form of a review article, meta-analysis, conference abstract, editorial, short communication, guideline, or research letter. The authors of articles whose full-text was not available were contacted directly to request full-text versions.

Screening and data extraction

Two authors (T.J.L, R.S.H) independently conducted an initial title-abstract screening followed by full-text screening of all articles. All conflicts were resolved by consensus in consultation with a third author (E.R.L). Data from the final set of articles included in the review were extracted and recorded in a predetermined datasheet by two authors (T.J.L, R.S.H). Findings extracted from published reports included basic study characteristics, aspects of AI model construction, AI performance, and AI reporting domains. We collected information on baseline variables including: country, study design, purpose of study, disease outcome, sample size, reporting of socioeconomic status. Studies were evaluated for AI reporting based on 14 items from the AI-specific elements from the CONSORT-AI reporting guideline.

CONSORT-AI checklist

All included studies were scored independently by two authors (T.J.L, R.S.H) using 14 AI-specific items from the CONSORT-AI checklist. Each item was given equal weight, scoring 1 point each. The resulting mark was termed the ‘CONSORT-AI score.’ After initial scoring, any conflicts were resolved by consensus. The AI-specific items were from across the domains of: Title and Abstract, Background and Objectives, Methods, Results, and Other Information (Code Availability). The specific reporting requirements were: 1) indicating that the intervention involves AI and specifying the type of model; 2) stating the intended use of the AI intervention; 3) explaining the intended use of the AI intervention in the context of the clinical pathway, including its purpose and its intended users; 4) stating the inclusion and exclusion criteria at the level of participants; 5) stating the inclusion and exclusion criteria at the level of the input data; 6) describing how the AI intervention was integrated into the trial setting; 7) stating which version of the AI algorithm was used; 8) describing how the input data were acquired and selected for the AI intervention; 9) describing how poor quality or unavailable input data were assessed and handled; 10) specifying whether there human-AI interaction in the handling of the input data, and what level of expertise was required of users; 11) specifying the output of the AI intervention; 12) explaining how the AI intervention’s outputs contributed to decision-making or other elements of clinical practice; 13) describing results of any analysis of performance errors and how errors were identified, where applicable; and 14) stating whether and how the AI intervention and/or its code can be accessed, including any restrictions to access or re-use. Inter-rater reliability was assessed using Cohen’s kappa.

Risk of Bias assessment

Two independent reviewers (T.J.L, R.S.H) evaluated the potential for bias in the included studies using the Risk of Bias in Non-randomized Studies of Interventions (ROBINS-I) tool. The assessment covered various bias domains for each study, including confounding factors, processes for selecting participants, classification of interventions, deviations from planned interventions, missing data, measurement of outcomes, and the selection of reported results.

Results

The search strategy yielded a total of 4,968 citations (Fig. 1). Following deduplication and screening, 89 studies met the inclusion and exclusion criteria. The characteristics of the included studies are summarized in Supplemental Table 1.

Fig. 1
figure 1

PRISMA flowchart diagram for study identification and selection

Table 1 Compliance of included studies to CONSORT-AI

Studies were predominantly generated by the US (n = 26, 32.5%) and China (n = 14, 17.5%). The majority of studies retrieved patient data from either clinical records (33.75%) or health records databases (41.25%). Clinical records were defined as records collected from individual clinical practice sites, while health records databases were defined as large-scale repositories storing aggregated health information across multiple health systems or regions. The number of participants included in the AI algorithm ranged from 20 patients to 407,573 patients [9, 10]. The most commonly used AI modality was machine learning (n = 72, 80.9%). Most studies used AI for ocular disease prediction (n = 41, 46.1%), and diabetic retinopathy was the most studied ocular pathology (n = 19, 21.3%).

The overall mean CONSORT-AI score across the 14 measured items was 12.1 (range 8–14, median 12). Following the initial round of scoring, there was conflict on 68 items (5.5%). The inter-rater concordance for CONSORT-AI scoring had a kappa score of 0.89. The compliance rates of the included studies to each of the individual AI-specific items from the CONSORT-AI reporting guideline are shown in Table 1 and organized as a heatmap in Supplemental Fig. 1. The categories with the lowest adherence rates were: describing how poor quality or unavailable input data were assessed and handled (48.3%), reporting the inclusion and exclusion criteria of participants (56.2%), and providing further information as to whether and how the AI intervention and/or its code could be accessed, as well as any restrictions to access or re-use the modality (62.9%). The best performed categories were specifying the output of the AI intervention (100%), explaining how the AI intervention’s outputs contributed to decision-making or other elements of clinical practice (100%), stating the intended use of the AI intervention within the trial in the title and/or abstract (98.9%), explaining the intended use of the AI intervention in the context of the clinical pathway, including its purpose and its intended users (97.8%), describing how the input data were acquired and selected for the AI intervention (97.8%), stating which version of the AI algorithm was used (96.6%), describing the results of any analysis of performance errors and how errors were identified (96.6%), indicating that the intervention involves artificial intelligence/machine learning in the title and/or abstract and specifying the type of model (93.3%), and stating the inclusion and exclusion criteria at the level of the input data (92.1%). Almost all studies reported their sources of funding, if applicable (n = 80, 89.9%). The majority of studies did not include socioeconomic status (SES) characteristics of patients within their study reporting (n = 67, 75.2%).

Based on ROBINS-I risk of bias assessment tool, risk of bias was “low” for 49 (55%) and associated with “some concerns” for 40 (45%) studies (Fig. 2). The majority of concerns with regards to risk of bias domains were identified to be “confounding,” “selection of participants in the study,” and “missing data.”

Fig. 2
figure 2

The ROBINS-I traffic light plots of the domain-level judgements for each individual result of publication quality and are formatted according to the risk-of-bias assessment tool used to perform the assessments. Green indicates “low risk” of bias and yellow indicates “some concerns.”

Discussion

With this review, we aimed to assess the reporting quality of studies utilizing AI and EMRs within ophthalmology by examining their adherence to 14 AI-specific items from the CONSORT-AI reporting standards for studies involving AI. Our study found a total of 89 studies that utilized AI with EMRs in ophthalmology. The mean CONSORT-AI score of the articles was 12.1/14 (range 8–14, median 12). Out of the 89 articles total in our review, 14 (15.7%) of the articles received a score of 100%.

To the best of our knowledge, our review is the first comprehensive study to evaluate the adherence of articles utilizing AI with EMRs in ophthalmology using AI-specific items from the CONSORT-AI reporting guideline. The adherence of the studies we examined was generally high for the 14 AI-specific items assessed from the CONSORT-AI reporting guideline, with an average adherence score of 86.4% (range 48.3–100%). However, the criteria with the lowest adherence were describing how poor quality or unavailable input data were assessed and handled (48.3%), reporting the inclusion and exclusion criteria of participants (56.2%), and providing information as to whether and how the AI intervention and/or its code could be accessed, as well as any restrictions to access or re-use the modality (62.9%). A similar recent study on the adherence of randomized controlled trials using AI in ophthalmology to the CONSORT-AI checklist found suboptimal reporting across certain domains, with an average adherence of 53% (range 37%–78%) for the included articles [11].

As research utilizing artificial intelligence continues to expand rapidly, tools for the evaluation of research output are necessary in order to maintain a high quality of reporting standards amongst publications. Reporting guidelines help to ensure scientific validity, clarity in the arrangement of results presented, greater reproducibility, and adherence to a consistent and ethical set of standards amongst researchers utilizing AI. This push towards standardization in reporting is already reflected in the current literature with the recent production of several guidelines for the reporting and quality assessment of AI studies. Currently, there are several guidelines that have been published for authors to reference when publishing research within the field of AI. For randomized trials, CONSORT-AI and SPIRIT-AI are the corresponding AI extensions for CONSORT (Consolidated Standards of Reporting Trials) and SPIRIT (Standard Protocol Items: Recommendations for Interventional Trials) [8, 12]. There are some options that have been developed for different types of non-randomized studies. For example, for diagnostic accuracy studies, STARD-AI is the AI-specific version of the Standards for Reporting of Diagnostic Accuracy Study (STARD) [13]. For prediction model studies on diagnosis and prognosis, there are three upcoming guidelines in development, called QUADAS-AI, TRIPOD-AI and PROBAST-AI [14, 15]. They are the AI versions of the QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies), TRIPOD (Transparent Reporting of a multivariable prediction model of Individual Prognosis Or Diagnosis) statement and the PROBAST (Prediction model Risk Of Bias ASsessment Tool) [14, 15]. Once complete, these checklists will provide guidelines for reporting standards as well as risk of bias assessment, which will be very useful for meta-analyses comparing between various AI studies.

With regards to performance assessment of AI-specific models, MI-CLAIM (Minimum Information about Clinical Artificial Intelligence Modelling) focuses on technical reproducibility of clinical AI modeling research, and MINIMAR (MINimum Information for Medical AI Reporting) provides guidance on proper data source usage and model evaluation [16, 17]. For the evaluation of early-stage AI decision support systems, DECIDE-AI is a reporting guideline for the evaluation of these clinical evaluations, helping to facilitate critical appraisal of these studies and replicability of their findings [19]. There are also guidelines specific to certain topics of research within AI, such as CLAIM (Checklist for Artificial Intelligence in Medical Imaging) which is for studies applying machine learning to medical imaging, and the Radiomics Quality Score (RQS) that is specific to publications on radiomics [20, 21]. There are also initiatives targeted towards AI model development, such as FUTURE-AI, which is a checklist for use within the conceptualization and development stage. FUTURE-AI is based on six central principles (Fairness, Universality, Traceability, Usability, Robustness and Explainability (FUTURE)) that focuses on assessing AI model optimization for real-word practice [18].

Therefore, depending on the type of non-randomized study, there are several potential options for reporting guidelines that could serve as a valuable reference for authors when developing their manuscripts. However, a consolidated generalized checklist applicable for all non-randomized studies would be ideal to provide a standardized framework for AI reporting, and help facilitate easier comparison between different types of non-randomized studies. In the interim, non-RCT studies can utilize the previously mentioned guidelines or even the CONSORT-AI framework as a reference by which to ensure that their reporting includes all relevant details, allowing for greater translation into clinical settings and standardization in the way results are reported between different AI models.

Our study completed a comprehensive search of the literature in order to identify all eligible articles within the field of ophthalmology that have applied AI to the analysis of EHR/EMRs. We were also able to assess the presence of certain characteristics, specifically the purpose and type of the AI model, the ophthalmological disease focused on, the data source used, study design type, country of origin, and whether baseline SES and funding source were reported. These discrepancies highlight a need for standardization of AI reporting guidelines which will enable better reproducibility of AI methodologies and allow for generalizability of results across various ophthalmologic centers. Lastly, certain restrictions in our inclusion criteria including English-language publications and other forms of secondary literature may have limited the identification of additional studies and perspectives on the topic.

Conclusion

Artificial intelligence (AI) offers considerable promise in leveraging large, heterogeneous patient health data sets to inform clinical practice in the management of medical conditions and disease. The digitization and electronic storage of medical information has provided a favourable setting for this application of AI and machine learning. The application of AI techniques in ophthalmology continues to rapidly progress, with new initiatives being developed in a wide variety of areas within the field. However, there is still a lack of standardization in reporting the results of these studies, which can make it difficult to compare and evaluate different AI models. The CONSORT-AI framework holds promise as an effective guideline for the transparent and comprehensive reporting of AI studies, by helping to standardize reporting across key aspects such as the study design, participant characteristics, interventions, outcomes, and statistical analysis. By adhering to the AI-specific reporting guidelines, researchers can improve the clarity and completeness of their reporting, allowing readers to better assess the quality and validity of their study. Standardized and transparent reporting of AI studies in ophthalmology will ultimately aid in application of AI for enhanced diagnosis and management of ocular conditions.