Introduction

Acute ischemic stroke (AIS) is the leading cause of adult disability in Singapore and with an ageing population, the burden of stroke is expected to rise [1]. In patients with large artery occlusion in the anterior circulation presenting within 24 h, mechanical thrombectomy may be indicated. Presently, clinicians use a combination of clinical judgment and neuroimaging parameters to discern if a patient will be suitable for endovascular thrombectomy (EVT) [2]. This decision-making has been previously modeled by a multivariate logistic regression model by selecting some prognostic variables, and aggregating them into a usable scale [3]; however, this classical approach is limited as it operates on the assumption that there is a linear relationship between the variables and the logarithmic odds of outcomes, and is weak to collinearity between the variables [4]. By contrast, using machine learning (ML), one can program various algorithms that are free of these linear assumptions with the benefit of being able to control collinearity by regularization hyperparameters [5].

Machine learning has been shown to successfully incorporate multifactorial events in various fields for clinically relevant outcomes, such as the diagnosis of acute coronary syndromes [6]. Similarly, emerging studies seem to indicate that there is great potential in introducing ML models as a clinical tool to accurately predict the suitability of an AIS patient for an EVT procedure [7,8,9]; however, individual studies may not be statistically powered to evaluate the robustness of the findings or may not adequately account for the small biases in each population. Hence, we conducted a systematic review and meta-analysis to evaluate the effectiveness of current ML models as a clinical tool to predict the clinical outcome of AIS patients undergoing EVT.

Methodology

This diagnostic test accuracy (DTA) meta-analysis was conducted and reported in accordance with the Cochrane DTA handbook [10] and preferred reporting items of systematic reviews and meta-analyses (PRISMA) guidelines [11]. We searched PubMed from 1 January 2000 to 14 October 2019 for studies that evaluated ML algorithms for the prediction of outcomes with the modified Rankin scale (mRS), modified thrombolysis in cerebral infarction (modified TICI), symptomatic intracranial cerebral hemorrhage score, and mortality in stroke patients undergoing thrombectomy. Literature search in MEDLINE (PubMed) was performed using the following terms in combination: (large vessel occlusion OR cerebrovascular occlusion OR endovascular thrombectomy OR mechanical thrombectomy) and (artificial intelligence OR machine learning). We included all studies (randomized controlled trials, prospective/retrospective cohort studies, case-control studies), according to the PICOS (Table 1). We excluded all studies not reporting stroke outcomes following thrombectomy. The literature search and data extraction were performed independently by two reviewers, and all disagreements were resolved by mutual consensus. The corresponding authors of three papers were contacted by email to provide data not directly available in the original publication but have not responded by the time of submission, hence the papers were excluded.

Table 1 PICOs, inclusion criteria and exclusion criteria applied to database search

Study protocol and full text articles were independently reviewed by an expert team comprising senior neurologists.

Apart from stroke outcomes of patients, we also collected data on age, sex, smoking, the presence of diabetes mellitus, hypertension, hyperlipidemia, atrial fibrillation, previous ischemic stroke and coronary artery disease. Baseline information included National Institute of Health Stroke Scale (NIHSS), Alberta Stroke Program Early CT Score (ASPECTS), modified Rankin scale (mRS), systolic blood pressure, time of admission (door), time to intravenous thrombolysis (needle), time to procedure start (puncture), time to recanalization, intravenous thrombolysis, endovascular intervention, anesthesia type (general anesthesia vs. conscious sedation), vascular occlusion site and stroke etiology according to TOAST criteria [12, 13]. Outcomes collected include mRS 0–2 at discharge, 90-day follow-up, and 30-day follow-up, symptomatic intracranial hemorrhage, mortality, modified TICI score 2b/3, NIHSS at discharge and NIHSS with early clinical improvement. For the prediction models, we collected data of the accuracy metrics, including sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), precision, accuracy, true positive rates, false positive rates, true negative rate, false negative rate, receiver operator curves (ROC), area under the curve (AUC), conclusion, and other outcomes measured. In the case of data unavailability, we requested further data by contacting the corresponding authors of the relevant studies. Quality control were performed by two independent reviewers with the Ottawa grading quality assessment [14], as shown in Appendix 1.

Statistical Analysis

In this DTA meta-analysis, the results from the five included studies were quantitatively pooled and analyzed in R version 3.6.2 (R Foundation for Statistical Computing, Vienna, Austria) using general approaches laid out by Shim et al. [15]. We used the metaprop command to pool sensitivity, specificity, PPV, NPV, and diagnostic odds ratio (DOR). The logit transformation was applied, and Clopper-Pearson exact confidence intervals were used [16]. The reitsma (Package: mada), which employs the bivariate model, was employed to generate the summary ROC curve and compute the area under curve (AUC) [17]. The random-effects model was utilized to account for between-study variance. We present between-study heterogeneity using I2 and τ2 statistics. We considered I2 of < 30% to indicate low heterogeneity between studies, 30–60% to indicate moderate heterogeneity, and > 60% to indicate substantial heterogeneity. Two-sided P values of < 0.05 were regarded to indicate nominal statistical significance.

Results

The PRISMA flowchart is presented in Fig. 1. MEDLINE literature search retrieved 3143 results, and bibliographic searches from review articles references uncovered two additional studies. After an initial screen, 3110 studies were excluded as they did not include stroke patients, thrombectomy patients, did not use machine learning methods, or were of an inappropriate study type. In the remaining 33 full-text papers, full text review excluded 29 papers and 4 papers were included in our analysis.

Fig. 1
figure 1

PRISMA flow diagram of study selection

Baseline Characteristics

We retained a total of five studies that evaluated ML. Support vector machine was utilized across four studies and decision tree was utilized in Alawieh 2018 [9]. In the study by Nishi 2019 [5], regularized logistic regression, random forest, and support vector machine were utilized. To reduce heterogeneity, we analyzed the support vector machine model in Nishi 2019. In addition, the Alawieh (2018) paper included 2 separate cohorts, comprising one retrospective and one prospective. Hence, despite the PRISMA in Fig. 1 showing four included articles, the number of included studies totaled five.

The five studies (1 prospective and 4 retrospective studies) comprised a combined cohort of 805 patients, of which 690 patients (mean age: 75.4 years, 52.6% males) reported outcome assessment and was included in the analysis. A significant proportion of patients (42.1%) received IV thrombolysis. The primary outcome measured was mRS at 90 days when available. When unavailable, mRS at the closest time point was obtained and specified. The mRS measures the degree of disability in stroke patients, with a good functional outcome defined as mRS 0–2. A good functional outcome was achieved in 37.7% of patients. As the symptomatic intracranial hemorrhage, mortality, modified treatment in cerebral infarct score 2b/3, NIHSS at discharge, and NIHSS with early clinical improvement were not reported across all studies, they were excluded from the analysis. The participant characteristics of the included studies are shown in Table 2.

Table 2 Participant characteristics

Across the five cohorts, the type of machine learning modalities utilized and how they were derived are varied. A summary of the machine learning modalities, comprising machine learning model, software algorithm, training procedure, and optimizing metrics of the included studies is attached in Appendix 2. The quality of training data, comprising type of study, cohort size, class imbalance, normalization/standardization, and validation are summarized in Appendix 3. Clinical predictors and outcomes included in the five studies are summarized in Table 3.

Table 3 Clinical predictors and outcomes included in the training of different ML models

Machine Learning Pooled Outcomes

The diagnostic odds ratio in predicting the outcome of AIS patients undergoing endovascular thrombectomy is presented in Fig. 2. Random effects model demonstrated that the odds ratio was 12.6 (95% confidence interval: 5.26–30.36, I2 = 68%).

Fig. 2
figure 2

Forest plot of the diagnostic odds ratio reported by studies that applied a machine learning method to predict clinical outcomes in ischemic stroke patients undergoing thrombectomy

The sensitivity is presented in Fig. 3. Random effects model demonstrated that the pooled sensitivity was 0.795 (95% confidence interval: 0.651–0.889, I2 = 70%).

Fig. 3
figure 3

Forest plot of the sensitivity reported by studies that applied a machine learning method to predict clinical outcomes in ischemic stroke patients undergoing thrombectomy

The specificity is presented in Fig. 4. Random effects model demonstrated that the pooled specificity was 0.780 (95% confidence interval: 0.634–0.879, I2 = 85%).

Fig. 4
figure 4

Forest plot of the specificity reported by studies that applied a machine learning method to predict clinical outcomes in ischemic stroke patients undergoing thrombectomy

The random effects model demonstrated that the pooled negative predictive value was 0.874 (95% confidence interval: 0.728–0.947) (I2 = 84%). The random effects model demonstrated that the pooled positive predictive value was 0.697 (95% confidence interval: 0.640–0.749, I2 = 0.0%).

The summary ROC is presented in Fig. 5. The AUC of the summary ROC was 0.846 (95% confidence interval: 0.686–0.902).

Fig. 5
figure 5

Summary receiver operating characteristic curve (ROC). Curve is not part of the ROC. HSROC Hierarchical Summary Receiver Operating Characteristic

Discussion

This diagnostic test accuracy meta-analysis demonstrates the utility of ML algorithms as an adjunctive tool in identifying good candidates in acute ischemic stroke patients indicated for endovascular thrombectomy (EVT) with moderate to high AUC, sensitivity, specificity, NPV and PPV; however, there exists a large heterogeneity across ML models. The accuracy of ML will undoubtedly improve over time as algorithms are trained on larger and more robust databases, improving patient selection for endovascular thrombectomy.

Traditional predictive models include ASPECTS score, baseline NIHSS, M1 occlusion, Boston acute stroke imaging scale (M1-BASIS), and the DEFUSE‑2 trial. When considering the effectiveness of traditional clinical predictive models as compared to the ML models, the AUC was utilized [18, 19]. Compared to that of the ML model (AUC for ML model = 0.846), the AUC for ASPECTS (0.730), M1-BASIS (0.721), NIHSS (0.728), and the DEFUSE‑2 trial (0.730) demonstrated a lower value [18, 19]. This demonstrates a relative superiority of ML models compared to traditional predictive models in prognosticating stroke patients who are indicated for thrombectomy.

A pooled diagnostic odds ratio (DOR) of 12.6 demonstrates the potential of utilizing ML algorithms as a predictive tool for good clinical outcomes with higher discriminatory power compared to conventional models [20]; however, as the DOR is usually used as an output statistic, rather than a summary test statistic, as it can be achieved at different combinations of sensitivity and specificity. Examining the random effects model, it demonstrated a pooled sensitivity and specificity 0.795 and 0.780, respectively, each nearing 80%. On the other hand, traditional predictive models, such as the ASPECT score, conventionally used to exclude patients for the use of reperfusion therapy, showed higher sensitivity of 0.91 and specificity of 0.88 compared to that of ML models [21]. We propose that the differences may be due to high heterogeneity across the studies analyzed. Observing the forest plots for sensitivity in Fig. 3, the first three cohorts by Alawieh and Asadi (2014a) showed much higher sensitivities of around 0.90 as compared to the remaining two cohorts. On the other hand, Nishi and Alawieh (2018b) showed much higher specificities of around 0.90 as compared to the remaining three cohorts. These are comparable to traditional models like ASPECT. Hence, we note that a large heterogeneity exists across ML models, affecting the accuracy of the pooled prognostic estimate [22].

Additionally, accounting for the prevalence of good and poor clinical outcomes amongst acute ischemic stroke patients, we derived the pooled positive predictive value and negative predictive value of 0.697 and 0.874. A high negative predictive value of 87% shows that a high proportion of patients who are predicted by ML to have poor outcomes do in fact suffer from poor outcomes postthrombectomy, hence guiding the risk-benefit decision-making process for thrombectomy [23]. Hence, we proposed that current ML models can be used as an adjunctive clinical tool with a high discriminatory value to predict the suitability of an AIS patient for an EVT procedure. Using ML algorithms, patients with predicted poor mRS-90d may be assessed to be at a higher risk for poorer outcomes, corresponding to a lower benefit from the intervention. Hence, EVT procedures may be deemed unsuitable if the risk outweighs the potential benefit.

Recent studies showed that due to its high discriminatory value, ML has the potential to be more effective compared to older regression models in predicting the clinical outcome in other diseases [24]. In stroke patients, ML has been shown to translate data like spatial regularization of diffusion-weighted index into statistical data [25], and imaging algorithms that are able to estimate the extent of potentially salvageable tissues [26]. This implies that with the correct combination of factors and algorithm, ML might be able to accurately predict outcomes of stroke patients undergoing endovascular thrombectomy.

However, we emphasize that the interplay of factors and confounders in the human body is very complex, and a significant amount of training over a large dataset needs to be applied to ML to compensate for these factors [27]. This may partially explain the observed large heterogeneity across ML models in our study. Nonetheless, the converse may also hold true. With sufficient training, ML algorithms may be able to augment physician’s decision making by accounting for relationships and interactions between different variables that the clinician may not be aware of, hence enabling clinicians to make a more informed decision about a patient’s treatment.

We postulate that the implementation of a machine learning tool into clinical practice may be less challenging than previously thought, owing to its convenience, ease of use and personalization. Currently, indications for a thrombectomy include a contraindication to tissue plasminogen activator thrombolysis, a timing of within 6 h onset of stroke, a large vessel intracranial vessel, and the use of conventional scoring including NIHSS ≥ 10 and ASPECTs ≥ 6. We propose that machine learning can be considered as a secondary tool to indicate thrombectomy use. On notification of a patient presenting with AIS, demographic data and medical history may be automatically extracted from the hospital database into the ML algorithm. After preliminary clinical assessment and investigations are completed, the clinician may further input relevant parameters and investigation findings into the ML algorithm to quickly determine the potential benefit and hence suitability of the patient in undergoing thrombectomy. The majority of the studies analyzed in this meta-analysis used a form of widely used machine learning called support vector machine (SVM). This is a commonly used machine learning tool, having previously been used in cancer genomics, due to the strength of the algorithm and the flexibility in the data presentation [28]. Furthermore, the primary clinical outcome is mRS, a widely used clinical outcome measure in stroke patients. This provides a relatively smoother transition to the uptake of machine learning to aid clinicians.

Limitations

Certain limitations of this study should be acknowledged. First, only a small number of studies were included in the analysis, of which the majority utilized SVM. This may reduce the generalizability of the findings across different ML models and patient cohorts.

Second, some of the utilized clinical predictors may not have been reported fully in the papers. Furthermore, among the reported clinical predictors used, while the categories of predictors are largely similar, the individual predictors included under each category differ across studies. Hence, this may introduce heterogeneity across the studies included in the meta-analysis.

Third, while machine learning has tremendous potential its “black box” limitation mandates that ML models require large databases to improve their accuracy, and the true underlying relationships between influential factors remains largely unknown to the user [29, 30].

Fourth, an mRS score of 0–2 was used to represent a good outcome post-thrombectomy. The negative outcome reported may not have been a direct result of the thrombectomy, but rather a complication of an existing comorbidity, which may affect the reliability of the SVM if sufficient variables were not accounted for.

Fifth, while the quality of training data is overall high, most of the studies (n = 4) were retrospective, which may introduce bias into the training of ML models. Furthermore, software algorithms differed across studies. Hence, further prospective studies are needed to improve the training of ML models.

Conclusion

The moderate to high AUC, sensitivity, specificity, negative predictive value and diagnostic odds ratio demonstrate that ML is a good adjunct clinical tool to predict the suitability of an AIS patient undergoing EVT. As seen in the heterogeneity of the studies analyzed, further development is required to improve the accuracy of various ML models. Training the algorithms over larger datasets and allocating more resources towards the refinement of algorithms may help improve its sensitivity and specificity so that it may potentially be used as a confirmatory tool in the future.