Introduction

Patients with emergent large vessel occlusion (LVO) acute ischemic strokes (AIS) account for 24–46% of all AIS cases [1,2,3]. While there are many confounding factors, timely recanalization of such blockages is essential for optimal patient outcomes [4]. Mechanical thrombectomies (MTs) aim at endovascular clot retrieval to restore blood flow to ischemic territories [5, 6]. MTs are reported to have successful reperfusion rates between 75 and 80% [7] compared to the 30% early recanalization rate when using intravenous thrombolysis with recombinant tissue-type plasminogen activators (thrombus length < 7 mm) [8]. Thus, MTs have markedly reduced severe disability and mortality compared to intravenous thrombolysis [8] and have been established as the standard care for LVO AIS.

Patients with LVOs undergo computed tomography (CT), magnetic resonance imaging (MRI), CT angiogram, CT perfusion, or MR angiogram (MRA) imaging to determine eligibility for the MT procedure, followed by the procedure itself. While preprocedural AIS imaging has undergone major advancements [9], intra-procedural imaging still lags behind. Currently, intra-procedural MT success is assessed primarily by grading intracranial reperfusion using cerebral digital subtraction angiography (DSA). This is done using the thrombolysis in cerebral infarction (TICI) scale as proposed by Higashida and Furlan [10] or the modified TICI (mTICI) scale [11]. mTICI grading systems have received criticisms due to confusing internal inconsistencies [12] as well as the inclusion of bias due to grading being conducted solely through direct visual estimations of operators [13]. Clinical trials compared operator-scored mTICIs with core lab–scored mTICIs and found only a 56% proportion of agreement between the two. In 33% of these, the operator-scored mTICIs were overestimated compared to those from the core lab [14,15,16].

Intra-procedural assessment of such endovascular procedures could be improved using quantitative tools similar to CTP. However, such implementation is limited due to the 2D nature of DSA and variability caused by hand injection of contrast. Angiographic parametric imaging (API) has been proposed as an alternative solution. This form of image analysis uses a DSA sequence to semi-quantitatively analyze blood flow through the vasculature and angioarchitectures. Intensity at each pixel across the DSA sequence is measured, resulting in a time density curve (TDC) at each pixel. TDCs’ parameterization enables the extraction of various parameters such as mean transit time, time to peak, time to arrival, peak height (PH), and area under the TDC. This allows API map generation for each parameter which can be analyzed to understand the nature of flow through different vessels and phases in DSAs [17, 18]. Each map encodes one hemodynamic parameter derived from each X-ray pathway, thus making maps less sensitive to subtle flow differences. This suboptimal sensitivity could be improved using a hybrid approach where hemodynamics encoded in API are combined with a data-driven model.

Data-driven models such as convolutional neural networks (CNNs) and other machine learning tools using CT or MRI data have been clinically implemented for automated stroke assessment such as the ASPECT score [19]. Following similar trends, tools have been proposed for DSA using API to make predictions regarding treatment success [20]. In this work, we present a study to test the feasibility of using CNNs with quantitative angiographic information from API to classify cerebral reperfusion during MT procedures. For the data-driven classification, the reperfusion level was assessed using the mTICI scale. However, any other outcomes, including post-procedure MRI or neurologic evaluation, could be used.

Methods

Data collection

Retrospective collection and analysis of patient data was conducted at a single center and approved by our institutional review board. Inclusion criteria were any patient with an LVO undergoing a MT. For each patient, baseline, intra-procedural, and post-MT DSAs were collected. Anteroposterior (AP) and lateral view DSAs were collected for every scan. Patients with posterior circulation occlusions were excluded. DSAs with image artifacts caused by patient motion during the scan, mainly in cases treated under conscious sedation, were also excluded from the study.

We included 192 patients with 383 angiographic runs in our final analysis. Since angiographic runs from the same patients were taken at different time points during MT procedures, they have different levels of reperfusion and can be considered separate cases. Mean patient age was 68.75 years, initial NIH stroke score (NIHSS) was 12, post-procedure NIHSS was 4, and NIHSS shift was − 7. Patient demographics, locations of LVOs, and summary of mTICI scores are displayed in Fig. 1.

Acquisition of DSA sequences for all patients was conducted using Canon Infinix biplane systems (Canon Medical Systems Corporation, Otawara, Japan). DSAs were acquired at an average tube voltage of 84.3 ± 5.0 kVp (average ± standard deviation), tube current of 149.4 ± 42.7 mA, pulse width of 84.0 ± 12.1 ms, and frame rate of 3 frames per second. Contrast used during acquisitions was iohexol (Omnipaque 350; GE Healthcare, Piscataway, NJ).

The overall study workflow is displayed in Fig. 2. For each DSA included in the analysis, an mTICI label was assigned by two experienced operators (qualified neuro-interventionalists) independently of each other. The operators graded every case and used the AP and lateral full DSA sequences. Operators were not involved in the procedure and were blinded to clinical outcomes and PH maps. Grading was performed according to the following 6 categories: no perfusion (grade 0), partial perfusion beyond initial occlusion but not in distal arteries (grade 1), partial perfusion less than 50% (grade 2a), partial perfusion more than 50% but less than full (grade 2b), complete but delayed perfusion (grade 2c), and complete perfusion (grade 3) (Fig. 2) [21]. Cases with disagreements in mTICI labels were resolved by consensus decision. This was done to remove any bias in labels used for training the network.

Fig. 1
figure 1

Patient demographic information. p refers to the number of patients, n refers to the number of DSAs. DSA digital subtraction angiography, ICA internal carotid artery

Fig. 2
figure 2

Workflow of the study. DSA digital subtraction angiography, PH peak height

API map generation

Reperfusion evaluation with mTICI scores was done based on the extent of tissue perfusion as represented by the capillary blush in DSAs [10, 22]. Each DSA was cropped to only include frames where contrast was in the late arterial and capillary phases. Thus, API maps contained a limited number of overlapping structures from early arterial or venous phases. Given the three frames per second acquisitions, arterial structures were always present in the final API maps. The temporal cropping was done by an operator with 3 years of experience working with DSAs.

TDCs were extracted at each pixel by tracking the flow of contrast across frames in the cropped DSA sequence. PH maps were generated by calculating the maximum value from TDCs at each pixel. For the purpose of this feasibility study, only PH maps were considered as it reflects maximum contrast intensity in each pixel across all frames and is thus most reflective of perfusion.

Since hand contrast injection was used for these emergent cases, injection parameters such as concentration, volume, and injection rate were highly heterogeneous between DSA acquisitions. To account for this variability, every pixel value in the PH map was divided by the PH value in the main feeding artery, thus normalizing each map to contrast concentration in the respective inlet vessel. The location for normalization was manually conducted by an operator with 3 years of experience working with DSAs and API. The location was chosen on a straight portion of the main feeding artery before splitting into respective branches. Care was taken to avoid tortuous or overlapping structures that would affect the quantitative angiographic values as well as any regions of image artifacts.

Network development

CNN was developed using Keras [23] to classify PH maps based on the reperfusion level. CNN architecture development was an iterative process based on the optimization of metrics such as classification accuracy and receiver operating characteristic (ROC) curves. The final architecture is displayed in Fig. 2. The optimizer used during training was the Adam optimizer with an initial learning rate of 10−3. The loss function used was the categorical cross-entropy. Keras callbacks were used to reduce the learning rate as training progressed and automatically terminate training as loss plateaued. In addition, the class imbalance between classes was accommodated by implementing a balanced class weighting during training. Thus, the CNN balanced layer weights to ensure equal penalization of under- or overrepresented classes in the training set. CNNs were trained and tested on a single NVIDIA (Nvidia Corporation, Santa Clara, CA) Tesla V100 graphics processing unit.

Following guidelines proposed by Radiology [24] and in order to prevent network overfitting, we split the dataset with 70% (268 cases) reserved for training, 10% (39 cases) reserved as a validation set used for hyperparameter tuning during training, and 20% (77 cases) reserved for testing. Hyperparameter tuning during training was done by tracking loss on the validation set. To test network robustness and ensure that results were not based on specific training-testing splits, a 20-fold Monte Carlo cross-validation (MCCV) [25] was conducted. This approach involved randomly splitting the total dataset into training and testing sets 20 times.

In order to combine information contained in both the AP and lateral PH maps, we used ensembled networks. This method involves a combination of predictions from multiple networks to give a final prediction by assigning a weight to predictions from each network. One network was trained on AP PH maps, and another network with the same architecture was trained on lateral PH maps. Weights to assign to predictions from individual networks were calculated using a differential evolution optimization algorithm [26] implemented using SciPy [27]. Once the weights were calculated, they were used to combine predictions from the AP and lateral networks to give weighted ensembled predictions.

CNNs can automate outcome predictions and quantitative assessment of lesions such as intracranial aneurysms [20]; however, as a self-controlling minimization technique, they do not allow users to oversee which image features are most important and thus how to improve network predictions. We investigated the use of class activation maps (CAMs) to visualize regions of PH maps that trigger the trained algorithm, thus lending insight to whether the CNN makes decisions based on flow or using some other portion of the image which may not be as predictive of reperfusion. CAMs were generated using a method described by Zhou et al. [28]. They are obtained by taking outputs from the final convolutional layer and passing it through a global average pooling layer. CAMs are heatmaps where high-intensity pixels are features highly weighted towards network classification output.

Statistical analysis

The CNN was evaluated using five quantitative metrics including classification accuracy, ROC curves, area under the ROC curves (AUROC), sensitivity, specificity, and Matthews correlation coefficient (MCC). Each of these metrics was averaged using results over the 20-fold MCCV. MCC is used in machine learning models to evaluate the quality of binary classifications [29]. It has proven to be advantageous as it takes into account class imbalance and uses every factor in the confusion matrix (true positives, false positives, true negatives, and false negatives) [30].

Currently, when classifying intra-procedural DSAs as having sufficient or insufficient reperfusion to determine the need for further treatment, clinicians use either a 2-outcome grouping where mTICI 0,1,2a is clinically insufficient reperfusion and mTICI 2b,2c,3 is sufficient reperfusion [31] or a 3-outcome grouping where mTICI 0,1,2a is insufficient reperfusion, mTICI 2c,3 is sufficient reperfusion, and mTICI 2b is either sufficient or insufficient reperfusion, and the need for further treatment is decided based on other factors [32]. Thus, in addition to a 2-outcome classification between mTICI 0,1,2a and mTICI 2b,2c,3, we also investigated a 3-outcome classification between mTICI 0,1,2a, mTICI 2b, and mTICI 2c,3. In addition, subgroup analysis was conducted for using AP and lateral view networks independently and for using both views combined using the ensembled network. Two-tailed McNemar’s p test values were calculated in order to evaluate the significance of any performance differences (p < 0.05). McNemar’s p test values were also calculated between networks using temporally cropped DSAs and networks using uncropped DSAs in order to test the effect of temporally cropping out arterial and venous phases from the DSAs prior to PH map generation.

Results

Network performance

The algorithm takes 0.25 s to create and normalize each API map, the AP network took 9.2 min to train, the lateral network took 9.3 min to train, the ensembled weights were calculated in 0.65 s and a single case can be classified using the network in 0.6 ms. Average values for each evaluation metric along with their standard deviations and 95% confidence intervals are displayed in Table 1. Peak network performance was achieved when making a 2-outcome classification using an ensembled network that combined classifications from both the AP and lateral view networks. While better performance was achieved in terms of accuracy, AUROC, MCC, and sensitivity for 2-outcome classifications, better specificity was observed for 3-outcome classifications.

Table 1 Convolutional neural network performance in classifying DSAs based on their level of reperfusion based on the mTICI scale. Performance is displayed in the form of average accuracies, area under the receiver operating characteristic curves (AUROC), Matthews correlation coefficients (MCC), sensitivities, and specificities along with their standard deviations and 95% confidence intervals (CI). (A) Two-outcome classification (mTICI grade 0,1,2a versus mTICI grade 2b,2c,3). (B) Three-outcome classification (mTICI grade 0,1,2a versus mTICI grade 2b versus mTICI grade 2c,3). The 3-outcome classification requires a ROC curve for each outcome; thus, there is an AUROC for each outcome in (B). The best results are in italic. The results indicate that the best performance is achieved when making a 2-outcome classification using an ensembled network

Performance was also evaluated with ROC curves that are displayed in Fig. 3. The highest AUROC values were achieved when using ensembled networks. In each plot, ROC curves for each subgroup are similar with the overlap of standard deviations, and McNemar’s p test values indicate significant advantage towards using lateral view over AP view (p value < 0.05) and towards using ensembled networks over AP or lateral view networks independently (p < 0.05) for both the 2-outcome and 3-outcome classifications.

Fig. 3
figure 3

Receiver operating characteristic (ROC) curves generated from the classifications of the convolutional neural network (CNN). a The ROC curves obtained for each subgroup when making a 2-class classification. b The ROC curves for each subgroup for the mTICI 0,1,2a class when making a 3-class classification. c The ROC curves for each subgroup for the mTICI 2b class when making a 3-class classification. d The ROC curves for each subgroup for the mTICI 2c,3 class when making a 3-class classification. The shaded region around the ROC curve depicts the standard deviations at each point. High AUROC values and thin spread of standard deviations in a indicate that the best performance is achieved when making a 2-class classification. In all 4 subplots, while the ROC curves and the standard deviations overlap between the three subgroups, McNemar’s p test values indicate significant improvement in the performance of ensembled networks over AP and lateral view networks

McNemar’s p test values between the temporally cropped and uncropped AP, lateral, and ensembled networks were 0.16, 0.23, and 0.05. Since these values are above or equal to the 0.05 threshold, there is no significant advantage towards using temporally cropped DSAs.

Class activation maps

CAMs for two different cases were generated to visualize how the CNN makes its classifications and are displayed in Fig. 4. The CAMs were able to display what PH map regions were being used by the CNN to make the classification decision. The contrast in the vasculature activated the network, with larger vessels having a higher activation. Thus, the network is looking at the presence of the contrast and vasculature to make classification decisions. The internal carotid artery (ICA) terminus, middle cerebral artery (MCA) presence, and MCA territory seemed to be the greatest contributors towards network classification tasks.

Fig. 4
figure 4

a, b Two examples showing the anteroposterior (AP) and lateral digital subtraction angiography (DSA) sequences, the normalized peak height (PH) maps generated from those sequences, the class activation maps (CAMs), and the classifications from the convolutional neural network (CNN) for each view and the ensembled CNN for both views. In a, the AP view CNN incorrectly classifies the PH map as being mTICI 2b,2c,3 while the lateral view CNN and ensembled CNN both correctly classify the PH map as mTICI 0,1,2a. In b, the lateral view CNN incorrectly classifies the PH map as being mTICI 0,1,2a while the AP view CNN and ensembled CNN both correctly classify the PH map as mTICI 2b,2c,3. This shows that misclassifications can occur when either the AP or lateral views are used independently; however, when information from both views are combined using an ensembled network, the tool is able to correctly classify the DSA into the appropriate group. In each CNN classification table, the green highlight indicates the network classification. All the CAMs show that the activation occurs in the vessels with the larger vessels causing a higher activation

Discussion

An objective and unbiased assessment of reperfusion status during MT is critical for the estimation of clinical prognosis and documentation for research purposes. In this study, we investigated the technical feasibility of using a CNN with quantitative angiographic information to assess the level of cerebral perfusion for patients undergoing a MT to treat an LVO AIS. We successfully classified PH maps generated from DSAs during an MT into 2-outcome categories (mTICI 0,1,2a and mTICI 2b,2c,3) or 3-outcome categories (mTICI 0,1,2a, mTICI 2b, and mTICI 2c,3). This indicates that data-driven models such as CNNs can be used to derive hemodynamic information encoded in API and make decisions regarding the nature of cerebral blood flow.

Numerical results for five evaluation metrics used are displayed in Table 1, and ROC curves are displayed in Fig. 3. Peak performance was observed when making a 2-outcome classification using ensembled networks where information from both the AP and lateral view networks were used to provide a final classification. This was observed considering high average values for each metric along with small standard deviations and tight confidence intervals. The numerical results, intersecting ROC curves, and overlapping standard deviations may indicate similar performance between each subgroup; however, McNemar’s t test p values (p < 0.05) indicate significant advantage to using ensembled networks over AP and lateral networks independently. In addition to commonly used evaluation metrics, we also calculated MCC. Since MCC is an application of the Pearson correlation coefficient [33], it follows the same patterns in terms of inferring correlation strength between classifications and ground truth [34]. MCC values indicate strong positive relationships for 2-outcome classifications and moderate positive relationships for 3-outcome classifications. Network performance on 3-outcome classifications is lower for each subgroup; however, it is still within an acceptable range given this is a feasibility study. The lower performance on 3-outcome classifications can be attributed to the lower number of cases in each outcome (169 cases:140 cases:74 cases) compared to 2-outcome classification (169 cases:214 cases) and to the increased task complexity of creating a finer classification. Increasing dataset size and increasing the number of cases in each specific class will allow us to achieve higher performance on 3-outcome classifications.

In order to understand which PH map features the CNN uses to make decisions, we generated CAMs. Two specific cases, including input PH maps, CAMs, and final classification probabilities from the CNN for those cases, are analyzed and displayed in Fig. 4. In both cases, the network was able to correctly classify the input map into mTICI 0,1,2a or mTICI 2b,2c,3 outcomes. In all 4 maps, we observed that image regions that were activated were vessels, with higher activations in larger vessels. Thus, the network is making its decision based on image intensities in the vasculature. Using this method, we are able to interpret the CNN output and accept or reject the result if the salient features do not match the clinical experience. CAMs could also be used to optimize input data to improve network performance [35]. For instance, in Fig. 4, the activation of some extracranial regions is observed. Since common sense dictates that those regions should have no contribution to classification, cropping the regions should improve the data-driven model performance. Figure 4 also displays the advantages of using an ensembled network which combines information from both the AP and lateral view networks. The figure shows that misclassifications can occur when either the AP or lateral view networks are used independently; however, when information from both networks are combined using an ensembled method, the PH map is correctly classified into the appropriate group.

The level of reperfusion was classified based on the mTICI scale which has its drawbacks [12, 13]. In addition, neuro-interventionalists can currently perform the classification themselves by visual assessment of the DSAs. However, an automated process trained using labels provided by a core lab or experts in the field could provide an objective tool across many institutions and users. This study is useful as it can be replicated for any other outcome scale such as post-op MRI or neurological evaluations; however, these are not intra-procedurally acquired. Further investigations need to be conducted using other outcome scales to classify reperfusion levels.

There are some limitations to this study. First, we are only using 383 angiograms (268 for training, 39 for in-training hyperparameter tuning or validation, and 77 for testing) that were all collected from the same center. Thus, we are currently limited to demonstrating only a technical feasibility study of using CNNs to assess the reperfusion level. Second, we are currently only using PH maps for this feasibility study; other API maps such as mean transit time, time to peak, and area under the TDC can also be derived and may be used in conjunction with PH maps to boost the performance of the network. Third, we are currently doing only a 2- or 3-outcome classification instead of a full-range mTICI scale. This is due to the low number of cases per outcome (mTICI 0, 82; mTICI 1, 5; mTICI 2a, 82; mTICI 2b, 140; mTICI 2c, 48; and mTICI 3, 26) which leads to a decrease in performance, as seen when going from a 2- to 3-outcome classification. Fourth, we are currently not identifying the location of LVOs, rather just the reperfusion status. Fifth, preprocessing methods such as cropping of DSAs to exclude early arterial and venous phases and identifying inlet vessels for normalization of PH maps are currently not automated. Lastly, the current normalization process only uses one point from the main feeding artery; we will investigate averaging over multiple points from the main feeding artery; this may provide a more effective normalization.

This study proves the feasibility of using CNNs to extract encoded hemodynamic information from API by assessing the level of reperfusion during an MT in patients with an LVO AIS. While this study provides neuro-interventionalists with a more robust tool to evaluate the level of reperfusion during MTs rather than relying solely on subjective assessment of DSAs, it also proved the feasibility of using CNNs with API maps and can thus be possibly used for other endovascular interventions.

Conclusion

This is a novel attempt at using a data-driven approach to classify DSAs based on the nature of flow in the neuro-vasculature by extracting hemodynamic information encoded in quantitative angiographic maps. In this study, we proved the feasibility of this approach to make decisions regarding the reperfusion status of patients undergoing a MT to treat an emergent LVO AIS. The CNN succeeded in making this assessment with an accuracy of 81.0%, AUROC of 0.86, and MCC of 0.62 when making a 2-outcome classification. When making a 3-outcome classification, the network succeeded with an accuracy of 64.0%, AUROC of 0.85 for mTICI 0,1,2a, AUROC of 0.74 for mTICI 2b, AUROC of 0.78 for mTICI 2c,3, and an MCC of 0.43.