Keywords

1 Introduction

Coronavirus disease 2019 (COVID-19), caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has had a global impact like no other pandemic or disease in modern times. COVID-19 patients present a clinical picture similar to the other two coronaviruses that in previous years have caused pandemic diseases (SARS-CoV, MERS-CoV) [10]. Patients initially manifest a respiratory infection that can lead to viral pneumonia, which can culminate into acute respiratory distress syndrome (ARDS) [15, 20]. Early prediction of disease progression and severity assessment in at-risk patients can help determine and plan healthcare needs related to hospital admission, mechanical ventilation, and intensive care monitoring.

Several approaches have been taken to implement machine learning methods for COVID-19 patient outcome prediction [14]. For example, Chao et al. [2] developed a framework integrating various features extracted from chest CT scans with patient demographic information, vital signs, and laboratory blood exams to assess disease severity and predicting intensive care unit (ICU) admission. In Tang et al. [18], a random forest model was trained using quantitative features from the lungs to assess the severity of the disease. Said features were obtained from a deep learning-based chest CT analysis tool. Other approaches have focused on developing similar deep learning-based tools to extract relevant information from chest CT-scans and then use the obtained features for disease severity assessment. Fang et al. [4] employed a deep learning approach to segment the lung lobes and pulmonary opacities and computed severity scores from those. Afterward, random forest, SVM, and logistic regression models were trained with the severity scores to predict the patient outcome. However, the existing works mainly focus on the imaging features from the lungs. COVID-19 severity prognosis has been associated with multiple comorbidities through systematic reviews and meta-analyses and retrospective cohorts [6,7,8, 12, 17, 19]. Among the comorbidities, cardiovascular disease (CVD) was found to have a risk ratio ranging from 2.25 to 3.15 [6, 12, 17] and a prevalence ranging from 5.8% to 25% [7, 8, 19]. It is thus important to take CVD risk into account when assessing patient prognosis outcomes.

Fig. 1.
figure 1

The proposed framework utilizing both imaging and non-imaging features including lung severity and CVD risk for COVID-19 outcome prediction.

In this work, we combine features of CVD analysis, chest CT image, and non-imaging data to predict the outcome of COVID-19 patients through machine learning, as shown in Fig. 1. By leveraging the correlation between CVD and COVID-19 severity, our model can more accurately separate patients with high mortality risks. The rest of this paper is organized as follows. First, the datasets employed are presented, then an exploratory analysis of the variables is introduced. Next, the feature selection and machine learning methods used for outcome prediction are discussed. Then the experimental results are presented, which show the performance improvement with including the CVD severity score. Finally, the findings and conclusions of this paper are discussed.

2 Materials and Methods

2.1 Datasets

The data employed in this work was obtained from two hospitals, Firoozgar Hospital (Tehran, Iran), and Massachusetts General Hospital (MGH) (Boston, MA, USA). These datasets were comprised of patients’ demographics (age, sex), laboratory blood tests (white blood cell count, lymphocyte count, lymphocyte ratio), coronary artery calcification (CAC) score, and the outcome of the patients (discharged, deceased). Moreover, non-contrast chest CT scans without intravenous contrast injection were provided. The Firoozgar Hospital dataset included data from 113 patients, while the MGH dataset was comprised of 125 patients. Both datasets were combined into one larger dataset comprised of 238 patients. CT scans were manually inspected for lines, tubes, and imaging artifacts, the ones including such artifacts were removed from the dataset. A resulting total of 208 patients were available for outcome prediction. From the combined dataset, 108 patients were discharged and the remaining 100 were deceased. Other relevant demographic statistics of the patient population are presented in Table 1.

Table 1. Demographic statistics of the dataset.

2.2 Outcome Prediction Methods

In order to perform the outcome prediction, both the imaging and metadata features were divided into three categories. 1) First, all the non-imaging features from demographic, vital signs, and blood examination (DVB) were placed together. These features being age, sex, white blood cell count, lymphocyte count, and lymphocyte ratio. 2) The second category consisted of 64 hierarchical lobe-wise quantification features (HLQ) calculated with the approach introduced in [2]. The five lung lobes and pulmonary opacities are first segmented by a deep neural network [5]. Then, in 8 regions of interest (ROIs) (whole lung, left & right lung, and 5 lung lobes), the volumes of pulmonary opacities grouped under 4 Hounsfield unit (HU) ranges and their ratios to the corresponding ROIs are calculated (8 \(\times \) 4 \(\times \) 2 = 64). 3) Furthermore, a third category was the collection of CVD severity scores that were calculated from the chest CT-scans.

In this paper, we studied two kinds of CVD scores. One is the deep learning-based CVD risk score automatically calculated based on the model introduced in [3]. The other one is the CAC score manually assigned by radiologists. A radiologist from MGH categorized all 208 CT images into 4 CAC levels, i.e., normal: no calcification, minimal: calcification less than 1/3 of the length of coronary arterial length, moderate: calcification over 1/3 to 2/3 of the coronary arteries, and heavy: calcification over 2/3 of the arterial length.

As a first step, an exploratory analysis was performed to examine the relevance of the CVD risk and CAC scores towards the differentiability of the patients. Next, the sets of non-imaging (DVB) and imaging (HLQ) features were used as inputs for machine learning models to predict the patient outcome. In further detail, experiments were run with the non-imaging features (DVB) and lung imaging features (HLQ) individually and together to establish a baseline. Afterward, the CVD or the CAC scores were added in the following experiments to check for performance improvement. Three well-known machine learning algorithms – random forest classifier, logistic regression, and support vector machine (SVM) – were used to perform the outcome prediction. Several configurations of their parameters were tested and the one with the best performance was chosen.

2.3 Exploratory Statistical Analysis

While CVDs have been widely associated with increased risk ratios and prevalence among the patients with fatal outcomes [6,7,8, 12, 17, 19], a key initial step was to test if the deep learning-based CVD risk scores and the CAC scores showed significant differences between the patients with fatal outcome and the discharged patients. First, the normality of the distribution of the CVD risk scores and CAC scores of each group were tested through a Shapiro-Wilk test (\(\alpha =\) 0.05). Due to the non-normal distribution of the CVD scores, Mann-Whitney U tests were employed to determine significance between the CVD risk scores and CAC scores of two groups (\(\alpha \) = 0.05). Second, to enhance the explainability of the developed model, an exploratory factor analysis [16] was performed to examine the relevance of the features and further granularities in the data. First, the loadings for each factor were calculated to determine relations between the features present in the datasets. Second, groups of features were obtained for each factor; thus providing an improved understanding of which features share the most similar information between each other, and especially to the features of interest – the CVD risk score and the CAC scores. The factor analysis was implemented through the FactorAnalyzer python library, using a varimax rotation matrix. Bartlett’s sphericity test [1] and Kaiser-Meyer-Olkin test [9] were used to test for adequacy of applying factor analysis for the datasets.

2.4 Feature Selection

As the ratio of available features to patients was not favorable, two feature reduction strategies were implemented depending on the machine learning method. For the SVM and logistic regression, a leave-one-out strategy was implemented to select the best set of features. The second feature reduction approach ranks features based on their Gini importance [11]. Because of the randomness of Random Forest, the Gini importance of each feature might vary in different runs. To alleviate such randomness, we calculate the features’ Gini importance 100 times with different random seeds. Each time the ranks of each feature were recorded. The final rank was determined by sorting the total summed rank of each feature. Based on the final ranking, the model was trained and tested multiple times on the top \(K\in \) [1,100] features and its performances were recorded. The combination of features that rendered the highest AUC was found. Furthermore, due to the small sample size a traditional train-test split was not ideal for independent assessment; therefore, a 5-fold cross-validation was implemented for all experiments to evaluate the methods performances differences. Finally, significance in performance increase was tested using Mann–Whitney U tests.

3 Results

3.1 Exploratory Analysis

The Mann-Whitney U tests revealed a significant difference between the CVD risk scores from the outcome groups in the Iran, MGH and combined datasets; while it revealed no significant difference between the CAC scores in any of the datasets. Therefore, for the individual datasets and the combined version of these, it can be understood that the CVD risk score is a valuable feature to differentiate patients’ possible outcomes.

The factor analysis revealed granularities in the dataset with especial interest in the CVD risk and CAC scores. For the Iran data set, the factor analysis indicated that the CVD risk score – labeled as softmax_pred on the right end of the first subplot of Fig. 2 – has a strong correlation with age, and a mild correlation with features such as white blood cell count and the infection volume and ratios in lobes 2 and 3. These results are expected, as patients of older age have a higher prevalence of cardiovascular disorders [13]. The relation between the CVD risk score and the lung features could be explained in part due to the closeness of the heart and lobe 2 (middle lobe of the right lung). In the low dose CT-scans employed for the HLQ and CVD risk score features calculation parts of the lung and heart, in their respective cases, could appear in the field of view of the neural networks producing the respective quantitative measurements. On the other hand, for the MGH data set relationships were found between the CVD risk score and the infection volumes and ratios of multiple sections from the lobes 3 and 4. This can be seen on the right plot of Fig. 2. Similar to the Iran dataset factor analysis results, lobe 4 (superior lobe of the left lung) shares boundaries with the left atrium and aorta, which are major areas of interest for the CVD risk estimator network as calcification in the aorta is a key driver of this score. Furthermore, the factor loadings also showed a mild correlation between the CVD risk score and the age following a similar pattern as the Iran dataset results.

Fig. 2.
figure 2

Factor loadings plots for Iran (left) and MGH (right) datasets

3.2 Inclusion of CVD Risk Score in Patient Classification

As mentioned before, multiple parameter configurations were tested for the implemented machine learning methods. All three machine learning methods used Scikit-Learn package implementation. For the parameters used, in the case of the SVM, the best results were accomplished when using a RBF kernel. As for the logistic regression, the best performance was achieved by using a stochastic average gradient descent solver. Furthermore, the random forest classifier obtained its best results when setting the number of estimators to 300.

Table 2 reports the results from the 5-fold cross-validation using each best-performing method and features used as input for prediction. Most of the experiments showed an increased performance, reflected on the AUC after the CVD severity score was included on the feature set. Nevertheless, after testing for significance with a Mann–Whitney U test, 5 were found to be significant, mainly improving when all features were introduced. Moreover, while the inclusion of the CAC scores resulted in increased performance in some of the experiments, mainly in the HLQ+CAC category, none of the improvements were found to be significant by the Mann–Whitney U test. The lack of significant improvement when including the CAC score as a feature concurs with the results from the exploratory analysis in which no significant differences were found in the distributions of the CAC scores between the deceased and discharged patients.

Table 2. Comparison of 5-fold cross-validation results of RF, SVM and LR using various configurations of features. Sensitivity was obtained after setting specificity = 70%, results with p-value > 0.05 are marked with *.
Fig. 3.
figure 3

ROC curves for Random Forest, SVM, and Logistic regression.

Table 2 and Fig. 3 show consistently increased performance for all the three models when the CVD severity score was added to the DVB+HLQ configuration, with the increase on the random forest and the logistic regression being validated as significant by the Mann–Whitney U test. Moreover, this feature configuration also achieves the best performances on each method, with the highest performing experiment being accomplished by the logistic regression using all 3 sets of features. Furthermore, similar increases in performance can be seen for the DVB + CVD when compared to DVB alone. The improvements in AUC were found to be significant for the random forest and the SVM models. On the other hand, while SVM and logistic regression struggled to achieve an increased performance when the CVD severity score was added to the HLQ features, the random forest classifier was able to achieve this goal. Additionally, the random forest model was able to achieve significant performance increases when the CVD severity score was included in all 3 feature combinations (DVB+CVD, HLQ+CVD, DVB+HLQ+CVD); this could be in part due to the feature selection strategy and the robustness of the random forest due to its ensemble method design. However, it is interesting to notice that the best performance of all methods and feature combinations (0.888) was achieved by the logistic regression model when the CVD severity score was included in the DVB+HLQ configuration. Emphasizing the relevance of including the CVD severity score, as this best performing method had a significant increase in performance compared to the classification based only on the DVB and HLQ features.

4 Discussion and Conclusion

Although the majority of patients with COVID-19 infection are either asymptomatic or present minor symptoms, most healthcare systems from the most to the least developed countries in the world were overwhelmed, because the pandemic nature of COVID-19 pneumonia results in the substantial volume of patients with COVID-19 infection. Patients present with rapid deterioration requiring hospital admission and mechanical ventilation. Therefore, understanding and predicting patients at risk of severe and potentially life-threatening infection is key to resource planning and prognosis prediction. One commonly recognized comorbidity associated with adverse COVID-19 outcomes is the presence of cardiovascular disease. In this work, we explored the inclusion of a deep learning-based CVD severity score to improve the prediction accuracy of patient outcome.

Some key results from the experiments showed a significant increase in performance, which was seen in the random forest classifier after including the CVD severity score as a feature for prediction. Concurrently, a significant performance improvement was observed in the logistic regression when using all features and the CVD severity score with the latter model being the best performing of all configurations. The results confirmed that the inclusion of the CVD severity score benefits the models by increasing their ability to predict the patients’ outcome. Additionally, while the sample size used was relatively small for machine learning models, the presented results were strengthened as the experiments were validated on a combination of the datasets. Thus, showing the relevance of including the CVD severity score for COVID-19 patient outcome prediction regardless of the patient population. This concurs with the clinical studies, presented in the introduction, that show cardiovascular diseases as relevant comorbidities for COVID-19 patients and suggest to consider cardiovascular diseases when assigning treatment to a COVID-19 patient.

Moreover, while several of the tested features and model configurations resulted in significant performance increases, these increases are not overwhelmingly high. As discussed in the exploratory analysis results, this could be due to shared information across the multiple variables employed for the outcome prediction as shown by the factor loadings. Moreover, while the manual CAC score was also tested, the CVD severity score outperformed the CAC score results across all experiments as expected from the non-significant differences in the distributions of the CAC scores. This could be due to mainly two reasons, one being, the CVD severity score might include additional relevant information of the presence of the comorbidity extracted from the CT-scans, rather than just the calcification of the coronary artery. Furthermore, as CAC is a categorical variable, contrary to the continuous values of the CVD severity score, the machine learning methods could be hindered in their prediction performance.

We hope that this work can shed more light on assigning a better prognosis for COVID-19 patients, as well as for improved clinical resource management. The main takeaway is to add evidence into a series of clinical and computational investigations, as discussed in the introduction, that indicate a significance on the impact of cardiovascular diseases as a risk factor for COVID-19 patients.