1 Introduction

In December 2019, a novel disease related to coronavirus family spread between several people in Wuhan, in China's Hubei Province (Chen et al. 2020). It had several clinical manifestations such as fever, cough, and dyspnea and affects the lung causing pneumonia (Chung et al. 2020). The lung becomes filled with fluid, inflamed and multiple plaque shadows and interstitial changes occur leading to Ground Glass Opacities (GGO) (Ardakani et al. 2020; Chen et al. 2020). In severe cases, lung consolidations can occur presenting a phenomenon called “white lung” (Chen et al. 2020). In March 2020, the WHO declared COVID-19, caused by Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), to be a pandemic and a public health emergency of international concern. By November 7th, 2020, the epidemic spread to more than 200 countries with more than 49 million individuals contracted the virus worldwide and more than 1,200,000 reported deaths (WorldOmeter). There are four common methods to diagnose COVID-19 which are Reverse Transcription Polymerase Chain Reaction (RT-PCR), Computed Tomography (CT), X-ray and C-Reactive Protein Level (CPR) blood test (Fan et al. 2020). CT can play an important role in the early detection and management of COVID-19 pneumonia (Hani et al. 2020). It is more sensitive than RT-PCR and showed suggestive abnormalities even when the viral load is insufficient causing RT-PCR to produce falsely negative results (Hani et al. 2020; Long et al. 2020). It is more accurate than blood tests because some of the cases that tested by CPR blood test were tested twice as negative while the first CT test diagnosed these cases as positive (Radiopaedia). Moreover, the accuracies of CT were shown to be higher than that of X-rays as in early stages of COVID-19, a chest X-ray may be identified as normal, while CT conveys early signs of the disease (Rony Kampalath; Zu et al. 2020). In other cases of severe COVID-19, their X-ray findings may resemble that of pneumonia or acute respiratory distress syndrome (ARDS) (Rony Kampalath; Zu et al. 2020). Thus, no confident diagnosis of COVID-19 disease is possible based on chest X-ray alone (Rony Kampalath; Zu et al. 2020). In CT, advert symptoms of COVID-19 can be seen, which aids more accurate timely diagnosis compared to X-ray (Rony Kampalath; Zu et al. 2020). Timely diagnosis can lead to better prognosis, especially if the severity of the infection could be assessed on a multilevel scale. A five-level scale was introduced (Chen et al. 2020; Hani et al. 2020) (normal, early, moderate, advanced, severe) according to percentage of GGO and consolidation in lung parenchyma to help identify the risk level. Due to the increase of the COVID-19 cases worldwide, the medical system suffers from high workloads that can result into inaccurate decisions (Doi 2007; Lodwick 1966). A computer-aided diagnostic system is needed to support the medical system to detect COVID -19 infections and determine the severity degrees of these infections (Doi 2007).

In this paper, a hybrid computer-aided framework (COV-CAF), implementing a modified deep learning architecture, is proposed to detect COVID-19 infections and classify the severity of the infection based on the percentage of GGO and presence of consolidation in lung parenchyma. The model is based on fusion of automatically generated features from modified deep learning architecture with human articulated features.

The contribution of the paper can be summarized as.

  • Slice selection mechanism is proposed for selection of informative candidate frames from the 3D CT-volumes

  • Region of Interest (RoI) segmentation phase using unsupervised clustering is introduced.

  • A modified deep learning architecture is proposed that achieved an outstanding performance in diagnosing CT compared to previously established deep learning architectures.

  • A new robust hybrid machine learning system architecture is compiled (COV-CAF) for accurately diagnosing COVID-19 and its severity level. The system is based on fusion of the new proposed deep learning architecture automatic features with generated human articulated features, which noticeably improves the performance of pure deep learning architecture.

  • The proposed models are validated on two benchmark datasets which are MosMedData: chest CT scans with COVID-19 related findings dataset (Morozov et al. 2020) with multi class classification of degrees of COVID-19 infections and SARS-COV-2 CT-Scan Dataset (Soares et al. 2020) with binary classes for detection of COVID-19 infections.

The paper is organized as follows: in Sect. 2, an overall literature background on using machine learning for COVID-19 disease diagnosis based on CT imaging is presented. Section 3 details the proposed system architecture and the implemented modules. In Sect. 4, a full description of the datasets used in the conducted experiments is given. Section 5 presents the experimental setup for conducting the experiments and the results of different experimental scenarios are shown and discussed. Finally, conclusions will be drawn in Sect. 6.

2 Related work

Automatic screening of COVID-19 through machine learning and chest scanning is a vital area of research. Chest scans directly assess the condition of the lungs (Alafif et al. 2021; Kamalov et al. 2021); thus, it can be effectively used for disease monitoring and control (Zu et al. 2020). In addition, machine learning can provide automated preliminary screening of COVID-19 saving physicians time and allowing them to focus on more critical cases (Doi 2007; Lodwick 1966). Therefore, a lot of work has been dedicated recently to study the effectiveness of applying machine learning on chest scanning for COVID-19 diagnosis. In particular, deep learning -based systems have received the highest attention.

Jaiswal et al. (2020) applied and compared a range of standard deep learning architectures for classifying COVID-19 infected patients. DenseNet201-based deep transfer learning was shown to achieve the highest accuracy of 96.25%. Ardakani et al. (2020) have extensively evaluated the performance of ten DL architectures on CT images to distinguish COVID-19 from other atypical viral and pneumonia diseases. The infection area was manually cropped and scaled with the aid of a radiologist, then input to the CNN. Transfer learning was applied to compensate for the limited dataset of size of 1020 slice. The best performance was attained by ResNet-101 with COVID-19 sensitivity of 100% and specificity of 99.02%. A similar study was conducted by Koo et al. (Koo et al. 2018) using various DL architectures to diagnose COVID-19. ResNet-50 showed the highest diagnostic performance reaching sensitivity of 99.58% and specificity 100.00%, and accuracy 99.87%, followed by Xception, Inception-v3, and VGG16. Binary classification of positive COVID 19 cases vs normal was performed by Singh et al. (2020) CNNs were applied for the classification, where the initial parameters of CNN are adjusted using multi objective differential evolution (MODE). The model achieved a sensitivity of 90% given a training to testing percentage as 9:1. The discussed systems directly applied standard classification techniques on ready to process 2D-CT slices. Nevertheless, available datasets usually require manipulation of 3D-CT volumes and/or segmentation for Region of Interest (RoI) localization.

Zhang et al. (2020) proposed a new model that starts with 14-way data augmentation techniques applied on the training set. The augmented set was input to a 7-layer CNN network with enhanced stochastic max pooling, which was used to overcome the limitations of traditional max pooling techniques. The model was used to diagnose positive COVID-19 CT infections vs normal cases and achieved sensitivity, accuracy and specificity of 94.44%, 94.03% and 93.63% respectively. Another experiment proposed by Li et al. (2020) that developed COVNet deep neural network framework for extraction of two-dimensional local and 3D global representative features. The framework included RoI segmentation using U-Net (Ronneberger et al. 2015) and data augmentation before feeding the slices into ResNet-50. The achieved sensitivity and specificity for COVID-19 were 90% and 96%, respectively. A similar approach was presented by Zheng et al. (2020), where a weakly supervised DL technique was proposed for diagnosis of COVID-19 patients using 3D CT scans. Pre-trained U-Net was also applied for segmentation of 3D lung images. The segmented regions were input to the DL architecture for prediction of infected regions. The accuracy obtained from their model was 95.9%. Another segmentation approach based on attenuation and HU value thresholding is introduced by Bai et al. (2020). In some cases, manual correction of the segmentation was performed by a radiologist. Following the segmentation phase, EfficientNet B4 was used for separating COVID-19 cases from non-COVID or other non- COVID pneumonia. The model had higher test accuracy of 96%, sensitivity of 95% and specificity of 96% compared to radiologists 85%, 79% and 88% respective values. Kang et al. (2020) adopted a traditional machine learning approach of features extraction, latent multi-view representation and classification. V-Net was used for pre-segmentation. The latent representation of the features together with Neural Network classifier reached the highest sensitivity and specificity of 96.6% and 93.2% respectively. A hybrid learning approach was investigated by Hasan et al. (2020), where the CT-slices were segmented through histogram thresholding and subsequent morphological operations. They integrated a novel Q-Deformed entropy features and DL extracted features. Long Short-Term Memory neural network was used as the classifier attaining 99.68% accuracy. Another COVID-19 model was proposed by Wang et al. (2021) to classify COVID-19 CT infection by introducing a new (L, 2) transfer feature learning (L2TFL) that was used to remove the optimal layer of pretrained CNNs before testing. A new selection algorithm was proposed to choose the best two retrained models to be fused using a deep CCT fusion discriminant correlation analysis (DCFDCA) method. The used fusion method got a better result compared to traditional fusion methods. The final model named CCSHNET achieved a micro-averaged F1 score of 97.04%.

All of the previously mentioned studies determine whether the CT scans present a negative or a positive case of COVID-19 or differentiate it from community acquired pneumonia. A step further, that is well needed, is to determine the severity of the infection and the degree of lung involvement. Hence, enable better support to the more serious cases.

Gozes et al. (2020) attempted to determine the severity of the infection using an off the shelf system to localize and provide measurements for nodules and opacities. The system was able to trace the changes in nodules and opacities size over time. However, the system was not shown to automatically classify patients based on severity level, which is a needed capability. In the work of Wang et al. (2020), a hazard value was predicted for each patient to indicate whether he/she was high or low risk. The hazard score was calculated given three prognostic features fed into Cox Hazard Proportional model.

Despite the work of Gozes et al. (2020) and Wang et al. (2020), they did not automatically detect severity levels direct from the CT scans. Thus, the automated stratification of COVID-19 severity remained under-studied, which mandates directing more efforts into this area of research.

3 Methods

In this section, COV-CAF a robust COVID-19 Integrative Diagnostic and severity assessment system architecture is proposed to detect COVID-19 infections and classify the severity of the infection. The system consists of two main phases which are: the preparatory phase and the feature analysis and classification phase. In the preparatory phase, data preprocessing and slice selection are performed to handle the characteristics of different datasets, leading to the enhancement of image properties and improvement of the dependability of the dataset. The classification phase incorporates RoI segmentation, multi-view feature extraction and classification, which are responsible for producing an effective accurate diagnosis and severity assessment. RoI segmentation is performed using an unsupervised fuzzy clustering technique. Feature extraction is done through a hybrid technique which fuses the automatic features generated from a modified variant of an existing deep learning architecture named Norm-VGG16 (Ibrahim et al. 2020) and spatial features that are generated from the segmented RoI. The proposed system architecture is shown in Fig. 1. An illustration of each phase is presented followed by a description of each phase.

Fig. 1
figure 1

The proposed COV-CAF architecture

3.1 Preparatory phase

In this phase, a range of preprocessing steps are applied on the datasets to increase the system robustness and to limit the processing system requirements. This phase introduces two optional steps, which are Data preprocessing and Slice selection, that may be applied both or even none of them according to the nature of data used.

3.1.1 Data preprocessing

In case of 3D-CT volumes, the preparatory phase starts by converting the 3D-CT volumes to 2D-slices by using “med2image” library in python 3.7. The dataset volumes contain only axial view for lung as shown in Fig. 2 and all the 2D slices of each patient are saved in Joint Photographic Experts Group (jpeg) format.

Fig. 2
figure 2

2D-CT axial view

3.1.2 Slice selection

Converting 3D-CT Volumes to 2D slices generates open and closed lung slices as shown in Fig. 3. Open lung slices refer to slices with lung parenchyma, while closed lung slices contain mainly bones. The reason of conversion to 2D slices is to efficiently select the correct candidate slices with infections from all slices in 3D image sequence (Hamadi and Yagoub 2018; Rahimzadeh et al. 2020). Slice selection is used to select the informative slices (open lung slices) and reject the remaining slices, which shall positively affect training time, model accuracy and precision to generate an efficient classification model (Hamadi and Yagoub 2018; Rahimzadeh et al. 2020). An automatic slice selection technique is needed to speed up slice selection stage to save a lot of time and effort in comparison to manually selecting the desired open lung slices based on medical expert decisions (Rahimzadeh et al. 2020).

Fig. 3
figure 3

Open lung slices vs closed lung slices

Automated slice selection is proposed to separate open lung slices from closed lung slices. In the slice selection process, Histogram of Oriented Gradient features (HOG) descriptors are extracted from the CT images. Then, a subset of 2000 images of size 180 × 180 equally divided between open lung slices and closed lung slices is labeled. The labeled images are used to train SVM classifier. The remaining images are labeled using the trained model accordingly. The images classified as open lung slices are selected.

HOG descriptors are generated by normalizing colors then the image is divided into blocks. Each block is divided into smaller units called cells. Each cell includes number of pixel intensities. First, the gradient magnitude and direction of each cell’s pixel intensities is calculated. If (x, y) is assumed as a pixel intensity, then gradient magnitude is calculated from Eq. (1) and gradient angle is calculated using Eq. (2) (Dalal and Triggs 2005).

$$G\left( {x,y} \right) = \sqrt {G_{x} \left( {x,y} \right)^{2} + G_{y} \left( {x,y} \right)^{2} }$$
(1)
$$\theta \left( {x,y} \right) = \arctan \left( {\frac{{G_{y} \left( {x,y} \right)}}{{G_{x} \left( {x,y} \right)}}} \right)$$
(2)

After calculating magnitude and angle, the HOG is measured for each cell by calculating the histogram. Q bins for angles are selected with unsigned orientation angles between 0 and 180. Normalization is then applied, since different images can have different contrasts (Srinivas et al. 2016). The pipeline of HOG can be shown in Fig. 4. In our implementation, a [4 × 4] cell size, [2 × 2] cells per block and 9 orientation histogram bins.

Fig. 4
figure 4

Histogram of oriented gradients (HOG) descriptor pipeline

After generating HOG descriptors for the slice set, the trained Support Vector Machine (SVM) classifier is used to differentiate the opened lung slices from the closed lung slices. Moreover, sample output slices are inspected and verified by a medical expert to ensure opened lung slices are correctly separated from closed ones.

3.2 Feature analysis and classification phase

3.2.1 Feature analysis and classification

Automatic segmentation of medical image is considered the most important process for RoI extraction (Hawas et al. 2019; Sengur et al. 2019). It divides images into areas based on a specified description, such as segmenting body tissues, border detection, tumor segmentation and mass detection (Hawas et al. 2019; Sengur et al. 2019). Most datasets do not have ground truth masks for lung parenchyma because creating masks requires intensive work from physicians. So, a common approach is to resort to unsupervised segmentation (Hasan et al. 2020; Wu et al. 2020).

Unsupervised clustering-based segmentation is proposed for lung parenchyma segmentation in COV-CAF to eliminate the need for exhaustive manual annotation. In this stage, automatic unsupervised segmentation is based on clustering approach. Fuzzy C-means (FCM) and K-means clustering algorithms are applied and the appropriate clustering approach is chosen based on the clustering validity measures, namely Davis-Bouldin index, Silhouette Index and Dunn Index. FCM algorithm (Kang et al. 2009) helps identify the boundaries of lung parenchyma from the surrounding thoracic tissue in 2D-CT axial view slices. Hence, FCM is more likely to be used in this stage because of its known accurate RoI segmentation of irregular and fuzzy borders compared to different other techniques as K-means (Kang et al. 2009; Wiharto and Suryani 2020). Therefore, the FCM algorithm is presented in Algorithm 1. The algorithm is based on the minimization of the objective function shown in Eq. (3) where D is the number of data points, N is the number of clusters, m is fuzzy partition matrix exponent for controlling the degree of fuzzy overlap, xi is the ith datapoint, cj is the center of the jth cluster and and μij is the degree of membership of xi (the sum of all membership values for all the clusters are 1) (Bezdek et al. 1984)

$$J_{m} = \mathop \sum \limits_{i = 1}^{D} \mathop \sum \limits_{j = 1}^{N} \mu_{ij}^{m} |\left| {x_{i} - c_{j} } \right|\left. \right|^{2}$$
(3)

The whole segmentation process is described in Algorithm 2 which starts by selecting the best number of clusters (k) then image enhancement is applied. The best number of clusters is determined experimentally by applying a set of clustering quality measures which are Elbow method, Davis-Bouldin index, Silhouette Index and Dunn Index. Several number of clusters (k) are attempted and the number of cluster (k) with the best corresponding quality measures is selected for mask generation and segmentation. The used mask for Image (I) is the mask generated from the highest centroid value, which succeeds in identifying the boundaries of lung parenchyma correctly (MC). Samples of the generated masks are shown in Figs. 5 and 6. After that, the centroid mask corresponding to each image is inversed producing MI and background is subtracted to generate MB image. The inversed mask with subtracted background (MB) is preprocessed by a set of morphological operations filtered by different filtration masks giving intermediate images MD and MF respectively. Finally, the small, connected objects (due to deficiency of segmentation) are removed creating the final mask M. After generating the mask for all images in the dataset, the mask is multiplied by its corresponding image and the RoI is segmented. Samples of inverse mask after enhancement and segmented RoI is shown in Figs. 5 and 6.

Fig. 5
figure 5

MosMedData: mask generation and RoI segmentation for different dataset samples for different classes

Fig. 6
figure 6

SARS-COV-2: mask generation and RoI segmentation for different dataset samples for different classes

figure a
figure b

3.2.2 Modified norm-VGG16 deep learning architecture

Over the years, Deep leaning architectures have progressed rapidly. The main advantage of deep learning architecture is automatic generation of features without any human intervention. However, one of the common concerns of deep learning architectures is the limited interpretability of the constructed models (Ibrahim et al. 2020) due to its black box nature which may lower the usability of the system by medical experts, who are keen on explainable decisions. One of the well-known deep learning architectures is the VGGs architectures. VGGs (Simonyan and Zisserman 2014) are known for their superior performance compared to different CNN architectures like AlexNet (Krizhevsky et al. 2017).

A modified version of VGG16 (Simonyan and Zisserman 2014) is proposed “Norm-VGG16” which is adopted from (Ibrahim et al. 2020) due to its accurate results compared to other architecture like ResNets, Inceptions and MobileNet architectures. Before training the Norm-VGG16, the RoI area of lung parenchyma in 2D-CT slice is cropped by applying a bounding box between Xinitial to Xfinal pixels in the x-axis and Yinitial to Yfinal pixels in the y-axis ([Xintial,Yinitial] to [Xfinal,Yfinal]). The values of Xinitial, Xfinal, Yinitial and Yfinal are determined experimentally. The cropping of images is done to focus on the RoI in the images before passing it to the CNN and starting the training process. After image cropping, all images are normalized because images may have highly varying pixel range that could cause differences in the resultant loss (Ibrahim et al. 2020). The high pixel range will always have a large number of votes in updating weights of kernels in CNN layers in comparison with low pixel range. So, normalization of images decreases the gap and make a fair competition between high pixel ranges and low pixel ranges (Ibrahim et al. 2020). The structure of the Norm-VGG16 is modified to have an input of 180 × 180 followed by 16 convolution layers and each convolution layer is followed by batch normalization layer. Max pooling layers and dropout layers are added between convolution blocks and the CNN ends with global average pooling layer and categorical dense layer with kernel regularizer as shown in Fig. 7.

Fig. 7
figure 7

Modified NormVGG16 architecture

  1. (a)

    Convolution layers

    Its main role is automatic feature extraction by passing different number of kernels (feature maps) on the input image (Srinivas et al. 2016). The kernel weights change during the training stage and are settled at the end of the training stage to be used in the testing stage. In Norm-VGG16, the kernel size is 3 × 3 and stride = 1. The number of kernels (feature maps) is different in each convolution layer as shown in Fig. 7. The pipeline of CNN starts with a convolution layer with 64 feature maps (3 × 3 × 64) and the last convolution layer of the pipeline has 512 feature maps (3 × 3 × 512).

  2. (b)

    Sub-sampling (max pooling/global average pooling) layers

    Sub-sampling layers produce a down-sampled version that is robust against noise and distortion (Srinivas et al. 2016). Norm VGG16 uses different types of sub-sampling layers which are max pooling layers and average pooling layers. The max pooling layers consider the highest activation value of a window of size n × n of each feature map (Srinivas et al. 2016). Max pooling layers in NormVGG16 have [2 × 2] kernel size. The global average pooling layer computes the mean value of each feature map and forward it to the SoftMax in dense layer. The SoftMax in dense layer takes each value and converts it to a probability (with the probability of all digits summing to 1.0) (Mohsin and Alzubaidi 2020).

  3. (c)

    Batch normalization and dropout layers

    NormVGG16 is a deep CNN which is prone to overfitting of training data (Ioffe and Szegedy 2015; Srivastava et al. 2014). Batch Normalization layers and Dropout layers prevent overfitting of Deep CNNs. In Dropout layers, the term “dropout” refers to dropping out units (hidden visible) in a neural network. Dropping a unit out means it is temporarily removed from the network, along with all its incoming and outgoing connections (Srivastava et al. 2014). The choice of which units to drop is random. Each unit is retained with a fixed probability p independent of other units (Srivastava et al. 2014). Batch Normalization allows using higher learning rates and reduces the dependence on initialization. It also acts as a regularizer and helps dropout layers in avoiding overfitting (Ioffe and Szegedy 2015).

  4. (d)

    Kernel regulaizers

    Kernel regularizers (L2) is added to the dense layer in NormVGG16. Kernel regulaizers is used to decrease overfitting by increasing the loss equation during training phase by a factor as shown in Equation 4 where a training function ŷ: f(x) should be first defined as a function that maps an input vector x to output ŷ where ŷ is predicted value for actual value y. Loss (L) can be computed as L((yi), ŷi) = L (f(xi), yi) (Chris 2020). For all input samples xi … Xn. The sum of all loss functions between each input xi and its corresponding output ŷ. The factor of increasing loss is proportional to the square of the value of the weight coefficients.

    $$L\left( {f\left( {x_{i} } \right), y_{i} } \right) = \mathop \sum \limits_{i = 1}^{n} L_{losscomponent} \left( {f\left( {x_{i} } \right),y_{i} } \right) + \lambda \mathop \sum \limits_{i = 1}^{n} w_{i}^{2}$$
    (4)

Kernel regulaizers were first added in different layers but best performance was attained when it is added to the last layer (dense layer).

Over all, the proposed modifications in Norm-VGG16 can be summarized as increasing the number of convolution layers from 13 to 16 layers, addition of batch normalization layer after each convolution layer, addition of a dropout layer after each max pooling layer and integrating a kernel regularizer to the global average pooling layer. The additional added layers (batch normalization and dropout) and kernel regularizer plays an important role in opposing overfitting of the standard VGG16 network during training process.

3.2.3 Spatial feature extraction and fusion

The benefit of feature fusion is the detection of correlated feature values generated by different algorithms (Ross 2009). The fusion of features of different properties and families creates a compact set of salient features that can improve robustness and accuracy of classification model (Ross 2009). In this stage, spatial feature descriptors of global and local features are extracted from CT images and fused with the automatic features generated from modified NormVGG16.

After automatic segmentation of lung from CT slices, all the segmented images are resized to 64 × 64 to decrease as much as possible the size of extracted spatial features from the segmented RoI. Two articulated spatial features are extracted from the slices which are HOG and DAISY descriptors. The HOG feature descriptors were explained in detail in Sect. 3.1.2.

The DAISY descriptor is used because it is designed for effective dense computation as it is faster than GLOH and SIFT feature descriptors (Tola et al. 2010) and can be computed effectively unlike SURF (Tola et al. 2010). The DAISY feature descriptors generate low dimensional invariant descriptors from local image regions. Eight orientation maps, G, are generated for each direction and are computed for each image to generate its DAISY descriptors (Tola et al. 2010). Gaussian kernels of different summation values convolve each orientation map several times to obtain convolved orientation maps (Tola et al. 2010). If G(u,v) is the image gradient at location (u, v) and the h(u, v) is the vector made of values at location (u,v) in the orientation maps after convolution by gaussian kernels then the standard deviation can be calculated as in Eq. (5) (Tola et al. 2010).

$$h_{\sum 1} \left( {u,v} \right) = \left[ {G_{1}^{\sum 1} \left( {u,v} \right), \ldots , G_{H}^{\sum 1} \left( {u,v} \right) } \right]^{T}$$
(5)

where G1, G2…GH denote the Σ-convolved orientation maps (Tola et al. 2010). After that a normalization process occurs for each histogram independently to correctly represent the pixels near occlusions. The DAISY descriptors are calculated for different layers of concentric circles as shown in Fig. 8. The full DAISY descriptors D (u0, v0) for location (u0, v0) is then defined as a concatenation of h vectors at different layers of concentric circles as shown in Eq. (6).

$$D\left({u}_{0}, {v}_{0}\right)={\left[\begin{array}{c}{\hat{h} }_{\sum 1 }^{T}\left({u}_{0},{v}_{0}\right), \\ {\hat{h} }_{\sum 1 }^{T}\left({I}_{1}\left({u}_{0},{v}_{0}, {R}_{1}\right)\right),\dots \dots \dots \dots {\hat{h} }_{\sum 1 }^{T}\left({I}_{N}\left({u}_{0},{v}_{0}, {R}_{1}\right)\right),\\ {\hat{h} }_{\sum 2 }^{T}\left({I}_{1}\left({u}_{0},{v}_{0}, {R}_{2}\right)\right),\dots \dots \dots \dots {\hat{h} }_{\sum 2 }^{T}\left({I}_{N}\left({u}_{0},{v}_{0}, {R}_{2}\right)\right),\\ {\hat{h} }_{\sum Q }^{T}\left({I}_{1}\left({u}_{0},{v}_{0}, {R}_{3}\right)\right),\dots \dots \dots \dots {\hat{h} }_{\sum Q }^{T}\left({I}_{N}\left({u}_{0},{v}_{0}, {R}_{3}\right)\right), \end{array}\right]}^{T}$$
(6)

where Q is the number of convolved orientation layers with different ∑’s and Ij (u, v, R) is the location with distance R from (u,v) in the direction given by j when the directions are quantized into N layers of concentric circles as shown in Fig. 8.

Fig. 8
figure 8

DAISY descriptors layers of concentric circles

HOG descriptors are from global features family that generates a compact texture features but they are most sensitive to clutter and occlusion (Lisin et al. 2005). On the other hand, DAISY descriptors are from local features family which generates key descriptors that are calculated in multiple interest points of local image and are not sensitive to clutter and occlusion (Lisin et al. 2005). The key point of extracting both, HOG and DAISY descriptors, is to combine different information of different families of features, which is expected to improve the results (Lisin et al. 2005).

After extracting spatial features, the 512 features, generated by the Norm-VGG16 from the global average pooling layer shown in Fig. 7, are fused with 8100 HOG descriptors generated for each segmented image and 400 DAISY descriptors. After fusion of automatic generated features and hand-crafted features, the fused 9012 features are used for classification.

3.2.4 Classification

After the spatial features and automatic features are extracted and fused, the CT-slices are classified using Linear Support Vector Machine (SVM) based on the merged features. SVM finds the suitable hyperplane that maximizes the margin between classes (Cortes and Vapnik 1995). The SVM classifier has been chosen due to its robustness because SVM is trained by solving a constrained quadratic optimization problem (Cortes and Vapnik 1995). This means that each SVM parameter has only a unique optimal solution, unlike other classifiers, such as standard Neural Networks which are trained using backpropagation (Cortes and Vapnik 1995). Due to the large size of fused features which is equal 9012 × n, where n is the number of CT slices in dataset, the full dataset is too large and can’t fit completely in RAM. Incremental learning (Diehl and Cauwenberghs 2003) is used, which means dividing the dataset into batches during training of SVM, batch by batch is loaded from Hard disk to RAM to overcome the limited RAM bandwidth.

4 Materials

In the current study, two benchmark datasets are used in experiments of the proposed models. The two datasets are MosMedData: Chest CT Scans with COVID-19 Related Findings dataset and SARS-COV-2 CT-Scan dataset.

4.1 MosMedData: chest CT scans with COVID-19 related findings dataset

The MosMedData dataset (Morozov et al. 2020) was provided by medical hospitals in Moscow, Russia and collected at Center of Diagnostics and Telemedicine. It comprises 1110 3D-CT (saved as NifTi format) lung volumes of anonymized human lung computed tomography (CT) scans with COVID-19 related findings, as well as without such findings. The dataset includes 42% males, 56% females and 2% others of ages between 18 and 97 years old with median of 47 years old. Each 3D-CT NifTi volume corresponds to unique patient. The dataset is characterized by the availability of labeled different severity levels. The levels indicate the impact of COVID-19 infection on lungs. Such characteristic shall aid the precise diagnosis of COVID-19 and identification of subjects of high risk that need immediate intervention.

The 3D CT-volumes are divided into 5 classes depending on the state of the lung tissue which are:

  • CT-0: normal lung tissue, no CT-signs of viral pneumonia.

  • CT-1: several ground-glass opacifications, involvement of lung parenchyma is less than 25%.

  • CT-2: ground-glass opacifications, involvement of lung parenchyma is between 25 and 50%.

  • CT-3: ground-glass opacifications and regions of consolidation, involvement of lung parenchyma is between 50 and 75%.

  • CT-4: diffuse ground-glass opacifications and consolidation as well as reticular changes in lungs. Involvement of lung parenchyma exceeds 75%.

The distribution of volumes over the severity classes is shown in Table 1.

Table 1 MosMedData: chest CT scans with COVID-19 related findings dataset distribution of 3D-CT volumes studies

In accordance with clinical experts’ recommendation and due to the limited percentage of 3D subjects in CT-4, it is combined with the previous class CT-3 creating a composite class called (CT-3–4), which denotes the severe cases who have ground-glass opacifications and consolidation, involvement of lung parenchyma exceeding 50%. As of this modification the new distribution of COVID-19 related findings severity is shown in Table 2. The four classes CT-0, CT-1, CT-2 and CT-3–4 are shown in Fig. 9.

Table 2 MosMedData: chest CT scans with COVID-19 related findings dataset distribution of 3D-CT volumes studies after combination of CT-3 and CT-4 classes
Fig. 9
figure 9

Slices from 3D volumes of MosMedData dataset classes after merging classes CT-3 and CT-4 to CT-3–4

After applying the Preparatory phase modules in Sect. 3, the dataset is divided into 90% as training dataset and 10% as testing dataset. The distribution of the classes within the training dataset is illustrated in Table 3 while the distribution of testing dataset is illustrated in Table 4.

Table 3 Training dataset class distribution
Table 4 Testing dataset class distribution

4.2 SARS-COV-2 CT-scan dataset

The SARS-COV-2 2D-CT-Scan dataset (Soares et al. 2020) consists of 2482 CT scan images. It is divided to 1252 CT scans that are positive for SARS-CoV-2 infection (COVID-19) and 1230 CT scans for normal subjects, non-infected by COVID-19. The dataset is collected from 120 real patients in hospitals of Sao Paulo, Brazil, of which 60 patients are infected by COVID-19 including 32 males and 28 females, and the other 60 patients are not infected by COVID-19, which are 30 males and 30 females. The used dataset is publicly available on www.kaggle.com/plameneduardo/sarscov2-ctscan-dataset. The dataset is divided into 90% as training dataset and 10% as testing dataset. The distribution of the classes within the training set is illustrated in Table 5 while the distribution of testing dataset is illustrated in Table 6. The two classes of the dataset can be shown in Fig. 10.

Table 5 Training Dataset class distribution
Table 6 Testing Dataset class distribution
Fig. 10
figure 10

Sample slices of SARS-COV-2 2D-CT-scan dataset (COVID and non-COVID)

5 Experimental results and discussion

5.1 Experimental environment: tools and setup

Automatic Segmentation is performed using MATLAB 2019a, while model implementation, training, and testing are done using Python language v3.7.6 with Keras package (TensorFlow backend). Experiments are conducted on core i7, 2.21 GHz processor with 16 GB RAM and Nvidia GTX 1050Ti with 4 GB RAM. All deep learning architectures are trained for 50 epochs from scratch using Adam optimizer with starting learning rate of 0.001. Inputs are divided into batches of size 32. Validation accuracy and cross-entropy loss are monitored for each epoch. In addition, the learning rate is reduced by factor of 0.2 for each three epochs without improvement in validation loss. The best model is defined as having minimum validation loss, then it is stored and applied on the testing set.

5.2 Performance metrics

A set of performance measures is used to evaluate the used unsupervised segmentation clustering techniques in terms of the segmentation partitions quality. In addition, various established metrics are used to contrast the performance of the proposed COV-CAF architecture to the state-of-the-art models.

5.2.1 Segmentation performance indicators

Different qualitative metrics are used to evaluate the compactness and the degree of cluster separation for different unsupervised clustering algorithm to find the best clustering technique to be used with the best cluster numbers (Kovács et al. 2006; Youssef et al. 2007). After conducting the comparative experiments, the elbow method is used to ensures the appropriate number of clusters to be used.

  1. (a)

    Davis-Bouldin Index

    The measure is used to measure the ratio of the sum of within-cluster scatter to between-cluster separation. For C = (C1…Ck) be a clustering centroid for a group of data objects (D) (Kovács et al. 2006; Youssef et al. 2007). So, Davis-Bouldin (DB) can be given as Eq. (7).

    $$DB = \frac{1}{k} \mathop \sum \limits_{i = 1}^{k} R_{i} {\text{where}} R_{i} = \mathop {\max }\limits_{j = 1 \ldots n, i \ne j } R_{ij}$$
    (7)

    where Rij is the similarity index that measures within-to-between cluster distance ratio an in Eq. (8).

    $$R_{ij} = \frac{{\overline{d}_{i} + \overline{d}_{j} }}{{d_{i,j} }}$$
    (8)

    The scatter measure for the centroids ci of the Clusters Ci can be given as in Eq. (9).

    $$d \left( {C_{i} } \right) = \frac{1}{{\left| {C_{i} } \right|}} \mathop \sum \limits_{{x \in C_{i} }} \left| {\left| {x - c_{i} } \right|} \right|$$
    (9)

    where || ci—cj || presents the cluster-to-cluster distance between centroids (c) of different clusters. The best value of DB is the low value of Rij which generated from low value which represents low scatter value with high distance between cluster value (Youssef et al. 2007).

  2. (b)

    Silhouette Index

    It measures the average similarity of the objects within cluster and their distance to other objects in the other cluster (Wang et al. 2017) as shown in Eq. (10).

    $$s\left( i \right) = \frac{b\left( i \right) - a\left( i \right)}{{\max \left( {a\left( i \right), b\left( i \right)} \right)}}$$
    (10)

    where a(i) is used to represents the average distance (d) of point i with respect to all other points belonging to same cluster Ci (shown in Eq. 11) while b(i) represents the average distance (d) of point i with respect to all other points in the nearest cluster Ck (shown in Eq. 12) (Wang et al. 2017).

    $$a\left( i \right) = \frac{1}{{\left| {{\text{C}}_{{\text{i}}} } \right| - 1}} \mathop \sum \limits_{{j \in C_{i} , i \ne j}} d\left( {i,j} \right)$$
    (11)
    $$b\left( i \right) = \mathop {\min }\limits_{k \ne i} \frac{1}{{\left| {C_{k} } \right|}} \mathop \sum \limits_{{j \in C_{k} }} d\left( {i,j} \right)$$
    (12)

    The calculation of index involves that we choose minimum of all the average distance of the point i with all the other points that don't belong to another cluster. So, the general formula of Silhouette for Data points from 1…N can be written as shown in Eq. 13.

    $$S = \frac{1}{{\text{N}}} \mathop \sum \limits_{i = 1}^{N} s_{i}$$
    (13)

    So, it is concluded that the highest the ratio the better the clustering.

  3. (c)

    Dunn Index

    It determines the minimal ratio between cluster diameter and inter cluster distance and is calculated for Cluster Set C = (c1…ck) as shown in Eq. (14).

    $$D = \mathop {\min }\limits_{c,d \in C} \left[ {\frac{{d\left( {\mu_{c} , \mu_{d} } \right)}}{{\mathop {\max }\limits_{c \in C } \left[ {diam\left( c \right)} \right]}}} \right]$$
    (14)

    where diam (c) is of cluster c computed as the maximum inner cluster distance and d(μc,μd), which is the distance between the centroids of clusters c and d, is maximized. So, the compact and well separated dataset must be expected to have large distance between clusters and small diameter. It is concluded that the highest Dunn index value represents better clustering technique and best number of clusters.

  4. (d)

    Elbow method

    After computing the clustering quality and validity measure, the Elbow method is applied to determine the best number of centroids for lung segmentation (Thorndike 1953). Elbow method is used to ensures the best number of clusters centroids to be used (Nanjundan et al. 2019; Thorndike 1953).

    1. 1.

      Compute clustering algorithm for different k values (1:10).

    2. 2.

      For each k, calculate the total within-cluster sum of square (wss).

    3. 3.

      Plot a wss curve by number of clusters k.

    4. 4.

      The location of a bend (knee) in the plot is considered as an indicator for the suitable clusters’ number.

5.2.2 Classification performance indicators

  1. (a)

    Confusion matrix

    Confusion Matrix is a summarize table for visualizing and describing the performance of model in classifying a testing set of data as the shown in Figure 11. It is a summary of prediction results in a classification problem (Ibrahim et al. 2020). In confusion matrix, values of True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) are represented by assuming Ci one of the four classes in our dataset.

    • TP (Ci) = All the instance of Ci, that are classified as Ci.

    • TN (Ci) = All the non-Ci, instances that are not classified as Ci.

    • FP (Ci) = All the non-Ci, instances that are classified as Ci.

    • FN (Ci) = All the Ci instances that are not classified as Ci.

    Fig. 11
    figure 11

    Confusion matrix

  2. (b)

    Accuracy (Acc)

    It is calculated by dividing the number of images that are correctly labeled by the total number of test images. Equation (15) explain class accuracy measurements.

    $$Accuracy = \frac{TP + TN}{{TP + TN + FP + FN}}$$
    (15)
  3. (c)

    Precision (Prec)

    It is a metric which quantifies the number of positive predictions correctly made (Brownlee). Equation 16 explains single class precision measurements.

    $$Precision = \frac{TP}{{\left( {TP + FP} \right)}}$$
    (16)
  4. (d)

    Sensitivity (Sens)

    It is a metric which quantifies the number of correct positive predictions that could have been made from all the positive predictions (Brownlee). Equation (17) explains single class sensitivity measurements.

    $$Sensitivity = \frac{TP}{{\left( {TP + FN} \right)}}$$
    (17)
  5. (e)

    F-measure

    F-Measure is a metric that provides a way to combine both precision and sensitivity in a single metric to captures both properties (Brownlee). Equation (18) explains single class F-measure.

    $$F - measure = \frac{{\left( {2*Precision*Sensitivity} \right)}}{{\left( {Precision + Sesitivity} \right)}}$$
    (18)
  6. (f)

    Specificity (Spec)

    It is a metric that quantifies the number of correct negative predictions made from all negative predictions that could have been made (Brownlee). Equation (19) explains single class Specificity.

    $$Specificity = \frac{TN}{{\left( {TN + Fp} \right)}}$$
    (19)

5.3 Experimental scenarios

In this section, the performance of the proposed COV-CAF system architecture and Norm-VGG16 is evaluated. Norm-VGG16 performance is reported as an independent pure DL classifier in the following experiments.

First, an ablation study is conducted on MosMedData: Chest CT Scans with COVID-19 Related Findings Dataset to emphasize the effect of each feature category and the fusion of features. Moreover, the effectiveness of the COV-CAF architecture, its backbone DL network (modified Norm-VGG16) and fused spatial articulated features on MosMedData: Chest CT Scans with COVID-19 Related Findings Dataset is illustrated against four prominent CNN architectures, namely Xception, ResNet-50, MobileNet-v2 and Inception-v3 applied on the entire CT image. In addition, the performance of COV-CAF is compared to traditional ML where SVM is applied on the extracted hand-crafted features only from the segmented ROI. COV-CAF performance is contrasted to pure DL and traditional ML approaches to elucidate the effect of fusing both methods. The four CNNs have a common modification of adding an input layer of size 180 × 180 and an output layer was modified to four neurons representing four classes of MosMedData: Chest CT Scans with COVID-19 Related Findings Dataset. The first experiment was conducted using Inception-v3 which is a model developed by google. It is a very deep network consists of 48 layers. It starts with 6 convolution layers and followed by 10 inception blocks. The full architecture is described in detail in (Szegedy et al. 2016). The Second experiment was conducted using MobileNet-v2 which is a light-weight network. It has 53 layers which are divide to 52 convolution layers and the last is the dense layer. The network start with 16 residuals and bottlenecks blocks and ends with one convolution layer followed by dense layer (Ardakani et al. 2020; Sandler et al. 2018). Third Experiment conducted by using ResNet-50 which is an architecture with 50 convolution layers. The architecture is based on residual blocks of two types of residual networks which are identity residual networks and shortcut residual networks (Soares et al. 2020). The full architecture is explained in (Ibrahim et al. 2020). The last experiment is done with the Xception network which has 71 layers. It started by two convolution layers followed by depth separable convolution layers, four convolution layers and dense layer (Ardakani et al. 2020). The full architecture and specification is explained in Chollet (2017). These four CNNs are chosen to implement and compare our proposed models because of their different depths, subsequent complexities and different theories of operations as was explained previously. Moreover, they are chosen because of their good results (Ardakani et al. 2020; Jaiswal et al. 2020).

Second, the effectiveness of the COV-CAF architecture and its backbone DL network (Norm-VGG16) is compared to Jaiswal et al. implemented deep learning models (Jaiswal et al. 2020) on SARS-COV-2 CT-Scan dataset.

5.4 Segmentation evaluation

Different number of experiments are applied on MosMedData: chest CT scans with COVID-19 Related Findings Dataset using different clustering indicators are applied to compare between different clustering techniques with varying number of clusters to determine the best algorithm to apply and best cluster number.

First, the comparisons between k-means and FCM clustering are shown in Table 7 and Fig. 12a, which depict that FCM always scores the lowest Davis-Bouldin values with different number of clusters (from 1 to 10 clusters) compared to k-means with the same number of clusters. Moreover, it was found that the best cluster number for FCM is (k) is three.

Table 7 Davis-Bouldin index, Silhouette index and Dunn index for K-means and FCM for different number of clusters
Fig. 12
figure 12

Evaluation curves for different unsupervised clustering techniques. a Davis-Bouldin Index Curve. b Silhouette Index Curve. c Dunn Index Curve

Second, it was depicted from Silhouette index results in Table 7 and Fig. 12b that the FCM algorithm achieves the highest score compared to k-means with different number of clusters (from 1 to 10 clusters) which confirms that FCM is more suitable for our segmentation with cluster number (k) equals to three.

Finally, Table 7 and Fig. 12c provide certainty in choosing the FCM algorithm over the k-means as the highest Dunn index’s score achieved by FCM algorithm at number of clusters equals three that surpassed k-means with the same number of clusters.

Overall, the experimental findings stipulate on the superiority of FCM algorithm compared to k-means in unsupervised segmentation (Kang et al. 2009; Wiharto and Suryani 2020). In addition, the results in Table 7 show that the best number of clusters (k) equals three for both algorithms on MosMedData: Chest CT Scans with COVID-19 Related Findings Dataset.

For more certainty for the best number of clusters centroids, the elbow method is used to ensure the perfect number of clusters suitable for our experiments and it was found that the bend in the curve is at cluster number (k) equals to three as shown in Fig. 13.

Fig. 13
figure 13

FCM Elbow method using wss curve for optimal number of clusters (K)

5.5 Classification results

5.5.1 MosMedData: chest CT scans with COVID-19 related findings dataset

A preliminary experiment is carried out to verify the candidacy of Norm-VGG16 as the backbone (feature extractor) of the COV-CAF architecture. The mean performance metrics of five different bootstrapped partitions using the described DL architectures are shown in Table 8. Bootstrapping is used to show the performance stability and robustness across a set of different partitions. Overall, Norm-VGG16 achieves the best mean performance metrics except for specificity. It attained an overall accuracy of 96.64% and macro average of precision, sensitivity, specificity and f-measure of 96.92, 93.65%, 97.2% and 95.26%, respectively. Hence, Norm-VGG16 is used as the feature extractor within the COV-CAF architecture.

Table 8 The pure DL architectures mean performance of five different bootstrapped partitions experiment applied on the entire MosMedData: chest CT scans with COVID-19 related findings

The Norm-VGG16 learning (training and testing) curves of the best model partition are shown in Figure 14. The best model achieves a training accuracy of 99.98% and testing accuracy of 97.09%. The learning curves manifest that the testing accuracies and losses are stabilized around the 25th epoch. The gap between the training and testing curves, either in accuracy curves or in loss curve, can be due to the small amount of testing set compared to training dataset, thus a gap in performance might exist. To compensate for such possibility, the bootstrapping experiment is conducted to provide a robust illustration of the accuracy of the model and to ensure that the proposed modified Norm-VGG16 is not overfitting on one partition of the bootstrapped experiments

Fig. 14
figure 14

Norm-VGG16 learning curves. a Training and testing accuracy curves. b Training and testing loss curves

The remaining experiments are conducted using 90/10 percentage split. Tables 9 and 10 depict the calculated performance metrics of Norm-VGG16 applied directly on the entire CT image and COV-CAF architecture, respectively. The results of Norm-VGG16 model and COV-CAF are calculated from the best generated confusion matrices in Fig. 15f, g, respectively. In Tables 9 and 10, the accuracy, precision, sensitivity, f-measure and specificity are calculated for each class separately and a marco average for each performance metric is calculated. The macro average computes the performance metric independently for each class and then calculate their average. Tables 9 and 10 reveal that the hybrid model surpasses its counterpart Norm-VGG16 in terms of macro average metrics: accuracy, sensitivity and F-measure. with differences of 2.51%, 2.51% and 1.37% respectively.

Table 9 The proposed modified Norm-VGG16 performance metrics calculated on MosMed dataset
Table 10 The proposed COV-CAF architecture performance metrics calculated on MosMed dataset
Fig. 15
figure 15

Confusion Matrices of ablation models applied on MosMed dataset. a HOG + SVM model confusion matrix. b DAISY + SVM model confusion matrix. c Spatial articulated feature fusion (HOG + DAISY) + SVM confusion matrix. d The proposed modified Norm-VGG16 confusion matrix. e The proposed modified Norm-VGG16 + HOG confusion matrix. f The proposed modified Norm-VGG16 + DAISY confusion matrix. g The proposed COV-CAF Model confusion matrix

The performance of the proposed COV-CAF model and the effect of each feature of extracted features (HOG, DAISY and automatic features of proposed modified Norm-VGG16) and their different combinations are illustrated through an ablation study, where several combinations are experimented and contrasted. The confusion matrices for different extracted features and fusions are shown in Fig. 15 which is used in establishing Table 11 for accuracies comparison. From Table 11, it can be seen that fusion of the extracted spatial articulated features (HOG + DAISY) results in a significant increase in detecting the severe COVID-19 classes by scoring 85.57% which exceeds the accuracies of HOG extracted features and DAISY extracted features by 5.67% and 3.32% respectively. Comparing the accuracies of the fused spatial articulated features (HOG + DAISY) to the accuracies produced by HOG features, a difference of 18.22% and 11.11% is found in CT-2 and CT-3–4 severe classes accuracies respectively. Also, the fused spatial articulated features scores higher accuracies compared to DAISY extracted features in CT-2 and CT-3–4 severe classes with a difference of 4.82% and 17.59% respectively. The proposed modified Norm-VGG16 surpasses the spatial articulated feature fusion with an increase in CT-0 and CT-1 classes by about 40% and 3.05% respectively, while the spatial articulated features fusion produces an increase in the CT-2 and CT-3–4 severe infections by 3.78% and 2.78% respectively. In terms of the proposed COV-CAF model, it achieves the highest accuracies in COVID-19 most severe classes that need immediate medical interpretation which are CT-2 and CT-3–4 scoring 95.53% and 95.37% respectively as shown in Table 11. Only the proposed modified Norm-VGG16 achieves a slight increase in cases of non-severe infection CT-1 class.

Table 11 Comparison between accuracies per class and over all accuracy for each proposed model applied on MosMed dataset for ablation study

The spatial articulated features (HOG + DAISY) fusion model, COV-CAF model and its back bone the proposed modified Norm-VGG16 model are compared with different deep learning architecture with different subsequent complexity which are Xception, Resnet-50, MobileNet-v2 and Inception-v3 models. The best confusion matrix for each model of the deep learning models on MosMed dataset is reported in Fig. 16a–d. The confusion matrices in Fig. 16 are used to calculate the performance metrics for each model. The best Xception model achieves a testing accuracy of 94.67%, the best ResNet-50 model achieves a testing accuracy of 93.33%, the best MobileNet-v2 model achieves a testing accuracy of 93.14% and the best Inception-v3 model achieves a testing accuracy of 91.13%.

Fig. 16
figure 16

Confusion matrices of different CNN models applied on MosMed dataset. a Xception model confusion matrix. b ResNet-50 model confusion matrix. c MobileNet-v2 model confusion matrix. d Inception-v3 model confusion matrix

Table 12 depicts accuracy per class and overall accuracy of each of the seven implemented models. It can be noticed that COV-CAF and Norm-VGG16 outperform the other models. Norm-VGG16 surpasses the best per-class and overall accuracies of the four recognized architectures with the smallest differences being 3.28%, 1.59%, 3.44%, 0.93% and 2.42% for classes CT-0, CT-1, CT-2, CT-3–4 and overall accuracy, respectively. Despite that traditional ML of spatial articulated feature (HOG + DAISY) fusion scores the worst CT-0 accuracy across all models, it surpasses the standard DL architectures in terms of the critical classes (high severity) accuracy which are classes CT-2 and CT 3–4. A possible explanation for the variation in performance of the spatial articulated features (HOG + DAISY) is that the lack of evident textural changes in the soft tissues of CT-0 and CT-1 cases hinders the extraction of informative features for these classes. However, the high concentration of GGOs in the high severity classes (CT-2 and CT-3–4) generates large number of key features that are captured easily by HOG and DAISY features which are representatives of local and global features families respectively (Ahmed et al. 2017; Walsh et al. 2019). When compared to the proposed modified Norm-VGG16 deep learning architecture, it is found that traditional learning (HOG + DAISY) fusion model needs a smaller training dataset. Thus, spatial articulated feature (HOG + DAISY) fusion produces better accuracy results from the limited training set of the high severity cases (CT-2 and CT-3–4) (Walsh et al. 2019). This finding explains the superior performance of COV-CAF model, as it combines both features’ categories (automatic deep learning features + spatial articulated features) resulting in better overall performance. The proposed COV-CAF model attains the highest per-class accuracies except for CT-1 with a minute difference of 0.61% scored by Norm-VGG16. A remarkable improvement is reached by COV-CAF architecture compared to the four standard DL architectures, especially in the highest severity classes of CT-2 and CT-3–4. The improvements in CT-2 range from 10.31 to 14.77%, while a bigger range exist for CT-3–4 from 5.56 to 15.74%. The results elucidate the capability of COV-CAF architecture in stratifying the critical minority cases in contrast to the pure DL architectures. It also reveals a huge improvement over traditional ML in terms of CT-0 accuracy with 44.29% increase.

Table 12 Comparison between accuracies per class and over all accuracy of different models applied on MosMed dataset

Table 13 shows a comparison between the COV-CAF architecture, the modified Norm-VGG16, The spatial articulated features (HOG + DAISY) fusion model and the implemented standard DL architectures. Similar findings to Table 12 show at Table 13, where it depicts the great improvement of the COV-CAF architecture relative to the rest of models, where COV-CAF achieves the highest precision, sensitivity, f-measure and specificity. It surpasses Norm-VGG16 in macro average sensitivity and F-measure by 2.51% and 1.37% respectively. The modified Norm-VGG16, which comes second after COV-CAF, exceeded the results achieved by Xception in terms of precision, sensitivity, f-measure and specificity by 3.04%, 2.92%, 2.99% and 1.12% respectively. Moreover, it is noteworthy that modified NormVGG16 architecture succeeded in attaining such significantly higher performance metrics, while maintaining much lower network depth and subsequent complexity compared to Xception architecture.

Table 13 Comparison between macro average precision, sensitivity, f-measure and specificity for different models applied on MosMed dataset

5.5.2 SARS-COV-2 CT-scan dataset

The performance of COV-CAF architecture and the proposed modified Norm-VGG16 on SARS-COV-2 CT-Scan Dataset is compared to the results of the state-of -art of Jaiswal et al. (2020), who reported satisfactory results on the dataset. Table 14 shows the comparison between Jaiswal et al. (2020) implemented models and our models. The results show that the modified Norm-VGG16 model matched the results of the best model in the work of Jaiswal et al. (2020), DenseNet 201, with much lower network depth and subsequent complexity. Modified Norm-VGG16 consists of only 16 convolution layers, contrasted to the 201 convolution layers of Jaiswal et al. (2020) DenseNet201. Moreover, results show that COV-CAF got a considerable increase in sensitivity, F-measure, specificity than DesnseNet201 (Jaiswal et al. 2020) by 2.12%, 1.3%, 1.61% and 1.34% respectively. The highest improvement is attained in sensitivity of COVID19 group, which is a crucial measure in assessing the performance of the model to be able to correctly identify subjects that need immediate medical attention.

Table 14 Comparison between proposed models and state-of-the art models on SARS-COV-2 Ct-scan dataset

6 Conclusion

In this paper, a novel hybrid computer aided diagnostic system COV-CAF is introduced. COV-CAF introduces a preparatory phase and feature extraction and classification phase. The preparatory phase starts with a preprocessing module for converting the 3D volumes to 2D slices followed by an effective slice selection module to select CT slices with COVID-19 symptoms. Automatic DL feature extraction is performed by a modified Norm-VGG16 CNN. An unsupervised fuzzy c-means clustering is used to segment the RoI (lung parenchyma). Moreover, a feature fusion module is introduced where automatic features generated by Norm-VGG16 is combined with spatial articulated features generated from RoI segmentation. The remarkable result achieved by the proposed COV-CAF model in detecting COVID-19 infection and classifying the severity degree of infection from chest CT slices proves the robustness of the model and the importance of feature fusion phase. Our modified Norm-VGG16 and hybrid model surpassed traditional ML and the four well-known tested pure deep learning architecture named Xception, ResNet-50, MobileNet-v2 and Inception-v3 on MosMedData dataset. Moreover, our COV-CAF surpassed Jaiswal et al. who implemented four deep learning models on SARS-COV-2 Ct-Scan dataset.

As for the future work, different unsupervised techniques can be tested on different datasets to have a full study on the best unsupervised segmentation technique with the best number of clusters. Moreover, experimenting with different articulated features to be fused to the system to test its effect on the model performance.

Overall, the proposed COV-CAF diagnostic framework is a robust framework that can aid physicians in stratifying subjects into different risk groups according to their COVID-19 CT findings. Moreover, COV-CAF is a reusable framework that is expected to achieve competitive results on similar problems. It provides effective solutions to different common issues involved in CT lung diagnosis, such as slice selection, RoI segmentation and multi-view feature analysis.

7 Availability of data and material

MosMedData: Chest CT Scans with COVID-19 Related Findings Dataset is available via https://mosmed.ai/en/ and SARS-COV-2 CT-Scan Dataset is available via www.kaggle.com/plameneduardo/sarscov2-ctscan-dataset.