1 Introduction

The flare-up of the COVID-19 has increased the need for new effective and faster diagnostic methods than those manual diagnosis provided by the experts. The huge number of infected people and insufficient number of medical staff and health facilities in some countries increased the burden on the health system. On the other hand, the widespread usage of rapid diagnosis tools, which help in taking measurements and suggesting an appropriate treatment, is an evidence of both sieging effect of the pandemic and their usefulness in mitigating the spread of virus. In recent years, the reliance on machine learning techniques in the medical field has increased dramatically. Roy et al. (2022) discussed the prospects of supervised machine learning (SML) in the healthcare sector, the challenges it faces, how to solve it and the opportunity for healthcare through AI and SML in the near future. In general, these techniques have proven to be effective in diagnosing the diseases with acceptable accuracy and high speed. Jaiswal et al. (2021) proposed an optimized technique for identification of blindness in retinal images using the deep learning models. Ensembles of convolutional neural networks (CNN) has shown to be an efficient tool for skin cancer detection (Al-Karawi 2022) while segmentation of skin diseases also possible with the methods based on CNN (Huang et al. 2020). Among the many other studies based on medical images, there are different applications such as detection and diagnosis of gastric cancer (Cao et al. 2019), breast cancer (Wang et al. 2019), brain tumors (Salçin 2019), pneumonia (Avsar 2021), lung diseases (Kabiraj 2022) and lung cancer (Gunjan et al. 2022; Agarwal 2021). The use of machine learning techniques in the medical field was not limited to diagnosing diseases only, but also included several domains such as segmenting the medical images (Pal et al. 2022; Rajinikanth et al. 2022) and use the segmented images for specific purposes like predicting the type of the fetal brain (pathological or neurotypical) and predicting the gestational age of the fetus (Gangopadhyay et al. 2022).

Symptoms of the COVID-19 vary from person to person; however, the most frequently reported symptoms include fatigue, coughing, and shortness of breath. The problem is that these symptoms may be associated with other similar illnesses such as pneumonia (Zayet et al. 2020). Reverse transcription polymerase chain reaction (RT-PCR) tests are currently one of the most popular and reliable methods to discover the presence or absence of this virus. However, these tests have many drawbacks. This method is slow as sometimes it takes 24 h for a result to appear. In addition, it puts medical staff at risk of catching the virus as a result of the physical contact with the patient. It is also expensive, thus inaccessible for poor countries. Therefore, a quick, reliable and cheap way to diagnose COVID-19 and pneumonia infections is necessary and help to take the appropriate actions. Chest radiography (chest X-ray) is also a commonly used method for diagnosing the lung diseases and detect COVID-19; however, this method has drawbacks as well. To diagnose the diseases by the X-ray images, experts are required to inspect the images. In addition, it can provide false results because of the similarity between chest X-ray images of people infected with COVID-19 and different types of pneumonia.

CNN is a popular machine learning method that is used to classify images and detect objects. In this work, a sequential CNN architecture is proposed to detect the X-ray images belonging to patients with COVID-19, Viral Pneumonia, and Bacterial Pneumonia. For benchmarking purposes, the classification performance of the proposed architecture was compared with those obtained by widely-used CNN models pretrained on ImageNet dataset. These benchmark models are MobilenetV2, InceptionResNetV2, ResNetV2, EfficientNet B2, EfficientNet B0, NasNetMobile, InceptionV3, VGG16 and VGG19. These models differ in terms of design, number of parameters and depth. These factors allow for a comparison of models interchangeably with the proposed model. In terms of the practical implications, the property of being lightweight allows the proposed model to be used on devices with limited processing capability. In other words, it opens the way to design and develop cheap auxiliary tools to detect lung diseases.

Many works have been conducted to diagnose COVID-19 and pneumonia diseases; however, most of these works merge viral and bacterial lung diseases into one category. This leads to less understanding about how CNNs perform in classifying these diseases separately and provides a limited diagnosis scheme. In addition, the number of parameters is not discussed in the proposed models presented in these studies, so it is not clear how well these models can work in environments with low resources. Therefore, within the scope of this study, the answers to the following research questions are sought:

Q1: How successful is the proposed CNN model in detecting the lung diseases separately (i.e. COVID-19, Viral Pneumonia and Bacterial Pneumonia)?

Q2: Can a light model with low number of parameters, and thus low computation cost, performs well for this classification task?

As a result of the experiments performed, a CNN model is proposed to address the limitations of the existing studies. In particular, the contributions of this study are given in the list below.

  • The proposed model has a lower number of convolutional layers and parameters than the benchmark models. Therefore, this is a lightweight model that requires relatively small amount of calculation at training and test phases.

  • It may achieve better classification results. In particular, the proposed model is capable of distinguishing the COVID-19, viral and bacterial pneumonia cases with a high true detection rate.

  • As a result of being lightweight, the proposed model does not require expensive and powerful hardware as it includes relatively a low number of parameters. This advantage makes it applicable to devices with low computational power such as edge devices and single board computers.

The remainder of this paper is organized as follows: In Sect. 2, the existing studies in the related literature are reviewed. In Sect. 3, the dataset, models and performance metrics are introduced. Section 5 and 6 represent the results and discussion. Finally, the paper is concluded in Sect. 6.

2 Literature review

Among various approaches for COVID-19 detection, chest X-ray images are widely used and hence there are many available studies in this context. Due to the automated feature extraction capability of convolutional neural networks (CNN), they are used for classification of unstructured data such as images. Therefore, there are numerous previous studies in which chest X-ray images were used together with CNN models for detection of COVID-19 infections. In some studies, the researchers aim to discriminate the X-ray images of positive COVID-19 cases from the healthy X-ray images. However, COVID-19 cases are very likely to be confused with pneumonia infections which can be bacterial or viral, hence, there are other studies where the detection is considered as a three classes or a four classes problem to detect COVID-19 and pneumonia together.

The number of studies considering a binary problem to detect healthy and COVID-19 X-ray images is relatively high. For instance, Reynaldi et al. (2021) used CNN with the Resnet-101 model as an image recognition method to detect COVID-19. The authors used a dataset contains 2562 images categorized as positive COVID-19 (1281 images) and negative COVID-19 (1281 images). Contrast Limited Adaptive Histogram Equalization (CLAHE) preprocessing process was applied on dataset and the results showed that the model with CLAHE data achieved better result with accuracy of 99.61% compared with the raw data where the accuracy was 99.22%. In addition, Hemdan et al. (2003) used many deep convolutional neural network models (VGG19, DenseNet201, InceptionV3, ResNetV2, InceptionResNetV2, Xception, and MobileNetV2) to classify X-ray images into positive or negative COVID-19. The authors used dataset of 50 chest X-ray images that includes 25 positive and 25 healthy cases. The results showed that VGG19 and DenseNet201 provides the highest classification performance with accuracy of 90%. Narin et al. (2021) used 5 pre-trained convolutional neural network models namely, ResNet50, ResNet101, ResNet152, InceptionV3 and InceptionResNetV2 to detect COVID-19. The dataset they used contains 7396 chest X-ray images classified as 341 COVID-19 images, 2800 Normal images, 2772 Bacterial Pneumonia and 1493 Viral Pneumonia. The dataset was divided into three binary-class datasets: dataset-1 contains COVID-19 and Normal classes, dataset-2 contains COVID-19 and Viral Pneumonia classes while dataset-3 contains COVID-19 and Bacterial Pneumonia classes. ResNet50 model achieved the best classification results with accuracy of 96.1%, 99.5% and 99.7% for dataset-1, dataset-2 and dataset-3, respectively. Ohata et al. (2020) used transfer learning models as features extractors to detect COVID-19. Transfer learning models that were used in this work are VGG16, VGG18, InceptionV3, InceptionResNetV2, ResNet50, NASNetLarge, NASNetMobile, Xception, MobileNet, DenseNet121, DenseNet169 and DenseNet201. These models were combined with many classifiers like k-Nearest Neighbor, Bayes, Random Forest, Multilayer Perceptron (MLP), and Support Vector Machine (SVM). The authors used two datasets where both of them have the same images for the COVID-19 class, but they have different images for the healthy class. The datasets are balanced and consist of 194 images for each class. The results showed that MobileNet model with the SVM classifier (linear kernel) achieved the best mean accuracy of 98.46% for one of the datasets while DenseNet201 model with MLP classifier was the best for another dataset with a mean accuracy of 95.64%.

In another work with binary class of images, Breve et al. (2011) performed a set of exhaustive classification experiments. In COVID-19 detection problem, they used 21 different CNN models that are VGG, ResNet, DenseNet, EfficientNet and their derivatives (i.e. DenseNet121, EfficientNetB1, ResNet152). In addition, ensembles of these CNN models were also employed. Their dataset contains 16,352 chest X-ray images where 2358 images are COVID-19 positive and 13,994 are COVID-19 negative. The negative data includes images with non-COVID-19 pneumonia. The results showed that DenseNet169 achieved the best results with an accuracy and F1 score of 98.15% and 98.12%, respectively. The ensemble approach increased the accuracy and F1 score of DenseNet169 to 99.25% and 99.24%, respectively. Maheen et al. (2010) used different pre-trained CNN models to detect COVID-19 using chest X-ray images. The models are: AlexNet, VGG-16, MobileNet-V2, SqeezeNet, ResNet-34, ResNet-50 and COVIDX-Net. The dataset contains 406 images distributed evenly to COVID-19 and healthy classes. ResNet-34 achieved the best prediction accuracy of 98.33%. Shenoy et al. (2010) proposed a new CNN model to detect COVID-19. A dataset contains 4316 chest X-ray images (2158 COVID-19 negative scans and 2,158 COVID-19 positive scans) was used and data augmentation technique was used. The model achieved an accuracy of 95.5%. Hasoon et al. (2021) proposed many methods that combines between image processing and classifiers (i.e. K-Nearest Neighbor (KNN) and Support Vector Machine (SVM)) for classification and early detection of COVID-19. The dataset includes normal and pneumonia COVID-19 X-ray images. The method that combines Local binary pattern (LBP) and KNN achieved the best accuracy of 99%. Mohammed et al. (2022) proposed an integrated method for selecting the optimal deep learning model based on a novel crow swarm optimization algorithm for COVID-19 diagnosis. ResNet50 model achieved the best accuracy of 91.46%.

Detection of pneumonia together with COVID-19 is also considered by many researchers. In that case, it becomes a three-classes problem. One of the methods was proposed by Montalbo (2021) where DenseNet121 was modified to classify normal, COVID-19 and Pneumonia (Bacterial and Viral) classes. The resulting model, which has lower parameters and depth than the original one, achieved an accuracy of 97.99%. It did not achieve better accuracy compared to the base model but showed to be able to outperform against some state-of-the-art deep convolutional neural network models. Same author in another study (Montalbo 2022) applied a truncation method on various of famous deep convolutional neural networks to reduce the number of parameters of the models and make it applicable with low computing resources. Chest X-ray images were used and the results showed that the InceptionResNetV2 model achieved the best accuracy of 97.41% in three-classes classification (Normal, COVID-19 and Pneumonia) after truncating it and reducing its parameters to 441 K. In addition, Shome et al. (2021) proposed a vision transformer-based deep learning pipeline for detecting COVID-19 using chest X-ray images. Dataset with three-classes (Normal, COVID-19 and Pneumonia) contains 30 K of chest X-ray images (10 K for each class) was used and the proposed model achieved an accuracy of 98% for binary classification (Normal and COVID-19) and 92% for multi-class classification. Nagi et al. (2022) used a relatively large dataset to check the performance of deep learning. Xception model was the best in terms of accuracy. The model achieved an accuracy of 94.21% while the Custom-Model (proposed model in the study) achieved an accuracy of 92.38%.

Transfer learning is a widely utilized practical tool in this three-class problem as well. Makris et al. (2020) used several well-known CNN model with a dataset containing 336 chest X-ray images. According to the results, VGG16 and VGG19 achieved the best accuracy score of 95%. El Asnaoui et al. (2020) used well-known CNN architectures, namely DenseNet201, InceptionV3, InceptionResNetV2, Resnet50, MobileNetV2, VGG16 and VGG19 to classify COVID-19. The database used in this work contains 6087 X-ray and CT images (231 COVID-19, 1493 Viral Pneumonia, 2780 Bacterial Pneumonia and 1583 Normal images). COVID-19 and Viral Pneumonia classes were considered as one class in the classification process. InceptionResNetV2 and Densnet201 achieved the best results with accuracy of 92.18% and 88.09%, respectively. Alqudah et al. (2020) used pretrained and proposed models like ShuffleNet, MobileNet and AOCTNet to extract the automated features from the images, then they passed these features to Soft-max, Support Vector Machine (SVM), Random Forest (RF) and K-Nearest Neighbor (KNN) classifiers. It was shown that features extracted by MobileNet performed the best accuracy.

In addition to modification of available transfer learning models, there are some other studies in which specific CNN architectures are proposed. For instance, Antonchuk et al. (2021) proposed a new CNN model for detecting COVID-19 and influenza cases. The model achieved an accuracy score of 93% on a dataset consisting of 4,152 X-ray images for each class. The CNN architecture proposed by Atitallah et al. (2023) was tested on two different datasets. First dataset (COVIDx) contains 15,475 chest X-ray images (8851 Normal, 6,053 Pneumonia and 571 COVID-19) while the other (Enhanced COVID-19) includes 1092 chest X-ray images (364 images for each class). Data augmentation was applied on datasets and class weight method was applied on COVIDx dataset to re-balance it. The results showed that the proposed model achieved an accuracy of 94% and 99% for COVIDx and enhanced COVID-19 datasets, respectively. Liu et al. (2022) proposed an approach comprises of many stages. EfficientNetV2 was considered as backbone network then ResNet101 (feature fusion), Convolutional Block Attention Module and SVM classifiers, respectively, were used. The dataset contains three-classes (COVID-19, Normal and Pneumonia) and data augmentation was applied. The results showed that the system achieved an accuracy of 99.89%.

Different from the studies considering the Viral and Bacterial Pneumonia as one single class, it is possible to take them as separate classes and eventually have a four-classes problem. One example of such works is proposed by Zeiser et al. (2021). In their work, pretrained DenseNet121, InceptionResNetV2, InceptionV3, MovileNetV2, ResNet50 and VGG16 models were used for classification of the X-ray images together with CLAHE as a preprocessing method. Their dataset contains 5,181 images categorized to four-classes as COVID-19, Normal, Viral Pneumonia and Bacterial Pneumonia. The results showed that VGG16 achieved the best classification performance with an accuracy of 85.11%, sensitivity of 85.25%, specificity of 85.16% and F1-score of 85.03%. Bolhassani (2021) used an unbalanced chest X-yay dataset together with ResNet50 and DenseNet121 models. To eliminate the effect of the class imbalance, they applied data augmentation and achieved an accuracy score of 80.0%. Sait et al. (2021) proposed a model based on InceptionV3 model and multilayer perceptron. Dataset consists of four-classes (COVID-19, Normal, Bacterial and Viral Pneumonia) of chest X-ray images was used without data augmentation. The dataset was split into train and validation sets with a ratio of 80:20. It is noted that the authors did not use part of the dataset as test data which is important to check the robustness of the model's performance. The proposed model achieved a validation accuracy of 91.3% on the chest ray images. In a study focused on determining the seriousness of lung disease using chest X-ray images, Rajinikanth et al. (2022) implemented a pre-trained InceptionV3 scheme with chosen multi-class classifiers to detect the pneumonia and check its severity level. The dataset contains four-classes (Normal, Mild, Moderate, and Severe Pneumonia). The result achieved by K-Nearest Neighbor (KNN) classifier was the best in this work with an accuracy of 85.18%.

Based on the explanations above, the existing studies about pneumonia and COVID-19 detection using X-ray images are summarized in Table 1. As can be seen from the available studies in the literature, there are very different approaches for pneumonia and COVID-19 diagnosis; however, the majority of these studies consider it as a binary-class problem or merge viral and bacterial pneumonia together in one class. In other words, the analysis made on the three mentioned lung diseases is very limited. In addition, most of the works that proposed new models do not consider the computational load of the model. Typically, deeper models with more convolutional layers may achieve better feature extraction eventually yielding more successful classification. However, such models have major drawbacks like the requirement of large amount of images and expensive hardware with heavy computational capability. This problem is present especially in those studies considering the four-class problem (healthy, COVID-19, viral pneumonia, and bacterial pneumonia). Therefore, this situation has been addressed to some extent in this study by proposing a model with reduced convolutional layers as well as number of weights. Hence, it becomes more suitable for the detection task to be executed on a larger scale of digital devices including those with relatively lower computational power.

Table 1 Summary of related works

3 Methods

3.1 Dataset

In this work, a publicly available dataset of chest X-ray images have been used (Sait, et al. 2020). The dataset contains 9207 chest X-ray images categorized as follows: 3269 normal, 1281 COVID-19, 3001 bacterial pneumonia and 1656 viral pneumonia chest X-ray images. Figure 1 shows some sample images of the dataset.

Fig. 1
figure 1

Samples of normal a, COVID-19 b, Bacterial Pneumonia c and Viral Pneumonia d chest X-ray images

The dataset was divided into training, validation and test sets with ratio of 60%, 20%, and 20%, respectively. After dividing the dataset, data augmentation (Mikołajczyk 2018) technique was applied on the training set. Data augmentation is a technique which is used to increase the number of images in the dataset, this leads to increase its diversity and reduce the risk of overfitting. Horizontal flip and shifting operations were applied. 10%, 30% and 50% were used for each of width shift and height shift. Table 2 shows the number of images in training set before and after applying data augmentation.

Table 2 The number of images in training set before and after applying data augmentation

3.2 The proposed CNN architecture

For detecting COVID-19, viral and bacterial pneumonia samples, a lightweight sequential CNN architecture with small number of parameters was proposed. The successive convolutional and pooling layers in the model are followed by fully connected layers. Finally, the softmax function was used in last layer of the classification part for final prediction. Figure 2 shows a generic CNN architecture with convolutional, pooling and fully connected layers. In the feature extraction part of the proposed model, there are five convolutional and pooling layers. On the other hand, the classification part involves three dense layers with dropout layers in between.

Fig. 2
figure 2

A generic architecture of a sequential CNN

As Fig. 2 shows, the convolutional layers receive the input image and convolve it by a filter with specific dimensions. This process produces an output known as feature map. The feature map is being processed by a pooling layer and activation function. Rectified linear unit (ReLU) was used as nonlinear activation function due to its ability to accelerate the training process and solve the vanishing gradient problem. ReLU returns all the negative inputs to zero while positive inputs pass without any change as Fig. 3 shows. The following equation can express the mathematical expression of ReLU:

Fig. 3
figure 3

ReLU function

$${\varvec{f}}\left(x\right)=\left\{\begin{array}{c}0, x<0\\ x, x\ge 0\end{array}\right.$$

Pooling layers are responsible for reducing the size of the feature maps and max-pooling operation was used in the proposed model. This is accomplished by two-dimensional filter that passes through the feature map. Max-pooling selects the maximum value covered by the filter. This process leads to a lower number of parameters in the model and can speed up the computational process. The max-pooling operation with a filter size of 2 × 2 and stride of 2 is illustrated in Fig. 4.

Fig. 4
figure 4

Max Pooling operation using filter of 2 × 2 and flattening process

After passing the input through several convolutional and max-pooling layers, the flatten layer works on converting the resulted feature map from two dimensional and multichannel feature map to one dimensional vector. This operation is important as the fully connected layer expects a vector as input. The operation of flatten layer is given in Fig. 4.

Fully connected layers (FC) are responsible for final classification. It consists of input, hidden and output layers. In each layer there are many neurons. Softmax was chosen as activation function in the output layer because it converts the output to a probability distribution. The following equation shows the mathematical expression of softmax.

$$\sigma (z)_{i} = \frac{{e^{{zi}} }}{{\sum\nolimits_{j}^{k} {e^{{zj}} } }}$$

Dropout layers were added to prevent overfitting and provide better generalization of the model. Dropout layers invalidate some neurons randomly in the fully connected layer during the training process. The number of such neurons are determined by the user defined dropout rate parameter. Tables 3 and 4 illustrate the hyperparameter values for each layer in the proposed model. These values affecting the model performance were determined empirically but taking care of the constraint that the model should possess smaller amount of convolutional layers as well as weights.

Table 3 The hyperparameter of the feature extraction part in the proposed model
Table 4 The hyperparameter of the classification part in the proposed model

3.3 Transfer learning models

For comparison with the proposed model, transfer learning models were used. Transfer learning is machine learning technique that depends on using the weights of pre-trained models as starting point for training the model on new task using new dataset. Subsequently, the images in our dataset were fed into different pre-trained CNN models which have different input image size, number of layers and number of parameters. The models are, namely, EfficientNet B0 (Tan and Le 2019), EfficientNet B2 (Tan and Le 2019), InceptionV3 (Szegedy et al.2016), InceptionResNetV2 (Szegedy et al. 2017), MobileNetV2 (Sandler et al. 2018), NASNetMobile (Zoph et al. 2018), ResNetV2_152 (He et al. 2016), VGG16 (Simonyan and Zisserman 2023) and VGG19 (Simonyan and Zisserman 2023). These models were trained on ImageNet (Russakovsky et al. 2015) dataset. The weights of layers in these models were frozen except of the output layer which was set to have four units. In addition, softmax was used as an activation function in the output layer. Table 5 shows the dimensions of input image, number of total parameters and number of trainable parameters in the transfer learning models and the proposed model.

Table 5 Number of parameters and trainable parameters in transfer learning models and the proposed model

3.4 Evaluation metrics

Standard metrics like accuracy, precision, recall and F1-Score were considered for evaluation of the pre-trained models and the proposed model. The components of confusion matrix shown in Table 6 were used for calculating these metrics.

Table 6 The confusion matrix

PCC variable refers to the number of predictions where the images labeled as COVID-19 were correctly classified as a COVID-19. PCC represents the True Positive (TPCovid19) of COVID-19 class. On the other hand, PNC, PBC, and PVC represent the COVID-19 images incorrectly labeled as Normal, Bacterial Pneumonia, and Viral Pneumonia, respectively.

True Negative (TN) for each class can be calculated by taking the sum of all the values of the confusion matrix except the values in row and column of the class being studied. The following equation shows the True Negative of COVID-19 class:

$${TN}_{Covid19}= {P}_{NN}+ {P}_{BN}+ {P}_{VN}+ {P}_{NB}+ {P}_{BB}+ {P}_{VB}+ {P}_{NV}+ {P}_{BV}+ {P}_{VV}$$

False Positive (FP) is the sum of all the values in the column of the being studied class except the true positive value. The equation that presents false positive of COVID-19 class is:

$${FP}_{Covid19}= {P}_{CN}+ {P}_{CB}+ {P}_{CV}$$

False Negative (FN) is the sum of all the values in the row of the being studied class except the true positive value. The equation that presents false negative of COVID-19 class is:

$${FN}_{Covid19}= {P}_{NC}+ {P}_{BC}+ {P}_{VC}$$

The rest components of the confusion matrix can be explained and calculated based on the above. Using these values, the metrics can be calculated as given below:

$$Precision= \frac{TP}{FP+TP}$$
$$Recall= \frac{TP}{TP+FN}$$
$$F1 Score=2\times \left(\frac{Precision \times Recall}{Precision + Recall}\right)$$

The accuracy has been calculated on the basis of the class-specific values where only the total true positives are divided by the total number of samples in the test set. As a result, the same accuracy value has been calculated for each class using the formula below.

$$accuracy= \frac{{P}_{CC}+{P}_{NN}+{P}_{BB}+{P}_{VV}}{\# samples in the test set}$$

3.5 Hyperparameters tuning

Adam optimizer (Kingma and Ba 2023) was used for training all the models mentioned in this work. The exponential decay rate (beta 1 and beta 2) for the first and second moment estimates were set to default values of 0.9 and 0.999, respectively. Learning rate was chosen as 0.001. Batch size of 32 and 64 were chosen for all the experiments. Number of epochs was chosen as 50 for the proposed model while it was 1, 3 and 5 for the pre-trained models. The number of epochs was not increased because no improvement in the performance was observed. Table 7 shows these hyperparameter values.

Table 7 Hyperparameter configuration

The codes were written in Python (version 3.6.13) language using Tensorflow and Keras libraires. The execution of the codes was performed on Radeon RX 580 GPU.

4 Experimental results

4.1 Transfer learning results

The models mentioned in Sect. 3.3 were trained and confusion matrix was generated for each experiment. Table 8 shows the best result achieved for each model with corresponding confusion matrix, prediction accuracy, precision, recall and F1-score. These transfer learning experiments performed on the widely used benchmark models allows for understanding the best performing model together with the appropriate epochs and batch size.

Table 8 Prediction results of the benchmark models

As shown in Table 8, Efficient-Net B2 model achieved the best overall prediction accuracy. The model was the best to predict the images labeled as COVID-19 with precision of 0.99 and recall of 0.98. It is possible to consider this model as the most appropriate one for identifying COVID-19 images. The model also showed a good performance in predicting the images labeled as Normal (healthy). The performance of the model declined when predicting Bacterial and Viral Pneumonia where the number of misclassifications were high. ResNetV2_152 also achieved second best prediction accuracy; however, the model did not achieve satisfying results in COVID-19 prediction when compared to Efficient-Net B2 model. On the other hand, ResNetV2_152 was the best among the other models in detecting Viral Pneumonia. With regard to Bacterial Pneumonia, InceptionResNetV2 achieved the best accuracy in detecting this class; however, the accuracy of classifying Viral Pneumonia drops sharply in this model.

On the other hand, VGG models achieved the lowest overall accuracy. These two models (especially VGG16) were not able to predict COVID-19 images properly and give very low sensitivity in predicting Viral Pneumonia class. The large number of parameters and very deep structure of these two models may be the reason for these models to perform poorly. This means that such models with large number of parameters may not be always feasible for problems with a relatively small number of classes.

It is notable from Table 8 that most of the misclassifications belong to Bacterial and Viral Pneumonia classes and this led to a decrease in the overall accuracy of the models. It is possible that the combining these two classes together and making the classification triple (i.e. Pneumonia, COVID-19 and Normal) instead of quadruple will give a higher accuracy as many studies showed. However, separating these two classes can give a better overview about the ability of these models to identify the lung diseases and surely will give more specific diagnosis.

4.2 The proposed model results

The number of parameters in the proposed model is significantly lower than the transfer learning models (see Table 5). In addition, the model has lower number of layers (lower depth). This can highlight the effect of the depth and the number of parameters on the prediction results.

The proposed model was trained from scratch (unlike transfer learning) with batch sizes of 32 and 64, respectively. The number of epochs was chosen as 50 because no improvement in the training accuracy was observed after this number. The images were resized to 224 × 224 × 2. The weights belonging to epochs that gave highest validation accuracy were used for testing the model. The corresponding detailed results are given in Tables 9 and 10.

Table 9 Prediction performance of the proposed model
Table 10 Performance metrics of the proposed model

Table 9 shows that the proposed model with batch size of 64 achieved the best prediction accuracy of 89.89%. This result indicates that the proposed model was better in detecting the diseases than the transfer learning models given in Table 5. In particular, the proposed model was better in detecting Viral and Bacterial Pneumonia classes where the precision and recall values are higher than those in the transfer learning models.

On the other hand, when the performance values in Tables 8 and 10 are compared, it is possible to observe that Efficient-Net B2 and B0 models were a slightly better in predicting COVID-19 class than the proposed model. The precision and recall values related to detecting COVID-19 in Efficient-Net B2 model were 0.99 and 0.98, respectively, while it was 0.98 and 0.96, respectively, in the proposed model.

As mentioned before, the proposed model has relatively lower number of parameters and layers than the pre-trained models. Apparently, this relatively low amount was sufficient for the model to extract good features and perform the classification task; therefore, it gave a better overall accuracy. In contrast, a high number of layers, as the case in benchmark models, may be causing a negative effect on the classification task that has lower number of classes.

The average prediction and training times of the benchmark and the proposed models for one single image were also calculated and compared (Table 11). Among the benchmark models, EfficientNet B0 and InceptionResNetV2 are the most and least time-consuming models, respectively in the training process. With regard to the prediction time, VGG16 and InceptionResNetV2 are the most and least time-consuming models, respectively. When the time consumption of the proposed model is compared with the benchmark models, it is notable that the proposed model is significantly better in training and prediction phases.

Table 11 The training and prediction time for all the models

4.3 Ablation study

Ablation study was conducted on the proposed model for better understanding of the network’s behavior and to justify the robustness. Ablation in machine learning means to delete part of the network and train the model again to check the function or effect of the deleted layer on the overall performance. For this purpose, one convolutional layer was deleted from the network sequentially and the results were recorded. Table 12 shows the details of the best prediction accuracy after deleting each layer separately.

Table 12 Best prediction accuracy after applying the ablation

By comparing the results in Table 12 with the results of the proposed model in Table 9, it is notable that the prediction accuracy decreases when the models in the ablation study are used. As for the confusion matrix, the entries corresponding to false predictions are higher in general. In addition, the gap between the training and validation accuracies is increased during the ablation study. This means that the model is prone to overfitting when some layers are removed. However, the difference appears clearly in the number of parameters, training and prediction times as Table 13 shows.

Table 13 The consumed time after the ablation process

As expected, the comparing between Tables 5, 11 and 13 shows that sequentially ablating of the convolutional layers caused a significant increase in the number of parameters, thus an increase in the time required for training and prediction. This increase in the number of parameters did not cause an increase in the prediction accuracy as Table 12 shows. This proves that low number of parameters as in the raw model is enough to achieve the task of prediction.

4.4 Optimizer effect

Optimizers are algorithms used to update the weights of the neural networks to reduce the overall loss and increase the performance. The effect of using different optimizers on the detection performance of the proposed model was evaluated as well. For this purpose, different optimizers like Adaptive Gradients (AdaGrad) (Duchi et al. 2011) and Stochastic gradient descent (SGD) (Bottou 2012) were used. Table 14 shows the best prediction accuracy obtained by the proposed model after using SGD and AdaGrad optimizers.

Table 14 The results of using different types of optimizers

The corresponding results of Table 14 show that the proposed model with batch size of 64 and Adagrad optimizer achieved a prediction accuracy of 89.13% which is slightly lower than the result obtained by Adam optimizer. On the other hand, SGD failed to achieve competitive prediction accuracy. The training accuracy with SGD optimizer did not exceed 87.32% after 50 epochs of training, which means slow convergence.

In general, Adagrad optimizer frequently updates the learning rate for each iteration depending on the change in the parameters during the training process. This feature maybe led to better accuracy than SGD optimizer.

In comparison between the confusion matrix of Adam optimizer in Table 9 and Adagrad optimizer in Table 14, it is notable that the performance of Adagrad optimizer is better in detecting Viral Pneumonia disease. However, using Adam optimizer led to better results in classifying the rest of the classes and gave better overall accuracy.

5 Discussion

The results show that the proposed model is able to outperform the benchmark models in detecting the lung diseases. The proposed model achieved an overall accuracy of 89.89% while Efficient-net B2, the best among the benchmark models, had an overall accuracy of 85.7%.

Figure 5 shows that the benchmark models Efficient-Net B0 and Efficient-Net B2 were slightly better in detecting COVID-19 than the proposed model. In addition, MobileNetV2 model was a bit more accurate in detecting Normal class. On the other hand, the proposed model was much better in detecting Bacterial Pneumonia class than the benchmark models and a little better in detecting Viral Pneumonia class.

Fig. 5
figure 5

Comparison of the number of true positives between the different models

By checking the confusion matrix for each of the proposed model and the benchmark models in Tables 8 and 9, it is notable that all these models have a significant decrease in the accuracy of detecting pneumonia classes (i.e. Viral and Bacterial Pneumonia) compared with detecting COVID-19 and Normal classes. The tables show that the accuracy of predicting Viral Pneumonia is much less than the accuracy of predicting Bacterial Pneumonia. In addition, most of the misclassifications in these two classes are due to the confusion of the models in classifying the Viral Pneumonia as Bacterial Pneumonia, and vice versa. The reason for that may be associated with the small number of samples in Viral Pneumonia data compared with the number of samples in Bacterial Pneumonia data as shown in Table 2.

The imbalance in the data for these two types of classes (i.e. Bacterial and Viral Pneumonia) possibly led to a negative effect on training process and caused an inability in the models to differentiate between these two diseases. This limitation can be addressed if new chest X-ray images are obtained and added to the dataset as a future work. Obviously, obtaining and accessing the data is one of the difficulties that hamper the researchers, especially the medical data due to the privacy concerns.

The low number of parameters is also an extra advantage that characterize the proposed model in this study. The number of parameters in the proposed model is around 1 million parameters while it is 7.7 million parameters in Efficient-Net B2 model that achieved the best prediction accuracy among the pre-trained models as Table 5 shows. The low number of parameters and layers in the proposed model leads to lower prediction and training time compared with the pre-trained models (Table 11). This means that the proposed model is faster, needs lower resources and more qualified to operate in places that do not have high computing power.

Table 5 also shows that the proposed model has lower parameters than MobileNetV2 model which designed to work with fewer operations. The low depth of the proposed model may cause a positive effect on the classification accuracy in chest X-ray images whereas, deeper models can cause a decay in the extracted features and thus lower accuracy in tasks with relatively low number of classes.

In the closest study to this work, Sait et al. (2021) (mentioned in the Table 1) used same dataset with same number and type of classes to check the ability of CNN to classify lung diseases. However, in their work, the authors did not use data augmentation techniques to increase the diversity of the dataset and did not verify the efficiency of the proposed model using test set. The results associated to both studies are summarized in Table 15.

Table 15 Comparison between this work and (Sait et al. 2021)

Table 15 shows that the proposed new model in this study has much lower number of parameters than those model (based on InceptionV3) proposed in Sait et al. (2021). This means that our model requires lighter computer resources and thus runs faster in terms of training. Another noticeable property of the other study is that the dataset was split into training and validation sets only; in other words, only validation set was utilized for final evaluation of the proposed method without test set. It is a well-known convention in such machine learning problems that the model performance is assessed by evaluating it on a separated set of samples that are not used during training process. Using validation data, which is often used to optimize hyperparameters, to check the performance of the model mostly do not provide reliable results all the time.

Given that both works could not be compared using the test accuracy, the validation accuracy of the proposed model in this work has been included in Table 15. The proposed model in this study outperforms the other in terms of the validation accuracy.

6 Conclusion

In this work, a lightweight diagnosis model based on convolutional neural network was proposed to diagnose lung diseases like COVID-19, Bacterial and Viral Pneumonia. All experiments regarding the development and testing of the proposed model were carried out on a publicly available chest X-ray dataset. To validate and highlight the effectiveness of the proposed model, state of the art pre-trained CNN models were used for this prediction task and their corresponding performances were compared. Among these models, the pre-trained Efficient-Net B2 achieved the highest classification accuracy of 85.7%. On the other hand, the proposed model outperformed the pre-trained benchmark models by achieving an overall prediction accuracy of 89.89% with batch size of 64. A notably high accuracy in detection of COVID-19 samples was obtained in both the proposed model and the benchmark models; however, the pre-trained model Efficient-Net B2 showed a slightly better result in predicting COVID-19 with precision and recall of 0.99 and 0.98, respectively. In general, all the models used in this work showed a relatively poor precision for Viral Pneumonia class and confusion in distinguishing it from Bacterial Pneumonia class, and vice versa. This led to a decrease in overall accuracy.

The low number of data samples in Viral Pneumonia class may have hampered the models to extract better representations from the images content, thus obtaining a relatively low prediction performance for this class. This can be considered as the main limitation of the study. It is expected that supporting Viral Pneumonia class with more samples will improve the performance of the models.

Besides the performance of the proposed model, this study contributes to the related literature by submitting a model with significantly a low number of parameters. This advantage makes this model applicable in medical facilities and areas that do not have devices with high computational resources. Furthermore, the system can easily be integrated with a user interface on a regular computer and can be used by medical staff with no technical skills of computer usage.

Since such a deep learning-assisted diagnosis model is more suitable for computers with limited computational power, the model may be executed on edge devices or single board computers. Hence, the proposed study has another practical application possibility to be used as a part of Internet of Things (IoT) systems. Provided that the decisions are made using the proposed model on an edge device, the application will have advantages like saving bandwidth and rapid assessment of the input image to perform diagnosis. Therefore, such applications may help more convenient diagnosis practices while preventing the spreading of viruses.

As a future work, image processing techniques like Contrast Limited Adaptive Histogram Equalization (CLAHE) can be applied to enhance the quality of the chest X-ray images used in this work. Also, ensemble methods may be utilized by considering class-specific correct detection rates of different classifier models. In addition, providing new images to Viral Pneumonias class can be considered to achieve the balance in the data and increase the capability of the models to extract better representations.