Keywords

1 Introduction

The American Society for Gastrointestinal Endoscopy (ASGE) supports the analysis of endoscopic images in gastrointestinal (GI) tract to assist clinicians in making correct decisions [1]. Endoscopic imaging technology has refined the diagnostic and therapeutic purposes that can be used as alternative techniques by which patients can avoid biopsy and surgical procedures [2]. Explorations of the digestive system based on endoscopic images are performed using gastroscopy for the upper GI tract and colonoscopy for the lower GI tract. Based on statistical analyses of GI disorders, various different diseases exist such as oesophageal, stomach and colorectal cancer [3], that may result in death. The most common cancers in the GI tract are colorectal cancer representing 1.80 million cases, and stomach cancer representing 1.03 million cases. In the United States, approximately 862, 000 colorectal cancer and 783 000 stomach cancer deaths occur each year [4]. The factors related to GI diseases, include environmental factors (Helicobacter pylori infection, a wrong diet, food storage), treatment factors (using antibiotics to kill a specific bacterium, poorly qualified gastroenterologists), genetic factors (inherited cancer genes), and unknown factors. In some cases, optical diagnoses by endoscopic imaging examination suffer from endoscopist errors, lengthy procedures and poor quality images [5]. Therefore, computer-assisted diagnosis systems for GI images can affect accurate and rapid classification by discriminating between normal and diseased GI tract and reducing the mortality level for GI diseases [6].

In general, endoscopic images for the GI tract are considered to be biomedical images. It is essential to create deep learning algorithms to process these huge images before the disease diagnosis. A major challenge in biomedical images is to perform classification for low-level visual images obtained from imaging devices. The deep convolutional neural network (CNN) is a common learning algorithm that has achieved success in medical images classification [7]. For example, CNNs have been efficiently applied for polyp detection in colonoscopy videos [8], for lung images classification [9], for pancreas segmentation in CT images [10], and for brain tumour segmentation in magnetic resonance imaging (MRI) scans [11]. Additionally, CNN frameworks running on accelerated hardware have been utilized for medical image retrieval [12] and for medical image segmentation ration [13]. Thus, the CNN architecture has encouraged rapid automated classification for large number of medical images.

This paper demonstrates a CNN model for classifying GI diseases from endoscopic images. The remainder of this paper is structured as follows. Related works are discussed in Sect. 2, especially from the perspective of previous CNN architectures when using BN [14] and when ELU is used as the activation function [15]. Section 3 presents an explanation of the image dataset used in this paper. In Sect. 4, the proposed methodology is explained. The experimental outcomes are notified in Sect. 5. Lastly, a few concluding comments are estimated in Sect. 6.

2 Related Works

CNNs have been used extensively to solve issues related to computer vision, such as image identification [16] because a CNN is one of the most effective ways to extract features for non-trivial tasks [17]. Numerous variants of CNN architectures can be found in the literature that have advanced results in different image classification task, for example, VGG16 and VGG19 [18], which won the runner-up award in the ILSVRC-2014. VGG16 is a 16-layer network containing 13 convolutional layers, three fully-connected layers, and five max-pooling layers, while VGG19 is a 19-layer network containing 16 convolutional layers, three fully-connected layers, and five max-pooling layers. Despite the successes of these architectures, one of their drawbacks is that they are difficult to train [19].

In addition, a wide range of techniques have been developed to improve the performance or facilitate the training of CNNs, such as incorporating BN or using an ELU as the activation function. BN is a technique introduced by Ioffe and Szegedy for accelerating deep network training. BN has become a typical element in modern better performing CNN designs such as Inception V3 [20], which achieved the lowest error rate (3.08%) in the ImageNet challenge. BN helps the network to train faster, achieve higher accuracy, stabilize the distribution and reduce the internal covariate shift [14, 21, 22].

The experimental results in [15] indicated that the ELU activation function accelerates learning in deep neural networks, leads to higher classification accuracies that achieve better generalization performance than other activations function such as rectified units (RELUs), and using ELU with BN outperforms RELU with BN. According to the previous work, a CNN model incorporating the BN technique and using ELU as an activation function accelerates GI diseases identification from endoscopic images.

3 Dataset Description

The Kvasir dataset [23] has two versions. Deep learning methods were implemented using version one in [24, 25]; however, version two, which was released in 2017, has not been used until now in any previous studies in this field. Therefore, in this paper the proposed model is applied to version two of the Kvasir database. Kvasir version two has a size of 2.3 GB and contains 8,000 images with 720 × 576 pixels. These data are divided into eight classes with 1,000 images for each class. This Kvasir dataset was created from endoscopic images of GI tract diseases. The descriptions of the eight classes are listed in Table 1. These data consist of three types: anatomical landmarks, pathological findings, and polyp removal, as shown in Fig. 1.

Table 1 Descriptions of the 8 of endoscopic images classes in the GI tract
Fig. 1
figure 1

Endoscopic images of gastrointestinal (GI) tract for anatomical landmarks, pathological findings, and polyps removal: (a) Z-line, (b) pylorus, (c) caecum, (d) oesophagitis, (e) polyps, (f) ulcerative colitis, (g) dyed and lifted polyps, (h) dyed resection margins

4 The Proposed Approach

The proposed approach involves four phases: images preprocessing, data augmentation, feature extraction, and classification. The feature extraction phase includes convolutional layers, BN layers, ELU layers, and max-pooling layers, while the classification phase contains fully connected layers, BN layer, ELU layer, dropout layer, and a softmax layer. Figure 2 demonstrates a structural representation of the proposed approach. In the proposed approach, the RMSProp optimizer [26] with a learning rate of 1e−4, categorical cross-entropy as the loss function [27], a batch size of 32 and 115 epochs were used as shown in Table 2.

Fig. 2
figure 2

Graphical representation of the proposed CNN approach. Conv. Layer = Convolutional layer, BN. Layer = Batch normalization layer, ELU layer = Exponential Linear Unit layer, FC1 = the first fully connected layer, FC2 = the second fully connected layer

Table 2 Hyper-parameters values of the proposed CNN approach. FC1 = the first fully connected layer, FC2 = the second fully connected layer

4.1 Images Preprocessing Phase

The dataset was split into three separate file groups. The first file group comprised the training set, which included 700 images of each class; each class was stored in a separate file. The second file group comprised the validation set, which included 150 images of each class, and each class was stored in a separate file. The third file group was the test set, which included 150 images for each class, and each class was stored in a separate file.

Before loading the images into the proposed approach, all the images in the training, validation and test sets were resized to a resolution of 400 × 400 to decrease the computational time and normalized by dividing the colour value of each pixel by 255 to achieve values in the range 0, 1.

4.2 Data Augmentation Phase

Data augmentation techniques increase the amount of training data available, which is crucial when training a deep learning model from scratch [28]. Data augmentation was used in this paper to overcome the overfitting that can result from small training dataset sizes. The data augmentation has a lot of techniques, such as rotation, width shift, height shift, shear, zoom, horizontal flip and fill mode. These techniques were used in this paper to apply various transformation to the images as listed in Table 3.

Table 3 Data augmentation techniques and their corresponding values

4.3 Feature Extraction Phase

The convolutional layers, BN layers, ELU layers, and Max pooling layers were used to extract important features from the images.

  • Convolutional Layers: the proposed approach involves six convolutional layers [29]. All convolutional layers contain 64 filters except for the last two layers (layer five and layer six) which each contained 128 filters. A kernel size of 3 × 3, a stride of 2 and the same padding were used in all convolutional layers (see Table 2).

  • Batch Normalization is a recent approach for accelerating deep neural network training that normalizes each scalar feature independently by making it have a mean of zero and unit variance, as shown in step one, two and three in Algorithm 1. Then, the normalized value for each training mini-batch is scaled and shifted by the scale and shift parameters \(\upgamma\,{\text{and}}\,\beta\) as shown in step four in Algorithm 1. This conversion confirms that the input distribution of each layer remains unchanged within different mini-batches; thus, BN reduces the internal covariate shift and the number of iterations required for convergence and simultaneously improves the final performance. BN maintains non-trainable weights (the mean and variance vectors) that are updated via layer updates instead of through back propagation [30]. The BN can be considered as another layer that can be inserted into the model architecture, similar to a convolutional layer, an activation layer or a fully connected layer [31]. The proposed CNN approach includes eight BN layers in which six are used in the feature extraction phase and two are used in the classification phase. In the proposed approach, the BN layers were added before each activation function layer.

    figure a
  • An Exponential Linear Unit (ELU) is the activation function used in the proposed approach and given in [32] as

$$elu\left( x \right) = \left\{ {\begin{array}{*{20}l} {\alpha \left( {\exp \left( x \right) - 1} \right)} \hfill & { if\,\, x \le 0} \hfill \\ x \hfill & {if\, \,x > 0} \hfill \\ \end{array} } \right.$$
(1)
  • in which the gradient w.r.t. the input is

$$\frac{d}{dx} elu\left( x \right) = \left\{ {\begin{array}{*{20}l} {elu\left( x \right) + \alpha } \hfill & {if\,\,x \le 0} \hfill \\ 1 \hfill & { if\,\,x > 0} \hfill \\ \end{array} } \right.$$
(2)
  • where \(\alpha\) = 1.

  • The proposed CNN approach involves seven ELU layers in which six are used in the feature extraction phase and one is used in the classification phase. Each ELU layer was implemented after each BN layer as shown in Fig. 2.

  • Max-Pooling aims to down-sample the input representation in the feature extraction phase [33]. In this paper, six max-pooling layers of size (2 × 2) were used and these layers were implemented after each ELU layer.

4.4 Classification Phase

The classification phase classifies the images after flattening the output of the feature extraction phase [34] using two fully connected layers (FC), in which the first (FC1) contains 512 neurons and the second (FC2) contains 8 neurons, a BN layer, an ELU layer and a dropout layer [35] with a dropout rate of 0.3 to prevent overfitting. Finally, a softmax layer was added [36].

4.5 Checkpoint Ensemble Phase

When training a neural network model, the checkpoint technique [37] can be used to save all the model weights to obtain the final prediction or to checkpoint the neural network model improvements to save the best weights only and then obtain the final prediction, as shown in Fig. 3. In this paper, checkpointing was applied to save the best network weights (those that maximally reduced the classification loss of the validation dataset).

Fig. 3
figure 3

The rounded boxes going from left to right represent a model’s weights at each step of a particular training process. The lighter shades represent better weight. In checkpointing neural network model, all the model weights are saved to obtain the final prediction P. In checkpointing neural network improvements, only the best weights are saved to obtain the final prediction P

5 Experimental Results

The architecture of the proposed approach was built using Keras library [38] using Tensorflow [39] as the backend. The Keras library’s ImageDataGenerator function [40] was used to normalize the images during images preprocessing phase and to perform the data augmentation techniques as shown in Fig. 4, while the ModelCheckpoint function was used to perform the checkpoint ensemble phase. The proposed approach has 2,702,056 total parameters of which 2,699,992 are trainable parameters and 2,064 are non-trainable parameters that come from using the BN layers.

Fig. 4
figure 4

Generation of normal caecum picture via random data augmentation

The proposed  approach was tested using traditional metrics such as accuracy (Table 4), precision, recall, F1 score (Table 5), and a confusion matrix (Table 6).

Table 4 Accuracy and loss values of the proposed approach
Table 5 Classification results of the proposed approach
Table 6 Confusion matrix of the proposed approach 

Accuracy measures the ratio of correct predictions to the total number of instances evaluated and is calculated by the following formula [41]:

$${\text{Accuracy}} = \frac{{ {\text{Number of correct prediction}}}}{\text{Total number of prediction }}$$
(3)

Precision measures the ability of a model to correctly predict values for a particular category and is calculated as follows:

$${\text{Precision}} = \frac{\text{particular category predicted correctly}}{\text{all category predictions}}$$
(4)

Recall measures the fraction of positive patterns that are correctly classified and is calculated by the following formula:

$${\text{Recall}} = \frac{\text{Correctly Predicted Category}}{\text{All Real Categories}}$$
(5)

The F1 score is the weighted average of the precision and recall. Additionally, the confusion matrix is a matrix that maps the predicted outputs across actual outputs [42]. Additionally, the pyplot function of matplotlib [43] was used to plot the loss and accuracy of the model over the training and validation data during training to ensure that the model did not suffer from overfitting, as shown in Fig. 5.

Fig. 5
figure 5

Accuracy and loss values of the proposed approach on the training and validation data

To evaluate the proposed approach, transfer learning and fine-tuning techniques [44] with data augmentation technique were applied to the VGG16, VGG19 and Inception-v3 architectures using the same dataset and batch size used in the proposed approach. Transfer learning was conducted first by replacing the fully connected layers with new two fully connected layers, where FC1 contains 512 neurons and FC2 contains 8 neurons, and one added dropout layer after the flatten layer with a dropout rate of 0.3 in VGG architectures and a dropout rate of 0.5 in inception-v3 architecture to prevent overfitting. During the transfer learning, the RMSProp optimizer with a learning rate of 1e−4 was used, and the models were trained for 15 epochs. Then, the top convolutional block of the VGG16 and VGG19 architectures and the top two blocks of Inceptionv3 architecture were fine-tuned, with a small learning rate of 1e−5 and trained for 35 epochs.

The elements used for comparison were the number of convolutional layers, the total number of parameters of the convolutional layers, the number of epochs, validation accuracy and test accuracy, as shown in Table 7 which shows as comparison of the models’ results when identifying GI diseases. The first and second comparative elements are the number of convolution layers and the number of parameters of the convolutional layers. The proposed approach includes the fewest convolutional layers and parameters compared to the other architectures, which reduces its computational complexity in the training phase. The third comparative element is the number of epochs in the training phase. The proposed approach has the maximum number of epochs; however, the great number of epochs was expected because unlike the other architectures, the proposed approach network was not pre-trained; thus, more epochs are required to achieve a stable accuracy. As shown by the validation accuracy comparison, the proposed approach obtained an accuracy of 88%, which is better than the accuracy of the compared models. Regarding test accuracy, the proposed approach achieved an accuracy of 87%, similar to VGG16 and VGG19, while Inception-v3 achieved an accuracy of only 80%.

Table 7 Comparative results for GI diseases identifications

6 Conclusion

Automatic classification of GI diseases from imaging is increasingly important; it can assist the endoscopists in determining the appropriate treatment for patients who suffer from GI diseases and reduce the costs of disease therapies. In this paper, the proposed approach is introduced for this purpose. The proposed approach consists of a CNN with BN and ELU. The results of comparisons show that the proposed approach although it has low trained images and low computational complexity in training phase, outperforms the VGG16, VGG19, and Inception-v3 architectures regarding validation accuracy and outperforms the Inception-v3 model in test accuracy.