1 Introduction

Some illnesses can affect the gastrointestinal (GI) tract, according to GLOBOCAN 2020 estimates of cancer incidence and mortality produced by the International Agency for Research on Cancer [35]. Esophageal, stomach, and colorectal cancer are the most common cancers worldwide [35]. In 2020, 1.9 million new cases and 935,000 deaths from colorectal cancer were estimated, which is the third in incidence and the second in mortality [35]. Stomach cancer, with 1.1 million new cases and 769,000 deaths, has the fifth for incidence and the fourth for mortality, grades globally [35]. Esophageal cancer with 604,000 new cases is the seventh in incidence and with 504,000 deaths is the sixth in mortality in total [35].

With the advent of the minimally invasive surgeries (MIS), like endoscopy for examination of the upper GI tract and colonoscopy for screening the lower GI tract, physicians use MIS techniques for lesions, ulcers, polyps, and other abnormal finding and removal [28]. Failing that diagnoses the mentioned abnormalities during GI screening can lead to growing the diseases such as GI malignancies in patients in the next few years [12]. To reduce the rate of misdiagnosis, previous studies have focused on the automatic identification, classification and localization of the abnormalities used for automated analysis of medical images [8, 11, 14, 20, 30, 32].

Some previous studies have designed and proposed the automatic detection of polyps [32], tumors [36], cancer [19], erosion and ulcer [2], and bleeding [4, 16, 27].

Various previous studies have designed and proposed methods based on conventional machine learning models [19, 23, 36] and deep neural networks (DNNs) [4, 32] as the newer branch of machine learning models [5]. Previous studies have shown that DNNs can extract, analyze, and learn the valuable features from the raw dataset automatically [5, 11, 20, 30]. The principal prerequisite of using and training the conventional machine learning models is extracting handcrafted features [5]. Some previous studies combines handcrafted features and DNNs features to enhance the performance [10, 16].

The different architectures of DNNs are used for the automatic identification GI tract abnormalities in the previous study, like convolutional neural network (CNN) [7, 37], auto-encoders (AE) [13], regionbased convolutional neural networks (R-CNN) [39], generative adversarial networks (GANs) [31].

In [7], the researchers have proposed a CNN to classify bleeding capsule endoscopic video frames from non-bleeding ones. They have used pre-trained AlexNet then trained its last dense layers and also, SegNet which is used for semantic segmentation of the bleeding zones [7].

Another previous study has proposed a method consisting of two sequential convolutional encoder-decoder to extract features from images and detect polyps automatically [13]. The novelty of their proposed approach has been using a hetero-associator (hetero-encoder) in front of the model, which generates labeled images with a specific similarity to the actual image [13].

A study has recommended a feature learning method named stacked sparse auto-encoder with image the manifold of image constraint (SSAEIM), to prepare discriminative explanation of polyps and recognize them into the images of Wireless capsule endoscopy (WCE) [38]. They have assumed that the images with identical classes similar features and the others should be different enough [38].

Our previous study has proposed a semi-supervised deep model for anatomical landmark detection from endoscopic video frames. Our semi-supervised convolutional neural network (SSCNN) has been functional when accessibility to the labeled video frames was difficult. We have examined our previously proposed method on the KVASIR V1 dataset [28].

Previous research is used the representational power of convolutional auto-encoders (CAE) networks for feature extraction. In [21], a CNN, with three convolutional layers, is displaced with average pooling layers to extract smooth features from Optical Emission Spectroscopy (OES) data. In the other study, the researcher used CAE to learn audio features. This pipeline is for converting source lectures into target ones. This proposed method achieves high-level features that consist of an authentic representation of the audio file [6].

Feature extraction is a necessary step in many machine learning problems. The advent of automatic feature extraction methods has required the conversion and representation of sophisticated data into lower dimensional without losing any information. This approach in feature extraction is the basis of the inspiration and accomplishment of novel technologies in computer vision.

As demonstrated by previous studies, researchers have tended to use DNNs in recent years because of their abilities in various areas. A combination of DNNs and using their advantages of them together to find the lesions in endoscopic frames can be helpful.

Previous studies have shown that discriminating some classes from each other would be difficult, and some automatic models have not demonstrated desirable performance in distinguishing them [2]. Some previous researchers have addressed this issue by working on a new extended version of the dataset [3, 25] or focusing on specific lesions [15, 29].

The prime purpose of the proposed method is to develop a novel approach based on the combination of DNNs for classifying GI tract lesions from endoscopic video frames. The second is to provide a novel method that extracts high-level features from the endoscopic video frames and depicts them into a 2-D data map.

The main novelties of the proposed method can summarize as the following:

  • Proposing a new approach combining DNNs to classify the GI tract lesions from endoscopic video frames.

  • Utilizing the benefits of CAE and 2-D visualization together.

  • Extracting high-level features with CAE and converting them into 2-D data maps.

  • Training CNN with novel 2-D visualization data maps.

2 Materials and methods

In the proposed method, three different scenarios are designed, proposed and compared for classifying GI tract lesions from endoscopic video frames as shown by Fig. 1.

Fig. 1
figure 1

Three designed and proposed different scenarios for classifying GI tract lesions from endoscopic video frames. The first scenario for classifying GI tract lesions from endoscopic video frames (AE-2DV-CNN). b The second scenario for classifying GI tract lesions from endoscopic video frames (CNN-Sc). c The third scenario for classifying GI tract lesions from endoscopic video frames (Incept_TL)

As illustrated by Fig. 1, the first proposed scenario is composed of encoding N patches of each image using CAE, visualizing the features into a 2-D data map and considering it as an image, and feeding 2-D data map image into CNNs as their input. Since the first proposed scenario combines using Convolutional Auto-Encoders, the patches extracted from each image feed into Convolutional Auto-encoder as input. 2-D visualization and CNNs, we named this scenario the CAE-2DV-CNN scenario. The second scenario consists of training the end-to-end CNN for classifying the KVASIR V1 dataset, so we named this CNN-Sc. The last scenario uses the Inception-V3 as transfer learning, so we called that Incept_TL.

More details of the main steps of each scenario have described in the following subsections.

2.1 Dataset description

Several studies have used the KVASIR V1 dataset for designing and examining their proposed methods [2, 28, 29]. We use this dataset. It consists of 4,000 images based on anatomical landmarks, pathological findings and polyp removal [28]. These images have been listed into eight classes. Each class includes 500 images with different resolutions ranging from 720 × 576 to 1920 × 1072 pixels [28].

K-fold cross validation (C.V.) is a sampling technique that is portioned into k-equal subsets [34]. For each k, one of these k subsets is considered the test dataset, and the others are considered the training dataset. In all scenarios of the proposed method, five-fold cross validation is used for sampling from endoscopic video frames to build the training and test datasets. In each fold, 80%, or 3,200 images used for training, and 20%, or 800 images used for testing.

2.2 The first proposed scenario (CAE-2DV-CNN) for classifying GI tract lesions from endoscopic video frames

The first scenario, as shown in Fig. 1, includes image processing and patch extraction, encoding image by CAE, 2-D visualization of the feature map as a data map, training CNN classifier based on data map and evaluation and validation.

2.2.1 Image preprocessing and patch extraction

At first, images were resized into 64 × 64 pixels. Then N patches (N = 1089) were extracted from each image with a size of 32 × 32 pixels.

2.2.2 Encoding image by convolutional auto-encoder (CAE)

An auto-encoder (AE) is a specific type of Artificial Neural Network (ANN) that aim is to regenerate the inputs under the unsupervised learning fashion [22]. A typical AE consists of two main blocks, including an encoder block compressing the inputs into the low dimensional representation and a decoder block that is trained to reconstruct the inputs from the features extracted with the encoder block [22]. The encoder block is a strong feature extractor that can be designed by suitable output layers and then fine-tuned to receive the eligible features [21]. Minimization of the error calculated from the regenerated input images is achieved by optimizing Eq. (1).

$$\begin{array}{cc}J\left(\theta\right)=\sum\left(x,\;z\right)&\theta=\end{array}\left\{w,\;w',\;b,\;b'\right\}$$
(1)

In the proposed method, the encoder block takes images as inputs. Therefore, the first layers of the encoder block should be convolutional. Thus, this type of AEs is called a convolutional auto-encoder (CAE). CAE has been used to exploit the power of CNN in feature extraction [21]. The patches which are extracted from each image are fed into CAE as input. The best architecture of CAE among the examined architectures in the proposed method for classifying Gastrointestinal (GI) tract lesions from endoscopic video frames is shown in Table 1.

AE is trained by Adam optimizer [18] with learning rate of 0.001 and a Mean Squared Error (MSE) loss function. At last, for each image, the features which are produced by the encoder’s layer of our designed CAE will be saved. A list of different hyper-parameter values for each proposed scenario is shown in Table 4.

Table 1 CAE architecture for classifying GI tract lesions from endoscopic video frames

2.2.3 Visualizing data into two dimensions and generating a 2-D data map

In this section, for constructing creative inputs for training by CNN, the features generated by encoder’s block is reshaped into (N, 4*4*8) size. Next, the similarity relationship between the distinctive patches are extracted from each image is visualized by drawing a scatter plot into a 2-dimensional data map. The resolution of new data maps is 378 × 248 pixels. The samples of the 2-D data map for the patches of each image are illustrated in Fig. 2.

Fig. 2
figure 2

The 2-D data map for the patches of each image

2.2.4 Designing and training end-to-end CNN classifier

The images of the 2-D data map are fed to a CNN in an end-to-end fashion to be classified as different classes of GI tract lesions. CNN is a type of deep neural network that can learn hierarchical features from low to high dimensions. The grid search method is c for tuning the hyper-parameters. The hyper-parameters of the examined CNNs tuned by this method include the learning rate, the activation function, dropout rate, batch size, the optimizer, the number of convolutional blocks, and the number of neurons in convolutional layers. Therefore, the architecture of CNN among the compared and examined ones having the best performance on the validation dataset is shown in Table 2.

Table 2 CNN architecture of the first scenario for classifying GI tract lesions from endoscopic video frames

According to Table 2, CNN architectonics consists of many layers namely convolution, pooling, dropout, fully connected layer, and dense. Any change in the structure of CNN leads to the creation of new architectonics. The convolutional layers consist of filters of different sizes which slide over the input images. By this layer, the feature of the image is learned and saved into a feature map [26]. \(W\) is known as the height of the filters, so, the filter will be \(W\times W\) as the multiple of width to height. For the sake of computing pre-nonlinearity input in each layer (\({\text{x}}_{\text{i}\text{j}}^{\text{l}}\)), the filter works the same as Eq. (2).

$$\mathrm y_{\mathrm j}^{\mathrm l}=\mathrm f(\sum_jx_j^{l-1}\otimes w_{ij}^l+b_i^l)$$
(2)

In Eq. (2), x is the input, y is the output, w is the convolution filter, and b indicates the bias. Sliding of the convolution filters is extracted features from the input images and reduces the parameters. In addition, the pooling layer is used to reduce the parameters. This layer is used for transitioning the proper features to other layers and is included max-pooling, average-pooling and sum-pooling. The max-pooling that is worn in this study is deliberated by Eq. (3).

$${P}_{jm}={}_{k=1}{}^{r}\text{m}\text{a}\text{x}\left({x}_{j\left(m-1\right)n+k}\right)$$
(3)

In Eq. (3), x is known as the input matrix. In our method, to overcome the vanishing and the exploding gradient after the convolution layer’s operation used the activation functions, ReLU, which is introduced the feature of non-linearity to the DNNs. ReLU is calculated by \(max(0.x)\).

During training the model, ReLU can die, so, the Leaky ReLU, which is indicated by Eq. (4), is used to overcome this problem.

$$Irelu\left(x\right)=\left\{\begin{array}{cc}ax&if\;x\leq0\\x&if\;x>0\end{array}\right.$$
(4)

In the last dense layer to classify the inputs, is used the SOFTMAX activation functions which is calculated as Eq. (5).

$$SOFTMAX\left({y}_{i}\right)= \frac{{y}_{i}}{\sum _{j}{y}_{j}}$$
(5)

CNN is trained for 100 epochs with Adam optimizer [18] with learning rate of 0.001 and batch size of 512. The activation function for all layers except last layer is ReLU [24]. The last layer uses Softmax activation function.

As shown in Fig. 1c, the main steps of our proposed and designed AE-2DV-CNN are described in Algorithm 1.

Algorithm 1
figure a

The steps for training CAE-2DV-CNN

2.2.5 Evaluation and validation

Different scenarios should be assessed by precise metrics to weigh their strength generalizability. These measures include accuracy, precision, recall, F1-Score and Area under Receiver Operating Characteristics (ROC) curve (AUC) [9].

The value of accuracy shows the strength of the model in classifying the data Eq. (6) [9]:

$$Accuracy= \frac{TP+TN}{N}$$
(6)

(TP) is the abbreviation of True Positives, (TN) is the abbreviation of True Negative, and N is the all numbers of data records.

Precision denotes what portion of data is predicted exactly like its actual label [9]. This measure calculates as Eq. (7).

$$Precision= \frac{TP}{TP+FP}$$
(7)

Recall is known as true positive rate denoted in Eq. (8) and it shows the portion of positive classes which is predicted right.

$$Recall = \frac{TP}{TP+FN}$$
(8)

(FP) is the abbreviation of False Positives, (FN) is the abbreviation of False Negative.

The F1-measure is the harmonic mean of precision and recall, as is shown in Eq. (9) [9].

$$F1-measure = 2\times \frac{precision\times recall}{precision+recall}$$
(9)

In the following, Eqs. (10)-(15) shows the measures which are estimated the performance of the multi classes classification [33]:

$$micro-averaged\;precision=\frac{\sum_{c=1}^{NOC}{TP}_c}{\sum_{c=1}^{NOC}{TP}_c+\sum_{c=1}^{NOC}{FP}_c}$$
(10)
$$micro-averaged\;recall=\frac{\sum_{c=1}^{NOC}{TP}_c}{\sum_{c=1}^{NOC}{TP}_c+\sum_{c=1}^{NOC}{FN}_c}$$
(11)
$$micro-averaged\;F1-score=2\times\frac{micro-averaged\;precision\times micro-averaged\;precision}{micro-averaged\;precision+micro-averaged\;precision}$$
(12)
$$macro-averaged\;precision=\frac1{NOC}\times\sum\nolimits_{c=1}^{NOC}\frac{{TP}_c}{{TP}_c+{FP}_c}$$
(13)
$$macro-averaged\;recall=\frac1{NOC}\times\sum\nolimits_{c=1}^{NOC}\frac{{TP}_c}{{TP}_c+{FN}_c}$$
(14)
$$macro-averaged\;F1-score=\frac1{NOC}\times\sum\nolimits_{c=1}^{NOC}2\times\frac{\frac{{TP}_c}{{TP}_c+{FP}_c}\times\frac{{TP}_c}{{TP}_c+{FN}_c}}{\frac{{TP}_c}{{TP}_c+{FP}_c}+\frac{{TP}_c}{{TP}_c+{FN}_c}}$$
(15)

In the above equations, NOC is the number of different classes.

2.3 The second scenario (CNN-Sc) for classifying GI tract lesions from endoscopic video frames

In the second scenario, an end-to-end CNN model is designed, and KVASIR V1 dataset is fed into it as its inputs. The hyper-parameter tuning is determined with the Grid search method. The architecture model with the best performance compared to the examined ones is listed in Table 3.

Table 3  s scenario (CNN-Sc) architecture for classifying GI tract lesions

The second scenario is trained for 100 epochs with Adam optimizer [18] with a learning rate of 0.001, and a batch size of 512.

2.4 The third scenario (Incept_TL) for classifying GI tract lesions from endoscopic video frames

For the last scenario, we assess different pre-trained CNNs models such as MobileNet [29], Inception-V3 [15], VGG16 and VGG19 [34] to compare their results. Figure 2 demonstrates the performance measures of different models are evaluated in the last scenario.

As is shown in Fig. 2, Inception-V3 demonstrates the best performance among the compared and examined pre-trained CNNs for transfer learning in the proposed method.

2.4.1 Transfer learning

We use the pre-trained CNNs trained previously on Image-net dataset. The convolutional layers of the pre-trained CNNs are locked to prevent from changing their connection weights. To avoid the overfitting, dropout layers are added. The last layer of CNN is replaced with a dense layer with eight neurons, and the SOFTMAX activation function. Finally, the root-mean-square propagation (RMSprop) optimizer is used with the learning rate of 0.01 to tune last dense layer’s weight. The model is trained for 100 epochs with the batch size of 512. The images are resized to 75 × 75 pixels and are fed into the model as inputs.

Table 4 indicates the best value hyper-parameters of models in different scenarios.

Table 4 The hyper-parameters of different scenarios

3 Results and discussion

In this section, different scenarios are compared based on the classification performance measures to find the scenario which has the best efficiency.

In Table 5, the performance measures for each proposed scenario for classifying GI tract lesions are reported based on five-fold cross validation strategy.

Table 5 The performance metrics of the scenarios for classifying GI tract lesions from endoscopic video frames

As illustrated by Table 5, the accuracies of 99.87 ± 0.001, 92.07 ± 0.086, and 90.55 ± 0.111 are achieved for CAE-2DV-CNN, CNN-Sc, and Incept_TL, respectively. So, the first scenario (CAE-2DV-CNN) has the best efficiency compared to the other scenarios. It demonstrates our innovative approach to using 2-D maps instead of initial images to train with CNN is thoughtful.

Some researchers have trained and evaluated their proposed methods on the KVASIR V1 dataset. Their techniques, and the results are listed and are compared with the proposed method in Table 6.

Table 6 Comparing the performance of the scenarios in the proposed method with the previous researches which have worked on KVASIR V1 dataset

By assessing the performances in Table 6, it is apprehended that our first scenario leads to superior performance compared to the former studies that used the KVASIR V1 dataset.

Figure 3 Illustrates the confusion matrix for each scenario per each data fold used for five-fold C.V. As realized by Fig. 3, the first scenario is classified as 2-D maps correctly. In the last scenario, the misclassification problem alarmed in the previous studies is reduced.

Fig. 3
figure 3

Comparing the performance measures of different pre-trained CNNs examined in the third scenario

Figure 4 displays Roc curve of each class per fold for different scenarios. As shown in Fig. 4, the AUC of various scenarios is highly desirable..

Fig. 4
figure 4

The confusion matrix for different scenarios per fold. a The first scenario for classifying GI tract lesions from endoscopic video frames (AE-2DV-CNN). b The second scenario for classifying GI tract lesions from endoscopic video frames (CNN-Sc). c The third scenario for classifying GI tract lesions from endoscopic video frames (Incept_TL)

Figure 5 indicates the accuracy and the loss function per epoch in each fold for the various scenarios. As illustrated by Fig. 5, the accuracy and loss function of the first scenario for each fold except, the first fold, are smooth while the second and third scenarios have fluctuated (Fig. 6).

Fig. 5
figure 5

The ROC curve of each class per fold for different scenarios. a The first scenario for classifying GI tract lesions from endoscopic video frames (AE-2DV-CNN). b The second scenario for classifying GI tract lesions from endoscopic video frames (CNN-Sc). c The third scenario for classifying GI tract lesions from endoscopic video frames (Incept_TL)

Fig. 6
figure 6

Training and validation curves of each fold (accuracy and loss function per epoch) for different scenarios. a The first scenario for classifying GI tract lesions from endoscopic video frames (AE-2DV-CNN). b The second scenario for classifying GI tract lesions from endoscopic video frames (CNN-Sc). c The third scenario for classifying GI tract lesions from endoscopic video frames (Incept_TL)

Table 7 represents the processing time details for steps of each scenario in the proposed method, which is calculated by “Google Colab”. The maximum amount of RAM is 12.76 GB and the maximum amount of disk is 68.40 GB, which is assigned to each user. The GPU models in “Google Colab” are NVIDIA K80, P100, P4, T4, and V100 GPUs. All deep learning models have used the python libraries like, Scikit-learn, TensorFlow, and Keras.

Table 7 The processing time for each step of each scenarios

The principal goal of the proposed method is to suggest novel scenarios that use the advantages of DNNs for classifying GI tract lesions from endoscopic video frames.

Our proposed method in the first scenario can solve some issues which are reported by the previous studies, such as the misclassification of the classes or overfitting of the models.

In addition, in the first scenario, visualizing a 2-D data map represents an innovative approach that leads to the used of the power of CAE in extracting the high-level features and outstanding performance of the model.

4 Conclusions

Misdiagnosis of GI tract abnormalities is exceedingly common during the screening of the GI tract. The accurate diagnosis of the mentioned abnormalities is highly related to the expertise level of the physician and the quality of the images captured by the endoscope and shown on the monitor. Several previous studies have proposed methods with the power of automatic recognition of GI tract abnormalities.

However, low performance in diagnosing some abnormality types are some instances in which suffered from some limitations and drawbacks.

During the advent of computer vision and artificial intelligence, automatic detection has become a common area of research. DNNs are the subset of machine learning methods based on artificial intelligence with the ability the learning. Extracting high-level features from images leads to learning and accessing the superior performance of these models.

So, we aim to use the DNNs model to visualize the high-level features of each image into a 2-D data map and feed them into CNN.

Therefore, in the proposed method, a novel approach is designed and introduced to classify the GI tract lesions from an endoscopic video frame with superior accuracy. The dataset analyzed in the proposed method is KVASIR V1.

In the proposed method, three various scenarios are designed, and their results are compared. The average accuracy of the first, second, and third scenarios is 99.87 ± 0.001, 92.07 ± 0.086 and 90.55 ± 0.111, respectively. The experimental, results demonstrate the superiority of the first proposed scenario in the proposed method over others compared with other ones, and the previous related works focused on the KVASIR V1 dataset. The main novelty of the first scenario is visualizing a 2-D data map from the features extracted by CAE and feeding them into the CNN as inputs.

As a future research direction, it proposes to use and analyze the extended dataset consisting of more abnormalities like Hyperkvasir which is gathered in recent years [10].

Another research opportunity is using other data analysis techniques to extract features with significant correlation from original images and visualize them into 2-D data maps as the inputs of CNN and evaluate their results.