1 Introduction

A non-invasive method for finding irregularities in the human gastrointestinal (GI) tract is called wireless capsule endoscopy (WCE) [17]. It is a capsule-like structure and measures 26 mm in length by 11 mm in diameter. Its components include an imaging sensor, an optical dome, a battery, an illuminator and an RF transmitter [15]. During the examination, the patient ingests the WCE, and it passes slowly through the small intestine and takes images of the entire GI tract while it is shifting. Finally, these images are transferred wirelessly to a data-recording device so that doctors can subsequently study the images for diagnosis [5].

Infections in the gastrointestinal system, like bleeding, Crohn’s disease, cancer, and ulcer polyps have become more common in recent years, while ulcers and bleeding are widespread disorders [2]. Cancer in the lungs (1.1 million fatalities), stomach cancer (765,000 fatalities), rectum cancer (525,000 fatality), liver cancer (505,000 alive), and breast cancer (385,000 deaths) are the top causes of death, according to a WHO survey [26]. According to a study conducted in the United States, a huge number of gastrointestinal tract infection disorders have been reported in the United States since 2017, with roughly 200,000 new cases reported each year since 2011. If found and diagnosed early enough, this gastrointestinal tract infection can be treated. WCE is the main technology utilized by specialist physicians to detect GI tract infections [7, 23, 27].

For an accurate diagnosis, the work is time-consuming and difficult due to the volume of data. Additionally, it can frequently be very challenging to see some little bleeding areas with the naked eye [1]. These are some of the factors driving the numerous research projects being done to speed up reading by automatically identifying anomalies or other regions of interest in images (ROIs) [3, 22].

One of the WCE’s producers, Given Imaging, offers a technology called suspected bleeding indicator (SBI) for automatically identifying frames with potentially bleeding regions [16, 20, 25]. However, research has revealed that SBI is insufficient to test for all GI tract illnesses. This opened the way for researchers to create methods for the more accurate automatic detection of various kinds of anomalies in WCE images [4, 19].

Even though these studies have greatly improved defective area segmentation, the accuracy is still insufficient because of the limitations of conventional techniques or network capabilities [8, 12,13,14]. They lack a tailored architecture for WCE photos, and they haven’t completely exploited the enormous potential of using semantic segmentation techniques on WCE images [18, 21]. This research offers a unique deep-learning based system for automatic detection and classification that extends the state-of-the-art in order to address the aforementioned issues. Our main contributions include developing an effective deep-learning architecture for precise detection, as well as comprehensively evaluating it and comparing it to the state-of-the-art, showing consistent improvements in a range of performance criteria.

The key contributions of this paper are as follows,

  • For WCE image defect area segmentation, we propose a new EM method that employs a multi-stage design with attention blocks at every step.

  • We use a multi-stage architecture, which allows every level to be treated as a separate network. As a result, every layer in every stage is trained to extract meaningful features for our target using DeepLab V3+.

  • LeNet 5 is employed to classify the GI tract in WCE images, and to find the presence stage of images.

  • The Kvasir-V2 dataset has undergone various ablation experiments. Our proposed network outperforms the state efficiency concerning all other approaches, according to the experimental data.

The remainder of the paper is laid out as follows: The state of the art on how the problem of energy limitation has been addressed is presented in part 2. Our method is presented in Section 3. In Part 4, the findings are presented and discussed. At last, Section 5 contains the conclusion.

2 Literature review

Here is a review of some previous studies on the automatic detection of anomalies in WCE images.

Semi-supervised Outlier Detection Model (SODM), a mix of semi-supervised deep-structured CNN-LSTM and outlier assessment mode (OAM), was introduced by authors in [6] to handle the issue of small intestinal disease outlier detection in WCE images. By evaluating the spatial-scale trends of successive image sections, irregular graphical patterns were discovered in the image. SODM also includes an outlier assessment model, which reduces the number of false alarms while making outlier declarations. When compared to the prior paper, the suggested scheme comes out ahead.

The authors of [19] introduced a method called BIR that classifies WCE bleedy images by combining the MobileNet with a custom-built CNN approach. Due to its low computation capacity, BIR uses the MobileNet approach to do initial phase computations. The results are then forwarded to CNN for further processing. For categorization, Ensemble is more accurate than either the MobileNet or CNN framework. In terms of effectiveness, the proposed BIR model surpasses current strategies.

In [3], CNN, GoogLeNet and AlexNet are used to classify ulcers and non-ulcers in object classification. The GoogLeNet and AlexNet methods were pre-trained on a section of the ImageNet database to find the optimum network parameter combination that allows these two CNNs to detect ulcers with good accuracy. When compared to other techniques, it had the highest detection rate.

The authors of [8] suggested using a deep CNN to detect hookworms in WCE images. Pooling layers are presented to blend the tube sections obtained from the extracted network with the feature maps created by the hookworm categorization in terms of improving the feature maps relating to tubular regions. The suggested methodologies outperformed other existing methods.

Based on the fusing and choosing of DL characteristics, the authors of [12] described a fully automated system for detecting stomach illnesses. The array-based technique is used to collect features from two consecutive fully linked layers and merge them into a re-trained model. The greatest entities were chosen using the PSO technique, which determines the best candidates based on a mean value-based fitness function. When compared to other feature selection techniques, it was discovered that the suggested method produced the greatest results.

In [9], the differential box-counting approach was utilized to extract the fractal dimension (FD) of Wireless Capsule Endoscopy images, and the RF-based hybrid classifier was employed to identify anomalous frames. The purpose of this analysis is on block-level local characteristics. To address the issue of class imbalance, they adopted the SMOTE method. The suggested method outperformed some of the existing state-of-the-art approaches.

By creating masks around polyps detected from still frames, [24] created a modified R-CNN. Feature extraction is performed using the Resnet-50 and Resnet-101 models. Polyps in images were recognized and segmented accurately using the provided approach. When compared to other methods, the suggested technique efficiently segments the polyps. Table 1 lists all of the existing relevant works.

Table 1 Comparison of existing related works

3 Proposed methodology

In our strategy, the work is processed based on four phases such as preprocessing, segmentation, feature extraction, and classification. The raw images were first passed through the preprocessing stage. Data resizing, gray scale conversion and image enhancements were done in preprocessing. By using the power law transformation the image can be enhanced. Once the WCE images are preprocessed we fed them into the.

segmentation phase to segment the defected area after completing this process, the next phase is feature extraction. The DeepLap V3+ descriptor is used to extract feature representations from every given image during the feature extraction stage. The obtained features are sent into the deep learning classifier Lenet 5 during the training phase. Finally, test photos are utilized to evaluate the suggested method’s efficiency. Figure 1  represents the architecture diagram of the proposed methodology.

Fig. 1
figure 1

Architecture diagram of the proposed methodology

3.1 Preprocessing

In medical science, image processing is critical for enhancing images and adding more information. Image quality is an important aspect of achieving an optimum solution. To achieve a better result, an image that has distortion or loudness or is of poor quality must be upgraded. In the proposed model the input size of the image is 720*576. The two values indicate the height and width of the image. The images were first decreased in size from 720*576 to 256 × 256 pixels. The image quality was then enhanced by using power law transformation.

3.1.1 Image enhancement based on power law transformation

Image enhancement can be done in the Fourier or spatial domains, with contrast enhancement being one of the most significant parameters to consider. The methods utilized in the spatial domain can be further classed as grey-level transformations, histogram processing, and so on. As previously stated, histogram equalization has the disadvantage of reducing the contrast in some cases.

Histogram matching or specification can be tailored to the images, but it needs a significant amount of input data. Power law transformations functions, on the other hand, necessitate a huge number of input data. The coefficient occurring in the transfer function must be chosen in the first case, while the gradients and varies of the parallel lines that make up the scaling factor must be chosen in the second case.

Typically, the basic equation is described as follows:

$$ s={cr}^{\gamma } $$
(1)

The individual grey intensities in the output and input images are s and r, respectively and constancy is given as c. Because RGB color information is generally noisier than HSV color information, the input image is first transformed into its HSV color space in this paper.

For the multiple exposure images, a power law transformation is done to the V-channel of the input image. Gamma is supposed to have an ideal value of one. However, if the gamma value varies beyond a certain point, the image is entirely degraded. As a result, the value of gamma should be kept within a specific range to prevent degeneration. The gamma value range is chosen such that the entropy of multiple exposure images does not fall below a certain level.

Linear stretching is applied to the generated images after creating different exposure images to make the most of the complete dynamic range.

$$ {V}_{i\_ LS}\left(m,n\right)=\frac{V_i\left(m,n\right)-\min \left({V}_i\left(m,n\right)\right)}{\max \left({V}_i\left(m,n\right)\right)-\min \Big({V}_i\left(m,n\right)} $$
(2)

3.2 Data augmentation

Due to the unavailability of a significant number of annotated images, image augmentation plays a key role in medical image analysis. Data augmentation improves data quality and reliability. Zoom, rotation, flip, shear, and shifting are some of the transformations that are triggered. Furthermore, by shortening the distance between the testing and training datasets, augmented datasets help to increase data points. In the training dataset, the overfitting problem can be prevented.

3.3 Segmentation based on EM algorithm

The maximum likelihood estimations of unknown parameters are computed using the expectation-maximization (EM) algorithm, which uses probabilistic models. An algorithm is an iterative approach for solving the problem of maximization. By selecting random values, the maximum-likelihood approach is employed to discover the “best fit” for a data set. An iterative approach using the EM method in incomplete/missing data points is used to calculate the best fit in an unsupervised way. For a set of complete data θ = ξ, the best fit is determined through maximum likelihood estimation. The complicated EM technique is efficient enough to catch model parameters in the absence of a data set. The EM method chooses the lost data points at random and uses them to predict the previous batch of data.

The gained new values are then utilized to enhance the initial set’s prediction, and the method eventually converges to a fixed point for the best fit. With an understanding wi, the value of θcan is enhanced. The procedure begins with a forecast forθ, then calculates z, then updatesθ using this new value for z, and so on until convergence is achieved.

The EM algorithm can be thought of as a training set{a(1), a(2), ……, a(m)}, with m independent samples for an estimation issue. These training parameters must be used to fit a model P(m, z), with the corresponding likelihood defined as

$$ j\left(\varepsilon \right)-\sum \limits_{s=1}^k\log p\left(m,\varepsilon \right) $$
(3)

The maximum likelihood estimation εis challenging, but it is possible if the latent variable z(1) can be observed. The EM algorithm can then be broken down into two steps, the first of which is the E step, which involves constructing a lower constraint on j. We also optimize the lower bound, sometimes known as the M step.

The EM technique has been successfully utilized to find a likelihood function for missing values in a data collection. The sample density is calculated as follows:

$$ p\left(m|\varepsilon \right)=\prod \limits_{s=1}^Np\left({m}_i|\varepsilon \right)=L\left(\varepsilon |m\right) $$
(4)

The likelihood function is defined asL(ε ∣ m), where x represents the sample data of size N. The procedure’s goal is to find an estimate of j that maximizes the value of L. Assume z = (m, h) for a complete data set, and a joint density function represents the relationship between the lacking and detectable values.

$$ p\left(a|\varepsilon \right)=p\left(m,h|\varepsilon \right)p\left(m|\varepsilon \right) $$
(5)

We can establish a new likelihood function using this new cumulative distribution function.

$$ L\left(\xi |z\right)=L\left(\xi |m|h\right)=p\left(m,h|\xi \right) $$
(6)

Initially, the technique uses log-likelihood logp(m, h, ξ) to calculate the estimated value for the complete data set, taking into account the unidentified data, m be detected data, and the currently estimated parameter. The current parameter’s estimation is represented as:

$$ Q\left(\xi, {\xi}^{\left(i-1\right)}\right)=E\left\{{\log}_p\left(m,h|\xi \right)|m,{\xi}^{\left(i-1\right)}\right\} $$
(7)

Where ξ(i − 1)are the estimated total parameters utilized to compute the expectation value, and ξ denotes the most recent parameters optimized to raise the Q value. We need to find a way to enhance the expectations.

$$ {\xi}^{(i)}=\underset{\xi }{\arg\ \max Q\Big(\xi, {\xi}^{\left(i-1\right)}}\Big) $$
(8)

The image was segmented periodically using the EM algorithm to achieve a satisfactory outcome in this research. The pixels from normal and afflicted tissues are sorted into separate categories based on their likelihood of being comparable in intensity. The cluster’s center point is reconstructed through the iteration process until it reaches a fixed cluster’s center point. The following equations give a brief description of the EM process.

To compute, set the mean, covariance, and mixing co-efficient tomi,σi andci, respectively.

$$ \log I\left(\xi \right)=\sum \limits_{i=1}^N\log \left[\sum \limits_{i=1}^M{c}_1P\left({x}_k|{m}_1,{\sigma}_1\right)\right] $$
(9)

The log-likelihood function that corresponds to it I(ξ) is defined as follows: We can also estimate the probability as follows:

$$ {T}_{kl}=\frac{c_1P\left({x}_k|{m}_1,{\sigma}_1\right)}{\sum_{t=1}^M{c}_tP\left({x}_k|{m}_t,{\sigma}_t\right)},\kern1em 1\le \mathrm{l}\le \mathrm{M},\kern0.5em 1\le \mathrm{t}\le \mathrm{N} $$
(10)

Perform maximization as

$$ {m}_1^{new}=\frac{1}{N_l}\sum \limits_{k=1}^N{T}_{kl}{x}_k,1\le \mathrm{l}\le \mathrm{M} $$
(11)

The improved co-variance is calculated as follows:

$$ {\sigma}_l^{new}=\frac{1}{N_l}\sum \limits_{k=1}^N{T}_{kl}\left({x}_k-{m}_1^{new}\right){\left({x}_k-{m}_1^{new}\right)}^t,\kern0.5em 1\le \mathrm{l}\le \mathrm{M} $$
(12)

The probability density function is represented by k, and the characteristic vector is x. The combination model has the following structure:

$$ P\left(x|\xi \right)=\sum \limits_{i=1}^M{C}_lP\left(x|{m}_1|{\sigma}_1\right) $$
(13)

3.4 Feature extraction

The extraction of features is an important phase in the model construction process. A system can be built using a variety of features, including color, shape, and geometry, as well as SURF (speeded-up robust features).

To extract spatial domain information and integrate them with spectral domain features, we use DeepLab v3+ as the neural network structure. DeepLab v3+ is one of Google’s fourth-generation DeepLab semantic segmentation networks, and it offers the best overall performance to date. The ASPP module has been altered from the own approach of independent branches in the DeepLabv3+ network topology. The dense connection method allows for denser pixel sampling, which improves the algorithm’s capacity to extract features.

More pixels can be used in the calculation with the densely coupled ASPP module. Atrous convolution’s pixel sampling is sparser than regular convolution’s. This issue is extended to two dimensions, the single-layer atrous convolution only involves 9 pixels, whereas the cascaded atrous convolution involves 49 pixels. As an outcome, the least ASPP output is densely coupled to the high ASPP rate layers, enhancing the potential of network feature extraction.

The densely linked ASPP module provides a bigger receptive field as well as denser pixel sampling. Without increasing the feature map’s resolution, atrous convolution can produce a broader receptive field. The volume of the receptive field R is the following for ASPP convolution with ASPP rate r and k denotes the size of the kernel:

$$ R=\left(r-1\right)\times \left(k-1\right)+k $$
(14)

By stacking two atrous convolutions, a bigger receptive field can be generated, and the volume of the stacked particular area can be calculated using the equation below:

$$ R={R}_1+{R}_2-1 $$
(15)

The sizes of the receptive fields of the two atrous convolutions are R1 and R2, respectively. The dense connection method’s receptive field size is calculated by stacking atrous convolutions. The basic ASPP module runs in parallel, with every branch sharing no information, whereas the upgraded module allows information to be shared by layer. The distinct ASPPs are dependent on one another, increasing the receptive field’s range. Utilizing \( {R}_r^k \) to represent the particular area the maximum receptive field Rmaxof the atrous method is:

$$ {R}_{\mathrm{max}}=\mathit{\max}\left({R}_3^6,{R}_3^{12},{R}_3^{18}\right)={R}_3^{18} $$
(16)

The maximal particular area of the modified ASPP module, as shown in Eq. (17), is

$$ {R}_{\mathrm{max}}={R}_3^6+{R}_3^{12}+{R}_3^{18}-2 $$
(17)

Whereas the intensively linked ASPP module allows for a larger area in a particular manner, it increases the effectiveness of parameters, decreasing the network’s efficiency. To tackle this issue, 1 × 1 CNN is applied before every ASPP CNN following dense connection, reducing the number of channels in the feature map and improving the network’s expression ability. s represents the number of channels the equation is shown below:

$$ {c}_l={c}_0+s\times \left(l-1\right) $$
(18)

Whenever the improved Xception network is used as the backbone network, the ASPP component’s inputs are the network’s dimensional vector with 2048 output size, so each ASPP module has 256 channels, and the parameter quantity S of the enhanced ASPP module can be computed as follows:

$$ S=\sum \limits_{l=1}^L\left({c}_l\times {1}^2\times \frac{c_0}{2}+\frac{c_0}{2}\times {k}^2\times n\right) $$
(19)

While L denotes the volume of ASPP CNN. DeepLabV3+ employs a false class cross-entropy equation, as regards:

$$ {P}_k(x)=\frac{e^a{k}^{(x)}}{\sum \limits_{k=1}^k{e}^a{k}^{(x)}} $$
(20)
$$ E=\sum \limits_xw(x)\mathit{\log}\left({p}_{l_{(x)}}(x)\right) $$
(21)

The outcome is a softmax categorization function for the pixel level, x is the position of the two-dimensional pixel point, and \( {a}_{k_{(x)}} \) denotes the location of pixels point x in the size k. The output is the confidence of each pixel x in the k class. The certainty of every image pixel x within k class is provided. Equation (21) shows that DeepLab’s total loss is based on cross-entropy loss and \( {p}_{l_{(x)}} \)defines the tag’s efficiency likelihood.

To avoid overfitting and improve model robustness, regularization terms are frequently added after the loss function. The L2 regularization term is used to punish the loss of function. The following is the definition of L2 regularization, often known as ridge regularization:

$$ {L}_2=\frac{1}{2}\eta \sum \limits_{i=1}^n{\theta}^2 $$
(22)

Where regularization coefficient is η and weight is θ. The loss of the standard phrase is minimized in backpropagation optimization as the degradation of the error function is reduced.

3.5 Classification

The purpose of classification is to predict the label. We employed the multi-class classification technique in the proposed approach to categorize the input image based on the selected features. LeNet5 is a comprehensive deep categorization network that categorizes extracted features and saves processing time through techniques including convolution, parameter sharing, and pooling. The standard LeNet-5 CNN has an input layer, two convolutional layers, two max pooling, fully connected layers, and a softmax layer. The input size for the architecture is 224 pixels by 224 pixels.

A 32 × 32 grayscale image serves as the input for LeNet-5 and is processed by the first convolutional layer comprising six feature maps or filters with a stride of 1. The LeNet-5 then adds an average pooling layer or sub-sampling layer with a filter size of 2 × 2 and a stride of 2. On the result of the Conv1 layer, the Pool1 layer applies a 2× 2 max-pooling procedure to build six 14× 14 pixel image features. To construct 16 feature maps of 10 × 10 pixels, the Conv2 layer employs 16 convolution operations of size 5× 5. Pool2 generates 16 image features of 5× 5 pixels by performing a 2× 2 max-pooling function on the Conv2 layer’s outcome. The FC1 layer is a 120-neuron comprehensive layer that is fully coupled to the Pool2 layer and generates 120 1-pixel image features. The FC2 layer is an 84-neuron full-connection layer that computes the dot-product of the input vector and support vectors, applies the bias value, and outputs the findings via the sigmoid function. The FC3 layer, also known as the softmax layer, has ten neurons and separates all input pictures into ten sections equivalent to integers 0–9. Architecture details of LeNet 5 is represented in Table 2.

Table 2 Architecture

It is decided to use the dynamic adaptive pooling method. The Relu activation function is efficiently utilized to solve the gradient vanishing issue. In softmax, the final classifier was trained. The training time and detection rate have both improved as a result of the optimization of these three strategies.

The input data is normalized before being transferred to the next layer, which is unaffected by the altered data distribution. Because the normalization layer is a trainable and customizable interface, whitening is the optimum data preprocessing technique. With pre-treatment, the number of computations is very big. To make computations easier, the pre-treatment equation for approximation whitening is

$$ {\hat{m}}^{(s)}=\frac{m^{(s)}-E\left[{m}^{(s)}\right]}{\sqrt{\operatorname{var}\left[{m}^{(s)}\right]}} $$
(23)

The SD of each set of input data is shown in the numerator of the preceding formula (23) and E(m(s))is the sum amount of each sample of input data. The variance and mean of every batch should be recorded during the training process so that the sum and difference of the full database can be calculated after the training is completed.

$$ E\left[m\right]\leftarrow {E}_B\left[{\mu}_B\right] $$
(24)
$$ Var\left[m\right]\leftarrow \frac{m}{n-1}{E}_B\left[{\delta}_B^2\right] $$
(25)

The resulting high-level feature map can minimize the original feature map’s dimension and resolution while also avoiding overfitting and other concerns. Mean and maximum pooling are two instances of pooled procedures.

As a result, a mathematical method based on the max pool technique is built to replicate the function using the interpolation approach. If the pooling factor is denoted as μ, then after upgrading the pooling models, the equation is represented as

$$ {S}_{ij}=\mu \underset{i=1,j=1}{\mathit{\max}}\left({F}_{ij}\right)+{b}_2 $$
(26)

The pool μ is employed to improve the max pool procedure, the improved characteristics can represent features correctly. The balanced is identical to the maximum pooling model’s parameter values.

$$ \mu =\rho \frac{a\left({v}_{\mathrm{max}}-a\right)}{{v_{\mathrm{max}}}^2}+\theta $$
(27)

The tanh function is used to update the m > 0 part of the ReLU, and a new activation function is created, which is described in formula (28)

$$ f(m)=\Big\{{\displaystyle \begin{array}{c}m,\kern0.5em \mathrm{m}\ge 0\\ {}\alpha\ \tanh \left(\mathrm{m}\right),\mathrm{others}\end{array}} $$
(28)

Formally, a set of N images {Xi,yi} Ni = 1 are taken, where Xi is the original image data and yi is a class category of the image (i.e., 0 and 1). The difference between the predicted label \( {\hat{y}}_i \) and the real label isyicalculated using the categorical cross-entropy function, defined as follows:

$$ J\left(\omega, b\right)\overset{\varDelta }{=}-\frac{1}{N}\sum \limits_{l=1}^N{y}_{l1}\log {\hat{y}}_{l1}+.\dots +{y}_{lk}\log {\hat{y}}_{lk} $$
(29)

The biases and weights of the conventional LeNet-5, respectively are represented by and b. K is the number of class categories, and ylkis the softmax value of the kth class category, which is calculated as:

$$ {\hat{y}}_{lk}= soft\max \left({Z}_k\right)=\frac{e^{z_k}}{\sum \limits_{i=1}^k{e}^{z_l}} $$
(30)

Where zi denotes the outcome of the last fully connected output’s associated ith class category. The convolution layers downsample at a rate of 2 strides per layer. The loss equation concerning parameters throughout the entire network, the convolution organization’s parameters, and fully linked layers were all trained using BP. The batch size, momentum, learning rate, and weight decay are 1, 0.9, 0.03, and 0.001, respectively. The initial learning rate is 0.01. In the ReLu layer the learning rate is saturated. Due to the possibility of the network being either under- or over-fitted, the number of epochs is also a crucial training parameter. For this dataset, we trained the network for 50 epochs. The proposed model’s training and testing accuracy varies, it ranges from 0.98 to 0.99 and loss values ranges from 0.001 to 0.004. Figure 2 represents the proposed method flowchart.

Fig. 2
figure 2

Flowchart of the proposed methodology

4 Result and discussion

This portion explores the details of the analytical outcomes employing the proposed approach, which is based on assessment criteria like precision, accuracy, recall, IOU, and F1 score. Dataset description, preprocessing, segmentation, feature extraction, classification method, and results in the analysis are the six processes in this research. In the selected data, images with sizes varying from 720*576 to 1920*1072 pixels were translated into pixels. The Kvasir version 2 database was utilized to classify GIT disorders in this study. The entire dataset is split into two parts: a training set with 80% of the data and a validation set with 20% of the data.

For our classification challenge, we will utilize an implementation of LeNet-5 Convolutional Neural Network. There are three Convolutional Layers each with 5 × 5 filters and followed by average pooling with 2 × 2 patches. For activation, we employ the ReLU function because it allows us to learn faster. Then we add the Dropout Layer with a factor of 0.2 to deal with overfitting. This means that 20% of the input will be neutralized at random to avoid significant dependencies across layers. Flattening and two Dense Layers are the final products. In the final Dense Layer, we will use the Softmax activation function to generate probabilities between 0.0 and 1.0 by assigning the same amount of neurons to each class. There are 70,415 parameters in this network as a consequence. The Adam learning rate optimization algorithm is a variation of the Stochastic Gradient optimization process. In terms of improving speed, Descent is a good option. The Categorical Cross-Entropy loss function is appropriate for this issue since it provides us with a multiclass probability distribution for our classification task. To assess the suggested method’s effectiveness, we utilized the following accuracy measures.

4.1 Experimental setting

Using 4 GB RAM and an Intel i5 2.60 GHz processor, it runs Windows 10. The studies are carried out in the Anaconda3 environment using Python and KERAS with the Tensor flow as a backdrop.

4.2 Dataset description

GT images collected in VV health trust were used in these investigations as the dataset. The training data comes from one of the trust’s main gastrointestinal departments. Medical specialists analyzed the dataset extensively and entitled it Kvasir-V2. As a component of the Mediaeval Medical Multimedia Challenge, a benchmarking initiative that assigns tasks to the research team, this dataset was made accessible in the fall of 2017. The dataset is divided into eight groups, each with 1000 photos, including anatomical markers, pathological findings, and polyp removal. The resolution of the images in the collection ranges from 720*576 to 1920*1072. Sample images from the dataset are shown in Fig. 3.

Fig. 3
figure 3

Sample images from the dataset

The Kvasir-V2 dataset is utilized for validation in this paper to estimate the effectiveness of our proposed approach. The data samples are split into two sections, one of which is utilized to create a classifier and is referred to as the training dataset. The testing dataset is used in the second step to evaluate the classifier.

4.3 Performance metrics

Several performance criteria can be used to evaluate gastrointestinal tract detection and categorization. The detection rate, or the ratio between the total number of pixels and infected pixels, has a significant impact on disease classifier performance. The likelihood of detection is another term for the detection rate. Several researchers analyzed the GI tract classification outcomes using keywords like recall, accuracy, True Negative (TN), precision, False Negative (FN), False Positive (FP), and True Positive (TP).

Accuracy

The predicted values accuracy score reflects the model’s accuracy. The accuracy score is calculated by dividing the number of accurate predictions by the overall number of recommendations. It can be summarized as follows:

$$ \boldsymbol{Accuracy}=\frac{TP+ TN}{TP+ FP+ TN+ FN} $$
(31)

Here,

  • The value is TP when the classifier is trained that the photo will be Normal and the image’s reality description is also Normal.

  • TN is the number when the classifier is trained that an image is affected and the image’s genuine label is also degraded.

  • FP is the value whenever the classifier is trained that the photo will be regular but the image’s real caption is contaminated.

  • The quantity FN is used when the classifier is trained that a photo is contaminated but the true caption is Normal.

Precision

Precision is defined as the ratio of precisely anticipated positive occurrences to all anticipated positive observations. Precision is the capacity to do the following things:

$$ \boldsymbol{precision}=\frac{TP}{TP+ FP} $$
(32)

Recall

The True Positive Rate (TPR) and Sensitivity are both terms for recall. The recall score reflects the classifier’s ability to locate all positive samples. It’s the total of TP and FN divided by TP. It can be described in the following terms:

$$ \mathit{\operatorname{Re}}\boldsymbol{call}=\frac{TP}{FN+ TP} $$
(33)

Intersection over union (IoU)

Intersection over Union is a statistic for comparing the total of the predicted and observed pixels in the image in the same category to accurately described pixels. Output of dyed-lifted-polyps, dyed-resection-margins, esophagitis, polyps and ulcerative-colitis is shown in Figs. 4, 5, 6, 7 and 8.

$$ IOU=\frac{TP}{FP+ TP+ FN} $$
(34)
Fig. 4
figure 4

Output of dyed-lifted-polyps (a) input image (b) contract enhanced greyscale image (c) segmented defected area image (d) classified image

Fig. 5
figure 5

Output of dyed-resection-margins (a) input image (b) contract enhanced greyscale image (c) segmented defected area image (d) classified image

Fig. 6
figure 6

Output of esophagitis (a) input image (b) contract enhanced greyscale image (c) segmented defected area image (d) classified image

Fig. 7
figure 7

Output of polyps (a) input image (b) contract enhanced greyscale image (c) segmented defected area image (d) classified image

Fig. 8
figure 8

Output of ulcerative-colitis (a) input image (b) contract enhanced greyscale image (c) segmented defected area image (d) classified image

4.4 Segmentation evaluation

In this portion, we explain the outcome of segmentation using the Expectation Maximization Algorithm. Six assessment metrics are estimated using a segmented image: IoU_A, accuracy, mIoU recall, IoU_B, and precision. All evaluation parameters are.

based on four factors: TP, TN, FP, and FN. The comparison can be made with the existing segmentation approaches such as Superpixel-based segmentation, Gaussian mixture model, and VGG-19. We also compared the outcomes based on precision, accuracy, and recall [11]. The results of the Segmentation using the EM algorithm are displayed in Table 3.

Table 3 Comparison of various segmentation methods with the proposed approach

We observed that our proposed method outperforms existing approaches, showing 99.59% accuracy, followed by Gaussian mixture, Super pixel-based, and VGG-19 with 99.29%, 98.76%, and 97.7%, respectively. When compared with other metrics like precision and recall our proposed approach yield a greater solution. And the second approach that yields a better performance is super pixel-based segmentation. Among the four approaches VGG-19 gains lower performance [10]. Figure 9 represents the differentiation of proposed vs existing segmentation approaches. Table 4 shows the segmentation approaches comparison.

Fig. 9
figure 9

Differentiation of proposed vs existing segmentation approaches

Table 4 segmentation approaches differentiation with various metrics

Other metrics such as mIoU, IoU A, and IoU B can then be used to compare. IoU stands for intersection over union; IoU B stands for IoU for background; IoU A stands for IoU for all classes; mIoU stands for mean IoU. Our technique outperforms more advanced solutions. The suggested framework improves image segmentation, as demonstrated by the positive results of all metrics for both training and testing images.

Segmentation is based on an EM algorithm. It noted that the proposed approach presented a superior performance over all metrics rather than the existing approaches. Moreover, the proposed model has a good vision for differentiating between defected fields and backgrounds. IOU-based segmentation methods comparison is represented in Fig. 10.

Fig. 10
figure 10

IOU-based segmentation methods comparison

4.5 Evaluation of training results

Train Accuracy and Validation Accuracy curves converge in the end, and after 50 epochs we received an accuracy of 99.9%, which is quite good. Figure 11 shows the training and testing accuracy. The validation Loss curve jumps up and down a bit. It implies that more validation data might be beneficial. Validation Loss exceeds Train Loss after around 25 epochs, indicating that there is some overfitting. However, because the curve does not rise over epochs and the variance between Validation and Train Loss is small, this could be acceptable. Figure 12 shows the training and testing loss.

Fig. 11
figure 11

Proposed training and testing accuracy

Fig. 12
figure 12

Proposed approach to training and testing loss

4.6 Classification evaluation

The existing approaches like Alex Net, ResNet, and LSTM-CNN are compared with our proposed methods.

Differentiation of proposed classification approaches with existing techniques is represented in Table 5 in terms of f1-score, recall, accuracy, and precision on the dataset. It illustrates that when compared to the competitive method, the proposed model performs much better.

Table 5 Comparison of classification performance

The graphical representation of classification performance differentiation is shown in Fig. 13. When comparing the proposed approach with the other methods, our presented approach yields the highest outcome. In terms of accuracy, our proposed approach gains the best result. Similarly, recall, precision, and f1-score also achieve the highest result. In terms of all metrics, the overall performance of the proposed method achieves a greater result (99.12% of accuracy, 98.79% of recall, 99.05% of precision, and 98.49% of F1-Score).

Fig. 13
figure 13

Comparison of classification performance

The effectiveness of various models with the suggested method is compared in Table 6. It demonstrates that the proposed strategy had the best results in comparison to others. Figure 14 showed the proposed effectiveness when compared to CNN, MobileNet, and BIR in terms of the F1 score, recall, accuracy, and precision.

Table 6 Performance comparison of Different Models
Fig. 14
figure 14

Performance comparison of different models with the proposed scheme

5 Statistical analysis

To examine the statistical significance of observed performance results, we used a T-test. The significance level for the test was set to α = 0.05.

Among the evaluated models, the proposed approach produces the best results with good efficiency and accuracy. The outcomes of the statistical tests also show that the enhancement of our proposed method is statistically significant. The value of the two models is indicated byρ an amount in each grid cell for the relevant row and column. Comparing the proposed strategy to other approaches, we can observe that the enhancement is statistically significant at the 0.01 level. The t-value is −2.51921. The p value is .008661. The result is significant at p < .05. Figure 14 is the statistical results of paired T-Test. The dark grey indicates <0.01. The light grey indicates 0.01–0.05. The blank space indicates>0.05. Figure 15 represents the statistical test result of paired T-test.

Fig. 15
figure 15

Statistical test results of paired T-Test

The confusion matrix for the proposed end-to-end trained method is shown in Fig. 16. The primary purpose of the confusion matrix, also known as the error matrix, is to highlight the differences and inconsistencies between the actual class and the predicted class. In this article, the possible number of classes is five, including dyed-lifted-polyps, dyed-resection-margins, esophagitis, polyps, and ulcerative colitis.

Fig. 16
figure 16

Confusion matrix

6 Conclusion

WCE image analysis is a crucial field in computer vision because of its substantial role in clinical diagnosis. But it is also a difficult task. WCE photos frequently have numerous small areas that are difficult to detect. In this study, a novel Lenet-5-based GI tract classification and detection system are proposed. The effectiveness of this system is demonstrated through in-depth experimental studies on a standard dataset and compared with the existing state-of-the-art approaches. The proposed system utilizes a better segmentation and demonstrates 99.12% of accuracy, 98.79% of recall, 99.05% of precision, and 98.49% of F1- a score that is superior to the existing models.

There will be three key topics covered in future works. The performance of the proposed expert system will first be assessed using various datasets. It will also be important to increase the number of comparative deep learning methods.