Introduction

Amongst various types, colorectal cancer occurs more frequently both in men and women. In 2015 only, approximately 132, 000 new cases of colorectal cancer are registered in USA [1]. Since 2017, on average 135,430 new gastrointestinal (GIT) infections occurred only in USA, which mostly include ulcer, bleeding, and polyps - as a most common neoplasm. According to statistics, 1.6 M people face hurting bowel infection and approximately 200,000 new cases appear each year. These GIT infections can be controlled and even cured if diagnosed at an early stage [2]. Recently, doctors utilize WCE technology [3] but this complete process contains several challenges, existence of irrelevant and redundant information, which makes this detection process more complex, expensive equipment, and time-consuming. These infections are life threatening, therefore it is obligatory to identify and diagnose GIT diseases at an early stage [4].

Recent articles in the area of computer vision (CV) introduced various computerized techniques for the diagnosis of GIT infections using WCE images [5]. In the existing methods, preprocessing step is given much importance due to its significant role in the achieving a high segmentation accuracy. A few of famous segmentation techniques are uniform segmentation [6], normal distribution based segmentation [7], optimized weighted segmentation [8], improved binomial thresholding [9], saliency-based techniques [10], and few more [11]. These segmented images are further used for features extraction for classification of relevant class. The well-known features are point, shape, texture, and color. The color features show significant performance as compared to other types of features in the endoscopy data due to RGB format. But the actual performance of system is depends on the number of selected features. Various features selection techniques are introduced by researchers such as wavelet fractal and automatic correlation [12], Genetic Algorithm based selection (GAS) [13], multi-versus-optimizer (MVO) [14], fisher criterion [15], and name a few more. Recently, deep learning based techniques outperforms in the area of CV and also successfully entered in medical imaging. Through deep learning, features are extracted in a hierarchy of layers in which higher-level features are attained by comprising lower-level convolutional layers. Various pre-trained deep convolutional neural network (DCNN) models are introduced by several researchers in the CV community among which famous ones are AlexNet [16], ResNet [17], VGG [18], GoogleNet [19], and YOLO [20].

Related work

A lot of techniques are introduced in the area of CV and machine learning for the diagnosis of medical diseases such as breast tumor [21], lungs cancer [22, 23], skin cancer [24,25,26,27], blood infections [28], brain tumor from MRI [29,30,31], stomach abnormalities from WCE images [32, 33], to name but a few [34,35,36]. From these diseases, stomach is more important human organ. The prominent stomach infections are ulcer, polyps, and bleeding among which ulcer and bleeding are more vulnerable types. Sivakumar et al. [37] identified bleeding regions in WCE video frames through superpixels segmentation. A CMYK color format is used to detect bleeding area clearly. Later Naive Bayes and superpixel segmentation methods are utilized to create an automatic concealed bleeding perception process. The presented method works efficiently by applying it on several extracted frames of few endoscopic videos.

Yuan et al. [38] presented two-phase fully automated system to identify ulcer through WCE. In the first phase, for segmentation of ulcer aspirants, multilevel superpixel based saliency map extraction method is presented. In the second phase, the acquired saliency vector is merged with image features for doing ulcer image identification process. Saliency max-pooling approach is merged with Locality-constrained Linear Coding (LLC) approach to categorize the images into relevant category. The acquired results show the effectiveness of presented approach with 92.65% accuracy and 94.12% sensitivity rates. Charfi et al. [39] suggested a two-phase technique for ulcer detection in WCE videos. For better depiction of WCE images, the method contains incorporation of texture feature extraction phase that involves Complete Local Binary Pattern (CLBP) and a color feature extraction phase that involves Global Local Oriented Edge Magnitude Pattern (GLOEMP). The presented method is experimented on WCE videos which includes both normal and irregular frames. The obtained results show the effectiveness of suggested technique with 94.07% accuracy.

Suman et al. [40] proposed an automated color feature-based method of bleeding identification from WCE frames. The presented system uses statistical color feature examination and for classification, SVM classifier is utilized. The test results illustrate the effectiveness of proposed approach by giving higher accuracy as compared to existing methods. Sainju et al. [41] introduced a supervised learning technique for computerized identification of bleeding areas from WCE video frames. The presented approach describes the image areas through statistical measures taken through first-order histogram probabilities of three channels of RGB color space. Then a semi-automatic area illustration technique is presented to train data proficiently. Extracted features are examined thoroughly to identify the best feature set and during this process, the segmentation technique is applied to extract areas from images. In the end, neural network is employed for final detection. The suggested method gives reasonable better recognition results by classifying bleeding and non-bleeding areas in WCE images. Zhang et al. [42] showed an infinite curriculum learning approach for categorizing WCE frames with respect to the occurrence of gastric ulcers. A method is designed to efficiently evaluate the intricacy of each sample through its sample size of the patch. The training time is scheduled although the patch size is increasing gradually till it becomes equivalent to actual image size and the method achieves promising better accuracy rates.

Fan et al. [43] introduced an automated CNN based method which is able to accurately identify minor ulcers and erosions from WCE frames. The AlexNet was trained over a dataset of WCE images to classify lesion and normal tissues. The experimental result shows notable accuracy as compared to recent techniques. Hajabdollahi et al. [44] proposed an automated segmentation technique for bleeding areas in WCE images. In this approach, color channels are picked and classified through ANN. During the training process, NN containing specified weights are used beyond any preprocessing and post-preprocessing. The test results show the efficacy of presented method by giving promising accuracy results.

Xing et al. [45] suggested automatic bleeding frame identification and area segmentation with the superpixel color histogram feature and a subspace K-nearest neighbor classifier. To minimize the execution cost, superpixel segmentation and key frame extraction methods are presented. The introduced 9-D color feature method merges the extracted data taken from HSV and RGB color channels. The test result shows the effectiveness of proposed system in contrast to existing methods with 99% of accuracy. Maghsoudi et al. [46] introduced an approach to detect tumor and bleeding regions in frames and categorize the normal and affected regions. GLCM, statistical, LBP, Law’s features and Gabor filters are applied to extract texture and geometric features. Later, normal and affected areas are discriminated in images through extracted features. The presented method is compared with CNN and it shows the importance of applying wide of features to identify numerous affected regions in WCE images in contrast to CNN which can only be used to train on specific databases. Moreover, CNN’s internal features cannot be persistent to enhance the applications (Table 1).

Table 1 Summary of existing ulcer detection and classification techniques

Problem statement & contributions

The irregularity in WCE images is a challenging task, and even experienced doctors spend on average 2 h for analyzing 55,000 frames of one patient [47]. Additionally, GIT infections are of different shapes, size, color, and texture, which increase the features similarities, and in turn chances of improved classification accuracy decreases. An improved classification system is completely relying upon the good features. Therefore, in this work, major challenge is to extract most principle features because, in medical imaging, the performance of a system relies upon the classification of abnormalities.

In this paper, a novel framework is proposed for stomach abnormalities classification using WCE images through DCNN. Major contributions in this work are:

  1. 1)

    An ulcer segmentation technique is introduced named color features based low-level and high-level saliency (CFbLHS) estimation. The HSV and LAB color transformations are applied in the very first step and SVD features are extracted. Then low level saliency is computed by utilizing these SVD features. Further, high level saliency map is also constructed and combined along low level saliency map to produce an initial map for ulcer detection. Finally, an existing CRF model is employed for refinement of segmented image.

  2. 2)

    The cropped and original images are feed to modified Densenet CNN architecture and perform activation on the average pooling and fully connected layers for deep features extraction.

  3. 3)

    The extracted multi layered DCNN features are optimized through Kapur entropy approach - fused through parallel maximum feature value (PMFV). After fusion, Tsallis entropy features are calculated and sorted into descending order. Finally, top 50% features are selected and feed to MLP for final classification.

  4. 4)

    For experimental results, a new WCE dataset of 12,000 images is generated for three stomachs abnormalities such as ulcer, bleeding, and health where each class contains 4000 images.

Proposed work

The proposed automated stomach irregularities detection and recognition system is presented in this section which includes three primary steps such as ulcer detection from ulcer frames, CNN features extraction and features fusion, and best features selection and recognition. The prime structure is shown in Fig. 1 which explains that initially frames are extracted from WCE videos and then separated into ulcer, bleeding, and healthy classes through manual process with the help of an endoscopic expert. Then ulcer is detected through a saliency method. Later, CNN features are extracted through mapped ulcer and original bleeding and healthy frames followed by PMFV based features fusion and KEcTF based best features selection which are classified through MLP. The comprehensive explanation of each step is given below.

Fig. 1
figure 1

Proposed flow of automated stomach abnormalities detection and classification

Saliency based ulcer detection

As shown in Fig. 1, color features are extracted in the first step through HSV and LAB color transformations using ulcer images. The extracted color features are later concatenated for information fusion. After that, low-level and high-level saliency is estimated by utilizing information of fused image. Finally, a threshold operation is performed to obtain a binary image which is later refined through existing CRF model.

Let φ(i) and φ(j) denote extracted color features through HSV and LAB color transformations, respectively. Two metrics such as variance and SVD are calculated from both transformations for color features which are defined through Eq. (1).

$$ {\upsigma}^2={\sum}_{\mathrm{i}=1}^{\mathrm{m}}{\sum}_{\mathrm{j}=1}^{\mathrm{n}}\frac{{\left({\mathrm{X}}_{\mathrm{i}\mathrm{j}}-\upmu \right)}^2}{\mathrm{N}} $$
(1)

Where, i, j denote rows and columns values of each image, μ is mean value of image, and N denotes total number of pixel values of an input image. Then SVD is calculated through following Eq. (2).

$$ \overset{\check{} }{\mathrm{S}}={\mathrm{U}}_{\mathrm{m}\times \mathrm{n}}\mathrm{d}{\mathrm{V}}_{\mathrm{n}\times \mathrm{n}} $$
(2)

Where, Um × n is m × n orthogonal column matrix over eigenvectors \( \overset{\check{} }{\mathrm{S}}{\left(\overset{\check{} }{\mathrm{S}}\right)}^{\mathrm{T}} \) and Vn × n is n × n matrix over eigenvectors \( {\left(\overset{\check{} }{\mathrm{S}}\right)}^{\mathrm{T}}\overset{\check{} }{\mathrm{S}} \) which are defined as follows in Eqs. (3) and (4).

$$ \overset{\check{} }{\mathrm{S}}{\left(\overset{\check{} }{\mathrm{S}}\right)}^{\mathrm{T}}={\mathrm{U}}_{\mathrm{m}\times \mathrm{n}}{\mathrm{d}}^2{\mathrm{V}}_{\mathrm{n}\times \mathrm{n}} $$
(3)
$$ {\left(\overset{\check{} }{\mathrm{S}}\right)}^{\mathrm{T}}\overset{\check{} }{\mathrm{S}}={\mathrm{V}}_{\mathrm{n}\times \mathrm{n}}{\mathrm{d}}^2{\left({\mathrm{V}}_{\mathrm{n}\times \mathrm{n}}\right)}^{\mathrm{T}} $$
(4)

Where, d is positive real value matrix recognized as singular values matrix and formulated as in Eq. (5).

$$ \mathrm{d}={\uplambda}_{\mathrm{i}}\overrightarrow{\upalpha_{\mathrm{i}}}, $$
(5)

Where, i ∈ 1, 2, 3, …n and \( \overrightarrow{\upalpha_{\mathrm{i}}} \) is a column vector. Hence, the final SVD is computed as below in Eq. (6).

$$ \overset{\check{} }{\mathrm{S}}={\sum}_{\mathrm{i}=1}^{\mathrm{n}}\left({\upalpha}_{\mathrm{i}}{\mathrm{U}}_{\mathrm{m}\times \mathrm{n}}{\left({\mathrm{V}}_{\mathrm{n}\times \mathrm{n}}\right)}^{\mathrm{T}}\right) $$
(6)

The dimension of resultant SVD feature vector \( \overset{\check{} }{\mathrm{S}\ } \) is N × M, where M denotes the number of columns. In this work, a fixed dimension of 256 × 256 is used, therefore the resultant vector output must be1 × 256 for each channel and as HSV and LAB transformations consist of total 6 channels, therefore the resulted fused vector is 1 × (1536 + 6) = 1542 and defined through Eq. (7).

$$ \mathrm{Fv}\left({\mathrm{S}}_{\mathrm{i}\mathrm{j}}\right)=\left[\left({\mathrm{S}}_{\mathrm{i}1},{\mathrm{S}}_{\mathrm{i}2},\dots {\mathrm{S}}_{\mathrm{i}\mathrm{m}}\right),\Big({\mathrm{S}}_{1\mathrm{j},}{\mathrm{S}}_{2\mathrm{j}},\dots {\mathrm{S}}_{\mathrm{nj}}\Big)\right] $$
(7)

Where, Fv(Sij) represents the concatenated information image and Sim, Snj ∈ φ(i), φ(j), respectively. The visual output after concatenation of these features is shown in Fig. 2.

Fig. 2
figure 2

Effects after concatenation of HSV and LAB color features

Later, low level saliency is computed by employing \( {\overset{\check{} }{\mathrm{S}}}_{\mathrm{i}} \) as follows in Eq. (8).

$$ {\uppsi^{\mathrm{k}}}_{\frac{\mathrm{G}}{\mathrm{B}}}\left({\overset{\check{} }{\mathrm{S}}}_{\mathrm{ij}}\right)={\sum}_{{\overset{\check{} }{\mathrm{S}}}_{\mathrm{ij}}\in \mathrm{G}/\mathrm{B}}\mathrm{W}\left(\mathrm{i},\mathrm{j}\right).{\upchi}^2\left(\mathrm{i},\mathrm{j}\right) $$
(8)

Where, W(i, j) shows the weighted factor of abnormal regions in the concatenated image which is defined through infection size and spatial distance of infected region. G and B denote the abnormal regions and distance of abnormal region and border regions, respectively. W(i, j) is computed by Eq. (9).

$$ \mathrm{W}\left(\mathrm{i},\mathrm{j}\right)=\left|{\overset{\check{} }{\mathrm{S}}}_{\mathrm{i}\mathrm{j}}\right|.\exp \left[\frac{-\left\Vert {\mathrm{S}\mathrm{c}}_{\mathrm{i}}-{\mathrm{S}\mathrm{c}}_{\mathrm{j}}\right\Vert }{\upalpha .\kern0.5em \mathrm{L}}\right] $$
(9)

Where, Sci and Scjrepresent the spatial center position of FV(Sij), α is a static parameter of value 0.2, and L is a diagonal length of concatenated image. The effects of low level saliency are shown in Fig. 3.

Fig. 3
figure 3

Lowlevel saliency estimation effects on WCE images. a Concatenated features effects, b Lowlevel saliency estimation

After that, a hig-hlevel saliency (HLS) is estimated from two essential information as (a) complex background, and (b) presence of abnormal regions at center point of the image. HLS is most essential for complex background and low and small abnormal regions. Based on these two points, location based object prior (OP) for each region of Sij are computed as given in Eq. (10).

$$ {\uplambda}_{\mathrm{OP}}\left({\mathrm{S}}_{\mathrm{ij}}\right)=\left(1-{\left(\frac{\Delta {\mathrm{B}}_{\mathrm{ij}}}{{\Delta \mathrm{B}}_{\mathrm{max}}}\right)}^{\upbeta}\right).\mathrm{e}\left(\frac{-{\mathrm{S}}_{\mathrm{d}}}{\mathrm{L}/2}\right) $$
(10)

Where, Sd denotes the spatial Euclidean distance, ΔBij explains the number of border pixels, ΔBmaxshows the maximum number of image pixels on border regions, and β is a static parameter which scales the location of abnormal regions in the image. Finally, λOP(Sij) is utilized into Eq. (10) and final HLS is estimated as follows in Eq. (11).

$$ {\mathrm{F}}_{\mathrm{S}}\left({\uplambda}_{\mathrm{OP}}\right)={\uplambda}_{\mathrm{OP}}\left({\mathrm{S}}_{\mathrm{ij}}\right).\frac{\sum \limits_{\mathrm{j}=1,\mathrm{j}\ne \mathrm{i}}^{\mathrm{n}}{\uplambda}_{\mathrm{OP}}\left({\mathrm{S}}_{\mathrm{ij}}\right).\left(1-\mathrm{N}{\upchi}^2\left(\mathrm{i},\mathrm{j}\right)\right)/{{\mathrm{N}\upchi}^2}_{\mathrm{m}\mathrm{ax}}}{\sum \limits_{\mathrm{j}=1,\mathrm{j}\ne \mathrm{i}}^{\mathrm{m}}\left(1-{\mathrm{N}\upchi}^2\left(\mathrm{i},\mathrm{j}\right)/\mathrm{N}{\upchi^2}_{\mathrm{m}\mathrm{ax}}\right)} $$
(11)

Where, Nχ2max represents the maximum of Chi-square distance (χ2(i, j)) between abnormal and healthy regions. Finally, HLS map FSOP) and lowlevel map are simply combined and Otsu thresholding operation is performed whose effects are shown in Fig. 4.

Fig. 4
figure 4

High level saliency estimation and Otsu thresholding effects using WCE ulcer images. a Concatenated features effects, b HLS estimation effects, and c initial thresholding after combination of Low-level and high-level saliency

Later, the resultant segmented images are refined through an existing conditional random field (CRF) [48] approach. The CRF approach removes the boundary and small unwanted pixels through following energy function given in Eq. (12).

$$ {\mathrm{F}}_{\mathrm{S}\mathrm{o}}\left({\uplambda}_{\mathrm{OP}}\right)={\sum}_{\mathrm{i}}\log \mathrm{P}\left({\mathrm{F}}_{\mathrm{S}}\left({\uplambda}_{\mathrm{i}}\right)\right)+{\sum}_{\mathrm{i},\mathrm{j}}{\uptheta}_{\mathrm{i}\mathrm{j}}\left({\mathrm{x}}_{\mathrm{i}},{\mathrm{x}}_{\mathrm{j}}\right) $$
(12)

Where, xi, xjshow the label assignment of all abnormal pixels and P(FSi)) is a probability value of ith pixels along θij(xi, xj) which is formulated as in Eq. (13).

$$ {\uptheta}_{\mathrm{i}\mathrm{j}}=\upmu \left({\mathrm{x}}_{\mathrm{i}},{\mathrm{x}}_{\mathrm{j}}\right)\left[\mathrm{w}.\exp {\left\Vert {\mathrm{P}}_{\mathrm{i}}-{\mathrm{P}}_{\mathrm{j}}\right\Vert}^2\right] $$
(13)

The final ulcer segmentation results after CRF refinement are shown in Fig. 5.

Fig. 5
figure 5

Final saliency effects after CRF refinement. a Original image, b Concatenated features effects, c Low-level saliency, d High-level saliency, e Thresholding, and f CRF refinement

Deep CNN features

CNN has become imperative machine learning approach from last few years and showing successful achievement for object classification. Through CNN, various new research problems have risen such as passing an input image to a network and after crossing all layers, it can be dissolved. Recently, several researchers try to resolve this kind of problems and introduced various CNN models among which few are AlexNet, VGG, ResNet, Inception, and Yolo. Through these models, they achieved significant accuracy. In this work, an existing CNN model named as DenseNet [49] is utilized which has simple connectivity patterns and gives the best learning among hidden layers. The connection between each layer is done through a feedforward approach. In this model, as compared to existing ones, less number of parameters is required and redundant features are avoided to learn. Major advantage is that each layer has straight access to gradients from the loss function.

DenseNet architecture is trained on 4 large publicly available datasets including CIFAR-10, 100, SVHN, and ImageNet. A total of 709 layers are involved in this network containing convolutional layers, depth concatenation layers, ReLu layers, batch normalization layers, average pooling layers, fully connected layers (FC), and output function called softmax. The basic flow of DenseNet architecture with 3Dense blocks is shown in Fig. 6.

Fig. 6
figure 6

A basic Deep DenseNet architecture of 3 dense blocks

Six steps are performed for implementation of DenseNet architecture including ResNets implementation, dense connectivity, composite function, pooling layers, growth rate, and bottleneck layers. AP layer and FC layer for deep CNN features extraction are applied in this work. As shown in Fig. 1, mapped ulcer segmented and RGB WCE images are utilized for features extraction through transfer learning [50]. The parameters involved in this process are given below in Table 2 which includes input size, pooling size, stride, padding mode, weights, learning factor, and bias learning factor. The softmax is used as activation function and feature vectors of dimensions N × 1920 and N × 1000are extracted for AP layer and FC layer, respectively. Later, best features are selected and fused through PMFV approach for recognition accuracy and minimum computational time.

Table 2 Parameters used for features extraction

Features selection and fusion

In machine learning, various types of features are utilized for recognition of objects into their relevant class [51]. However, the all extracted features are not important and many of them are irrelevant for current classification problem. These unimportant features deteriorate the classification performance and also increases the overall computation time. To motivate with these challenges, a new deep CNN features selection approach named as Kapur Entropy controlled Tsallis Entropy along higher Probability (KEcTaHP) is introduced in this work. The proposed KEcTaHP approach is formulated as follows:

There are two extracted CNN feature vectors such as AP layer vector and FC layer vector denoted by ξi(F1) and ξj(F2) of dimensions N × 1920 and N × 1000, respectively, where N denotes the total number for samples utilized for features extraction in the training and testing process. Let ξk(F3) and ξk1(F4) represent the selected feature vectors of dimensionsN × K1 and N × K2, respectively, where K1, K2 ∈ R. The Kapur entropy [52] is computed in the first step through following Eqs. (14) and (15).

$$ {\upxi}_{\mathrm{E}}\left(\mathrm{F}1\right)={\sum}_{\mathrm{i}=1}^{\mathrm{N}1}{\mathrm{H}}_{\mathrm{i}}\left(\mathrm{F}1\right) $$
(14)
$$ {\upxi}_{\mathrm{E}}\left(\mathrm{F}2\right)={\sum}_{\mathrm{j}=1}^{\mathrm{N}2}{\mathrm{H}}_{\mathrm{j}}\left(\mathrm{F}2\right) $$
(15)

Where, ξE(F1) and ξE(F2) are Kapur’s entropy vectors for ξi(F1) and ξj(F2). N1 and N2 are occurrence levels of features in F1 and F2 vectors. Hi(F1) and Hi(F2) are computed through following Eqs. (16) and (17).

$$ {{\mathrm{H}}_{\mathrm{i}}}^{\mathrm{c}}={\sum}_{\mathrm{i}=1}^{\mathrm{N}1}\frac{\mathrm{P}\left({{\mathrm{H}}_{\mathrm{i}}}^{\mathrm{c}}\left(\mathrm{F}1\right)\right)}{{{\mathrm{W}}_{\mathrm{N}1}}^{\mathrm{c}}}\ln \left(\frac{{{\mathrm{H}}_{\mathrm{i}}}^{\mathrm{c}}\left(\mathrm{F}1\right)}{{{\mathrm{W}}_{\mathrm{N}1}}^{\mathrm{c}}}\right) $$
(16)
$$ {{\mathrm{H}}_{\mathrm{j}}}^{\mathrm{c}}={\sum}_{\mathrm{j}=1}^{\mathrm{N}2}\frac{\mathrm{P}\left({{\mathrm{H}}_{\mathrm{j}}}^{\mathrm{c}}\left(\mathrm{F}2\right)\right)}{{{\mathrm{W}}_{\mathrm{N}2}}^{\mathrm{c}}}\ln \left(\frac{{{\mathrm{H}}_{\mathrm{j}}}^{\mathrm{c}}\left(\mathrm{F}2\right)}{{{\mathrm{W}}_{\mathrm{N}2}}^{\mathrm{c}}}\right) $$
(17)

Where, P(.) shows the probability distribution of each feature, WN1c and WN2c denote the probability occurrences for N1and N2 levels, respectively. Later, both entropy vectors ξE(F1) and ξE(F2) are fused through parallel maximum feature value (PMFV) approach. Through PMFV approach, initially features are combined through Eqs. (18) to (20) as given below.

$$ {\upxi}_{\mathrm{EF}}\left(\mathrm{F}1,\mathrm{F}2\right)=\left({\upxi}_1\left(\mathrm{F}1\right),{\upxi}_1\left(\mathrm{F}2\right)\right),\left({\upxi}_2\left(\mathrm{F}1\right),{\upxi}_2\left(\mathrm{F}2\right)\right),\dots \left({\upxi}_{\mathrm{N}1}\left(\mathrm{F}\mathrm{i}\right),{\upxi}_{\mathrm{N}2}\left(\mathrm{F}\mathrm{j}\right)\right) $$
(18)
$$ {\upxi}_{\mathrm{max}}\left(\mathrm{i},\mathrm{j}\right)={\sum}_{\mathrm{i}=1}^{\mathrm{N}1}{\sum}_{\mathrm{j}=1}^{\mathrm{N}2}\operatorname{Max}\ \left({\upxi}_{\mathrm{N}1}\left(\mathrm{Fi}\right),{\upxi}_{\mathrm{N}2}\left(\mathrm{Fj}\right)\right) $$
(19)
$$ {\upxi}_{\mathrm{f}\mathrm{us}}\left(\mathrm{ij}\right)={\upxi}_{\mathrm{max}}\left({\mathrm{f}}^1\right),{\upxi}_{\mathrm{max}}\left({\mathrm{f}}^2\right),{\upxi}_{\mathrm{max}}\left({\mathrm{f}}^3\right),\dots {\upxi}_{\mathrm{max}}\left({\mathrm{f}}^{\mathrm{N}}\right) $$
(20)

Where, ξfus(ij) denote fused entropy feature vector of dimension N × 1920. Later, discrete probability distribution (DPD) of fused vector is computed and defined as Pi = P0, P1, …PN. Then, by utilizing probabilities values, Tallis entropy is computed through following Eqs. (21) to (23).

$$ {\upxi}_{\mathrm{tli}}\left({\upxi}_{\mathrm{f}\mathrm{us}}\right)={\mathrm{f}}_{\mathrm{u}}(1),{\mathrm{f}}_{\mathrm{u}}(2),\dots {\mathrm{f}}_{\mathrm{u}}\left(\mathrm{N}\right) $$
(21)
$$ =\mathrm{argmax}\ \left[{{\mathrm{S}}_{\mathrm{E}}}^1\left({\mathrm{f}}_{\mathrm{u}}(1)\right)+{{\mathrm{S}}_{\mathrm{E}}}^2\left({\mathrm{f}}_{\mathrm{u}}(2)\right)+\dots {{\mathrm{S}}_{\mathrm{E}}}^{\mathrm{N}}\left({\mathrm{f}}_{\mathrm{u}}\left(\mathrm{k}\right)\right)\right] $$
(22)
$$ {{\mathrm{S}}_{\mathrm{E}}}^{\mathrm{N}}\Big({\mathrm{f}}_{\mathrm{u}}\left(\mathrm{k}\right)=\frac{\frac{1-\sum \limits_{\mathrm{k}=1}^{\mathrm{N}}{\mathrm{P}}_{\mathrm{k}}}{{\mathrm{P}}^{\mathrm{N}}}}{1-\mathrm{q}} $$
(23)

Where, SEN(fu(k) representss Tallis entropy vector, q > 0 AND < 1&0 > q < 1 and \( {\mathrm{P}}_{\mathrm{k}}=\sum \limits_{\mathrm{k}=1}^{\mathrm{N}}{\mathrm{P}}_{\mathrm{k}} \) are subject to the following constraints given in Eq. (24).

$$ \left|{\mathrm{P}}^{\mathrm{N}}+{\mathrm{P}}^{\mathrm{K}-1}\right|-1<S<1-\left|{\mathrm{P}}^{\mathrm{N}}+{\mathrm{P}}^{\mathrm{K}-1}\right| $$
(24)

Finally, Tallis entropy vector (SEN(fu(k)) is sorted into descending order and top 50% features are selected for final recognition process which are fed to multi-layer perceptron (MLP) [53, 54]. The proposed labeled results are shown in Fig. 7. These results are computed after selection of best 50% features.

Fig. 7
figure 7

Proposed labeled results using WCE images

Results and discussion

The proposed system is validated on a newly generated dataset of WCE images which consists of total 12,000 video frames of three stomach abnormalities. Three stomach abnormalities are ulcer, bleeding, healthy regions and each class includes 4000 frames. The resolution of generated video frames is 760 × 1240. Through proposed features optimization approach, the performance of top 50% features is also compared with 70% and all DCNN features. MLP method of neural network is utilized and its performance is compared with few popular supervised learning classification methods like decision trees (DT) [55], Cubic SVM (CSVM) [56], weighted KNN (WKNN) [57], and ensemble trees (ET) [58]. The performance of selected features is analyzed through prominent parameters such as recall rate, specificity, precision, AUC, false positive rate, accuracy, and classification computational cost. All simulations are conducted on the personal desktop computer of CoreI7, 16 GM of RAM and 8GB graphics card.

Experiment 1: All optimized features

In the first experiment, all optimized features are considered for computation of classification performance. A 70:30 approach is opted which explains that 2800 WCE frames (70%) of each class are utilized for training the system and remaining 12,00 WCE frames (30%) of each class are employed for testing of proposed system. All testing results are held through 10-fold cross validation (10CV). The classification results of all optimal features extracted through multilayers features selection (MLFS) are presented in Table 2. The maximum accuracy of proposed system through MLFS is 97.9%, whereas other methods such as DT, CSVM, WKNN, and ET gained accuracy of 77.2%, 96.4%, 96.3%, and 88.1%, respectively. From results, it is noticed that DT performed worst as compared to other methods and reported an error rate of 22.8%. Additionally, the classification performance is also computed through single FC layer optimal features and shows best accuracy of 95.0% for MLP and worst accuracy of 73.8 using DT. The performance of MLP for both FC and MLFS is also endorsed by confusion matrices (CM’s) in Fig. 8 to represent the sensitivity rate, precision, FPR, specificity, and AUC. In addition, the classification execution time is also computed for all classifiers as presented in Table 3 and also plotted in Fig. 9 which clarifies that proposed system outperforms using MLP.

Fig. 8
figure 8

Confusion matrices for all optimized selected features through MLP. a Verification of FC optimized features, b Verification of optimized MLFS

Table 3 Classification results using proposed MLFS and FC layer features
Fig. 9
figure 9

Comparison of classification time among optimal FC and MLFS approach

Experiment 2: 70% optimized features

In this experiment, top 70%, optimal features are selected for analysis of classification performance. 10CV is performed on same 70:30 approach for validation of the proposed system. Major intention in this work is to achieve maximum accuracy in less computation time. To do this, it is essential to provide a low dimensional feature vector which holds only relevant features. The classification results of 70% optimal features are presented in Table 4 for various classifiers which shows best accuracy of 99.3%, recall rate of 99.07%, specificity of 99%, precision rate 99.05%, AUC is 0.993, and FP rate of 0.003, respectively. The accuracy of other classification methods through proposed optimal MLFS such as DT, CSVM, WKNN, and ET is 88%, 98.6%, 98%, and 94.1%, respectively. Additionally, the classification accuracy of the proposed optimal MLFS features is compared with FC layer features and obtained a maximum accuracy of 99.2% through MLP. The accuracy of MLP through FC and MLFS are also verified through confusion matrices, presented in Fig. 10 and clearly demonstrate that minimum features produced good accuracy. Moreover, the analysis of the top 70% features and all features is conducted in terms of classification time as plotted in Fig. 11 which verifies that the reduction in features minimizes the system execution time for all classifiers.

Table 4 Classification accuracy of proposed 70% optimal MLFS features and FC layer features
Fig. 10
figure 10

Confusion matrices for 70% optimized selected features through MLP. a Verification of FC optimized features, b Verification of optimized MLFS

Fig. 11
figure 11

Classification time comparison of proposed optimal MLFS and FC layer features after 70% selection

Experiment 3: 50% optimal features

In the third experiment, top 50% features are selected for classification results. 70:30 approach is utilized and 10CV is performed. The results are shown in Table 5 which shows the best accuracy of 99.5% for MLP classifier. The other classifiers such as DT, CSVM, WKNN, and ET also perform well and give sufficient accuracy of 86.3%, 99.3%, 99.1%, and 99.5%, respectively for proposed MLFS approach. The performance analysis of MLP classifier using MLFS is also conducted through a confusion matrix given in Fig. 12 which depicts that each class provides approximately 99% of accuracy.

Table 5 Classification accuracy of proposed 50% optimal MLFS and FC layer features
Fig. 12
figure 12

Confusion matrices for 50% optimized selected features through MLP. a Verification of FC accuracy, b Verification of optimized MLFSaccuracy

Additionally, the classification accuracy of FC layer features is also calculated and achieves the highest accuracy of 99.3% for MLP classifier. The other classifiers also give significant classification performance in terms of accuracy which is 84.1%, 99.1%, 98.3%, and 97.9%, respectively. The accuracy of MLP classifier using FC layer features is verified through CM, plotted in Fig. 12. The results presented in Table 5 explain that the selection of 50% optimal features provides better classification accuracy and good execution time as compared to all and 70% selected MLFS approach. The time comparison of each classifier for all MLFS, 70% MLFS, and 50% MLFS features is depicted in Fig. 13 to show that the optimal number of features improves the overall system performance.

Fig. 13
figure 13

Time comparison of all experiments like all MLFS features, 70% MLFS and best 50% optimal selected MLFS features

Analysis

The analysis of proposed system is directed in terms of overall recognition performance (visual and numerical), change in recognition accuracy after 100 iterations of the proposed algorithm, and finally comparison with few latest techniques. In Fig. 1, overall flow of the proposed system is presented including saliency estimation for ulcer detection, multilayers CNN features extraction, best CNN features selection and fusion, and finally classification through MLP. The initial ulcer detection results are presented in Figs. 2, 3, 4, and 5. Then, CNN features are extracted through DenseNet model whose architecture is shown in Fig. 6. The multilayer features are extracted and best features are selected through the proposed selection approach and results are validated on newly designed WCE dataset. The results are analyzed in three different rounds: (a) all multilayers features selection (MLFS) and comparison with FC selected features; (b) 70% MLFS and comparison with FC 70% selected features, and (c) 50% MLFS and comparison with 50 FC selected features.

The results of first round are presented in Table 3 and confirmed by CM’s in Fig. 8. The bestachieved recognition accuracy of all MLFS approach is 97.9% whereas the FC selected features reached 95.0%. In Table 4, the results of second round are presented and achieved the best accuracy of 99.3% on MLP classifier through proposed MLFS approach. The results are also confirmed by CM’s given in Fig. 10. In the final round, 50% features are selected through MLFS approach as presented in Table 5 and reached the best accuracy of 99.5% which is also affirmed by CM’s in Fig. 12. In addition, the recognition time for all classifiers is also computed and plotted in Figs. 9, 11, and 13 which explains that the less and useful number of features gives the best accuracy and minimizes the overall system execution time.

Moreover, the iterations based comparison of proposed system is conducted with FC selected features as shown in Fig. 14 in which system is iterated up to 100 times and very minium change is noted which confirms the authenticity of proposed system.

Fig. 14
figure 14

Change in recognition accuracy after 100 times iterations using FC features and proposed MLFS approach

In the last, the comparison of proposed system is also conducted with latest existing techniques as presented in Table 6 where the latest existing accuracy is 98.49% achieved by Amna et al. [5] in 2018. They classify three of stomach infection classes including ulcer, bleeding, and healthy but with very low computational time of 17.193 s on only 448 WCE frames (255 healthy, 119 bleeding and 68 ulcer), whereas in the presented work, same type of 12,000 WCE images are used and reached maximum accuracy of 99.50% and computational time of 21.15 s which is clearly good as compared to existing ones.

Table 6 Proposed work comparison with latest existing approaches

Conclusion

Ulcer and bleeding are the most frequently occurring deformities of the human digestive tract. An ulcer is more common GI tract infection and approximately 10% of people in the entire world are suffering from it. In this article, a new automated system is proposed using the best CNN feature selection. The ulcer regions are segmented through CFbLHS approach and then compute CNN features. From them, the best features are selected and provides to MLP for classification. The experiments are performed on selected Private Dataset and achieve maximum accuracy of 99.5%, recall rate 99.40%, specificity 99.20%, and computation time 21.15 s. From overall system results, we conclude that the segmentation process helps in to extract the useful features of the important region. The proposed method also shows that the fusion process increase in the classification performance but on the other end, this step increase the computational time due to more number of features. This kind of problem is resolved through the selection process which minimized the computational time along with consistent accuracy. In the future, the segmentation of ulcer regions is performed through mask RCNN and then extracts features. Moreover, the addition of a few latest performance measures is also computed for more precise analysis.