1 INTRODUCTION

Colonoscopy is the imaging standard of choice for visualization as well as identifying various abnormalities of human gastrointestinal (GI) tract. Though widely used colonoscopy is a painful process that require careful preparation and monitoring of the patients that undergo the procedure. Wireless capsule endoscopy (WCE) is an exciting imaging system that uses a pill-sized camera that can transmit hours of video data wirelessly on a body-worn receiver [13]. The video data can then be used in a computer for analysis (see Fig. 1a). As the WCE is a wireless imaging system, it moves with the aid of natural peristalsis and this in turn makes the capsule tumble, and turn in an unconstrained manner (see Fig. 1b). Despite the opportunity to image the inner GI tract with little or no pain, compared to the traditional tethered models of colonoscopy systems, the WCE images pose challenges to automatic computer vision systems in terms of image quality [48].

Fig. 1.
figure 1

(a) WCE imaging system wirelessly transmits images that are later analyzed in a computer. (b) Unconstrained motion of the capsule due to peristalsis motion with the GI tract.

One of the important image processing tasks in the WCE imagery is that of enhancement of uneven illuminated images due to the unconstrained motion of the camera system. There have been significant progress made in contrast enhancement in natural images, and also some prior works used them for solving the contrast enhancement in WCE images. Here we highlight the relevant ones to our work. Majority of the traditional enhancement models use the image histogram computations, such as the histogram equalization [9] or contrast limited adaptive histogram equalization [10]. Other techniques include the application of spectral optimal contrast tone mapping [11], and inverse diffusion model that uses a partial differential equation [12]. Extension of the histogram equalization to color RGB images is straightforward. Recently, there have been efforts to incorporate human visual system (HVS) based models [13] that mimic our visual system approach in solving color constancy. However, to the best of our knowledge these models have not yet been applied to WCE or even general endoscopy procedures that usually contain nonuniform illumination and specular artifacts, dark regions, etc. In this work, we propose to use a consistent model that not only uses HVS driven approach, but also obtains improved enhancement compared to traditional histogram and other models applied to WCE images. Our work follows the feature-linking model (FLM) by converting the color red, green, and blue (RGB) images into hue, saturation, and value (HSV) space [14] for boosting chromaticity thereby avoids streaking artifacts associated with histogram type models. The FLM approach is based on the precise timing spikes in neuronal elements and is inspired by neuroscience of explicit times of spikes that occur in neural representations. We choose appropriate parameters in the FLM model and apply to frame-by-frame WCE videos as well.

Experimental results on various WCE images indicate that our approach obtains better quality enhancement without artifacts such as streaking, saturation, and color mix-up. We also provide comparative experimental results against the previous enhancement models [911] from the literature. Further, quantitatively we prove that our model obtains better results than existing enhancement techniques in terms of image quality metrics and thereby show that it can be a viable preprocessing task in automatic computer-aided diagnosis systems for WCE. To show the applicability of enhancement prior to applying automatic image processing tasks such as the mucosa-lumen segmentation [15, 16], and 3D reconstruction with shape from shading technique [17], we show example use cases indicating the improvements obtained.

We organized our paper as follows. In Section 2 we introduce our model for enhancing WCE images via HVS consistent neural spiking approach. In Section 3 experimental results and comparison with related enhancement models are given to highlight the performance of our model. In Section 4 we conclude the paper.

2 CAPSULE ENDOSCOPY ENHANCEMENT WITH HVS CONSISTENT METHOD

2.1 HVS Consistent Model

We follow Zhan et al. [13] for a feature-linking model (FLM) model that is based on spiking neurons, thereby modeling the neuronal mechanism inspired by human vision system. We can utilize the FLM model for enhancing digital images using the timing of the first spike where the majority of the image information is contained. In the FLM, each neuron corresponds to given image pixel and the intensity values are encoded in the stimulus. In this work, we use the WCE frame that needs to be enhanced as input to the FLM and computed time matrix provides the output enhanced frame. The FLM method is based on HVS and simulates the Mach band effect and this processing mechanism is consistent with the Weber-Fechner law. The overall process of FLM can be described in three main constituents:

▪ membrane potential,

▪ threshold value, and

▪ action potential.

We show the flow of the FLM approach in Fig. 2a, and steps involved in computing the time matrix \(T\) starting with the stimuli matrix \(S\) in Fig. 2b. The involved terms \(U\), \({\Theta }\), \(Y\) are described next.

Fig. 2.
figure 2

(a) Feature-linking model (FLM) schematic [13], (b) overall algorithmic steps.

The neural membrane potential term \({{u}_{i}}_{j}\) can be written as

$$\begin{gathered} {{U}_{{ij}}} = f~{{U}_{{ij}}}\left( {n - 1} \right) + {{S}_{{ij}}} + \alpha \mathop \sum \limits_{kl} {{M}_{{ij,kl}}}{{Y}_{{kl}}}\left( {n - 1} \right) \\ + \;\beta {{S}_{{ij}}}\left( {\mathop \sum \limits_{pq} \,{{W}_{{ij,pq}}}{{Y}_{{pq}}}\left( {n - 1} \right) - d} \right). \\ \end{gathered} $$
(1)

Here the notation (\(i,j\)) represents each neuron and neighbors of the neuron represented by (\(k,l\)) or (\(p,q\)), \({{Y}_{{ij}}}\) is the postsynaptic action potential, \({{S}_{{ij}}}\) is the stimulus given to the neuron (\(i,j\)), \({{M}_{{ij,kl}}}\) is the synaptic weight given to the feeding inputs, \({{W}_{{ij,pq}}}\) is the synaptic weight applied to linking inputs, \(f\) attenuation time constant, \(d > 0\) is a constant for global inhibition, \(\alpha \) feeding strength, and \({\beta }\) is the linking strength. The input to the threshold function is based on the post synaptic action potential and can be written as

$${{{\Theta }}_{{ij}}}\left( n \right) = g~{{{\Theta }}_{{ij}}}\left( {n - 1} \right) + h~{{Y}_{{ij}}}\left( {n - 1} \right).$$
(2)

Here, \(h\) is the magnitude adjustment, \(g\) is the attenuation time constant, and \({{Y}_{{ij}}}\) is the postsynaptic action. In order to produce an action potential, we compute a threshold based on \({\Theta }\):

$${{Y}_{{ij}}} = \left\{ {\begin{array}{*{20}{l}} {1,\quad {\text{if}}\quad {{U}_{{ij}}}\left( n \right) > {{{\Theta }}_{{ij}}}\left( n \right)} \\ {{\text{0,}}\quad {\text{otherwise}}{\text{.}}} \end{array}} \right.$$
(3)

2.2 Overall Flow

In the FLM overall flow (see Fig. 2b) the stimuli matrix \(S\) to pixel values of a given WCE frame \(I\) (input image) with neurons at every pixel location (\(i,j\)) with \(r \times c\) be the total number of neurons, where \(r\) and \(c\) are numbers of neurons by a row and a column, respectively. First, we convert a given WCE RGB frame \(I\) to the HSV space and apply the min-max normalization

$${{S}_{{ij}}} = \frac{{{{I}_{{ij}}} - \mathop {{\text{min}}}\limits_{ij} \left( {{{I}_{{ij}}}} \right)}}{{\mathop {{\text{max}}}\limits_{ij} \left( {{{I}_{{ij}}}} \right) - \mathop {{\text{min}}}\limits_{ij} \left( {{{I}_{{ij}}}} \right)}} + \epsilon .$$
(4)

Here, we assumed that the input image \(I\) is of 8-bit coded and set a small value of \(\epsilon > 0\). The above normalization makes the stimuli matrix \(S\) values greater than zero since neurons cannot be captured and never fire if their thresholds are positive.

3 EXPERIMENTAL RESULTS

3.1 Setup, Parameters, and Data

We set optimal parameters of HVS consistent for the FLM approach based on experiments conducted on various WCE videos. We apply a presmoothing with Gaussian filter to the following attenuation time constant \(f\):

$${{f}_{{ij}}} = {{c}_{0}}{\text{exp}}\left( { - {{{\left( {\frac{{{{S}_{{ij}}}}}{{{{\sigma }_{f}}}}} \right)}}^{2}}} \right) + {{c}_{1}},$$
(5)

with \({{c}_{0}},{{c}_{1}} > 0\), and standard deviation \({{{\sigma }}_{f}} = 1\). We further fix constants in the FLM method as \(h = {{2}^{{10}}}\), \(d = 2\), \(g = 0.911\), \(\alpha = 0.01\), \(\beta = 0.03\). Next, initial values of \(U\), \(Y\), \(T\) are all set to 0. The threshold \({{{\Theta }}_{{ij}}}\) is initially set as follows

$${{{\Theta }}_{{ij}}}\left( 0 \right) = {{S}_{{ij}}} \otimes L + 1,$$
(6)

where \(L\) is the Laplacian derivative operator [17], \( \otimes \) is the convolution operation. Weight matrices for synaptics are given with the following numerical values

$$M = W = \left( {\begin{array}{*{20}{c}} {0.1}&1&{0.1} \\ 1&1&1 \\ {0.1}&1&{0.1} \end{array}} \right).$$
(7)

We experimented with various parameter sweep options and these parameters are found to be optimal and does not yield drastic changes in the obtained enhanced images in WCE videos. This is also similar to the observations of [13] and the parameter \({\beta }\) is perhaps the important parameter that directly affects the final contrast values of the enhanced images. Thus, setting this at different values and benchmarking against visual scoring of gastroenterologists is an interesting future direction for visual experiments domain. We concentrate here on benchmarking against automatic contrast enhancement methods instead and leave the benchmarking with humans as a potential future work that will involve inter-observer variability among various experience levels in gastroenterologists. We implemented the method frame-by-frame on a MATLAB enabled MacBookPro Laptop with CPU core i7 with 8 GB RAM. The method takes 0.50 seconds for getting the enhanced image of a given color RGB image size \(512 \times 512\), and this can further needs to be reduced as the typical WCE hardwares now produce 4 to up-to 30 frames per second in the state-of-the-art systems.

3.2 Comparison with Other Enhancement Models

We conducted experiments on various WCE videos that consist of frames with normal mucosal regions, dark regions, illumination problems, intestinal juices, bleeding, polyps, diverticula etc. We use these images to compare and benchmark with different enhancement methods, histogram equalization (HISTEQ) [9], contrast limited adaptive histogram equalization (CLAHE) [10], spectral optimal contrast tone mapping (SOCTM) [11], inverse diffusion (INVDIFF) [12], and the FLM method. As can be seen in Fig. 3, where the top row contains the original input images that need enhancement and different enhancement models optimal outputs. The CLAHE results over-amplify the regions that are over-illuminated thereby obtaining a washed out appearance devoid of mucosal information. This can be detrimental in diagnosis based on vasculatures [4, 6]. In contrast to these results, our approach obtains better enhancements of obscured regions of interest and is not over-saturate the bright mucosal surfaces.

Fig. 3.
figure 3

Comparison of various enhancement models applied to different WCE images from our dataset. (a) Input images, and results of (b) HISTEQ [9], (c) CLAHE [10], (d) INVDIFF [12], (e) SOCTM [11], and (f) our approach.

To quantitatively compare the output obtained by various enhancement models and to asses image quality we utilized the following blind metrics (as the ground-truth is not available).

▪ Local contrast (LC):

$${\text{LC}} = \frac{1}{N}\mathop \sum \limits_{i = 1}^r \,\mathop \sum \limits_{j = 1}^c \,\left| {{{\mathcal{L}}_{{ij}}}} \right|,$$

where \(\mathcal{L} = \left( {I \circ I} \right) \otimes \mathcal{E} - \left( {{{\mathcal{L}}_{m}} \circ {{\mathcal{L}}_{m}}} \right)\), \({{\mathcal{L}}_{m}} = I \otimes \mathcal{E}\), with \( \circ \) is the Hadamard product, that is the elementwise multiplication, \( \otimes \) is the convolution operation, \(I\) is the input image, \(N = r \times c\), where \(r\) is the number of pixels in the vertical dimension, and \(c\) is the number of pixels in the horizontal dimension. The matrix \(\mathcal{E}\) is based on the identity matrix

$$\mathcal{E} = \frac{1}{9}\left( {\begin{array}{*{20}{l}} 1&1&1 \\ 1&1&1 \\ 1&1&1 \end{array}} \right).$$
(8)

We note that the higher values indicate better enhancement than lower values.

▪ Spatial frequency (SF):

$${\text{SF}} = \sqrt {R_{F}^{2} + C_{F}^{2}} ,$$

where \({{R}_{F}}\) is the row frequency, and \({{C}_{F}}\) is the column frequency computed as follows

$$\begin{matrix} {{R}_{F}}=\sqrt{\frac{1}{N}\underset{i=1}{\overset{r}{\mathop \sum }}\,\,\underset{j=1}{\overset{c}{\mathop \sum }}\,\,\text{}{{({{I}_{ij}}-I)}^{2}}}, \\ {{C}_{F}}=\sqrt{\frac{1}{N}\underset{i=1}{\overset{r}{\mathop \sum }}\,\,\underset{j=1}{\overset{c}{\mathop \sum }}\,\,{{({{I}_{ij}}-{{I}_{i-1,2}})}^{2}}}. \\ \end{matrix}$$

▪ Mean gradient (MG):

$${\text{MG}} = \frac{1}{N}\mathop \sum \limits_{i = 1}^r \,\mathop \sum \limits_{j = 1}^c \,{{G}_{{ij}}},$$

where \({{G}_{{ij}}}\) represents the mean gradient magnitude of the image at a pixel (\(i,j\)), with higher values representing better quality.

Table 1 shows quantitative comparisons of different enhancement models with the no-reference (blind) image quality metrics LC, SF, MG. Results are given for average across 30 different WCE frames which contain various structures to represent the heterogenous nature of WCE videos. Overall values indicate better enhancement across images and the HVS driven model outperformed other models consistent with visual results. Further, the model obtained higher values in all three metrics indicating the promise of using our approach for WCE video enhancement with structure preservation.

Table 1. Benchmarking results for different image enhancement models with 30 WCE frames with average of local contrast (LC), spatial frequency (SF), and mean gradient (MG) metrics

3.3 Applications

As an application and to show that the enhancement model can be a viable preprocessing task in automatic computer-aided diagnosis systems for WCE, we first apply to a mucosa-lumen segmentation pipeline that was based on active contours paradigm [15]. Figure 4 shows an example of improvement obtained by applying the enhancement model prior to running the lumen segmentation. Figure 4a shows the input (top row), and its segmentation result Fig. 4a (bottom row) blue curve indicating that the segmentation failed in obtaining the lumen but converged on the illumination saturated area instead. However, after applying the FLM based model, we obtain Fig. 4b (top row) and applying the segmentation now yields Fig. 4b (bottom row) where the white curve indicating that segmentation is near-perfect in capturing the lumen area.

Fig. 4.
figure 4

Enhancement improves the mucosa-lumen segmentations [15]. We show in (a) original, and its (b) enhanced images (top row) and their corresponding segmentation results.

Next, we show that using enhancement step before applying a shape from shading for 3D reconstruction of WCE frames [17] improves the visualizations. Figure 5 shows the improvement in 3D visualizations on a WCE frame that contains polyp (a precursor to colorectal cancer) by applying the enhancement model. Figure 5a shows the input image (top row), and its 3D shape from shading reconstruction result indicating that the illumination affects the visualization near the polyp and the surrounding mucosal folds. After applying the FLM based model, we obtain Fig. 5b (top row) and applying the 3D shape from shading now in Fig. 5b (bottom row) obtained better reconstruction. Other possible applications include extracting vessel structures for better polyps recognition [6, 18] and benchmarking other segmentation models [19, 20].

Fig. 5.
figure 5

Enhancement improves the 3D visualizations from shape from shading technique [17]. We show in (a) original, and its (b) enhanced images (top row) and their 3D visualizations.

4 CONCLUSIONS

In this work, we considered an enhancement model based on a human visual system for wireless capsule endoscopy (WCE) videos with illumination problems. The feature-linking model (FLM) inspired approach is well-suited to enhance inhomogeneous illumination based WCE and does not create artifacts associated with global histogram based techniques as well as other tailed WCE enhancement models based on spectral optimal contrast-tone mapping, and inverse diffusion. Our experimental results and benchmarking showed that the HVS consistent model avoids the saturation artifacts and obtains better enhancement results with better structure preservation. We are currently trying to reduce the computational time of the HVS model to perform real-time enhancement of WCE videos along with testing the model on flexible imaging color enhancement (FICE) imagery [6] and for improving the colorectal polyp detection [7].