Introduction

With advancements in remote sensing technology, the quality and accuracy of remote sensing applications have improved significantly. The growing e-health industry uses sensor technologies to monitor patient conditions both locally and remotely. It provides improved patient care through early detection of adverse health conditions and can influence patients’ behavior to improve their ongoing health. Power-harvesting technology and improved energy management techniques liberate sensors from bulky power source connections and batteries, allowing them to be used in a wider range of autonomous applications [1, 2]. Disposable low-cost sensor technology is emerging due to miniaturization, embedding and power harvesting. These sensors provide point-of-care monitoring for a broad range of patient conditions [3, 4]. Wireless capsule endoscopy (WCE) [5] is one of the examples of such disposable low-cost sensors. Wireless capsule endoscopy (WCE) is a non-invasive technique that allows doctors to visualize the most inaccessible parts of the human gastrointestinal tract as a means to identify causes of illness. It consists of three components: 1) a disposable wireless capsule, 2) a sensing system with eight antenna arrays, and 3) a battery pack with an image recoding unit (IRU). After a patient swallows the disposable wireless capsule, it travels through the patient gastrointestinal tract by transmitting video frames wirelessly to an IRU worn by the patient. Traditional endoscopy needs hospitalization and specialized medical staff, whereas WCE is free from such compulsions. This property makes it more significant over traditional endoscopy and marks it as a potential candidate for telemonitoring [68]. The WCE system allows patients to live an ordinary life during diagnostic procedures. This is a lengthy diagnosis process that lasts approximately 8 h and captures around 50,000 frames.

The underlying value of wireless capsule sensing and monitoring is in the information derived from the data acquisition. The sensing data is important for gastroenterologists to make a correct diagnosis. This collected data is large and contains only a small fraction of informative frames while the rest of the visual data is redundant and non-informative [9]. It is largely agreed that transmission of such huge data is not a feasible solution due to the severe energy and bandwidth constraints of wireless body sensors. This collected data should be processed locally with only relevant data transmitted to gastroenterologists. The processing of such large sensor data on resource constraint WCE becomes a bottleneck. Furthermore, image sensors consume more energy than scalar sensors and streaming of captured WCE videos also requires a large segment of bandwidth that causes a huge amount of power consumption. Addressing the demands of growing WCE data volume and complexity requires a novel framework for efficient managing, and analyzing the data. In this context, video summarization is the most feasible solution. It is the process of reducing video frames to create a summary that retains the visual contents of the original video [1012]. Ideally, video summarization techniques must utilize high-level semantic details of the WCE video content. However, it is not feasible to extract exact semantically relevant frames (objects such as lesions, bleeding etc.) from WCE videos because of the highly diverse nature of the contents. The majority of the techniques in the literature are therefore either anomalies specific (identification of polyps, bleeding, crohn’s disease lesions, etc.) or directly employ low-level features [1316]. However, the usage of low-level features results in the loss of semantic details, so the summaries generated are not consistent with gastroenterologist perception because of the semantic gap. To cope with these challenges, a visual attention model is used to bridge the gap between the low-level and high-level features. Visual attention is the cognitive process that selectively concentrates on one aspect of the image while ignoring the other contents [17]. In this context, we have developed some state-of-the-art visual attention driven keyframe extraction schemes [10, 11]. However, computational complexity is one of the main drawbacks of these existing summarization techniques. Therefore, these techniques are not feasible for the resource constrained WCE system. To solve this problem, a computationally efficient summarization framework suitable for resource limited devices such as wireless capsule sensors, mobile, etc. is vital.

This paper presents a cost-effective framework for large sensor data management in the tele-endoscopy system. In the proposed framework, a smartphone collects frame sequences from an IRU and performs video summarization to generate keyframes. In parallel, the smartphone also transmits the generated keyframes to the corresponding medical specialists for analysis. The proposed video summarization algorithm consists of three modules. Initially, the collected frame sequences are converted from RGB color space to color opponent-component (COC) space [18]. Then, the integral image is calculated that makes our summarization method suitable for real time video processing. Finally, image features i.e. moments, curvature, and multi-scale contrast are computed and are fused to obtain the saliency map of each frame. A saliency map is used to select keyframes.

Background

In an effort to reduce the time required for visual inspection of WCE videos, several methods have been presented in the literature. These methods can be classified into two categories: 1) detection and segmentation of gastrointestinal abnormalities and 2) video summarization. For a detailed review of the existing computer-vision based analysis schemes, the readers may refer to a survey [19].

A lot of attention has been paid to the detection and segmentation techniques that assist physicians by reducing the burden of time-consuming and concentration intensive WCE examination [20, 21]. Concerning the segmentation applied to endoscopy video frames, Tjoa et al. [22] utilized chromatic features. This provides better results for the segmentation of the colon since WCE images contain rich color information. Li and Meng [23] have comparatively studied tumor detection in endoscopy images using texture features. These techniques are domain specific and can only identify and segment certain abnormalities such as bleeding, ulceration, etc. Most of the techniques perform well in detecting only one type of abnormality. However, they generally fail to detect multiple abnormalities. Recently, efforts have been made to generate video summaries instead of detecting specific kinds of abnormalities [2426, 14]. Video summarization of the WCE aims to select those frames that contain abnormalities. Prominent summarization methods can be classified into two categories i.e. clustering and significant content change. In an effort to reduce the time required for the inspection of WCE videos, Iakovidis et al. and Ioannis et al. [27, 26] presented clustering based WCE video summarization schemes, but these schemes are computationally expensive.

In significant content change based methods, comparison is done between consecutive frames based on the differences of their low-level features. Whereas in clustering based methods, the frames are clustered based on low-level features and a single frame is selected from each cluster. The common features used in these schemes are histogram difference, texture, and shape. There exists a significant semantic gap among summaries generated by physicians and by low-level features that leads to unsatisfactory performance. Researchers have used visual attention theories for video summarization to minimize the semantic gap between human vision and low-level methods [28]. The human visual attention model consists of high-level cognitive processes which lead to saliency maps. The saliency map reduces the complexity of scene analysis and focuses on the most relevant objects. In this context, the visual attention model can be used to efficiently summarize the contents of WCE videos.

Proposed system

In WCE, a patient swallows a wireless capsule for diagnosing intestine diseases. The micro imager of the wireless capsule captures images and sends it to a data logger (DL) through radio frequency transmitter and the sensor array that are fixed at different locations on the anterior abdominal wall. DL is usually fixed in a belt around the patient’s waist. Due to limited resources in terms of battery life and computational power, the implementation of high-level signal processing solutions on wireless WCE’s DL is not feasible. For efficient and real time analysis, the captured images are forwarded from DL to smartphone. The image sequences received by the smartphone are summarized with a real time and efficient visual attention model. Finally, the summarized images are transferred to the corresponding gastroenterologists as shown in Fig. 1.

Fig. 1
figure 1

Overall view of the proposed video summarization based tele-endoscopy service

Consider a WCE frame sequence f t ; t = 1, 2, 3…, NT, where NT denotes the total number of frames. Each frame is in RGB color space (that is, R, G, and B represent the red, green and blue channels respectively). The goal is to extract keyframes from the underlying video and transmit those keyframes to the corresponding gastroenterologists.

Color space conversion

Color information plays a vital role in the analysis of medical images. Perception of color is usually not accurately represented in RGB space. A better model for improved color perception is COC [29]. This color space is in accordance with the human visual system that helps to efficiently select salient objects in an image. We have incorporated this property in the proposed system by converting frame f t from RGB color space to COC color space. RGB color channels are converted into four broadly-tuned color channels as R^=R-(G + B)/2, G^=G-(R + B)/2, B^=B-(R + G)/2, and Y^=(R + G)/2-|R-G|/2-B [30]. Using these four-color channels, two opponent color pairs red-green and blue-yellow are computed as follows:

$$ R{G}_t={R}_t^{\wedge }-{G}_t^{\wedge } $$
(1)
$$ B{Y}_t={B}_t^{\wedge }-{Y}_t^{\wedge } $$
(2)

The intensity channel is also computed and fused with red-green and blue-yellow channels to get a final aggregated image Ft as given in Eqs. 3 and 4. Then, saliency map is holistically calculated only for this aggregated image.

$$ {I}_t=\frac{R_t+{G}_t+{B}_t}{3} $$
(3)
$$ {F}_t= R{G}_t+ B{Y}_t+{I}_t $$
(4)

Integral image computation

The integral-image is an attractive concept for those real-time image processing applications which have memory and power constraints, like wireless body sensors and smartphones. Integral-image [31] (or summed area table) is a data structure constructed by replacing each image pixel with a value equal to sum of all pixels above and to the left of the current pixel as follows:

$$ F\left( x, y\right)={\displaystyle \sum_{\begin{array}{c}\hfill {x}^{\prime}\le x\hfill \\ {}\hfill {y}^{\prime}\le y\hfill \end{array}}{F}_t\left({x}^{\prime },{y}^{\prime}\right)}. $$
(5)

Computing an integral-image is a light weight process with time complexity 2WH (where W and H are the width and height of the image, respectively). Once the integral-image is calculated, we can compute the sum of any region in constant time i.e. O(1) rather than O(n^2). This faster computation has been applied to a number of interesting problems [32]. Thus, we can utilize the benefit of the faster computational property of the integral-images to generate video summaries in real-time.

Visual saliency computation

In this section, a visual saliency is computed to get an efficient video summary. The proposed saliency model is based on three features: image moment, multiscales contrast, and curvature.

Image moment

A wireless capsule travels through the gastrointestinal tract in natural peristaltic motion. The camera captures mucosal images at different scales and orientations that result in the production of highly redundant data. In order to eliminate this redundant visual data, moments of inertia are used [33]. For this purpose, the four moments of inertia: mean, standard deviation, skewness, and kurtosis are computed from each aggregated image F(t). These statistical moments qualitatively describe the structural shape of a region, its boundaries, texture etc. The four moments: mean, standard deviation, skewness, and kurtosis are computed as:

$$ {E}_t=\frac{{\displaystyle \sum_{x=1}^W{\displaystyle \sum_{y=1}^H{F}_t\left( x, y\right)}}}{W\times H} $$
(6)
$$ {\sigma}_t=\sqrt{\frac{{\displaystyle \sum_{x=1}^W{\displaystyle \sum_{y=1}^H{\left({F}_t\left( x, y\right)-{E}_t\right)}^2}}}{W\times H}\kern0.5em } $$
(7)
$$ {S}_t=\frac{1}{\sigma_t^3}\frac{{\displaystyle \sum_{x=1}^W{\displaystyle \sum_{y=1}^H{\left({F}_t\left( x, y\right)-{E}_t\right)}^3}}}{W\times H} $$
(8)
$$ {K}_t=\frac{1}{\sigma_t^4}\frac{{\displaystyle \sum_{x=1}^W{\displaystyle \sum_{y=1}^H{\left({F}_t\left( x, y\right)-{E}_t\right)}^4}}}{W\times H} $$
(9)

where E t , σ t , S t , and K t are the mean, standard deviation, skewness, and kurtosis values of frame Ft respectively. The sum of pixels within any rectangular region can be determined with only four array-references using the integral image F t. This accelerates the computation of moment of inertia features. A function of similarity between the two video frames Ft and Ft+1 can be defined as:

$$ \begin{array}{l}{d}_{mom}\left({F}_t,\kern0.5em {F}_{t+1}\right)=\\ {}\sqrt{{\left({E}_t-{E}_{t+1}\right)}^2+{\left({\sigma}_t-{\sigma}_{t+1}\right)}^2+{\left({S}_t-{S}_{t+1}\right)}^2+{\left({K}_t-{K}_{t+1}\right)}^2}\end{array} $$
(10)

The resultant distance value dmom is used for removing redundancy in videos.

Multi-scale contrast map

Contrast detection is one of the fundamental properties of the human cognitive process that helps gastroenterologists select the most informative regions in the underlying image. In order to deal with varying size anomalies in WCE frames, multi-scale contrast is used. Multi-scale contrast analyzes WCE frames at different scales that help in efficiently detecting varying size salient objects. For each frame, multi-scale contrast at pixel p(x,y) is computed on aggregated image Ft as:

$$ M{C}^s\left( x, y\right)=\left\Vert {F}_t\left( x, y\right)-\mu \right\Vert $$
(11)
$$ \mu =\frac{\mathrm{F}\left( x+ N, y+ N\right)+\mathrm{F}\left( x- N, y- N\right)\hbox{-} \mathrm{F}\left( x- N, y\right)\hbox{-} \mathrm{F}\left( x, y- N\right)}{N* N} $$
(12)

where s∈ [1 3] is the image’ Gaussian pyramid scale and N = 5 is the neighborhood around pixel p(x,y). μ is computed in constant time that requires operation on only four indexes of integral-image. Finally, multi-scale contrast at a pixel p(x,y) of frame Ft is achieved by adding the contrast at three levels of Gaussian pyramid. This multi-scale contrast produces a gray scale saliency image.

$$ MC\left( x, y\right)={\displaystyle \sum_{s=1}^3 M{C}^s\left( x, y\right)} $$
(13)

Curvature feature map

Research in neurosciences suggests that curvature is also an important factor in determining the saliency of an image [34, 35]. Psychophysical studies reveal that curvature influences the gastroenterologist’s analysis of the underlying images. Therefore, curvature measurement is used to compute the saliency of each frame. Before computing the image curvature, a Gaussian filter is applied. This efficiently reduces noise by smoothing the overall image.

$$ {G}_t={F}_t\left( x, y\right)*\frac{1}{\sqrt{2\pi {\sigma}^2}}{\displaystyle {e}^{-\raisebox{1ex}{${x}^2+{y}^2$}\!\left/ \!\raisebox{-1ex}{$2{\upsilon}^2$}\right.}} $$
(14)

Then, first order derivative of Gt is computed that produces a two dimensional column vector as mentioned in equation 15. This vector has an important geometric property that highlights curvature (that is the direction of maximum rate of change) of Gt.

$$ C=\nabla \left({G}_t\right)=\left[\begin{array}{c}\hfill {G}_t^x\hfill \\ {}\hfill {G}_t^y\hfill \end{array}\right]=\left[\begin{array}{c}\hfill \frac{\partial {G}_t}{\begin{array}{l}\partial x\\ {}\end{array}}\hfill \\ {}\hfill \frac{\partial {G}_t}{\partial y}\hfill \end{array}\right] $$
(15)

These partial derivatives are not rotational invariant, however their magnitude is. Thus, to get a rotational invariant curvature map, magnitude of curvature vector C is calculated as:

$$ {C}_{mag}= mag\left(\nabla \left({G}_t\right)\right)=\sqrt{{\left({G}_t^x\right)}^2+{\left({G}_t^y\right)}^2} $$
(16)

The Hu’s image moments, multi-scale contrast, and curvature measures are normalized in the range [0 1] and are fused to get a final saliency map. The average of non-zero pixel values is considered as a saliency value. These saliency values are then computed for each frame in the video. The proposed strategy for keyframe extraction is based on the comparison of frames’ saliency values. A new keyframe is selected if there is a significant change between the saliency values of the current frame and the previous keyframe. The significant change is a kind of threshold, and users can adjust it according to the level of details they require in summaries.

Experiments and results

Experiments were conducted to validate the effectiveness of the proposed method. For evaluation, videos were collected from Gastorlab [36] and WCE Video Atlas [37]. In order to accurately evaluate the proposed framework, three evaluation schemes have been used. The subsequent sections provide details of each evaluation scheme.

Significance of the proposed framework

In this section, the results of the proposed method have been presented on a single shot video downloaded from SYNMED,Footnote 1 . Footnote 2 In this video shot, wireless capsule captures phlebectasia while recording the small bowl of a patient. Phlebectasia is an abnormality caused due to unnatural dilation of veins in human body. This abnormity appears as a dark spot in this video shot. The attention curves of the Hu’s image moments, multi-scale contrast, curvature and fused saliency curves are shown in Fig. 2. The phlebectasia captured in this video is a salient object and the multi-scale contrast and curvature saliency efficiently highlight this abnormality. Also, it is evident from the fused attention curve that the frames containing phlebectasia have high attention values and are selected as a keyframe.

Fig. 2
figure 2

The visual attention curves for the sample WCE video shot

Subjective evaluation

In this section, the comparison of the proposed summarization framework has been performed with state-of-the-art video summarization techniques. This comparison is based on a subjective rating because the video summarization quality assessment is inherently subjective. For this purpose, two gastroenterologists were requested to select significant frames in each video. The frames selected by gastroenterologists thus acted as a ground truth. The frames selected by a particular summarization scheme were then compared with the ground truth to find Recall, Precision, and F-measure. Recall is the ratio of the number of relevant frames chosen as keyframes to the total number of relevant frames in the ground truth database. Precision is the ratio of the number of relevant frames chosen as keyframes to the total number of relevant and irrelevant frames selected as keyframes. Recall and Precision are complementary with each other. Thus, to get a better insight of the comparative analysis, F-measure was also calculated. A value of F-measure close to 1 indicates high values of both Recall and Precision and vice versa. These three matrices are computed as:

$$ \mathrm{Recall}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} $$
(17)
$$ \mathrm{Precision}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} $$
(18)
$$ \mathrm{Recall}=2\left(\frac{\mathrm{Recall}*\mathrm{Precision}}{\mathrm{Recall}+\mathrm{Precision}}\right) $$
(19)

where true positive (TP) is a frame selected by both the gastroenterologists and the technique. False negative (FN) is a frame chosen as keyframe by the gastroenterologist but not by the technique. False positive (FP) is a frame selected as keyframe by the technique but not by the gastroenterologist.

The proposed method was compared with our general purpose summarization technique [10] and a domain specific endoscopy video summarization scheme presented by Pan et al. [14]. It can be seen from Table 1 that the proposed technique outperformed other mentioned techniques by yielding higher values of Recall, Precision, and F-measure. This extraordinary performance is achieved primarily because of the saliency model. Image moments effectively removed redundant frames. Multi-scale contrast and curvature measures had precisely identified the important (abnormal) areas in the frame as shown in Fig. 3. The visual results show that the proposed saliency model was very effective in detecting various gastrointestinal abnormalities. This saliency map had a key-contribution in the overall summarization performance of the proposed scheme.

Table 1 Recall (R), Precision (P), and F-measure (F) achieved by different summarization techniques
Fig. 3
figure 3

Saliency maps. From top to bottom: input image and its corresponding saliency map

Application

In remote diagnosis applications like WCE, a huge amount of collected data becomes a bottleneck in healthcare services. Difficulties include storage, search, sharing, analysis, and visualization. The proposed method is implemented for smartphones because it automatically extracts diagnostically important visual data and sends it to the medical centers for analysis. The proposed telemedicine service is a resource-aware application that provides real-time analysis in order to improve overall patient healthcare. The major contributions and future possibilities of the proposed work are:

  • Provides medical specialists fast and easy access to vital information anytime/anywhere during the WCE procedure.

  • Reduces the storage and transmission cost.

  • Increases access to healthcare opportunities.

  • Once video keyframes are extracted, there are further possibilities of organizing them for browsing and navigation purposes, rather than the strict sequential display of WCE videos.

  • This framework can be extended to other remote diagnostic procedures to monitor and timely tackle abnormal findings.

Conclusion

With the proliferation of medical imaging data in remote health monitoring applications such as WCE, remote bio-sensing data processing becomes extremely challenging. This large sensor data also leads to an increased demand on storage and bandwidth transmission capabilities. Thus, an efficient framework is required to process capsule sensor data and disseminate actionable intelligence. To deal with such data-intensive remote health care systems, we have presented an efficient video summarization framework. The proposed framework is based on a mobile assisted video summarization mechanism that extracts semantically relevant frames from video sequence captured by a wireless capsule sensor. Smartphones provide a convenient platform for local processing as they have not only processing capability but are able to connect medical centers through reliable communication networks. In the proposed video summarization, data processing capabilities have been streamlined and automated through the use of an efficient visual attention model that allows near real-time extraction of significant video frames. The extracted information is transferred to specialists using the infrastructure of the Internet and wireless capsule sensor-to-mobile communications. This empowers gastroenterologists to make correct decisions in real time. The experimental evaluation represents significant advances over traditional tele-endoscopy systems with difficult access. The high performance of the proposed system is further demonstrated by the integral-image that facilitates real-time measurement of the visual saliency model. Quantitative and qualitative evaluation validates the effectiveness of the proposed framework.

Future work will be carried out to investigate the integration of cloud services with mobile-computing to further improve performance. We have the intention to include cloud services for optimized processing keeping in view the trade-off between energy consumption and computational complexity. The availability of such a high performance remote data management system will enable a cost-effective solution for mining large wireless capsule sensing data.