Introduction

Wireless capsule endoscopy (WCE) is a microelectromechanical system (MEMS), which consists of a miniature camera, white light-emitting diodes (LEDs), a battery, and a radio frequency emitter. The WCE, which is swallowed by the patient and propelled by gastrointestinal peristalsis [1], is mainly used for investigating the whole small intestine with a noninvasive manner. In gastrointestinal tract, WCE captures images of inner tract and wirelessly transmits these images to a receiver worn by the patient. WCE images can then be downloaded to a workstation for visualization, which helps clinicians to diagnose whether there are diseases and abnormalities or not in the small intestine.

In general, WCE captures images at a typical rate of two frames per second while moving forward along the digestive tract. For a complete examination, WCE takes about 6–8 h to traverse through the entire digestive tract and captures approximately 50,000 images of gastrointestinal wall [2, 3]. Currently, the major weakness of WCE examination is that a WCE clinician usually takes about 2–3 h to examine the whole WCE video. Even for an experienced physician, the work still needs about 1–2 h. Therefore, it is essential to reduce the examination time, and this study was motivated by attempts to devise a method for solving the problem.

In fact, WCE video has at least 10,000 frame images that carry useless clinical information. Moreover, clinical observations have revealed that only 20,000–30,000 frames about small intestine in a WCE video are concerned for a clinician. Although WCE captures two images per-second, a lot of images in WCE video have similar scenes, as shown in Fig. 1. By analyzing the redundant information, the number of images in a WCE video can be significantly reduced.

Fig. 1
figure 1

Many images in the WCE video have similar scenes

Although some methods have been proposed for WCE video processing, these methods mainly focus on computer aided diagnosis, such as the detection of obscure bleeding, polyp, ulcers, and tumor [48]. All of these are still in the experimental stage and have not been used in the clinic yet. Recently, some researchers begin to pay attention to the problem of summarizing the WCE video. The goal is to develop a useful approach that can reduce the number of images in a WCE video.

An image mining approach that applies a computationally intensive clustering scheme (FCM-NMF) to summarizing the WCE video was proposed to reduce the time for the visual inspection [9, 10]. The method is time consuming and quite complicated. This clustering scheme can lead some images, which are not time-related grouped into the same cluster and produce a wrong representative frame. In [11], Li et al. proposed a WCE video reduction scheme, which reduces a WCE video sequence using motion features. In that paper, they use two typical motion analysis methods: adaptive rood pattern search block matching (ARPS) [11] and Bayesian multiscale differential optical flow (BMSD). In fact, ARPS assumes that the camera movements are in horizontal and vertical directions and needs the predicted motion vector (MV) for initializing the size of ARPS. If there is not a predicted MV, the ARPS cannot estimate motion accurately. Although BMSD takes a coarse-to-fine estimation scheme to estimate optical flow vector, it is difficult to estimate a larger displacement of WCE camera motion. Therefore, the motion models of ARPS and BMSD are not suitable for WCE imaging motion estimation. Olympus Ltd has announced a new feature of the improved Endo Capsule Software that can condense the entire examination into a maximum of 2,000 still images [12]. However, the method is similar to the ARPS, which compares images with explicit changes to pervious image. In other words, the method just detects the sudden changes in WCE image sequence. Since some small scale of diseases cannot make a significant scene changes, this method may ignore important information in WCE image sequence and lead to a wrong-decision result for medical diagnosis. A registration methodology was presented to reduce the amount of frames in [13, 14]. This method utilizes a segmentation scheme and a graph method to register similar regions in two adjacent WCE video frames. However, it difficult to produce an effective segmentation to register similar regions because the similar scene in intestine region may produce various WCE images due to illumination and local nonrigid deformation.

In this paper, we present a new reduction scheme consisting of three stages for WCE video. In the first stage, we estimate motion features in WCE video. In the second stage, we use motion features to measure the scene changes between two WCE images. In the third stage, we preserve those images, which have obvious scene changes. In order to estimate the WCE motion between two successive WCE images more accurately, we also propose a novel motion model for WCE video. In this model, the motion estimation of WCE video is divided into two levels, the coarse level and the fine level. In the coarse level, the WCE camera motion as the global rigid motion is estimated and viewed as the large displacement motion estimation. In the fine level, the gastrointestinal tract deformation as the local nonrigid motion is estimated based on the motion estimation of WCE camera, and the nonrigid motion is viewed as local motion estimation.

The rest of this paper is organized as follows. In “The Proposed Method,” we will present the reduction scheme of WCE video in details. In “Results and Discussions,” experimental results are comprehensively illustrated and evaluated. The last section summarizes the conclusions of our study.

The Proposed Method

Overview of the Method

To achieve the reduction task, we assume that two successive frames have overlapping area (movement of WCE camera is continuous) and intrinsic parameters of WCE camera are constant in the entire examination procedure. We divide our scheme into three steps. First, the motion between successive WCE images is estimated based on a novel WCE video motion model. Then, the scene changes are measured with the motion estimation. Finally, images with obvious scene changes are preserved. The framework of our reduction scheme is presented in Fig. 2.

Fig. 2
figure 2

The framework of our WCE video reduction scheme, which is divided into three steps

The Motion Estimation of WCE Video

In our scheme, the motion estimation is used to capture scene changes between successive WCE images. However, this is a difficult task because the WCE takes two images per second unlike a general video where it has more consecutive motion features. Therefore, we need a suitable motion model to describe the motion between successive WCE images. In this paper, we propose a novel WCE video motion model. In this model, the motion of WCE video includes two parts: One part is rigid motion, and the other is nonrigid motion.

The Motion Model of WCE Video

As shown in Fig. 3, when WCE works in the gastrointestinal tract, WCE is propelled by gastrointestinal tract peristalsis. If we just consider the WCE camera movement, then two successive WCE images in which the relationship can be described with homogeneous coordinates:

Fig. 3
figure 3

The graph demonstrates the motion of WCE in gastrointestinal tract

(1)

where subscript i is an image index in the WCE video. Y i and Y i − 1 are 2D points in the source image i, and a neighborhood image i − 1 of the source image. Operator × is a matrix multiplication. M is a 2D rigid transformation, which includes translation T, rotations R, scaling s, and perspective p parameters. These parameters describe the motion of WCE camera (global rigid motion) between two successive WCE images.

In fact, the motion between successive WCE images depends not only on the movement of WCE camera but also on nonrigid deformation (local nonrigid motion) of gastrointestinal tract due to its peristalsis. Therefore, both the movement of WCE camera and gastrointestinal tract need be considered simultaneously when we estimate the motion between successive WCE images. Then, Eq. (1) can be modified as follows:

$$ {Y_{{i - 1}}} = M \times Y_i^T + {\varepsilon_i} $$
(2)

where ε i is a local gastrointestinal tract displacement vector caused by local gastrointestinal tract nonrigid movement. We can integrate the rigid and nonrigid transformation as a universal form.

(3)

According to this motion model, WCE image motion can be estimated in two stages. In the first stage of the coarse level estimation, the motion of WCE camera, which can be regarded as a large displacement of WCE image scene, is estimated. In the second stage of the fine level estimation, the local gastrointestinal tract deformation based on the result of the first stage estimation is estimated. Actually, the first stage also can be thought of an approximate alignment between two successive WCE images, and the second stage is an image alignment of WCE images in the local detail. Once we have chosen a suitable motion model to describe the alignment between a pair of WCE images, we need to devise a method to estimate its motion parameters.

The Motion Estimation of WCE Camera

WCE is propelled by gastrointestinal peristalsis. In fact, the gastrointestinal wall is very close to WCE camera, which causes less projective deformation of WCE images and WCE camera motion can be described as 2D rigid deformation. Therefore, we can just focus on rigid motion and ignored projective p and local gastrointestinal tract nonrigid displacement ε parameters, when we estimate the motion of WCE camera.

A usual approach for this estimation is to extract distinctive features from each image and match features to establish a global correspondence, then to estimate transformation between the images (feature-based methods). In reality, this is very difficult to extract robust, stable and distinctive features between successive WCE images and establish a matching. A major reason is due to the low resolution, poor structural information of WCE image. In this paper, we use the Bee Algorithm (BA) [15] to search the best solution of the WCE camera motion parameters (BAME). BA is a new population-based search algorithm for many complex multiobjective optimization problems that cannot be solved exactly within the polynomial bounded computation times. As opposed to the feature-based methods, this method is often called direct (pixel-based) alignment methods. BAME can be described as seeking a minimal error:

$$ {T^{ * }} = \arg \min \left\{ {{\text{erro}}{{\text{r}}_n}({I_{{i - 1}}},{T_n}({I_i})):1 \leqslant n \leqslant m,(i,m) \in {\text Z}} \right\} $$
(4)

where I i − 1 and I i are a neighborhood image and a source image in WCE video, T n is nth transformation with deformation parameters being searched for in the solution space, and error n (I i − 1,T n (I i )) is an error metric between the neighborhood image and source image. In our method, we use Mutual Information (MI) [16, 17] as error measure. MI is widely used in the medical image registration. In this paper, we use the gray-scale information to calculate MI because WCE images have similar intensity, color and hue. A pseudocode of the BAME is shown in Fig. 4.

Fig. 4
figure 4

The pseudocode of the BAME algorithm. We modified the standard Bee Algorithm to select the best optimal WCE camera motion parameters

The Motion Estimation of Local Gastrointestinal Tract

As mentioned earlier, gastrointestinal tract movement can be modeled as a nonrigid deformation, which is caused by gastrointestinal tract peristalsis. In this paper, we use the SIFT-flow [18] to predict the motion of the local gastrointestinal tract. The algorithm assumes that SIFT descriptors is able to establish the dense correspondences between a neighborhood image and a source image. SIFT descriptors have an excellent performance that is invariant in the local image illumination and encode local image structure. These properties make the matching more robust. SIFT flow can be formulated as an optimization problem on the correspondence search with the cost function:

$$ E(\varepsilon ) = \sum\nolimits_p {{{\left\| {{s_{{i - 1}}}(p) - {s_i}(p + \varepsilon (p))} \right\|}_1}} + \frac{1}{{{\sigma^2}}}\sum\nolimits_p {\left( {u_x^2(p) + u_y^2(p)} \right) + R(p,q)} $$
(5)
$$ R(p,q) = \sum\nolimits_{{(p,q) \in N}} {\min \left( {\alpha \left| {{u_x}(p) - {u_x}(q)} \right|,d} \right) + \min \left( {\alpha \left| {{u_y}(p) - {u_y}(q)} \right|,d} \right)} $$
(6)

where ε(p) = (u x (p),u y (p)) is a flow vector at pixel location p = (x,y), s i − 1(p) and s i (p) is the SIFT descriptor extracted at location p in the neighborhood image i − 1 and source image i in a WCE video, and N is the spatial neighborhood of a pixel. Here, the source image is a transformed image using the result of the motion estimation of WCE camera. R(p,q) constrains the flow vector which is consistent with adjacent pixels. The flow vector ε(p) is regarded as local nonrigid motion component. Finally, motion estimation between two successive WCE images can be described as follows:

$$ {Z_{{i - 1}}} = M \times Y_i^T + \varepsilon_i^T $$
(7)

The right-hand side of the equation contains two terms. The first term is global rigid motion estimated by BAME and the matrix M is a transformation matrix of WCE camera. The second term is local nonrigid motion estimate by SIFT flow and ε is a local flow vector, the best local match can be found along with the flow vector between two successive WCE images. Z i − 1 is an approximation of the neighborhood image points corresponding to Y i − 1.. The Eq. (7) can also be described as a universal form:

(8)

Measuring Scene Changes in WCE Images

In WCE video, a scene change means that two successive WCE images have more dissimilar context. For a robust measurement of scene changes between successive WCE images, we define a concept of the invalid region. An invalid region is a region in a WCE image (source image) that contains the scene of gastrointestinal wall that we cannot found it in neighborhood images (or key images). Here, key images are some images that should be preserved in the reduction process. As shown in Fig. 5, we take two steps to achieve the invalid region estimation. The first step is backward estimation. We estimate the scene deformation from image i to i − 1 using Eq. (7), and seek for points of image i that not lie in image i − 1 according to scene deformation. Those points are recorded as a set OB = {Y i : {M × Y i } ∉ Y i − 1}. The second step is forward estimation. We record points as a set OF = {Y i : {M × Y i }∉ Y i + 1} between image i and image i + 1. Finally, an invalid region of current frame is a set:

Fig. 5
figure 5

An invalid region is measured in the forward and backward manner. For example, the image i has three points (red, blue, and black), after transformation, the black point cannot be found in the images i − 1 and i + 1

$$ {\text{IR}} = {\text{OF}} \cap {\text{OB}} $$
(9)

The area of invalid region can reflect the scene change between two successive WCE images. Figure 6 demonstrates an invalid region of image i. Even if the number of the points in the invalid region is large, it may still not indicate the large displacement. Because the invalid region of a WCE image is often distributed at the margin and is usually narrow. Actually, the maximum diameter of the invalid region can reflect a potential possibility of the size of the scene changes. Therefore, we simplify this measurement and use the diameter of the max-inscribed circle (DMC) of invalid region to evaluate whether there is a scene change between two images.

Fig. 6
figure 6

Color for similar scenes: green and red regions that are the scenes of image i can only be found in i − 1 and i + 1 images respectively, yellow region (the scenes of image i ) can be found both in i − 1 and i + 1 images. Green circle is a max-inscribed circle of the invalid region

The Scheme of the WCE Video Reduction

In our reduction scheme, WCE video is represented as the set F = {f 1,…, f i , f i + 1,…f n }. We divide WCE video F into some segments named shots. A shot is defined as a finite subset of F: SF = {f i , f i + 1,…,f m : i, m ≤ n}. Then, we define an extract function, which describes an extracting images procedure in a shot KFj= extract(SF j ). KF represents a set consisting of some key images (key frames) should be preserved in a shot (j is a jth shot). If a WCE video is divided into the number of k shots, then our reduction scheme can be described as an equation:

$$ {\text{KFS}} = \mathop{ \cup }\limits_{{j = 1}}^k {\text{extract}}({\text{S}}{{\text{F}}_j}) = \mathop{ \cup }\limits_{{j = 1}}^k {\text{K}}{{\text{F}}_j} $$
(10)

In our method, we set the first image in each shot as a key frame (KF), then we estimate the scene motion between source image i and key frame as backward estimation, and estimate the scene motion between source image i and neighborhood image i + 1 as forward estimation. The invalid region is calculated after scene motion estimation. Finally, we determine whether current image should be preserved as a key frame or not according to the threshold τ of DMC. If a source image i is preserved as key frame, the following image will estimate scene motion with this new key frame in the backward estimation. This extraction procedure in a shot can be described as follows:

$$ {\text{extract}}({\text{S}}{{\text{F}}_j}) = \left\{ {\exists {\text{KF}} \in {\text{S}}{{\text{F}}_j}:{\text{DMC}}({\text{KF}}) < \tau } \right\} $$
(11)

In addition, we use the multiplication technique to deal with the case that has an interval between source image i and key frame, as shown in Fig. 7. The motion estimation between source image 2 (CF2) and key frame (KF) is given by:

Fig. 7
figure 7

The reduction scheme in a WCE video shot. We use multiple multiplication technique to calculate scene changes between source image i and key frame

$$ {Z_{\text{KF}}} = {M_1} \times {M_2} \times Y_{\text{CF2}}^T $$
(12)

Here, M 1 and M 2 are transformation matrices in Eq. (8), Y CF2 points in CF2 and Z KF points transformed from current points Y CF2 using the multiplication of transformation matrices M 2 and M 1.

Results and Discussions

Material and Evaluation of the Proposed Method

We test our method with various WCE image sequences from different patients provided by Nanjing General Hospital of Nanjing Military Command. We divide our experiments into two parts. The first part verifies whether the proposed WCE video motion model is suitable for WCE image motion estimation and the second part tests whether our reduction scheme is significant on WCE video.

In the experiments of the motion estimation, we use both image registration and PSNR, MSE, SSIM, and MI to evaluate the motion estimation performance. In reduction experiments, we use recall and precision to measure the reduction performance, which are widely used in the field of information retrieval. We also compare our reduction scheme with ARPS and FCM-NMF. Recall and precision in the reduction of WCE video is given as follows:

$$ {\text{Recall}} = \frac{\text{Number of keyframes correctly detected}}{{\text{Number of keyframes in the ground truth}}} $$
(13)
$$ {\text{Precision}} = \frac{\text{Number of keyframes correctly detected}}{{\text{Number of keyframes detected}}} $$
(14)

Experiments of WCE Video Motion Estimation

Both WCE camera and local gastrointestinal tract motion estimation are tested in our experiments. The source and neighborhood images as a pair of images are chosen from WCE video including three types of gastrointestinal tract images: stomach, small intestine, and colon, and motion estimation performance is evaluated with image registration and PSNR, MSE, SSIM, and MI. In the experiments of WCE camera motion estimation as shown in Fig. 8, we notice that the image registration of BAME is clearer in structural information and less visual artifacts than the direct image alignment method. We also run BAME on some nongastrointestinal tract images and notice that BAME is also effective in the general image scene. The residual error between the neighborhood image and transformed image with BAME is presented in Fig. 11. BAME does not show noticeable geometrical difference, which can explain that BAME can estimate WCE camera motion precisely.

Fig. 8
figure 8

The experiments of WCE camera motion estimation in gastrointestinal and nongastrointestinal image with BAME. Neighborhood and source images as a pair of images were chosen from different parts of WCE video: stomach, jejunum, ileum, colon and nongastrointestinal tract. We compare our BAME performance with image registration. The third column is the registration of image direct alignment without any transformation

Then, invariant feature transform (SIFT) matching and shape context (SC) matching can be considered as a comparison with BAME in our experiments. SIFT and SC were set the same parameters as described in the papers [19, 20]. In the experiments of SIFT matching, we found that SIFT cannot extract dense matching points effectively between two successive images in most cases. This may be due to low resolution, poor structural information, and texture-less regions in WCE images. For SC Matching, we extract the edge (or contour) information of two successive WCE images first [13]. Then, we apply the SC algorithm to establish dense matching pair points between these images according to the edge information. Finally, the motion is estimated based on these matching points. For some cases, we could not ensure to extract edges (or contours) correctly from WCE images and also could not ensure that edges (or contours) extracted from WCE images is consistent or contains similar structural information between two WCE images. However, for the nongastrointestinal images, these images have obvious contours and structure; thus, both SIFT and SC can work well on the motion estimation of WCE camera. Experimental results of SIFT and SC matching are presented in Fig. 9. We use PSNR, MSE, SSIM, and MI to compare the registration performance of these methods. We notice that the performance of BAME is better than those of SIFT and SC in most cases. These results are shown in Tables 1 and 2.

Fig. 9
figure 9

A comparison on image registration of BAME, SIFT, and Shape Context (SC). BAME is more effective on WCE motion estimation. The registration image of BAME has less visual artifacts than those of SIFT and SC

Table 1 The comparison of BAME with SC and SIFT on PSNR and MSE
Table 2 The comparison of BAME with SC and SIFT on SSIM and MI

Next, we use Eqs. (5) and (6) to estimate the local nonrigid motion of gastrointestinal tract based on the results from the motion estimation of WCE camera. In Fig. 10, we can notice that the alignment image with BAME-SIFTFlow is less blurred than the only motion estimation with BAME, which means SIFT flow can improve the accuracy of the motion estimation in the fine scale. This also verifies that the proposed the motion model is suitable for WCE video. From Fig. 11, we can find that the residual error is minimized between the neighborhood image and transformed image with BAME-SIFTFlow. Especially, the difference is smaller in nongastrointestinal images and stomach and colon images. However, for images of jejunum and ileum, we still found some obvious structure. The reason may be due to that the dense villi of small intestine cause an uncertain local motion.

Fig. 10
figure 10

The local nonrigid motion estimation with SIFT flow. The registration image with BAME-SIFTFlow is less blurred than only motion estimation with BAME. SIFT-flow displacement field is a visualization of pixel displacements using the color-coding scheme of [23]

Fig. 11
figure 11

The residual error is compared between the neighborhood image and transformed image with BAME and BAME-SIFTFlow. BAME-SIFTFlow does not present noticeable geometrical difference

To further test the performance of BAME-SIFTFlow, we compare the BAME-SIFTFlow with several popular nonrigid motion estimation algorithms, which include standard optical flow (HS), demon nonrigid registration method [21], and large displacement optical flow (LDOF) [22]. These methods are tested based on the results from the motion estimation of WCE camera with BAME and are named BAME-HS, BAME-Demon and BAME-LDOF. We again use PSNR, MSE, SSIM, and MI to evaluate the performance. From the comparison, we notice that the BAME-HS method is the worst of the local motion estimation among these methods. The BAME-LDOF has a poor performance compared with BAME-demon and BAME-SIFTFlow. However, our method has the better performance in most situations. The comparison is presented in Tables 3 and 4.

Table 3 The comparison of BAME-SIFTFlow with BAME-HS, BAME-Damon, and BAME-LDOF on PSNR and MSE
Table 4 The comparison of BAME-SIFTFlow with BAME-HS, BAME-Damon, and BAME-LDOF on SSIM and MI

Evaluation of the Reduction Scheme

Once the motion of a WCE video is estimated, we use it as salient features to reduce the number of images in a WCE video. In our experiments, we tested our reduction scheme on WCE video clips collected from different gastrointestinal tract scene, such as stomach, jejunum, ileum, and colon. Each WCE video clip contains 100 images, and each shot has 50 images. In each shot, we save the first image and last image as initial key frames. From the second image, we apply our reduction scheme as described in “Measuring Scene Changes in WCE Images.” In experiments, DMC threshold is set to 5 pixels empirically. The number of key frames in each shot is shown in Table 5.

Table 5 The number of key frames in each shot

We plot the curves about the radius of max-inscribed circle of each WCE image invalid region in a shot and draw key frames and its neighborhood images (shown in Fig. 12). Red circle represents an image chosen by DMC threshold. In Fig. 12, key frames are marked with yellow number label, such as 1222, 12232, 16729, etc., and neighborhood images are marked with white number label. We notice that the key frame has an obviously scene changes. Figure 12b is an example of a WCE transition from ileum to colon, and key frame 16729 is ileocecal valve. Neighborhood images (16728 and 16730) have an obvious different scene; therefore, both of them are preserved as key frames. Figure 12b also explains a phenomenon that the movement of small intestine is larger than colon. However, even if the movement is less in colon, but food residue may still distort the motion estimation in colon images, which can explain why a shot of transition from ileum to colon preserves more images.

Fig. 12
figure 12

The curve is the radius of max-inscribed circle of each WCE image invalid region in a shot. a The curve of a ilenum shot, and b the curve of a ilenum–colon shot. Key frames are marked with yellow number label, and neighborhood images are marked with white number label

We verify the performance of our reduction scheme by recall (RC), precision (PC) and compression ratio (CR). The key frames of ground truth in WCE video shots are labeled by the clinician. We also compare our method with ARPS and FCM-NMF scheme. These schemes were set same parameters as described in the papers [9, 11]. These results of which are presented in Tables 6 and 7. For a good WCE video reduction scheme, we hope both recall and precision should be as high as possible. However, in practice, we are more concerned with the correct key frames detected in WCE video because the false positives may lead to a failure for medical diagnosis. Thus, recall is more important than precision in our application.

Table 6 The comparison of our reduction scheme with ARPS k = 2 and FCM-NMF c = 5
Table 7 The comparison of our reduction scheme with ARPS k = 3 and FCM-NMF c = 6

From experimental results, the average performance of our reduction scheme for recall is 74 %, precision is 58 %, and CR is 68 %. Although ARPS and FCM-NMF have higher compression radio (CR) than our method, the higher RC also causes these methods having lower RC and PC. This is due to ARPS and FCM-NMF cannot preserve the representative images in WCE image sequence. For example, in images sequence 16701 to 16750 (Ileum to Colon), both ARPS and FCM-NMF cannot preserve ileocecal valve (16728 and 16729) as key frames. In fact, the local nonrigid deformation is not considered in ARPS. ARPS is a template search scheme, which is not suited for the case of WCE movement because the scheme assumes that the camera movements are just along horizontal and vertical directions. Actually, the movement of WCE can be in any directions. We also notice that it does not have any obvious improvement in experiments of FCM-NMF with increasing the number of clusters. Moreover, in the experiments of FCM-NMF, we cannot obtain a consistent result from same images sequence and parameters each time, which may be due to the drift of clustering centers in clustering procedure. In addition, for FCM-NMF scheme, this is also difficult to evaluate a right similarity and form the geodesics distance matrix between successive images just using Euclidean distances.

However, the average recall of our method is still 74 % (see Table 6). A major reason could be the objective criterion of reducing redundant data in a WCE video is very different from subjective criteria of the clinician. For clinicians, they are more interested in image content itself than in the WCE video. This reason also explains why the precision is not high because our scheme leads to more images taken as key frame in a shot than the ground truth marked by clinician. In addition, different clinicians may have different subjective criteria for the key frame extraction. Even for a clinician, it is still very difficult to judge if one image is better than another that should be preserved. Meanwhile, we find that different movement speed in different part of gastrointestinal tract can cause different reduction ratio. If we increase the CR, it may decrease the recall. Thus, we must make a tradeoff between the recall and CR.

To verify whether our reduction scheme is more suitable than ARPS and FCM_NMF, we compare the sampling frequency of our scheme with ARPS and FCM-NMF on a WCE images sequence (500 frames) without ground truth labeled by clinician. The result is presented in Fig. 13. Black curve in Fig. 13 is intensity of an image sequence (500 frames). The changes of intensity can reflect the scene changes in image sequence. In the sampling curves (blue, red, and green), value one (1) represents a key frame extracted from images sequence. We found that our scheme is more consistent with the changes of intensity than ARPS and FCM-NMF. The more changes of scene, the more key frames need to be preserved, which can explain that our scheme can produce an acceptable reduction sequence for browsing and examination in WCE images sequence.

Fig. 13
figure 13

A comparison on sampling frequency of ARPS, FCM-NMF, and our scheme. Black curve at top row is intensity of an image sequence (500 frames). The changes of intensity can reflect the sense changes in image sequence. The sampling frequency of our scheme (blue curve) is consistent with the changes of intensity

Conclusions

We presented a new reduction scheme for WCE video. This scheme makes use of the motion feature to preserve those images that have obvious scene changes in the temporal neighborhood. To obtain the motion feature from successive WCE images, a new WCE video motion model is proposed. Based on this new motion model, the motion estimation of WCE video is divided into two levels, the coarse level and the fine level. In the coarse level, the WCE camera motion is estimated with BAME. In the fine level, the local gastrointestinal tract motion is estimated with SIFT flow.

Through the empirical comparison, we find that the BAME-SIFTFlow method can estimate the motion between WCE images more accurate in most situations, especially when two successive WCE images have a large displacement. This is mainly due to the fact that the method takes two stages to estimate the motion. Therefore, the BAME-SIFTFlow is consistent with our gastrointestinal tract motion assumption and is robust for practical WCE video applications. Moreover, we notice that SIFT-flow has a better performance for local motion estimation than HS, Damon, and LODF in WCE video. We think that SIFT flow can be extended for other medical image applications. However, we also find that our reduction scheme has a lower recall in some situation. A major reason could be that WCE images preserved as key frames by a clinician may be more subjective and lack of an objective criteria. Therefore, it is still a challenging task to determine which images should be preserved as key frames of WCE video with unsupervised learning. Other reasons may be that the motion estimation is not enough as a feature to capture the scene changes in WCE images sequence. Therefore, it should be considered by combining the motion feature and other image features, such as color and texture, for future work.