Keywords

1 Introduction

Video capsule endoscopy (VCE) enables the examination of the whole gastrointestinal (GI) tract in a non-invasive way. It is performed with a swallowable capsule endoscope (CE), which captures colour images during its approx. 12 h battery lifetime. Today’s commercial CEs are passive, in the sense that they are moving by exploiting both the gravity and the peristaltic motion of the GI tract. However, several research prototypes have been proposed for active, robotic capsule endoscopy, which will enable thorougher examinations, easier lesion localization, and drug infusion [1].

A major issue that is still unresolved, both in passive and active VCE is that it requires a lot of human effort for manually reviewing of the produced videos. Typically, each individual review lasts 45–90 min, during which, the reviewer’s concentration should remain undivided for a careful inspection of the output video [2]. Such a tiring procedure is prone to human errors; a fact with serious consequences in the diagnostic yield, which is alarmingly low [3].

In order to cope with this problem, automated lesion detection methods based on computer vision algorithms have been proposed [4]. Most of these methods exploit supervised machine learning methodologies, capable of learning what is defined as normal and what is defined as an abnormal finding within the VCE video. The generation of datasets for training the learning machines requires that experts indicate which pixels correspond to normal or abnormal tissues within the VCE images. Considering that the videos produced by a VCE examination are composed of thousands of frames (usually of the order of 104), such a pixel-wise annotation task can prove very time-consuming and discouraging for annotation of large datasets by the experts.

A promising solution that could alleviate this problem is weakly-supervised learning, which involves training of a learning machine using weakly annotated data [5, 6]. In this paper weakly supervised learning is considered using images annotated at image-level instead of pixel-level. This way, a binary semantic label is assigned per video frame indicating whether its content is normal or abnormal. A drawback of such an approach is that the abnormal images can be tracked, but the localization of the lesion(s) within each abnormal frame remains a challenge. However, it is much more significant for the system to robustly detect which frames contain possible lesions than to localize the lesion within these frames, since this can be much easier done by the video reviewers.

The Bag-of-Words or Bag-of-Visual Words (BoW/BoVW) can be considered as a weakly supervised model built upon the notion of visual vocabularies. A visual vocabulary may be seen as a set of “exemplar” image patches (visual words), in terms of which any given image may be described. Typically, this vocabulary is built using a large corpus of representative images of the domain of interest and should be closely related to the problem at hand. The vocabulary may be seen as a means of quantization of the feature space i.e., the one of the local descriptors. Any unseen descriptor may then be easily quantized to its nearest visual word. The description of the whole image is formed by a histogram, counting the appearances of each visual word within it. Apart from the obvious advantage of BoW, i.e., that can be used as a weakly supervised approach as it has already been discussed, it also provides a fixed-size representation, a useful property for tasks such as classification using traditional classifiers e.g., feed-forward neural networks, support vector machines etc. Finally, the visual description provided by BoW may also be used on tasks such as inverted file indexing [7], visual retrieval etc.

An early application of BoW in capsule endoscopy has been investigated using speeded-up robust features (SURF) for polyp detection [8]. In [9] the performance of BoW was investigated using scale-invariant feature transform features (SIFT) and local binary patterns (LBP) for ulcer detection. A more complex feature extraction scheme for the construction required in BoW was proposed in [10]. This scheme was applied for polyp detection and includes extraction of SIFT, LBP, uniform LBP and histogram of oriented gradients (HoG) features from neighbourhoods of salient points detected using the SIFT key-point detector. In the context of bleeding detection, colour histograms extracted from various colour spaces were considered [11]. Colour along with textural information has also been exploited in [12] for detection of gastric and oesophageal cancer, gastritis, and oesophagitis. In that study superpixel segmentation was exploited for estimation of image descriptors from homogeneous regions. As in [12] the descriptors considered include colour histograms from various colour spaces as well as LBP-based textural signatures. Most of the aforementioned approaches are based on support vector machine (SVM) classifiers.

The BoW model was also exploited in the context of unsupervised segmentation of capsule endoscopy videos, based on probabilistic latent semantic analysis (pLSA) [13]. In the context of the analysis of higher resolution endoscopic images, BoW models have been proposed for browsing endoscopic imagery by semantic information [14], colonoscopy image classification [15], and classification of images obtained using chromo-endoscopy and narrow-band imaging techniques.

Acknowledging the significance of incorporating an image-level instead of pixel-level annotation process in the development of training datasets for lesion detection systems in VCE, in this paper we investigate a novel BoW-based weakly-supervised learning approach using the state-of-the-art features that have been proposed in [15]. These features represent colour information both at pixel and region level in CIE-Lab colour space, and despite their simplicity they have been proved very effective in the detection of a diverse set of abnormalities [5, 17].

The rest of this paper is organized as follows: In Sect. 2 we describe the methodology we followed for the proposed weakly supervised classification scheme. We provide a brief description of both the generic BoW methodology and the approach we followed. Then, in Sect. 3 we demonstrate and discuss our experimental results. Finally, conclusions are drawn in Sect. 4, where we also discuss plans for future continuation of this work.

2 Methodology

BoW is a widely used method to model generic categories in detection, classification and recognition problems [18]. This method has been originally inspired by text document analysis techniques, and consists of calculating word frequencies. The first step of BoW is to describe an image as a set of “words”, which capture its visual content. To this goal, given an adequately large dataset, a set of features is extracted from every image and typically quantized using a clustering approach, e.g., the k-means algorithm [19]. Upon clustering, the centroids (or in some approaches the medoids, which opposed to centroids are actual members of the dataset) that have been determined, are used as a “visual vocabulary” and are often referred to as “visual words.”

Each feature is then translated (coded) into one of these visual words, i.e., to the nearest one in the feature space (typically based on the Euclidean distance). The next step involves a histogram construction, which describes the appearance frequency of every visual word within an image. Thus, this histogram is used to characterize the visual content of the image. Among the advantages of BoW, we should emphasize that it succeeds to reduce the problem of classifying a large number of high dimensional vectors from local point descriptors to a fixed-size, one dimensional vector without significant loss of visual information. Finally, any typical classification approach may be used for the classification of these histogram vectors. In this work we choose to use an SVM [20], trained with examples of histograms extracted from both normal and abnormal categories.

We use the well-known SURF (speeded up robust features) algorithm [21], in order to detect interest points and extract descriptions from patches around them. SURF is a powerful and fast descriptor scheme and has been successfully applied to a plethora of computer vision problems. It has been shown to achieve comparable repeatability and performance to other, more sophisticated schemes, at a lower computational cost. It combines a Hessian-Laplace region detector and a gradient orientation-based feature descriptor and is invariant to several image transformations and robust to illumination variations. For interest point selection, we also make use of a “naïve” approach known as “dense sampling”. Following this approach, we select all pixels sampled using a regular grid (i.e., one with equal horizontal and vertical inter-pixel distances), which are then used as interest points. Although these points cannot be matched accurately, when compared e.g., to the SURF interest points, they carry valuable information regarding image content interpretation [22].

For the extraction of visual descriptions of patches around the interest points, we also evaluate the colour-based features of [16]. Images are first transformed to the CIE-Lab colour space and then, the following colour information is extracted from a square region centered at each point: (i) The Lab values of each interest point; (ii) The minimal and maximal values of each component. This results to a vector consisting of 9 values.

In Fig. 1(b) and (d) we illustrate the set of the SURF interest points extracted from a given VCE image, combined with SURF regions and fixed windows, respectively, whilst in Fig. 1(c) and (e) we illustrate the set of the dense interest points, also combined with SURF regions and fixed windows. One may easily observe that SURF points do not cover the visual properties of the whole image. Yet, the latter is achieved by the dense features.

Fig. 1.
figure 1

Image examples of different uses of the algorithms: (a) A raw WCE image depicting lymphangiectasia; (b) SURF; (c) Dense SURF; (d) Lab; and (e) Dense Lab.

3 Results

For the evaluation of the proposed weakly-supervised BoW approach, we performed experiments using a subset of dataset 2 from the publicly available KID database [23, 24]. This dataset displays a variety of different kinds of abnormalities. More precisely, the selected subset consists of 227 images of most common inflammatory lesions, e.g., as in Fig. 2(a) including ulcers, aphthae, mucosal breaks with surrounding erythema, cobblestone mucosa, stenoses and/or fibrotic strictures, and significant mucosal/villous oedema. It also includes a set of 1327 normal images derived from the small bowel (728 images), e.g., as in Fig. 2(b) (right), and the stomach (599 images), e.g., as in Fig. 2(b) (left).

Fig. 2.
figure 2

Representative images from the dataset used in experiments: (a) Inflammatory lesion images, (b) Normal images from the stomach (left) and the small bowel (right)

In order to investigate whether BoW could be used as a reliable classification approach, we compare its performance in four different experiments. These differentiate on the method for the selection of interest points, the description of patches around the aforementioned points; and the colour space used. For the latter case we used greyscale images and also transformations of CIE-Lab (using standard illuminant D65), where L and b channels had been discarded, keeping only the colour information of a. We shall refer to the latter as the “Lab images”. More specifically, the performed experiments are as follows: (i) SURF points and features on the greyscale image; (ii) dense points and SURF features on the Lab images; (iii) SURF points and colour features of [16]; (iv) dense points and colour features of [16]; and (v) the state-of–the-art method of [10], where image description is based on the combination of SIFT and compound local binary pattern features (CLBP). In each case, we extract interest points, then their descriptions, we create the visual vocabulary, which we use for image BoW description and finally train SVM classifiers. In every experiment we use 6-fold cross validation method and estimate the values of area under the receiver operating characteristic (AUC).

The visual vocabulary size ranged from 300 to 1200 words. For the experiments with dense SURF, we used multi-scale feature extraction with scale step 1.6, starting from scale 1.6, up to scale 6.4. We also experimented with various sizes of square regions, for the extraction of the colour features. We used 18 × 18 and 36 × 36 square areas. For dense feature extraction we used grid steps of 4, 10, 18 and 36 pixels, both horizontally and vertically. For the method of [10] we used CLBP of patch size 4 × 4 and 8 × 8. For the classification we used an SVM with RBF kernel.

Most notable results are summarized in Table 1. In this Table we may observe that best performance was achieved for the case of dense Lab features using a window size of 18 × 18 pixels and a visual vocabulary of 700 words. The best performance of standard SURF features (i.e., applied on grayscale images) was achieved using dense extraction and a vocabulary size of 800 words. However, this advantageous performance comes at cost of efficiency, since the number of samples obtained by dense SURF is higher (due to the regular sampling process). In addition, our approach had better results in comparison with of the state-of-the-art method of [10]. In any case the application of SURF on the a channel of CIE-Lab leads to an increase of AUC.

Table 1. Experimental Results; in dense (x), x denotes the step, in SURF (y), y denotes the colour space (g: greyscale, a: a channel of Lab). Note that in case of SURF feature description, image patches are selected by the algorithm, thus marked herein as “N/A”

4 Conclusions

In this paper we presented a weakly supervised classification scheme for automated lesion detection in VCE videos. We followed the BoW paradigm and created a visual dictionary encoding all extracted image features into visual words. A novel contribution of this paper is that we extended our state-of-the-art colour features [16, 17], according to the bag-of-visual-words model and created BoW image descriptions, which were used to train SVM classifiers. We evaluated four different feature extraction schemes, including a state-of-the-art approach, and investigated among others the use of colour and different sampling schemes. Our results indicate that standard SURF features are not capable of providing a reliable descriptor in the given problem. However, when applied to the Lab colour space, their performance is boosted. The latter are able to provide valuable results within the proposed weakly-supervised scheme, which could be used as an alternative to the demanding in terms of manual annotation effort, fully-supervised, schemes.

Open research topics in the area of BoW with application to weakly-supervised lesion detection include the construction of visual vocabularies (flat vs. hierarchical approaches, predefined vs. dynamically selected sizes), the selection of interest points (dense vs. salient vs. hybrid), the selection of patches surrounding interest points (shape, size, orientation) and of course their description (colour vs. greyscale vs. binary descriptors). We plan to perform a thorough systematic investigation to assess the effect of each part of BoW schemes to the overall results, within the context of lesion detection in VCE videos.