Keywords

1 Introduction

Colorectal cancer is one of the most common malignancies of the digestive system in the world. Most colorectal cancers originate from adenomatous polyp, and colonoscopy is an important way to screen for colorectal cancer [1]. Colonoscopy-based polyp detection is a key task in medical image computing. In recent years, Deep learning detection models are widely used in polyp detection [2,3,4, 8, 16]. However, influenced by the complex environment of the intestinal tract, bubbles, lens reflection, residues, and shadows may display polyp-like features. Those features can form the suspected target and confuse the model. See Fig. 1 below.

Fig. 1.
figure 1

(a) Bubbles; (b) Lens reflection; (c) Residues; (d) Virtual shadow

Currently two-stage [2,3,4, 6] and one-stage [5, 8, 16, 23] models are the most widely used models in object detection. Faster R-CNN [6] as the most widely used two-stage object detection model, has been adopted in various polyp detection tasks. Mo et al. [2] provide the first evaluation for polyp detection using Faster R-CNN framework, which provides a good trade-off between efficiency and accuracy. Shin et al. [4] propose FP learning. They first trained a network with polyp images and generated FP samples with additional normal videos. Then retrained the network by adding back the generated FP samples. Sornapudi et al. [3] propose a modified region-based convolutional neural network (R-CNN) by generating masks around polyp detected from still frames. One stage model such as You only look once (YOLO) [5] is also widely used for lesion detection with the advantage of its efficiency. Wang et al. [8] propose a new anchor free polyp detector, which can achieve real-time performance. Liu et al. [23] investigated the potential of the single shot detector (SSD) [18] framework for detecting polyps in colonoscopy videos. Three different feature extractors, including ResNet50, VGG16, and InceptionV3 are assessed. Tian et al. [16] propose a one-stage detection and classification approach for a new 5-class polyp classification problem.

To deal with the suspected target regions, some mechanisms such as attention mechanism (CBAM) [7] propose to make the model more focused on true target regions. Recently, Xiao et al. [10] propose a new sampling method based on the Faster R-CNN model to automatically learn features from the suspected target regions directly and effectively reduce false positives. Guo et al. [24] propose a method based on active learning to tackle false positives detected by the CADe system. But both [24] and [4] methods add the FP samples to the training set after finding the false-positive region to retrain the network, this process is more complicated. We design a semi-supervised method to automatically learn suspicious targets to solve this problem.

In addition, there are other methods to detect polyps. Tajbakhsh et al. [22] is based on a hybrid context-shape approach, which utilizes context information to remove non-polyp structures and shape information to reliably localize polyps. Tian et al. [25] integrate few-shot anomaly detection methods designed to perform the detection of frames containing polyps from colonoscopy videos with a method that rejects frames containing blurry images, feces and water jet sprays. Liu et al. [26] propose a consolidated domain adaptive detection and localization framework to bridge the domain gap between different colonosopic datasets effectively.

In this paper, we propose a novel one-stage polyp detection model based on YOLOv4. Moreover, Our model is validated on both the private dataset and the public dataset of the MICCAI 2015 challenge [11] including CVC-Clinic 2015 and Etis-Larib, brings significant performance improvements and outperform most cutting-edge models. To summarize, our main contributions include: (i) A multi-branched spatial attention mechanism (MSAM) is proposed to make the model more focus on the polyp lesion regions. (ii) Design the Top likelihood loss (Tloss) with a multi-scale sampling strategy to reduce false positives by learning from suspected regions from the background. (iii) Further propose Cosine similarity loss (Csimloss) to improve the discrimination ability between positive and negative images. (iv) A cross stage partial connection mechanism is further introduced to make the model more efficient. (v) Finally, from the large amount of experiments using the private and public datasets, we demonstrate that our detection model shows improved detection performance compared with other recent studies in the colonoscopy image datasets.

2 Methods

Our detailed model is shown in Fig. 2. The proposed framework consists of three parts: (1) A multi-branch spatial attention mechanism (MSAM) is proposed to make the model pay more attention to the polyp lesion regions (Sect. 2.1); (2) Top likelihood loss and cosine similarity loss are designed to the one-stage model for false-positive reduction (Sect. 2.2); (3) Cross Stage Partial Connection is introduced to reduce model parameters through feature fusion (Sect. 2.3). During training, the proposed model jointly optimizes positive and negative images. The positive images are trained by the original loss function, the negative images are trained with the top likelihood loss added. The pairs of positive and negative images are further optimized by the cosine similarity loss.

Fig. 2.
figure 2

The architecture of the model. C-Block is the structure after adding cross stage partial connection, and C-M-Block is the structure after adding cross stage partial connection and multi-branch spatial attention mechanism (MSAM), the number represent the convolution kernel size, setting \(\mathrm {k}^{\prime } \in \{5,7,9\}\) in our model, They correspond to the three scales in the model.

2.1 Multi-branch Spatial Attention Mechanism

In order to make the model pay more attention to the polyp lesion regions and eliminate the effect of background contents, inspired by the idea of spatial attention mechanism (SAM) [7] which locates the most important information on the feature map, we propose a multi-branch spatial attention mechanism (MSAM). We put them in the three output positions of feature fusion, as shown in C-M-Block in Fig. 2, MSAM is a concrete structure. There are three different scales of feature maps for feature fusion, the receptive fields of the three scales are targeted to different sizes of objects.

Given an input F, we compute the MSAM map \(A_{\mathrm {s}}=\sigma \left( \sum _{\mathrm {k}^{\prime }} f^{\mathrm {k}^{\prime } \times \mathrm {k}^{\prime }}(F)\right) \). Where, \(f^{\mathrm {k}^{\prime } \times \mathrm {k}^{\prime }}\) represents the convolution operation with the kernel size of \({\mathrm {k}^{\prime } \times \mathrm {k}^{\prime }} \), and \(\sigma \) represents the sigmoid activation function. Setting \(\mathrm {k}^{\prime } \in \{5,7,9\}\) in our model, They correspond to the three scales in the model. The \(\mathrm {9} \times \mathrm {9}\) convolution kernel corresponds to the smaller receptive field, the \(\mathrm {7} \times \mathrm {7}\) convolution kernel corresponds to the middle scale receptive field, and the \(\mathrm {5} \times \mathrm {5}\) convolution kernel corresponds to the larger receptive field.

2.2 Top Likelihood and Similarity Loss

We design the top likelihood loss and cosine similarity loss to reduce false positives. The implementation details of the loss can be summarized in Fig. 3.

Fig. 3.
figure 3

The illustration of the multi-scale top likelihood loss and cosine similarity loss where the solid point represents the selected sample: (a) show top likelihood loss with multi-scale sampling strategy, the K of each scale is set to 50. (b) In the same batch, positive and negative samples of the same scale calculate cosine similarity loss.

Top Likelihood Loss. When optimizing negative samples, since those images do not have any annotation information, this means that all areas will be randomly sampled with equal chance. As a result, the suspected target regions will have a small chance to get trained since it usually only occupies a small portion of the image. The prediction result may bias towards normal features, leading to some false positive detection. To solve this problem, we design top likelihood loss with multi-scale sampling strategy in a one-stage model. When dealing with negative images, we use top likelihood loss and select the proposals with top confidence scores.

Different from two-stage models, YOLOv4 directly generates object confidence score, category probability, and border regression. When training negative images, we compute the confidence scores and select the top 50 anchor boxes score negative anchor boxes on each scale (150 in total) to calculate the loss. The boxes with high scores will be more likely to represent the suspected target region, and as long as the boxes with high scores are minimized, all the boxes would be optimized to be negative regions. This top likelihood loss is defined as:

$$\begin{aligned} \mathrm {L}_{\text{ tloss } }=\frac{1}{ \text{ obj } } \sum _{i \in \text{ tops } } L_{\text{ obj } }\left( p_{i}, p_{i}^{*}=0\right) \end{aligned}$$
(1)

Here, i represents the index of anchor in a batch, and \(p_{i}\) represents the predicted score of the i-th anchors. \(L_{obj}\) is the cross-entropy loss.

Cosine Similarity Loss. We further propose the cosine similarity loss to improve the discrimination ability between positive and negative images. To make our model trained sufficiently, we make use all of the pairs of positive and negative images for computing the cosine similarity loss. Specifically, in each batch, positive images and negative images are random. In order to fully learn the characteristics between positive and negative images, we design a program to let the positive and negative images in the same batch size calculate the similarity loss between each other, and finally take the average. When the network processes the positive images, we take the positive samples with top K scores. Then, when the network processes negative images, we select the highest predicted K classification scores and pair them with positive ones. Assume A positive images and B negative images within one batch, there are \(A\times {B}\) positive-negative pairs. The similarity loss is obtained by computing the cosine similarity of K paired eigen-vectors and summing over the \(A\times {B}\) pairs.

$$\begin{aligned} L_{\text{ csimloss } }\left( X_{1}, X_{2}\right) =\frac{1}{A \times B} \sum _{j}^{\mathrm {AxB}}\left[ \frac{1}{K} \sum _{i=1}^{K} {\text {csim}}\left( X_{1}^{i}, X_{2}^{i}\right) \right] \end{aligned}$$
(2)

Where \(X_{1}^{i}, X_{2}^{i}\) are the feature vectors from positive and negative images, csim is cosine similarity loss, \({\text {csim}}\left( X_{1}^{i}, X_{2}^{i}\right) =\frac{X_{1}^{i} \cdot X_{2}^{i}}{\left\| X_{1}^{i}\right\| \left\| _{2}^{i}\right\| }=\frac{\sum _{i=1}^{n} X_{1}^{i}\times X_{2}^{i}}{\sqrt{\sum _{i=1}^{n}\left( X_{1}^{i}\right) ^{2}} \times \sqrt{\sum _{i=1}^{n}\left( X_{2}^{i}\right) ^{2}}}\).

2.3 Cross Stage Partial Connection

We further introduce the Cross Stage Partial Network (CSPNet) [13] in our model. By dividing the gradient flow, CSPNet can make the gradient flow propagate through different network paths, which can improve the reasoning speed. As shown in Fig. 2, the feature fusion part includes five modules: three up-sampling and two down-sampling. As shown in C-Block and C-M-Block in the bottom right of Fig. 2, the Block represents the original connection, C-Block and C-M-Block represents the connection after adding CSP. through the split and merge strategy, the number of gradient paths can be doubled. Because of the cross-stage strategy, which can alleviate the disadvantages caused by using explicit feature map copy for concatenation. As shown in Table 1, the number of parameters significantly decrease by adding such an operation.

3 Experiment

3.1 Datasets

In order to verify the effectiveness of the proposed method, we conduct experiments on two datasets, the private colonic polyp dataset and the public dataset including CVC-Clinic 2015 and Etis-Larib.

Private Polyp Dataset. A dataset of private colonic polyp dataset is collected and labeled from the Colorectal and Anorectal Surgery Department of a local hospital, which contains 175 patients with 1720 colon polyp images. The 1720 images are randomly divided into training and testing set with a ratio of 4:1. We simulate the actual application scenes of colonoscopy and expand the dataset accordingly, including the expansion of blur, brightness, deformation and so on, finally expanding to 3582 images. The colon polyp images are combined with 1000 normal images without annotation information to build the training set. The original image size is varied from \(612\times {524}\) to \(1280\times {720}\). And we resize all the images to \(512\times {512}\).

MICCAI 2015 Colonoscopy Polyp Automatic Detection Classification Challenge. The challenge contains two datasets, the model is trained on CVC-Clinic 2015 and evaluated on Etis-Larib. The CVC-Clinic 2015 dataset contains 612 standard well-defined images extracted from 29 different sequences. Each sequence consists of 6 to 26 frames and contains at least one polyp in a variety of viewing angles, distances and views. Each polyp is manually annotated by a mask that accurately states its boundaries. The resolution is 384 \(\times \) 288. The Etis-Larib dataset contains 196 high-resolution images with a resolution of 1225 \(\times \) 966, including 44 distinct polyps obtained from 34 sequences.

3.2 Evaluation and Results

Evaluation Criteria. We use the same evaluation metrics presented in the MICCAI 2015 challenge to perform the fair evaluation of our polyp detector performance.

Since the number of false negative in this particular medical application is more harmful, we also calculate the F1 and F2 scores as follows. The evaluation criteria are as follows:

$$\begin{aligned} \begin{aligned} Precision =\frac{T P}{T P+F P} \quad \quad \quad \quad \quad \quad Recall =\frac{T P}{T P+F N} \\ F 1=\frac{2 *{ Precision } * { Recall }}{ { Precision }+ { Recall }} \quad \quad F 2=\frac{5 *{ Precision } * { Recall }}{4 * { Precision }+ { Recall }} \end{aligned} \end{aligned}$$
(3)

where TP and FN denote the true positive and false negative patient cases. FP represents the false positive patient cases.

Implementation Details. Our model uses the Pytorch framework and runs on NVIDIA GeForce RTX 2080Ti GPU servers. We set the batch size to 8. During training, we use the SGD optimization method, we also perform random angle rotation and image scaling data for data augmentation. The training contains 2000 epochs with 574 iterations for each epoch, Normally the training process starts with a high learning rate and then decreases every certain as the training goes on. However, a large learning rate applies on a randomly initialized network may cause instability for training. To solve this problem, we apply a smooth cosine learning rate learner [12]. The learning rate \(\alpha _{t}\) is computed as \(\alpha _{t}=\frac{1}{2}\left( 1+\cos \left( \frac{t \pi }{T}\right) \right) \alpha \), where t represents the current epoch, T represents the epoch and \(\alpha \) represents initial learning rate.

Fig. 4.
figure 4

(a) Origin image with ground truth label (solid line box); (b) Heatmap generated by the original YOLOv4; (c) Heatmap generated by YOLOv4+MSAM; (d) Origin image with ground truth label (solid line box) and suspected target regions (dashed line box); (e) Heatmap generated by YOLOv4+MSAM; (f) Heatmap generated by YOLOv4+MSAM+Tloss (top likelihood loss);

Ablation Experiments on Private Dataset. In order to study the effect of MSAM and the new loss function, we conduct ablation experiments on our private dataset. As shown in Table 1, Compared to the YOLOv4 baseline, our proposed MSAM increases the Recall by 4.5%, resulting in a score increase of F1 and F2 by 2.2% and 4.0%, respectively. Adding the top likelihood loss only increases the Precision by 4.4%, and combining top likelihood loss together increases both Precision and Recall, leading to an increase of Precision by 2.9% and Recall by 3.1%. Finally, the model achieves the performance boosting over all the metrics when combining MSAM, Top likelihood and similarity loss, CSP module together, leading to increases of Precision by 4.4%, Recall by 3.7%, F1 by 4.0%, and F2 by 3.8%. It is also worth noting that CSP makes the model more efficient and leads decreases of FLOPs by 10.74% (8.66 to 7.73), and Parameters by 15.7% (63.94 to 53.9).

We also show some visualization results of the heatmap (last feature map of YOLOv4) for ablation comparison (shown in Fig. 4). The results demonstrate that MSAM makes the model more focus on the ground truth areas, and the top likelihood loss let the model better identify the suspected target regions and pay less attention to such areas.

Table 1. The results on the private polyp datasets.

Results and Comparisons on the Public Dataset. The results on the public dataset are shown in Table 2, we also test several previous models for the MICCAI 2015 challenges. The results show that our method improves performance on almost all metrics. Compare to the baseline, our proposed approach achieves a great performance boosting, yielding an increase of Precision by 11.8% (0.736 to 0.854), Recall by 7.5% (0.702 to 0.777), F1 by 9.5% (0.719 to 0.814), F2 by 8.2% (0.709 to 0.791). It is worth noting that the depth of CSPDarknet53 backbone for YOLOv4 is almost the same as Resnet50. However, our proposed approach even significantly outperforms the state-of-the-art model Sornapudi et al. [3] with a backbone of Resnet101 and Liu et al. [23] with a backbone of Inceptionv3. Comparison with Liu et al. [23], although it slightly decreases the Recall by 2.6% (0.803 to 0.777), it increases Precision by 11.5% (0.739 to 0.854), F1 by 4.6% (0.768 to 0.814), and F2 by 0.2% (0.789 to 0.791). We presented the Frame Per Second (FPS) for each model. It shows that our one-stage model is much faster than other models. It is 5.3 times faster than the Faster R-CNN (37.2 vs 7), 11.6 times faster than Sornapudi et al. [3] (37.2 vs 3.2) and 1.2 times faster than Liu et al. [23] (37.2 vs 32). Furthermore, The PR curve is plotted in Fig. 5. Comparison with baseline, our proposed approach increases the AP by 5.1% (0.728 to 0.779).

Table 2. Results of the different modes on MICCAI 2015 challenge dataset.
Fig. 5.
figure 5

Precision-Recall curves for all the methods. The performance of Proposed approach is much better than the teams that attended the MICCAI challenge

4 Conclusions

In this paper, we propose an efficient and accurate object detection method to detect colonoscopic polyps. We design a MSAM mechanism to make the model pay more attention to the polyp lesion regions and eliminate the effect of background content. To make our network more efficient, we develop our method based on a one-stage object detection model. Our model is further jointly optimized with a top likelihood and similarity loss to reduce false positives caused by suspected target regions. A Cross Stage Partial Connection mechanism is further introduced to reduce the parameters. Our approach brings performance boosting compare to the state-of-the-art methods, on both a private polyp detection dataset and public MICCAI 2015 challenge dataset. In the future, we plan to extend our model on more complex scenes, such as gastric polyp detection, lung nodule detection, achieving accurate and real-time lesion detection.