1 Introduction

Gastric cancer is a prevalent disease that poses a serious threat to human health [1]. According to the estimates by the International Agency for Research on Cancer (WHO), around 1.09 million individuals across the world suffered from gastric cancer in 2020, with 769,000 deaths attributed to the disease. This ranks gastric cancer as the sixth most prevalent malignancy and fourth most deadly worldwide. China accounts for 36.42\(\%\) of new cases and 37.58\(\%\) of deaths from gastric cancer worldwide, indicating the urgency of prevention and treatment. However, gastric cancer is frequently difficult to detect at an early stage due to the lack of specific symptoms in most gastric cancer patients, which is one of the reasons for the current low diagnosis and treatment rate of early-stage gastric cancer. In addition, gastric polyps can cause surface bleeding due to the corrosive effects of gastric acid, leading to symptoms such as anemia at the later stages. If polyps can be detected at an early stage and undergo regular follow-up or direct surgical treatment, the survival rate of patients can be significantly improved. Therefore, effective prevention and treatment measures should be taken to reduce the harm caused by gastric cancer to humans.

The main method for detecting polyps is gastroscopy screening, which is divided into traditional manual screening and assisted screening methods based on deep learning technology. However, traditional manual screening methods have limitations. Due to the irregular variations in texture features [2, 3], shape [4, 5], size [5], color [6, 7], and other characteristics of polyps, manual screening not only time-consuming to identify polyps but also requires doctors to have relevant knowledge. Moreover, even if the doctor has the appropriate knowledge, factors such as fatigue level [8] and the characteristics of the polyp itself can lead to experienced experts making misdetection or missed detection of some polyps [9]. Therefore, the development of computer-aided diagnosis technology to assist doctors in polyp detection is of great significance.

Due to the significant breakthroughs in GPU computing power, recently, researchers have devoted a lot of effort to computer vision. Compared to conventional manual screening methods, deep learning-based approaches can help doctors focus their attention on identifying suspected polyps, rather than wasting their time on a massive amount of normal images. Therefore, deep learning technology is now utilized in the field of medical imaging [10, 11], aiming to address the above problems while improving the precision of polyp detection and reducing the risk of missed diagnosis and misdiagnosis, providing doctors with more comprehensive, reliable, and efficient diagnostic tools.

In 2020, Deeba et al. [12] introduced a computer-assisted algorithm for the detection of polyps in both colonoscopy and wireless capsule endoscopy (WCE). This algorithm involved several key components, including image enhancement, the generation of saliency maps, and the extraction of histogram of oriented gradients (HOG) features, all of which played a crucial role in the final classification process. By effectively amplifying clinically significant features and reducing the number of search windows through saliency-based selection, the algorithm ultimately improved detection efficiency. In 2021, Qadir et al. [13] created a colon polyp detection system based on F-CNN, which used a two-dimensional Gaussian mask instead of a binary mask, allowing the proposed system to successfully detect the flat and small target polyps with blurred boundaries between the background and the polyp, reducing the rate of missed detection of colon polyps. In 2021, Taş et al. [14] suggested a preprocessing approach that used a super-resolution method based on convolutional neural networks (SRCNN) to enhance the resolution of colonoscopy images prior to polyp localization. This method improved both recall rate and accuracy of the model compared to the low-resolution case. In 2021, Chen et al. [15] improved the saliency of the polyp area by enhancing the contrast of the input image through the differentiation of foreground and background images. The enhanced data were input into an improved deep residual convolutional neural network and integrated learning method for automatic colon polyp detection. By adding attention modules, the network can focus on useful feature channels and suppress invalid feature channels, greatly improving the precision of the detection network. In 2021, Cao et al. [16] proposed a network for detecting gastric polyps, which incorporated a module that extracted and merged features and was based on the YOLOv3 network. The network utilized both high-level and low-level feature maps’ semantic information, resulting in improved detection of small target polyps, with a recall rate of 86.2\(\%\). In 2022, Nisha et al. [17] proposed a dual-path convolutional neural network (DP-CNN) that used image enhancement techniques, DP-RNN structure, and the S-shaped classifier to detect polyps, successfully classifying polyps and non-polyp patches in colonoscopy images and reducing complexity with fewer learnable parameters. In 2022, Hu et al. [18] proposed a novel approach, NeutSS-PLP, aimed at extracting polyp regions within colonoscopy images. The method combines neutral uncertainty theory and saliency detection strategies to enhance the identification accuracy of specular reflections in colonoscopy images and to perform suppression. In addition, a two-level short connection to the saliency detection network was introduced to extract multi-level and multi-scale features for better polyp region extraction.

Several effective strategies have been proposed for conventional polyp detection problem, which have performed well in terms of accuracy, recall, and feasibility. However, due to the irregularity, low resolution, and insufficient feature information of polyp targets, conventional detection models often encounter issues of missed detection or false detection when facing such small targets. Therefore, this study optimized the YOLOv5 model for small polyp target detection, including the following aspects: firstly, to address the issue of information loss in small polyp targets, a new network was developed by adding a small target detection head and utilizing Swin Transformer to enhance the network’s sensitivity to small targets, thereby improving small polyp detection. Secondly, to fully utilize the information between different scales, the new network integrated the ASFF module, which can be applied to four detection heads. Additionally, to weaken the impact of non-detection object areas in the image on the model results, a more outstanding plug-and-play Res-PATM attention mechanism module was proposed based on the PATM module. The proposed PATM-YOLO algorithm achieved 91.3\(\%\) precision and 86.6\(\%\) recall in the constructed dataset and 95.6\(\%\) precision and 90.8\(\%\) recall in the public polyp dataset SUN, outperforming other comparison algorithms in both datasets. The presented PATM-YOLO algorithm demonstrates its effectiveness in detecting small polyps, as indicated by these results.

2 Material and methods

2.1 Dataset

In this study, the parts of the collected datasets [23, 24] related to polyps were extracted and combined into a new dataset, in order to test the detection capability of the model with a richer dataset. The dataset constructed in this study consists of 1759 images, most of which are small polyps and can support related detection tasks for small polyps. Table 1 displays the distribution of images used in this study. A total of 1,127 images were used for training the network model, 281 for validation, and 351 for testing the model’s performance. To validate the model’s effectiveness, experiments were also conducted on the publicly accessible polyp dataset SUN [25], using the corresponding techniques. SUN is an public dataset for polyp detection, which contains up to 49,136 photos with polyp information collected from 100 patients and divided into 100 parts according to different patients. To ensure experimental objectivity, these photos were randomly partitioned into training, validation, and test sets. Within the training set, 31,448 photos containing polyp information were used for training. Furthermore, the validation set consisted of 7,862 images, while the remaining 9,826 polyp photos were allocated to the test set.

Table 1 Details regarding the public datasets employed in this study

2.2 The improved ultra-small target detection head

Due to the high presence of small polyps in the dataset and the significant irregularity in their shape, texture, and size, YOLOv5 does not perform optimally in detecting these small polyps. To address this issue, this study constructed a detection head for small targets in the model’s head to counteract the missed detection that occurs with YOLOv5 in small polyp detection [19]. This approach enhances the model’s detection accuracy for small polyps without compromising its ability to detect polyps of other sizes.

2.3 Improved PATM attention module

Currently, attention mechanisms are being widely used in the field of vision. Inspired by this, this study introduces the PATM attention module, which combines the advantages of smaller inductive bias and simpler architecture in MLP, to enhance the network model’s attention to effective targets and suppress attention to non-target areas [20], based on the following principle:

The PATM attention module characterizes a token as a wave function that possesses phase and amplitude, defined as follows:

$$\begin{aligned} \tilde{Z}_p = {|Z_p |}\odot e^{{i\theta }_p}, p=1,2,\ldots ,m, \end{aligned}$$
(1)

Where i represents the imaginary unit that satisfies \(i^2 = -1\), \( |\cdot |\) represents the absolute value operator, and \( \odot \) represents the element-wise dot product operator. The amplitude \( |Z_p |\) represents the real-valued feature for each token, \( e^{{i\theta }_p} \) is a periodic function, and \( \theta _p \) represents the phase, corresponding to the current position of the token within the wave period. The phase term \( \theta _p \) affects the summing result of different tokens during aggregation.

Calculate the corresponding amplitude information \( Z_p \) and phase information \( \theta _p \) based on the given input features using Formula 2 and Formula 3, respectively.

$$\begin{aligned} {Z}_p = Channel - FC(X_p, W^c)=W^cX_p, p=1,2,\ldots ,m, \end{aligned}$$
(2)
$$\begin{aligned} \theta _p = \Theta (X_p,W^\theta ), \end{aligned}$$
(3)

Where \( W^c \) represents a weight that possesses learnable parameters, and \( W^\theta \) represents learnable parameters.

Fig. 1
figure 1

Schematic diagram of the Res-PATM principle

As Formula 1 is represented in the complex domain, Formula 4 is needed to expand it and represent it in terms of real and imaginary parts.

$$\begin{aligned} \tilde{Z}_p = {|Z_p |}\odot cos\theta _p + i{|Z_p |}\odot sin{\theta _p}, p=1,2,\ldots ,m, \end{aligned}$$
(4)

In the above formula, the real and imaginary parts of complex-valued tokens are signified by two vectors, correspondingly. Then, different tokens \( \tilde{Z}_p \) are merged using the \( token-FC \) operation, i.e.:

$$\begin{aligned} \tilde{O}_p = Token-FC(\tilde{Z},W^t)_p={\sum _q W_{pq}^t}\odot \tilde{Z}_q, p=1,2,\ldots ,m, \end{aligned}$$
(5)

Where \( \tilde{Z}=[\tilde{Z}_1, \tilde{Z}_2,\ldots ,\tilde{Z}_m ]\) represents all the wave-like tokens in one layer. In Formula 5, the interaction between tokens takes into account both their amplitude and phase information. The resulting output, \( \tilde{O}_p \) , is represented by complex values that combine the features. Following the common quantum measurement approach that involves projecting a quantum state, characterized by a complex-valued representation, onto an observable real value. The real-valued output \( O_p \) is obtained by weighting and summing the real and imaginary parts of \( \tilde{O}_p \) with parameters. Combined Formula 5, the output \( O_p \) can be obtained:

$$\begin{aligned} O_p = \sum _q W_{pq}^t Z_{q} \odot \cos \theta _q + W_{pq}^i Z_q \odot \sin \theta _q, p=1,2,\ldots ,m, \end{aligned}$$
(6)

Where \( W^t \) and \( W^i \) each represent weights with learnable parameters. In the above formula, the phase \( \theta _q \) dynamically adjusts itself based on the semantic content of the input data. In addition to the unchanging weights, the phase also modulates the aggregating process of different tokens.

As shown in Fig. 1, the PATM attention module generates amplitude and phase information using Formula 2 and the phase estimation function Formula 3, respectively, given the input features. The complex-value representation is obtained by expanding the output as wave-like tokens using Formula 4 and aggregating them with Formula 6. The output features are enhanced by transforming them with another Channel-FC to increase the representational power and fed into the bottleneck residual module to deepen the network, thus improving the detection capability for small targets. This study refers to the improved module as Res-PATM.

Fig. 2
figure 2

Comparison of Experimental Results in Four Different methods

To investigate how the improved PATM attention module can be applied with the bottleneck residual module, this study conducted experiments comparing four adding methods as shown in Fig. 2. These experiments aimed to maximize the detection results of the network and determine the optimal number of bottleneck residual modules to be used in the PATM attention module.

The experimental findings presented in Fig. 2 indicate that the optimal performance of the network is achieved when two bottleneck residual modules are incorporated into the PATM attention module, allowing it to concentrate on more relevant information.

2.4 Determining the location of swin transformer

The Swin Transformer-v2 architecture benefits from the shift-window operation, which restricts the attention operation to a window and reduces the computational cost. Additionally, the Patch Merging operation can increase the receptive field and obtain multi-scale features [21]. Swin Transformer-v2 architecture makes the amplitude controllable by applying layer normalization afterwards. Inspired by these methods, this study replaced some of the original Cross Stage Partial(CSP) modules in YOLOv5 with CSP modules based on the Swin Transformer-v2 architecture(Swin-CSP).The schematic diagram related to the Swin-CSP module is shown in Fig. 3.

Fig. 3
figure 3

Illustrations related to Swin Transformer-v2: (a) The schematic diagram of the Swin Transformer-v2 block; (b) Swin-CSP module

Table 2 Experimental results of optimal placement positions

To further verify the optimal placement of Swin Transformer modules in the network, this study conducted experimental comparisons of the optimal placement positions, as shown in Table 2.

The results presented in Table 2 demonstrate that the detection model performs best when replacing one CSP module in the backbone network and all in the neck, and is able to extract more effective features compared to other network structures. Finally, we extended this to the PATM-YOLO model. As depicted in Fig. 5, only one CSP module in the head was replaced, whereas all CSP modules in the neck were replaced.

2.5 The improved ASFF module

The initial structure of YOLOv5 is affected by the irregular variations in size and shape of polyps, resulting in differences in detection difficulty due to polyps of different sizes. To reduce the detection difficulty fluctuations caused by polyps of different sizes, this study introduced the ASFF module [22]. The ASFF module’s fundamental concept revolves around empowering the network to dynamically acquire spatial feature weights across various scales during fusion, and its implementation can be divided into two parts: feature size normalization and scale fusion, as follows:

Unified Feature Size: Due to the various resolutions and channel numbers in the network header, the feature layers will eventually need to perform the summation operation as depicted in Formula 7. Therefore, it is crucial to ensure that each layer has uniform channel numbers and feature map size. This can be achieved by initially utilizing a regular convolution operation to obtain equal channel numbers, and then adjusting the sampling strategy for different levels of upsampling and downsampling to ensure uniform feature map size. As shown in Fig. 4, the blue line indicates the downsampling operation and the red line represents the upsampling operation used to enhance resolution.

Fig. 4
figure 4

Framework of ASFF module

As shown in Formula 7, the scale fusion operation is performed on the l-th level as an example. By dividing the feature layers of different resolutions into levels, the same resolution feature maps obtained from the other three levels after the feature size unification operation (i.e., downsampling operation) are weighted and summed to obtain the final features.

$$\begin{aligned} y_{ij}^l=\alpha _{ij}^l*x_{ij}^{1 \rightarrow l}+\beta _{ij}^l*x_{ij}^{2 \rightarrow l}+\gamma _{ij}^l*x_{ij}^{3 \rightarrow l}+\delta _{ij}^l*x_{ij}^{4 \rightarrow l} \end{aligned}$$
(7)

Where \( \alpha _{ij}^l \), \( \beta _{ij}^l \), \( \gamma _{ij}^l \) and \( \delta _{ij}^l \) are the spatial feature fusion weights of their corresponding feature maps relative to the feature map of level l, and these weights are shared across channels. It is worth noting that \( \alpha _{ij}^l \), \( \beta _{ij}^l \), \( \gamma _{ij}^l \) and \( \delta _{ij}^l \) are subject to the constraints of \( \alpha _{ij}^l+\beta _{ij}^l+\gamma _{ij}^l+\delta _{ij}^l=1 \) and \( \alpha _{ij}^l, \beta _{ij}^l, \gamma _{ij}^l,\delta _{ij}^l\in [0,1 ]\) and are defined as follows:

$$\begin{aligned} \alpha _{ij}^l=\frac{e^{\lambda _{\alpha _{ij}}^l}}{e^{\lambda _{\alpha _{ij}}^l}+e^{\lambda _{\beta _{ij}}^l}+e^{\lambda _{\gamma _{ij}}^l}+e^{\lambda _{\delta _{ij}}^l}} \end{aligned}$$
(8)

Where \( \alpha _{ij}^l \), \( \beta _{ij}^l \), \( \gamma _{ij}^l \) and \( \delta _{ij}^l \) are defined by the softmax function with \( \lambda _{\alpha _{ij}}^l \) ,\( \lambda _{\beta _{ij}}^l \),\( \lambda _{\gamma _{ij}}^l \), and \( \lambda _{\delta _{ij}}^l \) as the control parameters.

2.6 The PATM-YOLO algorithm

Based on the above enhancements, the present study suggests an improved algorithm Phase-Aware token Module based YOLOv5(PATM-YOLO) based on YOLOv5, which is dedicated to enhancing the missed detection of dense, small polyps due to the loss of information on small polyps, unevenness in polyp texture and polyp size, and the complexity of the detection background. Figure 5 depicts the structural diagram of the PATM-YOLO algorithm.

Fig. 5
figure 5

Structural diagram of the PATM-YOLO algorithm

3 Results

3.1 Implementation details

3.1.1 Training setting

In this paper, the experimental setup utilizes the Ubuntu 20.04 operating system. The central processing unit (CPU) employed is a 24 vCPU Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz, while the graphics processing unit (GPU) chosen is the RTX 3090 with 24GB memory capacity. The experimentation environment is configured with Python 3.8.1 scripting language, PyTorch 1.10.0 deep learning framework, and CUDA 11.3 GPU acceleration library.

The key training parameters for the experiments were configured as follows: input image dimensions were set to 640 by 640 pixels, the initial learning rate was established at 0.01, learning rate momentum was assigned a value of 0.937, and the weight decay coefficient was set to 0.0005. The training process spanned 100 epochs with a batch size of 64.

3.1.2 Evaluation metrics

In this research, the effectiveness of pre- and post-improved network models in detecting both dense and small polyp images using a constructed dataset was assessed under similar experimental conditions. The differences in experimental outcomes were compared to assess the network performance, i.e., the status of missed and false detections. The three main metrics chosen for this study included precision, recall, and mean average precision (mAP), which were calculated as follows:

$$\begin{aligned} Precision=\frac{TP}{TP+FP}, \end{aligned}$$
(9)
$$\begin{aligned} Recall=\frac{TP}{TP+FN}, \end{aligned}$$
(10)
$$\begin{aligned} mAP@0.5=\frac{\sum _{i=1}^N AP_i}{N}, \end{aligned}$$
(11)

The formulas presented above use TP to indicate true accurate predictions, FP to indicate false predictions, and FN to indicate false negative i.e., false predictions,and is frequently employed for assessing the overall target detection network model’s detection performance.

3.2 Ablation experiment

To further verify the impact of the proposed modules and optimizations on the detection algorithm in polyp detection tasks, this study conducted a set of ablation experiments. Based on the YOLOv5s network, this study added the Swin Transformer network to create YOLOv5s-a, added the ASFF module to create YOLOv5s-b, added the PATM attention module to create YOLOv5s-c, and added the small target detection head to create YOLOv5s-d. The network with all modules added to the YOLOv5s baseline is referred to as the proposed PATM-YOLO network. The results of the ablation experiments are shown in Table 3. In comparison to the YOLOv5 network, the separate addition of each module to the network not only led to a minimum improvement of 1.8\(\%\) in precision but also yielded a performance increase of at least 1.1\(\%\) in both recall and mAP@0.5. This underscores the feasibility of enhancing the model.

Table 3 Ablation experiments of PATM-YOLO

3.3 Experimental comparison between YOLOv5 algorithm and improved algorithms

In this section, the PATM-YOLO algorithm is compared with the YOLOv5 series algorithms. In order to objectively demonstrate the performance on the dataset, considering all existing YOLOv5 models, including YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5x, and YOLOv5l. Table 4 shows the different performance of PATM-YOLO algorithm and YOLOv5 series algorithms on the constructed dataset.

Table 4 Comparison of experimental results between the PATM-YOLO and YOLOv5 series networks
Fig. 6
figure 6

Comparison of different performance parameters between the PATM-YOLO and YOLOv5 series algorithms

Figure 6 displays a comparison of performance parameters between the PATM-YOLO algorithm and the YOLO series algorithms, revealing that the PATM-YOLO algorithm exhibits superior precision, recall rate, and mAP@0.5 compared to other models. With reference to Table 4, it can be inferred that the PATM-YOLO algorithm attains an precision rate of 91.3\(\%\), a recall rate of 86.6\(\%\), and an mAP@0.5 of 92\(\%\) in the detection experiment of polyp targets when contrasted with the original YOLOv5 series network. This constitutes an improvement of 8.5\(\%\), 3.3\(\%\), and 4.7\(\%\), respectively, over the YOLOv5s baseline network model. It can be observed that the performance of the PATM-YOLO algorithm on the constructed dataset exhibits an advantage.

For specific detection tasks involving dense and small targets, the detection performance of the YOLOv5 and PATM-YOLO network models is shown in Figs. 7 and 8. Figure 7 corresponds to dense polyp images with three targets in the original image, of which the original network detected two targets but missed one, while the improved PATM-YOLO network detected all polyp targets. For small polyp images in Fig. 8, the original network exhibited missed detection, while the improved network was able to detect all targets.

Fig. 7
figure 7

Comparison of detection results for dense polyp images: (a) PATM-YOLO; (b) YOLOv5

Fig. 8
figure 8

Comparison of detection results for small polyp images: (a) PATM-YOLO; (b) YOLOv5

3.4 Comparison of the PATM-YOLO algorithm with other algorithms

To further validate the effectiveness of the PATM-YOLO model, comparative experiments were conducted with other algorithms under the condition of maintaining consistent configuration environments and initial hyperparameters as much as possible.

Table 5 shows the experimental results of the PATM-YOLO algorithm and other algorithms on the dataset constructed in this paper. As shown in the table, under the input size of 640*640, the detection performance of the PATM-YOLO algorithm and other algorithms are outstanding, and are able to achieve better performance that surpasses other algorithms in polyp detection tasks.

Table 5 Comparison between the PATM-YOLO algorithm and other algorithms

3.5 Testing of the PATM-YOLO algorithm on the SUN dataset

This section aims to introduce the experiments of the PATM-YOLO algorithm proposed in this study on the public polyp dataset SUN. Similarly, to ensure the fairness of the polyp detection experiments, the experiment is conducted using similar parameter settings as the previous experiments.

Fig. 9
figure 9

The validation results of the model training: (a) mAP@0.5; (b) Recall; (C) Precision

Table 6 The comparison results between PATM-YOLO and other algorithms on the SUN dataset

Figure 9 shows the training process and validation results of YOLOv5, YOLOv7, YOLOv8, and PATM-YOLO. As shown in the figure, the red line representing YOLOv8 has lower recall, precision, and mAP@0.5 compared to the other three detection algorithms. The proposed PATM-YOLO algorithm can achieve higher precision and recall in a shorter time compared to the baseline YOLOv5 network. Furthermore, although the training process shows that PATM-YOLO and YOLOv7 have similar precision and recall, the PATM-YOLO algorithm actually outperforms YOLOv7 in the test set. When Table 6 is integrated into the analysis, it can be observed that the PATM-YOLO algorithm on the SUN dataset shows an increase of 0.4\(\%\) in precision and 0.8\(\%\) in recall.

It can be observed from Table 6 that the PATM-YOLO algorithm outperformed the detection performance in the public dataset SUN by a significant margin in terms of recall and was more suitable for the polyp detection task compared to other detection networks [26, 27].

4 Conclusions

The study proposes a new method for detecting small polyps in images, called PATM-YOLO. The proposed method addresses the issue of missed detection of small polyps. In terms of network architecture, a detection head is firstly constructed for detecting small targets, followed by an attention mechanism to obtain richer information and limit the influence of background areas in the image on the target. Secondly, the Swin Transformer structure is employed to augment the network’s feature extraction capacity. Finally, the ASFF module is incorporated into the network to enhance the integration of multi-scale features and enrich the network’s feature diversity. The PATM-YOLO algorithm achieved better performance than other YOLOv5 series algorithms, with an precision of 91.3\(\%\), a recall rate of 86.6\(\%\), and an mAP@0.5 of 92\(\%\) on the constructed dataset. In addition, the algorithm also achieved an precision of 95.6\(\%\) and a recall rate of 90.8\(\%\) on the public SUN dataset, making it more suitable for polyp detection tasks. The study shows that PATM-YOLO algorithm can improve the detection performance of polyps. In addressing the computational requirements, there is a need for further improvement in the PATM-YOLO algorithm. Enhancing the algorithm to reduce computational costs while maintaining detection accuracy and improving its detection performance to facilitate deployment on resource-constrained devices will be a focal point of our future work.