1 Introduction

Object detection is one of the fundamental tasks of computer vision and aims to predict a set of bounding boxes and category labels for each object instance of interest [1]. According to the process of object bounding boxes from nothing to something, we can divide into single-stage and two-stage object detection. The two-stage object detection algorithm in the first stage mainly uses anchors, region proposals, and NMS [2] to find out where the object appears and get suggestion bounding boxes. The second stage uses classifiers to classify the suggestion bounding boxes and finally realize the object detection process. The R-CNN [3,4,5,6] family is the classical two-stage object detection algorithm, among which fast R-CNN [7] and faster R-CNN [8] are regularly used in the area of object detection by virtue of their excellent capabilities. Although these detection algorithms have high detection accuracy, the detection speed is usually measured in frames per second, and even the fastest high-accuracy detector, Faster-R-CNN, can only run at 7 frames per second (FPS), making it unsuitable for detection scenarios with high time response requirements [9].

In order to solve the above problem, single-stage object detection has emerged. It requires only one forward pass of a single neural network model to predict objects’ class and location information directly from the original image. Classical single-stage object detection algorithms include YOLO [10,11,12,13,14,15], SSD [9], ResNet [16], etc. These classical object detection algorithms must count on manual components to accomplish the object detection process and cannot achieve true end-to-end object detection.

With the continuous exploration of researchers in object detection, Carion et al. [17] proposed an object detector that regarded object detection as a direct set prediction problem, named detection transformer (DETR), which brought a new direction for the development of object detection. DETR is based on encoder–decoder architecture [18] and combined with the bipartite graph matching algorithm, which eliminates the dependence on many manual components and realizes end-to-end object detection. It effectively simplifies the process of object detection.

Although DETR has a simple structure and achieves end-to-end object detection without relying on manual components, its training and inference still take a long time. The effect is not very good when detecting small- and medium-sized objects. Therefore, much subsequent research on DETR has been devoted to speeding up its convergence and improving detection accuracy.

Zhu et al. [19] attribute the slow convergence of DETR to the fact that the attention module applies the same attention weights to all pixels in the feature map in the initial stage. It causes the model to take a lot of time to learn the object distribution in the dataset to be detected. Based on the above analysis, deformable DETR is proposed. The problem of slow convergence of DETR is alleviated. Meng et al. [20] concluded that the slow convergence of DETR is because the query’s content embedding must be matched with both the content embedding in the key and the spatial embedding in the key when computing cross-attention by visualizing the spatial attention weight map of cross-attention in DETR. Therefore, DETR needs a large number of epochs to improve the quality of content embedding to locate the object precisely. Based on the above analysis, the authors propose conditional DETR. By decoupling the cross-attention module in the decoder, the convergence speed of DETR is improved. Gao et al. [21] propose a plug-and-play spatially modulated cross-attention module named SMCA and apply it to DETR. SMCA-DETR introduces a 2D spatial Gaussian-like distribution in the cross-attention mechanism. In this way, the search range of each object query vector in cross-attention is adjusted to a certain distance close to the target center, thus accelerating the convergence speed of DETR. In addition, the authors integrate multi-head attention and scale-selective attention into SMCA to further improve the detection accuracy of the model. Zhang et al. [22] proposed a strong end-to-end object detection model DINO, which improves the slow convergence speed of DETR-like models and the unclear meaning of query vectors by using a contrastive denoising training method, a hybrid query selection mechanism, and twice forward propagation. In 2022, Zhang et al. [23] experimentally found that in the DETR decoder, the object queries were mapped multiple times in the self-attention module and FFN, resulting in a lack of semantic alignment between the object queries and image features, which affected the convergence of DETR. Based on this analysis, Zhang et al. proposed a semantic-aligned matching detection transformer (SAM–DETR model), a model to accelerate DETR convergence by semantic alignment matching. SAM–DETR model greatly accelerates the convergence of DETR without sacrificing accuracy.

Inspired by the effectiveness of the above multi-head attention [18] and semantic alignment matching re-sampling mechanism [23] in accelerating DETR convergence and improving model detection performance, we propose a DETR object detector based on feature correction and double sampling (FCDS-DETR). It improves the detection accuracy by improving the perception ability of the baseline model to the target object. Specifically, we add a feature correction module to SAM–DETR model, which indirectly affects the position of sampling points in the sampling area by explicitly modeling the inter-dependence between feature channels. The ability of the model to locate the edge and end of the detected object was enhanced. At the same time, the double sampling mechanism and feature map fusion method of FCDS-DETR can improve the recognizability of the attention weight maps and reduce the difficulty of subsequent matching tasks by fusing the attention maps generated by the two sets of sampling points. The specific contributions of this paper are as follows:

  1. 1.

    We propose a high-accuracy end-to-end object detector that utilizes the feature correction module and double sampling mechanism to enhance the SAM–DETR model’s ability to detect and localize targets, thereby improving detection accuracy.

  2. 2.

    The reason behind the fuzziness of the attention map generated by the SAM–DETR model is thoroughly analyzed, and a novel attention fusion method is proposed to enhance the recognizability of the attention weight map and achieve more reliable object detection.

  3. 3.

    We evaluated our proposed model on the improved FDDB [24] and COCO [25] datasets, conducting a comprehensive assessment of its performance metrics, including precision, recall, and model parameters, through statistical measurements. The method is compared with existing DETR-like models such as deformable DETR and SAM–DETR model.

2 Related Work

2.1 Transformer

In the field of natural language processing (NLP), recurrent neural networks (RNNs) have long been one of the most popular neural network architectures. Still, they often suffer from long-term dependency problems when processing long sequences, leading to unsatisfactory results. In order to solve this problem, Vaswani et al. [18] proposed a new deep learning model transformer. Compared with RNNs, transformer is an attention mechanism-based neural network model that can model long sequence data without using the recurrent structure and has better parallel computing capability and computational efficiency. The success of transformer in the field of NLP has laid the foundation for subsequent applications in image classification [26,27,28,29], image generation [30,31,32,33], action recognition [34,35,36], and fault diagnosis [37, 38]. The transformer structure exchanges information with all inputs by employing key, query, and value. Through continuous iterative learning, it establishes the connection between each input and itself and the connection between each input and other inputs. However, the transformer structure is computationally quadratic in complexity, making it computationally intensive in the learning process and requiring a long time for model training. To address this problem, a series of studies have been conducted by subsequent scholars. Sparse transformer [39] uses sparse attention instead of the dense attention of the traditional transformer, reducing the complexity of the transformer from \( O(n^2) \) to O(nlog(n)) . Linformer [40] proposes to remove the softmax function in the transformer and perform matrix multiplication between query and value first to achieve the complexity from \( O(n^2) \) down to O(n) . In this paper, our FCDS-DETR constructs the model by referring to the idea of the original transformer. In future work, we will explore efficient transformers in FCDS-DETR.

Fig. 1
figure 1

Structure of SAM–DETR model. N denotes the number of encoders in the transformer, and M denotes the number of decoders in the transformer

2.2 Siamese-based architecture for matching

Siamese-based architecture for matching, a deep learning model for similarity comparison and matching, has a main structure consisting of Siamese networks. By stitching two identical neural networks together for training, they are accelerated to learn to project the two input vectors into the same feature space. The model projects two input vectors into two new vectors through a neural network when performing similarity matching. The similarity between these two vectors is then determined by calculating the Euclidean distance in the embedding space and other methods. This method is widely used in text matching [41,42,43], image matching [44,45,46], and face verification [47,48,49]. Mueller [42] et al. proposed the Siamese Recurrent Architecture in 2016 and used it for text matching tasks. The advantage of this model is its ability to learn the representation of text adaptively and capture the semantic similarity between texts. A good generalization capability was obtained by training on a limited dataset. Chen [43] et al. proposed an enhanced Siamese-based architecture model for natural language inference tasks in 2017. The model is able to handle diverse text types and lengths efficiently. Florian [50] et al. proposed a method of training similarity from data and used it for face verification. Koch [51] et al. used Siamese networks for small-sample learning tasks, introducing distance metrics to solve the problem of small-sample classification. Our FCDS-DETR achieves semantic alignment matching by projecting object queries to the same embedded space as the encoder output feature map.

Fig. 2
figure 2

Semantic aligner module generates new query vectors

2.3 Classical feature fusion method in object detection

Feature fusion methods have been widely adopted in the field of object detection due to their superior performance, with the main goal being to enhance and optimize the model’s detection accuracy for targets of varying scales and perspectives. In 2017, Lin et al. [52] proposed the feature pyramid network (FPN), an effective feature fusion architecture that uses a top-down pathway and lateral connections to integrate multi-scale and multi-level feature information, thereby improving object detection performance. Subsequently, PANet, proposed by Liu et al. [53], adds a bottom-up pathway to the FPN, allowing feature fusion to occur at each level and better integrating information from lower and higher layers. Huang et al. [54] designed DenseNet, which establishes dense connections between all layers, offering an effective method for integrating multi-scale and multi-level information, hence improving the model’s generalization capability. In this paper, we draw insights from these feature fusion methods and apply them to the fusion process of cross-attention weight maps, on the basis of which we propose an attention fusion method. This innovative approach improves the detection performance of the model FCDS-DETR from a new perspective.

3 Proposed methods

3.1 Overview

The FCDS-DETR proposed in this paper aims to integrate the feature correction module and the double sampling mechanism into the Semantics Aligner module to improve the problem of insufficient sampling points and inaccurate sampling point positioning in the re-sampling process of SAM–DETR model. The model’s accuracy is enhanced while keeping the number of parameters limited and without sacrificing the convergence rate of the model. In the following sections, we first review the basic architecture of SAM–DETR model and then introduce the architecture of our proposed FCDS-DETR.

3.2 Review SAM–DETR model

SAM–DETR model uses ResNet-50 [16] as a feature extraction network to extract feature maps \(F \in {R^{H*W*C}}\) from the input image \(I \in {R^{H0*W0*3}}\), where H0, W0, and H, W represent the height and width of the input image and output feature map, respectively. C represents the dimension of the output feature map. In the encoder part, the feature map F is first combined with the sinusoidal position encoding PE to obtain the feature map \({F_{\rm pe}}\) containing spatial position information. The \({F_{\rm pe}}\) will be applied to generate vectors K and Q, representing Key and Query in the transformer, respectively. The sinusoidal position encoding PE formula is shown in Eq. (1) and Eq. (2),

$$\begin{aligned} PE(pos,2i)= & {} Sin(\frac{{pos}}{{{{10000}^{\frac{{2i}}{d}}}}}) \end{aligned}$$
(1)
$$\begin{aligned} PE(pos,2i+1)= & {} Cos(\frac{{pos}}{{{{10000}^{\frac{{2i}}{d}}}}}) \end{aligned}$$
(2)

where pos represents the position coordinate of each pixel of the feature map, 2i and \(2i+1\) represent each pixel’s position in the corresponding position embeddings, and d is the dimension of the position embeddings. V represents the value obtained from F without position information. K, Q, and V will be input to self-attention. In the calculation of self-attention, the matrix dot product between Q and K is used to obtain the output containing context information. And then, after normalization and linearization, we get the output of self-attention. Realize the information exchange between features in all spatial locations. To increase feature diversity, K, Q, and V are divided into groups along the channel dimension for MHSAttention (multi-head self-attention). This MHSAttention formula is shown in Eq. (3),

$$\begin{aligned} MHSAttention(Q,K,V) = Concat(Softmax(\frac{{{Q_i}{K_i}}}{{\sqrt{{d_k}} }}){V_i}){W^o} \end{aligned}$$
(3)

where \({{Q_i}}\), \({{K_i}}\), and \({{V_i}}\) represent the ith feature groups of Q, K, and V. The \({{d_k}}\) is the row dimension of \({{Q_i}}\) and \({{K_i}}\). \({W^o}\) represents the output transformation matrix. The output result of MHSAttention is transformed and input to the transformer’s decoder.

Fig. 3
figure 3

Overall structure of semantics aligner used in FCDS-DETR, including the feature correction module and the double sampling mechanism. The feature correction module improves the sampling point localization accuracy by explicitly modeling the dependencies between the feature channels to be sampled. The double sampling mechanism samples on the feature region \({E_{\rm roi}}\) to generate the input of multi-head cross-attention

In the decoder part, SAM–DETR model adds a semantics aligner before each multi-head cross-attention, as shown in Fig. 1. This Semantics Aligner samples the encoder output feature maps to generate new query, the generation of which is shown in Fig. 2. The gray ellipse represents the necessary input data for cross-attention, while the orange rectangle depicts the process of creating a new query vector. Subsequently, the cross-attention module takes new query as input. The semantic aligner ensures that key and query are semantically aligned in cross-attention since both key and query are derived from the SAM–DETR model encoder output feature maps. By adding a semantic aligner to DETR, the convergence of the model is accelerated, and the detection accuracy is improved.

It is because of SAM–DETR model’s unique understanding of the problems of the classical DETR model that it opens up a new direction to accelerate DETR convergence. Using the re-sampling method to obtain object query avoids the problem of semantic destruction caused by multiple projections of object query generated by multiple decoder superposition. At the same time, the detection accuracy of the model is improved to a certain extent. However, the number of sampling points used by SAM–DETR model in the re-sampling process of semantics aligner is small, while the sampling points are not accurate enough in locating the crucial positions of target objects, resulting in low detection accuracy of the model.

3.3 Feature correction module

Feature correction is typically employed in feature extraction networks to enhance the robustness and generalization of the model by correcting and refining the features in the middle layer of the model. Since the transformer has the quality of multilayer decoder stacking, we introduce the feature correction module of parameter sharing into the basic decoder module. In a single iteration of the model, the multilayer stacked decoder can complete the recalibration of the input features by using the feature correction module and constantly updating the coordinates of the sampling points. CAM [55] is a classical squeeze-and-excitation type of feature correction mechanism, which can improve the quality of the feature representation of the model by adding only a few parameters. Therefore, we use it as a feature correction module to correct the features of the ROI extraction region. Our core idea is to indirectly affect the sampling points in the sampling area by using the differences between the channels of the feature map to be sampled. This indirect effect is mainly due to the fact that the feature map to be sampled is recalibrated in the channel dimension after adding the feature correction module. When the recalibrated feature map to be sampled is used to predict the offset of the re-sampled coordinate points, more accurate prediction results are obtained. The model’s ability to perceive the area to be detected and locate the key points of the object is improved, which in turn leads to improved detection accuracy. As shown in Fig. 3, E is the feature map obtained by the transformer encoder image feature after convolutional transformation. \({E_{\rm roi}}\) is the potential region containing the object obtained by E after ROI align [56]. \({F_{\rm sq}}\) completes the squeezing operation of the global spatial dimension of the input features, which we implement using global max pooling and global average pooling. This operation describes the feature map channel dimension by aggregating the input feature map spatial dimensions. The \({F_{\rm sq}}\) formula is shown in Eq. (4),

$$\begin{aligned} {F_{\rm sq}}({E_{\rm roi}}) = \left\{ {\begin{array}{*{20}{c}} {\frac{1}{{{H_{{E_{\rm roi}}}} \times {W_{{E_{\rm roi}}}}}}\sum \limits _{x = 1}^{{W_{{E_{\rm roi}}}}} {\sum \limits _{y = 1}^{{H_{{E_{\rm roi}}}}} {{E_{\rm roi}}(x,y)} } }\\ {Max({E_{\rm roi}}(x,y)),{\ }x \in {W_{{E_{\rm roi}}}},y \in {H_{{E_{\rm roi}}}}} \end{array}} \right. \end{aligned}$$
(4)

where \({{H_{{E_{\rm roi}}}}}\) represents the height of the ROI align output feature map and \({{W_{{E_{\rm roi}}}}}\) represents the width of the ROI align output feature map.

Fig. 4
figure 4

Sampling point acquisition method

Fig. 5
figure 5

Attention weight map of the cross-attention output. The first and second rows show the attention weight maps generated by the query vectors corresponding to the two sets of re-sampling points after performing cross-attention, and the third row shows the weight maps generated by the attention fusion method

We input the global average pooling and the global max pooling results into the MLP to obtain two sets of \(1 \times 1 \times C\) feature vectors. After that, we add the two sets of feature vectors and use the sigmoid function to implement the excitation operation \({F_{\rm ex}}\). The weight set \(W_{\rm roi}\) for each feature channel corresponding to the feature map \({E_{\rm roi}}\) is ultimately generated through the continuous iterative learning of the MLP. Figure 3 shows that \({F_{\rm adject}}\) multiplies the weight set \(W_{\rm roi}\) with the original \(E_{\rm roi}\) to obtain the recalibrated feature map \(E_{\rm adj}\). The \({F_{\rm ex}}\) and \({F_{\rm adject}}\) formulas are shown in Eqs. (5) and (6),

$$\begin{aligned}{} & {} {F_{\rm ex}}(X) = Sigmoid(MLP(X))\end{aligned}$$
(5)
$$\begin{aligned}{} & {} {{F_{\rm adject}}({E_{\rm roi}}) = {E_{\rm roi}} \times {W_{\rm roi}}} \end{aligned}$$
(6)

where X represents the output of \({E_{\rm roi}}\) after the global max pooling and the average pooling.

3.4 Double sampling mechanism

The purpose of re-sampling is to find the most representative key points in the feature map containing potential objects. In the semantics aligner module of SAM–DETR model, the regions containing potential objects are first extracted using ROI align, and a single re-sampling is performed within the region. Then, the sampled results are sent to multi-head cross-attention as object queries for the dot product calculation of attention. Finally, an attention weight map that reveals the matching degree between object queries and the target region is generated. However, while this single re-sampling scheme can speed up the convergence of DETR, it suffers from under-sampling when using fewer sampling points to locate the key points of the target object in the reference box.

Based on the above analysis, this paper proposes a double sampling mechanism to enhance the object perception capability of the model by increasing the number of sampling points, thus improving the detection accuracy. First, a feature map containing inter-channel dependencies is used to predict the offset of the two sets of sampling points. Then, the bilinear interpolation algorithm is used to double sample on the original feature map extracted by ROI Align according to the offset of the generated sampling points. Finally, two sets of query content embeddings are generated. Let us denote the two sets of offsets as \({P_{{\rm offset}\_i}}\), where \(i = \left\{ {0,1} \right\} \) represents the group to which they belong. As shown in Fig. 3, We obtain the offset of the sampling points by passing the recalibrated feature map \({E_{\rm adj}}\) to Conv, ReLU activation, and MLP. The formula is shown in Eq. (7),

$$\begin{aligned} {P_{{\rm offset}\_i}}({E_{\rm adj}}) = MLP({ReLU(Conv({E_{\rm adj}}))}) \end{aligned}$$
(7)

where Conv denotes the convolution operation, and ReLU is the activation function.

Following that, we can easily obtain two sets of corresponding key sampling points in the feature map to be sampled using an interpolation algorithm based on \({P_{{\rm offset}\_i}}\). The purpose of grouping is to be compatible with the number of multiple heads in cross-attention and to facilitate attention fusion in subsequent attention fusion in Sect. 3.6. We use \({F_{\rm Ds}}\) to represent the interpolation re-sampling. The \(Q_{{\rm cont}\_i}^{'}\) represents the re-sampling output of the corresponding group. The formula is shown in Eq. (8).

$$\begin{aligned} Q_{{\rm cont}\_i}^{'} = {F_{\rm Ds}}({E_{\rm roi}},{P_{{\rm offset}\_i}}) \end{aligned}$$
(8)

\({P_{{\rm offset}\_i}}\) is also used to update the coordinate boxes of the ROI regions and generate a new position embedding \({Q_{{\rm pos}\_i}^{'}}\).

Fig. 6
figure 6

On the improved FDDB dataset, our FCDS-DETR is more sensitive to the target objects, so the detection results are more accurate than the baseline model SAM–DETR model

This paper does not discard the use of previous query embedding in the semantics aligner module. Instead, the number of weights generated by the linear projection is increased to match the output of the double sampling. The formula is shown in Eqs. (9) and (10),

$$\begin{aligned}{} & {} {W_{{\rm pre}\_i}} = {\rm Sigmoid}({\rm Linear}({Q_{\rm pre}})) \end{aligned}$$
(9)
$$\begin{aligned}{} & {} {Q_{{\rm new}\_{\rm cont}\_i}} = {W_{{\rm pre}\_i}} \times Q_{{\rm cont}\_i}^{'} \end{aligned}$$
(10)

where \({Q_{\rm pre}}\) represents the previous query embedding, \({W_{{\rm pre}\_i}}\) represents the weight value of the previous query embedding after linear projection and sigmoid function activation and \({Q_{{\rm new}\_{\rm cont}\_i}}\) represents the weighted query content embeddings.

3.5 Sampling point coordinate information acquisition and position embedding

Before performing sinusoidal position embedding on the sample points, the coordinate information of the sample points should be obtained according to the position offset predicted by the MLP network in Eq. (7). Compared with the offset prediction network of the original SAM–DETR model, FCDS-DETR chooses to double the output channels of the MLP in the last layer of the offset prediction network to achieve the prediction of two sets of offsets. As shown in Fig. 4, the yellow reference point \(P(x\_{\rm center},y\_{\rm center})\) is the center point coordinate of the ROI aligner output feature map \({E_{\rm roi}}\), which is used as the reference point in the prediction of the model for the sample point. The red point \(A({x_{\rm sp}},{y_{\rm sp}})\) is the sample point. \(\Delta x\), \(\Delta y\) denote the offset of the sample point A relative to the reference point P in the x, y direction. \(W_{{E_{\rm roi}}}\) and \(H_{{E_{\rm roi}}}\) denote the width and height of the feature map \({E_{\rm roi}}\). The formula for calculating the sampling point A is shown in Eq. (11),

$$\begin{aligned} \left\{ \begin{array}{l} {x_{\rm sp}} = x\_{\rm center} + \Delta x\\ {y_{\rm sp}} = y\_{\rm center} + \Delta y \end{array} \right. \end{aligned}$$
(11)

The formulas for \(\Delta x\) and \(\Delta y\) are shown in Eq. (12),

$$\begin{aligned} \left\{ \begin{array}{l} \Delta x = \frac{{{W_{{E_{\rm roi}}}}}}{2} \times {P_{{\rm offset}\_x}}\\ \Delta y = \frac{{{H_{{E_{\rm roi}}}}}}{2} \times {P_{{\rm offset}\_y}} \end{array} \right. \end{aligned}$$
(12)

where \({P_{{\rm offset}\_x}}\) and \({P_{{\rm offset}\_y}}\) represent the outputs of the MLP in the offset prediction network.

In this paper, we inherit the way of position embedding in DETR and perform sinusoidal position encoding on the position coordinates of sampling points to generate two sets of corresponding query position embeddings \(Q_{{\rm pos}\_i}^{'}\). We also increase the number of weights from the linear projection of previous query embeddings to generate new query position embeddings \(Q_{{\rm new}\_{\rm pos}\_i}\).

3.6 Attention fusion

The cross-attention mechanism plays a crucial role in SAM–DETR model, which achieves target matching and feature extraction by using the sampled points from the encoder output feature map as object queries. However, the cross-attention weight map of SAM–DETR model is based on a single re-sampling, which is affected by the accuracy of sampling points, making the attention weight map blurred and unable to locate the target object precisely. Based on the above analysis, the attention fusion method is proposed in this paper. The two sets of sampled points obtained by the double sampling mechanism are fed into the cross-attention module in parallel, and the resulting cross-attention weight map is fused to improve the sensitivity and detection accuracy of the model on the target object.

We multiply the two sets \(Q_{{\rm new}\_{\rm cont}\_i}\) and the corresponding \(Q_{{\rm new}\_{\rm pos}\_i}\) with their respective weight matrices \(W_{q\_{\rm cont}\_i}\) and \(W_{q\_{\rm pos}\_i}\), respectively. After that, the two query vectors \({Q_i}\) can be obtained by summing. The formula is shown in Eq. (13),

$$\begin{aligned} {Q_i} = ({W_{q\_{\rm cont}\_i}} \times {Q_{{\rm new}\_{\rm cont}\_i}}) + ({W_{q\_{\rm pos}\_i}} \times {Q_{{\rm new}\_{\rm pos}\_i}}) \end{aligned}$$
(13)

where \({W_{q\_{\rm cont}\_i}}\) and \({W_{q\_{\rm pos}\_i}}\) represent the weight matrices obtained after linearization and sigmoid activation of the self-attentive outputs in the decoder. \({Q_i}\) represents the new query vectors generated after the semantic alignment module. The grouping of object queries does not impact the key and value in MHCAttention (multi-head cross-attention). We represent the attention weight map obtained by multi-head cross-attention as \({F_{{w_i}}}\), with Eq. (14). After that, the weight map \(F_{w_0}\) with \(F_{w_1}\) overlay is used to achieve attention fusion.

$$\begin{aligned} {F_{{w_i}}} = MHCAttention({Q_i},K,V) = Soft\max (\frac{{{Q_i}{K^T}}}{{\sqrt{{\mathrm{{d}}_k}} }})V \end{aligned}$$
(14)

The fusion attention process is shown in Fig. 5, where (a) is \(F_{w_0}\), (b) is \(F_{w_1}\), and (c) is the cross-attention weight map generated after \(F_{w_0}\) and \(F_{w_1}\) are fused. Looking at subfigure (c) in Fig. 5, it is clear that the weight map, after attentional fusion, has richer boundary and content information compared to subfigures (a) and (b). This is particularly evident for the medium and small targets enclosed by the red box in the figure. Attention fusion helps to distinguish these targets from others, thereby reducing the training pressure on the subsequent FFN.

4 Experiment

4.1 Image dataset preparation

In this paper, experiments are conducted on the improved FDDB dataset and the COCO 2017 dataset. The improved FDDB dataset converts the elliptical face annotation of the original FDDB dataset to an external rectangular annotation and converts the interpretation file to COCO format to fit the model’s requirements for the dataset. The improved FDDB dataset contains 2100 training images and 745 validation images with a total of 5171 face targets. The COCO 2017 dataset contains 118k training images and 5k validation images. Each image has an average of seven instances, and a single image in the training set contains up to 63 instances, each annotated with bounding boxes.

4.2 Experimental setup

For the improved FDDB dataset, we explore the model’s performance at 100 epochs. The initial learning rate of the model is set to \(1 \times {10^{ - 5}}\), and the learning rate decreases to 1/10 of the initial value at the 80th epoch. The batch size is set to 8. For the COCO dataset, we first experiment with 12 epochs, which is widely used in the ConvNet detector [8], and second, we also experiment with 50 epochs based on the transformer detector [19, 20]. The initial learning rate for these two sets of experiments is set to \(1 \times {10^{ - 5}}\), and the AdamW [57] optimizer is used. The batch size is set to 6. We use a configuration of 4\(\times \)Nvidia GeForce RTX 3090 GPUs to train our model, considering the requirement of GPU computing power for the transformer-based model in the learning process. When FCDS-DETR employs multi-scale in the encoder, the batch size is reduced to 2 due to the significant CUDA memory required. For the comparison experiment under 12 epochs, the model kept the initial learning rate unchanged. In the comparison experiment under 50 epochs, the learning rate decreases to 1/10 of the initial value at the 40th epoch. The input image size is set between 480*480 and 1333*1333 pixels, and the data are enhanced using random cropping and horizontal flipping.

4.3 Evaluation indicators

Both datasets we used for our experiments are in the annotated format of the COCO dataset. Therefore, we use the COCO evaluation indicators to evaluate the model’s detection performance objectively. Among them, we pay more attention to the detection of average precision and average recall. The performance indicators and their meanings are shown in Table 1.

Table 1 Evaluation indicators
Table 2 Comparison experiments—improved FDDB dataset
Table 3 Comparison experiment—COCO dataset
Fig. 7
figure 7

Convergence curves of FCDS-DETR with other DETR variants trained on the COCO dataset for 50 epochs. Compared with the original DETR, FCDS-DETR significantly improved AP while outperforming other DETR variants

4.4 Analysis of experiment result

Table 2 shows the training results of the proposed FCDS-DETR model and the recently studied DETR variant model on the improved FDDB data set. As observed in Table 2, DETR performs poorly on the improved FDDB dataset and has the lowest AP and AR100 relative to the DETR variant. FCDS-DETR significantly improves the detection performance of DETR: +6.8 AP, +2.2 AP0.5, and +6.0 AP0.75 for 100 epochs, respectively. For the medium- and small-scale object detection, FCDS-DETR showed substantial improvements relative to DETR with +13.1 APM and +12.5 APS, respectively. The black bolded rows in the table indicate the experimental results of our proposed FCDS-DETR model and its application at multiple scales. The FCDS-DETR model also has significant performance advantages compared to the baseline model SAM–DETR model, with +1.4 AP, +1.1 AP0.5, and +3.0 AP0.75, respectively. The most significant improvements are found on APM and APS with +2.7 and +5.7, respectively. This demonstrates the effectiveness of the feature correction module and the double sampling mechanism in Sect. 3.3 feature correction module and Sect. 3.4 double sampling mechanism in improving the detector’s performance. The performance of FCDS-DETR is even better than all the variants of DETR in the table. In addition, we can find that FCDS-DETR and the baseline model are similar in GFLOPs metrics, with FCDS-DETR increasing GFLOPs by a small amount (+7\(\%\)). Figure 6 shows the detection results of FCDS-DETR with the baseline model SAM–DETR model on the improved FDDB dataset. Both models use ResNet-50 as the feature extraction network for 100 epochs. One of the (a) shows the original image, (b) shows the SAM–DETR model detection results, and (c) shows the FCDS-DETR detection results.

Meanwhile, we also conducted experiments on the COCO dataset, and the results are shown in Table 3. It can be observed that the convergence speed of FCDS-DETR at 12 epochs is not reduced compared to the baseline model but is improved considerably to some extent, which is a surprise to us. We posit that this improvement may be attributed to the enhancing effect of the feature correction module on the model’s convergence during the early stages of training. FCDS-DETR improved by +5.3 AP, +6.0 AP0.5, and +6.0 AP0.75, respectively, compared to the original DETR after 50 epochs of training. It still performs outstandingly under 50 epochs of training, obtaining 39.0 AP, a +0.7 AP improvement over the baseline model. With the addition of multi-scale, the detection accuracy of FCDS-DETR is further improved, and 39.6 AP is obtained. Thanks to the high-quality sampling points and double sampling mechanism of FCDS-DETR, the model can perceive the position of the object to be detected more sharply during the training process. The convergence curves of each comparison model trained for 50 epochs on the COCO dataset are shown in Fig. 7. A large number of experiments fully demonstrate the effectiveness of our method. Figure 8 shows the detection results of FCDS-DETR with the baseline model SAM–DETR model on the COCO dataset. Both models use ResNet-50 as the feature extraction network for 100 epochs. One of the (a) shows the original image, (b) shows the SAM–DETR model detection results, and (c) shows the FCDS-DETR detection results.

In addition, it is worth noting that the FCDS-DETR model and the SMCA-DETR [21] model in Table 2 achieve similar scores on the AP metrics after 100 epochs of training, 76.4 and 76.5, respectively. We believe that the reason for the small difference in the performance of the two models is affected by the size of the dataset and the complexity of the scenes in the images. The improved FDDB dataset has 2.1k training images, the dataset size is small, and the complexity of the scene in which the target exists is low. After 100 epochs of training, both models can achieve better detection results. However, in Table 3, the performance gap between the models starts to appear when the two models undergo the same training epochs on the COCO dataset (118k training images) with complex scenes. FCDS-DETR achieves 39.0 AP after 100 epochs of training on the COCO dataset, which is an improvement of +0.6 over the SMCA-DETR model.

Fig. 8
figure 8

Detection results of FCDS-DETR and baseline model on COCO dataset. FCDS-DETR can detect small- and medium-sized objects in the image, and the positioning of the bounding box is more accurate

4.5 Ablation analysis

To validate the role and contribution of the various components proposed in this paper for FCDS-DETR, we conducted an ablation study to assess the importance of the proposed feature correction module and the double sampling mechanism. Additionally, we compared the results with the baseline SAM–DETR model.

We used ResNet-50 as the feature extraction network for SAM–DETR model and FCDS-DETR. To eliminate variability, we performed 24 epochs of the baseline and FCDS-DETR models with different components added. The initial learning rate was \(1 \times {10^{ - 5}}\), the learning rate decreased to 1/10 of the original value after 16 epochs, and the number of object queries was set to 100. The experimental results are shown in Table 4. SAM–DETR model can achieve an AP of 34.4 after 24 epochs.

Table 4 Ablation experiments

4.5.1 Effectiveness of feature correction module

As shown in Table 4, after ensuring the same learning rate, number of epochs, and training schedule as the baseline model, we obtain the results by adding different feature correction methods to the semantic alignment module. It can be observed that the improved models after adding the feature correction module all have different degrees of improvement in accuracy compared with the baseline model. The addition of SENet [58] improves the model accuracy by +0.5 AP, +0.7 AP0.5, and +0.7 AP0.75, respectively. Adding CAM [55] improves the model accuracy by +1.0 AP, +1.1 AP0.5, and +1.4 AP0.75, respectively. The addition of CBAM [59] improves the model accuracy by +0.1 AP, +0.6 AP0.5, and +0.2 AP0.75. The improved model corresponding to CBAM has limited improvement to the model when the number of epochs is small because it needs to learn two dimensions of channel and space. The results show that adding the feature correction module improves the detection accuracy of the model while maintaining the advantage of the baseline model in terms of convergence speed.

4.5.2 Effectiveness of the double sampling mechanism

As shown in Table 4, we added a double sampling mechanism to the semantic alignment module for double sampling within the region containing potential objects. We used the attention fusion method described in Sect. 3.6 to accommodate the attention map fusion problem due to doubling the number of sample points for double sampling. A comparison with the baseline model shows that the improved model with the inclusion of the double sampling mechanism improves the detection accuracy by +0.9% AP for the same number of epochs. This is a very impressive result and strongly supports our view that the double sampling mechanism improves the sensitivity of the model to target objects.

4.6 Visualization

Figure 9 visualizes the bounding boxes predicted by FCDS-DETR and the corresponding key points searched by applying the double sampling mechanism. Among them, the first set of sampling points is marked with green, and the second set of sampling points is marked with blue. Meanwhile, to show the advantage of FCDS-DETR in extracting object features, we also visualize the weight maps generated by the FCDS-DETR and SAM–DETR models after 50 epochs on the COCO dataset. Figure 10 visualizes the attention fusion method by visualizing the weight maps generated by the multi-head cross-attention module.

Fig. 9
figure 9

Detection results and weight maps obtained on the COCO dataset. The first row shows the original image to be detected. The second row visualizes the location of the bounding boxes. The third row visualizes the localization of the sampling points in the bounding boxes. The fourth row visualizes the final generated weight maps. Our FCDS-DETR can locate the edge and end of the object more accurately, making the bounding boxes localization more accurate

Fig. 10
figure 10

Attentional fusion method

It can be observed that double sampling on feature maps to which the feature correction module is applied allows more sampling points to be accurately located at the edges or ends of the objects to be detected. These sampling areas with important features play a crucial role in the subsequent object localization and recognition. Meanwhile, the comparison of the attention weight maps generated by the cross-attention modules of the two models shows that the weight maps obtained using the attention fusion method can clearly separate the object to be measured from the background. Therefore, this model can sensitively perceive and locate the object in the image and improve the accuracy of object detection. In contrast, the original SAM–DETR model has more scattered and sparse sampling points when a single re-sampling is performed, which cannot locate the edges and ends of the object well. It can be observed from the generated weight maps that SAM–DETR model is also less sensitive to the objects. Such results are consistent with our analysis above that fewer sampling points and a blurred attention map are the main reasons for SAM–DETR model’s low detection accuracy.

5 Conclusion

In this paper, we discuss the reasons for the unsatisfactory detection accuracy of SAM–DETR model when performing object detection, i.e., fewer sampling points and blurred attention. We propose FCDS-DETR to solve the above problems and obtain better performance. The core idea of FCDS-DETR is to improve the accuracy and number of sampling points localization by adding a feature correction module and double sampling mechanism, thus improving the recognizability of the attention map output by the model. We demonstrate the effectiveness of the model through a large number of experiments.

The limitations of our proposed FCDS-DETR model are shown in two aspects. On the one hand, the output of the cross-attention module may superimpose the background noise existing in the two attention weight maps when performing feature fusion, which may adversely affect the detection performance of the model. On the other hand, the Semantics Aligner module does not implement the combination with other improved DETR model methods for the time being, which affects the further improvement of model detection performance. To address the above limitations, we will continue to explore more effective noise reduction algorithms for application in the attention fusion process in the future. At the same time, we will continue to investigate the fusion between FCDS-DETR and other excellent improvements to achieve more excellent detection performance.