Keywords

1 Introduction

As an active microwave remote sensing device, synthetic aperture radar (SAR) is capable generate all-day, all-weather and high-resolution earth observations. SAR images are of great importance in reconnaissance and surveillance missions in the military and civilian domains. SAR images target detection can be applied in many tasks, such as environmental monitoring, battlefield reconnaissance, geographic survey and ocean monitoring.

Deep learning technology has achieved excellent results in solving optical images detection and recognition tasks, and has attracted more and more scholars to use deep learning technology in SAR images interpretation tasks [1,2,3,4]. But the complex imaging mechanism of SAR images is different from optical image, leads to the fact that algorithms perform well on optical images may not be perfectly adapted to SAR images. In general, the challenges of applying deep learning to study the tasks of SAR images target detection are mainly as follows: (1) As shown in Table 1 and Fig. 1 statistics from Official-SSDD [5, 6] and HRSID [7], two mainstream SAR image dataset, the size of sparse targets is generally small and the scale varies greatly, it undoubtedly increases the difficulty of SAR images target detection. (2) SAR images are often accompanied by cluttered noise and complex backgrounds such as docks, islands and reefs, resulting in lots of false detection or missed detection. (3) The difference between different datasets is large, lead to the generalization of the model trained on a single dataset is weak.

Fig. 1.
figure 1

Distribution of the ratio of the long side to the short side of the target bounding box.

As above, we aim to extract precise target features from complex SAR images to solve the problem of small target detection and multi-scale target detection. We propose an improved Transformer backbone based on Swin-Transformer [8] which called WAFormer. The backbone redesign the window attention module considering the size and shape of the SAR images targets. The improved window can better capture targets of various sizes and directions and distinguish them from the background. WAFormer achieves higher box AP than Swin-Transformer and other classic convolutional neural network (CNN) method with lower FLOPS than Swin-Transformer. Meanwhile we prove the Transformer method is suitable for SAR images target detection.

The main contributions of this paper are as follows:

  1. (1)

    We redesign the Transformer window attention module with the size variable window. The resizable window make feature extraction more suitable for SAR images targets of various postures.

  2. (2)

    To enhance connections between non-overlapping windows in abovementioned window attention module, we improve the original shift window mechanism in Swin-Transformer to make it more reasonable.

  3. (3)

    In order to alleviate the computational redundancy problem caused by the new window attention, we introduce a channel splitting mechanism to calculate the window attention of different direction at the same time.

Table 1. Statistical results of multi-scale ships in Official-SSDD and HRSID.

2 Related Works

2.1 SAR Target Detection Based on Deep Learning

The analysis of SAR images data has become a research hot spot because of its significance in the field of military and civil detection. In recent years, many SAR images target detection methods based on deep learning are gradually developed. Cui et al. [9] utilized a dense attention pyramid network (DAPN) to improve the accuracy of multi-scale ship detection. Zhao et al. [10] proposed an attention receptive pyramid network (ARPN) with receptive fields block (RFB) and convolutional block attention module (CBAM) to improve the performance of detecting multi-scale ships. Cui et al. [11] proposed an anchor-free method which introduces spatial shuffle-group enhance (SSE) attention module to CenterNet to achieve better performance than some classic CNN methods. Fu et al. [12] are also based on anchor-free strategy, proposed a feature balancing and refinement network (FBR-Net) to achieve the state-of-the-art performance among the general anchor-free methods. Guo et al. [13] presented CenterNet++ consists of feature refinement module, feature pyramids fusion module, and head enhancement module to improve the effectiveness and robustness. Tang et al. [14] proposed a scale-aware feature pyramid network comprises a scale-adaptive feature extraction module and a learnable anchor assignment strategy to address the problem of feature misalignment and targets’ appearance variation. Xu et al. [15] improved YOLOv5 to present Lite-YOLOv5, a lightweight onboard SAR ship detector with decreasing FLOPS and without sacrificing accuracy. Xia et al. [16] proposed a visual transformer framework based on contextual joint-representation learning by combining the global information of Transformer and the local feature representation of CNN.

2.2 Vision Transformer

Transformer [17] is the framework of encoder-decoder with attention mechanism for natural language processing (NLP). With Transformer’s impressive performance in NLP, a growing number of computer vision research work based on Transformer has emerged. ViT [18] presented a pure Transformer architecture for vision by inputting the patches sequences splitted from an image to Transformer. But when the training data is not sufficient ViT will not generalize well. Also based on convolution-free Transformers, DeiT [19] introduced distillation strategy into Transformer to achieve competitive performance. DEtection TRansformer (DETR) [20] realized an end to end detector including a transformer encoder-decoder architecture and a global loss calculated in the parallel decoder. PVT [21] introduced pyramid structure to Transformer to generate an excellent vision Transformer backbone with lower computation than ViT. But these methods based on global attention have high computational complexity. Swin-Transformer [8] presented a general vision Transformer backbone which innovatively designed the shifted windows based on hierarchical architecture. The non-overlapping local windows attention mechanism and cross-window connection not only reduces the computational complexity, but also realizes the state-of-the-art of multiple visual tasks. CSwin [22] proposed a cross-shaped window consists of horizontal and vertical stripes split from feature in a parallel manner, meanwhile introduced Locally-enhanced Positional Encoding (LePE) to achieve better position encoding ability. However, local window attention is not friendly to big target detection. Our method optimizes this disadvantage inspired by Swin-Transformer and CSwin to optimize this disadvantage.

3 Method

3.1 Motivation

Swin-Transformer [8] is currently state-of-the-art vision Transformer backbone with higher accuracy and lower cost than others. The excellent feature extraction capability and advantages for small target detection of the window attention mechanism inspired us to apply it to SAR images target detection. Nevertheless, due to characteristics of small and diverse target size, sparse distribution and different postures, Swin-Transformer can not be directly applied to SAR images. Thus we redesign the window with variable size and apply it to the original Swin structure, formed the improved backbone for ship target detection in SAR images, called WAFormer.

3.2 Overview

The overall architecture of WAFormer is shown in Fig. 2. Because the proposed method is based on Swin-Transformer, so that the overall structure of the network tends to be similar. Taking an image as input, same to Swin-Transformer, followed with the patch partition module to split the image into evenly divided patches. Then applying a linear embedding layer project the patch tokens to C dimension. The setting of patch size and the number of tokens, and the design of the hierarchical representation are both same to Swin-Transformer, so that we also have \(\frac{H}{2^{i+1}} \times \frac{W}{2^{i+1}}\) tokens in the \(i^{th}\) stage with decreased resolution and increased channels. The difference is that we replace the original Swin-Transformer block with our WAFormer block. The WAFormer block will be described in detail as follows.

Fig. 2.
figure 2

(a) The overall architecture of our proposed WAFormer; (b) an effective Transformer block for ship detection in SAR images described in Sect. 3.4). VW-MSA and SVW-MSA are multi-head attention modules with vertical/horizontal and shifted windowing configurations, respectively.

3.3 Variable Size Window Self-attention

Variable Size Window. Based on the local window attention mechanism, we propose a variable size window more suitable for ship target in SAR images. Firstly, in order to allow multi-scale input, the image is padded. Then the padded feature is partitioned into non-overlapping windows. The window size is set as \({M} \times {N}\) mean that each window contains \({M} \times {N}\) patches. As shown in Fig. 1, statistics indicate that the ratio of long and short sides of the bounding box of SAR images is mostly in the range of 4:1. While the aspect ratio of the window of Swin-Transformer is 1:1 which can not cover all targets and will truncate some targets. Thus we set the window size according to this ratio range as shown in Fig. 3. Specifically, from “Stage 1" to “Stage 4", we empirically set the long and short sides of the window to \(\frac{224}{7*2^{i-1}}\)(i = 1, 2, 3, 4) and [7, 4, 2, 1]. Meanwhile, we set horizontal and vertical windows to capture the targets of different postures. Inspired by CSWin [22], we introduce the channel split method to calculate horizontal and vertical window attention at the same time to reduce costs.

Shifted Window. Since our window is no longer a fixed size, the original shifted window is not applicable. To increase the connection between non-overlapping windows, we replace the original shift step with \(\left( \left\lfloor \frac{short \text{- }side}{2}\right\rfloor , \left\lfloor \frac{short\text{- } side}{2}\right\rfloor \right) \) to displace the regularly partitioned windows. In other words, the shift size becomes half of the short side of the window, which is proved to be effective by experiments.

Convolution Position Encoding. It is well known that position encoding is of great significance to the Transformer model [17, 26, 27]. However, we abandoned absolute position encoding and chose to utilize relative position encoding. Because we notice that the absolute position encoding does not lead to performance improvement. Inspired by LePE of CSWin, we also utilize a learnable additive positional encoding by performing convolution operation on value V of the window. We calculate the attention for a window according to the following formula:

$$\begin{aligned} Attention(Q,K,V)=SoftMax(QK^{T}/\sqrt{d})V+Conv(V) \end{aligned}$$
(1)

Experiments show that this position encoding can effectively improve the accuracy.

Fig. 3.
figure 3

The illustration of the variable size window with channel splitting manner

Computation Complexity Analysis. Omitting SoftMax, the computation complexity of a variable size window attention module based on an SAR image of \(h\times w\) patches is:

$$\begin{aligned} \varOmega (VW\text {-MSA})=4hwC^{2} + 2MNhwC \end{aligned}$$
(2)

where hw denote the patch num, it can be seen that our computational complexity is also linear with hw when MN set as we design.

3.4 WAFormer Block

Our network is built on WAFormer block, with other layers kept same with Swin-Transformer. A WAFormer block contains a pair of regular and shifted variable size window attention modules. This block is defined as:

$$\begin{aligned}&\hat{X}^{l} = VW \text{- } MSA(LN(X^{l-1}))+X^{l-1}, \nonumber \\&{X}^{l} = MLP(LN(\hat{X}^{l}) + \hat{X}^{l}), \nonumber \\&\hat{X}^{l+1} = SVW \text{- } MSA(LN(X^{l}))+X^{l}, \nonumber \\&{X}^{l+1} = MLP(LN(\hat{X}^{l+1}) + \hat{X}^{l+1}), \end{aligned}$$
(3)

where VW - MSA and SVW - MSA respectively denote the regular and shifted variable size window attention modules; \(\hat{X}^{l}\) and \({X}^{l+1}\) denote the output feature of the (S)VW - MSA module and the MLP module for block l.

4 Experiments

4.1 Dataset and Evaluation Metrics

SSDD [6] is the first open dataset which is widely used in the SAR remote sensing community. It includes 1160 SAR images with about 500 \(\times \) 500 pixels and under 1–15 m resolutions. The dataset contains 2456 ship targets of different sizes and materials, good and bad sea condition, offshore and inshore scenes. Official-SSDD [5] is an optimized version based on the initial SSDD. Compared to SSDD, Official-SSDD revises labels, formulates stricter using standards and provides a comprehensive data analysis. HRSID [7] includes 5604 SAR images with \(800 \times 800\) pixels and three resolutions(0.5 m, 1 m, 3 m). It contains 16951 ship targets covering different resolutions, polarization, sea condition, sea area, coastal port. We choose Official-SSDD as the main training and testing dataset, and HRSID as the validation dataset for comparison with Swin-Transformer. For detection evaluation metrics, we apply the mean Average Precision (mAP), detection rate at IOU = 0.5 (\(AP_{50}\)) and IOU = 0.75 (\(AP_{75}\)), and detection performance of target detection on small, medium, large targets (\(AP_{S}\), \(AP_{M}\), \(AP_{L}\)). The FLOPS and parameters of model used are also calculated and compared.

4.2 Implementation Details

We implement our proposed network on the PyTorch framework and MMDetection [23] toolbox. Multi-scale training [20, 24] and data augmentation techniques [19] are adopted while the largest size is set as \({1333 \times 800}\) refer to Swin-Transformer. The experiments run at a NVIDIA GeForce RTX 3090 GPU and the batch size is set as 4 limited by the compute capability. The initial learning rate and training epoch are set as 0.0001 and 300. We use AdamW [25] optimizer and cosine decay learning rate scheduler with 5 epochs of linear warm-up. The weight decay is set as 0.05.

4.3 Comparison Results

We compare our proposed WAFormer backbone with Swin-Transformer using Mask R-CNN [28] object detection framework. Meanwhile, we also choose 5 classic object detection methods including YOLOv3 [29], SSD-512 [30], RetinaNet [31], Faster R-CNN [32], Mask R-CNN using ResNet-50 [33] as backbone. Figure 4 shows the visual results on Official-SSDD of WAFormer and Swin-Transformer with Mask R-CNN framework compared with other classic methods. It can be seen that the detection performance of our method is better than Swin-Transformer, and the confidence of the detection box is higher than that of other methods.

Fig. 4.
figure 4

Visual results of methods involved on Official-SSDD. R-50 namely ResNet-50 and Swin namely Swin-Tranformer.

Table 2. Detection results on Official-SSDD test set.
Table 3. Parameter size and FLOPs of methods in experiment.

Table 2 shows the performance comparisons of WAFormer with Swin-Transformer and other methods. Our WAFormer architecture achieves the highest detection accuracy among all the methods involved in the comparison. Specifically, our method achieves 74.4% mAP surpassing Swin-Transformer by +1.0, while the \(AP_{50}\) and \(AP_{75}\) are also bring advantages of +0.9 and +0.8 respectively. Meanwhile, we achieve the best result at \(AP_{S}\) and competitive result at \(AP_{M}\) with 73.7% and 77.9% respectively. The results demonstrate that our method brings improvements for solving small and multi-scale targets detection of SAR images. Table 3 shows the parameters and FLOPs of these methods. When using Mask R-CNN detection framework, our WAFormer realize less parameters and FLOPs than Swin-Transformer. Our method achieves the best results with a lighter architecture. This further shows the effectiveness and superiority of WAFormer for target detection in SAR images.

To validate the universality of our method over Swin-Transformer in SAR images target detection, we retrain and test WAFormer and Swin-Transformer with Mask R-CNN framework on HRSID. Table 4 shows that we still have advantage compared with Swin-Transformer.

Table 4. Detection results on HRSID test set.

4.4 Related Configuration Adjustment

Window Size and Shift Size. To achieve the optimal performance, we conducted different configuration experiments on the size and the shift size of the window. Table 5 shows the results of different configuration. The results show that the highest accuracy is achieved when the long side and short side are of the window set as [32, 16, 8, 4] and [7, 4, 2, 1]. And when the shift size is set as \(\left( \left\lfloor \frac{short \text{- } side}{2}\right\rfloor ,\left\lfloor \frac{short\text{- }side}{2}\right\rfloor \right) \), the shifted window can bring optimal performance.

Table 5. The performance of different configuration on size of the window and step size of the shifted window. The long side and short side denote the size of the window.

Convolution Position Encoding. To validate the effect of convolutional relative position encoding, we also conducted relevant experiments. We calculate the origin attention without the convolution position encoding, the attention with additive and multiplicative convolutional position encoding, respectively. The results show in Table 6, the results show that the additive convolutional position encoding is beneficial to improve the accuracy.

Table 6. The performance of different position encoding. mul conv rel pos.: multiplicative convolutional position encoding, add conv rel pos.: additive convolutional position encoding

5 Conclusion

In this paper, according to the characteristics of the SAR images, we propose a backbone focus on target size based on Swin-Transformer. Our method improves the target detection performance in SAR images while reducing the cost. Experiments show the targeted improvements have played an effective role in solving the problem of difficult detection of small and multi-scale targets in SAR images. At the same time, our size variable window is also applicable to other datasets, since it is designed according to the dataset. However, it can be found that our large target detection results are not excellent. We consider this may be a shortcoming of window attention mechanism. In future work, we plan to increase the number of large windows in the shallow layer, and introduce the channel attention mechanism to increase the information interaction between channels.