1 Introduction

Synthetic aperture radar (SAR) [1] has emerged as a pivotal technology in maritime surveillance, offering all-weather, all-day imaging capabilities with strong penetration through adverse conditions such as clouds and fog [2]. However, SAR images of ships exhibit varying scales, and the complex background in SAR images, including interference from sea surfaces, ground clutter, and speckle noise, often leads to false negatives and higher false alarms during the detection process [3]. Therefore, achieving accurate positioning and recognition of ship targets in SAR images holds promising applications.

Traditional SAR ship target detection algorithms primarily rely on contrast differences between targets and background clutter. These methods encompass techniques such as constant false alarm rate (CFAR) detection [4], template matching algorithms [5], and trace-based detection algorithms [6]. CFAR detection is one of the most commonly used techniques, and literature [7] has successfully balanced accuracy and speed in estimating CFAR parameters. Literature [8] introduced a bilateral CFAR algorithm for ship detection, reducing the impact of SAR image ambiguity and background clutter. These methods rely on handcrafted features, exhibit limited efficiency, poor generalization performance, and are unsuitable for complex detection scenarios. With the rapid development of deep learning in optical image object detection and recognition [9,10,11,12,13,14], ideas like single-stage [15], two-stage [16], anchor-free [17], and Transformers [18] have been condensed in the field of object detection, and deep learning concepts have started to be applied to SAR images, yielding significant results [19, 20]. Chen et al. [21] employed deformable convolutional neural networks to enhance feature extraction by altering the convolutional kernel’s sampling points. Bai et al. [22] proposed a shallow feature enhancement network structure, employing the Inception structure along with dilated convolution to expand the feature map’s visual receptive field, improving the network’s adaptability to small-scale ship targets.

Features constitute the primary basis for iterative learning in object detection algorithms. Thus, optimizing the features fed into the detection network can most directly improve various algorithm aspects. Widely used modules include attention mechanisms to focus on key features and feature pyramid networks (FPN) for fusing multi-scale features. In terms of attention mechanisms, Li et al. [23] used channel attention mechanisms to convert spatial information in the image into masks, score them to extract crucial information, and provide references for the detection network. Zheng et al. [24] introduced transferable attention mechanisms, designing an attention mask that covers all positions for each attention module, highlighting the correct semantic feature regions. Regarding FPN, Zhao et al. [25] constructed a four-level scale feature pyramid network in a top-down manner. This network leveraged candidate regions and their surrounding contextual information to provide higher-quality classification confidence and final target scores, thereby enhancing semantic information extraction for small targets. Mei et al. [26] extended FPN into four parts, highly integrating features of different scales extracted by the backbone network, thereby improving the network’s ability to detect small ship targets.

Although the above-mentioned methods have improved ship target detection in SAR images in various aspects, certain shortcomings remain. Firstly, these methods may lack precision or generate more false alarms when dealing with extremely small targets, such as distant ships or lifeboats, possibly due to limitations in target resolution. Secondly, when targets closely resemble the color and texture of the background, such as in the presence of sea waves, islands, or other ships, these methods may face challenges [27]. Moreover, these methods may not be robust enough, especially when dealing with noisy, missing data, or other interference in SAR images [28].

To address these issues, this paper introduces the SAR-ShipSwin (synthetic aperture radar ship detection with swin transformer integration) algorithm, building upon the faster R-CNN framework. Our main contributions include:

  1. 1.

    A novel backbone architecture that merges the swin transformer with a feature pyramid network (FPN), enhanced by an advanced W-MSA module. This combination is specifically designed to tackle occlusion and overlap challenges inherent in SAR imagery.

  2. 2.

    The design of the background modeling network (BMN), primarily for identifying and eliminating complex background features. It comprises background feature extraction layers, background attention modules, and background weakening modules, effectively reducing background-related false alarms.

  3. 3.

    The introduction of spatial intensity geometric pooling, a unique pooling technique that incorporates both spatial and intensity information from the region of interest (ROI). This approach is tailored to preserve the geometric and structural integrity of the original ROI, minimizing loss of information and distortion.

  4. 4.

    The design of the dynamic ship shape adaptive convolution (DSAC) module, which dynamically modifies the shape of the convolution kernel to more accurately conform to the observed target. This method proves more adept at capturing the true shape of ships, especially given their variable forms and potential irregularities in SAR images, compared to conventional convolution techniques.

We have conducted multiple experiments, and the results demonstrate the excellent performance of the proposed algorithm in various scenarios and conditions, effectively improving ship target detection performance and generalization capabilities.

2 Preliminary

This section elaborates on the basic mathematical concepts and theories foundational. It focuses on specialized aspects of synthetic aperture radar (SAR) image processing, advanced neural network architectures, and specific innovations in convolutional operations, providing a direct underpinning for the methodologies developed in this research.

2.1 SAR image processing

SAR imaging involves complex signal processing techniques to resolve features in images, particularly for ship detection.

$$\begin{aligned} I_\textrm{d} = f(I_\textrm{s}) \otimes K + N \end{aligned}$$
(1)

where \(I_\textrm{d}\) represents the denoised image, \(I_\textrm{s}\) is the speckled SAR image, \(K\) denotes the kernel for convolution embodying the feature enhancement, \(\otimes\) signifies the convolution operation, and \(N\) is the residual noise.

2.2 Swin transformer for SAR images

The swin transformer is a pivotal innovation in the field of deep learning, introducing a hierarchical structure that significantly enhances the processing of synthetic aperture radar (SAR) images. Its design is tailored to capture the inherent multi-scale nature of SAR images, making it exceptionally suited for tasks requiring fine-grained feature extraction across various scales, such as ship detection in complex maritime environments.

At its core, the swin transformer operates by partitioning the input image into non-overlapping patches, which are then treated as the basic units for the initial layer of the transformer. This patch-based processing reduces the computational complexity, enabling the model to scale to large images efficiently. The key to its hierarchical structure lies in its ability to merge patches progressively at deeper layers of the network, effectively building a pyramid of features with increasing semantic levels and decreasing spatial resolutions. The hierarchical representation can be mathematically formulated as:

$$\begin{aligned} P_{l+1} = M(P_l) \end{aligned}$$
(2)

where \(P_l\) represents the set of non-overlapping patches or their feature representations at layer \(l\), and \(M\) denotes the patch merging operation that combines adjacent patches to form \(P_{l+1}\), the input for the next layer.

The swin transformer introduces the shifted window multi-head self-attention (SW-MSA) mechanism as a means to efficiently compute self-attention within local windows while also facilitating cross-window connection in subsequent layers. This approach significantly reduces the computational demands of traditional self-attention mechanisms, making it feasible to apply transformers to high-resolution images. The SW-MSA can be described as follows:

$$\begin{aligned} \hbox {SW-MSA}(Z_l) = \hbox {W-MSA}(\hbox {SHIFT}(Z_l)) \end{aligned}$$
(3)

where \(Z_l\) is the input feature map to layer \(l\), \(\hbox {SHIFT}\) is an operation that cyclically shifts the window partitions to enable cross-window connections, and \(\hbox {W-MSA}\) denotes the window-based multi-head self-attention.

The adaptive representation of features within SAR images by the swin transformer is achieved through the combination of hierarchical structuring and the SW-MSA mechanism. This dual approach allows the model to maintain high-resolution details in early layers while aggregating more abstract semantic information in deeper layers. The process of feature extraction and representation is encapsulated in the equation:

$$\begin{aligned} Z_{l+1} = \hbox {SW-MSA}(\hbox {Norm}(Z_l)) + \hbox {MLP}(\hbox {Norm}(\hbox {SW-MSA}(Z_l))) + Z_l \end{aligned}$$
(4)

where \(\hbox {Norm}\) represents layer normalization, and \(\hbox {MLP}\) denotes a multi-layer perceptron that is applied to the output of the SW-MSA block, followed by a residual connection that adds the input feature map \(Z_l\) to the output. This formula underscores the iterative refinement of features through self-attention and nonlinear transformations, enabling the model to capture complex dependencies and features relevant for SAR image analysis.

3 Methodology

3.1 Overall framework

In this work, we propose the SAR-ShipSwin (Synthetic Aperture Radar Ship Detection with Swin Transformer Integration) model, built upon the Faster R-CNN framework. The model comprises a backbone network structure and a background modeling network (BMN). The backbone network structure combines the swin transformer and feature pyramid network (FPN) to effectively address ship detection challenges in SAR images, especially when it comes to resolving the resolution requirements for small targets.

Furthermore, to tackle issues related to target occlusion and overlap in SAR images, we introduce the occlusion perceptive window multihead self-attention (OPW-MSA). In order to better capture the features of irregularly shaped ships, the model also employs the spatial intensity geometric pooling method and the dynamic ship shape adaptive convolution module. The overall architecture of the SAR-ShipSwin model is depicted in Fig. 1.

Fig. 1
figure 1

Model architecture of SAR-ShipSwin

3.2 Backbone network structure

To efficiently address ship detection in SAR images, we propose a backbone network that combines the swin transformer and the feature pyramid network (FPN). This design takes into full consideration the multi-scale nature of images and the resolution requirements for small ship targets.

Traditional feature pyramid networks (FPN) [29] enhance feature representations by combining low-level positional information with high-level semantic information. By incorporating FPN, we can significantly enhance the feature map resolution for small targets, which is crucial for small ship detection in SAR images. Considering the various model options of Swin Transformer, we employ the lightweight Swin-T as the basic unit in this paper. Swin-T consists of four stages, and the features generated in each stage undergo initial feature adjustment with a \(1 \times 1\) convolution. Subsequently, they are fused with features from other stages through upsampling. The fused feature maps then go through a \(3 \times 3\) convolution for further feature extraction and output, as illustrated in Fig. 2.

Fig. 2
figure 2

Multi-scale fusion network structure of swin-T and FPN

Swin transformer based on the ViT architecture [30] introduces prior hierarchies, locality, and translational invariance, optimizing model computation efficiency and performance [31]. Its unique mobile window operation ensures information interaction between adjacent windows, granting the model the ability to model global information while significantly reducing computation.

Fig. 3
figure 3

Swin transformer blocks structure

The core structure of swin transformer blocks is depicted in Fig. 3. To enhance the model’s information exchange capability without increasing computational complexity, we make improvements on the original W-MSA module. To address the issue of target occlusion and overlap in SAR images, we propose the occlusion perceptive window multihead self-attention (OPW-MSA).

Before performing multi-head self-attention computation, a small neural network is used to generate an occlusion score for each pixel. This score represents the degree to which the pixel is occluded, helping us identify areas where overlap or occlusion of targets may occur. We refer to this step as occlusion perceptive mapping (OPM). To assess the degree of occlusion for a region or pixel, this paper employs local gradient information from the image. High gradients may indicate the presence of boundaries, and boundaries may signify target occlusion or overlap. The computation formula for occlusion perceptive mapping (OPM) is as follows:

$$\begin{aligned} \hbox {OPM}(x)=\nabla x\cdot w_\textrm{opm}+b_\textrm{opm} \end{aligned}$$
(5)

where \(\nabla x\) represents the gradient of pixel value x, \(w_\textrm{opm}\) and \(b_\textrm{opm}\) are learnable parameters adjusted during training to maximize occlusion recognition.

Based on the output of OPM, weights are dynamically assigned to each pixel. These weights are proportional to the occlusion scores, meaning that highly occluded areas receive higher weights:

$$\begin{aligned} \hbox {DAW}(x)=\frac{\hbox {OPM}(x)}{\sum _{i}{\hbox {OPM}(x_i)}} \end{aligned}$$
(6)

where \(\hbox {DAW}(x)\) represents dynamically allocated weights.

The primary stage consists of two layer normalizations (LN), a window multihead self-attention (W-MSA) mechanism, and a multiLayer perceptron (MLP). In this stage, the W-MSA module segments the image into non-overlapping windows, effectively reducing the model’s computational burden. To overcome information exchange barriers caused by non-overlapping windows, the advanced stage replaces the W-MSA module with a sliding window multihead self-attention (OPW-MSA) mechanism. The remaining parts maintain LN and MLP to construct the residual connection. The specific computation process is as follows:

$$\begin{aligned} \begin{matrix}{\hat{Z}}^l=\hbox {W-MSA}\left( LN(Z^{l-1})\right) +Z^{l-1},\\ Z^l=\hbox {MLP}(LN({\hat{Z}}^l))+{\hat{Z}}^l,\\ {\hat{Z}}^{l+1}=\hbox {OPW-MSA}\left( LN(Z^l)\right) +Z^l,\\ Z^{l+1}=\hbox {MLP}(LN({\hat{Z}}^{l+1}))+{\hat{Z}}^{l+1}\\ \end{matrix} \end{aligned}$$
(7)

3.3 Background modeling network (BMN)

This sub-network focuses on identifying and eliminating complex background features, allowing the RPN to concentrate more on target extraction. BMN comprises a background feature extraction layer, a background attention module, and a background weakening module.

The background feature extraction layer consists of 2–3 convolutional layers, each followed by Batch Normalization [32] and ReLU [33] activation functions. These layers are primarily used to extract low-level background features. The computation formula for the background feature extraction layer is as follows:

$$\begin{aligned} \begin{matrix}X_1=\textrm{ReLU} (\textrm{BN} (W_1*X+b_1))\\ X_2=\textrm{ReLU} (\textrm{BN} (W_2*X_1+b_2))\\ X_b=\textrm{ReLU} (\textrm{BN} (W_3*X_2+b_3))\\ \end{matrix} \end{aligned}$$
(8)

where \(X\in {\mathbb {R}}^{H\times W\times C}\) represents the input feature map, where H and W are the height and width of the feature map, and C is the number of channels. \(X_b\in {\mathbb {R}}^{H\times W\times C^\prime }\) represents the output feature map, where \(C'\) is the new number of channels. \(*\) denotes the convolution operation, and W1, W2, W3, b1, b2, b3 represent the convolution weights and biases, and BN denotes Batch Normalization.

The background attention module, a core component, employs spatial attention mechanisms to help the network focus on prominent background features:

$$\begin{aligned} \begin{matrix}A=\sigma (W_a*X_b+b_a)\\ X_a=X_b\odot A\\ \end{matrix} \end{aligned}$$
(9)

where \(\odot\) represents element-wise multiplication, and \(W_a\) and \(b_a\) are the weights and biases of the attention module; while, \(\sigma\) is the sigmoid activation function.

The background weakening module employs a strategic formulation to modulate the background intensity, thereby facilitating a more focused target extraction by the region proposal network (RPN). It ingeniously incorporates a mask layer designed to generate a binary-like mask, which aligns with the spatial dimensions of the feature pyramid network (FPN) output. This mask layer is realized through a convolutional operation followed by a sigmoid activation function, \(\sigma\), which maps the input feature space into a [0,1] range. In this context, values approaching 0 are indicative of background regions; whereas, values nearing 1 delineate the foreground, thus achieving a discriminative representation of the scene elements. The mathematical expression encapsulating the operation of the background weakening module can be articulated as follows:

$$\begin{aligned} \begin{matrix}M=\sigma (W_m*X_a+b_m)\\ X_f=X_a\odot (1-M)+X\odot M\\ \end{matrix} \end{aligned}$$
(10)

where \(W_m\) and \(b_m\) are the weights and biases of the mask generation layer. \(X_f\) represents the feature map after weakening the background (multiplied by \(1-M\)); while, the foreground regions remain unchanged (multiplied by M).

Finally, the output mask from the background weakening module is element-wise multiplied with the output of FPN to obtain a feature map with weakened background. This fused feature map serves as the input to RPN for region proposal generation.

3.4 Spatial intensity geometric pooling

The fast R-CNN architecture employs ROI (region of interest) Pooling to extract region proposal features, generating fixed-size feature maps that are then passed to fully connected layers for classification and bounding box regression to complete object detection [34]. However, since the sizes of region proposals generated by the RPN (region proposal network) can vary, performing ROI Pooling with block-wise pooling to obtain fixed \(7 \times 7\)-sized feature maps can disrupt the structural information of the original image, leading to imprecise object localization. Additionally, the feature maps generated by the multi-scale fusion FPN (feature pyramid network) in this paper can have inconsistent sizes, resulting in extreme aspect ratios. This can lead to significant mapping discrepancies, causing feature loss. Particularly in SAR (synthetic aperture radar) images, forcing ROIs of different sizes and shapes into uniform fixed-size feature maps through ROI Pooling may destroy or distort the structural information of small objects, such as small vessels.

To address the aforementioned issues, this paper proposes a geometrically preserving sampling method that avoids the use of traditional max or average pooling. Instead, it introduces a new operation called spatial intensity geometric pooling (SIG-pooling), which takes into account both the spatial distribution and intensity information within ROIs to calculate pooling values.

Consider an ROI region R with dimensions \(h \times w\), which is divided into an \(m \times n\)grid of sub-region cells. For each sub-region cell \(g_{ij}\), a geometric weighting factor \(G_{ij}\) is defined, calculated based on the spatial distribution and intensity information of the ROI.

$$\begin{aligned} G_{ij}=\frac{1}{h\times w}\sum _{x=1}^{h}\sum _{y=1}^{w}I\left( x,y\right) \times D_{ij}\left( x,y\right) \end{aligned}$$
(11)

where I(xy) is the intensity value of the ROI at coordinates (xy), and is the reciprocal of the distance from the center of sub-region cell \(g_{ij}\) to coordinates (xy). This ensures that the central portion of the ROI is given a higher weight.

For each sub-region cell \(g_{ij}\) , its geometrically preserved pooling value \(P_{ij}\) is computed as follows:

$$\begin{aligned} P_{ij}=\frac{1}{h\times w}\sum _{x=1}^{h}\sum _{y=1}^{w}I\left( x,y\right) \times G_{ij} \end{aligned}$$
(12)

After the SIG-Pooling operation, the ROI region R is transformed into a new feature map of size \(m \times n\), where each element represents the geometrically preserved pooling value of its corresponding sub-region cell. This geometrically preserved pooling method, by combining spatial and intensity information, better retains the geometric and structural information of the original ROI, reducing information loss and distortion, especially for small objects like small vessels in SAR images.

3.5 Dynamic ship shape adaptive convolution

Considering that vessels in SAR (synthetic aperture radar) images have variable and often irregular shapes, this paper introduces the dynamic ship shape adaptive convolution (DSAC) module to adapt to different vessel shapes and sizes in a specific manner. Unlike traditional convolutional kernels with fixed shapes, DSAC dynamically adjusts the shape of the convolutional kernel to fit the current target, allowing for more accurate capturing of irregular vessel features. The dynamic ship shape adaptive convolution module comprises three sub-modules: shape recognition, shape-adaptive convolution, and convolution operations.

(1) Shape Recognition Sub-module

For precise object detection, considering the shape information of the target can greatly benefit feature extraction. In SAR images where vessel shapes vary, recognizing their shapes can assist subsequent convolutional operations in extracting features more effectively. This module is a classification task and can be defined as:

$$\begin{aligned} f_\textrm{shape}\left( x\right) ={\hbox {Softmax}}\left( W_\textrm{shape}\times x+b_\textrm{shape}\right) \end{aligned}$$
(13)

where x is the feature representation of the input ROI, and \(W_\textrm{shape}\) and \(b_\textrm{shape}\) are the weights and biases, respectively.

(2) Shape-Adaptive Convolution

To ensure that convolutional operations are more tailored, we need to take into account the specific shape of the input ROI. Therefore, we propose a dynamic kernel selection strategy that chooses the most suitable convolutional kernel based on the predicted shape of the input ROI. Based on the output of the shape recognition sub-module, we select a convolutional kernel set that best matches the predicted shape:

$$\begin{aligned} K=g(f_\textrm{shape}(x)) \end{aligned}$$
(14)

where g is a function that selects the appropriate convolutional kernel based on the output of the shape recognition sub-module. If \(f_\textrm{shape}(x)_\textrm{long}>f_\textrm{shape}(x)_{flat}\), \(K_{long}\) is chosen; otherwise, \(K_\textrm{flat}\) is chosen. \(K_\textrm{long}\) and \(K_\textrm{flat}\) are predefined sets of convolutional kernels suitable for “elongated” and “flat” shapes, respectively. For elongated vessels, we can define a convolutional kernel \(K_\textrm{flat}\) that is suitable for capturing vertical edges:

$$\begin{aligned} K_\textrm{long}=\left[ \begin{matrix}-1&{} \quad 2&{} \quad -1\\ -1&{} \quad 2&{} \quad -1\\ -1&{} \quad 2&{} \quad -1\\ \end{matrix}\right] \end{aligned}$$
(15)

For flat-shaped vessels, we can define a convolutional kernel \(K_\textrm{flat}\) that is suitable for capturing horizontal edges:

$$\begin{aligned} K_\textrm{flat}=\left[ \begin{matrix}-1&{} \quad -1&{} \quad -1\\ 2&{} \quad 2&{} \quad 2\\ -1&{} \quad -1&{} \quad -1\\ \end{matrix}\right] \end{aligned}$$
(16)

(3) Convolution Operations

After determining the most appropriate convolutional kernel, convolution operations are performed to extract features:

$$\begin{aligned} y(p)=\sum _{q\in K}W_\textrm{conv}\cdot x(p+q)+b_\textrm{conv} \end{aligned}$$
(17)

where p represents a pixel position in the feature map, q represents a position in the kernel K. \(W_\textrm{conv}\) and \(b_\textrm{conv}\) are the weights and biases of the convolution operation.

3.6 Loss function

Similar to the faster R-CNN model, the loss consists of two components: regression loss and classification loss. The classification loss employs the cross-entropy loss function as the classification loss function. Given a true label y for the target category and the model’s predicted class probability distribution p, the cross-entropy loss can be defined as:

$$\begin{aligned} L_{\textrm{cls}}=-\sum ^{\begin{matrix} i=1\\ C \end{matrix}}{y_i\log {(}p_i)} \end{aligned}$$
(18)

where C is the number of categories, \(y_i\) is the ith element of the true label, with a value of 1 (if the target belongs to the ith class) or 0 (otherwise). pi is the probability of the model’s prediction for the ith class.

For the regression loss, the Smooth L1 loss is adopted to reduce the impact of outliers when predicting bounding boxes. Given the true bounding box coordinates \(t*\) and the model’s predicted coordinates t, the Smooth L1 loss can be defined as:

$$\begin{aligned} L_\textrm{reg}=\left\{ \begin{matrix}0.5x^2&{} \quad &{} \quad \hbox {if}\;\left| x\right| <1\\ \left| x\right| -0.5&{} \quad \hbox {otherwise}&{} \quad \\ \end{matrix}\right. \end{aligned}$$
(19)

where \(x=t-t^*\) represents the difference between predicted and true coordinates.

Furthermore, a Geometric Pooling Loss is designed specifically for SIG-Pooling to ensure that the feature map after pooling effectively preserves the geometric and structural information of the original ROI. This is achieved by computing the cosine similarity between the features before and after pooling:

$$\begin{aligned} L_{\textrm{geometric}}=1-\frac{\textrm{featur} \textrm{e}_{\mathrm {pre\ prooling}}\cdot \textrm{featur} \textrm{e}_{\mathrm {post-prooling}}}{\parallel \textrm{featur}\textrm{e}_{\mathrm {pree\ proding}}\parallel \parallel \textrm{featur} \textrm{e}_{\mathrm {prost-prooling}}\parallel } \end{aligned}$$
(20)

The objective of this loss function is to minimize the cosine distance between the two feature vectors, ensuring that the post-pooling feature aligns directionally with the pre-pooling feature, thus preserving the geometric and structural information of the original ROI. \(\hbox {feature}_\mathrm{pre-pooling}\) and \(\hbox {feature}_\mathrm{post-pooling}\) represent feature vectors before and after the pooling operation, respectively. \(\begin{Vmatrix} \cdot \end{Vmatrix}\)denotes the norm operation.

The overall loss L can be defined as:

$$\begin{aligned} L=L_\textrm{cls}+\lambda _{\textrm{reg}}L_\textrm{reg}+\lambda _\textrm{geo}L_\textrm{geometric} \end{aligned}$$
(21)

where \(\lambda _\textrm{vreg}\) and \(\lambda _\textrm{geo}\) are weighting factors used to balance the various loss components.

4 Experiments

4.1 Experimental setup

Datasets To validate the effectiveness of the model proposed in this paper, experiments were conducted on the SSDD dataset [35] and the HRSID dataset [36]. The SSDD dataset comprises a total of 1160 images, with an average of 2.12 ships per image. It includes SAR images from three different sensors: Sentinel-1, Terra SAR-X, and RadarSAT-2, captured in HH, VV, VH, and HV imaging modes. The SAR image data have resolutions ranging from 1 to 15 m and cover large maritime areas as well as coastal regions with various ship targets. The HRSID dataset, released in January 2020, is a large dataset for object detection based on synthetic aperture radar (SAR). It contains 16951 instances of ships and 5604 high-resolution SAR images from Sentinel-1B, TerraSAR-X, and Tan DEM-X sensors. The dataset is designed for applications such as semantic segmentation, ship detection, and instance segmentation. Based on the COCO remote sensing image dataset, the HRSID dataset includes multi-source remote sensing images with different resolutions, polarizations, sea conditions, maritime areas, and coastal ports, with image resolutions ranging from 1 to 5 m.

Hardware and software environment The experimental hardware environment consists of an Intel Core i9-11900K CPU, 32GB of memory, and an NVIDIA GeForce RTX 3080 GPU. The operating system used is Ubuntu 20.04, and the deep learning framework employed is PyTorch.

Hyperparameters The stochastic gradient descent (SGD) algorithm is used to train our network with a batch size of 16. For our ablation experiments, the network undergoes a total of 300 epochs. A learning rate of 0.01, weight decay of 0.0005, and SGD momentum of 0.937 are set. Other unspecified hyperparameters are kept consistent with YOLOv5. Additionally, when comparing with other methods, we configure parameters similarly to ensure a fair comparison.

Evaluation metrics In this experiment, we utilize three metrics, namely Precision, Recall, and mean Average Precision (mAP), to analyze and verify the detection performance of the proposed method. mAP can be calculated from the Precision and Recall metrics.

4.2 Comparative experimental results

Table 1 presents a performance comparison of SAR-ShipSwin with other models on the SSDD dataset. In terms of the mAP metric, both LS-SSDD and SAR-ShipSwin exhibit outstanding performance, surpassing other methods. Notably, SAR-ShipSwin achieves the highest performance with a mAP of 98.02%. Regarding precision (P), SAR-ShipSwin leads with a value of 96.53%, followed by LS-SSDD at 96.10%, reaffirming SAR-Ship Swin’s superior detection accuracy. Additionally, Quad-FPN achieves a relatively high P value of 95.77%, but its recall (R) is lower, resulting in a slightly lower overall mAP. HR-SDNet achieves the highest R value at 96.49%, but its P value is slightly lower, leading to an overall mAP of 90.82%. SAR-ShipSwin ranks third in terms of R value at 94.57%, but its higher P value places it in the lead in terms of overall mAP.

Table 1 Comparison with other SAR ship detection methods on the SSDD dataset

Table 2 provides a performance comparison of SAR-ShipSwin with other models on the HRSID dataset. The comparison results on the HRSID dataset demonstrate that SAR-ShipSwin exhibits outstanding performance across different scenarios, including overall scenes, nearshore scenes, and scenes far from the shore. Particularly noteworthy are its achievements in overall scenes and scenes far from the shore, where SAR-ShipSwin achieves the highest mAP values of 92.35% and 97.92%, respectively, outperforming all other models.

SAR-ShipSwin leverages a combination of Swin Transformer and FPN to handle complex SAR images, providing robust feature extraction capabilities. The inclusion of the W-MSA module, in particular, enhances the model’s ability to address occlusion and overlapping challenges prevalent in SAR images, thereby improving detection accuracy in complex scenarios. Complex backgrounds pose a significant challenge in SAR ship detection. BMN ensures the effective removal of complex background features, thereby reducing the impact of background noise on detection. The Spatial Intensity Geometric Pooling and Dynamic Ship shape Adaptive Convolution modules guarantee that the model retains the structural information of ROIs effectively during ship detection, adapting to various ship shapes and sizes, ultimately enhancing detection accuracy.

Table 2 Comparison with other SAR ship detection methods on the HRSID dataset

4.3 Ablation study results

In this paper, four core modules were designed within the Faster R-CNN framework: the backbone network structure, background modeling network (BMN), spatial intensity geometric pooling, and dynamic ship shape adaptive convolution (DSAC). To better evaluate the contributions of these modules to the SAR-ShipSwin model’s performance, ablation experiments were conducted on the SSDD dataset, and the results are presented in Table 3.

Table 3 Ablation study

We initially used the Faster R-CNN base model as the evaluation starting point, which already achieved a relatively high mAP of 89.15% on the SSDD dataset. Upon incorporating the backbone network structure, which combines the Swin Transformer and FPN, the model’s performance improved, resulting in an mAP of 90.45%. This increase underscores the superiority of the backbone network in feature extraction. Subsequently, the Background Modeling Network (BMN) was introduced to assist the model in accurately identifying targets in SAR images with complex backgrounds, further increasing the mAP to 92.02%. This highlights BMN’s contribution to enhancing detection accuracy in complex scenarios. Finally, with the application of Spatial Intensity Geometric Pooling (SIG-Pooling), we observed a further increase in mAP to 98.02%. This indicates that when the model considers spatial and intensity information within ROIs, it can better retain the geometric and structural information of the original ROIs, resulting in more accurate detections, with the most significant performance improvement observed.

To validate the loss function designed in this paper, ablation experiments were conducted to compare the performance of the SAR-ShipSwin model under different loss function configurations. Initially, the SAR-ShipSwin model was trained using the Faster R-CNN loss. Subsequently, the model was trained using two different loss function configurations: classification loss \(+\) regression loss, and classification loss \(+\) regression loss \(+\) Geometric Pooling Loss. To ensure the reliability of the results, experiments were conducted on the SSDD dataset using various configurations of loss functions, and repeated five times. The experimental results are shown in Fig. 4.

Fig. 4
figure 4

Comparative experiments of different loss functions

As can be seen from the figure, although the performance of all loss function configurations exhibits certain fluctuations, the complete loss function configuration (including classification, regression, and geometric pooling losses) overall demonstrates higher and more stable performance. This validates the effectiveness of our designed loss function in enhancing the accuracy and stability of the SAR-ShipSwin model in detecting ships in complex SAR images.

4.4 Visualization results

Figure 5 displays the visualization results for the SSDD dataset. As seen, our proposed ship detection method exhibits strong performance both near the shore and in offshore areas.

Fig. 5
figure 5

Visualization of SAR-ShipSwin on the SSDD dataset

Figure 6 showcase the visualization results for the HRSID dataset. It is evident that the ship detection method proposed in this paper performs well in both nearshore and offshore areas.

Fig. 6
figure 6

Visualization of the HRSID dataset

5 Conclusion

SAR ship detection has long been a hot research topic in maritime target detection. Facing the current major challenges: first, the increased difficulty in distinguishing ship targets from complex backgrounds in SAR images, especially under various meteorological and sea conditions; second, the variability in the shape of ship targets in SAR images, along with severe occlusion and overlap [47]. In response to these issues, this paper proposes the SAR-ShipSwin model. This model, tailored to the unique characteristics of ship targets in SAR images, introduces a backbone network structure that combines Swin Transformer and FPN, effectively extracting features and optimizing model computational efficiency.

Furthermore, we propose a Background Modeling Network designed specifically to identify and eliminate complex background features, thereby improving the accuracy of target detection. Finally, considering the variability in ship shapes in SAR images, we design the Dynamic Ship shape Adaptive Convolution module, which dynamically adjusts the shape of convolutional kernels, further enhancing detection accuracy.

Through extensive comparative experiments, ablation studies, and generalization experiments, our SAR-ShipSwin demonstrates superior detection performance compared to existing baselines and some state-of-the-art algorithms. This confirms that our algorithm not only exhibits efficient detection performance but also demonstrates excellent generalization capabilities. In the future, efforts will be directed toward improving the performance of the SAR-ShipSwin model in detecting extremely small or highly occluded targets. Moreover, integrating reputation management mechanisms could further enhance our model’s robustness and reliability in dynamic environments [48]. Additionally, the adoption of ConvLSTM-based approaches for improving signal processing may refine our model’s ability to handle complex noise patterns and thus improve detection accuracy in challenging scenarios [49].