1 Introduction

Ship target detection based on remote sensing images has become an increasingly important research for coastal countries, driven by advancements in remote sensing technology. Synthetic Aperture Radar (SAR) serves as an active microwave remote sensing system with several advantages. It operates effectively in all weather conditions and times of day. SAR is not influenced by natural factors such as extreme weather or light intensity [1]. The adaptability of SAR to the variability of oceanic climates makes it well-suited for comprehensive real-time ship target detection activities [2]. Consequently, the utilization of SAR images for ship detection has emerged as a prominent and research-intensive area within the field of target detection [3].

Ship target detection techniques utilizing SAR images can be categorized into two main groups, including traditional detection methodologies and deep learning-based detection procedures [4]. Traditional approaches to ship target detection in SAR images typically involve eliminating regions unlikely to contain the target from large scene images and selecting potential regions of interest. However, the aforementioned selected regions often include numerous false alarms due to noise, necessitating further target identification and classification. Additionally, traditional detection methods generally rely on gray statistics, edge information, and other fuzzy edge detection algorithms [5]. When applied to complex backgrounds such as offshore terminals, these traditional methods encounter challenges such as missed detections and high false alarm rates. Furthermore, the feature representations employed by traditional ship target detection techniques are typically designed manually. The interpretation of SAR images is highly dependent on the professional expertise and work experience of relevant personnel, leading to the detection and recognition algorithm being weak in robustness and generalization ability [6].

The rapid advancement of computing power, coupled with the emergence of artificial intelligence technologies, has significantly enhanced the training efficiency of deep learning approaches. The advancement enables efficient processing of multi-dimensional data and demonstrates substantial application potential in the fields of computer vision and object detection [7]. Deep learning, which is a subset of machine learning, involves constructing deep neural network models with multiple hidden layers to learn multi-layered feature information from targets. The aforementioned process entails deep-level feature transformation and extraction from original image data, leading to the abstract representation of targets in high-level features and facilitating target detection [8]. Deep learning techniques excel at extracting deep features through continuous training, demonstrating robust adaptive learning capabilities. Given the hierarchical structure and extensive parameterization of deep learning models, they are well-suited for adapting to large-scale data environments [9]. By integrating feature extraction and classification into a unified framework and leveraging data-driven feature learning, deep learning approaches effectively address the limitations of manual feature design inherent in traditional SAR target detection methods, which include being time-consuming, labor-intensive, and challenging to adapt to complex environments. At present, deep learning methods have been widely adopted in the field of image processing, consistently delivering exceptional performance [10]. Therefore, integrating deep learning technology into ship target detection tasks utilizing SAR images contains meaningful research significance, which is instrumental in advancing the development of SAR target detection technology.

This article introduces a ship target detection architecture for SAR images by integrating the advanced and robust deep learning-based object detection network models. We enhance and optimize the existing model structure to align with the characteristics of ship targets in the images, aiming to improve both the accuracy and efficiency of ship target detection. The enhancements include the implementation of a novel attention mechanism model, the refinement of the loss function of the network architecture, and the integration of rotating target detection technology with the circular smooth label algorithm. We designed experiments utilizing public SAR image datasets to validate the feasibility and effectiveness of the proposed architecture.

The main contributions of this paper can be summarized as follows:

  1. 1.

    This paper incorporates the coordinate attention mechanism and develops an architecture based on the YOLOv7 object detection network. By integrating the coordinate attention mechanism, the object detection network concentrates on important regions and features from SAR images, thereby enhancing the performance of the ship target detection task. The coordinate attention mechanism supports the network in identifying tiny ship targets within SAR images and enables the object detection network to be versatile across diverse complex scenarios.

  2. 2.

    The SCYLLA Intersection over Union (SIoU) loss is employed to replace the original Complete Intersection over Union (CIoU) regression loss in the YOLOv7 network model. The adjustment reduces the complexity of the loss function and elevates detection accuracy while mitigating false positives and the likelihood of missed targets, hence enhancing the robustness and positioning accuracy of the network. The SIoU loss measures the scale factor of the target during IoU calculation, which minimizes regression biases and improves target localization accuracy, enabling the network suitable for the ship target detection task. As a result, it enhances detection accuracy, reduces false detections, and minimizes missed targets.

  3. 3.

    Due to the challenges associated with precise feature extraction from SAR images, false detections and missed detections often occur when dealing with densely distributed and complex images, such as ships in nearshore areas. This research implements the rotating box detection method based on the circular smooth label algorithm to achieve accurate positioning. The angle prediction is transformed into a high-precision classification task, addressing the boundary discontinuity problem and enhancing detection performance. The integration of the circular smooth label algorithm empowers the object detection network to analyze the shape and structure of the corresponding ship target, consequently minimizing label boundary ambiguities. The enhancement improves the capability of the network to generalize across ship targets across diverse scenes, scales, and poses.

The remainder of the paper is organized as follows: Section 2 introduces generic object detection methodologies and recent advancements in ship detection techniques. Section 3 provides a detailed description of the designed network model architecture. Section 4 presents the comparative experiment, offering a comprehensive analysis of the experimental results. Potential areas for improvement and future insights are discussed in Sect. 5, followed by a summary of the paper in Sect. 6.

2 Related works

Traditional ship target detection approaches based on SAR images have largely relied on the concept of semi-automatic target detection, with numerous studies conducted in this area [11]. Ai et al. introduced the Constant False Alarm Rate (CFAR) algorithm, incorporating two parameters, and subsequently developed a CFAR algorithm based on the K-distribution [12]. Most CFAR algorithms analyze SAR images pixel by pixel by employing local sliding windows. However, the aforementioned procedure involves multiple calculations for each pixel, leading to a low computational efficiency [13]. To mitigate the challenges posed by traditional imaging techniques, which frequently generate strong background clutter and high sidelobe interferences, Xu et al. employed machine learning approaches. They introduced a target-centric Bayesian compressive sensing imaging method, complemented by a region-adaptive extractor, which enhances radar image object perception tasks [14]. Nasrabadi et al. proposed a method that employs entropy, concave wavelet transform, and template matching from information theory to detect ship targets [15]. Additionally, Guo et al. developed an algorithm for SAR image target detection based on feature extraction [16]. Despite their contributions, traditional techniques exhibit multiple limitations, including strong dependence on manual intervention, low generalization ability, suboptimal detection accuracy, and extended detection times. Moreover, methods relying on image texture feature extraction necessitate manual design interventions, making the entire process complex, time-consuming, and challenging to ensure timely detection [17].

With the advent of Convolutional Neural Networks (CNNs) and the proliferation of artificial intelligence technologies, SAR ship object detection techniques based on deep learning approaches have rapidly evolved and demonstrated impressive detection performance. Various advanced SAR ship object detection methods based on deep learning technologies can be broadly categorized into two types, including two-stage detection models and single-stage detection models [18]. The two-stage detection model initially employs selective search or a region proposal network to generate and extract suggested regions from the input image. Subsequently, it utilizes the features of suggested regions to predict object categories and perform regression classification. For instance, Liu et al. developed a two-stage ship detection algorithm in SAR images based on Regional Convolutional Neural Networks (R-CNN) [19], while Lin et al. introduced a two-stage Faster R-CNN for ship detection in SAR images [20]. Xu et al. identified that prevailing deep learning-based SAR ship detection methods predominantly focus on single-polarization SAR images, overlooking the potential of dual-polarization characteristics. To overcome the limitation, they introduced a group-wise feature enhancement-and-fusion network. The network incorporates dual-polarization feature enrichment, aiming to enhance the accuracy of dual-polarization SAR ship detection [21]. In contrast, the single-stage detection model simplifies the object detection problem by treating it as a regression problem. It eliminates the regional proposal stage and utilizes a single convolutional neural network to directly predict the category probabilities and position coordinates of various objects. Compared to the two-stage detection model, the single-stage model streamlines the entire workflow, leading to higher recognition rates. Among various single-stage detection models, the You Only Look Once (YOLO) series techniques stand out for their speed, particularly excelling in recognizing relatively small targets [22]. As a result, it has gained widespread adoption in SAR ship detection tasks.

Numerous scholars have delved into SAR ship target detection methodologies based on the YOLO model architecture. Gao et al. applied a regression-based approach to establish the deep separation convolutional network utilizing the YOLOv4 model. They incorporated channel and spatial attention mechanisms to enhance the ship detection accuracy in SAR images [23]. Similarly, Guo et al. introduced an improved YOLOv5 detection method to address the multi-scale challenges of ship target detection in complex scenes [24]. Xu et al. utilized the YOLOv5 algorithm as the foundation and introduced a streamlined onboard SAR ship detector named Lite-YOLOv5. The variant minimizes the model size, reduces the computational overhead, and achieves onboard ship detection without compromising accuracy [25]. While these studies utilized conventional horizontal label boxes for detection, which require fewer parameters and simplify the model training process [26], they face limitations in scenes with complex images. Horizontal label boxes often encompass redundant background information, complicating classification and leading to inaccurate target representation. To address the aforementioned challenges, rotating target detection offers a viable solution. Rotation boxes eliminate overlap during object detection, enable precise target identification as well as localization amidst complex backgrounds, and broadly exclude background information from the detection box, reducing its influence on object classification. Sun et al. incorporated rotating target detection into SAR ship target detection by designing a circular smooth label algorithm and integrating it into the YOLOv5 detection network, achieving precise ship target positioning [27]. Despite these advancements, most rotating target detection models in the existing literature are based on YOLOv4 or YOLOv5. Notably, the YOLOv7 algorithm represents a more recent innovation within the YOLO series approaches. YOLOv7 introduces updated network architecture and auxiliary detection for preliminary result screening, enhancing both computational efficiency and detection accuracy [28]. While retaining the dynamic tag allocation strategy from previous versions, YOLOv7 further improves computational efficiency and detection accuracy, enabling it to be a promising candidate for SAR ship target detection [29].

To address the aforementioned challenges and enhance the efficiency as well as the accuracy of ship target detection in SAR images, in this research, we make improvements based on the YOLOv7 detection algorithm framework. Tailoring the loss function to align with the characteristics of ship target detection, we optimized the detection network. Inspired by the methodology proposed by Hou et al. [30], we incorporated the coordinate attention mechanism module to bolster ship target detection performance and effectiveness in SAR images. Additionally, we integrated rotating target detection technology into the model to mitigate the impact of target overlap on detection results. To validate the efficacy of the proposed model, we compared it with the aforementioned state-of-the-art techniques.

3 Methodology

3.1 Detection network

The YOLOv7 network model we constructed primarily consists of Input, Backbone, Head, and Prediction modules. The structure of the model is depicted in Fig. 1.

Fig. 1
figure 1

The constructed YOLOv7 detection network structure

The Input module first scales SAR images to a standardized pixel size to align with the input size requirements of the architecture network. Following a series of preprocessing steps, including data augmentation, the images are forwarded to the Backbone module. The Backbone module comprises multiple BConv convolutional layers, Extended Efficient Layer Aggregation Network (E-ELAN) layers, and Max-Pooling Convolutional (MPConv) layers [31]. The BConv layer consists of a convolutional layer, a Batch Normalization (BN) layer, and an activation function. It serves to extract image features across various scales [32]. The E-ELAN layer architecture is an enhancement of the original ELAN structure. While retaining the transition layer structure from the original ELAN design, E-ELAN introduces diverse feature learning by guiding different feature-set blocks. Through mechanisms like expand, shuffle, and merge cardinality, E-ELAN enhances network learning capabilities without disrupting the original gradient flow [33]. Lastly, the MPConv convolutional layer broadens the receptive field of the current feature layer. Subsequently, it combines the expanded feature information with the output from standard convolution processing to bolster the generalization capabilities of the network [34].

The Backbone module extracts multiple features from the processed images. The extracted features are then fused utilizing the concat operation within the Head module to generate features of varying sizes. The Head module adopts a Path Aggregation Feature Pyramid Network (PAFPN) architecture, facilitating efficient feature fusion across different levels by introducing a bottom-up path that smoothly transfers information from the base to the top [35]. Within the Head module, the architecture incorporates both Spatial Pyramid Pooling and Convolutional Spatial Pyramid Pooling (SPPCSPC) structures. This SPPCSPC structure enhances the perceptual field of the network by integrating a Convolutional Spatial Pyramid (CSP) structure into the standard Spatial Pyramid Pooling (SPP). Additionally, it features a substantial residual edge to aid in optimization and feature extraction. By integrating multiple MaxPool operations in parallel with a sequence of convolutions, the design mitigates image distortion from processing operations and addresses the issue of redundant feature extraction in the CNN model [36]. The abbreviation CAM stands for Coordinate Attention Mechanism. In this research, three coordinate attention mechanisms are integrated into the Head module, positioned before the prediction head. This design aims to capture crucial feature representations essential for the downstream object detection task. The interior structure for each CAM is displayed in Fig. 1. The placement of the CAM within the Head module from the architecture of the object detection network is inspired by approaches proposed by Liu et al. [37] and Raj et al. [38]. While several researchers have integrated the attention mechanism into the Backbone module of the object detection network, the experimental results consistently demonstrate similar outcomes [39]. A detailed explanation of the inner components of the attention mechanism is provided in the following section. Subsequently, the fused features are directed to the Prediction module. The module adjusts the channel count for features of different scales from the PAFPN output employing RepVGG blocks (REP). It then employs convolution for predicting confidence scores, categories, and anchor frames [40]. Compared to its predecessors, the YOLOv7 detection network enhances feature extraction capabilities, striking a commendable balance between detection efficiency and accuracy. Figure 1 illustrates the core enhancements introduced in the research, including the coordinate attention mechanism, the SIoU loss, and the rotating target detection technology. The rotational target detection branch is integrated into the multi-tasking pipeline of the prediction component within the object detection network. The rotation detection branch utilizes the circular smooth label algorithm to predict output results. The placement of the rotating target detection technology within the overarching object detection network architecture is based on the approach proposed by Zhang et al. [41].

3.2 Attention mechanism

YOLOv7 demonstrates exceptional performance by generating a substantial volume of information. However, it potentially leads to information overload, necessitating a focused approach within the network, particularly on the object regions. In addressing the aforementioned challenge, attention mechanisms, which are widely employed in deep learning techniques and computer vision-relevant tasks, play a pivotal role. Attention mechanisms guide the model to emphasize specific information and locations crucial to the task, effectively reducing attention to less relevant data and mitigating information overload. The targeted focus enhances both efficiency and accuracy [42]. To bolster the precision of the detection network without introducing significant computational overhead, we incorporated a flexible coordinate attention mechanism. The architecture and flow of the coordinate attention mechanism are illustrated in Fig. 2.

Fig. 2
figure 2

The flow structure of the coordinate attention mechanism

The input feature graph X represents the output of the preceding layer of convolution. It has dimensions \(C\times H\times W\). The number of channels is C, H denotes the height, and W signifies the width. The average pooling of dimensions (H, 1) and (1, W) is used to encode information from each channel across the horizontal and vertical dimensions, respectively, which is the output of the \(c-th\) channel with height h and the \(c-th\) channel with width w.

The formulas are displayed as follows:

$$\begin{aligned}{} & {} z_c^h (h)=\frac{1}{W}\sum _{0 \le i \le W} x_c(h, i) \end{aligned}$$
(1)
$$\begin{aligned}{} & {} z_c^w (w)=\frac{1}{H}\sum _{0 \le i \le H} x_c(j, w) \end{aligned}$$
(2)

The two aforementioned transformations aggregate features along two spatial directions and subsequently cascade the resulting feature graphs \(z^h\) and \(z^w\). A convolution operation \(F_1\) with a kernel size of 1 is then applied to produce the intermediate feature graph f, capturing spatial information in both horizontal and vertical directions. The formula for this operation is as follows:

$$\begin{aligned} f=\delta (F_1([z^h, z^w])) \end{aligned}$$
(3)

The intermediate feature graph f is partitioned into two separate tensors, \(f^h\) and \(f^w\), along the spatial dimension. Subsequently, these feature graphs \(f^h\) and \(f^w\) are expanded to match the channel count of the input X using two convolution operations, \(F_h\) and \(F_w\), each with a kernel size of 1. The formulas for these operations are below.

$$\begin{aligned}{} & {} g^h=\sigma (F_h(f^h)) \end{aligned}$$
(4)
$$\begin{aligned}{} & {} g^w=\sigma (F_w(f^w)) \end{aligned}$$
(5)

In the aforementioned equations, the operation \(\sigma \) represents the Sigmoid activation function. The operation scales the output to a range between 0 and 1, indicating the level of importance. \(g^h\) and \(g^w\) serve as attention weights. The final output formula is displayed in the equation as follows:

$$\begin{aligned} y_c(i, j)=x_c(i, j)\times g_c^h (i) \times g_c^w (j) \end{aligned}$$
(6)

Consequently, the detection network can effectively focus on the relevant channels and spatial coordinates. The attention mechanism is integrated into the BConv convolutional layer of the Backbone module and the CatConv convolutional layer of the Head module. Therefore, the detection network can extract features from the target areas of interest more effectively, thereby improving the efficiency of model training.

3.3 Loss function

The loss function for the YOLOv7 detection network comprises three components, including the localization loss, the confidence loss, and the classification loss. The overall loss is calculated as the weighted sum of these three individual losses, as shown in the equation below. Both the confidence loss and the classification loss utilize the BCEWithLogits loss function, while the localization loss is computed using the Complete Intersection over Union (CIoU) regression loss.

$$\begin{aligned} \begin{aligned} Loss_{object}\, =\,\,&Loss_{localization} \times W_{localization} + \\ {}&Loss_{confidence} \times W_{confidence} + \\ {}&Loss_{classification} \times W_{classification} \end{aligned} \end{aligned}$$
(7)

The equation of the CIoU regression loss is shown below.

$$\begin{aligned} \begin{aligned} Loss_{CIoU}&= 1 - I_{IoU} + \frac{\rho ^2(b, b_{gt})}{c^2}+\alpha v \end{aligned} \end{aligned}$$
(8)
$$\begin{aligned} \begin{aligned} v&= \frac{4}{\pi ^2}(arctan\frac{w_{gt}}{h_{gt}}-arctan\frac{w}{h})^2 \end{aligned} \end{aligned}$$
(9)
$$\begin{aligned} \alpha&= \frac{v}{(1-I_IoU)+v} \end{aligned}$$
(10)

In the equations provided, b represents the predicted box, \(b_{gt}\) stands for the ground-truth box. c denotes the diagonal distance of the smallest enclosing region that can encompass both the predicted and ground-truth boxes. \(\alpha \) is the equilibrium parameter, and v measures the consistency of aspect ratios between the predicted and ground. When the aspect ratio of the predicted box matches that of the ground-truth box (v is 0), the penalty term for the aspect ratio becomes ineffective, destabilizing the CIoU loss function. To address this issue, we employ the SIoU loss function, which is proposed by Gevorgyan et al. [43], as a substitute in our object detection network. The SIoU loss function integrates angle cost considerations, thereby redefining the distance based on this angle cost and reducing the overall flexibility of the loss function. The parameters associated with the SIoU loss function are illustrated in Fig. 3.

Fig. 3
figure 3

The parameters associated with the SIoU loss function

The SIoU regression loss function consists of four parts: the angle cost, the distance cost, the shape cost, and the IoU cost. The equations of the four parts are shown below, respectively.

$$\begin{aligned} \begin{aligned} \Lambda&= 1-2\times sin^2(arcsin(\frac{c_h}{\sigma })-\frac{\pi }{4}) \\ {}&= cos(2\times (arcsin(\frac{c_h}{\sigma })-\frac{\pi }{4})) \end{aligned} \end{aligned}$$
(11)
$$\begin{aligned} \begin{aligned} \Delta = 2-e^{(\Lambda -2)\times (\frac{C_h}{{C'}_h})^2}-e^{(\Lambda -2)\times (\frac{C_w}{{C'}_w})^2} \end{aligned} \end{aligned}$$
(12)
$$\begin{aligned} \begin{aligned} \Omega = (1-e^{-\frac{|w-w^{GT}|}{Max(w, w^{GT})}})^{\theta }+(1-e^{-\frac{|h-h^{GT}|}{Max(h, h^{GT})}})^{\theta } \end{aligned} \end{aligned}$$
(13)
$$\begin{aligned} \begin{aligned} IoU = \frac{Ground\,Truth\,Box\,\cap Prediction\,Box}{Ground\,Truth\,Box\,\cup Prediction\,Box} \end{aligned} \end{aligned}$$
(14)
$$\begin{aligned} \begin{aligned} Loss_{SIoU} = 1-IoU+\frac{\Delta + \Omega }{2} \end{aligned} \end{aligned}$$
(15)

In the equations provided, \(\Lambda \) denotes the angle loss, \(\Delta \) represents the distance loss, \(\Omega \) signifies the shape loss, and the IoU stands for the Intersection over Union loss. Additionally, the distance loss calculation considers the loss of angles associated with the two boxes. The variable \(\theta \) is adjustable, determining the weight the network assigns to the shape loss. The angle loss integrated into SIoU primarily facilitates the calculation of the distance loss between the two boxes. During the initial stages of model training, the predicted box and the ground-truth box often do not intersect. Incorporating the angle loss accelerates the computation of the distance between these boxes, enabling quicker convergence of their distances. When the angle \(\alpha \) exceeds 45 degrees, the term \(\beta \) is utilized in the formula to replace \(\alpha \). It allows the network model to initially align the center point of the predicted box with that of the ground-truth box. Subsequently, the predicted box is guided to approach the ground-truth box along the relevant axis.

With the inclusion of the angle cost, the loss function achieves an increased comprehensive representation. The addition reduces the likelihood of the penalty term equating to zero. As a result, the stability of the convergence for the loss function is enhanced, leading to improved regression accuracy and a consequent reduction in prediction errors.

3.4 Rotating target detection

In recent years, rotating object detection technology has gained significant traction, particularly in text and image detection tasks. The technology proves especially effective when detecting targets that are densely distributed and exhibit a certain tilt angle [44]. In SAR images, ship targets often manifest at specific tilt angles, particularly in areas with densely arranged nearshore wharves. Utilizing rotating target detection can mitigate the impact of overlapping bounding boxes on detection outcomes.

While the current rotation detection techniques have demonstrated promising results, they still encounter certain challenges. One significant issue is the boundary discontinuity problem stemming from angle regression. To address the aforementioned problem, this research employs the concept of circular smooth label algorithm as proposed by Yang et al., which considers the angle as a classification problem [45]. We innovatively integrate the circular smooth label algorithm with the object detection network to enhance the performance of the architecture in addressing ship target detection tasks in SAR images. The challenges associated with angle-related algorithms based on regression methods primarily revolve around two issues, including the periodicity of the angle and the commutativity of the boundary. The angular periodicity problem arises due to the cyclic nature of angle parameters. In contrast, the boundary exchangeability problem is predominantly tied to the definition of the boundary frame [46]. The core issue leading to boundary discontinuity problems is the divergence of ideal prediction results beyond the predefined range. The divergence results in a significant spike in losses at the boundaries, complicating the regression of boundary boxes [47].

To address the boundary issue, the circular smooth label algorithm redefines the angle problem from its original regression format to a classification format. The transformation effectively resolves angular boundary challenges and synergizes well with the long-side definition method. Within the circular smooth label algorithm, the defined angles are segmented for better clarity and efficacy. A comparative analysis of angular classification methods is illustrated in Fig. 4.

Fig. 4
figure 4

Three kinds of labels for angular classification: One-hot Label, Binary Coded Label, and the implemented Circular Smooth Label

As depicted in Fig. 4, the circular smooth label classification method employs circular label encoding characterized by periodicity. The assigned label values exhibit smooth transitions within a specified tolerance range. It ensures label continuity at boundaries, eliminating arbitrary accuracy errors stemming from the periodic nature of the circular smooth label classification. When the window function is represented by a pulse function or when its radius is relatively small, the one-hot label technique aligns with the circular smooth label classification methodology. The specific formulation of the circular smooth label classification algorithm is presented in the following equation:

$$\begin{aligned} \begin{aligned} Circular\;Smooth\;Label(x) =\left\{ \begin{aligned} g(x),&\quad \theta - r< x < \theta + r\\ 0,&\quad otherwise \\ \end{aligned} \right. \end{aligned} \end{aligned}$$
(16)

In the given equation, g(x) denotes the window function, r signifies the radius of this window function, and \(\theta \) stands for the angle of the current bounding box. The ideal window function g(x) should satisfy the requirements delineated in the equations.

$$\begin{aligned} \begin{aligned} g(x) = g(x+kT), k \in N \end{aligned} \end{aligned}$$
(17)

where \(T=180/\omega \) signifies the number of bins into which the angle is partitioned, with a default value set at 180.

$$\begin{aligned} \begin{aligned} 0 \le g(\theta + \epsilon )=g(\theta - \epsilon ) \le 1, \mid \epsilon \mid < r \end{aligned} \end{aligned}$$
(18)

In the equation, \(\theta \) is the center of the symmetry.

$$\begin{aligned} \begin{aligned}&g(\theta ) = 1 \end{aligned} \end{aligned}$$
(19)
$$\begin{aligned} \begin{aligned}&0 \le g(\theta \pm \epsilon ) \le g(\theta \pm \zeta ) \le 1, \mid \zeta \mid< \epsilon < r \end{aligned} \end{aligned}$$
(20)

The equation above describes a monotonically non-increasing trend from the center point toward both sides. The aforementioned equations, introduced by Yang et al. [45], demonstrate the four essential properties of an ideal window function g(x). These pivotal properties include periodicity, as indicated in Eq. 17; symmetry, as denoted in Eq. 18; maximum, as represented in Eq. 19; and monotonicity, as demonstrated in Eq. 20.

Given that, the label value remains continuous at the boundary without arbitrary accuracy errors stemming from the periodicity of the circular smooth label algorithm. In addition, when the window function is a pulse function or when the radius of the window function is relatively small, the one-hot label or vanilla classification equates to the circular smooth label algorithm [48]. The angle prediction process within the circular smooth label classification methodology is delineated by the equations below.

$$\begin{aligned} \begin{aligned} Encode=\,&Circular\;Smooth\;Label(-Round((\theta _{gt} - 90)/\omega )) \\ Decode=\,&90 - \omega (Argmax(Sigmoid(logits))+0.5) \end{aligned} \end{aligned}$$
(21)

4 Experiments

4.1 Experimental setup and evaluation metrics

To assess the performance of the designed ship detection network, experiments are conducted utilizing the Capella Open SAR dataset. The dataset comprises 995 images predominantly featuring two types of scenes, including far-sea and nearshore scenes. The SAR images have a ground range resolution of 0.73, a range resolution of 0.48 m, and an azimuth resolution of 0.5 m. The image dimensions are 21000 \(\times \) 21000 pixels. For training the detection network, images are cropped to a size of 512 \(\times \) 512 pixels. The model is trained using stochastic gradient descent (SGD) across two NVIDIA GeForce GTX 3090 graphics cards. The experiments are implemented employing PyTorch 1.12, with a batch size set to 24. The Adam optimizer is employed with a learning rate of 0.00125 and a cosine annealing schedule for training over 100 epochs. To evaluate the performance of the experiments, precision, recall, F-measure, and mean average precision (Mean AP) are used as evaluation metrics.

4.2 Ablation study

We conduct a comparative analysis on the Capella Open SAR dataset, contrasting the proposed approach with several advanced object detection methods, including the two-stage target detector Faster R-CNN [20], the one-stage target detector YOLOv3 with multi-target tracking (YOLOv3-MT) [49], YOLOv4 with attention mechanism (YOLOv4-AM) [23], and YOLOv5s-CBAM-BiFPN [24]. Additionally, the original YOLOv7-Tiny detection network without any enhancements is also evaluated [50]. In the aforementioned baseline approaches, Faster R-CNN, YOLOv4-AM, and YOLOv5s-CBAM-BiFPN are specifically tailored for ship target detection in SAR images. In Table 1, the highest value for each column is bolded, while the second-highest value is underlined. Our developed method achieved the top mean average precision while maintaining a high running speed. A qualitative comparison of the methods is illustrated in Fig. 5. As observed in Table 1, the proposed improved YOLOv7 detection network has significantly enhanced the mean average precision of SAR images compared to classical target detectors.

Table 1 Comparisons with advanced methods on SAR ship detection

From the time consumption data presented in Table 1 for each baseline approach and the proposed enhanced object detection framework, it is evident that as the backbone model version of the YOLO network is upgraded, the inference time of the method decreases, highlighting the corresponding faster inference speed [50]. YOLO operates as an object detection algorithm by performing detections in a single feedforward neural network inference [33]. The efficiency and speed of object detection stem from its single-pass detection methodology. In contrast, Faster R-CNN adopts a two-stage detection process [16, 20]. Firstly, candidate regions are generated, followed by the corresponding classification. Due to this two-stage detection process, Faster R-CNN may not match the speed of YOLO. With the integration of the coordinate attention mechanism and the circular smooth label algorithm, the time consumption for the proposed enhanced object detection framework is marginally higher than that of the original YOLOv7-Tiny detection network without any enhancements. Nonetheless, considering the improvements in the multiple evaluation metrics, the slight increase in inference time is justifiable. It is worth noting that the actual inference time may vary based on different experimental settings.

Fig. 5
figure 5

Visualization of comparative results: a Original SAR Image. b Ground Truth. c Detection with YOLOv5. d Detection with YOLOv7-Tiny. e Proposed Method. In the images, cyan boxes represent the ground truth, red boxes signify incorrect detections, yellow boxes denote low-confidence detections (misses), and green boxes indicate successful detections

To validate the effectiveness of the proposed object detection network for ship targets in SAR images, this research conducted experiments utilizing the publicly available Official-SSDD dataset [51]. The SSDD benchmark is notably the first publicly accessible dataset extensively employed by numerous researchers in the SAR ship detection community. The most recent version, termed the Official-SSDD dataset, is employed for the experiments. The SSDD benchmark comprises 1100 SAR images sourced from RADARSAT-2, TerraSAR-X, and Sentinel-1 satellites [52]. The SAR images exhibit resolutions ranging from 1 to 10 m and encompass radar polarizations such as VV, VH, HH, and HV. Specifically, the dataset contains 920 training samples and 180 testing samples in this research [53]. The experimental results are indicated in Table 2.

Table 2 Comparative analysis of advanced object detection methods for ship targets on the Official-SSDD dataset

Analysis of the experimental results reveals that the proposed enhanced object detection network for ship targets outperforms advanced baseline competitors on the publicly available Official-SSDD benchmark dataset across all evaluation metrics in SAR images, achieving a margin of approximately 2% over the second-best results. The outcomes underscore the effectiveness of the designed enhanced object detection network. While there is a slight increase in computational time compared to the top-performing and second-best techniques, the notable enhancements across all evaluation metrics justify the time consumption expenditure.

4.3 Rotating frame detection

In the aforementioned experiments, it can be observed that misdetection and missed targets are partly attributable to the relatively dense nearshore targets. To address this issue, we implement a fusion rotating frame target detection network. The architecture of the existing object detection network is reconstructed utilizing the circular smooth label algorithm. During the training process, the batch size is set to 10. The Adam optimization algorithm is employed for gradient descent. The initial learning rate is defined at 0.01 with a cyclic learning rate also set at 0.01. The training consisted of 500 iterations. The final training results for the loss components are as follows: \(AngleLoss=0.1956\), \(BoxLoss=0.0758\), and \(ObjectLoss=0.0302\).

We select 265 images featuring nearshore features from the Umbra Open dataset for the experiments. The nearshore target environment is inherently more complex, leading to increased detection challenges. Consequently, the accuracy of detecting nearshore ship targets significantly influences the overall detection performance. In contrast, in far-shore scenarios, various deep learning methodologies exhibit comparable ship target detection results with most techniques yielding satisfactory outcomes. However, when focusing on nearshore ship target detection, the rotating target detection network architecture proposed in this study demonstrates clear advantages over other algorithms. It excels in accurately detecting targets amidst complex environments. A comparative analysis of the detection results obtained by different deep learning techniques on the dataset is presented in Fig. 6.

Fig. 6
figure 6

The results that are accurately identified by seven different deep learning techniques in the dataset

For an intuitive comparison of the object detection performance of various deep learning techniques in nearshore target environments, we present visual comparative results utilizing selected SAR images from the experimental dataset across seven object detection algorithms, as depicted in Fig. 7. The visual comparisons underscore the enhancements achieved through updates in the detection network backbone and the incorporation of the rotating target detection technique.

Fig. 7
figure 7

Visualization of comparative results: a Detection with Faster R-CNN. b Detection with YOLOv3. c Detection with YOLOv4. d Detection with YOLOv5. e Detection with YOLOv7-Tiny. f Detection with enhanced YOLOv7 without CSL. g Proposed Method. In the images, cyan boxes represent the ground truth, red boxes signify incorrect detections, yellow boxes denote low-confidence detections (misses), and green boxes indicate successful detections

Figure 8 illustrates the partial detection results of the fusion rotating frame target detection network on the dataset. The boundary frames accurately encircle the targets across various scenarios. Whether it is a small-scale target in the far sea or a large-scale target in the near sea, the angle category is effectively predicted in the angle classification process presented in this research, allowing for the precise selection of the optimal boundary frame.

Fig. 8
figure 8

Partial detection results from the fusion rotating frame target detection network on the dataset. Yellow boxes represent low-confidence detections (misses), while green boxes indicate successful detections

5 Discussion

The study of SAR ship target detection technology holds significant application value, especially in marine resource detection. As high-resolution SAR systems advance, SAR images now encompass more detailed information, laying a robust foundation for precise ship target detection in oceanic environments [54]. In this paper, we propose an enhanced SAR ship target detection model, constructed upon the existing deep learning target detection network architectures. Addressing challenges arising from the diverse and densely packed nature of ship targets in SAR images, we incorporate an attention mechanism architecture. Additionally, we refine the loss function of the original target detection network and integrate a rotating target detection algorithm. The aforementioned enhancements enable the detection of ship targets of varying sizes and significantly reduce the probability of missing ship targets, especially in dense nearshore scenes.

In SAR images, ship targets of various sizes coexist, which are often densely arranged, making the detection of small-sized ships particularly challenging. While this research achieves high accuracy in ship target detection utilizing a deep learning approach, several challenges and limitations remain, suggesting areas for future research and improvement. One significant challenge arises when SAR images are affected by substantial clutter interference during the imaging process. Such interference can degrade image quality, leading to issues like ghosting and incomplete ship target contours. Consequently, the performance of deep learning-based ship target detection in SAR images may suffer, sometimes resulting in false detections where partial structures of incomplete ships are misinterpreted as whole targets. Addressing the aforementioned challenge is crucial for enhancing the accurate detection of fragmented or incomplete ship targets. Another challenge emerges when ship targets occupy a relatively small proportion of large-amplitude SAR images. In such scenarios, deep learning-based detection methods may miss detecting these smaller targets. To mitigate the issue, manual image cropping and segmentation are often required to enlarge the target size for subsequent detection. Future research could focus on optimizing the existing models to improve the detection of small-sized ship targets, thereby reducing the dependency on manual interventions. Moreover, most existing deep learning-based SAR ship detection technologies primarily utilize the amplitude information of SAR images, overlooking the rich phase information inherent in SAR imaging. Unlike optical images, SAR imaging relies on the scattering characteristics of electromagnetic waves, which contain both amplitude and valuable phase information [55]. Incorporating the phase information as an additional input to the detection network could potentially enhance target detection and recognition capabilities, warranting further investigation and study.

6 Conclusion

This research presents an optimized object detection network, leveraging an enhanced version of the YOLOv7 algorithm, tailored specifically for detecting ship targets within SAR images. To bolster the detection capabilities of the network, we innovatively implement multiple pivotal components. Firstly, we incorporate the coordinate attention mechanism into the object detection network. By integrating the coordinate attention mechanism, the object detection network concentrates on crucial regions and features, thereby enhancing the accuracy and performance of the object detection task. Given the sparse presence of ship targets in SAR images, integrating the coordinate attention mechanism enables the object detection network to focus on pivotal areas and intricacies of the target. The coordinate attention mechanism aids the network in identifying and pinpointing tiny targets, making it capable of the subsequent ship target detection tasks. In addition, the inclusion of the coordinate attention mechanism amplifies the generalization capability of the model across targets of varying scales, shapes, and orientations, rendering it versatile across diverse complex scenarios. Secondly, we replace the conventional CIoU regression loss with the SIoU loss in the object detection network. The substitution aims to elevate detection accuracy while mitigating false positives and the likelihood of missed targets, bolstering the reliability of the network for the object detection task. The SIoU loss adeptly handles variations in object shape and scale, enhancing the robustness of the model in complex scenes and when dealing with occlusions. Considering the scale variability of ship targets in SAR images, the SIoU loss accounts for the scale factor of the target during IoU calculation. The adjustment minimizes regression biases and improves target localization accuracy, rendering the network qualified for the subsequent ship target detection task. Thirdly, recognizing the challenges posed by complex SAR images featuring densely packed ship targets in nearshore regions, we integrate rotating object detection technology into our framework. Furthermore, we incorporate the circular smooth label algorithm to enhance the detection and recognition of closely spaced ship targets. The innovative approach addresses the issues of reduced detection and recognition performance attributed to model errors and oversights. Conventional label approaches might result in imprecise bounding boxes due to variations in the shape, pose, or occlusion of the target. The integration of the circular smooth label algorithm allows the object detection network to adapt to the shape and structure of the target, consequently minimizing label boundary ambiguities. With the leveraging of the circular smooth label algorithm, the object detection network can acquire a generalized and robust feature representation of ship targets. The enhancement bolsters the capability of the network to generalize across ship targets across diverse scenes, scales, and poses. Through rigorous experimentation on public SAR datasets, our proposed architecture demonstrates superior accuracy without compromising significantly on speed. Specifically, it outperforms the second-best baseline competitor on the baseline datasets by approximately 2% in both precision and recall and surpasses the second-best baseline method by around 1.5% in Mean Average Precision (Mean AP). These results underscore the efficiency and effectiveness of the designed approach in ship target detection within SAR images.

The enhanced object detection network for ship target detection proposed in this research is not without limitations. Firstly, a significant limitation is that the developed technique primarily focuses on addressing the issue of ship target detection in SAR images. Therefore, future work will focus on applying the object detection network to broader categories of images, such as ship target detection in high-resolution remote sensing satellite images. Secondly, given the rapid advancements in deep learning approaches and their widespread application in object detection, it is worth exploring and incorporating the latest object detection networks to address the downstream ship target detection challenges. Finally, while integrating advanced enhancement modules into the object detection network can enhance algorithm performance, the associated increase in processing time remains an inevitable limitation. Balancing improved detection performance with model efficiency is a crucial challenge for future research.

Supplementary information

Not applicable.