1 Introduction

Unmanned surface vessels (USVs) equipped with autonomous navigation systems and obstacle avoidance functions have become an important tool for assisting or replacing human work and have attracted widespread attention with increased tasks such as coastal mapping, environmental monitoring, ecological monitoring, and marine rescue. The semantic-level perception recognition of water surface scene targets is required during water surface navigation to achieve the autonomous navigation of unmanned vessels, including target shape, posture, and division of feasible driving areas. Besides, real-time inference decisions are made based on the recognition results. An efficient and lightweight real-time semantic segmentation network should be designed due to the limited size and working mode of unmanned vessels. Thus, the target recognition can be completed with high accuracy in real time. This is also one of the current research hotspots.

Common semantic segmentation networks in the design process consider the design of the network backbone, such as ResNet-101 [17], Xception [8], and HRNet [38], instead of the inference speed and computational cost. These structures, multi-layered, complex, and deep, can extract integrated features from a larger spatial range and have stronger feature expression capabilities. However, these massive backbone models often have high computational complexity and slow inference speeds, which is not suitable for some practical applications requiring low latency and high prediction accuracy from the model. Therefore, how to achieve more accurate segmentation results and ensure a high inference speed has become a challenging task.

Real-time semantic segmentation networks have been proposed recently. ENet [31] achieves high efficiency and accuracy using methods such as depth-wise separable convolution, skip connections, and early downsampling. Also, it has a fast running speed. ERFNet [33] uses a series of effective designs (e.g., an asymmetric encoder and non-bottleneck-1D modules) to improve the accuracy and efficiency of the network. DFANet [25] reduces the parameter quantity based on multi-scale feature propagation. Its feature extraction method uses sub-network aggregation and sub-stage aggregation to promote the interaction and aggregation between features at different levels and reduces the calculation amount. Therefore, accuracy is improved. BiseNet [46] is a bilateral segmentation network. Two paths are designed to collect spatial and semantic information without affecting the segmentation speed. The spatial path uses more channels and shallower networks to retain rich spatial information and generate high-resolution features, while the context path uses fewer channels and deeper networks for rapid downsampling to obtain sufficient contextual information.

BiseNetV2 [45] adds bilateral guided aggregation based on BiseNet to integrate complementary information extracted by the detail and semantic branches. The layer guides the feature response of the detailed branch with the contextual information of the semantic branch, and it captures feature representations of different scales through different scale guides to accurately perform multi-scale information fusion. BiSeNetV3 [37] uses the STDCNet [12] backbone network to remove the time-consuming Spatial Path in BiseNet. An attention refinement module (ARM) and a feature fusion module (FFM) are added to the original architecture to make BiSeNetV3 perform better than existing results in urban landscapes with traditional edge detection methods. ICEG [16] proposes the ESC module to eliminate boundary ambiguity by separating foreground and background features and using attention masks to reduce uncertain boundaries and eliminate false predictions. The scheme also leverages edge features to adaptively guide segmentation and enhance feature-level edge information to achieve high-definition edges in segmentation results.

Lightweight convolutional neural networks can be primarily categorized into the following three types to obtain a real-time semantic segmentation algorithm that can be applied in practice: (1) Compression of pre-trained deep networks: A structure sparsity learning method (SSL method) is used to learn a compact structure from a large convolutional neural network [40], which reduces the computational cost. (2) Network quantization-based methods: The Quantized CNN method is used to quantize weights in the convolution and fully connected layers and to minimize the response error of each layer, which accelerates and compresses the CNN [43]. The method has a faster computational speed but poor segmentation performance. (3) Lightweight convolutional neural networks: The design focus of lightweight networks is to obtain a compact network that reduces both the computational cost and memory consumption.

Inspired by the Deeplab v3 + network, the work proposed an efficient real-time semantic segmentation network that balanced accuracy and speed (Fig. 1). The network retained the ASPP module of the Deeplab v3 + network and replaced the backbone network with the lighter Mobilenet v3 network. On this basis, a new decoder structure was designed and consisted of Laplace feature extraction, a multi-path channel attention feature fusion module (CAFM), a convolutional block attention module (CBAM) [41], and a feature refinement extraction module (FREM). The CAFM fuses the semantic features extracted from the encoding network with low-level features through a channel attention mechanism. The Laplace operator is used to extract second-order differential information from the input image to obtain edge features. The FREM refines the edge features and gradually decomposes abstract features into high-level semantic features. Finally, the CBAM filters useful features at both the channel and spatial levels and removes useless features. Then the resulting features are added back to the final semantic features fused by the CAFM.

Fig. 1
figure 1

LDANet’s detailed architecture, The Laplace module is applied to extract second-order differential features in the framework. Three layers of FREM {16, 64, 128} are used for feature refinement. The CAFM module is used to fuse deep and shallow features

The main contributions of the work are summarized as follows:

  1. (1)

    LDANet, a lightweight real-time semantic segmentation network, was proposed for the semantic-level perception recognition of water surface scene targets by SUV. The network had a dual-branch structure of a semantic branch and edge refinement branch, which ensured good segmentation performance and maintained inference speed.

  2. (2)

    The work proposed an edge refinement branch composed of the Laplace operator, FREM, and CBAM, which constrained the segmentation results of the semantic branch through boundary contour information.

  3. (3)

    A multi-path channel attention feature fusion module was designed to suppress unimportant features and enhance important features through a channel attention mechanism. Context information and spatial information were fused.

  4. (4)

    LDANet was tested on an integrated dataset consisting of MaSTr1325 and MID, with a mIoU score of 98.1%. Additionally, when tested on the CamVid and PASCAL VOC2012 datasets, LDANet achieved the mIoU scores of 73.1 and 81.1%, respectively. The network achieved an inference speed of 100.36 frames/second with an input image size of \(720 \times 960\). Compared to other lightweight algorithms such as DSA, LDANet performs better in terms of segmentation performance.

2 Related work

Deep learning-based image semantic segmentation techniques have been rapidly developing recently. Image classification is the most fundamental task in computer vision, and the demand is also increasing for segmentation accuracy and speed based on image classification networks. This section introduces the development of relevant technologies.

Fully Convolutional Networks (FCN) proposed by Long et al. [29]. solved the semantic segmentation problem in image analysis. The FCN network based on VGGNet replaces fully connected layers with convolutional layers, which promotes traditional convolutional neural networks (CNN). U-Net proposed by Ronneberger et al. [34]. uses an encoder-decoder architecture. The introduction of skip connections fuses the feature maps of the encoder with the corresponding feature maps of the decoder, which addresses the roughness of the FCN network’s output segmentation. SegNet proposed by Badrinarayanan et al. [1], similar to U-Net, uses an encoder-decoder structure; however, the encoder part of SegNet uses the first 13 layers of the VGG16 convolutional network, with each encoder layer corresponding to a decoder layer. It enhances the accuracy of image position information.

DeepLabv1 proposed by Chen et al. [5] introduces dilated convolution instead of standard convolution to increase the network’s receptive field. A dense conditional random field (CRF) is introduced as post-processing to address blurred edges after image segmentation. Chen et al. [4] added the atrous spatial pyramid pooling (ASPP) module to DeepLabv1 to extract image features as completely as possible by combining pyramid pooling and atrous convolution. Multiple atrous convolutions with different kernel sizes are used to obtain features at different scales. Chen et al. [6] used the multi-grid strategy to add batch normalization to the ASPP module and remove CRF post-processing, with DeepLabV3 proposed. After that, a decoder module is added based on DeepLabV3 to restore object boundaries caused by parallel hollow convolution [7]. Meanwhile, the Xception model is used for segmentation tasks and the deep separable hollow convolution is applied to ASPP and decoder, which improves the accuracy and speed of the network.

Researchers paid more attention to whether networks could be applied to different real-world scenarios. The efficient residual factorized net (ERFNet) [33] is based on residual connections and depth-wise separable convolutions. The ERFNet replaces the residual module with a non-bottleneck module and uses only 1D asymmetric convolutions internally, which reduces the number of parameters and ensures segmentation accuracy. ELANet [44] is a lightweight network based on an asymmetric encoder framework. A novel efficient context guidance strategy (ECG) is used to learn and capture complex contextual semantic information in the encoder, with good results. FEDER [15] model proposes a method for auxiliary edge reconstruction tasks using high-order ODE solvers and ODE-inspired edge reconstruction modules to address the fuzzy boundary problem in the ED challenge. This approach can help generate accurate segmentation results with precise object boundaries.

2.1 Encoder-decoder

The encoder-decoder architecture has been widely used in image segmentation tasks based on deep learning to improve the accuracy of semantic segmentation networks. Symmetric encoder-decoder structures usually include multiple feature extraction and upsampling steps. ResNet [17] and Xception [8] introduce multiple branch structures or connect shallow and deep features based on existing backbone networks with excellent feature extraction capabilities. However, optimizing the network structure increases more parameters, which limits their use in scenarios with limited computational resources. ENet [31] is a lightweight network that balances network accuracy and efficiency using an asymmetric encoder-decoder structure. The encoder has a deeper network structure and a larger receptive field, while the decoder is relatively shallow. The encoder uses skip connections to extract high-level semantic features, and the decoder fine-tunes the feature map details by upsampling encoder output.

BANet [51] adopts a dual-branch encoder-decoder structure, where two encoders process different levels of feature representations. The corresponding decoder combines these features to obtain the final detection and segmentation results. This dual-branch structure can utilize hierarchical features at different stages to improve task performance. OverSegNet [26] achieves good results in image superpixel segmentation. OverSegNet uses a NAS encoder, which is more efficient than existing encoders. Oversplitting is optimized through two steps: first, over-parameterize the encoder with small convolution kernels. Then data-driven network architecture search (NAS) is used to learn and select convolution kernels specific to over-splitting types, which optimizes network performance. Besides, OverSegNet proposes a clustering-specific decoder to calculate the correlation between pixels and superpixels. Since the decoder is optimized for superpixel clustering, the use of skip connections in the structure can significantly improve the accuracy of superpixel clustering without increasing the computational cost.

2.2 Attention mechanism

Attention mechanisms are widely used in neural networks. The more parameters a model has, the stronger its expressive power and the more information it can store in the learning process of neural networks. Attention mechanisms can quickly scan the global information in an image and allocate computational resources to more important tasks in situations where computational resources are limited, which reduces or filters the focus on other information.

There are two types of trainable attention: hard/local attention and soft/global attention. Hard attention relies on parameter updates in reinforcement learning, which hinders model training. Soft attention is probabilistic and can be trained using standard backpropagation without Monte Carlo sampling. Wu et al. [42] proposed a flexible attention model that uses global pooling and deformable convolutions to model spatial and channel attention. A dense gating and collaboration method is introduced to drive spatial and channel attention as dynamic attention. Self-attention mechanisms can establish long-range dependencies within sequences. OCNet [49] and DANet [13] use self-attention mechanisms to obtain relevant contextual information. CCNet [21] proposes a more efficient attention module—the cross-attention module (CCA). A category-consistent loss is introduced to force the CCA to acquire more diverse features.

BiSeNet V2 [45] uses two parallel branches to handle low-resolution and high-resolution features. One branch focuses on global information and the other focuses on local fine-grained information. The results of the two branches are fused using attention mechanisms to improve segmentation accuracy. Qu et al. [30] propose a global attention dual pyramid network (GADPNET), which uses an improved global attention module with a lightweight contextual branch to capture more local information on low-level features. GADPNET captures multi-scale features with a single high-level feature through a pyramid decoder structure. Different low-level multi-scale features are fused under the guidance of the improved global attention module to enhance the multi-scale feature capture of semantic segmentation networks.

2.3 Feature fusion

The fusion of features at different scales is an important approach to improve segmentation performance, and feature fusion structures are also effective for obtaining dense features and restoring image information. RefineNet [27] introduced complex refinement modules at each upsampling stage between the encoder and decoder to extract multiscale features. SENet [20] adopts a new “feature recalibration” strategy. Besides automatically acquiring the importance of each feature channel by learning, it enhances useful features and suppresses less useful features for the current task. The SENet network is effective for large-scale targets but performs poorly for small-scale targets. Yu et al. [48] extended the dense connection idea to develop more deeply aggregated structures and enhance the feature representation capability of the network. DFANet [25], a deep feature aggregation network, includes a backbone network with Xception as its backbone structure, subnet aggregation modules, and sub-stage aggregation modules. The sub-network is aggregated at the network level to combine advanced features, and the sub-stage is aggregated at the stage level to fuse semantic and spatial information. Dai et al. [9] proposed a multi-scale channel attention module (MS-CAM) to better fuse features with inconsistent semantics and scales. Channel attention is extracted through two branches with different scales to fuse different scale features. They proposed an iterative attention feature fusion module (IAFF), which alternately integrates initial feature fusion with another attention module. Zhang et al. [50] proposed the architecture of the cascade fusion network (CFNet) to improve the performance of dense, detection, or segmentation prediction. CFNet inserts feature integration operations into the backbone network, which enables more parameters to be used for feature fusion and increases the richness of feature fusion.

3 Proposed networks

The work improves the encoder and decoder structures of the Deeplabv3 + network to address the issues in the semantic segmentation of water surface image datasets. Firstly, a multi-path channel attention fusion module and a prior-based boundary feature guidance method are proposed to constrain semantic boundaries. Secondly, the backbone network in the encoder section is modified by adjusting the sampling rates of the dilated convolutions in the ASPP module. These improvements will be detailed in the following sections. Figure 1 shows the network model structure proposed in the work.

MobileNetV3 is first used in the entire network to extract features from the original input and obtain feature \({F}_{1}\). \({F}_{1}\) is then fed into the ASPP [47] structure, which consists of one \(1\times 1\) convolution, three \(3\times 3\) dilated convolutions with different dilation rates of 2, 7, and 15 to extract multi-scale contextual information from \({F}_{1}\), and a global average pooling module. The features extracted from these convolutions are concatenated along the channel dimension and then fused by a \(1\times 1\) convolution to obtain high-level feature map \({F}_{2}\).

High-level feature map \({F}_{2}\) is processed hierarchically from low to high using the multi-scale context aggregation fusion (CAFM) module during the decoding stage. This process gradually incorporates shallow-layer high-resolution images into the feature map to obtain feature \({F}_{3}\), which contains rich semantic information. Contextual and spatial information are combined to improve segmentation performance throughout the process. Feature \({F}_{3}\) is then upsampled using bilinear interpolation and added with the boundary refinement branch to obtain \({B}_{3}\). \({F}_{3}\) uses the bilinear interpolation method to perform 2 times upsampling and then is added to \({B}_{3}\) obtained by the boundary refinement branch. Features are fine-tuned through \(3\times 3\) convolution to reduce the dimension. This approach enhances the fusion of information from multiple scales, which improves results in semantic segmentation tasks.

The loss function used in the work is the commonly used cross-entropy loss function [36]. The formula of the loss function is

$$L= -\sum\nolimits_{i=1}^{N}{y}^{\left(i\right)}log{\widehat{y}}^{\left(i\right)}+\left(1-{y}^{\left(i\right)}\right)\mathit{log}\left(1-{\widehat{y}}^{\left(i\right)}\right)$$
(1)

where \(L\) represents the training loss; \(N\) represents the number of samples; \(y\) represents the actual sample label, with a value range of (0, 1); \(\widehat{y}\) represents the predicted label value, with a value range of (0, 1).

3.1 Encoder network

3.1.1 Backbone network

The larger size of the convolutional kernel means the larger receptive field of the network in CNNs, which results in more precise segmentation results. However, using larger kernels also increases the number of parameters in the network, which hinders training and leads to a larger network size. The introduction of dilated convolutions solves this problem by enlarging the receptive field of the network without adding more parameters. MobileNetV1 [19] uses depthwise separable convolution to build a lightweight network by transforming standard convolutions into depthwise convolution and pointwise convolution, followed by batch normalization (BN) and ReLU activation in each layer. There is a corresponding convolutional kernel for each input dimension of depthwise convolution. The output feature map of the depthwise convolution is calculated as follows for the same input.

$${\widehat{G}}_{k,l,m}= \sum\nolimits_{i,j}{\widehat{K}}_{i,j,m}\bullet {F}_{k+i-1,l+j-1,m}$$
(2)

where \(\widehat{K}\) is a depthwise convolutional kernel of size \({D}_{K}\times {D}_{K}\times M\), where the mth filter in \(\widehat{K}\) is applied to the mth channel in \(F\) to produce the mth channel of filtered output feature map \(\widehat{G}\).

Although dilated convolution greatly reduces the number of parameters, it cannot integrate multiple dimensions well. Therefore, the feature maps produced by the depthwise convolution should be fused using \(1\times 1\) convolution to generate new feature maps. \(3\times 3\) depthwise separable convolution requires 8–9 times fewer computations than traditional \(3\times 3\) convolution from the perspective of the computational cost. Besides, pooling layers are not used and strides can be used for downsampling tasks when deep neural networks are constructed using depthwise separable convolution modules.

MobileNetV2 [35] introduces the innovative unit of the inverted residual with linear bottleneck, which increased the number of layers and improves the overall accuracy and speed of the network. MobileNetV3 [18] proposes a novel idea of adding a neural network called “Squeeze-and-Excitation” in the core architecture. The core idea is to improve the quality of the network representation by explicitly modeling the interdependencies between convolutional feature channels. Specifically, it automatically learns the importance of each feature channel and uses information to enhance useful features and suppress irrelevant ones for the current task. The stride of the 5th downsampling module is changed to 1 to adapt to the semantic segmentation task and preserve more spatial information. Therefore, the main network performs only 4 downsamplings. To compensate for the reduction in receptive field resulting from this decrease in downsampling operations, we incorporated dilated convolutions with a dilation rate of 2 in the last three bottleneck modules. Table 1 shows the specific structure.

Table 1 Backbone network of the semantic encoding network used for extracting high-level information, SE is whether to use the attention mechanism; NL is the currently used non-linear activation function; s is the stride

3.1.2 ASPP module

The ASPP module consists of a series of dilated convolutions and global average pooling with different dilation rates, which can extract object multi-scale features. However, improper setting of the parallel dilated convolution dilation rates can easily cause the gridding effect. Ref. [39] finds that continuous use of dilated convolutions with the same dilation rate in the ASPP module causes the convolution kernels to be discontinuous, which leads to the gridding effect (Fig. 2 (a)). Reasonable settings of dilation rates (Fig. 2 (b)) can obtain the contextual information of different scale targets and avoid the loss of relevant information.

Fig. 2
figure 2

Grid artifacts and ideal effects of dilated convolution [39]. a Grid artifacts in dilated convolution. b Ideal effect of dilated convolution

$${M}_{i}=\mathrm{ max}[{M}_{i+1}-2{r}_{i},{M}_{i+1}-2\left({M}_{i+1}-{r}_{i}\right),{r}_{i}]$$
(3)

The expansion rate should follow the following principles: 1) The expansion rate should not contain a common divisor greater than 1; otherwise, the grid effect will occur. 2) Suppose that the expansion rates corresponding to N empty convolution kerns with the size of \(k*k\) are \([ {r}_{1},\cdots ,{r}_{i} ,\cdots , {r}_{n}]\), then Eq. (3) is required to meet that \({M}_{2}\) is less than or equal to \(k\). \({r}_{i}\) represents the expansion rate of the ith empty convolution; \({M}_{i}\) represents the maximum expansion rate of the ith empty convolution; \({M}_{n}={r}_{n}\) by default. Therefore, following the above design principles, the work resets a set of cavity convolution with the expansion rates of 1, 2, 7, and 15 through experiments.

3.2 Decoder network

The decoder's role is to decode the feature information encoded by the encoder, offering pixel-level classification. An excellent encoder can improve model accuracy while maintaining inference speed. To decode feature information better with fewer parameters, a boundary constraint branch based on second-order derivatives and a multi-scale fusion attention mechanism are designed for the decoder, consisting mainly of CAFM, CBAM, and FREM.

The CAFM module filters out useful features through channel attention, enhancing their impact on the model and reducing the influence of irrelevant features on the results. Secondly, we up-dimension the low-level features to avoid them being overwhelmed by high-level features. The Laplace branch is used to constrain the segmentation boundaries, enabling more precise boundary segmentation. Generic semantic segmentation is achieved by gradually up-sampling the results obtained through high-magnification downsampling, which can obtain a larger receptive field but may lead to loss of boundary information. Although feature fusion can be used to supplement detailed information, it has limited effectiveness for fine boundaries. Therefore, we propose boundary constraints that constrain the expansion of semantic information while supplementing boundary information as much as possible without introducing any additional irrelevant information.

3.2.1 Multi-channel attention fusion module

Semantic segmentation networks usually extract semantic information in deep networks with a larger receptive field. However, the input feature map of the deep network loses a lot of detailed information due to multiple downsampling, which affects the final upsampled prediction result. Considering that shallow networks extract features from images with higher resolution and obtain output feature maps rich in detail information, it is crucial to fuse these output feature maps with those from deep networks. However, simply fusing low-level features with high-level features, such as SegNet [1] and U-Net [34], cannot well exploit the advantages of both, which wastes information. If these features can be integrated more appropriately, the result will be improved to some extent. Therefore, a multi-channel attention fusion module (CAFM) is proposed to fuse these features (Fig. 3).

Fig. 3
figure 3

Overall architecture of the multi-path channel attention feature fusion module

The CAFM calculates the attention vector to guide the fusion of semantic features. It is divided into two parts: compression and excitation. The compression part aims to compress global spatial information, while the excitation part aims to predict the importance of each channel. The multi-channel features are first connected along the channel dimension in the compression part (Fig. 3). The feature map dimension after the connection is \(H\times W\times 2C\), where \(H\), \(W\), and \(C\) are the height, width, and number of channels of the feature map before the connection, respectively. Then, the feature dimension is compressed from \(H\times W\times 2C\) to \(1\times 1\times 2C\) through global average pooling and global maximum pooling, i.e., \(H\times W\) is compressed to \(1\times 1\). Although the dimension is reduced, \(H\times W\) is compressed to \(1\times 1\), which can obtain a large receptive field with fewer parameters. The above operations are represented as follows:

$${Z}_{c}= {F}_{sq}\left({u}_{c}\right)= \frac{1}{H\times W}\sum\nolimits_{i=1}^{H}\sum\nolimits_{j=1}^{W}{u}_{c}(i,j)$$
(4)

The \(1\times 1\times C\) feature tensor obtained from the compression part needs to be integrated into a fully connected layer in the excitation part to predict the importance of each channel. Then the importance values are used to weigh the corresponding channels in the feature map. A simple gating mechanism and a sigmoid activation function are used. The operations mentioned above are represented as follows:

$$s= {F}_{ex}\left(z,W\right)= \sigma \left(g\left(z,W\right)\right)= \sigma ({W}_{2}\delta ({W}_{1}z))$$
(5)

where \(\delta\) is the ReLu function; \(\sigma\) is the Sigmoid function; \({W}_{1}\in {R}^{\frac{2C}{r}\times 2C}\) and \({W}_{1}\in {R}^{2C\times \frac{2C}{r}}\).

3.2.2 Boundary refinement branch

This article designs a branch of edge detail constraints based on Laplace. The input image is processed by Laplace operator to obtain the low-level boundary features \({B}_{1}\) of the input image. This feature map contains a lot of second-order differential information that is not related to the segmented object. In order to extract continuous and complete edge information to constrain the edge segmentation of semantic branch, this article designs the feature refinement extraction module (FREM) (Fig. 4) to extract the high-level semantic features in feature B1. This module gradually segments feature B1 from whole to details layer by layer, and each feature map independently extracts its own features again without being affected by other feature maps to obtain the most concise and complete edge feature map. This article uses a three-layer {16, 64, 128} structure of FREM, where the first layer uses a convolutional kernel with a stride of 2 to downsample the input features. The three layers use \(7\times 7\), \(5\times 5\), and \(3\times 3\)-sized convolutional kernels to refine the features of \({B}_{1}\) and obtain features \({B}_{2}\). \({B}_{2}\) then goes through a Convolutional Block Attention Module (CBAM) [41] (Fig. 5) to filter out unwanted noise features and ultimately generate edge constraints. The CBAM module is composed of a channel attention module and a spatial attention module. The channel attention module assigns different weights to each channel to enhance the ability to identify important feature channels. This helps to eliminate noise information that is unrelated to the segmented object. The spatial attention module captures global contextual information to aid in inferring ambiguous pixels and enhancing the continuity of target edges. The channel attention module compresses the feature dimension of the obtained feature map from \(H\times W\times 128\) to \(1\times 1\times 128\). The compressed feature is integrated into a fully connected layer to predict the importance of each channel, and then it is applied to the corresponding channels of \({B}_{2}\) to obtain the feature \({B}_{2}^{\mathrm{^{\prime}}}\). The spatial attention mechanism compresses the feature dimension of \({B}_{2}^{\mathrm{^{\prime}}}\) from \(H\times W\times 128\) to \(H\times W\times 1\). A 7 × 7 convolution is used to capture dependencies between features, enhancing the continuity of edges. Finally, this is applied to the feature \({F}_{3}\) to assist in obtaining more complete and accurate segmentation.

Fig. 4
figure 4

2-layer {4, 8} structure of the feature refinement extraction module. This model uses a 3-layer {16, 64, 128} structure

Fig. 5
figure 5

Overall architecture of the convolutional block attention module including the channel attention module and the spatial attention module

4 Experiments

The MaSTr1325 [2] dataset and MID [28] dataset were used to evaluate the effectiveness of the proposed method in the semantic understanding of the ocean. Besides, the generalization of the proposed method was evaluated in other application scenarios on two famous and challenging datasets (CamVid [3] and PASCAL VOC2012 [11]) which are commonly used for semantic segmentation.

Firstly, detailed information about the datasets and evaluation metrics is introduced in Sect. 4.1, and the implementation details are described in Sect. 4.2. Subsequently, an ablation study on the Camvid validation set is conducted, and the effectiveness of each component in LDANet is explained in Sect. 4.3. Finally, the inference speed, model parameters, and segmentation accuracy of the proposed model are compared with those of some advanced models on VOC 2012 and CamVid datasets in Sect. 4.4. The work follows the common semantic segmentation evaluation metric, Mean Intersection over Union (mIoU).

4.1 Datasets and evaluation metrics

The datasets used in the experiments were MaSTr1325 [2] and MID [28], in addition to PASCAL VOC 2012 [11] and CamVid [3] datasets for the generalization evaluation of the network.

MaSTr1325 is a new large-scale ocean semantic segmentation training dataset for developing obstacle detection methods for small coastal unmanned surface vehicles (USVs). The dataset contains 1,325 different images that were captured by a real USV for two years with a resolution of \(384\times 512\), and it covers a range of realistic scenarios encountered in coastal surveillance tasks. All images were semantically labeled at the pixel level and synchronized with inertial measurements from onboard sensors. The dataset included three semantic labels and was split into a training set and a test set at a 9:1 ratio.

The MID dataset contained eight video sequences used for marine obstacle detection. There were 2,655 annotated images in the dataset with a resolution of \(640\times 480\), and they were captured by the Shenghai No.8 USV. The obstacles in this dataset were divided into large obstacles (spanning the water’s edge) and small obstacles (surrounded by water). The dataset was originally used for horizon segmentation and was pixel-level annotated in the work as obstacles, ocean, and sky for semantic segmentation. The dataset was split into a training set and a test set at a 9:1 ratio.

The CamVid dataset is a road scene dataset captured from driving a car, and it contains 701 images with a resolution of \(960 \times 720\). The quality of the images and annotations in this database was lower than that of Cityscapes, which challenges the dataset.

The PASCAL VOC 2012 semantic benchmark contained 20 foreground object classes and one background class. The original dataset had 1,464 and 1,449 images for training and validation, respectively. Additional annotations provided by [14] were also used in the work to enhance the training dataset.

Evaluation Metrics: Two metrics were used to quantitatively and qualitatively evaluate the accuracy performance of the proposed network model in the experiments, including pixel accuracy (PA), mean Intersection over Union (mIoU), parameters, and frames per second (FPS). The PA metric reflected the proportion of correct pixels in the semantic-segmentation image to the total pixels in the image. Equations (6) and (7) present the PA calculation and the mean IoU calculation, respectively.

$$PA = \sum\nolimits_{i=0}^{k}\frac{{p}_{ii}}{{\sum\nolimits }_{j=0}^{k}{p}_{ij}}$$
(6)
$$mean\;IoU = \frac{1}{k=1}\sum\nolimits_{i=0}^{k}\frac{{p}_{ii}}{{\sum\nolimits }_{j=0}^{k}{p}_{ij}+\sum\nolimits_{j=0}^{k}{p}_{ji}-{p}_{ii}}$$
(7)

where \(k\) is the number of categories; \(k+1\) is the number of categories containing background; \({p}_{ij}\) is the total number of pixels predicted by categories \(i\) to \(j\); \({p}_{ji}\) is the total number of pixels predicted by categories \(j\) to \(i\); \({p}_{ii}\) is the total number of categories correctly predicted by category \(i\). Parameters reflect the size of the network model's running parameter memory (Eq. (8)).

$$Parametes=\left({K}^{2}\times {C}_{i}\right)\times {C}_{0}+{C}_{0}$$
(8)

where \({C}_{i}\) is the number of input channels of the convolution layer; \(K\) is the size of the convolution kernel; \(H\) and \(W\) are the sizes of the output feature graph, respectively; \({C}_{0}\) is the number of output channels of the convolution layer. FPS is used to reflect the image processing speed of the semantic segmentation model (Eq. (9)).

$$FPS=F\div T$$
(9)

where \(F\) is the frame number; \(T\) is elapsed time and is usually set as 1 s.

4.2 Implementation details

Training: The Adam [23] optimizer and SGD optimizer were combined to train all the models. The “poly” learning-rate update strategy [4] was employed to multiply the learning rate by \({\left(1-\frac{step}{total\_steps}\right)}^{0.9}\) after each step. We scaled the images by a factor of 0.5 to 2 and flipped them horizontally with a probability of 0.5. The images were normalized using mean = [0.485, 0.456, and 0.406] and variance = [0.229, 0.224, and 0.225] and were randomly cropped based on different datasets. The cross-entropy loss was used to train the network.

Setup: All experiments were conducted on CUDA 11.6. The platform used was Windows 10, with an Intel® Core™ i7-11700KF CPU@3.60 GHz CPU and an NVIDIA GeForce RTX3070 GPU.

4.3 Ablation study

Ablation experiments were used to validate the effectiveness of the CBAM module, FREM module, and CAFM module in the CamVid dataset.

4.3.1 CBAM module

The CBAM module improved the channel attention module and spatial attention module to obtain spatial features at the edges of features and provide spatial constraints for the features obtained by the semantic branch.

The experiment with the module removed was performed to verify the effectiveness of the CBAM module, and edge-based refining feature \({B}_{2}\) obtained by FREM was added directly to feature \({F}_{3}\) (Table 2). When the CBAM was removed, the network’s mIoU on the test set decreased from 73.1 to 71.4%, and segmentation accuracy decreased by 1.7%. The CBAM module could improve the segmentation ability of the network.

Table 2 Results of ablation experiments on CAFM, Laplace, FREM, and CBAM on the Camvid test set

Five experiments (S1, S2, S3, S4, and S5) were performed to find the most appropriate convolution kernel size in the spatial attention module of the CBAM module. The kernel sizes used in the spatial attention module were 3, 5, 9, 11, and 7, respectively. Compared with S5 (the proposed method in the work), the mIOU of S1, S2, S3, and S4 decreased by 0.8, 0.9, 0.8, and 0.9%, respectively (Table 3). Generally, increasing the kernel size of the spatial attention module can enable each weight value to obtain a larger receptive field, which brings a certain improvement. However, a larger receptive field will introduce more noise information.

Table 3 Experimental results of different Kernel sizes in the spatial attention mechanism in CBAM

4.3.2 FREM

The FREM further extracts features from the shallow second-order differential features, which enables it to quickly extract refined edge information.

The FREM was replaced with a fully-connected convolutional layer to test its impact on the network. The mIoU decreased from 73.1 to 71.5% (Table 3), demonstrating that the FREM can quickly extract gradient features to improve network performance.

However, simply introducing the FREM or CBAM alone did not improve segmentation accuracy. When both the FREM and CBAM were removed, the mIoU obtained was 71.5% (Table 3). Using both modules significantly increased the accuracy.

4.3.3 Laplace

The Laplace module extracted second-order differential features from the three channels of the original image (i.e., R, G, and B) to obtain boundary information between different semantics and constrain the prediction results.

The entire edge-based refining branch was removed to verify the effectiveness of the Laplace module. The mIoU decreased from 71.5 to 70.1%, compared to only removing the FREM and CBAM modules. The edge-based branch could constrain the edges of different semantics and improve segmentation accuracy.

4.3.4 CAFM

The CAFM, an efficient and simple feature fusion module, integrates low-level features and high-level features in semantic encoding networks. The work compared element-wise addition, concatenation, and CAFM in terms of performance (Table 4). The experiment showed that feature concatenation had a better inference speed and higher mIoU compared to the element-wise addition of features, but the CAFM method achieved a 3.1% increase in the mIoU at the cost of an inference speed of 13.83 fps. Higher semantic segmentation performance was achieved under real-time constraints.

Table 4 Comparison of three types of feature fusion methods

4.4 Comparison to state-of-the-art methods

First, LDANet was compared with advanced networks on the MaSTr1325 and MID datasets (Table 5). Second, LDANet was compared with some methods proposed in recent years on the VOC2012 and CamVid datasets to demonstrate its generalizability.

Table 5 Segmentation results of the three classes on the integrated dataset of the MaSTr1325 dataset and the MID dataset for the model

4.4.1 MaSTr1325 and MID datasets

The MaSTr1325 and MID datasets were integrated into one dataset, with a ratio of 9:1 between the training set and the test set. Images were uniformly scaled from 0.5 to 2 times using 480 as the base, and then they were cropped to a size of \(480\times 480\). The mini-batch size was set to 12 for training images. Figure 6 shows some example results on the dataset, and Table 5 presents the comparison of the IoUs of three classes of LDANet and the Deeplab v3 [6] and Deeplab v3 + [7] networks on the dataset. LDANet is more accurate than Deeplab v3 in edge and small object performance. Even though its fps decreases by 55.68, it still meets real-time requirements. Although Deeplab v3 + has made significant progress in edge detection compared to Deeplab v3, its ability to distinguish sea surface reflections is poor.

Fig. 6
figure 6

Visual examples from the test sets of the MaSTr1325 dataset and the MID dataset

When shallow features are introduced, features are not well fused, which causes the influence of useless shallow features on the results. The CAFM module of LDANet addresses this problem and selects shallow features. The edge refinement branch constrains the semantic branch to ensure the continuity of semantic edges for accurate semantic boundaries.

4.4.2 PASCAL VOC2012 and CamVid

The performance of LDANet in the PASCAL VOC2012 and CamVid datasets was tested to verify the generalizability of LDANet in other tasks. The effectiveness of LDANet was demonstrated by comparing it with realistic semantic segmentation methods.

PASCAL VOC2012: LDANet in the PASCAL VOC2012 dataset was tested, and a mini-batch of 6 and the images were cropped to \(512\times 512\). The SGD optimizer algorithm was used to train for 300 rounds. A mIoU of 81.1% was achieved at 111.46 fps with the extended dataset. Figure 7 shows the visualization results.

Fig. 7
figure 7

Visual examples from the PASCAL VOC2012 test set

CamVid: LDANet in the CamVid dataset was tested. The mini-batch size was set to 8, and the image size was randomly cropped to \(480\times 480\). The SGD optimizer algorithm was used to train for 200 rounds. LDANet achieves excellent performance and significantly improves the segmentation of semantic edges (Fig. 8). LDANet can process images with a resolution of \(960\times 720\) at a speed of 100.36 fps and achieve a mIoU of 73.1% on the CamVid test set. The work selected the pre-training condition, inference speed (fps), parameter quantity, and mIoU on the test set for comparison. Although LDANet’s inference speed is not the fastest, LDANet achieved the best results in segmentation accuracy without pre-training. Compared with BiSeNet2 [46], DSANet [10], and AGLNet [32], LDANet was 4.4, 2.2, and 3.7% higher in the mIoU, respectively.

Fig. 8
figure 8

Visual examples from the Camvid test set

Table 6 shows the IoUs of 11 classes in the Camvid test set. Compared with other networks, LDANet significantly improves several difficult-to-classify classes, which proves the effectiveness of this method. Table 7 presents the experimental environment and results of LDANet and other methods to objectively evaluate the performance of the network model.

Table 6 Segmentation results of 11 classes in the Camvid dataset
Table 7 Results of the model in the Camvid dataset and experimental environment

5 Conclusions

A new real-time semantic segmentation method was proposed to achieve a balance between speed and accuracy in the work. Efficient semantic segmentation required the fusion of high-level semantic features and detail features. MobileNetV3 was used to extract high-level semantic features from the input, and multi-scale semantic information was obtained through the ASPP structure.

A new efficient decoder network was designed to integrate high-level semantic features and low-level semantic features well and ensure inference speed. It solved the feature loss caused by upsampling. Meanwhile, a boundary constraint branch was established based on second-order differentiation to ensure the segmentation accuracy of semantic boundaries. Specifically, our boundary constraint branch was inserted into other segmentation networks.

The work validated the effectiveness of LDANet in the MaSTr1325 and MID datasets and qualitatively and quantitatively evaluated LDANet in two challenging datasets, CamVid and PASCAL VOC2012. The network was still competitive in other scenarios. Compared with other advanced networks, LDANet significantly improved accuracy, speed, and generalizability.