1 Introduction

Semantic segmentation is a fundamental task in computer vision that aims to precisely predict the label of each pixel in an image. It has been widely applied in many fields, such as autonomous driving, medical image segmentation, video surveillance, and more. As the demand for mobile device deployment grows, real-time semantic segmentation has become a hot research field in recent years (Fig. 1).

Real-time semantic segmentation is a challenging task that requires considering both segmentation accuracy and inference speed. To balance both accuracy and speed, MobileNet [1] reduces computation by using depthwise separable convolutions. BiSeNet [2, 3] proposes a two-pathway architecture to capture spatial and semantic information separately. STDC [4] designs a lightweight backbone and proposes a detail aggregation module to preserve the spatial details in low-level layers. Despite the development and impressive performance of real-time semantic segmentation driven by these methods, it still faces the following challenges:

1) Most of methods are inefficient in multi-level feature fusion. The semantic gap is common among different levels of features. Simple feature fusion methods, such as element-wise addition or channel-wise concatenation operation, require less computation, but they are not capable of effectively bridging the semantic gap, leading to suboptimal segmentation performance. Some fusion methods employing attention mechanisms, such as [2, 5], unilaterally utilize spatial or channel context information to generate attention weights, ignoring the effectiveness improvement brought by their joint generation. Moreover, sophisticated feature fusion methods [6, 7] can effectively bridge the semantic gap, but they come with high computational costs and are difficult to optimize, rendering them unsuitable for deployment on mobile devices.

2) Most semantic boundary optimizations require additional inference overhead. Semantic boundaries are crucial for segmentation accuracy, and further optimizing semantic boundary prediction can enhance segmentation performance. However, current real-time semantic segmentation methods [2, 3, 8,9,10] usually optimize semantic boundary prediction by adding extra learnable modules or branches, undoubtedly leading to additional training costs and inference overhead.

Inspired by the above observations, we propose a novel real-time semantic segmentation network, ZMNet, that balances accuracy and speed by using a lightweight decoder and semantic boundary supervision module. ZMNet adopts an encoder–decoder structure. In the encoding stage, different levels of convolutional blocks generate feature maps of different scales. Specifically, shallow feature maps contain more spatial detail information while deep feature maps contain more semantic information. In the decoding stage, we propose a new feature fusion module to fuse deep and shallow features in a top-down manner with multi-dimensional contextual information of input features. In addition, the semantic boundaries are refined further to improve accuracy. These two processes in the decoding stage are implemented by two modules: the multi-level attention feature fusion module (MLA-FFM) and the semantic boundary supervision module (BSM).

Fig. 1
figure 1

The comparison of speed–accuracy performance on the Cityscapes test set. Red dots indicate our methods, while blue dots represent other methods. Notably, our proposed ZMNet achieves a state-of-the-art trade-off between segmentation accuracy and inference speed

Our main contributions can be summarized as follows:

  • We propose a lightweight multi-level feature fusion module called MLA-FFM, which is the critical component of our decoder. This module can bridge the semantic gap and fuse multi-level features in an effective and efficient way.

  • We propose a semantic boundary supervision module called BSM, which aims to improve the accuracy of semantic boundaries without increasing the computational cost of the inference stage.

  • Extensive experiments demonstrate that ZMNet achieves a state-of-the-art trade-off between accuracy and speed on both Cityscapes and CamVid datasets. More specifically, on one NVIDIA GTX 1080Ti card, ZMNet achieves 77.4% mIoU on Cityscapes test set at 97.5 FPS, and 74% mIoU on CamVid test set at 156.6 FPS.

2 Related work

Real-time Semantic Segmentation. With the increasing demands for mobile device deployment, real-time semantic segmentation has been an active research area in computer vision for the past few years, with significant progress achieved in terms of accuracy and speed. So far, a lot of deep convolutional neural network-based methods have been proposed, and have made great progress in the semantic segmentation of street scenes. MobileNet [1], ShuffleNet [11], and STDC [4] propose a lightweight and effective backbone to reduce parameters and computation of models. BiSeNet [2, 3] proposes a two-pathway architecture which a context path for high-level context information and a spatial path as a supplement to strengthen spatial information. Similarly, BFMNet [10] adopts a bilateral network structure to separately encode semantic feature information and detailed feature information, and introduces the attention enhancement fusion module (AEFM) to promote bilateral feature fusion. Fast-SCNN [12] proposes shallow learning to a down-sample module that contains three layers using stride 2 for fast and efficient multi-branch low-level feature extraction. GDN [13] proposes a guided down-sampling method that decomposes the original image into a set of compressed images, reducing the size of feature maps while retaining most of the spatial information of the original image, thus greatly reducing computational costs. DDRNet [14] maintains high-resolution representations and harvests contextual information simultaneously with a dual-resolution network structure and a deep aggregation pyramid pooling module (DAPPM). Method [15] proposes token pyramid module for processing high-resolution images to quickly produce a local feature pyramid. SFANet [16] proposes a Stage-aware Feature Alignment module (SFA) to efficiently align and aggregate two adjacent levels of feature maps. Method [17] develops a lattice enhanced residual block (LERB) to address the issue of inferior feature extraction capability of the lightweight backbone network, and proposes a feature transformation block (FTB) that mitigates the problem of feature interference between different layers at a lower computational cost. Moreover, image segmentation under rain or fog conditions is very practical in real application, especially in real-time scenario [18,19,20].

Attention mechanism. The key concept of attention mechanism is to enhance feature representation and improve network performance by telling the network “what” and “where” to focus on. Attention mechanism has achieved great success and been widely applied in neural language processing tasks due to its high efficiency and simplicity. In recent years, it has also been used in the computer vision field. Wang et al. [21] bridge self-attention mechanism and non-local operators for capturing long-range dependencies with deep neural networks. Method [22] exploits the relationships between different channels by channel attention. Go further, BAM [23] and CBAM [24] introduce spatial attention, and combine it with channel attention to improve the representation power of CNNs. The ResNeSt [25] proposes a split-attention block that can perform feature map attention across different feature map groups. SENet [26] strengthens the spatial information representation by enhancing the information correlation between feature maps of different resolutions through an attention mechanism. Method [27] proposes sparse attention module (SAM) and class attention module (CAM) to optimize the computational cost of self-attention.

Feature Fusion Module. Feature fusion module is widely used in encoder–decoder structure networks. As the network layers deepen, the semantic gap between shallow features and deep features also increases: deep features contain high-level semantic information with low resolution, while shallow features contain high-resolution spatial information. Although traditional feature fusion methods such as element-wise addition or channel-wise concatenation can effectively reduce computational and parameter complexity, they may result in poor fusion performance. To better fuse features of different levels, BiSeNet [2] proposes a feature fusion module (FFM) to fuse the output features of the spatial path and context path to make the final prediction. SFNet [28] proposes a novel flow-based align module (FAM) to promote broadcasting high-level features to high-resolution features. DFANet [29] proposes sub-network aggregation and sub-stage aggregation modules to obtain sufficient receptive fields and enhance the model learning ability. Dai et al. [5] propose an attentional feature fusion (AFF) module, which fuses features of inconsistent semantics and scales by utilizing multi-scale contextual information along the channel dimension. In this paper, we propose multi-level feature fusion modules (MLA-FFM), which fuse features of different levels by utilizing multi-dimensional contextual information from different levels of features.

Fig. 2
figure 2

The architecture overview of the ZMNet. “MLA-FFM” denotes the multi-level feature fusion module, and “BSM” denotes the semantic boundary supervision module

3 Proposed method

In this section, we present our proposed methods in detail. We first introduce the overall architecture of ZMNet in Sect. 3.1. Then, we present MLA-FFM in Sect. 3.2, which is used to fuse features from adjacent levels. Finally, we introduce BSM in Sect. 3.3, which is used to optimize the semantic boundary segmentation.

3.1 Network architecture

As shown in Fig. 2, the architecture of ZMNet follows the encoder–decoder paradigm. The backbone of encoder can be any computationally efficient CNN architecture. We choose STDC [4] as the backbone because of its outstanding performance in real-time semantic segmentation. The decoder is composed of a sequence of MLA-FFMs that integrate features in a top-down manner. To enhance the segmentation accuracy, we incorporate BSM as a supervision signal to guide the learning of semantic boundary results during training. The overall process is as follows:

Firstly, as illustrated in Fig. 2a, the encoder generates a series of multi-level feature maps, denoted as {\(x_1,\ldots ,x_5\)}. It should be noted that during the decoding stage, involving \(x_1\) in feature fusion would result in a significant computational burden due to its large spatial size, and the abundant spatial information from shallow feature map may hinder feature aggregation [30]. Therefore, in the decoding stage, we only utilize \(x_2\), \(x_3\), \(x_4\), and \(x_5\) for feature fusion.

Secondly, as shown in Fig. 2b, to integrate spatial details of shallow features during the upsampling process, the decoder utilizes lateral connections to perform top-down feature fusion with MLA-FFMs. To bridge the semantic gap, MLA-FFM exploits input feature maps’ multi-dimensional contextual information to generate a soft attention map, then fuses inputs in a soft selection manner. The decoder of ZMNet comprises three decoding stages, each of which utilizes MLA-FFM to fuse feature maps from different layers and generate corresponding fused feature map, i.e., {\(x_6, x_7, x_8\)}. We denote the process of MLA-FFM as \(\mathcal {F}(\cdot , \cdot )\):

$$\begin{aligned} \begin{aligned} x_6&= \mathcal {F} (x_4, x_5) \\ x_7&= \mathcal {F}(x_3, x_6) \\ x_8&= \mathcal {F}(x_2, x_7) \end{aligned} \end{aligned}$$
(1)

In the training phase, each MLA-FFM is followed by a segmentation head (SegHead) and an upsampling operation to produce segmentation prediction. All segmentation predictions are supervised by the ground-truth semantic labels. In this paper, we adopt cross-entropy loss with online hard example mining (OHEM) [31] to optimize the models. The joint segmentation loss \(l_s\) is calculated as follows:

$$\begin{aligned} \begin{aligned} pred_i&= \mathcal {I}_u(\mathcal {S}_i(x_i)) \\ l_{s}&= \sum \limits _{i=6}^{8} \mathcal {L}_{s}(pred_i, gt) \end{aligned} \end{aligned}$$
(2)

where \(x_i\) denotes the output feature map at the i-th stage, and \(pred_i \in \mathbb {R}^{N \times H \times W}\) is the corresponding segmentation prediction probability map, where N is the number of classes. \(\mathcal {S}_i(\cdot )\) denotes the i-th SegHead operation, and its structure is illustrated in Fig. 3a. Note that \(\mathcal {S}_6(\cdot )\) and \(\mathcal {S}_7(\cdot )\) are auxiliary SegHeads intended to enhance feature representation during training and they are discarded in the inference phase. \(\mathcal {I}_u(\cdot )\) denotes the bilinear upsampling operation. \(gt \in \mathbb {R}^{1 \times H \times W}\) denotes the ground-truth semantic labels. \(\mathcal {L}_{s}(\cdot )\) denotes the OHEM [31].

Fig. 3
figure 3

The components of a SegHead and b AttnHead. “CBR” denotes the convolution with batch normalization (BN) and ReLU

Thirdly, as shown in Fig. 2d, to obtain more accurate semantic boundaries, we use BSM to derive the semantic boundary ground-truth \(gt_b\) and the semantic boundary probability map \(pred_b\) from the ground-truth gt and the segmentation prediction probability map \(pred_7\) of the 7-th stage. As the pixels on the semantic boundary represent hard samples due to their relatively small proportion compared to non-semantic boundary pixels, we utilize focal loss [32] to increase the weight of hard samples and optimize the model’s learning of semantic boundaries. Notably, BSM is a module designed to guide the learning of semantic boundary prediction during training and is discarded in the inference phase. The above procedure can be formulated as follows:

$$\begin{aligned} \begin{aligned} \{gt_b, pred_b\}&= \mathcal {B} (gt, pred_7) \\ l_{b}&= \mathcal {L}_{b}(gt_b, pred_b) \end{aligned} \end{aligned}$$
(3)

where \(\mathcal {B}(\cdot , \cdot )\) denotes the operation of BSM. \(\mathcal {L}_{b}(\cdot )\) denotes the focal loss [32] function. \(l_{b}\) denotes the semantic boundary loss.

Finally, the overall objective function l is a combination of segmentation loss \(l_s\) and semantic boundary loss \(l_b\). We use the trade-off parameter \(\xi \) to balance the weight of the segmentation loss and semantic boundary loss. In our paper, we set \(\xi = 0.6\).

$$\begin{aligned} \begin{aligned} l&= l_{s} + \xi \cdot l_{b} \end{aligned} \end{aligned}$$
(4)
Fig. 4
figure 4

The components of MLA-FFM. Where “Spatial Aggregation” refers to pooling operations along the spatial dimension, “Channel Aggregation” refers to pooling operations along the channel dimension. The symbol \(\alpha \) represents the weight of the linear combination between \(x_{l'}\) and \(x_{h'}\)

3.2 Multi-level attention feature fusion module

As illustrated in Fig. 4, MLA-FFM receives two different levels of feature maps as inputs, where the shallow-level feature map \(x_h\) has a higher resolution than the deep-level feature map \(x_l\). The size of \(x_h\) and \(x_l\) is first unified through channel compression and bilinear interpolation, and the unified feature maps are denoted as \(x_{h'}\) and \(x_{l'}\).

Next is information aggregation, through pooling operations to obtain a larger receptive field and capture multi-dimensional contextual information. Spatial aggregation and channel aggregation are performed, respectively. Spatial aggregation performs max-pooling operation on spatial dimension to aggregate spatial-wise context information. Channel aggregation performs average-pooling on channel dimension to aggregate channel-wise context information and avoid the loss of spatial location information in feature maps. After information aggregation, four feature maps are generated, i.e., \(x^{sp}_{l'} \in \mathbb {R}^{C'\times 1\times 1}\), \(x^{cp}_{l'} \in \mathbb {R}^{1\times H' \times W'}\), \(x^{sp}_{h'} \in \mathbb {R}^{C' \times 1\times 1}\), and \(x^{cp}_{h'} \in \mathbb {R}^{1\times H' \times W'}\).

Then, mix the aggregated information on the spatial and channel of different resolution features to generate attention weight tensor. That is, we multiply \(x^{sp}_{l'}\) and \(x^{cp}_{h'}\) to obtain \(m_1\), and multiply \(x^{sp}_{h'}\) and \(x^{cp}_{l'}\) to obtain \(m_2\). Then, \(m_1\) and \(m_2\) are concatenated and fed into the attention head (AttnHead) to generate an attention weight tensor \(\alpha \):

$$\begin{aligned} \begin{aligned} m_1&= \mathcal{S}\mathcal{P}(x_{l'}) \otimes \mathcal{C}\mathcal{P}(x_{h'}) \\ m_2&= \mathcal{S}\mathcal{P}(x_{h'}) \otimes \mathcal{C}\mathcal{P}(x_{l'}) \\ \alpha&= \mathcal {T}([m_1, m_2]) \\ \end{aligned} \end{aligned}$$
(5)

where \(\mathcal{S}\mathcal{P}(\cdot )\) and \(\mathcal{C}\mathcal{P}(\cdot )\) denote the spatial aggregation and channel aggregation, respectively. \([\cdot ,\cdot ]\) denotes the concatenation operation along the channel dimension. \(\mathcal {T}(\cdot )\) denotes the AttnHead operation, and its structure is illustrated in Fig. 3b.

Finally, we perform soft selection on \(x_{h'}\) and \(x_{l'}\), and then use element-wise addition followed by a \(3 \times 3\) CBR to obtain fused feature map. The merge procedure can be formulated as follows:

$$\begin{aligned} \begin{aligned} x_{h+1}&= CBR(x_{h'} \odot \alpha + x_{l'} \odot (1 - \alpha )) \end{aligned} \end{aligned}$$
(6)
Fig. 5
figure 5

The components of BSM. Where “Laplacian” denotes to the application of 2D convolution using the Laplacian kernel as the weight, “Dilation” denotes the dilation operation

3.3 Semantic boundary supervision module

The semantic boundary is crucial for semantic segmentation tasks. However, down-sampling operations in deep convolutional neural networks cause the loss of spatial detail information, resulting in rough prediction results of semantic boundaries and their adjacent pixels. To alleviate this problem, we propose BSM supervise the semantic boundary of segmentation results. Figure 5 illustrates the structure of BSM.

Firstly, a binary boundary mask \(m_b \in \mathbb {R}^{1 \times H \times W}\) is extracted from the ground-truth semantic labels using a 2D convolution with the Laplacian kernel as the weight. In the binary boundary mask, the value is 1 if it is a semantic boundary, otherwise it is 0.

Then, a dilation operation on the binary semantic boundary mask is performed to expand the range of boundary pixels, in order to treat pixels near the semantic boundary as part of the semantic boundary. This operation produces a dilated binary boundary mask, denoted as \(m_d \in \mathbb {R}^{1 \times H \times W}\).

Finally, the dilated binary boundary mask is element-wise multiplied with both the ground-truth semantic labels and the predicted probability map to obtain two semantic boundary maps, i.e., the semantic boundary ground-truth \(gt_b \in \mathbb {R}^{1 \times H \times W}\) and the semantic boundary prediction \(pred_b \in \mathbb {R}^{N \times H \times W}\). It is worth noting that those semantic boundary maps only retain the classification information of boundary elements, while non-boundary elements are set to 0. Therefore, unlike [4], BSM can not only optimize the semantic boundary but also further optimize the classification information of the semantic boundary. We believe this can achieve better semantic boundary results. The above procedure can be formulated as follows:

$$\begin{aligned} \begin{aligned} m_{d}&= \Upsilon (\Gamma (gt)) \\ gt_{b}&= m_{d} \odot gt \\ pred_{b}&= m_{d} \odot pred \end{aligned} \end{aligned}$$
(7)

where \(\Gamma (\cdot )\) denotes the 2D convolution with the Laplacian kernel as the weight. \(\Upsilon (\cdot )\) denotes the dilation operation.

4 Experiments

4.1 Datasets

Cityscapes. Cityscapes [33] is one of the most popular complex urban street scene datasets. It contains 5,000 fine annotated images and is split into 2,975, 500, and 1,525 images for training, validation, and testing, respectively. The annotation includes 19 classes of annotation for the semantic segmentation task. These images have the same challenging resolution (\(1,024 \times 2,048\)) for real-time semantic segmentation. For a fair comparison, we only use fine annotated images in our experiments.

CamVid. Cambridge-driving Labeled Video Database (Camvid) [34] is a road scene dataset that contains 701 images and is partitioned into 367 training, 101 validation, and 233 test images. All images share the same resolution of \(720 \times 960\). The annotation includes 32 categories, of which a subset of 11 categories are used in our experiments. Same setting as [2, 14, 29, 35], we train our model on both the training and validation sets and validate on the test set.

Table 1 Performance comparison with and without BSM on Cityscapes test set

4.2 Implementation details

Data Augmentation. In the training phase, we all apply color jittering, random horizontal flip, random resize, random crop, etc. For the Cityscapes, the random scale ranges in [0.125, 1.5], and the cropped resolution is \(512 \times 1,024\). For the CamVid, the random scale ranges in [0.5, 2.5], and the cropped resolution is \(720 \times 960\).

Fig. 6
figure 6

Visualization of segmentation results with and without BSM on Cityscapes validation set. The column with subscript B denotes predictions with BSM. Column a shows the input images, b are the ground-truths of input images, c, d demonstrate the predictions without and with BSM

Table 2 Performance comparison of using different loss functions in BSM and inserting BSM at different stages

Training Settings. As a common configuration, we use mini-batch stochastic gradient descent (SGD) [36] as an optimizer which momentum set as 0.9 and the weight decay is \(5e^{-4}\). Similar to [37, 38], we also utilize “poly” learning rate policy. In our paper, we set the initial learning rate and power is 0.01 and 0.9, respectively, and the initial rate is multiplied by \((1-\frac{iter}{max\_iter} )^{power}\). In addition, we use the warm-up strategy at the first 1,000, and 300 iterations for Cityscapes and CamVid, respectively. For the Cityscapes, we set the batch size as 40, the max iterations are 70,000. For the CamVid, we set the batch size as 30 and the max iterations are 30,000. We conduct our all training experiments on NVIDIA GeForce 2080Ti.

Inference Settings. Similar to [2, 4], for the Cityscapes dataset, we initially resize the image resolution from the original \(1,024 \times 2,048\) to \(768 \times 1,536\) for inference, and subsequently, we resize the prediction to the original size of the input. The time taken for resizing is included in our reported inference time. For the CamVid dataset, we take the original image as input. We measure the inference time under PyTorch\(-\)1.6, CUDA 10.2, CUDNN 7.6, and TensorRT on a single NVIDIA GeForce 1080Ti GPU with a batch size of 1.

Evaluation Metrics. In this paper, we adopt the mean of class-wise Intersection over Union (mIoU) to evaluate segmentation accuracy and frames per second (FPS) to evaluate inference speed.

4.3 Ablation study

Effectiveness of BSM. To investigate the effectiveness of BSM, we conduct ablation experiments with and without BSM. As shown in Table 1, the model with BSM improves the segmentation accuracy for most categories. Through careful observation, we find that the categories with clear edges and lines, such as poles, traffic signs, trucks, buses, and trains, show the most significant improvement in segmentation accuracy. Specifically, the train category has the most noticeable improvement, with mIoU increasing from 75.2% to 79.4%, a 4.2% improvement.

To further demonstrate the effectiveness of BSM, we present some segmentation examples of models with and without BSM for visual comparison. As shown in Fig. 6, comparing columns (c) and (d), we can see that the model with BSM achieves better semantic segmentation performance than the model without BSM.

Comparison of different loss functions in BSM. In order to validate the rationality of the selected loss function, we conduct experimental comparisons in BSM using both the focal [32] and OHEM [31] loss functions. As shown in Table 2, although both OHEM and focal loss functions are suitable for addressing issues like class imbalance and hard example learning, the experimental results consistently indicate higher accuracy when employing focal loss function compared to OHEM loss function. Therefore, in this paper, we choose focal loss as the loss function for BSM.

Comparison of BSM Application Positions. It is worth noting that BSM is a module specifically designed to improve the accuracy of semantic boundaries during model training, which can be incorporated at various stages. As shown in Table 2, when BSM is applied in the 7-th stage, its segmentation accuracy is higher than that of other stages, which increases the mIoU on the Cityscapes validation set from 77.2% to 77.6%.

Table 3 Compare the effect of using auxiliary segmentation supervision in different decoding stages

Effectiveness of Auxiliary Segmentation Supervision. The decoder of ZMNet consists of three fusion stages, namely stage 6, stage 7, and stage 8. During the training phase, the main segmentation supervision is conducted in stage 8, while the auxiliary segmentation supervision can be performed in stages 6 and 7. In order to evaluate the impact of auxiliary segmentation supervision on segmentation accuracy, we conduct ablation experiments on the Cityscapes validation set. The results of the auxiliary segmentation supervision are shown in Table 3. It can be observed that the incorporation of auxiliary segmentation supervision in either stage 6 or stage 7 contributes to the improvement of segmentation accuracy. The best segmentation performance is achieved when auxiliary segmentation supervision is applied in both stage 6 and stage 7, resulting in an increase of mIoU from 76.8% to 77.2%, which is an improvement of over 0.4%.

Table 4 Comparison with other feature fusion methods on the Cityscapes validation set

Effectiveness of MLA-FFM. To validate the effectiveness of MLA-FFM, we compare it with previously popular feature fusion methods. For a fair comparison, we set the input resolution to \(768 \times 1,536\) for all methods and only use the feature maps generated by the last three stages of the encoder for feature fusion. The input and output feature maps are aligned using \(3 \times 3\) CBR. The settings of the auxiliary segmentation supervision are also consistent.

As shown in Table 4, the feature fusion method of element-wise addition (EA) is the fastest fusion method among all methods. The channel-wise concatenation (CAT) fusion method performs similarly to EA, but both methods ignored the importance of multi-dimension contextual information during fusion. Compared with EA and CAT methods, our proposed MLA-FFM, respectively, improved 0.6% and 0.5% mIoU at the cost of a small speed sacrifice.

AFF [5] aggregates spatial contextual information through spatial pooling during feature fusion. Our method not only aggregates spatial contextual information through spatial aggregation but also aggregates channel contextual information through channel aggregation, and generates a soft attention map by combining these two pieces of contextual information to better fuse features from different levels. By comparison, we can find that MLA-FFM is faster and more accurate than AFF [5].

FFM [2] uses contextual information of the channel dimension to perform attention weighting on the fused features. Compared with FFM [2], our method is 13.4 FPS faster and has a higher mIoU by 0.7%. We believe that such improvement is due to our use of multi-dimensional contextual information to fuse feature information.

Table 5 Comparison of different pooling operations in MLA-FFM

Comparison of different pooling operations in MLA-FFM. Firstly, as shown in the first row of Table 5, we remove all pooling operations and corresponding matrix multiplication operations in MLA-FFM. We can find that the modified MLA-FFM does not show any significant impact on speed but significantly reduce the segmentation accuracy (by 1.5% mIoU). This indicates that multi-dimensional contextual information is beneficial for feature fusion.

Additionally, we compare the effects of different pooling operations in MLA-FFM. As shown in rows two to five of Table 5, the impact of different pooling operations on the results is not significant. Among them, spatial dimension aggregation implemented by max-pooling and channel dimension aggregation implemented by average-pooling achieve the best performance.

Comparison of different trade-off parameter \(\xi \) values. In our paper, semantic segmentation loss and semantic boundary loss jointly optimize our model. We employ a parameter \(\xi \) to balance the trade-off between semantic segmentation loss and semantic boundary loss. As shown in Fig. 7, the accuracy of our network is improved when \(\xi \) takes smaller value such as 0.3, 0.6, or 1.0. However, when the value of \(\xi \) is set to a larger value, such as 1.5 or 2.0, the accuracy of the network decreases. This is because when \(\xi \) is larger, the network may pay too much attention to the semantic boundary area while reducing attention to non-semantic boundary areas, resulting in a decrease in overall segmentation accuracy. Experimental results show that when \(\xi \) is set to 0.6, its segmentation performance is optimal.

Fig. 7
figure 7

Comparison of different trade-off parameter \(\xi \) values

Table 6 Comparison with other state-of-the-art real-time methods on the Cityscapes dataset

4.4 Compare with state-of-the-arts

Results on Cityscapes. To demonstrate the ability of our proposed ZMNet for tackling segmentation on complex street scenes, in Table 6, we illustrate the accuracy and speed of ZMNet on Cityscapes validation and test sets. We propose ZMNet-S and ZMNet, which use STDC1 [4] and STDC2 [4] as backbones, respectively. ZMNet-S has a faster inference speed, while ZMNet has higher segmentation accuracy. To further demonstrate the effectiveness of our decoder, we also propose ZMNet-R18, which uses ResNet-18 [41] as the backbone. The backbones used in our experiments are pre-trained on the ImageNet [42] dataset. To make a fair comparison, at the test phase, we use both training and validation sets and make the evaluation on the test set. In the end, we submit our test set predictions to the Cityscapes online evaluation server for detailed accuracy results. As shown in Table 6, ZMNet-R18 achieves 75% and 75.2% mIoU with 79.7 FPS on validation set and test set, respectively. Although its accuracy is slightly lower than SwiftNet [35], its inference speed is almost twice as fast as SwiftNet [35]. ZMNet-S with 130.4 FPS achieves 75% and 74.9% mIoU on the validation set and test set, respectively, showing competitive performance compared to most methods. Moreover, our ZMNet achieves 77.6% mIoU on the validation set and 77.4% mIoU on test set with 97.5 FPS, respectively. Compared to SENet [26], not only does ZMNet have higher segmentation accuracy than SENet, but its inference speed is also three times that of SENet. Although ZMNet is slightly inferior to BFMNet2-M [10] on segmentation accuracy, its inference speed is much faster than BFMNet2-M.

Results on CamVid. We also present the segmentation accuracy and inference speed results of our proposed methods on the CamVid dataset. As shown in Table 7, ZMNet-R18 achieves a 73.8% mIoU at 133 FPS, which achieves a more competitive speed–accuracy trade-off than other methods that use the same ResNet-18 backbone. ZMNet-S achieve 72.6% mIoU on the CamVid test set at a speed of 213.7 FPS, which is 8.1% faster than STDC1-Seg [4] while sacrificing a small amount of accuracy. ZMNet achieved a remarkable trade-off between accuracy and speed among all methods by achieving 74% mIoU at a speed of 156.6 FPS. Compared to state-of-the-art methods like STDC [4] and BFMNet [10], our method demonstrates competitiveness in either segmentation accuracy or inference speed. The experiment results demonstrate that our method achieves a good balance between inference speed and segmentation accuracy.

Table 7 Comparison with other state-of-the-art real-time methods on the CamVid dataset

5 Conclusions

In this paper, we propose a novel real-time network ZMNet to balance segmentation accuracy and inference speed in real-time semantic segmentation. First, we design a lightweight feature fusion module MLA-FFM to effectively fuse features from adjacent levels. Second, we propose a semantic boundary supervision module BSM to further improve the accuracy of semantic boundaries. Experiments show that our proposed ZMNet achieves a state-of-the-art balance between segmentation accuracy and inference speed. Currently, our method has only been tested for real-time street scene segmentation in ideal conditions. In the future, we plan to expand our approach to encompass real-time scene segmentation in non-ideal conditions such as rainy or foggy weather, aiming to meet the demands of a wider range of practical application scenarios.