Keywords

1 Introduction

In the field of computer vision, the application of neural networks mainly includes image recognition, target detection, and semantic segmentation. Semantic segmentation is the classification of each pixel in the image to determine the category of each point (such as background, person or car). Compared with image recognition and target location and detection, semantic segmentation not only provides the classification information of objects but also extracts the location information, which lays the foundation for other computer vision tasks [1, 2]. For example, in the field of autonomous driving, the system can automatically and quickly classify images to avoid obstacles. In medical image analysis, the semantic segmentation system automatically generates a simple disease report to help doctors diagnose. In precision agriculture, the machine performs semantic segmentation of crops and weeds in the image, realizes the weeding behavior of the machine, and accurately reduces the number of herbicides sprayed, greatly improving agricultural efficiency [3,4,5].

Traditional semantic segmentation methods generally divide images by extracting the grayscale information, texture shape, color and other shallow features of the image. It divides the information of the same semantic category as the same region by fixing a range that has the same semantic category. The traditional methods are threshold segmentation, edge detection and region segmentation [6]. When the background is complex and contains multiple objects, the segmentation effect of traditional methods is not obvious, and the segmentation result is rough.

With the improvement of computer performance and the rapid development of deep learning in the field of computer vision, many image semantic segmentation methods based on deep learning have been proposed [7, 8]. The output of the full convolutional network (FCN) [9] is changed from a two-dimensional vector of fixed length to a two-dimensional space feature graph. The FCN adds an upsampling structure and then predicts each pixel. However, there are problems such as pixel loss in the upsampling process, resulting in rough segmentation results. To solve this problem, SegNet [10] was proposed and restored image details by retaining the maximum index during decoding. U-Net [11], the low-resolution features obtained by the encoder and the high-resolution features extracted by the decoder end are fused at the upsampling stage of the feature map to restore the refined features of the object and refine the edge information of the object. However, the above networks do not pay more attention to spatial context information, resulting in complex semantic information in the image, which easily confuses the target. To solve this problem, DeepLabv3 [12] uses atrous convolution of different expansion rates, and PSPNet [13] uses multiscale pooling features to aggregate multiscale contextual information. However, atrous convolution may lose some pixel position information, and PSPNet may cause information loss, making segmentation inaccurate. Therefore, the proposed DeepLabV3 + [12] in 2008 improved DeepLabV3 by introducing low-level features in the decoding stage. Nevertheless, the use of shallow information is very important in the process of feature fusion, and the segmentation effect is not obvious due to the insufficient use of shallow information. Excessive use of shallow features may lead to information redundancy. Semantic segmentation algorithms that treat all pixels equally are obviously different from human visual mechanisms. To enhance the influence on the region of interest in images and reduce information redundancy, researchers use an attention mechanism as the main method to solve such problems [14,15,16,17].

In this paper, a network model based on the attention mechanism AM-PSPNet is proposed. In this model, PSPNet is the backbone network, and the ECA attention module is added in the encoding stage, which can effectively learn the channel attention of each convolution block, reduce the noise and weight the feature channels, and improve the feature extraction performance of the network. In the decoding stage, the DGF module uses deep features to guide the expression of shallow features, strengthen the learning of important features, and restore the shallow features of image edge and texture information to achieve better pixel location and finer details.

2 AM-PSPNet

Based on PSPNet, AM-PSPNet is proposed in this paper. The ECA module and DGF module are added into the model in the encoding stage and decoding stage, respectively, which improves the feature extraction ability of the network and refines the classification results. The structure of the entire network is shown in Fig. 1.

Fig. 1.
figure 1

AM-PSPNet framework.

AM-PSPNet is composed of three subnetworks: feature extraction based on a residual module, multiscale feature extraction and upsampling pixel recovery. The feature extraction subnetwork uses ResNet50 as the basic feature extraction. The network has five convolution modules of different structures. To avoid damage to the original ResNet structure, this paper adds ECA attention after the third, fourth and fifth convolution modules of ResNet so that the network can extract the discriminant features of images in the channel dimension. The multiscale feature extraction subnetwork uses a pyramid pooling module (PPM) to aggregate the context information of the multiscale to obtain the global context information. The DGF module is used in the upsampling recovered pixel subnetwork to guide the classification of shallow features more accurately through global contextual information.

2.1 Efficient Channel Attention Module

Adding the attention module to the existing convolutional neural network can bring performance improvement [18]. Most existing methods focus on more complex attention modules for better performance but result in increased computational burden on the network. To balance the relationship between network performance and complexity, the ECA module is introduced in this paper. The ECA module [19] adds little algorithm complexity while increasing network performance.

The ECA module has an improvement on the squeeze-and-excitation (SE) module [20], and the SE module can learn the channel attention of each convolutional block, which brings significant performance improvement to the deep convolutional neural network architecture. The SE module is used to control the complexity of the network, but dimension reduction can have a negative effect on predicting channel attention, and it is not necessary to obtain dependencies between all channels [21]. As a result, the ECA module efficiently captures local cross-channel interactions. As shown in Fig. 2, the ECA module implements global average pooling between channels without dimension reduction. It captures local cross-channel interactions through fast one-dimensional convolution of kernel size \(k\). \(k\) is the coverage of the cross-channel interaction. Then, the sigmoid function is used to generate the weight proportion of each channel. The channel attention feature is obtained by multiplying the given input by the channel weight. The size \(k\) can be determined by the adaptive function according to the size of the input channel \(C\), and its calculation equation can be expressed as

$$k = \Phi (C) = \mathop {\left| {\left. {\frac{{\mathop {\log }\nolimits_{2} c}}{\gamma } + \frac{b}{\gamma }} \right|} \right.}\nolimits_{odd}$$
(1)
$$C = \varphi (k) = \mathop 2\nolimits^{(\gamma * k - b)}$$
(2)

In the equation, \(\left| t \right|_{odd}\) is the odd number closest to \(t\), the constant \(r\) is set to 2, and the constant \(b\) is set to 1.

Fig. 2.
figure 2

ECA module.

2.2 Deep Guidance Fusion Module

Usually, multiscale context information is extracted by PPM. However, multistage spatial pooling will lose many fine information. Therefore, features in the deep layer of the network have strong semantic expression ability but poor pixel accuracy, while shallow features contain more pixel information. The direct superposition of deep features and shallow features easily produces considerable noise, while the segmentation accuracy of the model is reduced.

This paper proposes the deep guidance fusion module. As shown in Fig. 3, the DGF is embedded behind the PPM, and it performs global average pooling on deep features to produce attention maps. The shallow features are convolved with 3 × 3 to reduce the feature mapping channels from the CNN. Then, the shallow features are multiplied by the global attention force to screen out effective information. Finally, the output is added to the deep feature elements and upsampling to produce the final prediction results. To reconcile the contradiction between improving performance and reducing complexity, the output of the third stage is selected as a shallow feature in the feature extraction stage after many experiments.

Fig. 3.
figure 3

DGF module.

3 Experiments and Analysis

3.1 Experimental Design

The PASCAL VOC 2012 dataset and the Cityscapes dataset are used to evaluate the performance of AM-PSPNet. First, ablation studies are carried out on the PASCAL VOC 2012 dataset, and then experiments are carried out on two datasets to compare the performance of the network.

To better reflect the performance of the model. Pixel accuracy (PA) is the ratio of correctly segmented pixel points to total pixel points, and mean intersection over union (mIoU) is the ratio of intersection and union of ground truth and prediction graph. The calculation equations are as follows:

$$PA = \frac{{\sum\nolimits_{{{\text{i}} = o}}^{N} {\mathop n\nolimits_{ii} } }}{{\sum\nolimits_{i = o}^{N} {\sum\nolimits_{j = 0}^{N} {\mathop n\nolimits_{ij} } } }}$$
(3)
$$mIoU = \frac{1}{N}(\sum\nolimits_{{{\text{i}} = 1}}^{N} {\frac{{\mathop n\nolimits_{ii} }}{{\sum\nolimits_{j = 1}^{N} {\mathop n\nolimits_{ij} } + \sum\nolimits_{j = 1}^{N} {(\mathop n\nolimits_{ji} - \mathop n\nolimits_{ii} )} }}} )$$
(4)

where \(N\) is the number of category labels and \(n_{ji}\) is the total number of pixels of true category \(i\) but predicted category \(j\). \(\mathop n\nolimits_{{_{ii} }}\) and \(n_{ij}\) are similar to \(n_{ji}\).

3.2 Ablation Study

To test the importance and performance of each part of the model, an ablation study is designed. To simplify ablation studies, all methods are performed on the PAS CAL VOC 2012 dataset using the same experimental environment to compare performance in different configurations. Resnet-50 is used as the feature extraction network in this paper. The clipping size of the input data is set to 380 × 380, and the batch size is set to 8. The performance of each module is compared in a fair way, and the corresponding results are shown in Table 1.

Table 1. Ablation study on the PASCAL VOC 2012 dataset

In this paper, feature extraction is divided into five stages. Stages 2, 3 and 4 in Table 1 indicate whether to add the ECA attention module in the second, third and fourth stages of network feature extraction, respectively, and “ + DGF” indicates that the DGF module is added in the network decoding stage. Table 1 shows that the addition of the DGF module is beneficial to the improvement of network performance. When the ECA attention module is added to the second, third and fourth stages of network feature extraction, the network performance is the best.

3.3 Performance Evaluation on PASCAL VOC 2012

The validity of AM-PSPNet is verified using PASCAL VOC 2012, which is a public standard dataset commonly used in the field of semantic segmentation. It contains 1464, 1456 and 1449 images used for training, testing and verification, respectively. There are four categories of human, animal, vehicle and indoor objects, and there are 20 categories in total. There are 21 semantic categories, including one background category.

To accurately measure the performance of the model, AM-PSPNet, FCN-8S, U-Net, PSPNet and DeepLabV3 are experimentally verified on the PASCAL VOC 2012 dataset. The prediction results are shown in Table 2 and Table 3.

Table 2. Semantic segmentation results on the PASCAL VOC 2012 dataset
Table 3. Each category results on the PASCAL VOC 2012 testing set.

As seen from Tables 2 and 3, PSPNet achieves good prediction results compared with other semantic segmentation models, but AM-PSPNet achieves better prediction results; the PA is 94.6%, and the mIoU is 78.8%, which are 0.4% and 1.4% higher than the prediction results of PSPNet. AM-PSPNet obtains the highest accuracy for 15 of all categories of segmentation results. Compared with PSPNet, the segmentation results of 19 categories are improved, among which the segmentation effect of object categories with indistinguishable boundaries is significantly improved. For example, the segmentation results of the network for horse and sheep categories improved by 3.8% and 3.1%, respectively, compared with PSPNet. The use of the ECA module enhances the feature class resolution of the network, and the DGF module is helpful in restoring the image edge detail features. The experiment verifies the effectiveness of these two modules in AM-PSPNet.

Fig. 4.
figure 4

Comparison of prediction results.

To display the segmentation effect of the model more intuitively, the comparison of the visualization results of each model is shown in Fig. 4. By observing the first line, it can be seen that PSPNet segmentation of cow horns is rough, while AM-PSPNet better retains segmentation details and makes the prediction more accurate and clearer. Compared with the picture in the second line, PSPNet failed to predict the distant figure completely, and several network models missed segmentation with serious loss of details. AM-PSPNet accurately expresses the details of the image. Compared with the picture in the third line, it can be seen that AM-PSPNet has a more delicate prediction of cattle legs. Compared with FCN-8s, U-Net, DeepLabV3 and PSPNet, the overall predicted contour of AM-PSPNet is smooth and delicate, and the predicted result is closer to the ground truth.

3.4 Performance Evaluation on Cityscapes

This paper also evaluated AM-PSPNet on the Cityscapes Dataset, which has 5000 images of driving scenes in urban environments, with 19 categories, recording street scenes in 50 different cities.

Resnet-50 is used as the backbone feature extraction network for network training. Affected by the GPU memory capacity, 380 × 380 is selected as the cutting size of the input in this paper, and the batch size is set to 6. The prediction results are shown in Table 4. The Am-pspnet proposed in this paper is superior to other networks, with mIoU reaching 69.1% and PA reaching 95.2%, improving by 1.6% and 1.1%, respectively, compared with the original network.

Table 4. Semantic segmentation results on the Cityscapes dataset

4 Conclusions

This paper uses AM-PSPNet as the backbone network, and the DGF module is proposed to guide shallow feature expression through deep features and achieve better pixel positioning. The ECA module is added in the feature extraction stage to improve the performance of the convolutional neural network architecture by learning the channel attention of each convolutional block. The experiment is carried out on the PASCAL VOC 2012 dataset and Cityscapes dataset. That, AM-PSPNet has good performance compared with FCN-8s, U-Net, PSPNet and DeepLabV3.