1 Introduction

As one of the most fundamental and critical tasks in analyzing human in the wild, human parsing, or semantic segmentation has become a key enabling technology nowadays in a large number of application domains such as video surveillance [1], human behavior analysis [2], human part segmentation [3], medical image segmentation [4], and so on.

Semantic segmentation has recently witnessed great progress driven by the advancement of Convolutional Neural Networks (CNNs) [5], especially Fully Convolutional Networks (FCNs) [6]. Thanks to deeply learned features [7] and large-scale annotations [8], U-Net [9], a CNN-based Network, has become the state-of-the-art technology for human image segmentation.

Despite the excellent representational capabilities, the general limitation of CNN-based approaches goes to the incapability of displaying explicit remote relationship modeling due to the inherent limitations of convolutional operations [4]. As a result, these architectures often performs poorly for target structures that exhibit large differences in texture, shape, and size. To overcome this limitation, we employ the Res2Net [10] module together with Transformer [11] in this work. The Res2Net module builds hierarchical residual class links in a single residual block and fuses the features extracted by each hierarchical residual class link to improve the multi-scale capability of exploring CNN in a larger scope. As stated in [10], it essentially extends a new dimension (the number of feature groups in the Res2Net block), namely scale, which is an important and more effective factor in addition to the depth, width and cardinality dimensions.

By employing, dispense convolution operators entirely and relying on attention mechanisms solely, Transformers have emerged as alternative architectures to design for sequence-to-sequence prediction [11]. Unlike CNN-based approaches, Transformers are not only powerful in modeling the global environment, but also exhibit excellent transfer capabilities for downstream tasks in the presence of large-scale pre-training. This success has been widely witnessed in the fields of machine translation and natural language processing [11, 12].

However, using transformer alone in the encoder process will lead to feature resolution loss [4]. Though the CNN-transformer works as a powerful tool to deal with the feature resolution loss, it can not better extract the features from the input image. To tackle this problem, in this work, we use the Res2Net-transformer for encoding, which not only eliminates the feature resolution loss but also improves the network layer range of the perceptual field.

To further improve the segmentation performance, we add Coordinate Attention [13] to the decoder process in the network. The Coordinate Attention decomposes the channel attention into two one-dimensional feature encoding processes, and collect features along two spatial directions, respectively. This allows a more aggregated feature map of the encoder. After being upsampled to recover the local spatial information, the aggregated feature maps are combined with the different high-resolution Res2Net features in the encoding path to achieve precise localization.

The main contribution of this paper is to put forward a semantic segmentation neural network with TransUNet as the backbone and Res2Net and Coordinate Attention. Compared the TRCA-Net with state-of-the-art (SOAT) semantic segmentation networks: the original U-Net, DeepLabv3+, and TransUNet, our proposed TRCA-Net has following performance advantages.

  • Res2Net module is introduced in the feature extraction process to further improve the feature extraction ability.

  • We use the Res2Net-transformer for encoding, which not only eliminates the feature resolution loss but also improves the network layer range of the perceptual field.

  • We add coordinate attention to the decoder process in the network to further improve the segmentation performance.

The remainder of this paper is organized as follows: Sect. 2 provides related work; in Sect. 3, the proposed method and the baseline network are presented Sect. 4 gives the acquisition of input data, experimental setup and results. In Sect. 5, the conclusions of this study are shown.

2 Related work

2.1 Res2Net

Res2Net was first proposed in [10] as a simple yet effective module to explore the multi-system capabilities of CNNs over a larger range. It can be conveniently combined with existing state-of-the-art methods. Deng-Ping Fan et al. [14] proposed a new network for lung infection image segmentation. They used Res2Net as the backbone network for CT images to extract two sets of low-level features and three sets of high-level features. Using the powerful multi-scale capability of Res2Net, good CT image segmentation performance was achieved. In [15], a new residual multi-scale module with an attention mechanism drew on the multi-scale capability in Res2Net for single-image super resolution applications. Inspired by Res2Net, Yan Li et al. [16] developed the multiple temporal aggregation module, which divided spatiotemporal information and associated local convolution layers into a number of subsets.

2.2 Transformer

Transformer was first proposed in [11] for machine translation applications. Without reliance on CNNs, Alexey Dosovitskiy et al. [17] presented a Vision Transformer, which only applied a standard Transformer directly to sequences of image patches and completed image classification tasks very well in computer vision area. An improved Transformer architecture named Longformer was proposed in [18] to address the difficulty that computational requirements of self-attention increases guadratically with sequence length. By taking the advantage of the self-attention, Niki Parmar et al. [19] generalized the previous proposed Transformer of [11] to a sequence modeling formulation of image generation called Image Transformer. While Image Transformer can maintain much larger receptive fields per layer than traditional CNNs, it increases the size of images significantly.

2.3 Attention mechanisms

Initially created for machine translation [20], the Attention Mechanism has gradually grown in significance in the field of neural networks. An analogy to the human visual system can be used to explain attention mechanisms. The human visual system has a propensity to concentrate on specific details in an image that support judgment and dismiss irrelevant details. By inserting attention modules into CNN architectures, the performance of large-scale image classification tasks is improved substantially [21,22,23]. An effective module, called Bottleneck Attention Module (BAM) which was placed at each bottleneck of models, was proposed in [21]. By developing an inter-channel relationship, an attention module which is called as Squeeze-and-Excite (SE) was presented in [22]. In contrast to SE, Woo et al. [23] proposed the Convolutional Block Attention Module (CBAM) where both spatial and channel-wise attention was exploited. Despite these improvements, only a few studies have used attentional mechanisms for image segmentation tasks.

3 Methods

The whole structure of our proposed TRCA-Net based on Res2Net [10], Transformers [11], and Coordinate Attention is shown in Fig. 1.

Fig. 1
figure 1

Network architecture of TRCA-Net

From Fig. 1, a convolutional layer is used to reduce the channel size of the reshaped features to the number of target classes in the last network layer, and then one can directly bilinearly upsample the feature maps to full resolution to predict the final segmentation results.

We can see that the decoder process together with the hybrid encoder forms a U-shaped architecture that allows feature aggregation at different resolution levels by skipping connections. The detailed architecture of the upsampling and intermediate skip-connection processes is shown in long dotted lines with arrows in Fig. 1.

Though combining CNN-Transformer [4] with combinatorial upsampling has achieved substantial performance, this strategy may not be the optimal choice for segmentation networks because the range of receptive fields in the encoder is not large and thus leads to a loss of low-level details. To gain a larger range of receptive fields, our TRCA-Net uses a hybrid Res2Net50-Transformer architecture as an encoder in the decoder process. Moreover, the upsampling and Coordinate Attention block includes an decoder, where the upsampling can expand the size of the feature map to the one of the input images and Coordinate Attention block can focus on the interested areas in the feature map.

The details of Res2Net, Transformer and Coordinate Attention are described below.

3.1 Res2Net

We first describe the encoder part of the TRCA-Net which is in the downsampling part. To combine the strengths of Res2Net block and Transformer, we use them to form the downsampling process.

Fig. 2
figure 2

The Res2Net block

The details of the Res2Net [10] block are shown in Fig. 2. After the input feature map is convolved using the \(1\times 1\) convolution, it splits the feature maps into s feature map subsets evenly. Except for \(X_1\), each \(X_i\) has a corresponding 3\(\times\)3 convolution, denoted by \(K_i\). We denote \(y_i\) as the output of \(K_i\). The feature subset \(X_i\) is added with the output of \(K_{i-1}\), and is then fed into \(K_i\). To reduce parameters the number of feature map subsets, we omit the \(3\times 3\) convolution for \(X_1\). The \(3\times 3\) convolution operator has the potential to receive feature information from all feature segmentations. Each time a feature segmentation passes through the \(3\times 3\) convolution operator, the output may have a larger perceptual field. Different numbers and different combinations of receptive field sizes/scales are contained in the output of Res2Net module due to the effect of combinatorial explosion. In the Res2Net module, the segmentations are processed in a multi-scale manner, which facilitates the extraction of global and local information. We concatenate all the divisions and perform a single convolution on them in order to better combine the information at various scales. The splitting and tandem approach can accelerate feature processing and force convolution.

3.2 Transformer

In the downsampling process, we perform tokenization by reshaping the output of the fortieth layer in the Res2Net50 [24] network into a sequence of flattened 2D patches. We map the vectorized patches into a latent D-dimensional embedding space using a trainable linear projection. To encode the patch spatial information, we learn specific position embeddings which are added to the patch embeddings to retain positional information, get the input of Transformer [11].

The Transformer encoder consists of L layers of Multihead Self-Attention (MSA) [25] and Multilayer Perceptron (MLP) [26] blocks. Therefore, the output of the \({\ell }\)-th (\({\ell =1,...,12}\)) layer can be written as follows:

$$\begin{aligned} \begin{array}{rcl} z^{\prime }_{\ell } = {\text{ MSA }}({\text {LN}}(z_{\ell -1}))+z_{\ell -1}, \end{array} \end{aligned}$$
(1)
$$\begin{aligned} \begin{array}{rcl} z_{\ell } = {\text{ MLP }}({\text {LN}}(z^{\prime }_{\ell }))+z^{\prime }_{\ell }, \end{array} \end{aligned}$$
(2)

where LN\((\cdot )\) denotes the layer normalization operator , \(z_{\ell }\) is the encoded image representation and \({z_0}\) is the product of merging all vectorized patches multiplied by the patch embedding projection plus the position embedding.

3.3 Coordinate attention

After reshaping the sequence of hidden features into the shape, we use a combined decoder that consists of three Coordinate Attention and multiple upsampling steps to reach full resolution from 3D feature map size to original image size, where each block is formed by a Coordinate Attention, two upsampling layers and one \(3\times 3\) convolutional layer. The Coordinate Attention structure is shown in Fig. 3.

Fig. 3
figure 3

The Coordinate Attention. “X Avg Pool” and “Y Avg Pool”, respectively, mean 1D horizontal global pooling and 1D vertical global pooling

Specifically, given the input feature map \(M\in {\mathbb {R}}^{C \times H\times W}\), two spatially scoped pooling kernels (H, 1) or (1, W) are used to encode each channel along the horizontal and vertical coordinates, respectively. The above two transformations aggregate features along two spatial directions, respectively, to produce two direction-aware feature maps, which are combined. Then a shared \(1\times 1\) convolutional transformation function \(F_{1}\) is performed on them, after a nonlinear activation function, the intermediate characteristic map of spatial information in the horizontal and vertical directions is obtained. The above process can be formulated as

$$\begin{aligned} \begin{array}{rcl} f= \delta \left( F_{1}\left( \left[ \displaystyle {\frac{1}{W}}\sum \limits _{0\le {i}<W}m_c(h,i) , \displaystyle {\frac{1}{H}}\sum \limits _{0\le {j}<H}m_c(j,w)\right] \right) \right) , \end{array} \end{aligned}$$
(3)

where \([\cdot ,\cdot ]\) is the concatenation operation along the spatial dimension, \(\delta\) is a nonlinear activation function, \(f\in {\mathbb {R}}^{C/r \times (H + W)}\) is the intermediate feature map, and r is the reduction rate used to control the block size. In order to reduce the complexity of the overhead model, in the literature [13], an appropriate reduction ratio r is used to reduce the number of channels of intermediate feature maps.

We then split f along the spatial dimension into two separate tensors \(f^{h}\in {\mathbb {R}}^{C/r \times H}\) and \(f^{w}\in {\mathbb {R}}^{C/r \times W}\). Another two \(1\times 1\) convolutional transformations \(F_{h}\) and \(F_{w}\) are utilized to separately transform \(f^{h}\) and \(f^{w}\) to tensors with the same channel number to the input M, respectively, to pass the tensor through the sigmoid function. The two outputs are then expanded and used as attention weights, respectively. Finally, the output of our Coordinate Attention block Y can be written as

$$\begin{aligned} \begin{array}{rcl} y_{c}(i,j)=m_{c}(i,j)\times \sigma (F_{h}(f^{h}))\times \sigma ((F_{w}(f^{w})), \end{array} \end{aligned}$$
(4)

where \(\sigma\) is the sigmoid function. The above is the whole operation of Coordinate Attention. As described above, the attention along both the horizontal and vertical directions is simultaneously applied to the input tensor. Each element in the two attention maps reflects whether the object of interest exists in the corresponding row and column. This encoding process allows our Coordinate Attention to more accurately locate the exact position of the object of interest and hence helps the whole model to segmentation better.

In conclusion, the main mechanism that makes our network more suitable for human image segmentation tasks lies in that it combines the advantages of Res2Net and Coordinate Attention, which is stated in the following.

  1. 1.

    Due to the intricate outdoor environment, the structure and texture of the human body target are complex, and the size and shape are very different, so the network needs to have stronger feature extraction ability and accurate positioning ability of the extracted physical signs. The network structure proposed in this paper uses Res2Net to extract more abundant feature information.

  2. 2.

    We add coordinate attention in the decoder process. In this framework, it allows the encoder to map more converged features, locate the information of interest and suppress the useful information, so that the features can be accurately located.

4 Experiments and discussion

In this paper, we use the CIHP dataset [3] to verify the performance of the proposed TRCR-Net. In this dataset, all images are collected from real-world human activity scenes with 19 semantic labels and 28, 280 images are used as the training set.

For fair comparisons, we use one training batch and learning rate dynamic decreasing process for all models. We use nearest-neighbor interpolation to resize the original images to \(512\times 512\). The batch size is four images and the learning rate decreasing formula is as follows,

$$\begin{aligned} \begin{array}{rcl} L = \displaystyle {{I_{\text {lr}}}\left( 1-\frac{\text {iter}}{\text {max}\_\text {{iter}}}\right) ^{\text{ power }}}, \end{array} \end{aligned}$$
(5)

where the initial learning rate \(I_{\text {lr}}\) is 0.001, real time training iteration iter is from 0 to 212100, maximum training iterations max\(\_\)iter is 212100, the power is 0.9. We train the TRCA-Net at the same settings for 30 epochs. There are 28280 images for training in CIHP dataset. We believe that the number of samples is large enough, so the random data expansion method is not applied in this experiment. Our methods are implemented using the pytorch [27] framework. All networks are trained on a single graphics card NVIDIA GeForce GTX 3090 GPU.

4.1 Evaluation metrics

We evaluate all semantic segmentation models using Mean Pixel Accuracy (MPA) and Mean Intersection over Union (MIoU) criterion, which are statistical methods to test the similarity and diversity of the sample set. Intersection over Union IoU measures the similarity of a finite set of samples, which is defined as the size of the intersection set divided by the size of the concatenated sample sets. It is useful when the number of pixels in an image is unbalanced because the same weight for all classes. MIoU is the result of summing and averaging the ratio of the intersection of the predicted and true values for each category. PA is calculated as the ratio of the class pixel values of the predicted pairs to the total predicted values across all pixels. MPA is an extension of PA, which refers to the percentage of correctly classified pixels for each category and is calculated as the average of all PAs over all the classes. MPA and MIoU are calculated as evaluation metrics in Eqs. (6) and (7). In (6) and (7), k refers to the number of classes, and \(k+1\) is the total number of classes including background. \(p_{ij}\) is the amount of pixels of class i inferred to belong to class j. In other words, \(p_{ii}\) represents the number of true positives, while \(p_{ij}\) and \(p_{ji}\) are usually interpreted as false positives and false negatives, respectively.

$$\begin{aligned} \begin{array}{rcl} {\text {MPA}}=\displaystyle {\frac{1}{k+1}}\sum _{i=0}^{k}\frac{p_{ii}}{\sum _{j=0}^{k}p_{ij}}, \end{array} \end{aligned}$$
(6)
$$\begin{aligned} \begin{array}{rcl} MIoU=\displaystyle {\frac{1}{k+1}}\sum _{i=0}^{k}\frac{p_{ii}}{\sum _{j=0}^{k}p_{ij}+\sum _{j=0}^{k}p_{ji}-p_{ii}}. \end{array} \end{aligned}$$
(7)

Adopting the above evaluation metrics, we evaluate our proposed approach with CIHP dataset. The three baseline methods are the original U-Net [9], DeepLabv3+ [28] and TransUNet [4]. Table 1 summarizes the segmentation performances of each method. It can be seen that the proposed TRCA-Net outperforms the baseline methods in first three metrics such as MPA, MIoU and Recall.

Table 1 Comparison results of original U-Net, DeepLabv3+, TransUNet and proposed TRCA-Net
Table 2 Comparison results of original U-Net, DeepLabv3+, TransUNet and TRCA-Net in ten images
Fig. 4
figure 4

Segmentation comparisons of TRCA-Net and other methods on ten images, a original image. b ground truth. c TRCA-Net(our). d TransUNet [4]. e DeepLabv3+ [28] and f U-Net [9]

As can be seen in Table 1, TRCA-Net outperforms the other methods. TRCA-Net increases 4.54, 1.57, and 2.89 points in MPA over original U-Net, DeepLabv3+, and TransUNet, respectively, and we also compare the performance of TRCA-Net, original U-Net, DeepLabv3+, and TransUNet in terms of MIoU. From the table we can see that TRCA-Net has the highest MIoU score, which is significantly better than other methods. The overall performance of TRCA-Net over the original U-Net, DeepLabv3+, and TransUNet in terms of recall is 83.63, 94.19, 86.73, and 97.06, respectively. Finally, we make the comparison with the above networks about floating-point operations per second (FLOPS) and parameters. Due to the introduction of the attention mechanism and Res2Net in the network, the computation and the number of parameters in the network are increased. Compared with TransUNet, which has the closest performance to TRCA-Net, the Params increased by 8.785M and FLOPS increased by 18.082G. Although the amount of computation is increased, it can be acceptable because the performance has achieved a great improvement.

Table 2 shows the results of each index of the four network models of each picture. It can be seen from Table 2 that among the three performance indicators of each picture, our model is better than those of other models. In addition to the quantitative results, we also provide qualitative results of segmented images using our method compared to the original U-Net, DeepLabv3+, and TransUNet, as shown in Fig. 4.

Figure. 4 shows the ground truth images using all methods and their segmentation results. From Fig. 4, the original U-Net gains the worst performance, especially in the images of eighth rows in the presence of ambient noise. It can be seen from the images in fourth and fifth rows that DeepLabv3+ and TransUNet can segment human images better than the original U-Net. However, the segmented images of DeepLabv3+ and TransUNet are not as accurate as the ones produced by TRCA-Net. We can see that similar segmentation results are obtained by TRCA-Net and DeepLabv3+ from Images 1, 2, 5, 8, 9, 10, and Images 3, 4, 6, 7 have shown that TRCA-Net has better segmentation results than DeepLabv3+.

4.2 Discussion

In this section, we verify our approach through two sets of comparison experiments. The experiments are all conducted with the same settings as subsection 4.1. All experiments for comparison experiments are conducted with the backbone of the TransUNet [4].

4.2.1 Ablation study

To verify the effectiveness of the proposed modules, we show the ablation study in this part. The performances of TRCA-Net have been compared with those of TransUNet, TransUNet+Res2Net and TransUNet+CA modules.

Table 3 Comparison results of TransUNet, TransUNet+Res2Net, TransUNet+CA and TRCA-Net

From Table 3, the MPA, MIoU, and Recall indices of TransUNet+CA and TransUNet+Res2Net are better compared with TransUNet. When Coordinate Attention is added, MPA, MIoU and Recall were increased by 0.55, 0.77 and 1.99, respectively. When Res2Net was added, MPA, MIoU and Recall were increased by 1.75, 4.17 and 8.85, respectively. The proposed TRCA-Net combines the advantages of Res2Net and Coordinated Attention and achieves better results, MPA, MIoU and Recall is increased by 2.89, 4.72 and 10.33, respectively.

4.2.2 Effectiveness of coordinate attention

In this part, we further discuss the effectiveness of Coordinate Attention used in our approach. We likewise perform an experiment to demonstrate the effectiveness of the Coordinate Attention for improving the feature representation. The performances of TransUNet+CA have been compared with those of TransUNet+SE and TransUNet+CBAM modules. It is worth noting that all three models have the same underlying structure except for the attention module and are trained using the same settings for fair comparisons. The experimental results for different attention modules are shown in Table 4.

Table 4 Comparison results of TransUNet+SE, TransUNet+CBAM and TransUNet+CA

From Table 4, TransUNet+CA performs even better compared to TransUNet+SE and TransUNet+CBAM modules in terms of MIoU metrics with 0.5 and 0.8 points, respectively. The above experiments and data demonstrate the effectiveness of Coordinate Attention in the upsampling process.

5 Conclusion

In this paper, we have proposes a semantic segmentation network named TRCA-Net for human semantic segmentation. By combining the advantages of Res2Net, transformer and Coordinate Attention, TRCA-Net enhances the feature extraction ability and precise positioning ability of targets with complex structure and texture and large differences in size and shape. The neural network is more suitable for image segmentation tasks similar to outdoor human body segmentation.

To push the research boundary of outdoor human activity analysis for real-world scenes, we use a large-scale benchmark for an instance-level human parsing task, including 38, 280 pixelated annotated images with 19 semantic part labels. The experimental results on CIHP dataset show that our proposed method outperforms the SOTA segmentation methods, which implies the effectiveness and superiority of TRCA-Net.

Regarding to the advantages of the proposed network, it can be applied to mostly semantic segmentation and image segmentation tasks such as face recognition detection, accurate location of face biometric features, and detection of different parts of human disease types in medical image segmentation. However, the computational complexity increased, and real-time capability is compromised because of the introduction of attention mechanism and Res2Net in the network. Due to the performance improvement, these losses are acceptable, and our future work will focus on addressing this issue.