Keywords

1 Introduction

Human parsing aims to segment a human image into multiple semantic parts. It is a pixel-level prediction task which requires to understand human images in both the global level and the local level. Human parsing can be widely applied to human behavior analysis [9], pose estimation [34] and fashion synthesis [40]. Recent advances in human parsing and semantic segmentation [10, 19, 23, 34, 36, 37] mostly explore the potential of the convolutional neural network (CNN).

Fig. 1.
figure 1

Drawbacks of the pixel-wise classification loss. (a) Local inconsistency, which leads to a hole on the arm. (b) semantic inconsistency, which causes unreasonable human poses. The inconsistencies are indicated by red arrows. (Color figure online)

Based on CNN architecture, the pixel-wise classification loss is usually used [10, 19, 34] which punishes the classification error for each pixel. Despite providing an effective baseline, the pixel-wise classification loss which is designed for per-pixel category prediction, has two drawbacks. First, the pixel-wise classification loss may lead to local inconsistency, such as holes and blur. The reason is that it merely penalizes the false prediction on every pixel without explicitly considering the correlation among the adjacent pixels. For illustration, we train a baseline model (see Sect. 3.2) with the pixel-wise classification loss. As shown in Fig. 1(a), some pixels which belongs to “arm” are incorrectly predicted as “upper-clothes” by the baseline. This is undesirable but is the consequence of local inconsistency of the baseline loss. Second, pixel-wise classification loss may lead to semantic inconsistency in the overall segmentation map, such as unreasonable human poses and incorrect spatial relationship of body parts. Compared to the local inconsistency, the semantic inconsistency is generated from deeper layers. When only looking at a local region, the learned model does not have an overall sense of the topology of body parts. As shown in Fig. 1(b), the “arm” is merged with an adjacent “leg”, indicating incorrect part topology (three legs). Therefore, the pixel-wise classification loss does not explicitly consider the semantic consistency, so that long-range dependency may not be well captured.

In the attempt to address the inconsistency problems, the conditional random fields (CRFs) [17] can be employed as a post processing method. However, CRFs usually handle inconsistency in very limited scope (locally) due to the pairwise potentials, and may even generate worse label maps given poor initial segmentation result. As an alternative to CRFs, a recent work proposes the use of adversarial network [24]. Since the adversarial loss assesses whether a label map is real or fake by joint configuration of many label variables, it can enforce higher-level consistency, which cannot be achieved with pairwise terms or the per-pixel classification loss. Now, an increasing number of works adopt the routine of combining the cross entropy loss with an adversarial loss to produce label maps closer to the ground truth [5, 12, 27].

Fig. 2.
figure 2

Two types of convergence in adversarial network training. \(LossD\ (real) \) and \(LossD\ (fake)\) denote the adversarial losses of discriminator on real and fake image respectively, and LossG denotes the loss of generator. (a) Good convergence, where \(LossD\ (real)\) and \(LossD\ (fake)\) converge to 0.5 and LossG converges to 0. It indicates a successful adversarial network training, where G is able to fool D. (b) Poor convergence, where \(LossD\ (real)\) and \(LossD\ (fake)\) converge to 0 and LossG converges to 1. It stands for an unbalanced adversarial network training, where D can easily distinguish generated images from real images.

Nevertheless, the previous adversarial network also has its limitations. First, the single discriminator back propagates only one adversarial loss to the generator. However, the local inconsistency is generated from top layers and the semantic inconsistency is generated from deep layers. The two targeted layers can not be discretely trained with only one adversarial loss. Second, a single discriminator has to look at overall high-resolution image (or a large part of it) in order to supervise the global consistency. As mentioned by numbers of literatures [7, 14], it is very difficult for a generator to fool the discriminator on a high-resolution image. As a result, the single discriminator back propagates a maximum adversarial loss invariably, which makes the training unbalanced. We call it poor convergence problem, as shown in Fig. 2.

Fig. 3.
figure 3

Top: A brief pipeline of MMAN. Two discriminators are attached to a CNN-based generator (G). The Macro D works on the low-resolution label map and has a global receptive field, focusing on semantic consistency. Micro D focuses on multiple patches and has small receptive fields on high-resolution label map, thus supervising the local consistency. The Macro (Micro) discriminator yields “fake” if semantic (local) inconsistency is observed, otherwise it gives “real”. Bottom: qualitative results of using Macro D, Micro D and MMAN, respectively. We observe that Macro D and Micro D correct semantic inconsistency ( dashed circle) and local inconsistency ( dashed circle), respectively, and that MMAN possesses the merits of both. (Color figure online)

In this paper, the basic objective is to improve the local and semantic consistency of label maps in human parsing. We adopt the idea of adversarial training and at the same time aim to addresses its limitations, i.e., the inferior ability in improving parsing consistency with a single adversarial loss and the poor convergence problem. Specifically, we introduce the Macro-Micro Adversarial Nets (MMAN). MMAN consists of a dual-output generator (G) and two discriminators (D), named Macro D and Micro D. The three modules constitute two adversarial networks (Macro AN, Micro AN), addressing the semantic consistency and the local consistency, respectively. Given an input human image, the CNN-based generator outputs two segmentation maps with different resolution levels, i.e., low resolution and high resolution. The input of Macro D is a low-resolution segmentation map, and the output is the confidence score of semantic consistency. The input of Micro D is the high-resolution segmentation result, and its outputs is the confidence score of local consistency. A brief pipeline of the proposed framework is shown in Fig. 3. It is in two critical aspects that MMAN departs from previous works. First, our method explicitly copes with the local inconsistency and semantic inconsistency problem using two task-specific adversarial networks individually. Second, our method does not use large-sized FOVs on high-resolution image, so we can avoid the poor convergence problem. More detailed description of the merits of the proposed network is provided in Sect. 3.5.

Our contributions are summarized as follows:

  • We propose a new framework called Macro-Micro Adversarial Network (MMAN) for human parsing. The Macro AN and Micro AN focus on semantic and local inconsistency respectively, and work in complementary way to improve the parsing quality.

  • The two discriminators in our framework achieve local and global supervision on the label maps with small field of views (FOVs), which avoids the poor convergence problem caused by high-resolution images.

  • The proposed adversarial net achieves very competitive mIoU on the LIP and PASCAL-Person-Part datasets, and can be well generalized on a relatively small dataset PPSS.

2 Related Works

Our review focuses on three lines of literature most relevant to our work, i.e., CNN-based human parsing, the conditional random fields (CRFs) and the adversarial networks.

Human Parsing. Recent progress in human parsing has been due to the two factors: (1) the available of the large-scale datasets [4, 10, 19, 25]. Comparing to the small datasets, the large-scale datasets contain the common visual variance of people and provide a comprehensive evaluation. (2) the end-to-end learned model. Human parsing demands understanding the person on the pixel level. The recent works apply the convolutional neural network (CNN) to learn the segmentation result in an end-to-end manner. In [34], human poses are extracted in advance and utilized as strong structural cues to guide the parsing. In [21], four human-related contexts are integrated into a unified network. A novel human-related grammar is presented by [29] which infers human body pose and human part segmentation jointly.

Conditional Random Fields. Using the pixel-wise classification loss, CNN usually ignores the micro context between pixels and the macro context between semantic parts. Conditional random fields (CRFs) [17, 18, 22] are one of the common methods to enforce spatial contiguity in the output label maps. Served as a post-process procedure for image segmentation, CRFs further fine-tune the output map. However, the most common used CRFs are with pair-wise potentials [2, 26], which has very limited parameters and handles low-level inconsistencies with a small scope. Higher-order potentials [16, 18] have also been observed to be effective in enforcing the semantic validity, but the corresponding energy pattern and the clique form are usually difficult to design. In summary, the utilization of context in CNN remains an open problem.

Adversarial Networks. Adversarial networks have demonstrated the effectiveness in image synthesis [13, 28, 30, 38, 39]. By minimizing the adversarial loss, the discriminator leads the generator to produce high-fidelity images. In [24], Luc et al. add the adversarial loss for training semantic segmentation and yield the competitive results. Similar idea then has been applied in street scene segmentation [12] and medical image segmentation [5, 27]. Contemporarily, an increasing body of literature [7, 14] report the difficulty of training the adversarial networks on the high-resolution images. Discriminator can easily recognize the fake high-resolution image, which leads to the training unbalance. The generator and discriminator are prone to stuck in a local minimum.

The main difference between MMAN and the adversarial learning methods above is that the we explicitly endow adversarial training with the macro and micro subtasks. We observe that the two subtasks are complementary to each other to achieve superior parsing accuracy to the baseline with a single adversarial loss and are able to reduce the risk of the training unbalance.

3 Macro-Micro Adversarial Network

Figure 4 illustrates the architecture of the proposed Macro-Micro Adversarial Network. The network consists of three components, i.e., a dual-output generator (G) and two task-specific discriminators (\(D_{Ma}\) and \(D_{Mi}\)). Given an input image of size \(3\,\times \,256\,\times \,256\), G outputs two label maps of size \(C\,\times \, 16\times 16\) and \(C\,\times \,256\times 256\), respectively. \(D_{Ma}\) supervises the entire label map of \(C\,\times \, 16\times 16\) and \(D_{Mi}\) focuses on patches of the label map of size \(C\,\times \,256\times 256\), respectively, so that global and local inconsistencies are penalized. In Sect. 3.1, we illustrate the training objectives, followed by the structure illustration in Sects. 3.2, 3.3 and 3.4. The merits of the proposed network are discussed in Sect. 3.5.

Fig. 4.
figure 4

MMAN has three components: a dual-output generator ( dashed box), a Macro discriminator ( dashed box) and a Micro discriminator ( dashed box). Given an input image of size \(3\,\times \,256\times 256\), the generator G first produces a low-resolution (\(8192\,\times \,16\times 16\)) tensor, from which a low-resolution label map (\(C\,\times \,16\times 16\)) and a high-resolution label map (\(C\,\times \,256\times 256\)) are generated, where C is the number of classes. Finally, for the each label map (sized \(C\,\times \,16\times 16\), for example), we concatenate it with an RGB image (sized \(3\,\times \,16\times 16\)) along the 1st axis (number of channels), which is fed into the corresponding discriminator. (Color figure online)

3.1 Training Objectives

Given a human image x of shape \(3\,\times \, H\times W\) and a target label map y of shape \(C\,\times \, H\times W\) where C is the number of classes including the background, the traditional pixel-wise classification loss (multi-class cross-entropy loss) can be formulated as:

$$\begin{aligned} \mathcal {L}_{mce}(G) = \sum _{i=1}^{H\,\times \, W}\sum _{c=1}^{C} - y_{ic}\log {\hat{y}_{ic}}, \end{aligned}$$
(1)

where \(\hat{y}_{ic}\) denotes the predicted probability of the class c on the i-th pixel. The \(y_{ic}\) denotes the ground truth probability of the class c on the i-th pixel. If the i-th pixel belongs to class c, \(y_{ic}=1,\) else \(y_{ic}=0\).

To enforce the spatial consistency, we combine the pixel-wise classification loss with the adversarial loss. It can be formulated as:

$$\begin{aligned} \mathcal {L}_{mix}(G, D) = \mathcal {L}_{mce}(G) + \lambda \mathcal {L}_{adver}(G, D), \end{aligned}$$
(2)

where \(\lambda \) controls the relative importance of the pixel-wise classification loss and the adversarial loss. Specifically, the adversarial loss \(\mathcal {L}_{adver}(G, D)\) is:

$$\begin{aligned} \mathcal {L}_{adver}(G, D) =&\mathbb {E}_{x,y}[\log D(x, y)] + \nonumber \\&\mathbb {E}_{x}[\log (1-D(x, G(x))]. \end{aligned}$$
(3)

As shown in Fig. 4, the proposed MMAN employs the “cross-entropy loss + adversarial loss” to supervise both the bottom and top output from the generator G:

$$\begin{aligned} \mathcal {L}_{MMAN}(G,&D_{Ma}, D_{Mi}) = \mathcal {L}_{adver}(G, D_{Ma}) + \lambda _{1} \mathcal {L}_{mce_{l}}(G) \;+ \nonumber \\&\lambda _{2} \mathcal {L}_{adver}(G,D_{Mi}) + \lambda _{3} \mathcal {L}_{mce_{h}}(G), \end{aligned}$$
(4)

where \(\mathcal {L}_{mce_l}(G)\) donates the cross-entropy loss between the low-resolution output and the small-sized target label map, while the \(\mathcal {L}_{mce_h}(G)\) refers to the cross-entropy loss between the high-resolution output and the original ground-truth label map. Similarly, \(\mathcal {L}_{adver}(G, D_{Ma})\) is the adversarial loss focusing on the low-resolution map, and \(\mathcal {L}_{adver}(G, D_{Mi})\) is based on the high-resolution map. The hyper parameters \(\lambda _{1}\), \(\lambda _{2}\) and \(\lambda _{3}\) control the relative importance of the four losses. The training objective of MMAN is:

$$\begin{aligned} G^*, D_{Ma}^*, D_{Mi}^* =&\arg \min _G\max _{{D_{Ma}, D_{Ma}}} \mathcal {L}_{MMAN}(G, D_{Ma}, D_{Mi}). \end{aligned}$$
(5)

We solve Eq. 5 by alternate between optimizing G, \(D_{Ma}\) and \(D_{Mi}\) until \(\mathcal {L}_{MMAN}(G, D_{Ma}, D_{Mi})\) converges.

3.2 Dual-Output Generator

For the generator (G), we utilize DeepLab-ASPP [2] framework with ResNet-101 [11] model pre-trained on the ImageNet dataset [6] as our starting point due to its simplicity and effectiveness. We augment DeepLab-ASPP architecture with cascaded upsampling layers and skip connect them with early layers, which is similar with U-net [31]. Furthermore, we add a bypass to output the deep feature tensor from the bottom layers and transfer it to a label map with a convolution layer. The small-sized label map severs as the second output in parallel with the original sized label map from the top layer. We refer to the augmented dual-output architecture as Do-DeepLab-ASPP and adopt it as our baseline. For the dual output, we supervise the cross-entropy loss from top layers with ground truth label maps of original size, since it can retain visual details. Besides, we supervise the cross-entropy loss of bottom layers with a resized label map, i.e., 1/16 times of the original size. The shrunken label map pays more attentions to the coarse-grained human structure. The same strategy is applied to adversarial loss. We concatenated the respect label map with RGB image of corresponding size along class channel as a strong condition to discriminators.

3.3 Macro Discriminator

Macro discriminator (\(D_{Ma}\)) aims to lead the generator to produce realistic label map that consist with high-level human characteristics, such as reasonable human poses and correct spatial relationship of body parts. \(D_{Ma}\) is attached to the bottom layer of G and focuses on an overall low-resolution label map. It consists of 4 convolution layers with kernel size of \(4\,\times \,4\) and stride of 2. Each convolution layer follows by one instance-norm layer and one LeakyRelu function. Given a output label map from G, \(D_{Ma}\) downsamples it to \(1\,\times \,1\) to achieve the global supervision on it. The output of \(D_{Ma}\) is the confidence score of semantic consistency.

3.4 Micro Discriminator

Micro discriminator (\(D_{Mi}\)) is designed to enforce the local consistency in label maps. We follow the idea of “PatchGAN” [13] in designing the \(D_{Mi}\). Different from \(D_{Ma}\) that has a global receptive field on the (shrunken) label map, \(D_{Mi}\) only penalizes local error at the scale of image patches. The kernel size of \(D_{Mi}\) is \(4\,\times \,4\) and the stride is 2. Micro D has a shallow structure of 3 convolution layers, each convolution layer follows by one instance-norm layer and one LeakyRelu function. \(D_{Mi}\) aims to classify if each \(22 \,\times \, 22\) patch in an high-resolution image is real or fake, which is suitable for enforcing the local consistency. After running \(D_{Mi}\) convolutationally across the label map, we will obtain multiple response from every receptive field. We finally averages all responses to provide the ultimate output of \(D_{Mi}\).

3.5 Discussions

In CNN-based human parsing, convolution layers go deep to extract part-level features, and deconvolution layers bring the in-depth features back to pixel-level locations. It seems intuitive to arrange the Macro D to deeper layers to supervise high-level semantic features and Micro D to top layers, focusing on low-level visual features. Besides the intuitive motivation, however, we can benefit more from such arrangement. The merits of MMAN are summarized in four aspects.

Functional Specialization of Macro D and Micro D. Compared with the single discriminator which attempts to solve two levels of inconsistency alone, Macro D and Micro D are specified in addressing one of the two consistency problems. Take Macro D as an example. First, Macro D is attached to the deep layer of G. Because the semantic inconsistency is originally generated from the deep layers, a such designed Macro D allows the loss to back propagated to G more directly. Second, Macro D acts on a low-resolution label map that retains the semantic-level human structure while filtering out the pixel-level details. It enforces Macro D to focus on the global inconsistency without disturbing by local errors. The same reasoning applies to Micro D. In Sect. 4.5, we validate that MMAN consistently outperforms the adversarial networks with a single adversarial loss [5, 24].

Functional Complementarity of Macro D and Micro D. As mentioned in [35], supervising classification loss in early deep layers can offer a good coarse-grained initialization for later top layers. Correspondingly, decreasing the loss in top layers can remedy the coarse semantic feature with fine-grained visual details. We assume that the adversarial loss has the same characteristic to work in complementary pattern. We clarify our hypothesis in Sect. 4.4.

Small FOVs to Avoid Poor Convergence Problem. Reported by increasing literatures [7, 14], the existing adversarial networks have drawbacks in coping with complex high-resolution images. In our framework, Macro D acts on a low-resolution label map and Micro D has multiple but small FOVs on a high-resolution label map. As a result, both Macro D and Micro D avoid using large FOVs as the actual input, which effectively reduce the convergence risk caused by high resolution. We show this benefit in Sect. 4.5.

Efficiency. Comparing with the single adversarial network [5, 24], MMAN achieves the supervision across the overall images with two shallower discriminators, which have fewer parameters. It also owning to the small FOVs of the discriminators. The efficiency of MMAN is showed in variant study in Sect. 4.5.

4 Experiment

4.1 Dataset

LIP [10] is a recently introduced large-scale dataset, challenging in the severe pose complexity, heavy occlusions and body truncation. It contains 50,462 images in total, including 30,362 for training, 10,000 for testing and 10,000 for validation. LIP defines 19 human part (clothes) labels, including hat, hair, sunglasses, upper-clothes, dress, coat, socks, pants, gloves, scarf, skirt, jumpsuits, face, right arm, left arm, right leg, left leg, right shoe and left shoe, and a background class.

PASCAL-Person-Part [4] annotates the human part segmentation labels and is a subset of PASCAL-VOC 2010 [8]. PASCAL-Person-Part includes 1,716 images for training and 1,817 for testing. In this dataset, an image may contain multiple persons with unconstrained poses and environment. Six human body part classes and the background class are annotated.

PPSS [25] includes 3,673 annotated samples, which are divided into a training set of 1,781 images and a testing set of 1,892 images. It defines seven human parts and a background class. Collected from 171 surveillance videos, the dataset can reflect the occlusion and illumination variation in real scene.

Evaluation Metric. The human parsing accuracy of each class is measured in terms of pixel intersection-over-union (IoU). The mean intersection-over-union (mIoU) is computed by averaging the IoU across all classes. We use both IoU for each class and mIoU as evaluation metrics for each dataset.

4.2 Implementation Details

In our implementation, input images are resized so that its shorter side is fixed to 288. A \(256\,\times \,256\) crop is randomly sampled from the image or its horizontal flipped version. The per-pixel mean is subtracted from the cropped image. We adopt instance normalization [32] after each convolution. For the hyperparameters in Eq. 4, we set \(\lambda _{1} = 25\), \(\lambda _{2} = 1\) and \(\lambda _{3} = 100\). For the down-sampling network of the generator, we use the ImageNet [6] pretrained network as initialization. The weights of the rest of the network are initialized from scratch using Gaussian distribution with standard deviation as 0.001. We use Adam optimizer [15] with a mini-batch size of 1. We set \(\beta 1 = 0.9\), \(\beta 2 = 0.999\) and \(weight decay = 0.0001\). Learning rate starts from 0.0002. On the LIP dataset, learning rate is divided by 10 after 15 epochs, and the models are trained for 30 epochs. On the Pascal-Person-Part dataset, learning rate is divided by 10 after 25 epochs, and the models are trained for 50 epochs. We use dropout in the deconvolution layers, following the practice in [13]. We alternately optimize the D and G. During testing, we average the per-pixel classification scores at multiple scales, i.e., testing images are resized to {0.8, 1, 1.2} times of their original size.

4.3 Comparison with the State-of-the-Art Methods

In this section, we compare our result with the state-of-the-art methods on the three datasets. First, on the LIP dataset, we compare MMAN with five state-of-the-art methods in Table 1. The proposed MMAN yields an mIoU of 46.65%, while the mIoU of the five competing methods is 18.17% [1], 28.29% [23], 42.92% [3], 44.13% [2] and 44.73% [10], respectively. For a fair comparison, we further implement ASN [24] and SSL [10] on our baseline, i.e, Do-Deeplab-ASPP. On the same baseline, MMAN outperforms ASN [24] and SSL [10] by +1.40% and +0.62% in terms of mIoU, respectively. It clearly indicates that our method outperforms the state of the art. The comparison of per-class IoU indicates that improvement is mainly from classes which are closely related to human pose, such as arms, legs and shoes. In particular, MMAN is capable of distinguishing between “left” and “right”, which gives a huge boost in following human parts: more than +2.5% improvement in left/right arm, more than +10% improvement in left/right leg and more than +5% improvement in left/right shoe. The comparison implies that MMAN is capable of enforcing the consistency of semantic-level features, i.e., human pose.

Table 1. Method comparison of per-class IoU and mIoU on LIP validation set.

Second, on PASCAL-Person-Part, the comparison is shown in Table 2. We apply the same model structure used on the LIP dataset to train the PASCAL-Person-Part dataset. Our model yields an mIoU of 58.45% on the test set. It is higher than most of the compared methods and is only slightly inferior to “Attention+SSL” [10] by 0.91%. This is probably due to the human scale variance in this dataset, which can be addressed by the attention algorithm proposed in [3] and applied in [10].

Therefore, we add a plug-and-play module to our model, i.e., attention network [3]. In particular, we employ multi-scale input and use the attention network to merge the results. The final model “Attention+MMAN” improves mIoU to 59.91%, which is higher than the current state-of-the-art method [10] by +0.55%. When we look into the per-class IoU scores, we have similar observations to the those on LIP. The largest improvement can be observed in arms and legs. The improvement over the state-of-the-art methods [3, 10, 20] is over +0.6% in upper arms, over +1.8% in lower arms, over +0.4% in upper legs and over +0.9% in lower legs, respectively. The comparisons indicate that our method is very competitive.

Table 2. Performance comparison in terms of per-class IoU with five state-of-the-art methods on the PASCAL-Person-Part test set.

Third, we deploy the model trained on LIP to the testing set of the PPSS dataset without any fine-tuning. We aim to evaluate the generalization ability of the proposed model.

To make the labels in the LIP and PPSS datasets consistent, we merge the fine-grained labels of LIP into coarse-grained human part labels defined in PPSS. The evaluation result is reported in Table 3. MMAN yields an mIoU of 52.11%, which significantly outperforms DL [25] DDN [25] and ASN [24] by +16.9%, +4.9% and +1.4%, respectively. Therefore, when directly tested on another dataset with different image styles, our model still yields good performance.

Table 3. Comparison of human parsing accuracy on the PPSS dataset [25]. Best performance is highlighted in

In Fig. 5, we provide some segmentation examples obtained by Baseline (Do-Deeplab-ASPP), Baseline+Macro D, Baseline+Micro D and full MMAN, respectively. The ground truth label maps are also shown. We observe that Baseline+Micro D reduces the blur and noise significantly and aids to generate sharp boundaries, and that Baseline+Macro D corrects the unreasonable human poses. The full MMAN method integrates the advantages of both Macro AN and Micro AN and achieves higher parsing accuracy. We also present qualitative results on the PPSS dataset in Fig. 6.

Fig. 5.
figure 5

Qualitative parsing results on the Pascal-Person-Part dataset.

Fig. 6.
figure 6

Qualitative parsing results on the PPSS dataset. RGB image and the label map are showed in pairs. (Color figure online)

4.4 Ablation Study

This section presents ablation studies of our method. Since two components are involved, i.e., Macro D and Micro D, we remove them one at a time to evaluate their contributions respectively. Results on LIP and PASCAL-Person-Part datasets are shown in Tables 1 and 2, respectively.

On the LIP dataset, when removing Macro D or Micro D from the system, mIoU will drop 1.21% and 1.29%, respectively, compared with the full MMAN system. Meanwhile, when compared with the baseline approach, employing Macro D or Micro D alone brings +0.88% and +0.80% improvement in mean IoU. Similar observations can be made on the PASCAL-Person-Part dataset as well.

To further evaluate the respective function of the two different discriminators, we add two external experiments: (1) For Macro D, we calculate another mIoU using the low-resolution segmentation maps, which filter out pixel-wise details and retain high-level human structures. So this new mIoU is more suitable for evaluating Macro D. (2) For Micro D, we count the “isolated pixels” in high-resolution segmentation maps, which reflects local inconsistency such as “holes”. The “isolated pixel rate” (IPR) can be viewed as a better indicator for evaluating Micro D. We see from Table 4 that Macro D is better than Micro D at improving “mIoU (low-reso.)”, proving that Macro D specializes in preserving high-level human structures. We also see that Micro D is better than Macro D at decreasing IPR, suggesting that Micro D specializes in improving local consistency of the result.

4.5 Variant Study

We further evaluate three different variants of MMAN, i.e., Single AN, Double AN, and Multiple AN, on the LIP dataset. Table 5 details the numer of parameter, global FOV (g.FOV) and local FOV (l.FOV) sizes, as well as the architecture sketch of each variant. The result of original MMAN is also presented for a clear comparison.

Single AN refers to the traditional adversarial network with only one discriminator. The discriminator is attached to the top layer and has a global receptive field on a \(256\,\times \,256\) label map. As the result shows, Single AN yields 45.23% in mean IoU, which is slightly higher than the baseline but lower than MMAN. This result suggests that employing Macro D and Micro D outperforms the single discriminator, which proves the correctness of the analysis in Sect. 3.5. What is more, we observe the poor convergence (pc) problem when training the Single AN. It is due to the employment of large FOVs on the high-resolution label map.

Double AN has the same number of discriminators with MMAN. The difference lies in that the Double AN attaches the Macro D to the top layer. Compared to Double AN, MMAN significantly improves the result by 0.82%. The result illustrates the complementary effects of Macro D and Micro D: Macro D acts on deep layers and offers a good coarse-grained initialization for later top layers and Micro D helps to remedies the coarse semantic feature with fine-grained visual details.

Multiple AN is designed to evaluate the parsing accuracy when employing more than two discriminators. To this end, we attach an extra discriminator to the 3rd deconvolution layer of G. In particular, the discriminator has the same architecture with micro D and focuses on \(22\,\times \,22\) patches on a \(64\,\times \,64\) label map. As the result shows in Table 5, employing three discriminators brings very slightly improvement (0.16%) in mean IoU, but results in more complex architecture and more parameters.

Table 4. Comparison in IPR and mIOUs
Table 5. Variant study of MMAN.

5 Conclusions

In this paper, we introduce a novel Macro-Micro adversarial network (MMAN) for human parsing, which significantly reduces the semantic inconsistency, e.g., misplaced human parts, and the local inconsistency, e.g., blur and holes, in the parsing results. Our model achieves comparative parsing accuracy with the state-of-the-art methods on two challenge human parsing datasets and has a good generalization ability on other datasets. The two adversarial losses are complementary and outperform previous methods that employ a single adversarial loss. Furthermore, MMAN achieves both global and local supervisions with small receptive fields, which effectively avoids the poor convergence problem of adversarial network in handling high-resolution images.