Keywords

1 Introduction

Recent deep learning research has revealed that increasing the model capacity properly often leads to improved performance [6, 7, 10]. Nevertheless, the larger model usually comes with potential drawbacks, such as long training time, increased inference latency, and high GPU memory consumption.

Fig. 1.
figure 1

The different training objectives in the single-label scene and the multi-label scene. Due to the property of the softmax function, logits in the single-label scene sum up to 1, while logits in the multi-label scene don’t have such property.

To address this issue, knowledge distillation (KD) [11, 21] is often employed, which utilizes a strong teacher network to transfer knowledge to a relatively weaker student network. KD has already been extensively studied in the single-labeled image classification task, where each sample has only one label. However, it has not been well investigated in the multi-attribute and multi-label classification (we use the multi-label scene to refer to them in this paper), where each sample may have multiple labels.

In this paper, we focus on exploring logit-based Knowledge Distillation (KD) in the context of the multi-label scene. The logit-based KD method stands out due to its simplicity in both idea and implementation, its independence from the backbone model structure, as well as its relatively low computational overhead compared to other KD methods [9]. In this method, the term logits refers to the outputs of the neural network’s final layer, which will then be fed into a softmax function.

However, for the following reasons, we do not directly employ the logit-based KD method [11] (which we call vanilla KD below) commonly used in single-label image classification.

As shown in Fig. 1, in the single-label scene, a softmax function is applied to the logits to generate predictions in terms of probabilities. As the sum of prediction for all classes equal 1, the logits of different classes in a single sample become highly correlated, and therefore, contain interdependent information. This entanglement of logits has also been noted by Decoupled KD [28]. However, in the context of the multi-label scene, the logits of different classes are used to calculate loss with their own labels and do not explicitly interact with the logits of other classes. Such lack of interaction between logits of different classes weaken the information contained in the relation of logits of different classes. Since the vanilla KD distills exactly the relation of logits of different classes within a sample, directly employing the vanilla KD in the multi-label scene leads to a reduced amount of information.

Fig. 2.
figure 2

In the single-label scene, the logit of a same dog can be affected by the presence of the person, as shown in (a) and (b). Instead, in the multi-label scene, the logit of a same dog tends to be more stable and less influenced by the other classes, as shown in (c) and (d).

On the contrary, when evaluating logits from multiple samples, logits of the same class across multiple samples in the multi-label scene are more comparable than those in the single-label scene. To demonstrate this idea, we present an example in Fig. 2, where two input images contain the exactly same dog, with the only difference being a person in the second image. In a single-label scene, the softmax function results in \(\Sigma _{i=1}^{C}p_i=1\), which makes the rise in person lead to the decrease in the dog. However, in the multi-label scene, such a decrease is not as significant because the logit of the person does not explicitly influence the logit of the dog. To conclude, if we look into the logits of the same class in different samples, their values are not comparable in the single-label scene but are comparable in the multi-label scene.

Inspired by the aforementioned two observations, we propose to distill knowledge in logits from the same class and different samples, rather than from the same sample and different classes. We refer to this strategy as the Transpose.

We also note that the multi-label scene typically exhibits a class imbalance, where most classes or attributes contain fewer (or significantly fewer) positive samples compared to negative samples. Therefore, under our transpose strategy, there are more negative samples than vanilla KD [11]. We assume that the relative relation in negative samples is less informative, since from two positive samples the network can learn from salient information (for example, the cat in the i-th image is more salient than the j-th image) but such information does not exist in two negative samples (since they simply contain no cat). We thus propose another strategy to fill all position whose logits in the teacher network is negative (negative samples predicted by the teacher) with zero before distillation, which we refer to the Mask strategy.

As illustrated in Fig. 3, based on the above analysis, we propose Transpose and Mask Knowledge Distillation (TM-KD), which is a simple but effective logit-based knowledge distillation method. We further validate the effectiveness of our method on three tasks and five datasets, which shows TM-KD is better than both the vanilla student network and the student network with vanilla logit-based KD.

Fig. 3.
figure 3

Illustration of our proposed Transpose and Mask Knowledge Distillation (TM-KD), where logits are first transposed and are then masked with 0 according to negative samples predicted by the teacher.

2 Related Work

Knowledge Distillation. Knowledge distillation (KD), proposed by Hinton et al. [11], aims to utilize a strong teacher network for a better student network. KD methods can be roughly divided into logit-based [11, 13, 15, 26] and feature-based [2, 21]. KD is originally proposed for single-label image classification but recent studies also show the effectiveness of KD in other tasks like object detection [4, 29], semantic segmentation [19, 24], graph neural network [8, 25], anomaly detection [3] and some low-level tasks [23].

Multi-attribute and Multi-label Learning. Due to the limited space, it is hard to provide a detailed overview of each task. Here we provide some representative methods for these tasks. Label2Label [14] proposes a language modeling framework for clothing attribute recognition and pedestrian attribute recognition. JLAC [22] exploits graph neural network on top of convolution neural network for better results for pedestrian attribute recognition. Query2Label [17] proposes a simple transformer for better modelling multi-label classification. Note that our work doesn’t focus on the state-of-the-art performance of each dataset and thus is orthogonal to these works.

KD for the Multi-label Scene. So far, KD for the multi-label scene is not well explored. Liu et al. [20] leverages the extra information from weakly-supervised detection for KD in the multi-label scene. Zhang et al. [27] proposes a feature-based method for KD in the multi-label scene exploiting class activation maps. On the contrary, our work focuses on better logit-based KD and uses no auxiliary model like the object detector.

3 Method

3.1 Preliminaries

A training batch with B samples and C classes for the multi-label scene can be described as \(D=\{(x_i,y_i),i=1,2,...,B\}\), where \(x_i\) is the i-th image in a batch and \(y_i\in \{0,1\}^C\) is a binary vector with length C, the lables for i-th sample. We used \(y_{ij}\) to represent the j-th attribute label for i-th sample and \(y_{ij}=1\) for a positive sample, \(y_{ij}=0\) for a negative sample.

Then, a classification network f is trained to predict a vector \(z_i\in \mathbb {R}^C\) for the i-th sample and \(z \in \mathbb {R}^{B \times C}\) is called logits. In the multi-label scene, each separate logit is then fed into sigmoid function, and then calculate the binary cross-entropy (BCE) loss. The above process can be formally defined as:

$$\begin{aligned} z_i = f(x_i) \end{aligned}$$
(1)
$$\begin{aligned} p_{ij} = \frac{1}{(1+e^{-z_{ij}})} \end{aligned}$$
(2)
$$\begin{aligned} \mathcal {L_{\text {BCE}}} = -\frac{1}{BC}\sum _{i=1}^{B}\sum _{j=1}^{C}y_{ij}log(p_{ij})+(1-y_{ij})log(1-p_{ij}) \end{aligned}$$
(3)

And if KD is applied during the training, the final loss can be represented as:

$$\begin{aligned} \mathcal {L} = \mathcal {L_\text {BCE}} + \lambda \mathcal {L_\text {KD}} \end{aligned}$$
(4)

where \(\lambda \) is a hyperparameter to balance BCE loss and KD loss.

Below we will show the different designs of \(\mathcal {L_\text {KD}}\) in vanilla KD and our TM-KD.

3.2 Vanilla KD

Apart from the student network \(f^s\), KD uses a stronger teacher network \(f^t\) trained beforehand to help the student network. The logits of them can be represented as \(z^s \in \mathbb {R}^{B \times C}\) and \(z^t \in \mathbb {R}^{B \times C}\).

Directly using vanilla KD from the single-label classification, Kullback-Leibler (KL) divergence loss is used to minimize the discrepancy of probabilities from different classes in the same sample, which is:

$$\begin{aligned} p_i^t = softmax(z_i^t/\tau ),\ \ p_i^s = softmax(z_i^s/\tau ),\ \ i=1,2,...,B \end{aligned}$$
(5)
$$\begin{aligned} \mathcal {L_\text {KD}} = \frac{1}{B}\sum _{i=1}^{B}KL(p_i^t,p_i^s) \end{aligned}$$
(6)

where \(\tau \) is a hyperparameter to adjust the smoothness of two probabilistic distributions.

Algorithm 1
figure a

PyTorch-style pseudocode for vanilla KD and TM-KD.

3.3 TM-KD

As illustrated in Fig. 3, our TM-KD consist of two strategies, i.e. the Transpose and the Mask respectively.

For the Mask, to alleviate the influence of useless information in negative samples for a class, we set the position to zero if the corresponding logits in the teacher network are negative. By doing so, the teacher network only distills the knowledge in positive samples from its perspective. Formally:

$$\begin{aligned} \hat{z}^{*}_{ij} = {\left\{ \begin{array}{ll} z^*_{ij}, &{} z^t_{ij}\ge 0 \\ 0, &{} z^t_{ij}\le 0 \end{array}\right. } \end{aligned}$$
(7)

where \(* \in \{s,t\}\). Note the student network and teacher network share the same mask from the teacher network.

For the Transpose, we no longer distill from the different classes in the same sample, but from the different samples in the same class, which is done by:

$$\begin{aligned} p_j^t = softmax(\hat{z}_{*j}^t/\tau ),\ \ p_j^s = softmax(\hat{z}_{*j}^s/\tau ),\ \ j=1,2,...,C \end{aligned}$$
(8)

\(*\) can be any i where \(1\le i\le B\), and \(\hat{z}^t_{*j}, \hat{z}^s_{*j} \in \mathbb {R}^{B}\) can be viewed as logits of the same class in different samples. We have analyzed why they contain more information in the multi-label scene in the introduction.

We also provide a pseudocode of vanilla KD and TM-KD in Algo. 1.

4 Experiments

In this section, we validate the performance of our TM-KD on three tasks and five datasets. Our TM-KD consistently demonstrates impressive performance across all the datasets. In addition, we conduct ablation studies to demonstrate the effectiveness of our Transpose and Mask strategy.

4.1 Experimental Setting

We conduct our experiment under two KD settings with ResNet [10], where ResNet-101 serves as the teacher model for teaching ResNet-50, and ResNet-50 serves as the teacher model for teaching ResNet-18. We train a teacher first and then utilize it to help train a student. The only exception is that we used ResNet-101 as the teacher and ResNet-34 as the student for the MS COCO dataset. Our code is on top of the codebase by Jia et al. [12] and the teacher network is retrained instead of loaded. Below, we’ll present more experimental details.

Datasets. We evaluate on pedestrian attribute recognition using PETA [5], PETA-zs [5, 12], and PA100k [18], clothing attribute recognition using the Clothing Attributes Dataset [1], and multi-label classification using MS COCO [16]. We list the statistics of used datasets in Table 1. We utilize default dataset split and more details can be found in their original paper and codebase by Jia et al. [12]. Note that for the clothing attributes attribute dataset, we use only 22 out of 26 attributes, where we exclude attributes of sleeve length, neckline, category and gender.

Table 1. Statistics of 5 used datasets from 3 different tasks. PAR: pedestrian attribute recognition. CAR: clothing attribute recognition. MLC: multi-label classification. \(N_{train}\): number of samples in train set. \(N_{test}\): number of samples in test set. \(N_{attr}\): number of attributes used in this dataset.

Implementation Details. For KD hyperparameters, we set \(\lambda =20\) in Eq. 4 and set \(\tau =1\) in Eq. 8 for all experiments. It turns out that the order of magnitude of \(\lambda \) and \(\tau \) (in the power of 10) does affect the results, but when \(\lambda \) and \(\tau \) are in the same order of magnitude, the exact values of them don’t affect the result. Since we implemented our method on top of the codebase by Jia et al. [12], we used most of the default settings from it. It’s guaranteed that hyperparameters for different methods on the same dataset are also the same.

Metrics. Following the routine in previous works, We report mean accuracy (mA) and \(micro-F1\) for pedestrian attribute recognition and clothing attribute recognition datasets and report mA for the multi-label classification dataset. Since ReduceLROnPlateau learning rate scheduler is used following the codebase by Jia et al. [12], we report the metrics after the first epoch of learning rate reducing to \(10^{-5}\) for pedestrian attribute recognition datasets, and we report those of clothing attribute recognition dataset for first reduction to \(10^{-6}\). For the multi-label classification dataset, we report the metrics at the last (30) epoch.

Table 2. The results for pedestrian attribute recognition, \(\Delta _{*}\) represent the performance gains from our TM-KD compared with the baseline w/o KD and the baseline with vanilla KD. The rightmost column reports the average of 3 datasets.

4.2 Main Results

Pedestrian Attribute Recognition. We report our result in Table 2. It can be seen that vanilla KD has negligible influence on baseline. And we compared TM-KD with the baseline w/o KD and with vanilla KD in \(\Delta _*\) rows. Our TM-KD has impressive and consistent gains on all datasets w.r.t mA.

When it comes to F1, our model isn’t as outstanding as its performance w.r.t mA but still gets an overall positive delta on the average performance compared to the baseline w/o KD. We argue that mA is the main metric for pedestrian attribute recognition since it calculates the mean accuracy over classes, while \(micro-F1\) treats all samples equally and thus can’t well reflect the model’s performance in class imbalance scene. Jia et al. [12] also note that the trade-off exists between mA and F1, and they show that by changing the weight function we can control the trade-off lean to mA or F1 to some extents.

Table 3. Our results on the Clothing Attributes Dataset. Note that our TM-KD has even helped student’s performance surpass their corresponding teacher.

Clothing Attribute Recognition. Our results on clothing attribute recognition are presented in Table 3, wherein our TM-KD demonstrates more remarkable performance. Surprisingly, our ResNet-18 student, trained by ResNet-50, outperforms even the ResNet-101 teacher. Additionally, the ResNet-50 teached by ResNet-101 with our TM-KD also achieves significantly better results compared to all other methods. On the contrary, the vanilla KD approach leads to performance degradation for both ResNet-18 and ResNet-50.

One possible reason for such remarkable performance may be the fact that the Clothing Attributes Dataset contains a very limited number of samples (recall Table 1). Intuitively, when the training data is extremely insufficient to train a network, the network has even more potential to progress. Consider two college students, and in their final exams one gets a \(D^-\) grade while another gets an A grade. If we teach them in the same way, apparently the former will progress more.

Table 4. Our results on the MS COCO dataset.

Multi-label Classification. As shown in Table 4, our results also boost the performance of the student ResNet-34 (\(+2.02\)) on the MS COCO multi-label dataset, contrary to the negative impact caused by vanilla KD. Although the improvement may not be significant, considering the difficulty of this dataset and performance degradation of the vanilla KD, the result is still quite impressive.

Fig. 4.
figure 4

Ablation of the proposed Transpose strategy and Mask strategy on the PETA-zs dataset. The dark blue dashed line assumes the 1:1 trade-off between mA and F1. (Color figure online)

4.3 Ablation Study

To evaluate the effectiveness of the two proposed strategies, we conduct an ablation study by incorporating one of them into the vanilla KD method. The corresponding results are presented in Fig. 4. As discussed previously in Sect. 4.2, there exists a trade-off between mA and F1 scores in pedestrian attribute recognition. To provide a better visual representation, we assume an equal trade-off ratio of 1:1 and plot a dark blue dashed line. Under this assumption, points on the same line are considered equally effective. It can be seen that applying only one of our strategies can also improve the performance compared with vanilla KD, validating the effectiveness of the proposed two strategies. And when used together in our TM-KD, the performance becomes even better.

5 Conclusion

In this paper, we analyze the logits in the single-label scene and the multi-label scene and then propose TM-KD (Transpose and Mask Knowledge Distillation), a simple and effective logit-base KD method for multi-attribute and multi-label classification. The proposed method is evaluated on five datasets of three tasks. While vanilla KD usually brings nearly no improvement and sometimes even degradation, TM-KD gets impressive and consistent results on all datasets, validating the effectiveness of TM-KD.