DFGPD: a new distillation framework with global and positional distillation

Su, Weixing; Wang, Haoyu; Liu, Fang; Li, Linfeng

doi:10.1007/s00530-024-01503-9

DFGPD: a new distillation framework with global and positional distillation

Regular Paper
Published: 19 September 2024

Volume 30, article number 274, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Systems Aims and scope Submit manuscript

DFGPD: a new distillation framework with global and positional distillation

Download PDF

Weixing Su¹,
Haoyu Wang²,
Fang Liu² &
…
Linfeng Li³

Abstract

Knowledge distillation is a commonly used method for model compression that has been widely utilized in various computer vision tasks. Many efforts have utilized attention mechanisms to guide the student networks during training, encouraging them to mimic the important features of the teacher. However, most of these efforts use either the channel attention map or the spatial attention map to guide the student, ignoring the importance of positional features. In this paper, we propose a new distillation framework transferring global and positional features (DFGPD), which consists of three parts: global and positional distillation, a generic teacher framework and a two-stage distillation method. DFGPD takes positional information into consideration for a more effective distillation process. We conduct extensive comparison experiments, ablation studies, and sensitivity studies to demonstrate the effectiveness and stability of DFGPD. Our results show that (1) DFGPD achieves comparable or even better performance compared to state-of-the-art methods; (2) DFGPD can alleviate the bigger-models-not-always-better-teachers issue to a certain extent.

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Deep neural networks have become a popular choice for various vision tasks due to their strong performance [1,2,3]. However, more powerful networks usually require more storage space and have lower inference speeds, making them difficult to apply in tasks with high real-time requirements such as autonomous driving. To address this issue, knowledge distillation has been proposed [4, 5]. It can enhance the performance of lightweight networks to replace large networks to a certain extent, making it easier to apply deep neural networks in high real-time tasks. Knowledge distillation can be divided into three types: logits-based distillation, feature-based distillation, and relation-based distillation. Most of the feature-based methods achieve better performance, so feature-based distillation is the focus of this paper.

The concept of feature-based distillation was first introduced in [6], where the authors used feature maps from the middle layers of the teacher network to guide the student networks during training. In [7], a spatial attention mechanism was utilized to obtain an attention map for distillation. Similarly, in [8] and [9], channel attention and spatial attention were used to suppress unimportant background features in object detection. However, these efforts failed to consider the importance of positional information in feature maps. In some domains like computer vision processing, the position or order of elements is important for understanding the input data. For example, in computer vision, positional features can be used to capture the spatial arrangement of pixels in an image. This can help neural networks understand the location of objects or patterns in the image. Incorporating positional encodings in neural networks can enhance their ability to capture spatial or sequential patterns in the data, thus improving performance on tasks that require understanding positional relationships. This paper takes a step further by considering the positional features of the teacher to guide the student during training.

In addition to utilizing attention mechanisms, some efforts have focused on modifying the structure of student networks to enhance the distillation effect. For example, [10] added several branches at different stages of networks and used the knowledge of the deepest layer of networks to guide the training of the student network. Similarly, [11] added a BiFPN (Bi-directional feature pyramid networks, which is proposed by [12]) and a classifier for distillation, while [13] proposed a residual learning framework to enhance the performance of distillation. However, the redundant portions of networks in these efforts need to be dropped at the end of training. In contrast, [14] simplified the teacher network with low-rank approximation to improve the distillation effect, avoiding the disadvantage mentioned above. However, it relies on the mutual setting of multiple decomposition ranks, which can be cumbersome and difficult to find a suitable teacher for student networks. To address these issues, this paper presents a new generic teacher framework for distillation that can be widely applied in other distillation methods to improve the distillation effect. Building upon this framework, we propose a new two-stage distillation method that further enhances the effect of distillation.

Based on above discussion, this paper presents DFGPD, which consists of three parts: global and positional distillation, a generic teacher framework and a two-stage distillation method. DFGPD effectively utilizes the positional features of neural networks that previous efforts have been ignoring. DFGPD does not introduce additional computation and parameters since global distillation and positional distillation are only required during the training period. Additionally, the teacher framework proposed in this paper can be widely utilized in other distillation methods to effectively improve the distillation effect. Our experiments demonstrate an interesting finding - the proposed teacher framework can always perform better than the original teacher network, despite increasing the teacher–student capacity gap. This finding is noteworthy because it contradicts the widely held belief [15, 16] that bigger models are not always better teachers, as the teacher–student capacity gap can influence the distillation effect. We conducted extensive comparative experiments in the classification task and the segmentation task, as well as sensitivity experiments and ablation experiments in the classification task to demonstrate the effectiveness and stability of DFGPD. We validate it using CIFAR100, Tiny-Imagenet and Imagenet [21] datasets in the classification task, as well as in Cityscapes [33] and Pascal VOC [34] in the segmentation task. Surprisingly, our method can improve the performance of MobilenetV3 on CIFAR100 and Tiny-ImageNet by over 10%, which is remarkable. Furthermore, our method alleviates the influence of the teacher–student capacity gap on the distillation effect [15, 16]. As shown in Table 13, the distillation effect of our method increases with the teacher’s capacity improvement. Finally, our method has only one hyper-parameter that is not sensitive, which indicates that it does not take much time to find hyper-parameters for better distillation effect.

In summary, the contributions of this paper are as follows:

We present a new distillation framework DFGPD, which is composed of global and positional distillation loss, a teacher framework for distillation and a two-stage distillation method for better distillation performance.
The proposed global and positional distillation loss can effectively transfer global and positional formation between teacher and student, which is straightforward and intuitive.
The proposed teacher framework could improve distillation performance for both logits-based distillation methods and feature-based distillation methods. Based on the teacher framework, we present a two-stage distillation method for better distillation effect.
We provide an analysis of the backpropagation of the teacher framework to explain its effectiveness.
DFGPD can effectively alleviate bigger-models-not-always-better-teachers issue, which means we don’t need to spend time on finding an appropriate teacher for student model.

2 Related works

Knowledge distillation Given a lightweight model, the aim of vanilla knowledge distillation [17] is to improve its performance by allowing it mimicking the predictions, or soft labels, of teacher model. The explanation of the success of knowledge distillation is that soft labels can provide student model with dark knowledge, which can effectively improve the general ability of student model. However, vanilla knowledge distillation requires a pre-trained teacher network, which brings extra training cost. To alleviate this issue, online distillation [18] and self-distillation [5, 10] have been proposed, which do not require a pre-trained teacher model.

Feature-based knowledge distillation Besides mimicking the predictions, recent methods attempted to leverage information contained in hidden layers of neural networks. The earliest work on feature-based distillation is [6], which encourage student model to mimic the feature maps of teacher model by minimize the L2 loss between student feature maps and teacher feature maps. To effectively transfer the knowledge of teacher networks, a bunch of efforts designed elaborate knowledge representations. For example, [7] transferred the teacher knowledge utilizing spatial attention. [8, 9] represented the knowledge of teacher model with channel-spatial attention maps. The channel attention map and spatial attention map can express the global information from channel dimension and spatial dimension. Different from them, in this paper, we have considered the positional information of teacher model in a further step and proposed global and positional distillation, which can effectively transfer the global and positional information of teacher model.

Learning framework in knowledge distillation Vanilla knowledge distillation utilized general teacher–student framework for distillation, which do not modify teacher network or student network. The modification of learning framework can be divided into three types: the modification to student model for self-distillation [5, 10, 11], the modification to student model for offline-distillation [13] and the modification to teacher model [14]. However, redundant parts of student model should be dropped in inference period for [10, 11, 13]. Although [14] avoids the issue, it needs to set multiple decomposition ranks mutually. The issues mentioned above makes them not generic.

Based on the above discussion, the differences between our method with related works are summarized as follows: (1) besides the channel features and spatial features, this paper further considers the positional information of feature maps; (2) the proposed teacher framework does not require setting any parameters manually and does not require any modification to student networks; (3) [14] simplified the teacher for better distillation effect. The motivation of [14] is intuitive and easy to understand due to the bigger-models-not-always-better-teachers issue. We design the teacher framework from a new perspective, i.e., adding the complexity of teacher to obtain better teacher feature maps; 4) A two-stage distillation method is proposed to enhance the distillation effect in a further step.

3 Methods

3.1 Global and positional distillation

As shown in Fig. 1, the global distillation is achieved by encouraging the student to mimic the spatial attention map (represents the global information of spatial positions) $A_{s} \in {\mathcal {R}}^{1,H,W}$ and the channel attention map (represents the global information of channel dimension)$A_{c} \in {\mathcal {R}}^{C,1,1}$ of the teacher network. The positional distillation is achieved by encouraging the student to mimic the horizontal attention map (reflects the feature responses of each channel across the entire width, i.e. the positional information of the width for each channel)$A_{w} \in {\mathcal {R}}^{C,1,W}$ and the vertical attention map (reflects the feature responses of each channel across the entire height, i.e. the positional information of the height for each channel) $A_{h} \in {\mathcal {R}}^{C,H,1}$ of the teacher network. Given a feature map $F \in {\mathcal {R}}^{C,H,W}$, where C, H, W represent its channel number, height and width, respectively. The generation of the spatial attention map and channel attention map can be considered as finding the mapping function $\varrho ^{s}:$ $F \in {\mathcal {R}}^{C,H,W} \rightarrow A_{s} \in {\mathcal {R}}^{1,H,W}$ and $\varrho ^{c}:$ $F \in {\mathcal {R}}^{C,H,W} \rightarrow A_{c} \in {\mathcal {R}}^{C,1,1}$, respectively. Similarly, the horizontal attention map and the vertical attention map can also be obtained with the mapping functions $\varrho ^{w}:$ $F \in {\mathcal {R}}^{C,H,W} \rightarrow A_{w} \in {\mathcal {R}}^{C,1,W}$ and $\varrho ^{h}:$ $F \in {\mathcal {R}}^{C,H,W} \rightarrow A_{h} \in {\mathcal {R}}^{C,H,1}$. Note that the s, c, w and h are utilized to discriminate ’spatial’, ’channel’, ’width’ and ’height’. Since the absolute value of each element in feature maps represents its importance, we construct $\varrho ^{s}$ by summing the absolute values across the channel dimension and construct $\varrho ^{c}$ by summing the absolute values across the width and height dimensions, which can be formulated as $\varrho ^{s}(F) = \frac{1}{C} {\textstyle \sum _{k=1}^{C}} |F_{k,i,j} |$ and $\varrho ^{c}(F) = \frac{1}{HW} {\textstyle \sum _{i=1}^{H}} {\textstyle \sum _{j=1}^{W}} |F_{k,i,j} |$, where i, j, k denote the $i_{th}$, $j_{th}$, $k_{th}$ element in the height, width and channel dimension, respectively. Moreover, the $\varrho ^{h}$ is constructed by summing the absolute values across the width dimension and the $\varrho ^{w}$ is constructed by summing the absolute values across the height dimension, which can be formulated as $\varrho ^{h}(F) = \frac{1}{W} {\textstyle \sum _{j=1}^{W}} |F_{k,i,j} |$ and $\varrho ^{w}(F) = \frac{1}{H} {\textstyle \sum _{i=1}^{H}} |F_{k,i,j} |$. The global and positional distillation loss $L_{GPD}$ is composed of two components: global distillation loss $L_{GD}$ and positional distillation loss $L_{PD}$. $L_{GD}$ is utilized to encourage the student network to mimic the spatial and channel attention of the teacher network. $L_{PD}$ is used to encourage the student network to mimic the horizontal attention and vertical attention of the teacher network. The equations for $L_{GD}$ and $L_{PD}$ are as follows.

$$\begin{aligned} L_{GD}&= L_{2}(\varrho ^{s}(F^{T}), \varrho ^{s}(F^{S})) + L_{2}(\varrho ^{c}(F^{T}), \varrho ^{c}(F^{S})) \end{aligned}$$

(1)

$$\begin{aligned} L_{PD}&= L_{2}(\varrho ^{h}(F^{T}), \varrho ^{h}(F^{S})) + L_{2}(\varrho ^{w}(F^{T}), \varrho ^{w}(F^{S})) \end{aligned}$$

(2)

Here $L_{2}$ represents $L_{2}$ norm loss. $F^{T}$ represents feature maps of the teacher network. $F^{S}$ represents feature maps of the student network. The equation for $L_{GPD}$ and the loss function of student networks are shown as Eqs. (3) and (4), respectively.

$$\begin{aligned} L_{GPD}&= L_{GD} + L_{PD} \end{aligned}$$

(3)

$$\begin{aligned} Loss&= CrossEntropy(q,Y) + \alpha \cdot L_{GPD} \end{aligned}$$

(4)

Here CrossEntropy refers to the cross-entropy loss. The q represents the prediction of student networks. The $\alpha $ is utilized to adjust the weight of teacher knowledge.

3.2 Teacher framework for distillation

Figure 2 shows the process of obtaining the teacher feature maps. First, the feature maps at different stages of the original network are fused using the selective dense feature connections (SDFC) module, which adapts the weights of feature maps at different stages. Note that downsampling of the feature map is done using the 3 $\times $ 3 depthwise separable convolution (DSConv) [19]. Next, the fused feature map is integrated into the network. Second, to reduce computational complexity and speed up training, the channel dimension of feature maps at different stages is mapped to a lower dimension using 1 $\times $ 1 convolution (the channel number is set to 128 or 256 in experiments, and Table 8 shows that the distillation effect is not sensitive to channel number of feature maps). Finally, the feature map generated by 1 $\times $ 1 convolution at the deepest layer is upsampled using deconvolution and fused with the feature maps of shallow layers using element-wise summation to obtain the teacher feature maps of different resolutions. The loss function for the teacher network is not modified, which can be denoted as $Loss = CrossEntropy(q,Y)$, where q, Y, CrossEntropy represent the prediction of networks, ground truth labels and cross-entropy loss function.

As depicted in Fig. 3, SDFC is a module that fuses features at different stages. The fusion process is as follows. Firstly, feature maps from different stages are fused by an element-wise summation, which can be formulated as $M = {\textstyle \sum _{i=1}^{N}} F_{i} $, where M, $F_{i}$, N represent the preliminary fused feature map, the feature map of $i_{th}$ stage, the number of feature maps that need to be fused, respectively. Then a channel-level average pooling is utilized to obtain an attention map, which can be expressed with $F_{s} = \frac{1}{C} {\textstyle \sum _{i=1}^{C}} M(i,h,w)$, where $F_{s}$, C, h and w represent the attention map, the channel number, the height dimension and the width dimension, respectively. Finally, the weights of different feature maps are obtained through one MLP layer and a softmax operation, which can be represented by $W = Softmax(MLP(F s)))$, where W, MLP and Softmax refer to the weights of different feature maps, the MLP layer and the softmax operation, respectively.

3.3 Two-stage distillation

Self-distillation is an effective technique for improving network performance, and a more powerful network usually have learned better features. Building on this, we propose a two-stage distillation method in this paper. In the first stage, we train our proposed teacher framework using self-distillation, aiming of allowing the teacher network to learn better features. In the second stage, we use the well-trained teacher to transfer its knowledge via global and positional distillation loss. As shown in Fig. 4, the teacher self-distillation training framework is indicated by gray boxes.Several branches have been added at various stages of the original network for self-distillation. These branches include an attention module, a feature alignment layer, and a classifier, all configured in the same manner as described in [20]. Subsequently, during the network’s forward propagation, downsampling and the SDFC (Self-Distillation Feature Fusion Component) are utilized to integrate feature maps from different stages to obtain the final multi-scale feature maps for prediction. Finally, the obtained multi-scale feature maps and predictions are used for distillation. The loss function for self-distillation can be expressed as:

$$\begin{aligned} \begin{aligned} Loss&= \sum _{i=1}^c \Big ( (1-\alpha ) \cdot CrossEntropy(q^i, y) \\&\quad + \alpha \cdot KL(q^i, q^c) \\&\quad + \lambda \cdot \left\| F_i - F_c\right\| _2^2 \Big ) \end{aligned} \end{aligned}$$

(5)

$$\begin{aligned} {F_c=S D F C\left( F_1, \ldots , F_n\right) } \end{aligned}$$

(6)

Here c denotes the number of classifiers. The $q^{i}$ and $q^{c}$ represent the prediction of the $i_{th}$ classifier and the deepest classifier, respectively. The y indicates the ground truth labels. $F_{i}$ and $F_{c}$ signify the feature maps in the $i_{th}$ classifier and the deepest classifier. CrossEntropy and KL denotes the cross entropy loss and the Kullback–Leibler divergence, respectively. The $\alpha $ and $\lambda $ are utilized adjust the knowledge of teacher networks.

4 Experiments

DFGPD is a novel distillation framework that can be easily applied to different models for performing classification and segmentation tasks. In this paper, we conducted experiments on two different tasks, including classification and semantic segmentation. We utilized different datasets and models for validation experiments tailored to each task. All models in both tasks achieved significant improvements through DFGPD. The equipment used for all the experiments in this paper is a single NVIDIA GeForce RTX 4090. The experiment was conducted under the Linux operating system. The backend of the deep learning framework uses CUDA 11.7. We implement the model using the PyTorch 1.12.1 framework.

4.1 Classification

In the classification task, DFGPD has been evaluated on three datasets: CIFAR-100, Tiny-ImageNet and ImageNet [21]. The teacher–student pairs consist of networks with the same architecture and different architectures. The benchmark networks include seven networks with different lengths and widths, namely ResNet [22], WideResNet [23], VGG [24], MobileNetV1 [19], MobileNetV3 [25], ShuffleNetV1 [26], and ShuffleNetV2 [27]. To demonstrate its effectiveness, our method is compared with six other distillation methods, namely KD [17], DKD [28], MGD [29], USKD [30], LSKD [31] and SRD [32].

To prevent model overfitting, data augmentation methods (including image flipping, scaling, and cropping), early stopping, and L2 normalization were used during the training process. Neural networks are optimized using SGD with momentum. On the CIFAR-100 dataset, all networks were trained for 200 epochs, with the learning rate divided by 10 at the 75th, 130th, and 180th epochs. The batch size and initial learning rate were set to 128 and 0.1, respectively. On the Tiny-ImageNet dataset, all networks were trained for 100 epochs, with the learning rate divided by 10 at the 30th, 60th, and 90th epochs. The batch size and initial learning rate were set to 64 and 0.1, respectively. On the ImageNet dataset, all networks were trained for 100 epochs, and the weight decay is 0.0001. We initialize the learning rate to 0.1 and decay it for every 30 epochs The batch size and initial learning rate were set to 32 and 0.1, respectively. The hyperparameter $\alpha $ in our method is set to 1.2.

4.2 Classification results

Comparison experiments on CIFAR100 The experimental results on CIFAR-100 have verified the effectiveness of DFGPD compared to other state-of-the-art distillation methods. In Tables 1 and 2, we compare the accuracy, recall, and precision of various distillation methods. Specifically, Table 1 contains the results of teacher–student pairs with the same architecture, while Table 2 shows the results of pairs with different architectures. It can be observed that: (1) among teacher–student pairs with the same architecture, both our two-stage distillation method and the one-stage distillation method (OGPD) consistently outperformed other methods across all pairs. Notably, the two-stage distillation observed the greatest increase in the three evaluation metrics in the Resnet50-Resnet32 teacher–student pair, with recall, accuracy, and precision improving by 5.15%, 5.26%, and 5.18%, respectively, while one-stage distillation also achieved 4.71%, 4.76%, and 4.76%, respectively; (2) compared to other KD methods where the teacher and student have the same network architecture, our two methods (DFGPD and OGPD) have shown significant improvements. Specifically, DFGPD has achieved an average increase of 3.69% in accuracy, 3.76% in recall, and 3.76% in precision. The improvements in accuracy ranged from 2.64 to 5.15%, in recall from 2.74 to 5.26%, and in precision from 2.78 to 5.18%. This demonstrates that our methods have reached comparable or superior performance compared to the state-of-the-art KD methods for teacher–student pairs with identical architectures. (3) when the teacher and student come from different series, our two methods achieved the highest performance improvement in the three evaluation metrics for the Resnet34-MobilenetV3 teacher–student pair, which were 10.23%, 10.22%, and 9.72% respectively. (4) Compared to other KD methods, regardless of whether the teacher and student have the same or different architectures, our methods are optimal. Specifically, when the teacher and student have the same architectures, on average in the three evaluated metrics of accuracy, recall, and precision, DFGPD outperforms the other KD methods by 1.96%, 1.82%, and 1.97%, respectively, while OGPD also surpasses other KD methods by an average 1.63%, 1.54%, and 1.58%, respectively. When the teacher and student have the different architectures, on average in the three evaluated metrics of accuracy, recall, and precision, DFGPD outperforms the other KD methods by 2.59%, 2.63%, and 2.65%, respectively, while OGPD also surpasses other KD methods by an average 2.35%, 2.39%, and 2.40%, respectively. This proves that both of our distillation strategies offer superior performance compared to the state-of-the-art distillation methods.

Table 1 Comparison with other distillation methods on CIFAR100

Full size table

Table 2 Comparison with other distillation methods on CIFAR100

Full size table

Comparison experiments on Tiny-ImageNet Our experimental results on Tiny-ImageNet demonstrate the effectiveness of our two-stage distillation method (DFGPD) and the one-stage distillation method (OGPD). Table 3 shows experiments conducted on networks with the same architecture and those with different architectures. The following observations can be made: (1) one-stage distillation (OGPD) methods are effective for teacher–student pairs with both the same and different architectures. (2) It is worth noting that on the Resnet34-MobilenetV3 teacher–student pair, both of our methods achieved performance improvements exceeding 10% across the three evaluation metrics. For the two-stage distillation, the accuracy, recall, and precision improved by 10.75%, 11.00%, and 10.92% respectively, while for the one-stage distillation, the improvements were 10.50%, 10.67%, and 10.57% respectively. (3) Compared to other KD methods, regardless of whether the teacher and student have the same or different architectures, our methods are optimal. Specifically, on average in the three evaluated metrics of accuracy, recall, and precision, DFGPD outperforms the other KD methods by 2.41%, 2.37%, and 2.52%, respectively, while OGPD also surpasses other KD methods by an average 2.74%, 2.72%, and 2.81%, respectively. This proves that both of our distillation strategies offer superior performance compared to the state-of-the-art distillation methods.

Table 3 Comparison with other distillation methods on Tiny-ImageNet

Full size table

Comparison experiments on ImageNet On ImageNet, we compared our methods (DFGPD and OGPD) with other advanced distillation methods to prove the effectiveness of our approach. Table 4 shows experiments conducted on networks with the same architecture and those with different architectures. The following observations can be made: 1) Our proposed two-stage distillation ( DFGPD) and one-stage distillation (OGPD) methods have been validated on the ImageNet dataset and are effective for teacher–student pairs with both identical and different architectures. 2) Our methods achieved the greatest improvement in the three evaluation metrics for the Resnet34-Resnet18 teacher–student pair. For the DFGPD, the accuracy, recall, and precision increased by 3.61%, 3.59%, and 3.66% respectively; for OGPD, the improvements were 3.22%, 3.24%, and 3.26% respectively. 3) Compared to other KD methods, regardless of whether the teacher and student have the same or different architectures, the two-stage distillation (DFGPD) are optimal. Specifically, in the three evaluated metrics of accuracy, recall, and precision, mean performance improvement with 1.39% improvement in accuracy, 1.42% improvement in recall, and 1.23% improvement in precision. Even without the use of autodistillation, the one-stage distillation (OGPD) method achieved a higher performance improvement than the other distillation methods, with a 0.85% increase in mean accuracy, 1.01% increase in recall, and 0.82% increase in precision.

Table 4 Comparison with other distillation methods on ImageNet

Full size table

4.3 Segmentation

In the semantic segmentation task, we conducted experiments across two datasets and a variety of network architectures.

Datasets and data augmentation methods (1) Cityscapes [33] is a dataset for urban scene parsing that includes 5000 finely annotated images, with a total of 19 classes, among which 2975/500/1525 images are used for training /evaluation /testing. (2) Pascal VOC [34] is a visual object segmentation dataset, including 20 foreground object classes and one background class. The datasets contain 10582/1449/1456 images for training/evaluation/testing, respectively.

Network architecture For all experiments, we use the segmentation framework DeepLabV3 [35] with a ResNet50 backbone as a powerful teacher network. Specifically, we initialize the backbone of the teacher network with ImageNet [21] pre-trained weights. For the student networks, we use different segmentation architectures to verify the effectiveness of the distillation method. Specifically, we employed DeepLabV3 (DLV3) and PSPNet [36] with different backbones such as ResNet-34 (Res34), ResNet-18 (Res18), and MobileNetV2 (MV2).

Training detailsTo prevent model overfitting, data augmentation including image flipping, scaling, cropping, etc., were used simultaneously. All experiments were optimized by SGD with a momentum of 0.9, a batch size of 4, and an initial learning rate of 0.02. The learning rate is adjusted according to a polynomial decay strategy. For the crop size during the training phase, we used 512 $\times $ 1024 and 512 $\times $ 512 for Cityscapes and Pascal VOC, respectively.

Comparative distillation methods We compared our proposed DFGPD with state-of-the-art segmentation distillation methods: SKD [37], IFVD [38], CIRKD [39], and CWD [40]. We re-ran all methods using the code provided by the authors. Here all methods use the same pre-trained teacher DeepLabV3-ResNet50. The results on two segmentation task datasets demonstrate that our DFGPD demonstrates comparable or superior performance to other methods.

4.4 Segmentation results

Comparison experiments on Cityscapes In Table 5, we compared our two proposed distillation methods( DFGPD, which is a two-stage distillation, and OGPD, which is a one-stage distillation without teacher self-distillation) with the state-of-the-art distillation methods on Cityscapes in terms of validation (Val) and test (Test) mIoU performance. It can be observed that (1) under the supervision of the teacher, methods improved the student networks. Our DFGPD achieved the best segmentation performance across various student networks with similar or different architectural styles, and OGPD also outperformed previous KD methods. (2) Under the best circumstances, DFGPD achieved performance improvements of 3.3% and 3.34% on the teacher–student pairs of DeepLabV3-ResNet50 and PSPNet-ResNet34, respectively. OGPD achieved performance improvements of 3.14% and 3.10% on the same teacher–student pairs. (3) Compared to other performing KD methods, our distillation methods showed an average improvement of 1.57% in Val mIoU and 1.58% in Test mIoU for DFGPD, and 1.78% in Val mIoU and 1.80% in Test mIoU for OGPD.

Table 5 Comparing other distillation methods across various student segmentation networks on Cityscapes

Full size table

Comparison experiments on Pascal VOC We compared our two proposed distillation methods(DFGPD, which is a two-stage distillation, and OGPD, which is a one-stage distillation without teacher self-distillation) with the state-of-the-art distillation methods on Cityscapes in terms of validation (Val) and test (Test) mIoU performance. As shown in Table 6, can be observed that (1) under the supervision of the teacher, methods improved the student networks. Our DFGPD achieved the best segmentation performance across various student networks with similar or different architectural styles, and OGPD also outperformed previous KD methods. (2) Under the best circumstances, DFGPD achieved performance improvements of 3.40% and 3.45% on the teacher–student pairs of DeepLabV3-ResNet50 and DeepLabV3-ResNet34, respectively. OGPD achieved performance improvements of 2.80% and 2.67% on the same teacher–student pairs. (3) Compared to other performing KD methods, our distillation methods showed an average improvement of 1.57% in Val mIoU and 1.58% in Test mIoU for DFGPD, and 1.34% in Val mIoU and 1.47% in Test mIoU for OGPD.

Table 6 Comparing other distillation methods across various student segmentation networks on Pascal VOC

Full size table

4.5 Ablation study and sensitivity study

Effect of teacher framework on other distillation methodsWe conducted experiments on various types of distillation methods in Tables 7 and 8 for classification and segmentation tasks, respectively, to demonstrate the capability of the teacher framework in enhancing performance. These distillation methods include six knowledge distillation methods used for classification tasks, as well as five knowledge distillation methods used in segmentation tasks. The results indicate that The teacher framework is effective for (1) both logit-based (such as KD and DKD) and feature-based (such as MGD) distillation methods; (2) it is effective for both classification tasks and segmentation tasks; this indicates that it has broad applicability to other distillation methods without any modification. In both classification and segmentation tasks, the teacher framework improved the performance of other distillation methods by an average of 0.86% and 2.00% in two sets of teacher–student pairs, and by 0.63% and 0.60%, respectively, demonstrating its effectiveness in enhancing the distillation effect. Surprisingly, in the classification task, the teacher framework can further enhance the DKD distillation effect by 6.39%. Despite an increased teacher-student capacity gap when using the teacher framework for distillation, all distillation methods demonstrated improved performance in both classification and segmentation tasks, with improvements ranging from 0.34 to 6.39%. This suggests that the proposed teacher framework can still perform better than the original teacher network, even with an increased teacher-student capacity gap. This is noteworthy because extensive research [15, 16] has shown that larger models are not always better teachers, as the teacher-student capacity gap can affect the distillation effect.

Table 7 Effect of teacher frameworks on other distillation methods in classification tasks

Full size table

Table 8 Effect of teacher frameworks on other distillation methods in segmentation tasks

Full size table

Effect of two-stage distillation We conducted experiments on CIFAR-100 to demonstrate the effects of two-stage distillation. The results are reported in Table 9. The benchmark includes five teacher–student pairs. It can be observed that: 1. The combination of the teacher framework and self-distillation (T-SD) can improve the accuracy of the teacher networks by an average of 3.69%, with a 3.22% increase on Resnet34 and a 3.95% improvement on Resnet50; 2. Two-stage distillation enhances the performance of the student by an average of 4.28%. 3. Additionally, compared to one-stage distillation, two-stage distillation also brings performance improvements to teacher–student pairs with the same and different architectures, indicating that two-stage distillation can further improve the distillation effect compared to one-stage distillation.

Table 9 Effect of two-stage distillation

Full size table

Ablation study on different components of DFGPD Table 10 conducts detailed ablation studies on CIFAR100 to demonstrate the effect of different components in the proposed method. The teacher network and student network are set to Resnet34 and Resnet18, respectively. It can be observed that: (1) global distillation and positional distillation increase the accuracy of Resnet18 by 1.85% and 1.82%, respectively. Moreover, there are 2.37% accuracy improvements with the combination of global distillation and positional distillation, which indicates that each distillation has their individual effectiveness; (2) with the help of the teacher framework, the accuracy of Resnet18 can be further improved by 0.91%. And two-stage distillation can further improve the distillation effect by 0.33% on this basis. This means that two-stage distillation can further improve the distillation effect; (3) the combination of GPD, the teacher framework and two-stage distillation can enhance the performance of Resnet18 by 3.61%, which is achieved jointly by each component of DFGPD.

Table 10 Ablation study on different components of DFGPD

Full size table

Sensitivity study on hyper-parameters There is only one hyper-parameter is introduced in DFGPD. Table 11 conducts sensitivity experiments on CIFAR100 with two teacher–student pairs: Renset34-Resnet18 and Resnet34-Renset10. It can be observed that: (1) for the teacher–student pairs Resnet34-Resnet18, the worst hyper-parameter leads to 0.38% accuracy drop compared to the highest accuracy, which is still higher than the baseline by 2.9%; (2) for Resnet34-Resnet10, the accuracy obtained by the worst hyper-parameter is 0.55% lower than the highest accuracy, which is still higher than the baseline by 2.86%. The results indicate that our method is not sensitive to the choice of hyper-parameter, which means that there is no need to spend too much time searching for hyper-parameters for better distillation effect.

Table 11 Sensitivity study on hyper-parameters

Full size table

Sensitivity study on channel number As shown in Fig. 2, 1 $\times $ 1 convolution is utilized to adjust the number of channels in the feature maps, which reduces computational complexity and speeds up training. To investigate the impact of channel number on the distillation effect, sensitivity experiments were conducted on two teacher–student pairs: Resnet34-Resnet18 and Resnet34-MobilenetV1. Table 12 presents the distillation effect when the number of channels is adjusted from 128 to 512. The results indicate that the worst accuracies are only 0.27% and 0.32% lower than the top accuracies on Resnet34-Resnet18 and Resnet34-MobilenetV1, respectively, while still higher than the baseline by 3.01% and 4.86%. This suggests that our method is insensitive to the number of channels in the teacher feature maps. Therefore, selecting a smaller number of channels can accelerate distillation with little penalty for accuracy loss.

Table 12 Sensitivity on channel number of feature maps for teacher networks

Full size table

4.6 Extension

In this section, we have discovered that DFGPD can alleviate to some extent the issue that "large models are not always good teachers." Subsequently, we conducted ablation studies, visualization experiments, and analyzed the backpropagation of the network after the introduction of SDFC. Through the aforementioned experiments, we have thoroughly discussed the reasons why DFGPD performs well and can mitigate the problem of "large models not always being good teachers." Specifically, these can be attributed to: (1) SDFC achieves better feature representation by effectively integrating feature maps from different stages; (2) self-distillation enables the network to obtain better feature expression; (3) global and positional distillation can more effectively transfer the knowledge from the teacher network.

Alleviating the influence of teacher–student capacity gap on distillation It has been demonstrated in many previous studies [15, 16] that bigger models are not always better teachers. Specifically, larger teacher networks may result in worse performance than smaller ones. These studies explained this phenomenon with the model capacity gap between teachers and students and proposed teacher-assistants [16] or early-stopping [15] to alleviate this problem. However, we found that DFGPD can effectively alleviate this issue. In Table 13, we conducted experiments on a series of teacher models. It can be observed that: (1) when the teacher–student gap increases from 6.28 to 18.76 M, the performance of student networks distilled with vanilla KD drops from 76.98 to 75.93% and then increases to 77.08%. This indicates that as the capacity of teacher models increases, the performance improvements brought by vanilla KD are oscillating, which is consistent with the results in [15, 16]; (2) when the teacher–student gap increases from 6.28 to 18.76M, the performance of student networks distilled with DFGPD increases from 78.53 to 79.07%, indicating that our method can alleviate the bigger-models-not-always-better-teachers issue. This also implies that better distillation effects can be obtained without spending too much time finding a suitable teacher for student networks. (3) When using only global and positional distillation, the performance of the student network gradually improves 77.53–77.94% as the capacity gap between the teacher and student increases from 6.28 to 18.76 M, indicating that global and positional distillation can more effectively transfer the knowledge from the teacher network; (4) when using only the proposed teacher framework, the performance of the student network also gradually improves 77.34–77.83% as the capacity gap increases from 6.28M to 18.76M, indicating that the proposed teacher framework can provide better feature expression. In summary, the reasons why DFGPD can alleviate the issue that "large models are not always good teachers" to a certain extent are as follows: (1) SDFC enables the network to effectively integrate feature maps from different stages of the network during the training process. Self-distillation allows the teacher network to optimize its feature representation before knowledge transfer. The combined effect of SDFC and self-distillation allows the network to produce better feature maps; (2) global and positional distillation can efficiently transfer the knowledge from the teacher network. Global distillation focuses on the transfer of overall features, while positional distillation concentrates on reinforcing features at specific spatial locations. This combination not only enhances the student network’s learning efficiency for key features but also strengthens its ability to recognize features across different regions.

Table 13 The effect of vanilla KD and our method when teacher networks with different capacity are used for distillation

Full size table

Visualization experiments Figure 5 shows the feature maps that used for distillation (i.e., FeatureTea in Fig. 4) for ResNet34 with and without a teacher framework and after self-distillation. It can be observed that: (1) compared to Resnet34 without a teacher framework, ResNet34 with a teacher framework captures better features. For example, in the images of teacher feature map 3 of the first set of images, ResNet34 with a teacher framework pays more attention to the body of the koala, while Resnet34’s attention is partly distracted to the background. Moreover, in the images of teacher feature maps 2 and 3 of the second set of images, compared to Resnet34, ResNet34 with a teacher framework pays more attention to the face of the red panda, while Resnet34’s attention is partly distracted to elsewhere. (2) Compared to Resnet34 with a teacher framework, a distilled Resnet34 captures better features, especially for teacher feature maps 3 and 4. For example, in the images of teacher feature maps 3 and 4, although distilled and undistilled Resnet34 both focus on the correct places (i.e., the body of koala and the face of red panda), the features captured by distilled Resent34 are more evident clearly. Based on the above discussion, it can be concluded that: (1) the ability of the teacher framework and two-stage distillation to enhance distillation can be attributed to that the teacher network can capture better features due to the teacher framework and two-stage distillation; (2) the performance gain of two-stage distillation is mainly attributed to the better teacher feature maps 3 and 4.

Analysis of backpropagation in the teacher framework Visualization experiments have explained the effectiveness of the teacher framework in producing great feature maps. To further explain its effectiveness, we conducted an analysis of backpropagation in the teacher framework. Here we take part of the selective dense feature connections as an example. Figure 6a, b show the last BasicBlock of stages 1 and 2 in Resnet18, and the last BasicBlock of stages 1 and 2 in Resnet18 with the teacher framework.

As shown in Fig. 6a, the equation for H(x) can be formulated as $H\left( x\right) =G\left( x\right) +F\left( G\left( x\right) \right) $. The partial derivative result for H(x) can be denoted as Eq. (7). Representing the loss function with $\xi $, according to the chain rule of backpropagation [41], the gradient can be formulated as Eq. (8).

$$\begin{aligned} H'(x)= & \frac{\partial H(x)}{\partial x} = \left( 1+\frac{\partial F\left( G\left( x\right) \right) }{\partial G\left( x\right) }\right) \frac{\partial G\left( x\right) }{\partial x} \end{aligned}$$

(7)

$$\begin{aligned} \frac{\partial \xi }{\partial x}= & \frac{\partial \xi }{\partial H\left( x\right) }\frac{\partial H\left( x\right) }{\partial x}=\frac{\partial \xi }{\partial H\left( x\right) }\left( 1+\frac{\partial F\left( G\left( x\right) \right) }{\partial G\left( x\right) }\right) \frac{\partial G\left( x\right) }{\partial x} \end{aligned}$$

(8)

As indicated in Fig. 6b, the equation for $H_{t}(x)$ can be formulated as $H_{t}(x) = SDFC(F_{s}(x),D(x))$, where $D(x)=G(x)+F(G(x))$. As described in Sect. 3.2, SDFC adaptively adjusts the weights of feature maps at different stages. Suppose that the weights for feature maps of stage1 and stage2 are $W_1$ and $W_2$, respectively. Then $H_{t}(x)$ can be denoted as $H_{t}(x)=W_{1}F_{s}(x)+W_{2}(G(x)+F(G(x)))$. The partial derivative result for $H_{t}(x)$ can be represented as Eq. (9). Denoting the loss function as $\xi $, according to the chain rule of backpropagation, the gradient can be denoted as Eq. (10).

$$\begin{aligned} H_{t}'(x)= & \frac{\partial H_{t}(x)}{\partial x}=W_{1}\frac{\partial F_{s}(x)}{\partial x}+W_{2}(1+\frac{\partial F(G(x))}{\partial G(x)})\frac{\partial G(x)}{\partial x} \end{aligned}$$

(9)

$$\begin{aligned} \frac{\partial \xi }{\partial x}&= \frac{\partial \xi }{\partial H_{t}(x)}\frac{\partial H_{t}(x) }{\partial x}\nonumber \\= & \frac{\partial \xi }{\partial H_{t}(x)}\left(W_{1}\frac{\partial F_{s}(x)}{\partial x}+W_{2}\left(1+\frac{\partial F(G(x))}{\partial G(x)}\right)\frac{\partial G(x)}{\partial x}\right) \end{aligned}$$

(10)

From Eqs. (7) and (9), it can be observed that: (1) compared to $H'(x)$, $H_{t}'(x)$ can adaptively adjust the gradient according to the importance of different feature maps due to the existence of $W_{1}$ and $W_{2}$. Note that $W_1$ and $W_2$ are the weights of $F_{s}(x)$ and D(x) generated by SDFC. The more important $F_{s}(x)$ is, the larger $W_{1}$ is, indicating that the term $W_{1}\frac{\partial F_{s}(x)}{\partial x}$ plays a more important role in network optimization; (2) the term of $W_{1}\frac{\partial F_{s}(x)}{\partial x}$ means that the information can be propagated directly with the 3 $\times $ 3 DSConv layer. This means that the features learned at deep layers have a directly influence on the shallow layers. It is well known that features at deep layers contain rich semantic information. Therefore, Resnet18 with teacher framework can learn better features. Based on above discussion, it can be concluded that SDFC and DSConv can enhance gradient backpropagation. Therefore, networks equipped with teacher framework can extract better features, which is consistent with the results of visualization experiments. The great effect of teacher framework can also be attributed to its effective gradient backpropagation and the better feature extraction ability.

5 Conclusion

In this paper, we introduce a new distillation framework, DFGPD, which consists of global and positional distillation, a teacher framework, and a two-stage distillation method. The advantages of DFGPD are as follows: (1) it fuses feature maps from different stages using a selective dense feature connection module and enhances the network’s feature representation through self-distillation; (2) it efficiently transfers knowledge by utilizing the global information of the feature maps from the teacher and student networks across channel and spatial dimensions, as well as positional information along the width and height dimensions. DFGPD can effectively enhance the performance of the student network in classification and segmentation tasks without increasing the complexity of the student network.

Limitations and future research directions Despite the improvements in distillation performance achieved by DFGPD, it also has some limitations. A primary challenge is that Self-distillation and the upsampling process can decrease the efficiency of the entire distillation process. Moreover, for the task of object detection, more refined distillation strategies need to be contemplated to enhance the performance improvement of DFGPD in object detection tasks. This mainly due to the inherent complexity of object detection tasks, where factors such as the ratio of losses for the foreground and background, as well as the ratio of losses for large and small object detection, must be carefully considered to meticulously design the distillation methods. In future research, we will consider designing distillation strategies tailored to different distillation tasks, enabling DFGPD to adapt to a greater variety of tasks and achieve better results.

Data availability

The datasets generated and utilized as well as the datasets used for analysis in this study can be requested from the corresponding author upon reasonable request.

References

Yang, W., Feng, J., Xie, G., Liu, J., Guo, Z., Yan, S.: Video super-resolution based on spatial-temporal recurrent residual networks. Comput. Vis. Image Underst. 168, 79–92 (2018)
Article Google Scholar
Tang, S., Guo, D., Hong, R., Wang, M.: Graph-based multimodal sequential embedding for sign language translation. IEEE Trans. Multim. 24, 4433–4445 (2021)
Article Google Scholar
Artacho, B., Savakis, A.: Unipose: unified human pose estimation in single images and videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7035–7044 (2020)
Su, W., Li, L., Liu, F., He, M., Liang, X.: AI on the edge: a comprehensive review. Artif. Intell. Rev. 55(8), 6125–6183 (2022)
Article Google Scholar
Li, L., Su, W., Liu, F., He, M., Liang, X.: Knowledge fusion distillation: improving distillation with multi-scale attention mechanisms. Neural Process. Lett. 1–16 (2023)
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: hints for thin deep nets. In: Bengio, Y., LeCun, Y. (eds) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings (2015)
Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings, OpenReview.net (2017)
Zhang, L., Ma, K.: Improve object detection with feature-based knowledge distillation: Towards accurate and efficient detectors. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021, OpenReview.net (2021)
Yang, Z., Li, Z., Jiang, X., Gong, Y., Yuan, Z., Zhao, D., Yuan, C.: Focal and global knowledge distillation for detectors. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18–24, 2022, IEEE, pp. 4633–4642 (2022)
Zhang, L., Song, J., Gao, A., Chen, J., Bao, C., Ma, K.: Be your own teacher: improve the performance of convolutional neural networks via self distillation. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019, IEEE, pp. 3712–3721 (2019)
Ji, M., Shin, S., Hwang, S., Park, G., Moon, I.: Refine myself by teaching myself: feature refinement via self-knowledge distillation. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, Virtual, June 19–25, 2021, pp. 10664–10673, Computer Vision Foundation/IEEE (2021)
Tan, M., Pang, R., Le, Q.V.: Efficientdet: scalable and efficient object detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020, Computer Vision Foundation/IEEE, pp. 10778–10787 (2020)
Verma, A., Gulati, P., Gupta, S.: [re] distilling knowledge via knowledge review. CoRR vol. abs/2205.11246 (2022)
Dai, C., Liu, X., Li, Z., Chen, M.: A tucker decomposition based knowledge distillation for intelligent edge applications. Appl. Soft Comput. 101, 107051 (2021)
Article Google Scholar
Cho, J.H., Hariharan, B.: On the efficacy of knowledge distillation. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019, IEEE, pp. 4793–4801 (2019)
Mirzadeh, S., Farajtabar, M., Li, A., Ghasemzadeh, H.: Improved knowledge distillation via teacher assistant: bridging the gap between student and teacher. CoRR vol. abs/1902.03393 (2019)
Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. CoRR vol. abs/1503.02531 (2015)
Zhang, Y., Xiang, T., Hospedales, T.M., Lu, H.: Deep mutual learning. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, Computer Vision Foundation/IEEE Computer Society, pp. 4320–4328 (2018)
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: efficient convolutional neural networks for mobile vision applications. CoRR vol. abs/1704.04861 (2017)
Zhang, L., Tan, Z., Song, J., Chen, J., Bao, C., Ma, K.: SCAN: a scalable neural networks framework towards compact and efficient models. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8–14, 2019, Vancouver, BC, Canada, pp. 4029–4038 (2019)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, Ieee, pp. 248–255 (2009)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016, IEEE Computer Society, pp. 770–778 (2016)
Zagoruyko, S., Komodakis, N.: Wide residual networks. In: Wilson, R.C., Hancock, E.R., Smith, W.A.P. (eds.) Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19–22, 2016. BMVA Press (2016)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings (2015)
Howard, A., Pang, R., Adam, H., Le, Q.V., Sandler, M., Chen, B., Wang, W., Chen, L., Tan, M., Chu, G., Vasudevan, V., Zhu, Y.: Searching for mobilenetv3. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019, IEEE, pp. 1314–1324 (2019)
Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, Computer Vision Foundation/IEEE Computer Society, pp. 6848–6856 (2018)
Ma, N., Zhang, X., Zheng, H., Sun, J.: Shufflenet V2: practical guidelines for efficient CNN architecture design. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision—ECCV 2018—15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part XIV, vol. 11218 of Lecture Notes in Computer Science, pp. 122–138. Springer (2018)
Zhao, B., Cui, Q., Song, R., Qiu, Y., Liang, J.: Decoupled knowledge distillation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18–24, 2022, IEEE, pp. 11943–11952 (2022)
Yang, Z., Li, Z., Shao, M., Shi, D., Yuan, Z., Yuan, C.: Masked generative distillation. In: Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision—ECCV 2022—17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XI, vol. 13671 of Lecture Notes in Computer Science, pp. 53–69. Springer (2022)
Yang, Z., Zeng, A., Li, Z., Zhang, T., Yuan, C., Li, Y.: From knowledge distillation to self-knowledge distillation: A unified approach with normalized loss and customized soft labels. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17185–17194 (2023)
Sun, S., Ren, W., Li, J., Wang, R., Cao, X.: Logit standardization in knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15731–15740 (2024)
Miles, R., Mikolajczyk, K.: Understanding the role of the projector in knowledge distillation. Proc. AAAI Conf. Artif. Intell. 38, 4233–4241 (2024)
Google Scholar
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 88, 303–338 (2010)
Article Google Scholar
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder–decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818 (2018)
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890 (2017)
Liu, Y., Chen, K., Liu, C., Qin, Z., Luo, Z., Wang, J.: Structured knowledge distillation for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2604–2613 (2019)
Wang, Y., Zhou, W., Jiang, T., Bai, X., Xu, Y.: Intra-class feature variation distillation for semantic segmentation. In: Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pp. 346–362. Springer (2020)
Yang, C., Zhou, H., An, Z., Jiang, X., Xu, Y., Zhang, Q.: Cross-image relational knowledge distillation for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12319–12328 (2022)
Shu, C.,. Liu, Y, Gao, J., Yan, Z., Shen, C.: Channel-wise knowledge distillation for dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5311–5320 (2021)
LeCun, Y., Boser, B.E., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.E., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by National Key R &D Program of China (2023YFB4706800); National Key R &D Program of China (2021YFB2501800).

Author information

Authors and Affiliations

School of Computer Science and Technology, Tiangong University, Tianjin, 300387, China
Weixing Su
School of Software, Tiangong University, Tianjin, 300387, China
Haoyu Wang & Fang Liu
School of Artificial Intelligence, Tiangong University, Tianjin, 300387, China
Linfeng Li

Authors

Weixing Su
View author publications
You can also search for this author in PubMed Google Scholar
Haoyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Fang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Linfeng Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Weixing Su: conceptualization, methodology, software. Haoyu Wang: data curation, visualization, writing—original draft. Fang Liu: funding acquisition, formal analysis, writing—original draft. Linfeng Li: investigation, validation, writing—review and editing.

Corresponding author

Correspondence to Linfeng Li.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical approval

The content of this manuscript is original and has not been published or submitted for publication elsewhere. We confirm that this submission complies with the policies of your journal, and that neither the manuscript nor the underlying study violates any of the journal’s ethical guidelines.

Additional information

Communicated by Xun Yang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Su, W., Wang, H., Liu, F. et al. DFGPD: a new distillation framework with global and positional distillation. Multimedia Systems 30, 274 (2024). https://doi.org/10.1007/s00530-024-01503-9

Download citation

Received: 20 March 2024
Accepted: 11 September 2024
Published: 19 September 2024
DOI: https://doi.org/10.1007/s00530-024-01503-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

DFGPD: a new distillation framework with global and positional distillation

Abstract

Explore related subjects

1 Introduction

2 Related works

3 Methods

3.1 Global and positional distillation

3.2 Teacher framework for distillation

3.3 Two-stage distillation

4 Experiments

4.1 Classification

4.2 Classification results

4.3 Segmentation

4.4 Segmentation results

4.5 Ablation study and sensitivity study

4.6 Extension

5 Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation