1 Introduction

Deep neural networks have become a popular choice for various vision tasks due to their strong performance [1,2,3]. However, more powerful networks usually require more storage space and have lower inference speeds, making them difficult to apply in tasks with high real-time requirements such as autonomous driving. To address this issue, knowledge distillation has been proposed [4, 5]. It can enhance the performance of lightweight networks to replace large networks to a certain extent, making it easier to apply deep neural networks in high real-time tasks. Knowledge distillation can be divided into three types: logits-based distillation, feature-based distillation, and relation-based distillation. Most of the feature-based methods achieve better performance, so feature-based distillation is the focus of this paper.

The concept of feature-based distillation was first introduced in [6], where the authors used feature maps from the middle layers of the teacher network to guide the student networks during training. In [7], a spatial attention mechanism was utilized to obtain an attention map for distillation. Similarly, in [8] and [9], channel attention and spatial attention were used to suppress unimportant background features in object detection. However, these efforts failed to consider the importance of positional information in feature maps. In some domains like computer vision processing, the position or order of elements is important for understanding the input data. For example, in computer vision, positional features can be used to capture the spatial arrangement of pixels in an image. This can help neural networks understand the location of objects or patterns in the image. Incorporating positional encodings in neural networks can enhance their ability to capture spatial or sequential patterns in the data, thus improving performance on tasks that require understanding positional relationships. This paper takes a step further by considering the positional features of the teacher to guide the student during training.

In addition to utilizing attention mechanisms, some efforts have focused on modifying the structure of student networks to enhance the distillation effect. For example, [10] added several branches at different stages of networks and used the knowledge of the deepest layer of networks to guide the training of the student network. Similarly, [11] added a BiFPN (Bi-directional feature pyramid networks, which is proposed by [12]) and a classifier for distillation, while [13] proposed a residual learning framework to enhance the performance of distillation. However, the redundant portions of networks in these efforts need to be dropped at the end of training. In contrast, [14] simplified the teacher network with low-rank approximation to improve the distillation effect, avoiding the disadvantage mentioned above. However, it relies on the mutual setting of multiple decomposition ranks, which can be cumbersome and difficult to find a suitable teacher for student networks. To address these issues, this paper presents a new generic teacher framework for distillation that can be widely applied in other distillation methods to improve the distillation effect. Building upon this framework, we propose a new two-stage distillation method that further enhances the effect of distillation.

Based on above discussion, this paper presents DFGPD, which consists of three parts: global and positional distillation, a generic teacher framework and a two-stage distillation method. DFGPD effectively utilizes the positional features of neural networks that previous efforts have been ignoring. DFGPD does not introduce additional computation and parameters since global distillation and positional distillation are only required during the training period. Additionally, the teacher framework proposed in this paper can be widely utilized in other distillation methods to effectively improve the distillation effect. Our experiments demonstrate an interesting finding - the proposed teacher framework can always perform better than the original teacher network, despite increasing the teacher–student capacity gap. This finding is noteworthy because it contradicts the widely held belief [15, 16] that bigger models are not always better teachers, as the teacher–student capacity gap can influence the distillation effect. We conducted extensive comparative experiments in the classification task and the segmentation task, as well as sensitivity experiments and ablation experiments in the classification task to demonstrate the effectiveness and stability of DFGPD. We validate it using CIFAR100, Tiny-Imagenet and Imagenet [21] datasets in the classification task, as well as in Cityscapes [33] and Pascal VOC [34] in the segmentation task. Surprisingly, our method can improve the performance of MobilenetV3 on CIFAR100 and Tiny-ImageNet by over 10%, which is remarkable. Furthermore, our method alleviates the influence of the teacher–student capacity gap on the distillation effect [15, 16]. As shown in Table 13, the distillation effect of our method increases with the teacher’s capacity improvement. Finally, our method has only one hyper-parameter that is not sensitive, which indicates that it does not take much time to find hyper-parameters for better distillation effect.

In summary, the contributions of this paper are as follows:

  • We present a new distillation framework DFGPD, which is composed of global and positional distillation loss, a teacher framework for distillation and a two-stage distillation method for better distillation performance.

  • The proposed global and positional distillation loss can effectively transfer global and positional formation between teacher and student, which is straightforward and intuitive.

  • The proposed teacher framework could improve distillation performance for both logits-based distillation methods and feature-based distillation methods. Based on the teacher framework, we present a two-stage distillation method for better distillation effect.

  • We provide an analysis of the backpropagation of the teacher framework to explain its effectiveness.

  • DFGPD can effectively alleviate bigger-models-not-always-better-teachers issue, which means we don’t need to spend time on finding an appropriate teacher for student model.

2 Related works

Knowledge distillation Given a lightweight model, the aim of vanilla knowledge distillation [17] is to improve its performance by allowing it mimicking the predictions, or soft labels, of teacher model. The explanation of the success of knowledge distillation is that soft labels can provide student model with dark knowledge, which can effectively improve the general ability of student model. However, vanilla knowledge distillation requires a pre-trained teacher network, which brings extra training cost. To alleviate this issue, online distillation [18] and self-distillation [5, 10] have been proposed, which do not require a pre-trained teacher model.

Feature-based knowledge distillation Besides mimicking the predictions, recent methods attempted to leverage information contained in hidden layers of neural networks. The earliest work on feature-based distillation is [6], which encourage student model to mimic the feature maps of teacher model by minimize the L2 loss between student feature maps and teacher feature maps. To effectively transfer the knowledge of teacher networks, a bunch of efforts designed elaborate knowledge representations. For example, [7] transferred the teacher knowledge utilizing spatial attention. [8, 9] represented the knowledge of teacher model with channel-spatial attention maps. The channel attention map and spatial attention map can express the global information from channel dimension and spatial dimension. Different from them, in this paper, we have considered the positional information of teacher model in a further step and proposed global and positional distillation, which can effectively transfer the global and positional information of teacher model.

Learning framework in knowledge distillation Vanilla knowledge distillation utilized general teacher–student framework for distillation, which do not modify teacher network or student network. The modification of learning framework can be divided into three types: the modification to student model for self-distillation [5, 10, 11], the modification to student model for offline-distillation [13] and the modification to teacher model [14]. However, redundant parts of student model should be dropped in inference period for [10, 11, 13]. Although [14] avoids the issue, it needs to set multiple decomposition ranks mutually. The issues mentioned above makes them not generic.

Based on the above discussion, the differences between our method with related works are summarized as follows: (1) besides the channel features and spatial features, this paper further considers the positional information of feature maps; (2) the proposed teacher framework does not require setting any parameters manually and does not require any modification to student networks; (3) [14] simplified the teacher for better distillation effect. The motivation of [14] is intuitive and easy to understand due to the bigger-models-not-always-better-teachers issue. We design the teacher framework from a new perspective, i.e., adding the complexity of teacher to obtain better teacher feature maps; 4) A two-stage distillation method is proposed to enhance the distillation effect in a further step.

3 Methods

3.1 Global and positional distillation

As shown in Fig. 1, the global distillation is achieved by encouraging the student to mimic the spatial attention map (represents the global information of spatial positions) \(A_{s} \in {\mathcal {R}}^{1,H,W}\) and the channel attention map (represents the global information of channel dimension)\(A_{c} \in {\mathcal {R}}^{C,1,1}\) of the teacher network. The positional distillation is achieved by encouraging the student to mimic the horizontal attention map (reflects the feature responses of each channel across the entire width, i.e. the positional information of the width for each channel)\(A_{w} \in {\mathcal {R}}^{C,1,W}\) and the vertical attention map (reflects the feature responses of each channel across the entire height, i.e. the positional information of the height for each channel) \(A_{h} \in {\mathcal {R}}^{C,H,1}\) of the teacher network. Given a feature map \(F \in {\mathcal {R}}^{C,H,W}\), where CHW represent its channel number, height and width, respectively. The generation of the spatial attention map and channel attention map can be considered as finding the mapping function \(\varrho ^{s}:\) \(F \in {\mathcal {R}}^{C,H,W} \rightarrow A_{s} \in {\mathcal {R}}^{1,H,W}\) and \(\varrho ^{c}:\) \(F \in {\mathcal {R}}^{C,H,W} \rightarrow A_{c} \in {\mathcal {R}}^{C,1,1}\), respectively. Similarly, the horizontal attention map and the vertical attention map can also be obtained with the mapping functions \(\varrho ^{w}:\) \(F \in {\mathcal {R}}^{C,H,W} \rightarrow A_{w} \in {\mathcal {R}}^{C,1,W}\) and \(\varrho ^{h}:\) \(F \in {\mathcal {R}}^{C,H,W} \rightarrow A_{h} \in {\mathcal {R}}^{C,H,1}\). Note that the scw and h are utilized to discriminate ’spatial’, ’channel’, ’width’ and ’height’. Since the absolute value of each element in feature maps represents its importance, we construct \(\varrho ^{s}\) by summing the absolute values across the channel dimension and construct \(\varrho ^{c}\) by summing the absolute values across the width and height dimensions, which can be formulated as \(\varrho ^{s}(F) = \frac{1}{C} {\textstyle \sum _{k=1}^{C}} |F_{k,i,j} |\) and \(\varrho ^{c}(F) = \frac{1}{HW} {\textstyle \sum _{i=1}^{H}} {\textstyle \sum _{j=1}^{W}} |F_{k,i,j} |\), where i, j, k denote the \(i_{th}\), \(j_{th}\), \(k_{th}\) element in the height, width and channel dimension, respectively. Moreover, the \(\varrho ^{h}\) is constructed by summing the absolute values across the width dimension and the \(\varrho ^{w}\) is constructed by summing the absolute values across the height dimension, which can be formulated as \(\varrho ^{h}(F) = \frac{1}{W} {\textstyle \sum _{j=1}^{W}} |F_{k,i,j} |\) and \(\varrho ^{w}(F) = \frac{1}{H} {\textstyle \sum _{i=1}^{H}} |F_{k,i,j} |\). The global and positional distillation loss \(L_{GPD}\) is composed of two components: global distillation loss \(L_{GD}\) and positional distillation loss \(L_{PD}\). \(L_{GD}\) is utilized to encourage the student network to mimic the spatial and channel attention of the teacher network. \(L_{PD}\) is used to encourage the student network to mimic the horizontal attention and vertical attention of the teacher network. The equations for \(L_{GD}\) and \(L_{PD}\) are as follows.

$$\begin{aligned} L_{GD}&= L_{2}(\varrho ^{s}(F^{T}), \varrho ^{s}(F^{S})) + L_{2}(\varrho ^{c}(F^{T}), \varrho ^{c}(F^{S})) \end{aligned}$$
(1)
$$\begin{aligned} L_{PD}&= L_{2}(\varrho ^{h}(F^{T}), \varrho ^{h}(F^{S})) + L_{2}(\varrho ^{w}(F^{T}), \varrho ^{w}(F^{S})) \end{aligned}$$
(2)

Here \(L_{2}\) represents \(L_{2}\) norm loss. \(F^{T}\) represents feature maps of the teacher network. \(F^{S}\) represents feature maps of the student network. The equation for \(L_{GPD}\) and the loss function of student networks are shown as Eqs. (3) and (4), respectively.

$$\begin{aligned} L_{GPD}&= L_{GD} + L_{PD} \end{aligned}$$
(3)
$$\begin{aligned} Loss&= CrossEntropy(q,Y) + \alpha \cdot L_{GPD} \end{aligned}$$
(4)

Here CrossEntropy refers to the cross-entropy loss. The q represents the prediction of student networks. The \(\alpha \) is utilized to adjust the weight of teacher knowledge.

Fig. 1
figure 1

Details of global and positional distillation. Global distillation generates the spatial attention map and channel attention map by performing average pooling across the spatial and channel dimensions, respectively. On the other hand, positional distillation generates the horizontal attention map and vertical attention map by performing average pooling across the width and height dimensions, respectively. The students are encouraged to mimic the attention maps of the teacher network. The GPD loss is then applied to feature maps with different resolutions. The term "Avgpool" is used to denote the operation of average pooling across the corresponding dimension

3.2 Teacher framework for distillation

Figure 2 shows the process of obtaining the teacher feature maps. First, the feature maps at different stages of the original network are fused using the selective dense feature connections (SDFC) module, which adapts the weights of feature maps at different stages. Note that downsampling of the feature map is done using the 3 \(\times \) 3 depthwise separable convolution (DSConv) [19]. Next, the fused feature map is integrated into the network. Second, to reduce computational complexity and speed up training, the channel dimension of feature maps at different stages is mapped to a lower dimension using 1 \(\times \) 1 convolution (the channel number is set to 128 or 256 in experiments, and Table 8 shows that the distillation effect is not sensitive to channel number of feature maps). Finally, the feature map generated by 1 \(\times \) 1 convolution at the deepest layer is upsampled using deconvolution and fused with the feature maps of shallow layers using element-wise summation to obtain the teacher feature maps of different resolutions. The loss function for the teacher network is not modified, which can be denoted as \(Loss = CrossEntropy(q,Y)\), where q, Y, CrossEntropy represent the prediction of networks, ground truth labels and cross-entropy loss function.

As depicted in Fig. 3, SDFC is a module that fuses features at different stages. The fusion process is as follows. Firstly, feature maps from different stages are fused by an element-wise summation, which can be formulated as \(M = {\textstyle \sum _{i=1}^{N}} F_{i} \), where M, \(F_{i}\), N represent the preliminary fused feature map, the feature map of \(i_{th}\) stage, the number of feature maps that need to be fused, respectively. Then a channel-level average pooling is utilized to obtain an attention map, which can be expressed with \(F_{s} = \frac{1}{C} {\textstyle \sum _{i=1}^{C}} M(i,h,w)\), where \(F_{s}\), C, h and w represent the attention map, the channel number, the height dimension and the width dimension, respectively. Finally, the weights of different feature maps are obtained through one MLP layer and a softmax operation, which can be represented by \(W = Softmax(MLP(F s)))\), where W, MLP and Softmax refer to the weights of different feature maps, the MLP layer and the softmax operation, respectively.

Fig. 2
figure 2

Details of teacher framework. Here we take Resnet with teacher framework as an example. ResStage indicates the different stages of Resnet. SDFC is utilized adaptively fused feature maps at different stages. FeatureTea denotes the feature maps that used to guide the training of student networks. 1 \(\times \) 1 convolution is utilized to adjust the channel number of feature map thus reducing computational complexity. To resolve the discrepancies in width and height between the feature maps of the student and teacher networks, deconvolution is employed as an upsampling technique to align the dimensions of the feature maps

Fig. 3
figure 3

Selective dense feature connections module (C, H, W and N denote channel number, height, width and the number of feature maps that need to be fused, respectively)

3.3 Two-stage distillation

Self-distillation is an effective technique for improving network performance, and a more powerful network usually have learned better features. Building on this, we propose a two-stage distillation method in this paper. In the first stage, we train our proposed teacher framework using self-distillation, aiming of allowing the teacher network to learn better features. In the second stage, we use the well-trained teacher to transfer its knowledge via global and positional distillation loss. As shown in Fig. 4, the teacher self-distillation training framework is indicated by gray boxes.Several branches have been added at various stages of the original network for self-distillation. These branches include an attention module, a feature alignment layer, and a classifier, all configured in the same manner as described in [20]. Subsequently, during the network’s forward propagation, downsampling and the SDFC (Self-Distillation Feature Fusion Component) are utilized to integrate feature maps from different stages to obtain the final multi-scale feature maps for prediction. Finally, the obtained multi-scale feature maps and predictions are used for distillation. The loss function for self-distillation can be expressed as:

$$\begin{aligned} \begin{aligned} Loss&= \sum _{i=1}^c \Big ( (1-\alpha ) \cdot CrossEntropy(q^i, y) \\&\quad + \alpha \cdot KL(q^i, q^c) \\&\quad + \lambda \cdot \left\| F_i - F_c\right\| _2^2 \Big ) \end{aligned} \end{aligned}$$
(5)
$$\begin{aligned} {F_c=S D F C\left( F_1, \ldots , F_n\right) } \end{aligned}$$
(6)

Here c denotes the number of classifiers. The \(q^{i}\) and \(q^{c}\) represent the prediction of the \(i_{th}\) classifier and the deepest classifier, respectively. The y indicates the ground truth labels. \(F_{i}\) and \(F_{c}\) signify the feature maps in the \(i_{th}\) classifier and the deepest classifier. CrossEntropy and KL denotes the cross entropy loss and the Kullback–Leibler divergence, respectively. The \(\alpha \) and \(\lambda \) are utilized adjust the knowledge of teacher networks.

Fig. 4
figure 4

Details of the two-stage distillation framework. Here we take Resnet as an example. ResStage refers to the different stages of ResNet. SDFC is a feature fusion module that can adaptively adjust the weights of feature maps to be fused. Attention refers to an attention mechanism, while the feature alignment layer is used to align feature maps for distillation. FeatureTea denotes the feature maps used to guide the training of student networks. GPDLoss indicates the global and positional distillation loss. The 1 \(\times \) 1 convolution layer is utilized to adjust the channel number of the student and teacher for distillation

4 Experiments

DFGPD is a novel distillation framework that can be easily applied to different models for performing classification and segmentation tasks. In this paper, we conducted experiments on two different tasks, including classification and semantic segmentation. We utilized different datasets and models for validation experiments tailored to each task. All models in both tasks achieved significant improvements through DFGPD. The equipment used for all the experiments in this paper is a single NVIDIA GeForce RTX 4090. The experiment was conducted under the Linux operating system. The backend of the deep learning framework uses CUDA 11.7. We implement the model using the PyTorch 1.12.1 framework.

4.1 Classification

In the classification task, DFGPD has been evaluated on three datasets: CIFAR-100, Tiny-ImageNet and ImageNet [21]. The teacher–student pairs consist of networks with the same architecture and different architectures. The benchmark networks include seven networks with different lengths and widths, namely ResNet [22], WideResNet [23], VGG [24], MobileNetV1 [19], MobileNetV3 [25], ShuffleNetV1 [26], and ShuffleNetV2 [27]. To demonstrate its effectiveness, our method is compared with six other distillation methods, namely KD [17], DKD [28], MGD [29], USKD [30], LSKD [31] and SRD [32].

To prevent model overfitting, data augmentation methods (including image flipping, scaling, and cropping), early stopping, and L2 normalization were used during the training process. Neural networks are optimized using SGD with momentum. On the CIFAR-100 dataset, all networks were trained for 200 epochs, with the learning rate divided by 10 at the 75th, 130th, and 180th epochs. The batch size and initial learning rate were set to 128 and 0.1, respectively. On the Tiny-ImageNet dataset, all networks were trained for 100 epochs, with the learning rate divided by 10 at the 30th, 60th, and 90th epochs. The batch size and initial learning rate were set to 64 and 0.1, respectively. On the ImageNet dataset, all networks were trained for 100 epochs, and the weight decay is 0.0001. We initialize the learning rate to 0.1 and decay it for every 30 epochs The batch size and initial learning rate were set to 32 and 0.1, respectively. The hyperparameter \(\alpha \) in our method is set to 1.2.

4.2 Classification results

Comparison experiments on CIFAR100 The experimental results on CIFAR-100 have verified the effectiveness of DFGPD compared to other state-of-the-art distillation methods. In Tables 1 and 2, we compare the accuracy, recall, and precision of various distillation methods. Specifically, Table 1 contains the results of teacher–student pairs with the same architecture, while Table 2 shows the results of pairs with different architectures. It can be observed that: (1) among teacher–student pairs with the same architecture, both our two-stage distillation method and the one-stage distillation method (OGPD) consistently outperformed other methods across all pairs. Notably, the two-stage distillation observed the greatest increase in the three evaluation metrics in the Resnet50-Resnet32 teacher–student pair, with recall, accuracy, and precision improving by 5.15%, 5.26%, and 5.18%, respectively, while one-stage distillation also achieved 4.71%, 4.76%, and 4.76%, respectively; (2) compared to other KD methods where the teacher and student have the same network architecture, our two methods (DFGPD and OGPD) have shown significant improvements. Specifically, DFGPD has achieved an average increase of 3.69% in accuracy, 3.76% in recall, and 3.76% in precision. The improvements in accuracy ranged from 2.64 to 5.15%, in recall from 2.74 to 5.26%, and in precision from 2.78 to 5.18%. This demonstrates that our methods have reached comparable or superior performance compared to the state-of-the-art KD methods for teacher–student pairs with identical architectures. (3) when the teacher and student come from different series, our two methods achieved the highest performance improvement in the three evaluation metrics for the Resnet34-MobilenetV3 teacher–student pair, which were 10.23%, 10.22%, and 9.72% respectively. (4) Compared to other KD methods, regardless of whether the teacher and student have the same or different architectures, our methods are optimal. Specifically, when the teacher and student have the same architectures, on average in the three evaluated metrics of accuracy, recall, and precision, DFGPD outperforms the other KD methods by 1.96%, 1.82%, and 1.97%, respectively, while OGPD also surpasses other KD methods by an average 1.63%, 1.54%, and 1.58%, respectively. When the teacher and student have the different architectures, on average in the three evaluated metrics of accuracy, recall, and precision, DFGPD outperforms the other KD methods by 2.59%, 2.63%, and 2.65%, respectively, while OGPD also surpasses other KD methods by an average 2.35%, 2.39%, and 2.40%, respectively. This proves that both of our distillation strategies offer superior performance compared to the state-of-the-art distillation methods.

Table 1 Comparison with other distillation methods on CIFAR100
Table 2 Comparison with other distillation methods on CIFAR100

Comparison experiments on Tiny-ImageNet Our experimental results on Tiny-ImageNet demonstrate the effectiveness of our two-stage distillation method (DFGPD) and the one-stage distillation method (OGPD). Table 3 shows experiments conducted on networks with the same architecture and those with different architectures. The following observations can be made: (1) one-stage distillation (OGPD) methods are effective for teacher–student pairs with both the same and different architectures. (2) It is worth noting that on the Resnet34-MobilenetV3 teacher–student pair, both of our methods achieved performance improvements exceeding 10% across the three evaluation metrics. For the two-stage distillation, the accuracy, recall, and precision improved by 10.75%, 11.00%, and 10.92% respectively, while for the one-stage distillation, the improvements were 10.50%, 10.67%, and 10.57% respectively. (3) Compared to other KD methods, regardless of whether the teacher and student have the same or different architectures, our methods are optimal. Specifically, on average in the three evaluated metrics of accuracy, recall, and precision, DFGPD outperforms the other KD methods by 2.41%, 2.37%, and 2.52%, respectively, while OGPD also surpasses other KD methods by an average 2.74%, 2.72%, and 2.81%, respectively. This proves that both of our distillation strategies offer superior performance compared to the state-of-the-art distillation methods.

Table 3 Comparison with other distillation methods on Tiny-ImageNet

Comparison experiments on ImageNet On ImageNet, we compared our methods (DFGPD and OGPD) with other advanced distillation methods to prove the effectiveness of our approach. Table 4 shows experiments conducted on networks with the same architecture and those with different architectures. The following observations can be made: 1) Our proposed two-stage distillation ( DFGPD) and one-stage distillation (OGPD) methods have been validated on the ImageNet dataset and are effective for teacher–student pairs with both identical and different architectures. 2) Our methods achieved the greatest improvement in the three evaluation metrics for the Resnet34-Resnet18 teacher–student pair. For the DFGPD, the accuracy, recall, and precision increased by 3.61%, 3.59%, and 3.66% respectively; for OGPD, the improvements were 3.22%, 3.24%, and 3.26% respectively. 3) Compared to other KD methods, regardless of whether the teacher and student have the same or different architectures, the two-stage distillation (DFGPD) are optimal. Specifically, in the three evaluated metrics of accuracy, recall, and precision, mean performance improvement with 1.39% improvement in accuracy, 1.42% improvement in recall, and 1.23% improvement in precision. Even without the use of autodistillation, the one-stage distillation (OGPD) method achieved a higher performance improvement than the other distillation methods, with a 0.85% increase in mean accuracy, 1.01% increase in recall, and 0.82% increase in precision.

Table 4 Comparison with other distillation methods on ImageNet

4.3 Segmentation

In the semantic segmentation task, we conducted experiments across two datasets and a variety of network architectures.

Datasets and data augmentation methods (1) Cityscapes [33] is a dataset for urban scene parsing that includes 5000 finely annotated images, with a total of 19 classes, among which 2975/500/1525 images are used for training /evaluation /testing. (2) Pascal VOC [34] is a visual object segmentation dataset, including 20 foreground object classes and one background class. The datasets contain 10582/1449/1456 images for training/evaluation/testing, respectively.

Network architecture For all experiments, we use the segmentation framework DeepLabV3 [35] with a ResNet50 backbone as a powerful teacher network. Specifically, we initialize the backbone of the teacher network with ImageNet [21] pre-trained weights. For the student networks, we use different segmentation architectures to verify the effectiveness of the distillation method. Specifically, we employed DeepLabV3 (DLV3) and PSPNet [36] with different backbones such as ResNet-34 (Res34), ResNet-18 (Res18), and MobileNetV2 (MV2).

Training detailsTo prevent model overfitting, data augmentation including image flipping, scaling, cropping, etc., were used simultaneously. All experiments were optimized by SGD with a momentum of 0.9, a batch size of 4, and an initial learning rate of 0.02. The learning rate is adjusted according to a polynomial decay strategy. For the crop size during the training phase, we used 512 \(\times \) 1024 and 512 \(\times \) 512 for Cityscapes and Pascal VOC, respectively.

Comparative distillation methods We compared our proposed DFGPD with state-of-the-art segmentation distillation methods: SKD [37], IFVD [38], CIRKD [39], and CWD [40]. We re-ran all methods using the code provided by the authors. Here all methods use the same pre-trained teacher DeepLabV3-ResNet50. The results on two segmentation task datasets demonstrate that our DFGPD demonstrates comparable or superior performance to other methods.

4.4 Segmentation results

Comparison experiments on Cityscapes In Table 5, we compared our two proposed distillation methods( DFGPD, which is a two-stage distillation, and OGPD, which is a one-stage distillation without teacher self-distillation) with the state-of-the-art distillation methods on Cityscapes in terms of validation (Val) and test (Test) mIoU performance. It can be observed that (1) under the supervision of the teacher, methods improved the student networks. Our DFGPD achieved the best segmentation performance across various student networks with similar or different architectural styles, and OGPD also outperformed previous KD methods. (2) Under the best circumstances, DFGPD achieved performance improvements of 3.3% and 3.34% on the teacher–student pairs of DeepLabV3-ResNet50 and PSPNet-ResNet34, respectively. OGPD achieved performance improvements of 3.14% and 3.10% on the same teacher–student pairs. (3) Compared to other performing KD methods, our distillation methods showed an average improvement of 1.57% in Val mIoU and 1.58% in Test mIoU for DFGPD, and 1.78% in Val mIoU and 1.80% in Test mIoU for OGPD.

Table 5 Comparing other distillation methods across various student segmentation networks on Cityscapes

Comparison experiments on Pascal VOC We compared our two proposed distillation methods(DFGPD, which is a two-stage distillation, and OGPD, which is a one-stage distillation without teacher self-distillation) with the state-of-the-art distillation methods on Cityscapes in terms of validation (Val) and test (Test) mIoU performance. As shown in Table 6, can be observed that (1) under the supervision of the teacher, methods improved the student networks. Our DFGPD achieved the best segmentation performance across various student networks with similar or different architectural styles, and OGPD also outperformed previous KD methods. (2) Under the best circumstances, DFGPD achieved performance improvements of 3.40% and 3.45% on the teacher–student pairs of DeepLabV3-ResNet50 and DeepLabV3-ResNet34, respectively. OGPD achieved performance improvements of 2.80% and 2.67% on the same teacher–student pairs. (3) Compared to other performing KD methods, our distillation methods showed an average improvement of 1.57% in Val mIoU and 1.58% in Test mIoU for DFGPD, and 1.34% in Val mIoU and 1.47% in Test mIoU for OGPD.

Table 6 Comparing other distillation methods across various student segmentation networks on Pascal VOC

4.5 Ablation study and sensitivity study

Effect of teacher framework on other distillation methodsWe conducted experiments on various types of distillation methods in Tables 7 and 8 for classification and segmentation tasks, respectively, to demonstrate the capability of the teacher framework in enhancing performance. These distillation methods include six knowledge distillation methods used for classification tasks, as well as five knowledge distillation methods used in segmentation tasks. The results indicate that The teacher framework is effective for (1) both logit-based (such as KD and DKD) and feature-based (such as MGD) distillation methods; (2) it is effective for both classification tasks and segmentation tasks; this indicates that it has broad applicability to other distillation methods without any modification. In both classification and segmentation tasks, the teacher framework improved the performance of other distillation methods by an average of 0.86% and 2.00% in two sets of teacher–student pairs, and by 0.63% and 0.60%, respectively, demonstrating its effectiveness in enhancing the distillation effect. Surprisingly, in the classification task, the teacher framework can further enhance the DKD distillation effect by 6.39%. Despite an increased teacher-student capacity gap when using the teacher framework for distillation, all distillation methods demonstrated improved performance in both classification and segmentation tasks, with improvements ranging from 0.34 to 6.39%. This suggests that the proposed teacher framework can still perform better than the original teacher network, even with an increased teacher-student capacity gap. This is noteworthy because extensive research [15, 16] has shown that larger models are not always better teachers, as the teacher-student capacity gap can affect the distillation effect.

Table 7 Effect of teacher frameworks on other distillation methods in classification tasks
Table 8 Effect of teacher frameworks on other distillation methods in segmentation tasks

Effect of two-stage distillation We conducted experiments on CIFAR-100 to demonstrate the effects of two-stage distillation. The results are reported in Table 9. The benchmark includes five teacher–student pairs. It can be observed that: 1. The combination of the teacher framework and self-distillation (T-SD) can improve the accuracy of the teacher networks by an average of 3.69%, with a 3.22% increase on Resnet34 and a 3.95% improvement on Resnet50; 2. Two-stage distillation enhances the performance of the student by an average of 4.28%. 3. Additionally, compared to one-stage distillation, two-stage distillation also brings performance improvements to teacher–student pairs with the same and different architectures, indicating that two-stage distillation can further improve the distillation effect compared to one-stage distillation.

Table 9 Effect of two-stage distillation

Ablation study on different components of DFGPD Table 10 conducts detailed ablation studies on CIFAR100 to demonstrate the effect of different components in the proposed method. The teacher network and student network are set to Resnet34 and Resnet18, respectively. It can be observed that: (1) global distillation and positional distillation increase the accuracy of Resnet18 by 1.85% and 1.82%, respectively. Moreover, there are 2.37% accuracy improvements with the combination of global distillation and positional distillation, which indicates that each distillation has their individual effectiveness; (2) with the help of the teacher framework, the accuracy of Resnet18 can be further improved by 0.91%. And two-stage distillation can further improve the distillation effect by 0.33% on this basis. This means that two-stage distillation can further improve the distillation effect; (3) the combination of GPD, the teacher framework and two-stage distillation can enhance the performance of Resnet18 by 3.61%, which is achieved jointly by each component of DFGPD.

Table 10 Ablation study on different components of DFGPD

Sensitivity study on hyper-parameters There is only one hyper-parameter is introduced in DFGPD. Table 11 conducts sensitivity experiments on CIFAR100 with two teacher–student pairs: Renset34-Resnet18 and Resnet34-Renset10. It can be observed that: (1) for the teacher–student pairs Resnet34-Resnet18, the worst hyper-parameter leads to 0.38% accuracy drop compared to the highest accuracy, which is still higher than the baseline by 2.9%; (2) for Resnet34-Resnet10, the accuracy obtained by the worst hyper-parameter is 0.55% lower than the highest accuracy, which is still higher than the baseline by 2.86%. The results indicate that our method is not sensitive to the choice of hyper-parameter, which means that there is no need to spend too much time searching for hyper-parameters for better distillation effect.

Table 11 Sensitivity study on hyper-parameters

Sensitivity study on channel number As shown in Fig. 2, 1 \(\times \) 1 convolution is utilized to adjust the number of channels in the feature maps, which reduces computational complexity and speeds up training. To investigate the impact of channel number on the distillation effect, sensitivity experiments were conducted on two teacher–student pairs: Resnet34-Resnet18 and Resnet34-MobilenetV1. Table 12 presents the distillation effect when the number of channels is adjusted from 128 to 512. The results indicate that the worst accuracies are only 0.27% and 0.32% lower than the top accuracies on Resnet34-Resnet18 and Resnet34-MobilenetV1, respectively, while still higher than the baseline by 3.01% and 4.86%. This suggests that our method is insensitive to the number of channels in the teacher feature maps. Therefore, selecting a smaller number of channels can accelerate distillation with little penalty for accuracy loss.

Table 12 Sensitivity on channel number of feature maps for teacher networks

4.6 Extension

In this section, we have discovered that DFGPD can alleviate to some extent the issue that "large models are not always good teachers." Subsequently, we conducted ablation studies, visualization experiments, and analyzed the backpropagation of the network after the introduction of SDFC. Through the aforementioned experiments, we have thoroughly discussed the reasons why DFGPD performs well and can mitigate the problem of "large models not always being good teachers." Specifically, these can be attributed to: (1) SDFC achieves better feature representation by effectively integrating feature maps from different stages; (2) self-distillation enables the network to obtain better feature expression; (3) global and positional distillation can more effectively transfer the knowledge from the teacher network.

Alleviating the influence of teacher–student capacity gap on distillation It has been demonstrated in many previous studies [15, 16] that bigger models are not always better teachers. Specifically, larger teacher networks may result in worse performance than smaller ones. These studies explained this phenomenon with the model capacity gap between teachers and students and proposed teacher-assistants [16] or early-stopping [15] to alleviate this problem. However, we found that DFGPD can effectively alleviate this issue. In Table 13, we conducted experiments on a series of teacher models. It can be observed that: (1) when the teacher–student gap increases from 6.28 to 18.76 M, the performance of student networks distilled with vanilla KD drops from 76.98 to 75.93% and then increases to 77.08%. This indicates that as the capacity of teacher models increases, the performance improvements brought by vanilla KD are oscillating, which is consistent with the results in [15, 16]; (2) when the teacher–student gap increases from 6.28 to 18.76M, the performance of student networks distilled with DFGPD increases from 78.53 to 79.07%, indicating that our method can alleviate the bigger-models-not-always-better-teachers issue. This also implies that better distillation effects can be obtained without spending too much time finding a suitable teacher for student networks. (3) When using only global and positional distillation, the performance of the student network gradually improves 77.53–77.94% as the capacity gap between the teacher and student increases from 6.28 to 18.76 M, indicating that global and positional distillation can more effectively transfer the knowledge from the teacher network; (4) when using only the proposed teacher framework, the performance of the student network also gradually improves 77.34–77.83% as the capacity gap increases from 6.28M to 18.76M, indicating that the proposed teacher framework can provide better feature expression. In summary, the reasons why DFGPD can alleviate the issue that "large models are not always good teachers" to a certain extent are as follows: (1) SDFC enables the network to effectively integrate feature maps from different stages of the network during the training process. Self-distillation allows the teacher network to optimize its feature representation before knowledge transfer. The combined effect of SDFC and self-distillation allows the network to produce better feature maps; (2) global and positional distillation can efficiently transfer the knowledge from the teacher network. Global distillation focuses on the transfer of overall features, while positional distillation concentrates on reinforcing features at specific spatial locations. This combination not only enhances the student network’s learning efficiency for key features but also strengthens its ability to recognize features across different regions.

Table 13 The effect of vanilla KD and our method when teacher networks with different capacity are used for distillation

Visualization experiments Figure 5 shows the feature maps that used for distillation (i.e., FeatureTea in Fig. 4) for ResNet34 with and without a teacher framework and after self-distillation. It can be observed that: (1) compared to Resnet34 without a teacher framework, ResNet34 with a teacher framework captures better features. For example, in the images of teacher feature map 3 of the first set of images, ResNet34 with a teacher framework pays more attention to the body of the koala, while Resnet34’s attention is partly distracted to the background. Moreover, in the images of teacher feature maps 2 and 3 of the second set of images, compared to Resnet34, ResNet34 with a teacher framework pays more attention to the face of the red panda, while Resnet34’s attention is partly distracted to elsewhere. (2) Compared to Resnet34 with a teacher framework, a distilled Resnet34 captures better features, especially for teacher feature maps 3 and 4. For example, in the images of teacher feature maps 3 and 4, although distilled and undistilled Resnet34 both focus on the correct places (i.e., the body of koala and the face of red panda), the features captured by distilled Resent34 are more evident clearly. Based on the above discussion, it can be concluded that: (1) the ability of the teacher framework and two-stage distillation to enhance distillation can be attributed to that the teacher network can capture better features due to the teacher framework and two-stage distillation; (2) the performance gain of two-stage distillation is mainly attributed to the better teacher feature maps 3 and 4.

Fig. 5
figure 5

Visualization experiment. Two sets of images were used for the experiment. The first row of images in each set shows the different teacher feature maps of Resent34. The second row shows the different teacher feature maps of Resent34 with a teacher framework. The third row shows the different teacher feature maps of Resent34 distilled using our method. The teacher feature map here refers to FeatureTea in Fig. 4. Here Resnet34, Resnet34 with TF and distilled Resnet34 denote the vanilla Resnet34, Resnet34 equipped with the teacher framework and Resnet34 equipped with the teacher framework and trained with self-distillation, respectively

Analysis of backpropagation in the teacher framework Visualization experiments have explained the effectiveness of the teacher framework in producing great feature maps. To further explain its effectiveness, we conducted an analysis of backpropagation in the teacher framework. Here we take part of the selective dense feature connections as an example. Figure 6a, b show the last BasicBlock of stages 1 and 2 in Resnet18, and the last BasicBlock of stages 1 and 2 in Resnet18 with the teacher framework.

Fig. 6
figure 6

The last BasicBlock of stages 1 and 2 of Resnet18 and Resnet18 equipped with teacher framework. Note that BasicBlock is the basic unit of Resnet. The x denotes the output feature map of the last BasicBlock of stage1. G(x) represents the input of the last BasicBlock of stage2. The F(G(x)) indicates the output feature map of the 3 \(\times \) 3 convolution layer. H(x) represents the output of stage2 of Resnet18 without teacher framework. \(F_s\left( x\right) \) and D(x) are the input of SDFC. \(H_{t}(x)\) denotes the output of the stage2 of the Resnet18 with the teacher framework. The feature fusion path denotes the path that feature maps are fused with SDFC. The integration path indicates the original output feature map of stage2 is replaced with the fused feature map

As shown in Fig. 6a, the equation for H(x) can be formulated as \(H\left( x\right) =G\left( x\right) +F\left( G\left( x\right) \right) \). The partial derivative result for H(x) can be denoted as Eq. (7). Representing the loss function with \(\xi \), according to the chain rule of backpropagation [41], the gradient can be formulated as Eq. (8).

$$\begin{aligned} H'(x)= & \frac{\partial H(x)}{\partial x} = \left( 1+\frac{\partial F\left( G\left( x\right) \right) }{\partial G\left( x\right) }\right) \frac{\partial G\left( x\right) }{\partial x} \end{aligned}$$
(7)
$$\begin{aligned} \frac{\partial \xi }{\partial x}= & \frac{\partial \xi }{\partial H\left( x\right) }\frac{\partial H\left( x\right) }{\partial x}=\frac{\partial \xi }{\partial H\left( x\right) }\left( 1+\frac{\partial F\left( G\left( x\right) \right) }{\partial G\left( x\right) }\right) \frac{\partial G\left( x\right) }{\partial x} \end{aligned}$$
(8)

As indicated in Fig. 6b, the equation for \(H_{t}(x)\) can be formulated as \(H_{t}(x) = SDFC(F_{s}(x),D(x))\), where \(D(x)=G(x)+F(G(x))\). As described in Sect. 3.2, SDFC adaptively adjusts the weights of feature maps at different stages. Suppose that the weights for feature maps of stage1 and stage2 are \(W_1\) and \(W_2\), respectively. Then \(H_{t}(x)\) can be denoted as \(H_{t}(x)=W_{1}F_{s}(x)+W_{2}(G(x)+F(G(x)))\). The partial derivative result for \(H_{t}(x)\) can be represented as Eq. (9). Denoting the loss function as \(\xi \), according to the chain rule of backpropagation, the gradient can be denoted as Eq. (10).

$$\begin{aligned} H_{t}'(x)= & \frac{\partial H_{t}(x)}{\partial x}=W_{1}\frac{\partial F_{s}(x)}{\partial x}+W_{2}(1+\frac{\partial F(G(x))}{\partial G(x)})\frac{\partial G(x)}{\partial x} \end{aligned}$$
(9)
$$\begin{aligned} \frac{\partial \xi }{\partial x}&= \frac{\partial \xi }{\partial H_{t}(x)}\frac{\partial H_{t}(x) }{\partial x}\nonumber \\= & \frac{\partial \xi }{\partial H_{t}(x)}\left(W_{1}\frac{\partial F_{s}(x)}{\partial x}+W_{2}\left(1+\frac{\partial F(G(x))}{\partial G(x)}\right)\frac{\partial G(x)}{\partial x}\right) \end{aligned}$$
(10)

From Eqs. (7) and (9), it can be observed that: (1) compared to \(H'(x)\), \(H_{t}'(x)\) can adaptively adjust the gradient according to the importance of different feature maps due to the existence of \(W_{1}\) and \(W_{2}\). Note that \(W_1\) and \(W_2\) are the weights of \(F_{s}(x)\) and D(x) generated by SDFC. The more important \(F_{s}(x)\) is, the larger \(W_{1}\) is, indicating that the term \(W_{1}\frac{\partial F_{s}(x)}{\partial x}\) plays a more important role in network optimization; (2) the term of \(W_{1}\frac{\partial F_{s}(x)}{\partial x}\) means that the information can be propagated directly with the 3 \(\times \) 3 DSConv layer. This means that the features learned at deep layers have a directly influence on the shallow layers. It is well known that features at deep layers contain rich semantic information. Therefore, Resnet18 with teacher framework can learn better features. Based on above discussion, it can be concluded that SDFC and DSConv can enhance gradient backpropagation. Therefore, networks equipped with teacher framework can extract better features, which is consistent with the results of visualization experiments. The great effect of teacher framework can also be attributed to its effective gradient backpropagation and the better feature extraction ability.

5 Conclusion

In this paper, we introduce a new distillation framework, DFGPD, which consists of global and positional distillation, a teacher framework, and a two-stage distillation method. The advantages of DFGPD are as follows: (1) it fuses feature maps from different stages using a selective dense feature connection module and enhances the network’s feature representation through self-distillation; (2) it efficiently transfers knowledge by utilizing the global information of the feature maps from the teacher and student networks across channel and spatial dimensions, as well as positional information along the width and height dimensions. DFGPD can effectively enhance the performance of the student network in classification and segmentation tasks without increasing the complexity of the student network.

Limitations and future research directions Despite the improvements in distillation performance achieved by DFGPD, it also has some limitations. A primary challenge is that Self-distillation and the upsampling process can decrease the efficiency of the entire distillation process. Moreover, for the task of object detection, more refined distillation strategies need to be contemplated to enhance the performance improvement of DFGPD in object detection tasks. This mainly due to the inherent complexity of object detection tasks, where factors such as the ratio of losses for the foreground and background, as well as the ratio of losses for large and small object detection, must be carefully considered to meticulously design the distillation methods. In future research, we will consider designing distillation strategies tailored to different distillation tasks, enabling DFGPD to adapt to a greater variety of tasks and achieve better results.