Keywords

1 Introduction

Deep neural networks (DNNs) have been widely studied and applied in various fields such as image classification [7, 36], object detection [11, 26], semantic segmentation [10, 27], etc. One direction pursues the best accuracy which tends to introduce over-parameterized models [24, 26] and demands very high computation and storage resources that are often not available for many edge computing devices. This has triggered intensive research in developing lightweight yet competent network models in recent years, typically through four different approaches: 1) network pruning [14, 15, 17, 21, 28], 2) network quantization [9, 29], 3) building efficient small networks [6, 19, 20], and 4) knowledge transfer (KT) [4, 5, 12, 18, 30]. Among the four approaches, KT works in a unique way by pre-training a large and powerful teacher network and then distilling features and knowledge to a compact student network. Though compact yet powerful student networks can be trained in this manner, the conventional distillation is usually a multi-stage complex offline process requiring extra computational costs and memory.

Online knowledge distillation [3, 22, 25, 37] has attracted increasing interest in recent years. Instead of pre-training a large teacher network in advance, it trains two or more student models simultaneously in a cooperative peer-teaching manner. In other words, the training of the teacher and student networks is merged into a one-phase process, and the knowledge is distilled and shared among peer networks. This online distilling paradigm can generalize better without a clear definition of teacher/student role, and it has achieved superior performance as compared to offline distillation from teacher to student networks. On the other hand, this online distillation adopts an outcome-driven distillation strategy in common which focuses on minimizing the discrepancy among the final predictions. The rich information encoded in the intermediate layers from peer networks is instead largely neglected which has led to various problems such as limited knowledge transfer in deep mutual learning [37], constrained coordination in on-the-fly native ensemble [25], etc.

In this work, we propose a novel adversarial-based mutual learning network (AMLN) that includes both process-driven and outcome-driven learning for optimal online knowledge distillation. Specifically, AMLN introduces a block-wise learning module for process-driven distillation that guides peer networks to learn the intermediate features and knowledge from each other in an adversarial manner as shown in Fig. 1. At the same time, the block-wise module also learns from the final layer of the peer networks which often encodes very useful high-level features and information. In addition, the softened class posterior of each network is aligned with the class probabilities of its peer, which works together with a conventional supervised loss under the outcome-driven distillation. By incorporating supervision from both intermediate and final network layers, AMLN can be trained in an elegant manner and the trained student models also produce better performance than models trained from scratch in a conventional supervised learning setup. Further, AMLN outperforms state-of-the-art online or offline distillation methods consistently. More details will be described in Experiments and Analysis sections.

Fig. 1.
figure 1

Overview of the proposed adversarial-based mutual learning network (AMLN): AMLN achieves process-driven mutual distillation by dividing each peer network into same blocks and employing a discriminator to align the block-wise learned features adversarially. Additionally, the intermediate features are also guided by the peer’s final output for learning high-level features. The outcome-driven learning instead employs the conventional cross-entropy loss (with one-hot labels) and Kullback-Leibler (KL) loss (with softened labels). Note this pipeline focuses on the distillation from Network2 to Network1. For distillation from Network1 to Network2, a similar pipeline applies as highlighted by the dashed lines.

The contributions of this work are thus threefold. First, it designs an innovative adversarial-based mutual learning network AMLN that allows an ensemble of peer student networks to transfer knowledge and learn from each other collaboratively. Second, it introduces a block-wise module to guide the peer networks to learn intermediate features and knowledge from each other which augments the sole outcome-driven peer learning greatly. Third, AMLN does not require pre-training a large teach network, and extensive experiments over several public datasets show that it achieves superior performance as compared to state-of-the-art online/offline knowledge transfer methods.

2 Related Work

2.1 Knowledge Transfer

Knowledge transfer (KT) is one of the most popular methods used in model compression. The early KT research follows a teacher-student learning paradigm in an offline learning manner [5, 12, 23, 30, 34]. In recent years, online KT is developed to strengthen the student’s performance without a pre-trained teacher network [3, 25, 33, 37]. Our work falls into the online KT learning category.

Fig. 2.
figure 2

Four different mutual learning networks: The architectures in (a), (b) and (c) perform mutual learning from the predictions or features of peer networks. The deep mutual learning (DML) [37] in (a) uses the distilled softened prediction of the peer network. The on-the-fly native ensemble (ONE) [25] in (b) creates a teacher with the gating mechanism for the peer network training. The feature fusion learning (FFL) [22] in (c) applies mutual knowledge learning between peer networks and fused classifier. Unlike these outcome-driven learning architectures, our adversarial-based mutual learning network (AMLN) in (d) uses mutual knowledge distillation between block-wise output features and final generated predictions, which enhances the performance of each peer network by distilling more multifarious features from peers.

Offline KT aims to enforce the efficiency of the student’s learning from scratch by distilling knowledge from a pre-trained powerful teacher network. Cristian et al. [5] first uses soft-labels for knowledge distillation, and this idea is further improved by adjusting the temperature of softmax activation function to provide additional supervision and regularization on the higher entropy soft-targets [12]. Recently, various new KT systems have been developed to enhance the model capabilities by transferring intermediate features [23, 30, 34] or by optimizing the initial weights of student networks [8, 18].

Online KT trains a student model without the requirement of training a teacher network in advance. With online KT, the networks teach each other mutually by sharing their distilled knowledge and imitating the peer network’s performance during the training process. Deep mutual learning (DML) [37] and on-the-fly native ensemble (ONE) [25] are the two representative online KT methods that have demonstrated very promising performance as illustrated in Fig. 2. DML proposes to train the students by mutually exchanging the softened classification information using the Kullback-Leibler(KL) divergence loss. Similar to [37], Rohan et al. [3] introduces the codistillation method that forces student networks to maintain diversity longer by using the distillation loss after enough burn in steps. Rather than mutually distilling between peer networks, ONE generates a gated ensemble logit of the networks during training and adopts it as a target to guide each network. In addition, feature fusion learning (FFL) [22] uses a fusion module to combine the feature maps from sub-networks, aiming for enhancing the performance of each sub-network.

All the above methods adopt an outcome-driven distillation approach where the distillation during the intermediate network layers is largely neglected. AMLN addresses this issue by further incorporating process-driven distillation which guides the sharing and transfer of intermediate knowledge beyond the knowledge from the final outputs. Unlike ONE [25], AMLN also has better applicability which can work with peer networks with the same or different architecture.

2.2 Adversarial Learning

Generative Adversarial Learning [13] is proposed to create realistic-looking images from random noise. An adversarial training scheme is proposed which consists of a generator network G and a discriminator network D. Specifically, G learns to synthesize images to fool D, meanwhile, D is trained to distinguish the real images in the dataset from the fake images generated by G.

To align the intermediate features which are updated continually at each training iteration, the \(L_{1}\) or \(L_{2}\) distance is not applicable since it is designed to evaluate the pixel-level or point-level difference instead of distributional differences between features. We introduce adversarial learning for online mutual learning among multiple student networks, where each student tends to generate features with similar distributions as its peer by striving to deceive the discriminators while the discriminators are trained to distinguish the different distributions of the generated features from multiple peer student networks.

3 Proposed Method

In this section, we describe how to effectively guide the peer-teaching student networks to learn collaboratively with the proposed Adversarial-based Mutual Learning Network (AMLN). Unlike existing online KT methods, AMLN takes into account not only the distillation based on the final prediction, but also the intermediate mutual supervision between the peer networks. We start by giving the architecture overview in Subsect. 3.1, and introduce our novel online process-driven mutual knowledge distillation in Subsect. 3.2. In Subsect. 3.3, we give an explanation of the outcome-driven mutual learning method. Finally, the whole optimization pipeline is presented in Subsect. 3.4.

3.1 The Architecture of AMLN

We formulate our proposed method by considering two peer networks \(S_{1}\) and \(S_{2}\). As illustrated in Fig. 1, \(S_{1}\) and \(S_{2}\) could adopt identical of different architectures, but should have the same number of blocks for intermediate feature alignment. During the training, the process-driven mutual knowledge distillation is implemented with a proposed block-wise module that contains a discriminator and an alignment container. Specifically, each network is trained to fool its corresponding block-wise discriminators so that it can produce similar feature maps to mimic that from its peer network. The alignment container is employed to align the block-wise outputs to the peer network’s final feature maps for high-level information distillation. On the other hand, the outcome-driven mutual knowledge distillation is realised by minimizing the peer model’s softened output distributions, which encodes higher entropy as extra supervision. Moreover, ground truth labels are used as a conventional supervision for the task-specific features learning.

3.2 Process-Driven Mutual Knowledge Distillation

Given N samples X = \(\{x_{i}\}_{i=1}^{N}\) from M classes, we denote the corresponding label set as Y = \(\{y_{i}\}_{i=1}^{N}\) with \(y_{i}\) \(\in \) \(\{1, 2, ..., M \}\). As can be seen in Fig. 1, the backbone networks are first divided into the same blocks according to their depth. Suppose that the block-wise generated feature is defined as \(f_{j}^{b}\), where j and b indicate the network number and block number respectively, i.e. \(j=1,2\) and \(b=1,2,3\). Each block is followed with a block-wise training module, including a discriminator \(D_{j}^{b}\) and an alignment container \(C_{j}^{b}\). The discriminator \(D_{j}^{b}\) is formed by three convolution layers with ReLU operation, where the last layer with two neurons is responsible for identifying the network number j of the injected feature \(f_{j}^{b}\). For each alignment container \(C_{j}^{b}\), it applies depthwise convolution and pointwise convolution to align the block-wise generated feature \(f_{j}^{b}\) with the peer’s final output \(f_{3-j}^{3}\) for high-level knowledge distillation. Therefore, there are two loss items for the process-driven mutual learning, one of which is the adversarial-based distilling loss defined as follows:

$$\begin{aligned} L_{D}^{j} = \mathop {min}\limits _{f_{j}^{b}} \mathop {max}\limits _{D} \sum _{b=1}^{3} E_{{f_{j}^{b}\sim P_{S_{j}}}}[1 - D_{j}^{b}(\sigma (f_{j}^{b}))] + E_{f_{3-j}^{b}\sim P_{S_{3-j}}} [D_{j}^{b}(\sigma (f_{3-j}^{b}))] \end{aligned}$$
(1)

Here, \(\sigma \) denotes the convolution kernel, which is utilized to reduce the number of channels of \(f_{j}^{b}\). \(P_{S_{j}}\) corresponds to the logits distribution of the network \(S_{j}\).

Another loss works by evaluating the distance between the block-wise distilled feature and the peer’s final generated feature, which can be computed as:

$$\begin{aligned} L_{F}^{j} = \sum _{b=1}^{3} d(C_{j}^{b}(f_{j}^{b}), f_{3-j}^{3}) \end{aligned}$$
(2)

where \(C_{j}^{b}\) denotes the alignment container that transforms \(f_{j}^{b}\) into the same shape as \(f_{3-j}^{3}\), and the distance metric d is adopted with \(L_{2}\) method consistently.

The overall process-driven mutual distillation loss function is then formulated with the weight balance parameter \(\beta \) as:

$$\begin{aligned} L_{S_{j}^{P}} = L_{D}^{j} + \beta L_{F}^{j} \end{aligned}$$
(3)

3.3 Outcome-Driven Mutual Knowledge Distillation

For outcome-driven distillation, two evaluation items are employed where one is the conventional cross-entropy (CE) loss and the other is the Kullback Leibler (KL) loss between the softened predicted outputs. Suppose that the probability of class m for sample \(x_{i}\) given by \(S_{j}\) is computed as:

$$\begin{aligned} p_{j}^{m}(x_{i}) =\frac{ exp(z_{j}^{m}) }{\sum _{m=1}^{M}exp(z_{j}^{m})} \end{aligned}$$
(4)

where \(z^{m}_{j}\) is the predicted output of \(S_{j}\). Thus, the CE loss between the predicted outputs and one-hot labels for \(S_{j}\) can be evaluated as:

$$\begin{aligned} L_{C}^{j} = - \sum _{i=1}^{N} \sum _{m=1}^{M} u(y_{i}, m)log(p_{j}^{m}(x_{i})) \end{aligned}$$
(5)

Here, u is an indicator function, which returns 1 if \(y_{i} = m\) and 0 otherwise.

To improve the generalization performance of sub-networks on the test data, we apply the peer network to generate softened probability with a temperature term T. Given \(z_{j}\), the softened probability is defined as:

$$\begin{aligned} \rho _{j}^{m}(x_{i}, T) =\frac{ exp(z_{j}^{m}/T) }{\sum _{m=1}^{M}exp(z_{j}^{m}/T) } \end{aligned}$$
(6)

when \(T = 1\), \(\rho _{j}^{m}\) is the same as \(p_{j}^{m}\). As the temperature term T increases, it generates a softened probability distribution where the probability of each class distributes more evenly and less dominantly. Same as [22, 37], we use \(T=3\) consistently during our experiments.

KL divergence is then used to quantify the alignment of the peer networks’ softened predictions as:

$$\begin{aligned} L_{KL}^{j}(\rho _{j} || \rho _{3-j}) = \sum _{i=1}^{N} \sum _{m=1}^{M} \rho _{j}^{m}(x_{i})log\frac{\rho _{j}^{m}(x_{i})}{\rho _{3-j}^{m}(x_{i})} \end{aligned}$$
(7)

The overall outcome-driven distillation loss function \(L_{S^{R}_{j}}\) is formulated as:

$$\begin{aligned} L_{S_{j}^{R}} = L_{C}^{j} + T^{2} \times L_{KL}^{j} \end{aligned}$$
(8)

Since the scale of the gradient produced by the softened distribution is 1/\(T^{2}\) of the original value, we multiply \(T^{2}\) according to the KD recommendations [12] to ensure that the relative contributions of the ground-truth and the softened peer prediction remain roughly unchanged.

figure a

3.4 Optimization

Combining both process-driven and outcome-driven distillation loss, the overall loss for each sub-network \(S_{j}\) is as follows:

$$\begin{aligned} L_{S_{j}} = L_{S_{j}^{P}} + L_{S_{j}^{R}} \end{aligned}$$
(9)

The mutual learning strategy in AMLN works in such a way that the peer networks are closely guided and optimized jointly and collaboratively. At each training iteration, we compute the generated features and predictions of the two peer networks, and update both models’ parameters according to Eq. 9. The optimization details are summarized in Algorithm 1.

Table 1. The number of parameters in Millions over CIFAR100 dataset.

4 Experimental Results and Analysis

4.1 Datasets and Evaluation Setups

AMLN is evaluated over three datasets that have been widely used for evaluations of knowledge transfer methods. \(\mathbf{CIFAR10} \) [1] and \(\mathbf{CIFAR100} \) [2] are two publicly accessible datasets that have been widely used for the image classification studies. The two datasets have 50,000 training images and 10,000 test images of 10 and 100 image classes, respectively. All images in the two datasets are in RGB format with an image size of 32 \(\times \) 32 pixels. \(\mathbf{ImageNet} \) [31] refers to the LSVRC 2015 classification dataset which consists of 1.2 million training images and 50,000 validation images of 1,000 image classes.

Evaluation Metrics: We use the Top-1 and Top-5 mean classification accuracy (%) for evaluations, the former is calculated for all studied datasets while the latter is used for the ImageNet only. To measure the computation cost in model inference stage, we apply the criterion of floating point operations (FLOPs) and the inference time of each image for efficiency comparison.

Networks: The evaluation networks in our experiments include ResNet [16] as well as Wide ResNet(WRN) [35] of different network depths. Table 1 shows the number of parameters of different AMLN-trained network models that are evaluated over the dataset CIFAR100.

Table 2. Comparison with online distillation methods DML [37], ONE [25] and FFL [22] over CIFAR10 in (a) and CIFAR100 in (b) with the same network architecture. ‘\(\uparrow \)’ denotes accuracy increases over ‘vanilla’, ‘Avg denotes the average accuracy of Net1 and Net2, and ‘*’ indicates the reported accuracies in [22] under the same network setup.

4.2 Implementation Details

All experiments are implemented by PyTorch on NVIDIA GPU devices. On the CIFAR dataset, the initial learning rate is 0.1 and is multiplied by 0.1 every 200 epochs. We used SGD as the optimizer with Nesterov momentum 0.9 and weight decay 1e−4, respectively. Mini-batch size is set to 128. For ImageNet, we use SGD with a weight decay of 10\(^{-4}\), a mini-batch size of 128, and an initial learning rate of 0.1. The learning rate is decayed every 30 epochs by a factor of 0.1 and we train for a total of 90 epochs.

4.3 Comparisons with the Online Methods

Comparisons over CIFAR: This section presents the comparison of AMLN with state-of-the-art mutual learning methods DML [37], ONE [25] and FFL [22] over CIFAR10 and CIFAR100. Since ONE cannot work for peer networks with different architectures, we evaluate both scenarios when peer networks have the same and different architectures. Tables 2 and Table 3 show experimental results, where ‘vanilla’ denotes the accuracy of backbone networks that are trained from scratch with classification loss alone, ‘Avg’ shows the averaged accuracy of the two peer networks Net 1 and Net 2, and the column highlighted with ‘*’ represents the values as extracted from [22] under the same setup.

Case 1: Peer Networks with the Same Architecture. Tables 2(a) and 2(b) show the Top-1 accuracy over the datasets CIFAR10 and CIFAR100, respectively, when peer networks have the same architecture. As Table 2 shows, ONE, DML, and FFL all outperform the ‘vanilla’ consistently though ONE and FFL achieve larger margins in performance improvement. In addition, AMLN outperforms all three state-of-the-art methods consistently under different network architectures and different datasets. Specifically, the average accuracy improvements (across the four groups of peer networks) over DML, ONE and FFL are up to 0.61%, 0.39% and 0.33% for CIFAR10 and 2.28%, 1.32% and 1.08% for CIFAR100, respectively. Further, it can be observed that the performance improvement over the more challenging CIFAR100 is much larger than that over CIFER10, demonstrating the good scalability and generalizability of ALMN when applied to complex datasets with more image classes.

Table 3. Comparison with online distillation methods DML [37], ONE [25] and FFL [22] over CIFAR10 in (a) and CIFAR100 in (b) with different network architectures.

Case 2: Peer Networks with Different Architectures. This experiment evaluates the peer networks with different architectures WRN-16-2/ResNet32 and WRN-40-2/ResNet56, where the former pair has relatively lower depths. Table 3 shows experimental results. As Table 3(a) shows, the AMLN-trained Net1 and Net2 outperform the same networks trained by ‘DML’ and ‘FFL’ consistently on CIFAR10. For CIFAR100, AMLN-trained Net2 achieves significant improvements of 1.71% (WRN-16-2/ResNet32) and 1.52% (WRN-40-2/ResNet56) over the state-of-the-art method FFL as shown in Table 3(b). The good performance is largely attributed to the complementary knowledge distillation with both process-driven learning and outcome-driven learning which empower the peer networks to learn and transfer more multifarious and meaningful features from each other.

Comparisons over ImageNet: To demonstrate the potential of AMLN to transfer more complex information, we conduct a large-scale experiment over the ImageNet LSVRC 2015 classification task. For a fair comparison, we choose the same peer networks of ResNet34 as in ONE [25] and FFL [22]. Table 4 shows experimental results. As Table 4 shows, ONE and FFL achieve similar performance as what is observed over the CIFAR datasets. Our AMLN method performs better consistently, with 1.09% and 1.06% improvements in the Top-1 accuracy as compared with ONE and FFL, respectively. The consistent strong performance over the large-scale dataset ImageNet further demonstrates the scalability of our proposed method.

Table 4. Comparison of Top-1/Top-5 accuracy(%) with online methods ONE [25] and FFL [22] on the ImageNet dataset with the same network architecture (ResNet34). #FLOPs and inference time of each image are also provided.
Table 5. Comparison results with offline knowledge transfer methods AT [34], KD [12], FT [23], as well as their hybrid methods AT+KD and FT+KD over CIFAR10 (a) and CIFAR100 (b). The results shown in the last 7 columns are from Table 3 of [23], where the ‘vanilla’ column represents the performance of the backbone network trained from scratch and the last five columns are the Top-1 accuracy of Net2 under the guidance of Net1.

4.4 Comparisons with the Offline Methods

Several experiments have been carried out to compare AMLN with state-of-the-art offline knowledge transfer methods including AT [34], KD [12], FT [23], as well as the combinations of AT+KD and FT+KD. Among all compared methods, KD adopts an outcome-driven learning strategy and AT and FT adopt process-driven learning strategy.

Tables 5(a) and 5(b) show experimental results over CIFAR10 and CIFAR100, respectively, where Net1 serves as the teacher to empower the student Net2. Three points can be observed from the experimental results: 1) AMLN-trained student Net2 outperforms that trained by all other offline distillation methods consistently for both CIFAR10 and CIFAR100, regardless of whether Net1 and Net2 are of different types (WRN-40-1/ResNet20), having different widths (WRN-16-2/WRN-16-1) or depths (ResNet110/ResNet20, ResNet110/ResNet56); 2) Compared to the ‘vanilla’ teacher Net1 trained from scratch, AMLN-trained teacher Net1 (mutually learnt with the student Net2) obtains significantly better performance with 0.43%-1.26% and 3.03%-3.65% improvements on CIFAR10 and CIFAR100, respectively. This shows that small networks with fewer parameters or smaller depths can empower larger networks effectively through distilling useful features; and 3) AMLN-trained student Net2 even achieves higher accuracy than its corresponding teacher Net1 in ‘vanilla’. Specifically, AMLN-trained ResNet56 (0.86M parameters) produces a better classification accuracy with an improvement of 1.70% than the teacher ResNet110 (1.74M parameters) trained from scratch (in the ResNet110/ResNet56 setup). This shows that a small network trained with proper knowledge distillation could have the same or even better representation capacity than a large network.

4.5 Ablation Study

In AMLN, we have moved one step forward from previous researches by introducing the block-wise module which consists of mutual adversarial learning (MDL) and intermediate-final feature learning (MFL). We perform ablation studies to demonstrate the effectiveness of the proposed method on the datasets CIFAR10 and CIFAT100 by using two identical peer networks ResNet32. Table 6 shows experimental results.

Table 6. Ablation study of AMLN with the same peer network ResNet32.

As Table 6 shows, Cases A and E refer to the models trained from scratch and from AMLN, respectively. Case B refers to the network when only the outcome-driven losses \(L_{C}\) (Equ. 5) and \(L_{KL}\) (Equ. 7) are included. By including MFL(\(L_{F}\)) in Case C, the averaged accuracy is improved by 0.74% and 2.26% over datasets CIFAR10 and CIFAR100, respectively, as compared with case B. The further inclusion of MDL(\(L_{D}\)) on top of the outcome-driven losses in Case D introduces significant improvements of 0.86% and 3.18% over the datasets CIFAR10 and CIFAR100, respectively. The improvements indicate that MDL has a greater impact on the model performance, which is largely attributed to the convolutional structure of the discriminator that can interpret the spatial information in block-wise intermediate features and map the peer model’s features to a similar probability distribution. As expected, AMLN performs the best when both outcome-driven loss and process-driven loss are included for mutual learning. This demonstrates that the two learning strategies are actually complementary to each other in achieving better knowledge distillation and transfer between the collaboratively learning peer networks.

Fig. 3.
figure 3

Analysis of AMLN: The graph in (a) shows the training loss in the first 150 epoch of AMLN and a vanilla model. The graph in (b) shows the testing error under the guidance of different transferring losses. The graph in (c) shows the loss fluctuation when adding parameter noise \(\alpha \) during the training of AMLN and a vanilla model.

4.6 Discussion

Benefits of Intermediate Supervision. To evaluate the benefit of combining outcome-driven and process-driven learning in the training procedure, we visualize the training loss (in the first 150 epoch) and test error with the peer networks of ResNet32 on CIFAR100. As illustrated in Fig. 3, our model (the purple line) converges faster than the fully trained vanilla model in Fig. 3(a). Compared to other loss combinations, AMLN (the \(L_{C}+L_{KL}+L_{F}+L_{D}\) case) has a relatively lower testing error, especially after 400 epoch. See the zoom-in window for details in Fig. 3(b). In addition, we compare the training loss of the learned models before and after adding Gaussian noise \(\alpha \) to model parameters. As shown in Fig. 3(c), the training loss of AMLN increases much less than the independent model after adding the perturbation. These clearly indicate that process-driven learning could improve the model stability and AMLN provides better generalization performance.

Qualitative Analysis. To provide insights on how AMLN contributes to the improved performance consistently, we visualize the heatmaps of learned features after the last convolution layer from four different networks AMLN, FFL, ONE and the vanilla model. We use the Grad-CAM [32] algorithm which works by visualizing the important regions where the network has focused on to discover how our model is taking advantage of the features. Figure 4 shows the Grad-CAM visualizations from each network with the highest probability and the corresponding predicted class. From the first two columns where all the evaluated models predict the correct class, it shows that our AMLN detects the object better with higher rate of confidence. In addition, the last four columns are the cases where AMLN predicts the correct answer but others do not. It again demonstrates the superior performance of our proposed online distillation method AMLN, in which both process-driven and outcome-driven learning effectively complement with each other for multifarious and discriminative feature distillation.

Fig. 4.
figure 4

The comparison of Grad-CAM [32] visualizations of the proposed AMLN with state-of-the-art methods FFL and ONE as well as the vanilla model where the peer networks use the same architecture ResNet32. The label under each heatmap is the corresponding predicted class with the highest prediction probability in the parenthesis.

5 Conclusion

In this paper, a novel online knowledge distillation method is proposed, namely the adversarial-based mutual learning network (AMLN). Unlike existing methods, AMLN employs both process-driven and outcome-driven mutual knowledge distillation, where the former is conducted by the proposed block-wise module with a discriminator and an alignment container for intermediate supervision from the peer network. Extensive evaluations of our proposed AMLN method are conducted on three challenging image classification datasets, where a clear outperformance over the state-of-the-art knowledge transfer methods is achieved. In our future work, we will investigate how to incorporate different tasks to train the peer networks cooperatively, not limited to using the same dataset while mutually training the peer networks as in this work.