Keywords

1 Introduction

Deep learning methods have been achieving state-of-the-art performances, contributing to the rapid development of applications for a variety of tasks such as image classification [11, 23, 41, 43] and object detection [4, 10, 35]. One of the critical problems with such state-of-the-art models is their complexity, thus the complex models are difficult to be deployed for real-world applications. In general, there is a trade-off between model complexity and inference performance (e.g., measured as accuracy), and there are three different types of method to make models deployable: 1) designing lightweight models, 2) model compression/pruning, and 3) knowledge distillation. Lightweight models such as MobileNet [14, 38], MnasNet [40] and YOLO series [33, 34] often sacrifice inference performance to reduce inference time, compared to complex models e.g., ResNet [11] and Mask R-CNN [10]. Model compression and pruning [9, 21] techniques reduce model size by quantizing parameters and pruning redundant neurons, and such methods are covered by Distiller [54], an open-source library for model compression.

In this paper, our focus is on the last category, knowledge distillation, that trains a simpler (student) model to mimic the behavior of a powerful (teacher) model. Knowledge distillation [13] stems from the study by Buciluǎ et al. [3], that presents a method to compress large, complex ensembles into smaller models with small loss in inference performance. Interestingly, Ba and Caruana [2] report that student models trained to mimic the behavior of the teacher models (soft-label) significantly outperform those trained on the original (hard-label) dataset. Following these studies, knowledge distillation and transfer have been attracting attention from the research communities such as computer vision [36] and natural language processing [39].

Table 1. Knowledge distillation frameworks. torchdistill supports modules in PyTorch and torchvision such as loss, datasets and models. ImageNet: ILSVRC 2012 [37], YT Faces: YouTube Faces DB [47], MIT Scenes: Indoor Scenes dataset [32], CUB-2011: Caltech-UCSD Birds-200-2011 [45], Cars: Cars dataset [18], SOP: Stanford Online Products [27]. P: Pretrained models, M: Module abstraction, D: Distributed training.

As summarized in Table 1, some researchers voluntarily publish their knowledge distillation frameworks e.g., [12, 28, 29, 42, 49, 52] to help other researchers reproduce their original studies. However, such frameworks are usually not either well generalized or maintained to be built on. Besides, Distiller [54] supports only one method for knowledge distillation, and Catalyst [17] is a framework built on PyTorch with a focus on reproducibility of deep learning research. To support various deep learning methods, these frameworks are well generalized, yet require users to hardcode (reimplement) critical modules such as models and datasets, even if the implementations are publicly available in popular libraries, to design complex knowledge distillation experiments. As pointed out by Gardner et al. [6], reference methods and models are often re-implemented from scratch, and this makes it difficult to reproduce the reported results. For further advancing the deep learning research, a new generalized framework is therefore needed, and the framework should be able to allow researchers to easily try different modules (e.g., models, datasets, loss configurations), implement various approaches, and take care of reproducibility of their work.

The concept of our framework, torchdistill,Footnote 1 is highly inspired by AllenNLP [6], a platform built on PyTorch [30] for research on deep learning methods in natural language processing. Similar to AllenNLP, torchdistill supports the following features:

  • module abstractions that enable researchers to write higher-level code for experiments e.g., model, dataset, optimizer and loss;

  • declarative PyYAML configuration files, which can be seen as high-level summaries of experiments (training and evaluation), enable to use anchors and aliases in the file to refer to the same object (e.g., file paths) and simplify themselves, and make it easy to change the abstracted components and hyper-parameters; and

  • generalized reference code and configurations to apply knowledge distillation methods to PyTorch and torchvision models pretrained on well-known complex benchmark datasets: ImageNet (ILSVRC 2012) [37] and COCO 2017 [22].

Furthermore, torchdistill supports 1) seamless multi-stage training, 2) caching teacher’s outputs, and 3) redesigning (pruning) teacher and student models without hard-coding (reimplementation). To the best of our knowledge, this is the first, highly generalized open-source framework that can support a variety of knowledge distillation methods, and lower barriers to high-quality, reproducible deep learning research [8]. Researchers can explore methods and shape new approaches, building on this generalized framework that makes it easy not only to customize existing methods and models, but also introduce completely new ones. Using some of our reimplemented methods, we also reproduce the experimental results on ILSVRC 2012 and COCO 2017 datasets reported in the original studies.

2 Framework Design

Our developed framework, torchdistill, is an open source framework dedicated for knowledge distillation studies, built on PyTorch [30]. For vision tasks such as image classification and object detection, the framework is designed to support torchvision, that offers a lot of options for datasets, model architectures and common image transformations. The collection of supported reference models and datasets in our framework are dependent on the version of user’s installed torchvision. For instance, when users find new models in the latest torchvision, they can shortly try the models simply by updating the torchvision and configuration files for their experiments with our framework.

2.1 Module Abstractions

An objective of module abstractions in our framework is to enable researchers to experiment with various modules by simply changing a PyYAML configuration file described in Sect. 2.3. We focus abstraction on critical modules to experiment, specifically model architectures, datasets, transforms, and losses to be minimized during training. These modules are often hard-coded (See Appendix A) in authors’ published frameworks [12, 28, 29, 42, 49, 52], and many of the hyperparameters are hard-coded as well.

Model Architectures: torchvision offers various model families for vision tasks from AlexNet [20] to R-CNNs [10, 35], and many of them are pretrained on large benchmark datasets. Specifically, the latest release (v0.8.2) provides about 30 image classification models pretrained on ImageNet (ILSVRC 2012) [37] and 4 object detection models pretrained on COCO 2017 [22]. As our framework supports torchvision for vision tasks, researchers can use such pretrained models as teacher and/or baseline models (e.g., student trained without teacher). In addition to the pretrained models available in torchvision, they can use their own pretrained model weights and any model architectures implemented with PyTorch. Moreover, torchdistill supports PyTorch HubFootnote 2 and enable users to import modules via the hub by specifying repository names in a PyYAML configuration file.

Datasets: As described above, torchvision also supports a variety of datasets, and previous studies [1, 12, 16, 24, 28, 29, 31, 36, 42, 44, 46, 50, 52] use many of them to validate proposed distillation techniques such as ImageNet [37], COCO [22], CIFAR-10 and -100 [19], and Caltech101 [5]. Similar to model architectures, torchdistill supports such datasets and can collaborate with any datasets implemented with PyTorch.

Transforms: In vision tasks, there are de facto standard image transform techniques. Taking image classification on the ImageNet dataset as an example, a standard transform pipeline for training with torchvisionFootnote 3 consists of 1) making a crop of random size of the original size and with a random aspect ratio of the original aspect ratio, 2) horizontal reflection with 50% chance for data augmentation to reduce a risk of overfitting [20], 3) PIL-to-Tensor conversion, and 4) channel-wise normalization using (0.485, 0.456, 0.406) and (0.229, 0.224, 0.225) as means and standard deviations, respectively. In torchdistill, users can define their own transform pipeline in a configuration file.

Losses: In distillation process, student models are trained using outputs from teacher models, and the research community has been proposing a lot of unique losses with/without task-specific losses such as cross entropy loss for classification tasks. PyTorch [30] supports various loss classes/functions, and simple distillation losses can be defined in a configuration file by combining such supported losses using torchdistill ’s customizable loss module (See Sect. 2.6).

2.2 Registry

The registry is an important component in torchdistill as abstracted modules are instantiated by mapping strings in the configuration file to the objects in code. Furthermore, it would make it easy for users to collaborate their implemented modules/functions with this framework. Similar to AllenNLP [6] and Catalyst [17], this can be done even outside the framework by using a Python decorator. The following example shows that a new model class, MyModel, is added to the framework by simply using @register_model (defined in the framework), and the new class can be instantiated by defining “MyModel” with required parameters at designated places in a configuration file.

figure i

2.3 Configurations

An experiment can be defined by a PyYAML configuration file (See Appendix B), that allows users to tune hyperparameters, and change methods/models without hard-coding. With PyYAML’s features, configuration files allow users to leverage anchors and aliases, and these features would be helpful to simplify the configurations in cases that users would like to reuse parameters defined in the configuration file such as root directory path for datasets, parameters and model names as part of checkpoint file paths for better data management. In a configuration file, there are three main components to be defined: datasets, teacher and student models, and training. Each of the key components is defined by using abstracted and registered modules described in Sects. 2.1 and 2.2. A configuration file gives users a summary of the experiment, and shows all the parameters to reproduce the experimental results except implicit factors such as hardware specifications used for the experiment.

The following example illustrates how to define a global teacher model declared in a PyYAML configuration file. As described in the previous sections, various types of modules are abstracted in our framework, and such modules (classes and functions) in user’s installed torchvision are registered. In this example, ’resnet34’ functionFootnote 4 is used to instantiate an object of type ResNet by using a dictionary of keyword arguments (**params). i.e. num_classess = 1000 and pretrained = True are given as arguments of ’resnet34’ function. For image classification models implemented in torchvision or those users add to the registry in our framework, users can easily try different models by changing ’resnet34’ e.g., ’densenet201’ [15], ’mnasnet1_0’ [40]. Besides that, ckpt indicates the file path of checkpoint, that is ’./resnet34.pt’ in the example defined by leveraging some of YAML features: anchors (&) and aliases (*). For teacher model, the checkpoint will be used to initialize the model with user’s own model weights if the checkpoint file exists. Otherwise, ’resnet34’ in this example will be initialized with torchvision’s pretrained weights for ILSVRC 2012.

figure j

Furthermore, torchdistill offers an option to generate log files that monitor the experiments. For instance, a log file presents what parameters were used, when executed, the trends of training behavior (e.g., training loss, learning rate and validation accuracy) at a frequency set in the configuration file, and evaluation results.

These configuration and log filesFootnote 5 will also help the researchers complete ML Code Completeness Checklist,Footnote 6 that was recently proposed to facilitate reproducibility in the research community as part of the official code submission process at major machine learning conferences e.g., NeurIPS, ICML and CVPR.

2.4 Dataset Wrappers

To support a wide variety of knowledge distillation methods, dataset is an important module to be generalized. Usually, the dataset module in PyTorch and torchvision returns a pair of input batch (e.g., collated image tensors) and targets (ground-truth) at each iteration, but some of the existing knowledge distillation approaches require additional information for the batch. For instance, contrastive representation distillation (CRD) [42] requires an efficient strategy to retrieve a large number of negative samples in the training session, that requires the dataset module to return an additional object (e.g., negative sample indices). To support such extensions, we design dataset wrappers to return input batch, targets, and a supplementary dictionary, that can be empty when not used. For the above case, the additional object can be stored in the supplementary dictionary, and used when computing the contrastive loss. This design also enables us to support caching teacher model’s outputs against data indices in the original dataset so that teacher’s inference can be skipped by caching (serializing) outputs of the teacher model given a data index at the first epoch, and reading and collating the cached outputs given batch of data indices at the following epochs.

Fig. 1.
figure 1

Knowledge distillation and FitNet methods. Yellow and blue modules indicate that their parameters are frozen and trainable, respectively (Color figure online).

Table 2. Epoch-level training speed improvement by caching teacher’s outputs at the 1st epoch, using ResNet-18 as student model for knowledge distillation [13].

To demonstrate that caching improves training efficiency, we perform an experiment with knowledge distillation [13] illustrated in Fig. 1a that caches outputs of the teacher model at the first epoch for training ResNet-18 (student) on ILSVRC 2012 dataset, and skips the teacher model’s inference by loading and feeding the outputs cached on disk to the loss module. Table 2 suggests that spending an extra one-minute at the 1st epoch to serialize teacher’s outputs, the caching strategy makes the following training process (i.e. from the 2nd epoch) approximately 1.23 – 2.11 times faster at epoch-level when using 3 NVIDIA GeForce RTX 2080 Ti‘s with batch size of 256. Also, this improvement becomes more significant when using a larger teacher model such as ResNet-152 (approximately 2.11 times faster than training without cache). The ILSVRC 2012 training dataset consists of approximately 1.3 million images, and the cached files consumes only 10GB whereas the original training dataset uses about 140GB. Note that caching may not improve the training efficiency if teacher’s outputs to be cached are much larger e.g., hint-based training [36] requires intermediate outputs from teacher and student models. Also, this mode should be turned off when applying data augmentation strategies.

Fig. 2.
figure 2

Factor transfer with two auxiliary modules.

2.5 Teacher and Student Models

Teacher-Student pairs are keys in knowledge distillation experiments, and recently proposed approaches [1, 12, 31, 36, 42, 49, 50, 52, 53] introduce auxiliary modules, which are used only in training session. Such auxiliary modules use tensors from intermediate layers in models, and introducing the modules to the models often results in branching their feedforward path as shown in Figs. 1 and 2. This paradigm, however, is also one of the backgrounds that researchers decide to hard-code the models (e.g., modify the original implementations of models in torchvision every time they change the placement of auxiliary modules for preliminary experiments) to introduce such auxiliary modules used for their proposed methods, and make it difficult for other researchers to build on the published frameworks [12, 28, 29, 42, 49, 52].

Taking an advantage of forward hook paradigm in PyTorch [30], torchdistill supports introducing such auxiliary modules without altering the original implementations of the models. Specifically, users can register the framework’s provided forward hooks to specific modules to store its input and/or output in a I/O dictionary by specifying the module paths (e.g., “conv1” for a MyModel object in Sect. 2.2) in the configuration files. The I/O dictionaries for teacher and student models will be fed to a generalized, customizable loss module described in Sect. 2.6.

For methods that not only require to extract the intermediate outputs (See Fig. 1) but also feed the extracted outputs to trainable auxiliary modules in different branches to be processed (See Fig. 2b), we define a special module in the framework, that is designed to have a post-forward function. In Fig. 1, for instance, the framework first executes ResNet-18 and extracts intermediate output by a registered forward hook, and then the extracted output stored in the student’s I/O dictionary will be fed to the regressor as part of the post-forward process. The concept of the special module gives users more flexibility in designing training methods while leaving the original implementations of models (ResNet-34 and ResNet-18 in Fig. 2) unaltered.

2.6 Customizable Loss Module

Leveraging the I/O dictionaries that contain input/output of specific modules with registered forward hooks, torchdistill provides a generalized customizable loss module that allows users to easily combine different loss modules with balancing factors by configuration files such as those in Fig. 2b. Given a pair of input x and ground-truth y, the I/O dictionaries consist of a set of keys J and the values \(z_{j}^{\text {S}}\) and \(z_{j}^{\text {T}}\) (\(j \in J\)) extracted from student and teacher models respectively. Using the I/O dictionaries and the ground-truth, the generalized loss is defined as

$$\begin{aligned} \mathcal {L} = \sum _{j \in J} \lambda _{j} \cdot \mathcal {L}_{j}(z_{j}^{\text {S}}, z_{j}^{\text {T}}, y), \end{aligned}$$
(1)

where \(\lambda _{j}\) is a balancing weight (hyperparameter) for \(\mathcal {L}_{j}\), which is either a loss module implemented in PyTorch [30] or user’s defined loss module in registry.

For instance, the loss function to train student model on ILSVRC 2015 dataset [37] at the 2nd stage of factor transfer (Fig. 2b) can be defined as:

$$\begin{aligned} \mathcal {L} = \lambda _{\text {cls}}&\cdot \mathcal {L}_{\text {cls}}(z_{\text {cls}}^{\text {S}}, z_{\text {cls}}^{\text {T}}, y) + \lambda _{\text {FT}} \cdot \mathcal {L}_{\text {FT}}(z_{\text {FT}}^{\text {S}}, z_{\text {FT}}^{\text {T}}, y)\\ \nonumber&\mathcal {L}_{\text {cls}}(z_{\text {cls}}^{\text {S}}, z_{\text {cls}}^{\text {T}}, y) = \text {CrossEntropyLoss}(z_{\text {cls}}^{\text {S}}, y) \\ \nonumber&\mathcal {L}_{\text {FT}}(z_{\text {FT}}^{\text {S}}, z_{\text {FT}}^{\text {T}}, y) = \left\Vert \frac{z_{\text {FT}}^{\text {S}}}{\left\Vert z_{\text {FT}}^{\text {S}}\right\Vert _{2}} - \frac{z_{\text {FT}}^{\text {T}}}{\left\Vert z_{\text {FT}}^{\text {T}}\right\Vert _{2}}\right\Vert _{p}, \end{aligned}$$
(2)

where \(\lambda _{\text {cls}} = 1\), \(\lambda _{\text {FT}} = 1,000\) and p = 1, following [16].

2.7 Stage-wise Training Configuration

In the previous sections, we describe the main features of torchdistill, and what modules are configurable in the framework. We emphasize that all the training configurations described above can be defined stage-wisely.

Seamless Multi-stage Training Configurations: Specifically, the framework is designed to enable users to configure critical components such as 1) number of epochs, 2) training and validation datasets, 3) teacher and student models, 4) modules (layers) to be trained/frozen, 5) optimizer, 6) learning rate scheduler, 7) loss module. These components can be re-defined at each of training stages, otherwise the framework reuses those from the previous stage. Notice that these training configurations can be declared in a configuration file, and this design enables to support not only two-stage training strategies [12, 16, 36, 50], but also more complicated distillation methods such as teacher assistant knowledge distillation (TAKD) [26], that trains TAs to fill the gap between student and teacher models. Transfer learning also can be supported by changing models and datasets from stage to stage, and users would execute code with a configuration file only once. Therefore, they will not need to execute code multiple times to perform multi-stage training, including transfer learning.

Redesigning Models for Efficient Training: Furthermore, our framework gives users an option to redesign teacher and student models at each stage by specifying the required modules in a configuration file. Specifically, users are allowed to rebuild models by reusing modules in the models optionally with auxiliary modules. Figure 1 shows an example that modules after the 8th and the 5th blocks of the teacher and student models respectively can be pruned as the outputs of the modules are not used in the hint-training (1st stage), thus not required to be executed. In this specific case, the redesigned student model will consist of the trainable (blue) modules and a regressor (auxiliary module) as illustrated in Fig. 3, and the teacher and student architectures at the 2nd stage will be reverted to the original ones (Fig. 1a) with parameters learnt at the 1st stage. Also, the redesigned teacher/student model can be an empty module to save execution time. In Fig. 2a, for instance, there is no need to feed input batch to the student model (thus, can be empty) as at the 1st stage of factor transfer, only the teacher model is executed to train the paraphraser.

Fig. 3.
figure 3

Hint-training with teacher and student models pruned simply by specifying required modules in a configuration file for further efficient training, compared to a naive configuration in Fig. 1.

As introduced in Sect. 2.4, when the teacher’s outputs are cacheable (e.g., in terms of available disk space), teacher’s inference can be skipped by loading the cache files produced at previous epoch. Redesigning models help users shorten training sessions even when teacher’s outputs are not cacheable. Note that student model’s outputs, however, cannot be cached as the model’s parameters are updated every iteration. Table 3 suggests that redesigning models using only modules to be executed for training would be an effective approach to saving training time, and this improvement would be more critical for training models on large datasets and/or with a lot of epochs. We emphasize that users can redesign (minimize) the models by specifying the required modules in a configuration file rather than hardcode (reimplement) the pruned models.

3 Reference Methods

Here, we describe the reimplementations of knowledge distillation methods and experiments to reproduce the reported results on ImageNet and COCO datasets.

Table 3. Epoch-level training speed improvement by redesigning teacher and student (ResNet-18) models with required modules only for hint-training shown in Fig. 3.

3.1 Reimplementations

Given that the pretrained models in torchvision are trained on large benchmark datasets, ImageNet (ILSVRC 2012) [37], and COCO 2017 [22], we focus our implementations on these datasets as the pretrained models can be used as teacher models and/or baseline student models (naively trained on human-annotated datasets). Note that some of the methods are not validated on these datasets in their original work.

Table 4 shows a brief summary of reference distillation methods reimplemented with torchdistill, and indicates what additional modules were implemented and added to the registry for reimplementing the methods. We emphasize that methods without any check marks ( ) in the Required additional modules columns such as KD, AT, PKT, RKD, HND, SPKD, Tf-KD, GHND and \(L_2\) can be reimplemented simply by adding the new loss modules to the registry in the framework (Sect. 2.2).

Different from the existing frameworks [12, 28, 29, 42, 49, 52], all the methods in Table 4 are reimplemented independently from models in torchvision so that users can easily switch models by specifying a model name and its parameters in a configuration file. Taking image classification as an example, the shapes of inputs and (intermediate) outputs for the models are often fixed (e.g., \(3 \times 224 \times 224\) and 1,000 respectively, for models trained on ImageNet dataset), that makes it easy to match the shape of student’s output with that of teacher when computing loss values to be minimized.

3.2 Reproducing ImageNet Experiments

In this section, we attempt to reproduce some experimental results with their proposed distillation methods. In particular, we choose the attention transfer (AT), factor transfer (FT) [16], contrastive representation distillation (CRD) [42], teacher-free knowledge distillation (Tf-KD) [51], self-supervised knowledge distillation (SSKD) [49], \(L_2\) and prime-aware adaptive distillation (PAD-\(L_2\)) methods [53] for the following reasons:

  • these methods are validated with the ImageNet datasets for ResNet-34 and ResNet-18 as teacher and student models in their original work;Footnote 7

  • the hyperparameters used in the ImageNet experiments are described in the original studies and/or their published source code; and

  • we did not have time to tune hyperparameters for other methods that are not validated on the ImageNet dataset in their original papers.

In addition to the methods, we apply knowledge distillation (KD) [13] to the same teacher-student pair. Note that except KDFootnote 8, we reuse the hyperparameters (e.g., number of epochs) for ImageNet given in their original work to reproduce their experimental results, and we provide the configuration and log files, and trained model weights (See footnote 5).

Table 4. Reference knowledge distillation methods implemented in torchdistill.

We also should note that Zagoruyko and Komodakis [52] propose attention transfer (AT), and define the following total loss function for their ImageNet experiment:

$$\begin{aligned} \mathcal {L}_{AT} = \mathcal {L}(\textsc {W}_{S}, x) + \frac{\beta }{2}\sum _{j \in \mathcal {I}} \left\Vert \frac{Q_{S}^{j}}{\left\Vert Q_{S}^{j}\right\Vert }_{2} - \frac{Q_{T}^{j}}{\left\Vert Q_{T}^{j}\right\Vert }_{2}\right\Vert _{p}, \end{aligned}$$
(3)

where \(\mathcal {L}(\textsc {W}_{S}, x)\) is a standard cross entropy loss, and \(Q_{S}^{j}\) and \(Q_{T}^{j}\) denote the vectorized forms of the j-th pair of student and teacher attention maps, respectively (Refer to their work [52] for more details). In their published frameworkFootnote 9, they set \(\beta \) and p to 1,000 and 2 respectively. However, we find a discrepancy between their defined loss function (Eq. (3)) and their implemented loss function (Eq. (4)), that computes mean squared error (MSE) between the teacher and student attention maps.

$$\begin{aligned} \mathcal {L}_{AT} = \mathcal {L}(\textsc {W}_{S}, x) + \frac{\beta }{2}\sum _{j \in \mathcal {I}} MSE\Bigl (\frac{Q_{S}^{j}}{\left\Vert Q_{S}^{j}\right\Vert }_{2}, \frac{Q_{T}^{j}}{\left\Vert Q_{T}^{j}\right\Vert }_{2}\Bigr ) \end{aligned}$$
(4)

In our preliminary experiment with hyperparameters the authors provide, the student model did not train well with the loss module based on Eq. (3). For this reason, we used Eq. (4) instead for AT in our experiments.

Table 5. Validation accuracy of ResNet-18 (student) trained on ILSVRC 2012 dataset with ResNet-34 (teacher), using eight different distillation methods. With the hyperparameters (e.g., # Epochs) either described in the original work or given by the authors, all the reimplemented methods outperform the student model trained without teacher.

Table 5 summarizes the results of the experiments with the training configurations (e.g., teacher-student pair, hyperparameters) described in each of the original studies and/or verified by the authors. In addition to experiments with a single GPU, we perform experiments with a distributed training strategy supported by PyTorch (reported with a dagger mark †) to demonstrate that our framework supports the strategy for saving training time. As for the \(L_2\) and PAD-\(L_2\) methods, the original study [53] uses batch size of 512 for their ImageNet experiments, which did not fit in our single GPU. Thus, we split the batch size into 171 per GPU, and report only the results with the distributed training (marked with ‡). The same strategy is applied to SSKD (total batch size of 256 and 768 for normal and augmented samples, respectively [49]) as it takes at least 4 times as long at epoch-level to train a model, compared to the other methods due to their 4x augmented training data, and our batch size per GPU is 85 (for normal samples + 255 for augmented samples). Similarly, we apply the same strategy for CRD due to the limited time. We also note that Zhang et al. [53] applied their proposed PAD-\(L_2\) to the student model trained with their proposed \(L_2\) as a pretrained model, and train the student model with the PAD-\(L_2\) method for 30 more epochs (i.e., 120 epochs).Footnote 10

Based on the methods we reimplemented with torchdistill, we successfully reproduce the results on the ILSVRC 2012 dataset for the teacher-student pair reported in the original papers of AT [52], Tf-KD [51], \(L_2\) and PAD-\(L_2\) [53] methods, and the result of PAD-\(L_2\) was recently reported as the state-of-the-art performance for the teacher-student pair on the ILSVRC 2012 dataset [53]. All the results outperform the baseline performance (S: ResNet-18) which is trained with human-labels only, and the pretrained model is provided by torchvision. Note that FT was validated on ILSVRC 2015 dataset in their original work [16], and we confirm the FT’s improvement over a baseline using ILSVRC 2012 dataset as the teacher model (ResNet-34) in torchvision is pretrained on the dataset. The result with the reimplemented CRD is almost comparable to the accuracy reported in the original study [42]. In CRD, both positive and negative samples are leveraged for learning representations, thus turns out to be the most-time consuming method in Table 5. The reimplemented SSKD outperforms the baseline model although the accuracy does not match the reported result [49]. A potential factor may be a different training configuration forced by our limited computing resource (e.g., different batch size per GPU whereas 8 parallel GPUs were used in their work) since we simply refactored and made the authors’ published code compatible with the ILSVRC 2012 dataset. As pointed out by Tian et al. [42], KD [13] is still a powerful method. Our reimplemented KD outperformed their proposed state-of-the-art method, CRD (71.17%), and achieved the comparable accuracy with their CRD+KD (71.38%) method.

3.3 Reproducing COCO Experiments

To demonstrate that our framework can 1) be applied to different tasks, and 2) collaborate with model architectures that are not implemented in torchvision, we apply the generalized head network distillation (GHND) to bottleneck-injected R-CNN object detectors for split computing [25], using COCO 2017 dataset. Their proposed bottleneck-injected Faster and Mask R-CNNs with ResNet-50 and FPN are designed to be partitioned into head and tail models which will be deployed on mobile device and edge server respectively, for reducing inference speed in resource-constrained edge computing systems. Following the original work on GHND, we apply the method to a pair of the original and bottleneck-injected Faster R-CNNs as teacher and student respectively, and conduct the same experiment for Mask R-CNN as well. As shown in Table 6, the reproduced mean average precision (mAP) match those reported in the original study [25].

Table 6. Validation mAP of bottleneck-injected R-CNN models for split computing (student) trained on COCO 2017 dataset by GHND with original Faster/Mask R-CNN models (teacher). Reproduced results match those reported in the original work [25].

4 Conclusions

In this work, we presented torchdistill, an open-source framework dedicated for knowledge distillation studies, that supports efficient training and configurations systems designed to give users a summary of the experiments. Researchers can build on the framework (e.g., by forking the repository) to conduct their knowledge distillation studies, and their studies can be integrated to the framework by sending a pull request. This will help the research community ensure the reproducibility of the work, and advance the deep learning research while supporting fair method comparison on benchmarks. Specifically, researchers can publish the log, configuration, and pretrained model weights for their champion performance, that will help them ensure the champion performance for specific datasets and teacher-student pairs.

Furthermore, the configuration files for and log files produced by torchdistill will help researchers complete the ML Code Completeness Checklist (See footnote 6), and we provide the full configurations (hyperparameters), log files and checkpoints including model weights for experimental results shown in Tables 5 and 6 in our code repository (See footnote 1). We provide reference code and configurations for image classification and object detection tasks, and plan to extend our framework for different tasks using popular packages e.g., Transformers [48] for NLP tasks. Our framework will be maintained and updated along with the new releases of PyTorch and torchvision so that users can save time for coding and use it as a standard framework for reproducible knowledge distillation studies.