1. INTRODUCTION

The paper considers the problem of distillation of deep learning models. Optimizing a deep learning model is computationally challenging [12]. The paper investigates a particular case of an optimization problem called knowledge distillation. It allows using both the training sample and the information contained in pretrained models. Knowledge distillation [5] is the problem of optimizing model parameters that takes into account not only the information contained in the original sample but also the information contained in the teacher model. The teacher model has a high complexity. It contains information about the sample as well as about the distributions of model parameters that will be transferred. The model of a simpler structure, called the student model, is optimized by transferring the knowledge of the teacher model.

Table 1. Complexity of various metaparameter and hyperparameter optimization methods. Here \(|\mathbf {w}| \) is the number of model parameters, \(|\boldsymbol {\lambda }|\) is the number of metaparameters, \(r \) is the number of runs of stochastic optimization methods, and \(s \) is the complexity of generation from probabilistic models

The procedure of optimization of metaparameters in the knowledge distillation problem is investigated. Metaparameters are the parameters of the optimization problem. Correct assignment of metaparameters can considerably affect the performance of the final model [11]. Unlike [9, 11], this paper takes into account the difference between hyperparameters, probability parameters of the prior distribution [4], and metaparameters. Despite the number of metaparameter and hyperparameter optimization methods used in deep learning such as random search [2] or models based on the use of probabilistic models [3], many approaches propose sequentially generating a random value of metaparameters and estimating the quality of the model trained with the given values of the hyperparameters. This approach may not be suitable in the case of training models that require substantial training time. Table 1 lists the values of the complexity of various approaches to optimizing metaparameters. It can be seen that if the parameter optimization takes a long time, approaches that require several runs of optimization are inefficient.

Fig. 1.
figure 1

Scheme of the proposed method: instead of directly optimizing the values of metaparameter \(\lambda \), it is proposed to approximate the optimization trajectory using linear models to achieve a minimum loss function on the validation part of the sample \(\mathcal {L}_\text {val}\). Random metaparameters are not minimum points of the function \(\mathcal {L}_\text {val} \) and deliver suboptimal model performance.

It is proposed to treat the metaparameter optimization problem as a bilevel optimization problem. The first level optimizes the model parameters and the second one, the metaparameters [1, 8, 9]. A greedy gradient method for solving the bilevel problem is described in [8]. Various gradient methods and random search are analyzed in [1]. In the present paper, we analyze an approach to optimizing and predicting metaparameters obtained after applying gradient methods. It can be seen from Table 1 that for large problems, gradient metaparameter optimization methods are preferable. However, even with a greedy metaparameter optimization algorithm with a difference approximation, metaparameter optimization becomes much more computationally demanding, as demonstrated in [7]. To reduce the optimization costs, in this paper we analyze the metaparameter optimization trajectory and predict its value using linear models. This method is illustrated in Fig. 1. This method is evaluated and compared with other metaparameter optimization methods on image samples from CIFAR-10 [6] and Fashion-MNIST [14] and a synthetic sample.

2. STATEMENT OF THE PROBLEM

We solve a classification problem of the form

$$ \mathfrak {D} = \big \{(\mathbf {x}_i, \mathbf {y}_i)\big \}_{i=1}^{m},\quad \mathbf {x}_i \in \mathbb {R}^n,\quad \mathbf {y}_i \in \mathbb {Y} = \big \{\mathbf {e}_k | k = {1,\dots ,K}\big \}, $$

where \(\mathbf {e}_k\) is the \(k \)th column of the identity matrix and \(\mathbf {y}_i \) is a vector with unit in place of class \(\mathbf {x}_i \).

We divide the sample \(\mathfrak {D}\) into two subsets, \( {\mathfrak {D} = \mathfrak {D}_\text {train} \sqcup \mathfrak {D}_\text {val}}.\) The subset \(\mathfrak {D}_\text {train} \) will be used to optimize the model parameters, and the subset \( \mathfrak {D}_\text {val}\) will be used to optimize the metaparameters.

Consider a teacher model \(\mathbf {f}(\mathbf {x}) \) that was trained on the sample \(\mathfrak {D}_\text {train} \). We optimize the student model \(\mathbf {g}(\mathbf {x}, \mathbf {w})\), \(\mathbf {w} \in \mathbb {R}^s \), by transferring the knowledge of the teacher model. Let us define this problem formally.

Definition 1.

Let a function \(D: \;{\mathbb {R}^s \to \mathbb {R}_{+}}\) define the distance between models \(\mathbf {g}\) and \(\mathbf {f} \). A \(D \)-distillation of the student model is an optimization of the student model parameters that minimizes the function \(D \).

We define a loss function \(\mathcal {L}_\text {train} \) that takes into account the transfer of knowledge from the model \( \mathbf {f}\) to the model \(\mathbf {g} \),

$$\begin {aligned} \mathcal {L}_\text {train}(\mathbf {w}, \boldsymbol {\lambda }) &= -\lambda _1\thinspace \sum \limits _{(\mathbf {x}, \mathbf {y}) \in \mathfrak {D}_\text {train}} \;\underbrace {\sum \limits _{k=1}^{K}y_k\thinspace \log \frac {e^{\mathbf {g}(\mathbf {x}, \mathbf {w})_k}}{\sum \limits _{j=1}^{K}e^{\mathbf {g}(\mathbf {x}, \mathbf {w})_j}}}_{\text {classification term}} \\[.6em] &\qquad \qquad \qquad \qquad \qquad {}-(1 - \lambda _1)\thinspace \sum \limits _{(\mathbf {x}, y) \in \mathfrak {D}_\text {train}} \;\underbrace {\sum \limits _{k=1}^{K}\frac {e^{\mathbf {f}(\mathbf {x})_k/T}}{\sum \limits _{j=1}^{K}e^{\mathbf {f}(\mathbf {x})_j/T}}\thinspace \log \frac {e^{\mathbf {g}(\mathbf {x}, \mathbf {w})_k/T}}{\sum \limits _{j=1}^{K}e^{\mathbf {g}(\mathbf {x}, \mathbf {w})_j/T}}}_{\text {distillation term}},\end {aligned} $$

where \(y_k\) is the \(k \)th component of the response vector and \(T \) is the temperature parameter in the distillation problem. The temperature \(T\) has the following properties:

  1. 1.

    As \(T \rightarrow 0\), we obtain the unit vector \(\left \{{e^{\mathbf {g}(\mathbf {x}, \mathbf {w})_k/T}}\Big /{\sum \nolimits _{j=1}^{K}e^{\mathbf {g}(\mathbf {x}, \mathbf {w})_j/T}}\right \}_{k=1}^K \).

  2. 2.

    As \(T \rightarrow \infty \), we obtain a vector with equal probabilities.

Let us show that the optimization of \(\mathcal {L}_\text {train} \) is a \(D \)-distillation for \(\lambda _1 = 0 \).

Proposition 1.

If \(\lambda _1 = 0 \), then the optimization of the loss function (1) is a \(D\)-distillation with \(D = D_{KL}\left (\sigma \left (\mathbf {f}(\mathbf {x})/T\right ), \sigma \left (\mathbf {g}(\mathbf {x}, \mathbf {w})/T\right )\right )\), where \( \sigma \) is the function \( \text {softmax} = \frac {e^{x_i}}{\sum \nolimits _{j=1}^Ke^{x_j}} \) and \( D_{KL}\) is the Kullback–Leibler divergence.

Proof. For \(\lambda _1 = 0 \), we have

$$\begin {aligned} \mathcal {L}_\text {train}(\mathbf {w}, \boldsymbol {\lambda }) &= \sum \limits _{(\mathbf {x}, y) \in \mathfrak {D}_\text {train}}\thinspace \sum \limits _{k=1}^{K}\frac {e^{\mathbf {f}(\mathbf {x})_k/T}} {\sum \limits _{j=1}^{K}e^{\mathbf {f}(\mathbf {x})_j/T}}\thinspace \log \frac {e^{\mathbf {g}(\mathbf {x}, \mathbf {w})_k/T}}{\sum \limits _{j=1}^{K}e^{\mathbf {g}(\mathbf {x}, \mathbf {w})_j/T}} \\[.3em] &= D_{KL}\Big (\sigma \big (\mathbf {f}(\mathbf {x})/T\big ), \sigma \big (\mathbf {g}(\mathbf {x}, \mathbf {w})/T\big )\Big ) - C. \end {aligned} $$
(1)

We conclude that \(\mathcal {L}_\text {train}(\mathbf {w}, \boldsymbol {\lambda })\) is equal to \(D_{KL}\left (\sigma (\mathbf {f}(\mathbf {x})/T), \sigma (\mathbf {g}(\mathbf {x}, \mathbf {w})/T)\right ) \) up to a constant \(C \) that does not affect optimization. The constant is the entropy of \( \sigma (\mathbf {f}(\mathbf {x})/T)\). The function \( D_{KL}\left (\sigma \left (\mathbf {f}/T\right ), \sigma \left (\mathbf {g}/T\right )\right )\) defines the distance between the logits of the model \(\mathbf {f}\) and the model \(\mathbf {g}\). We conclude that the definition of \(D \)-distillation is satisfied. \(\quad \blacksquare \)

Let us define the set \(\boldsymbol {\lambda } \) of metaparameters as the vector whose components are the coefficient \( \lambda _1\) multiplying the terms in \(\mathcal {L}_\text {train}\) and the temperature \(T \),

$$ \boldsymbol {\lambda } = [\lambda _1, T]. $$

Define a bilevel problem

$$ \hat {\boldsymbol {\lambda }} = \arg \min \limits _{\boldsymbol {\lambda } \in \mathbb {R}^2} \mathcal {L}_\text {val}(\hat {\mathbf {w}}, \boldsymbol {\lambda }),\;$$
(2)
$$ \hat {\mathbf {w}} = \arg \min \limits _{\mathbf {w} \in \mathbb {R}^s} \mathcal {L}_\text {train}(\mathbf {w}, \boldsymbol {\lambda }),$$
(3)

where \(\mathcal {L}_\text {val} \) is the validation loss function

$$ \mathcal {L}_\text {val}(\mathbf {w}, \boldsymbol {\lambda }) = - \sum \limits _{(\mathbf {x}, y) \in \mathfrak {D}_\text {val}}\thinspace \sum \limits _{k=1}^{K}y^k\thinspace \log \frac {e^{\mathbf {g}(\mathbf {x}, \mathbf {w})_k/T_\text {val}}}{\sum \limits _{j=1}^Ke^{\mathbf {g}(\mathbf {x}, \mathbf {w})_j/T_\text {val}}} $$

and the metaparameter \(T_\text {val} \) determines the temperature in the validation loss function. Its value has been chosen manually and is not subject to optimization.

3. GRADIENT OPTIMIZATION OF METAPARAMETERS

One method for optimizing metaparameters is to use gradient methods. Below is a diagram of their application and an approach to optimizing the trajectory of metaparameters.

Definition 2.

Let’s define an optimization operator as an algorithm \(U \) that selects the parameter vector \(\mathbf {w}^{\prime } \) of the model using the parameter values \(\mathbf {w} \) at the previous step.

Let us optimize parameters \(\mathbf {w}\) using \(\eta \) optimization steps,

$$ \hat {\mathbf {w}} = U \circ U \circ \dots \circ U(\mathbf {w}_0, \boldsymbol {\lambda }) = U^\eta (\mathbf {w}_0, \boldsymbol {\lambda }),$$

where \(\mathbf {w}_0 \) is the initial value of the parameter vector \(\mathbf {w} \) and \(\boldsymbol {\lambda } \) is the set of metaparameters.

Let us restate the optimization problem using the definition of the operator \(U \),

$$ \hat {\boldsymbol {\lambda }} = \arg \min \limits _{\boldsymbol {\lambda } \in \mathbb {R}^2} \mathcal {L}_\text {val}\bigl (U^\eta (\mathbf {w}_0, \boldsymbol {\lambda })\bigr ).$$
Fig. 2.
figure 2

Scheme of metaparameter optimization.

We solve the optimization problem (2), (3) using the gradient descent operator,

$$ U(\mathbf {w}, \boldsymbol {\lambda }) = \mathbf {w} - \gamma \nabla \mathcal {L}_\text {train}(\mathbf {w}, \boldsymbol {\lambda }),$$

where \(\gamma \) is the gradient descent step length. To optimize the metaparameters, we use the greedy gradient method that depends on the parameter values \(\mathbf {w} \) at the previous step alone. At each iteration, we obtain the following value of the metaparameters:

$$\boldsymbol {\lambda }^{\prime } = \boldsymbol {\lambda } - \gamma _{\boldsymbol {\lambda }}\nabla _{\boldsymbol {\lambda }}\mathcal {L}_\text {val}\big (U(\mathbf {w}, \boldsymbol {\lambda }), \boldsymbol {\lambda }\big ) = \boldsymbol {\lambda } - \gamma _{\boldsymbol {\lambda }}\nabla _{\boldsymbol {\lambda }}\mathcal {L}_\text {val}\big (\mathbf {w} - \gamma \nabla \mathcal {L}_\text {train}(\mathbf {w}, \boldsymbol {\lambda }), \boldsymbol {\lambda }\big ).$$
(4)

In this paper, we use a numerical difference approximation for this optimization procedure [7],

$$\begin {gathered} \frac {d\mathcal {L}_\text {val}(\mathbf {w}^{\prime }, \boldsymbol {\lambda })}{d\boldsymbol {\lambda }} = \nabla _{\boldsymbol {\lambda }}\mathcal {L}_\text {val}(\mathbf {w}^{\prime }, \boldsymbol {\lambda }) - \gamma \nabla ^2_{\boldsymbol {\lambda }, \mathbf {w}^{\prime }}\mathcal {L}_\text {val}(\mathbf {w}^{\prime }, \boldsymbol {\lambda })\nabla _{\mathbf {w}^{\prime }}\mathcal {L}_\text {val}(\mathbf {w}^{\prime }, \boldsymbol {\lambda }), \\[.03em] \nabla ^2_{\boldsymbol {\lambda }, \mathbf {w}^{\prime }}\mathcal {L}_\text {val}(\mathbf {w}^{\prime }, \boldsymbol {\lambda })\nabla _{\mathbf {w}^{\prime }}\mathcal {L}_\text {val}(\mathbf {w}^{\prime }, \boldsymbol {\lambda }) \approx \frac {\nabla _{\boldsymbol {\lambda }}\mathcal {L}_\text {val}(\mathbf {w}^+, \boldsymbol {\lambda }) - \nabla _{\boldsymbol {\lambda }}\mathcal {L}_\text {val}(\mathbf {w}^-, \boldsymbol {\lambda })}{2\varepsilon }, \\[.03em] \boldsymbol {\lambda }^{\prime } \approx \boldsymbol {\lambda } - \gamma _{\boldsymbol {\lambda }}\nabla _{\boldsymbol {\lambda }}\mathcal {L}_\text {val}(\mathbf {w}^{\prime }, \boldsymbol {\lambda }) + \gamma \frac {\nabla _{\boldsymbol {\lambda }}\mathcal {L}_\text {val}(\mathbf {w}^+, \boldsymbol {\lambda }) - \nabla _{\boldsymbol {\lambda }}\mathcal {L}_\text {val}(\mathbf {w}^-, \boldsymbol {\lambda })}{2\varepsilon },\end {gathered} $$

where \(\mathbf {w}^{\prime } = \mathbf {w} - \gamma \nabla \mathcal {L}_\text {train}(\mathbf {w}, \boldsymbol {\lambda }) \), \(\mathbf {w}^\pm = \mathbf {w}^{\prime } \pm \varepsilon \nabla _{\mathbf {w}^{\prime }}\mathcal {L}_\text {val}(\mathbf {w}^{\prime }, \boldsymbol {\lambda })\), and \(\varepsilon \) is a some given constant.

To further reduce the optimization costs, it is proposed to approximate the metaparameter optimization trajectory. The trajectory is predicted using linear models that are used periodically after a given number of iterations \(e_1 \). After that, the linear model is used to predict the metaparameters over \(e_2\) iterations,

$$\boldsymbol {\lambda }^{\prime } = \boldsymbol {\lambda } + \mathbf {c}^{\top }\begin {pmatrix}z\\1\end {pmatrix}, $$
(5)

where \(\mathbf {c}\) is the parameter vector of the linear model optimized using the least squares method and \(z \) is the number of optimization iterations.

The diagram in Fig. 2 describes the resulting optimization method. The model parameters are optimized at the first level of a bilevel optimization problem using the subset \(\mathfrak {D}_\text {train} \) and the loss function \(\mathcal {L}_{\text {train}}\). The metaparameters are optimized at the second level using the subset \(\mathfrak {D}_\text {val} \) and the loss function \(\mathcal {L}_\text {val} \). Over \(e_1 \) iterations, the metaparameters are optimized using the stochastic gradient descent method. Over \(e_2\) iterations, they are predicted using linear models.

Algorithm 1 (optimization of metaparameters—algorithm for the proposed method).

require the number \(e_1\) of iterations using gradient optimization

require the number \(e_2\) of iterations with prediction of \( \boldsymbol {\lambda }\) by linear models

1: :

while there is no convergence do

2: :

\(\qquad \text {Optimize }\boldsymbol {\lambda }\) and \(\mathbf {w} \) over \(e_1 \) iterations by solving a bilevel problem

3: :

\(\qquad \textbf {traj} = \text {trajectory}(\nabla \boldsymbol {\lambda }) \) changes during optimization

4: :

\(\qquad \text {Set }\) \(\mathbf {z} = [1,\dots ,e_1]^\mathsf {T}\)

5: :

\(\qquad \text {Optimize }\) \(\mathbf {c} \) by the LSM,

$$ \hat {\mathbf {c}} = \arg \min \limits _{\mathbf {c} \in \mathbb {R}^2} \|\textbf {traj} - \mathbf {z}\cdot c_1 + c_2\|_2^2$$
6: :

\(\qquad \text {Optimize }\) \(\mathbf {w} \) and predict \(\boldsymbol {\lambda } \) over \(e_2 \) iterations using the linear model with parameters \(\mathbf {c} \)

7: :

end while

The following theorem proves the well-posedness of the proposed approximation for a simple case where the parameters \(\mathbf {w}\) of the model \(\mathbf {g}\) have reached the optimum of problem (3), the Hessian \(\mathbf {H} = \nabla _ {\mathbf {w}}^2 \mathcal {L}_\text{train} \) is the identity matrix, and the metaparameters are optimized in the domain where the gradient of the metaparameters can be approximated by a constant. Note that in the general case, these conditions are not met when optimizing deep learning models. It was shown in [8, 13] that the use of methods for normalizing intermediate sample representations under the influence of nonlinear functions included in the deep learning model brings the Hessian of the loss function closer to unity. An analysis of the performance of the gradient optimization of metaparameters for the case in which the model parameters have not reached the optimum can be found in [11].

Theorem 1.

If the function \(\mathcal {L}_{\text {train}}(\mathbf {w}, \boldsymbol {\lambda }) \) is smooth and convex and its Hessian \(\mathbf {H} = \nabla _{\mathbf {w}}^2\mathcal {L}_\text {train}\) is the identity matrix, \({\mathbf {H} = \mathbf {I}}, \) and also if the parameters \(\mathbf {w} \) are equal to \( \mathbf {w}^*\), where \( \mathbf {w}^*\) is a point of local minimum for the current value of \(\boldsymbol {\lambda } \), then the greedy algorithm (4) finds the optimal solution of the bilevel problem. If there exists a domain \( \mathcal {D} \in \mathbb {R}^2\) in the metaparameter space such that the gradient of the metaparameters can be approximated by a constant, then the optimization is linear in the metaparameters.

Proof. In the paper [11], a formula for \(\nabla _{\boldsymbol {\lambda }}\mathcal {L}_\text {val} = \nabla _{\boldsymbol {\lambda }}\mathcal {L}_\text {val}(U(\mathbf {w}, \boldsymbol {\lambda })) \) was obtained for the case in which \(\mathcal {L}_\text {train}(\textbf {w}, \boldsymbol {\lambda }) \) is smooth and convex and a point \(\mathbf {w}^* \) of local minimum was found for the current value of \(\boldsymbol {\lambda }\),

$$ \nabla _{\boldsymbol {\lambda }}\mathcal {L}_\text {val}(\boldsymbol {\lambda }) = \nabla _{\boldsymbol {\lambda }}\mathcal {L}_\text {val} - (\nabla _{\textbf {w}, \boldsymbol {\lambda }}^2\mathcal {L}_\text {train})^\top (\nabla _{\textbf {w}}^2\mathcal {L}_\text {train})^{-1}\nabla _{\textbf {w}}\mathcal {L}_\text {val}. $$

This formula is simplified by eliminating the first term, since the function \(\mathcal {L}_\text {val}\) does not explicitly depend on the metaparameters,

$$ \nabla _{\boldsymbol {\lambda }}\mathcal {L}_\text {val}(\boldsymbol {\lambda }) = - (\nabla _{\textbf {w}, \boldsymbol {\lambda }}^2\mathcal {L}_\text {train})^\top (\nabla _{\textbf {w}}^2\mathcal {L}_\text {train})^{-1}\nabla _{\textbf {w}}\mathcal {L}_\text {val}.$$

If \( \nabla _{\textbf {w}}^2 \mathcal {L}_\text {train} \) is equal to the identity matrix, then the greedy algorithm produces the optimal bilevel problem if its step is given by the following formula [8]:

$$ \boldsymbol {\lambda }_{t+1} = \boldsymbol {\lambda }_{t} + \eta _1(\nabla _{\textbf {w}, \boldsymbol {\lambda }}^2\mathcal {L}_\text {train})^\top \nabla _{\textbf {w}}\mathcal {L}_\text {val}.$$

We also replace \(\nabla _{\textbf {w}}^2 \mathcal {L}_\text {train} \) by the identity matrix.

Let us return to the simplified gradient formula

$$ \nabla _{\boldsymbol {\lambda }}\mathcal {L}_\text {val}(\boldsymbol {\lambda }) = - (\nabla _{\textbf {w}, \boldsymbol {\lambda }}^2\mathcal {L}_\text {train})^\top \nabla _{\textbf {w}}\mathcal {L}_\text {val}.$$

Assume that there exists a domain \(\mathcal {D} \) in which \(\nabla _{\boldsymbol {\lambda }}\mathcal {L}_\text {val}(\boldsymbol {\lambda }) \) is equal to the constant vector

$$ \nabla _{\boldsymbol {\lambda }}\mathcal {L}_\text {val}(\boldsymbol {\lambda }) \approx \begin {pmatrix} a_1\\ a_2\end {pmatrix}.$$

Then the optimization step in \(\mathcal {D}\) can be represented in the form

$$ \boldsymbol {\lambda }_{t+1} = \boldsymbol {\lambda }_{t} - \gamma _{\boldsymbol {\lambda }}\begin {pmatrix} a_1\\ a_2\end {pmatrix} $$

and has a form similar to (5). \(\quad \blacksquare \)

4. COMPUTATIONAL EXPERIMENT

The purpose of the experiment is to evaluate the performance of the proposed distillation method and analyze the resulting models and their metaparameters. The method is evaluated on a synthetic sample as well as CIFAR-10 and Fashion-MNIST samples. Two types of experiments were carried out on the CIFAR-10 sample—on the entire sample, \(|\mathfrak {D}_\text {train}|=50\thinspace 000\), and on the reduced training sample, \( |\mathfrak {D}_\text {train}|=12\thinspace 800 \).

The following metaparameter optimization methods were analyzed:

  1. 1.

    Optimization with no distillation.

  2. 2.

    Optimization with random initialization of metaparameters. Metaparameters are generated from the uniform distribution

    $$ \lambda _1 \sim \mathcal {U}(0;1), \quad T \sim \mathcal {U}(0{.}1, 10).$$
  3. 3.

    Optimization with “naive” metaparameter assignment,

    $$ \lambda _1 = 0{.}5,\quad T = 1;$$
  4. 4.

    Gradient optimization.

  5. 5.

    The proposed method with \(e_1=e_2=10.\)

  6. 6.

    Optimization using a probabilistic model. For this type of optimization, we used the Hyperopt library [3], which implements optimization using the Parzen window method. For this method, five runs were performed before the final prediction of metaparameters.

The entire training set \(\mathfrak {D}\) was used for methods 1–3. For methods 4–6, the sample was divided into training, validation, and control \( \mathfrak {D} =\mathfrak {D}_\text {train} \sqcup \mathfrak {D}_\text {val} \sqcup \mathfrak {D}_\text {test}\).

The “accuracy” metric was used as an external performance criterion,

$$ \text {accuracy} = \frac {1}{m}\sum \limits _{i=1}^m \big [\mathbf {g}(\mathbf {x}_i, \mathbf {w}) = y_i\big ].$$

For all experiments, the initial values of the metaparameters were generated as follows:

$$ \lambda _1 \sim \mathcal {U}(0,1),\quad \log _{10} T \sim \mathcal {U}(-1, 1). $$

Ten runs were carried out for each experiment, and then the results were averaged. The code for the experiment is available at [15].

Fig. 3.
figure 3

Model accuracy on samples: (a) synthetic, (b) reduced CIFAR-10. Here and below, the points are slightly shifted with respect to the abscissa axis for better readability of the graphs.

Table 2. Experiment results. The numbers in parentheses are the maximum accuracy values obtained in particular experiments

The final results are presented in Table 2. The dependence of accuracy on the iteration number on a synthetic sample and a reduced version of CIFAR-10 is shown in Fig. 3.

4.1. Experiment on a Synthetic Sample

To evaluate the method obtained, an experiment was conducted on a synthetic sample,

$$\begin {gathered} \mathfrak {D} = \big \{(\mathbf {x}_i, y_i)\big \}_{i=1}^{m}, \quad x_{ij} \in \mathcal {N}(0, 1),\quad j=1, 2, \\ x_{i3} = \big [\text {sgn}(x_{i1})+\text {sgn}(x_{i2})>0\big ], \quad y_i = \text {sgn}(x_{i1}\cdot x_{i2}+\delta ),\end {gathered}$$

where \(\delta \in \mathcal {N}(0,\; 0{.}5)\) is the noise. The sample size of the student model is much smaller than the sample size of the teacher model and \(\mathfrak {D}_\text {train}\). To correctly demonstrate the proposed method in this experiment, the sample was divided into three parts: a training sample for the teacher model consisting of 200 objects, a training sample for the student model consisting of 15 objects, and a validation set, which is also a test set, \(\mathfrak {D}_\text {val} =\mathfrak {D}_\text {test}\). It also consists of 200 objects. The sample visualization is depicted in Fig. 4. The teacher model was trained over 20 000 iterations using the stochastic gradient descent method with step length \(10^{-2} \). A modified feature space was used to train it,

$$ x_{i3} = \big [\text {sgn}(x_{i1})+\text {sgn}(x_{i2}) + 0{.}1 >0\big ]. $$

This modification does not allow the teacher model to accurately predict the training sample. In this case, to train the student model, it is preferable to use only the distillation term, \( \lambda _1 = 0\). The student model was trained over 2000 iterations using the stochastic gradient descent method with step length 1.0 and \(T_\text {val} = 0{.}1\).

Fig. 4.
figure 4

Sample visualization for (a) teacher model, (b) student model, and (c) test sample.

Fig. 5.
figure 5

Model accuracy with \(e_1 \) and \(e_2 \) values: (a) \(e_1 = e_2 \), (b) a selection of \(e_2 \) with \(e_1 = 10 \).

A series of experiments was carried out to determine the best values of \(e_1 \) and \(e_2 \). Figure 5a shows the accuracy graph for various \(e_1 \) with \(e_2 \) equal to 10. Figure 5b shows the accuracy for various values of \(e_2 \). It can be seen that as \(e_1 \) and \(e_2 \) increase, the quality of approximation of the metaparameter update trajectory decreases.

Figure 3a shows the accuracy of the model for various methods. The best results were obtained for the optimized values of the metaparameters and the proposed method. It can be seen how well the proposed method approximates the optimization of metaparameters in this experiment.

4.2. Experiments on CIFAR-10 and Fashion-MNIST Samples

Both samples were split in proportion 9:1 for training and validation. The stochastic gradient descent method with initial step length 1.0 was used to optimize the model parameters. The step length was multiplied by 0.5 every 10 epochs. The value of \(T_\text {val} \) was set to 1.0.

For the experiment on the CIFAR-10 sample, a pretrained ResNet model in [10] was used as a teacher model. A CNN model with three convolutional layers and two fully connected layers was used as a student model.

For the reduced sample experiments, the step length for metaparameter optimization was 0.25 and the model was trained for 50 epochs. For the experiment on the full sample, the step length 0.1 was used. The model was trained for 100 epochs.

For the experiment on the Fashion-MNIST sample, we used architectures of the student and teacher models similar to the architectures in the experiment on the CIFAR-10 sample. The step length 0.1 was used to optimize the metaparameters, and the model was trained for 50 epochs.

It can be seen from the results in Table 2 that the proposed method and the gradient methods give a high accuracy value. However, the disadvantage of gradient methods is that they get stuck at local minimum points, which results in a much higher variance of results than for other methods. This effect can be seen in Fig. 3 and in Table 2.

5. CONCLUSIONS

The problem of optimizing the parameters of a deep learning model was investigated. A generalization of distillation methods was proposed that consists in gradient optimization of metaparameters. The model parameters are optimized at the first level, and at the second level, metaparameters that specify the type of the optimization problem are optimized. A method has been proposed that reduces the computational complexity of optimizing metaparameters for gradient optimization. The properties of the optimization problem and methods for predicting the trajectory of optimizing the model metaparameters were studied. The model metaparameters are the parameters of the optimization problem of distillation. The proposed generalization has made it possible to distill the model with better performance characteristics and in a smaller number of optimization iterations. This approach is illustrated using a computational experiment on CIFAR-10 and Fashion-MNIST samples and on a synthetic sample. The computational experiment has shown the efficiency of gradient optimization for the problem of choosing the metaparameters of the distillation loss function. The possibility of approximating the optimization trajectory of metaparameters by a locally linear model is analyzed. Further, it is planned to study the optimization problem and analyze the quality of approximation to the optimization trajectory of metaparameters by more complicated predictive models.