1 Introduction

Deep neural networks (DNNs) are known to be overparameterized (Allen-Zhu et al., 2019) as they usually have more learnable parameters than needed for a given learning task. So, a trained DNN contains many ineffectual parameters that can be safely pruned or zeroed out with little/no effect on its performance.

Pruning (LeCun et al., 1989) is an approach to alleviate overparameterization of a DNN by identifying and removing its ineffectual parameters while preserving its predictive accuracy on the validation/test dataset. Pruning is typically applied to the DNN after training to speed up testing-time evaluation and/or deploy the trained model on resource constrained devices (e.g., mobile phone, camera, etc.). For standard image classification tasks with MNIST, CIFAR-10, and ImageNet datasets, it can reduce the number of learnable parameters by up to \(50\%\) or more while maintaining model performance (Han et al., 2015; Li et al., 2017; Molchanov et al., 2017; Lin et al., 2019).

In particular, the overparameterization of a DNN also leads to considerable training cost being wasted on those DNN elements (e.g., connection weights, neurons, or convolutional filters) which are eventually ineffectual after training and can thus be safely pruned early. This problem is further compounded by the development of larger network architectures which are very expensive to train. These observations motivate the need of early pruning.

The objective of early pruning is to perform pruning during training for reducing training cost while minimizing losses to test time model accuracy given a fixed final network sparsity (e.g., determined by the resource constraints of the deployed devices). This necessitates consideration for both the test time model accuracy and training cost as both metrics are highly desirable to users. Related works in pruning during training (Lym et al., 2019) as well as prune and regrow (Bellec et al., 2018; Dettmers & Zettlemoyer, 2019; Mostafa & Wang, 2019) approaches do not jointly consider both competing metrics while offering a fixed final network sparsity. Similarly, pruning at initialization (de Jorge et al., 2021; Lee et al., 2019; Tanaka et al., 2020; Wang et al., 2020a) offers a method of reducing training cost, which however, overly sacrifices test time model accuracy by pruning at initialization when test time network element efficacy is not well known. Therefore, previous work does not offer a mechanism to trade-off between training cost and test time model accuracy according to user-defined sparsity objectives. See Sect. 2 for detailed literature review.

To offer a mechanism to trade-off between training cost and test time model accuracy, several problems must be addressed. Firstly, early pruning requires minimizing losses to test time accuracy, but pruning during training when the test time element efficacy is unknown. Thus, pruning decisions need to base on the inferred test time network element efficacy. How to infer test time network element efficacy for making accurate early pruning decisions is an important question that has not been addressed by related work. Secondly, building an inference model requires collecting network efficacy observations during training. A more accurate model requires collecting more observations during training, which increases the DNN training cost. Meanwhile, pruning with few observations incurs a low DNN training cost but sacrifices DNN model performance due to the inaccurate efficacy inference. Given this trade-off, when should a predicted poorly performing element be pruned? Finally, as both training cost and test time accuracy are important metrics to end users, how should these metrics be balanced while addressing the above challenges? These important questions have not been addressed by related works and are the focus of this work.

To answer these questions, this work considers to model the network element efficacy during training and do early pruning based on the predictive element efficacy and its predictive confidence upon convergence. A network element is pruned when a sufficiently high degree of confidence is reached regarding its poor performance. To naturally trade off between training cost and test time performance, we formulate the early pruning as a constrained optimization problem and propose an efficient algorithm for solving it. The trade-off is achieved by modulating the degree of confidence necessary before a poorly performing element is pruned. Pruning with a high degree of confidence makes fewer mistakes, yet requires collecting more observations which increases the training cost. Conversely, pruning with a low degree of confidence makes more mistakes, yet requires fewer observations, and thus less training cost. The specific contributions of this work include:

  • We pose early pruning as a constrained optimization problem to minimize losses to test time model accuracy while pruning during training under a fixed final network sparsity (Sect. 4.1);

  • Proposing to infer the efficacy of DNN elements using multi-output Gaussian process (MOGP) which models the belief of element efficacy conditioned upon efficacy measurements collected during training. This approach not only identifies poorly performing elements, but provides a measure of confidence by assigning a probabilistic belief to its prediction (Sect. 4.2);

  • Designing a Bayesian early pruning (BEP) algorithm to allow trading off between training cost and test time performance (Sect. 4.3). To the best of our knowledge, this is the first algorithm that can naturally satisfy a fixed final network sparsity while can dynamically achieve the training time versus performance trade-off. Existing works either require fixed scheduled pruning strategy or cannot achieve the fixed final network sparsity without tuning parameters.

  • Demonstrating strong performance improvements of our BEP algorithm when a large percentage of the network is pruned. This improvement is significant as DNNs continue to grow in size and may require significant pruning to allow training in a short time frame (Sect. 5).

Pruning typically relies on a measure of network element efficacy, termed saliency (LeCun et al., 1989). The development of saliency functions is an active area of research with no clear optimal choice. To accommodate this, our algorithm is agnostic, and therefore flexible, to changes in saliency function. We use BEP to prune neurons and convolutional filters and demonstrate its ability to capture the training cost versus performance trade-off.Footnote 1

2 Related work

2.1 Pruning and related techniques

Initial works in DNN pruning center around saliency based pruning after training including Skeletonization (Mozer & Smolensky, 1988), Optimal Brain Damage and followup work (Hassibi & Stork, 1992; LeCun et al., 1989) as well as sensitivity based pruning (Karnin, 1990). In recent years, saliency functions been adapted to pruning neurons or convolutional filters. Li et al. (2017) define a saliency function on convolutional filters by using the \(L_1\) norm. Molchanov et al. (2017) propose using a first-order Taylor-series approximation on the objective function as a saliency measure. Dong et al. (2017) propose layer wise pruning of weight parameters using a Hessian based saliency measure. Several variants of pruning after training exist. Han et al. (2015) propose iterative pruning where pruning is performed in stages alternating with fine tune training. Guo et al. (2016) suggest dynamic network surgery, where pruning is performed on-the-fly during evaluation time. The works of Li et al. (2017) and He et al. (2018) propose reinforcement learning for pruning decisions. A comprehensive overview may be found in Gale et al. (2019).

Knowledge distillation (Hinton et al., 2015; Lu et al., 2017; Tung & Mori, 2019; Yim et al., 2017) aim to transfer the capabilities of a trained network into a smaller network. Weight sharing (Nowlan & Hinton, 1992; Ullrich et al., 2017) and low rank matrix factorization (Denton et al., 2014; Jaderberg et al., 2014) aim to compress the parameterization of neural networks. Network quantization (Courbariaux et al., 2015; Hubara et al., 2017; Micikevicius et al., 2018) use lower fidelity representation of network elements (e.g. 16-bit) to speed up training and evaluation. Our work is orthogonal to network quantization as we reduce overall training FLOPs.

2.2 Initialization time or training time pruning

Pruning at initialization builds on the work of Frankle and Carbin (2019). This work shows that a randomly initialized DNN contains a small subnetwork which if trained by itself, yields equivalent performance to the original network. SNIP (Lee et al., 2019) and GraSP (Wang et al., 2020a) propose pruning connection weights prior to the training process through a first order and second order saliency function respectively. This technique is improved upon with IterSnip (de Jorge et al., 2021) and SynFlow (Tanaka et al., 2020). PruneFromScratch (Wang et al., 2020b) considers pruning at initialization in order to order training cost reduction. In comparison to our approach, above works in initialization time pruning do not offer a mechanism to trade-off between network training time and test time accuracy.

Dynamic sparse reparameterization considers pruning and regrowing parameter weights during the training process (Bellec et al., 2018; Dettmers & Zettlemoyer, 2019; Liu et al., 2020; Mostafa & Wang, 2019). Sparse evolutionary training (Mocanu et al., 2018) proposes initializing networks with sparse topology prior to training. Dai et al. (2019) propose a grow and prune approach to learning network architecture and connection layout. Other works (Louizos et al., 2018; Narang et al., 2017) propose pruning using heuristics such as \(L_0\) regularization for DNNs and recurrent neural networks. However all the above-mentioned works prune connection weights, their approach cannot be utilized to deliver training time improvements and do not offer a principled approach to capture the training cost versus performance trade-off.

PruneTrain (Lym et al., 2019) also proposes pruning filters during training to achieve training cost reduction while minimizing degradation to performance. In contrast to our approach, PruneTrain does not allow specification of the desired network size after training. A specified network size is important when training for resource constrained devices such as mobile phones or edge devices which require networks to conform to user specified size limits. It is unclear how to solve the early pruning problem using PruneTrain. We compare with PruneTrain under the early pruning problem definition in Sect. 5.2. Other works also propose model compression during training as a constrained optimization problem as part of a general framework encompassing pruning, quantization and low-rank decomposition (Idelbayev et al., 2021a; Idelbayev & Carreira-Perpiñán, 2021b). We differ from these works as our focus is on minimizing losses to test time model accuracy while reducing training cost through pruning during training. We also offer a mechanism to trade-off between the desirable metrics of test time model accuracy and training cost.

3 Preliminaries of pruning

Consider a dataset of D training examples \({\mathcal {X}} = \{{\textbf{x}}_1,\ldots ,{\textbf{x}}_D\}, {\mathcal {Y}} = \{y_1,\ldots ,y_D\}\) and a neural network \({\mathcal {N}}_{{\varvec{v}}_t}\) parameterized by a vector of M pruneable network elements (e.g. weight parameters, neurons, or convolutional filters) \({\varvec{v}}_t \triangleq [v^a_t]_{a=1,\ldots ,M}\), where \({\varvec{v}}_t\) represent the network elements after t iterations of stochastic gradient descent (SGD) for \(t = 1, \ldots , T\). Let \({\mathcal {L}}({\mathcal {X}}, {\mathcal {Y}}; {\mathcal {N}}_{{\varvec{v}}_t})\) be the loss function for the neural network \({\mathcal {N}}_{{\varvec{v}}_t}\). Pruning aims at refining the network elements \({\varvec{v}}_t\) given some user specified sparsity budget B and preserving the accuracy of the neural network after convergence (i.e., \({\mathcal {N}}_{{\varvec{v}}_T}\)), which can be stated as a constrained optimization problem (Molchanov et al., 2017):

$$\begin{aligned} \mathop {\min }_{{{\varvec{m}}} \in \{0,1\}^{M}} |{\mathcal {L}}({\mathcal {X}}, {\mathcal {Y}}; {\mathcal {N}}_{{\varvec{m}}\odot {\varvec{v}}_T}) - {\mathcal {L}}({\mathcal {X}}, {\mathcal {Y}}; {\mathcal {N}}_{{\varvec{v}}_T})|\quad \text {s.t.} \quad ||{{\varvec{m}}} ||_0 \le B \end{aligned}$$
(1)

where \(\odot\) is the Hadamard product and \({{\varvec{m}}}\) is a pruning mask. Note that we abuse the Hadamard product for notation simplicity: For \(a = 1, .., M\), \(m^a \times v^a_T\) corresponds to pruning \(v^a_T\) if \(m^a = 0\), and keeping \(v^a_T\) otherwise. Pruning a network element refers to zeroing the network element or the weight parameters which compute it. Any weight parameters which reference the output of the pruned network element are also zeroed since the element outputs a constant 0.

The above optimization problem is difficult due to the NP-hardness of combinatorial optimization. This leads to the approach of using saliency function s which measures efficacy of network elements at minimizing the loss function. A network element with small saliency can be pruned since it’s not salient/important in minimizing the loss function. Consequently, pruning can be done by maximizing the saliency of the network elements given the user specified sparsity budget B:

$$\begin{aligned} \mathop {\max }_{{{\varvec{m}}} \in \{0,1\}^M} \sum _{a=1}^{M} m^a s(a; {\mathcal {X}}, {\mathcal {Y}}, {\mathcal {N}}_{{\varvec{v}}_T}, {\mathcal {L}}) \quad \text {s.t.} \quad ||{{\varvec{m}}} ||_0 \le B \end{aligned}$$
(2)

where \(s(a; {\mathcal {X}}, {\mathcal {Y}}, {\mathcal {N}}_{{\varvec{v}}_T}, {\mathcal {L}})\) measures the saliency of \(v_T^a\) at minimizing \({\mathcal {L}}\) after convergence through T iterations of SGD. The above optimization problem can be efficiently solved by selecting the B most salient network elements in \({\varvec{v}}_T\).

The construction of the saliency function has been discussed in many existing works: Some approaches derived the saliency function from first-order (LeCun et al., 1989; Molchanov et al., 2017) and second-order (Hassibi & Stork, 1992; Wang et al., 2020a) Taylor series approximations of \({\mathcal {L}}\). Other common saliency functions include \(L_1\) (Li et al., 2017) or \(L_2\) (Wen et al., 2016) norm of the network element weights, as well as mean activation (Polyak & Wolf, 2015). In this work, we use a first-order Taylor series approximation saliency function defined for neurons and convolutional filtersFootnote 2 (Molchanov et al., 2017). Due to the first-order (i.e., gradient based) approximation, this saliency function has minimal memory and computational overhead during DNN training. However, our approach remains flexible to arbitrary choice of saliency function on a plug-n-play basis.

For ease of reference, we summarize notations that will be used frequently in the remaining sections of this paper in Table 1.

Table 1 Summary of important notations

4 Bayesian early pruning

4.1 Problem statement

As has been mentioned before, existing pruning works based on the saliency function are typically done after the training convergence (i.e., (2)) to speed up the test-time evaluation on the resource constrained devices, which wastes considerable time on training these network elements that will eventually be pruned. To resolve this issue, we extend the pruning problem definition (2) along the temporal dimension, allowing network elements to be pruned during the training process consisting of T iterations of SGD.

Let \(s^a_t \triangleq s(a; {\mathcal {X}}, {\mathcal {Y}}, {\mathcal {N}}_{{\varvec{v}}_t}, {\mathcal {L}})\) be a random variable which denotes the saliency of network element \(v^a_t\) after t iterations of SGD, \({\varvec{s}}_t \triangleq [s^a_t]_{a = 1, \ldots , M}\) for \(t = 1, \ldots , T\), and \({\varvec{s}}_{\tau _1:\tau _2} \triangleq [{\varvec{s}}_t]_{t=\tau _1, \ldots , \tau _2}\) be a matrix of saliency of all the network elements between iterations \(\tau _1\) and \(\tau _2\). Our early pruning algorithm is designed with the goal of maximizing the saliency of the unpruned elements after iteration T, yet allowing for pruning at each iteration t given some computational budget \(B_{t, c}\) for \(t = 1, \ldots , T\):

$$\begin{aligned}{} & {} \rho _{T}({{\varvec{m}}}_{T-1}, B_{T,c}, B_s) \triangleq \mathop {\max }_{ {{\varvec{m}}}_T} {{\varvec{m}}}_T \cdot {\varvec{s}}_T \end{aligned}$$
(3a)
$$\begin{aligned}{} & {} \text {s.t.} ||{{\varvec{m}}}_T ||_0 \le B_s \end{aligned}$$
(3b)
$$\begin{aligned}{} & {} {{\varvec{m}}}_{T} {\dot{\le }} {{\varvec{m}}}_{T-1} \end{aligned}$$
(3c)
$$\begin{aligned}{} & {} B_{T,c} \ge 0 \end{aligned}$$
(3d)
$$\begin{aligned}{} & {} \rho _{t}({{\varvec{m}}}_{t-1}, B_{t,c}, B_s) \triangleq \mathop {\max }_{ {{\varvec{m}}}_t} {\mathbb {E}}_{p({\varvec{s}}_{t+1}\mid {\tilde{{{\textbf{s}}}}}_{1:t})} \big [\rho _{t+1}({{\varvec{m}}}_t, B_{t,c}-||{{\varvec{m}}}_t ||_0, B_s) \big ] \; \end{aligned}$$
(4a)
$$\begin{aligned}{} & {} \;\;\,\text {s.t.} \;\;\;\;\;{{\varvec{m}}}_{t} {\dot{\le }} {{\varvec{m}}}_{t-1} \end{aligned}$$
(4b)

where \(B_s\) is the user specified sparsity budget of the trained network, \({\tilde{{{\textbf{s}}}}}_{1:t}\) is a matrix of observed values for \({\varvec{s}}_{1:t}\), \({\varvec{m}}_0\) is an M-dimensional 1’s vector, and \({\varvec{m}}_t \ {\dot{\le }} \ {\varvec{m}}_{t-1}\) represents an element-wise comparison between \({\varvec{m}}_t\) and \({\varvec{m}}_{t-1}\): \(m_t^a \le m_{t-1}^a\) for \(a = 1, \ldots , M\). At each iteration t, the saliency \({\varvec{s}}_t\) is observed and \({\varvec{m}}_t \in \{0,1\}^M\) in (4a) represents a pruning decision performed to maximize the expectation of \(\rho _{t+1}\) over the predictive belief of \({\varvec{s}}_{t+1}\) conditioned upon saliency measurements \({\tilde{{{\textbf{s}}}}}_{1:t}\) collected up to and including iteration t. This recursive structure terminates with base case \(\rho _T\) where the saliency of the unpruned elements is maximized after T iterations of training.

In the above early pruning formulation, constraints (3c) and (4b) ensure that pruning is performed in a practical manner whereby once a network element is pruned, it can no longer be recovered in a later training iteration. We define a sparsity budget \(B_s\) (3b), which specifies the desired network size after training completion. This constraint is important as often training is performed on GPUs for resource constrained devices (e.g., edge devices, or mobile phones) which can only support networks of limited size. We also constrain a total computational effort budget during iteration from iteration t to T, \(B_{t,c}\), which is reduced per training iteration by the number of unpruned network elements \(||{\varvec{m}}_{t} ||_0\). We constrain \(B_{T,c} \ge 0\) (3d) to ensure training completion within the specified computational budget. Here, we assume that a more sparse pruning mask \({\varvec{m}}_t\) corresponds to lower computational effort during training iteration t due to updating fewer network elements. Finally, (3a) maximizes the saliency with a pruning mask \({\varvec{m}}_T\) constrained by a sparsity budget \(B_s\) (3b). Our early pruning formulation balances the saliency of network elements after convergence against the total computational effort to train such network (i.e., \({\varvec{m}}_T \cdot {\varvec{s}}_T\) versus \(\sum _{t=1}^{T} ||{\varvec{m}}_{t} ||_0\)). This appropriately captures the balancing act of training-time early pruning whereby the computational effort is saved by early pruning network elements while preserving the saliency of the remaining network elements after convergence.

4.2 Modeling the saliency with multi-output Gaussian process

To solve the above early pruning problem, we need to model the belief \(p({\varvec{s}}_{1:T})\) of the saliency for computing the predictive belief \(p({\varvec{s}}_{t+1:T} \mid {\tilde{{{\textbf{s}}}}}_{1:t})\) of the future saliency in (4a). At the first glance, one may consider to decompose the belief: \(p({\varvec{s}}_{1:T}) \triangleq \prod _{a=1}^{M} p( {\varvec{s}}^a_{1:T} )\) and model the saliency \({\varvec{s}}_{1:T}^a \triangleq [s^a_t]_{t=1, \ldots , T}\) of each network element independently. Such independent models, however, ignore the co-adaptation and co-evolution of the network elements which have been shown to be a common occurrence in DNN (Hinton et al., 2012; Srivastava et al., 2014; Wang et al., 2020a). Also, modeling the correlations between the saliency of different network elements explicitly is non-trivial since considerable feature engineering is needed for representing diverse network elements such as neurons, connections, or convolutional filters.

To resolve such issues, we use multi-output Gaussian process (MOGP) to jointly model the belief \(p({\varvec{s}}_{1:T})\) of all saliency measurements. To be specific, we assume that the saliency \(s^a_t\) of the ath network element at iteration t is a linear mixtureFootnote 3 of Q independent latent functions \(\{u_q(t)\}_{q=1}^Q\): \(s^a_t \triangleq \sum _{q=1}^Q \gamma ^a_q u_q(t).\) If each \(u_q(t)\) is an independent GP with prior zero mean and covariance \(k_q(t, t')\), then the resulting distribution over \(p({\varvec{s}}_{1:T})\) is a multivariate Gaussian distribution with prior zero mean and covariance determined by the mixing covariance: \(\text {cov}[s_t^a, s_{t'}^{a'}] = \sum _{q=1}^Q \gamma _q^a \gamma _q^{a'} k_q(t, t').\) This explicit covariance between \(s_t^a\) and \(s_{t'}^{a'}\) helps to exploit the co-evolution and co-adaptation of network elements within the neural networks.

To capture the horizontal asymptote trend of \(s_1^a, \ldots , s_T^a\) as visualized in Fig. 1, we turn to a kernel used for modeling decaying exponential curves known as the “exponential kernel” (Swersky et al., 2014) and set \(k_q(t, t') \triangleq \frac{{\beta _q}^{\alpha _q}}{(t + t' + \beta _q)^{\alpha _q}}\) where \(\alpha _q\) and \(\beta _q\) are hyperparameters of MOGP and can be learned via maximum likelihood estimation (Álvarez & Lawrence, 2011).

Let the prior covariance matrix be \({\varvec{K}}_{\tau _1:\tau _2} \triangleq [\text {cov}[s_t^a, s_{t'}^{a'}]]_{t, t' = \tau _1, \ldots , \tau _2}^{a, a' = 1, \ldots , M}\) for any \(1 \le \tau _1 \le \tau _2 \le T\). Then, given a matrix of observed saliency \({\tilde{{{\textbf{s}}}}}_{1:t}\), the MOGP regression model can provide a Gaussian predictive distribution \(p({\varvec{s}}_{t'}\vert {\tilde{{{\textbf{s}}}}}_{1:t}) = {\mathcal {N}}(\varvec{\mu }_{t'\vert 1:t}, {\varvec{K}}_{t'\vert 1:t})\) for any future saliency \({\varvec{s}}_{t'}\) with the following posterior mean vector and covariance matrix:

$$\begin{aligned} \varvec{\mu }_{t'\vert 1:t} \triangleq {\varvec{K}}_{[t't]}{\varvec{K}}^{-1}_{1:t}{\tilde{{{\textbf{s}}}}}_{1:t}, \quad {\varvec{K}}_{t'\vert 1:t} \triangleq {\varvec{K}}_{t':t'} - {\varvec{K}}_{[t't]}{\varvec{K}}^{-1}_{1:t}{\varvec{K}}^\top _{[t't]} \end{aligned}$$

where \({\varvec{K}}_{[t't]} \triangleq [\text {cov}[s^a_{t'}, s^{a'}_\tau ]]_{\tau =1, \ldots , t}^{a, a' = 1, \ldots , M}\). Then, the ath element \(\mu ^a_{t'\vert 1:t}\) of \(\varvec{\mu }_{t'\vert 1:t}\) is the predictive mean of the saliency \(s^a_{t'}\). And the \([a, a']\)th element of \({\varvec{K}}_{t'\vert 1:t}\) denoted as \(\sigma ^{aa'}_{t'\vert 1:t}\) is the predictive (co)variance between the saliency \(s^a_{t'}\) and \(s^{a'}_{t'}\).

Fig. 1
figure 1

Above: Saliency of different convolutional filters over 150 epochs of SGD for a convolutional neural network trained on CIFAR-10 dataset. Below: Function samples drawn from GP with the exponential kernel. Both processes follow an exponentially decaying and asymptotic behavior

4.2.1 On the choice of the “exponential kernel"

We justify our choice of the exponential kernel as a modeling mechanism by presenting visualizations of saliency measurements collected during training, and comparing these to samples drawn from the exponential kernel \(k_q(t, t') \triangleq \frac{{\beta }^{\alpha }}{(t + t' + \beta )^{\alpha }}\), as shown in Fig. 1. Both the saliency of various convolutional filters and the function samples exhibit exponentially decaying behavior, which makes the exponential kernel a strong fit for modeling saliency evolution over time.

Furthermore, we note that the exponential kernel was used to great effect in Swersky et al. (2014) with respect to modeling loss curves as a function of epochs. Similar to the saliency measurement curves, loss curves also exhibit asymptotic behavior, which provides evidence for the exponential kernel being an apt fit for our saliency modeling task.

4.3 Bayesian early pruning (BEP) algorithm

Solving the above optimizing problem (3) and (4) is difficult due to the interplay between \([{\varvec{m}}_{t'}]_{t'=t,\ldots ,T}\), \([B_{t',c}]_{t'=t,\ldots ,T}\), and \({\varvec{m}}_{T} \cdot {\varvec{s}}_T\). To overcome this difficulty, we instead analyze a lower bound of \(\rho _t(\cdot )\):

$$\begin{aligned} \begin{array}{c} \begin{aligned} \rho _t({{\varvec{m}}}_{t-1}, B_{t,c}, B_s) &{}= \mathop {\max }_{ {{\varvec{m}}}_t} {\mathbb {E}}_{p({\varvec{s}}_{t+1}\mid {\tilde{{{\textbf{s}}}}}_{1:t})} \big [\rho _{t+1}({{\varvec{m}}}_t, B_{t,c}-||{{\varvec{m}}}_t ||_0, B_s) \big ] \\ &{}\ge \mathop {\max }_{ {{\varvec{m}}}_t} {\mathbb {E}}_{p({\varvec{s}}_T\mid {\tilde{{{\textbf{s}}}}}_{1:t})}[\rho _T({\varvec{m}}_{t}, B_{t,c} - (T-t)||{\varvec{m}}_t ||_0, B_s)]. \end{aligned} \end{array} \end{aligned}$$
(5)

We prove this lower bound in “Appendix 2”. Substituting this lower boundFootnote 4 we define \({\hat{\rho }}(\cdot )\):

$$\begin{aligned} {\hat{\rho }}_{t} ({{\varvec{m}}}_{t-1}, B_{t,c}, B_s) \triangleq \mathop {\max }_{ {{\varvec{m}}}_t} {\mathbb {E}}_{p({\varvec{s}}_T\mid {\tilde{{{\textbf{s}}}}}_{1:t})}[\rho _T({\varvec{m}}_{t}, B_{t,c} - (T-t)||{\varvec{m}}_t ||_0, B_s)]\ . \end{aligned}$$
(6)

This approach allows us to lift (3d) from (3), to which we add a Lagrange multiplier and achieve:

$$\begin{aligned} {\hat{\rho }}_{t} ({{\varvec{m}}}_{t-1}, B_{t,c}, B_s) \triangleq \mathop {\max }_{ {{\varvec{m}}}_t} {\mathbb {E}}_{p({\varvec{s}}_T\mid {\tilde{{{\textbf{s}}}}}_{1:t})}\left[ {\hat{\rho }}_T({\varvec{m}}_{t}, B_s) \right] + \lambda _t\left( B_{t,c} - (T - t)||{\varvec{m}}_t ||_0\right) \end{aligned}$$
(7)

for \(t = 1, \ldots , T-1\) and \({\hat{\rho }}_T\) is defined as \(\rho _T\) without constraint (3d). Consequently, such a \({\hat{\rho }}_T\) can be solved in a greedy manner as in (2). Afterwards, we will omit \(B_{t,c}\) as a parameter of \({\hat{\rho }}_T\) as it no longer constrains the solution of \({\hat{\rho }}_T\). Note that the presence of an additive penalty in a maximization problem is due to the constraint \(B_{T,c} \ge 0 \Leftrightarrow -B_{T,c} \le 0\) which is typically expected prior to Lagrangian reformulation.

To proceed with the analysis, we show the above optimization problem is submodular in \({\varvec{m}}_t\). In (7), the problem of choosing \({\varvec{m}}\) from \(\{0, 1\}^M\) can be considered as selecting a subset A of indexes from \(\{1, \ldots , M\}\) such that \(m^a_t = 1\) for \(a \in A\), and \(m^a_t = 0\) otherwise. Therefore, \(P({\varvec{m}}) \triangleq {\mathbb {E}}_{p({\varvec{s}}_T \mid {\tilde{{\varvec{s}}}}_{1:t})}[{\hat{\rho }}_{T}({\varvec{m}}, B_s)]\) can be considered as a set function which we will show to be submodular. To keep notation consistency, we will remain using \(P({\varvec{m}})\) instead of representing it as a function of the index subset A.

Lemma 1

Let \({\varvec{m}}',\ {\varvec{m}}'' \in \{0, 1\}^M\), and \(e^{(a)}\) be arbitrary M-dimensional one hot vector with \(1 \le a \le M\) and \(P({\varvec{m}}) \triangleq {\mathbb {E}}_{p({\varvec{s}}_T \mid {\tilde{{\varvec{s}}}}_{1:t})}[{\hat{\rho }}_{T}({\varvec{m}}, B_s)]\). We have \(P({\varvec{m}}' \vee e^{(a)}) - P({\varvec{m}}') \ge P({\varvec{m}}'' \vee e^{(a)}) - P({\varvec{m}}'')\) for any \({\varvec{m}}' \ {\dot{\le }}\ {\varvec{m}}''\), \({\varvec{m}}' \wedge e^{(a)} = 0^M\), and \({\varvec{m}}'' \wedge e^{(a)} = 0^M\).

Here, ‘\(\vee\)’ and ‘\(\wedge\)’ represent bitwise OR and AND operations, respectively. The bitwise OR operation is used to denote the inclusion of \({\varvec{e}}^{(a)}\) in \({\varvec{m}}_t\). The proof for Lemma 1 is presented in “Appendix 3”. Greedy approximations for submodular optimization have a running time of \(O(||{\varvec{m}}_{t-1} ||^2_0)\), which remains far too slow due to the large number of network elements in DNNs. To overcome this, we exploit the strong tail decay of multivariate Gaussian density \({p({\varvec{s}}_T \mid {\tilde{{\varvec{s}}}}_{1:t})}\) to deliver an efficient approximation procedure. Our approach relies on the following lemma (its proof is in “Appendix 4”):

Lemma 2

Let \({\varvec{e}}^{(i)}\) be a M-dimensional one-hot vectors with the ith element be 1. \(\forall \; 1 \le a,b \le M, {\varvec{m}}\in \{0,1\}^M \, s.t. \, {\varvec{m}}\wedge ({\varvec{e}}^{(a)} \vee {\varvec{e}}^{(b)}) = 0^M\). Given a matrix of observed saliency \({\tilde{{{\textbf{s}}}}}_{1:t}\ \), if \(\mu ^a_{T\mid 1:t} \ge \mu ^b_{T\mid 1:t}\ \) and \(\mu ^a_{T\mid 1:t} \ge 0\), then

$$\begin{aligned} {\mathbb {E}}_{p({\varvec{s}}_T\mid {\tilde{{{\textbf{s}}}}}_{1:t})} [{\hat{\rho }}_T({\varvec{m}}\vee {\varvec{e}}^{(b)})] - {\mathbb {E}}_{p({\varvec{s}}_T\mid {\tilde{{{\textbf{s}}}}}_{1:t})}[{\hat{\rho }}_T({\varvec{m}}\vee {\varvec{e}}^{(a)})] \le \mu ^b_{T\mid 1:t}\ \Phi (\nu /\theta ) + \theta \ \phi (\nu /\theta ) \end{aligned}$$

where \(\theta \triangleq \sqrt{\sigma ^{aa}_{T\mid 1:t} + \sigma ^{bb}_{T\mid 1:t} - 2\sigma ^{ab}_{T\mid 1:t}}\ \), \(\nu \triangleq \mu ^b_{T\mid 1:t} - \mu ^a_{T\mid 1:t}\ \), and \(\Phi\) and \(\phi\) are standard normal CDF and PDF, respectively.

Due to the strong tail decayFootnote 5 of \(\phi\) and \(\Phi\), Lemma 2 indicates that, with high probability, opting for \({\varvec{m}}_t = {\varvec{m}}\vee e^{(a)}\) is not a much worse choice than \({\varvec{m}}_t = {\varvec{m}}\vee e^{(b)}\) given \(\mu ^a_{T\mid 1:t} \ge \mu ^b_{T\mid 1:t}\). This admits the following approach to optimize \({\hat{\rho }}_t\): Starting with \({\varvec{m}}_t = 0^M\), we consider the inclusion of network elements in \({\varvec{m}}_{t}\) by the descending order of \(\{\mu ^a_{T\mid 1:t}\}^M_{a=1}\) which can be computed analytically using MOGP. A network element denoted by \({\varvec{e}}^{(a)}\) is included in \({\varvec{m}}_t\) if it improves the objective in (6). The algorithm terminates once the highest not-yet-included element does not improve the objective function as a consequence of the penalty term outweighing the improvement in \({\mathbb {E}}_{p({\varvec{s}}_T\mid {\tilde{{{\textbf{s}}}}}_{1:t})}[{\hat{\rho }}_T]\). The remaining excluded elements are then pruned.

Following the algorithm sketch above, we define the utility of network element \(v^{a}_t\) with respect to candidate pruning mask \({\varvec{m}}_t {\dot{\le }} {\varvec{m}}_{t-1}\) which measures the improvement in \({\mathbb {E}}_{p({\varvec{s}}_T\mid {\tilde{{{\textbf{s}}}}}_{1:t})}[{\hat{\rho }}_{T}]\) as a consequence of inclusion of \({\varvec{e}}^{(a)}\) in \({\varvec{m}}_t\):

$$\begin{aligned} \Delta (a, {\varvec{m}}_{t}, {\tilde{{{\textbf{s}}}}}_{1:t}, B_s) \triangleq {\mathbb {E}}_{p({\varvec{s}}_T\mid {\tilde{{{\textbf{s}}}}}_{1:t})}[{\hat{\rho }}_{T}({\varvec{e}}^{(a)} \vee {\varvec{m}}_{t}, B_s) - {\hat{\rho }}_{T}({\varvec{m}}_{t}, B_s)]. \end{aligned}$$
(8)

In computing \(\Delta (\cdot )\), we take the expectation over the distribution \(p({\varvec{s}}_T\mid {\tilde{{{\textbf{s}}}}}_{1:t})\), which utilizes both the predictive mean and variance of the network element. Thus, the confidence of the MOGP prediction is considered prior to pruning as mentioned in Sect. 1. We can now take a Lagrangian approach to make pruning decisions during iteration t by balancing the utility of network element \(v^a_t\) against the change of the penalty (i.e., \(\lambda _t(T-t)\)), as shown in Algorithm 1. Due to the relatively expensive cost of performing early pruning, we chose to early prune every \(T_{step}\) iterations of SGD. Typically \(T_{step}\) was chosen to correspond to 10-20 epochs of training. To compute \(\Delta (\cdot )\) we sampled from \(p({\varvec{s}}_T\mid {\tilde{{\varvec{s}}}}_{1:t})\) and used a greedy selection algorithm per sample as in (2). During implementation, we also enforced an additional hard constraint \(||{\varvec{m}}_t ||_0 \ge B_s\) which we believe is desirable for practicality reasons. We used a fixed value of \(B_{1,c} = ||{\varvec{m}}_0 ||_0 T_0 + B_s(T-T_0)\) in all our experiments.

Finally, we show how \(\lambda _t\) offers a probabilistic guarantee in the poor performance of a pruned network element. Using standard Markov inequality, we show the following:

Lemma 3

Let \({\varvec{e}}^{(*)}\) represent a pruned element at time t with the highest predictive mean \(\mu ^{*}_{T\mid 1:t} \ge 0\). Given an arbitrary pruned element \({\varvec{e}}^{(a)}\) at time t, then for all \(\delta \in (0,1)\) the following holds:

$$\begin{aligned} p\left( {\hat{\rho }}_{T}({\varvec{e}}^{(a)} \vee {\varvec{m}}_{t}, B_s) - {\hat{\rho }}_{T}({\varvec{m}}_{t}, B_s) < \frac{\lambda _t}{\delta } (T-t + \epsilon ) \right) > 1 - \delta \ \end{aligned}$$

where \(\epsilon \triangleq \left[ \mu ^{a}_{T\mid 1:t}\Phi (\nu /\theta ) + \theta \ \phi (\nu /\theta )\right] / \lambda _t\) with \(\theta \triangleq \sqrt{\sigma ^{**}_{T\mid 1:t} + \sigma ^{aa}_{T\mid 1:t} - 2\sigma ^{*a}_{T\mid 1:t}}\), and \(\nu \triangleq \mu ^a_{T\mid 1:t} - \mu ^{*}_{T\mid 1:t}\).

The proof is in “Appendix 5”. The above lemma shows that \(\lambda _t\) acts as a probabilistic guarantee of the poor performance of a pruned network element. A smaller \(\lambda _t\) offers a higher probability in the poor performance of an element prior to pruning. Consequently, \(\lambda _t\) is inversely correlated with training time where a lower \(\lambda _t\) requires more training time as fewer network elements are pruned. This offers a trade-off between training time and performance using the penalty parameter \(\lambda _t\).

figure a

4.4 BEP-LITE

We may further reduce the training cost of BEP by combining with pruning at initialization. This combination is motivated by noticing that initialization pruning techniques (Lee et al., 2019; Wang et al., 2020a) implicitly utilize the following predictive model of saliency:

$$\begin{aligned} p({\varvec{s}}_{T}) \triangleq \delta ({\varvec{s}}_{0}) \end{aligned}$$
(9)

where \(\delta\) represents the Dirac delta function. We observe in validation that this predictive model is effective to identify the poorest performing elements.Footnote 6 Thus, we may prune at initialization by solving the following optimization problem:

$$\begin{aligned} \mathop {\max }_{{{\varvec{m}}}_{0} \in \{0,1\}^M} \sum _{a=1}^{M} {\varvec{m}}_0 \cdot {\varvec{s}}_0 \quad \text {s.t.} \quad ||{{\varvec{m}}}_0 ||_0 \le B_0 \end{aligned}$$
(10)

where \(B_0 \ge B_s\) and is chosen to yield 10–20% training cost overhead over pruning at initialization. Consequently, pruning at initialization is used as a permissive heuristic to determine \({\varvec{m}}_0\), with the remainder of the pruning decisions made using BEP as in Algorithm 1. This fusion of techniques, termed BEP-LITE, significantly reduces training cost without adversely affecting test performance (Sect. 5) as BEP is utilized to make difficult pruning decisions by appropriately balancing training cost versus performance.

4.5 Dynamic penalty scaling

In BEP, each optimization problem \({\hat{\rho }}_t(\cdot )\) requires a corresponding Lagrange multiplier \(\lambda _t\). This necessitates several hyperparameters, one for each pruning iteration. Tuning these hyperparameters is costly, and thus undesirable. Due to this, we propose determining \(\lambda _t\) dynamically using a feedback loop utilizing a singular Lagrange multiplier \(\lambda\).

A proportional feedback loop can be defined as follows:Footnote 7

$$\begin{aligned} \lambda _{t} \triangleq \lambda + K_p \times e(t) \end{aligned}$$
(11)

where \(K_p \ge 0\) is a proportional constant which modulates \(\lambda _t\) according to a signed measure of error \(e(\cdot )\) at time t. Note that \(\lambda _t \ge \lambda\) as \(e(t) \ge 0\), and the opposite occurs if \(e(t) \le 0\), which allows the error to serve as feedback to determine \(\lambda _t\). Implicitly, \(\lambda _t\) asserts some control over \(e(t+1)\), and thus closing the feedback loop.

Traditional approaches to determine \(K_p\) do not work in our case as \(\lambda\) may vary over several orders of magnitude. Consequently, a natural choice for \(K_p\) is \(\lambda\) itself which preserves the same order of magnitude between \(K_p\) and \(\lambda\):

$$\begin{aligned} \lambda _{t} = \lambda + \lambda \times e(t) = \lambda (1 + e(t)). \end{aligned}$$
(12)

Here we make two decisions to adapt the above to our task. First, as \(\lambda\) is likely to be extremely small, we use exponentiation, as opposed to multiplication. Secondly as \(\lambda \le 1\) in practice, we use \(1 - e(t)\) as an exponent:

$$\begin{aligned} \lambda _t = \lambda {\wedge }\left[ 1 - e(t)\right] =\lambda \left[ \left( 1/\lambda \right) {\wedge }e(t) \right] . \end{aligned}$$
(13)

The above derivation is complete with our definition of e(t):

$$\begin{aligned} e(t) \triangleq (T-t)||{\varvec{m}}_t ||_0/{B_{t,c}} - 1. \end{aligned}$$
(14)

The above determines error by the discrepancy between the anticipated compute required to complete training \((T-t)||{\varvec{m}}_t ||_0\), versus the remaining budget \({B_{t,c}}\) with \(e(t) = 0\) if the two are equal. This is a natural measure of feedback for \(\lambda\) as we expect the two to be equal if \(\lambda\) is serving well to early prune the network.

5 Experiments

This section empirically validates the efficacy of our proposed methods. In particular, we will demonstrate: (a) The effectiveness of our MOGP modelling approach at inferring future saliency measurements; (b) The early pruning performance of BEP compared to related works and the training time versus performance trade-off; and (c) The robustness of the BEP performance in its hyperparameter tuning.

To avoid the huge cost in validating the above contributions in various settings, we first evaluate our saliency modeling approach as well as our BEP and BEP-LITE algorithms using a small-scale network: a CNN modelFootnote 8 trained on the CIFAR-10, and CIFAR-100 dataset. The model architecture is presented in Fig. 2 and consists of 4 convolutional layers followed by a fully connected layer. MaxPooling and Dropout is also utilized in the architecture, similar to VGG-16 (Simonyan & Zisserman, 2015).

Fig. 2
figure 2

Small scale model neural network architecture for CIFAR-10. Parentheticals indicate the number of convolutional filters, or neurons in a layer. The receptive field size for convolution is (3, 3), Max Pooling is done with receptive field size (2, 2)

The proposed BEP algorithm is compared with several pruning methods applied at initialization stage: (a) Random: Random pruning; (b) SNIP (Lee et al., 2019; (c) GraSP (Wang et al., 2020a; (d) PFS: PruneFromScratch (Wang et al., 2020b); and (e) EagleEye (Li et al., 2020): A pruning-after-training approach which is applied to the initialization stage for comparison.

Then, we apply BEP and BEP-LITE to prune ResNet-50 on the ImageNet dataset and compare against related works. For our ResNet-50 we compare against (a) IterSnip (de Jorge et al., 2021); and (b): SynFlow (Tanaka et al., 2020) as well as previously mentioned related works. Our ResNet-50 validation shows that BEP is able to train networks with higher accuracy when compared to related work. Furthermore, utilizing the BEP-LITE heuristic, these networks can be trained with only a small amount of training cost overhead when compared to pruning at initialization.

In this work, we use training FLOPs to measure training cost. Due to the continued growth in training cost of DNNs, we focus on the task of pruning a large percentage of the DNN. Due to the cubic time complexity of MOGPs, we used a variational approximation (Hensman et al., 2015). In all of our models, we used 60 variational inducing points per latent function. The GPflow library (Matthews et al., 2017) is used to build our MOGP models.

Table 2 Comparing log likelihood (standard error) of test data for independent GPs (GP) versus MOGP with n latent functions (n-MOGP) on different size of collected saliency measurements from CIFAR-10 and CIFAR-100 training

5.1 Small-scale experiments

5.1.1 Saliency modeling evaluation

A key assertion in our approach is the importance of capturing co-adaptation and co-evolution effects in network elements. To verify that our MOGP approach captures these effects, we compare MOGP versus GP belief modeling with GP assumes independence in saliency measurements across network elements (i.e., \(p({\varvec{s}}_{1:T}) \triangleq \prod _{a=1}^{M} p( {\varvec{s}}^a_{1:T} )\)). To validate this assertion, we collect saliency measurements of convolutional filters and neurons (network elements) by instrumenting the training process of our Small scale CNN model on the CIFAR-10/CIFAR-100 dataset. During early pruning, the MOGP infers the future saliency of network elements given a dataset of saliency measurements up to the present. Pruning decisions are then made on the inferred saliency of network elements. To verify the robustness of our inferring approach we validate the MOGP approach on inferring the future saliency given past saliency measurements.Footnote 9

We train the belief models with small (\(t=[0, 26]\) epochs), medium (\(t=[0, 40]\) epochs), and large (\(t=[0, 75]\) epochs) training dataset of saliency measurements. For GPs, a separate model was trained per network element (convolutional filter, or neuron). For MOGP, all network elements in a single layer shared one MOGP model. We measure these models’ ability in inferring the future (unobserved) saliency measurements using log likelihood and present the results in Table 2 for CIFAR-10 and CIFAR-100. As can be seen, our MOGP approach better captures the saliency of network elements than a GP approach. The log likelihood of MOGP trained using large dataset is much smaller than that trained using small dataset, which implies the necessity of collecting more saliency measurements before performing pruning during training and shows evidence of the trade-off between training cost versus performance.

Fig. 3
figure 3

Visualization of qualitative differences between GP and MOGP prediction. Top: GP. Bottom: 18-MOGP. Dataset is separated into training (green) set of observations and future saliency forms the validation (blue) set. Posterior belief of the saliency is visualized as predictive mean (red line), and \(95\%\) confidence interval (error bar).In many cases (e.g., top left graph), GP is unable to predict the long term trends of the data due to irregular element saliency observations. However, MOGP is able to overcome data irregularities by utilizing correlations between saliency of the network elements

We visualize the qualitative differences between GP and MOGP prediction in Fig. 3. We observe that MOGP is able to capture the long term trend of saliency curves with significantly less data than GP. In many cases, GP is unable to predict the long term trends of the data due to irregular element saliency observations. However, MOGP is able to overcome data irregularities by utilizing correlations between saliency of the network elements.

5.1.2 Dynamic penalty scaling

Fig. 4
figure 4

Comparing dynamic penalty scaling versus static on pruning tasks on CIFAR-100 CNN training. Dynamic penalty scaling encourages gradual pruning across a wide variety of settings of \(\lambda\)

We applied the early pruning algorithm on the aforementioned architecture, and training regimen. We investigated the behavior of the penalty parameter, \(\lambda\). We compare dynamic penalty scaling, and penalty without scaling in Fig. 4 using \(T_0=20\) epochs, \(T_{step}=10\) epochs for two pruning tasks. Dynamic penalty scaling encourages gradual pruning across a wide variety of settings of \(\lambda\). We use dynamic penalty scaling in the remainder of our validation.

5.1.3 BEP-LITE heuristic

For BEP-LITE we utilize the following predictive model of saliency

$$\begin{aligned} p({\varvec{s}}_{T}) \triangleq \delta ({\varvec{s}}_{0}) \end{aligned}$$
(15)

where \(\delta\) represents the Dirac delta function. To verify the effectiveness of this model as a permissive heuristic for BEP, we plot the relation between saliency at initialization, and after training.

Using the same experimental setup as Sect. 5.1.1, we plot the saliency measurement collected at initialization, and after training completion. This is presented in Fig. 5.

Fig. 5
figure 5

Correlation between saliency of network elements at initialization, and saliency of network elements after training. Top: CIFAR-10, Bottom: CIFAR-100. Left-to-right: Layers 1 through 5 of the convolutional neural network

As can be seen, saliency at initialization well correlates with saliency after training, hence demonstrating the validity of our heuristic. Following this observation, we utilize the above predictive model as a permissive heuristic applied at initialization to speed up the BEP algorithm. Due to the relatively better performance offered by GraSP (Wang et al., 2020a), we use it as the saliency measurement at initialization.

5.1.4 BEP on CIFAR-10/CIFAR-100 dataset

Table 3 Performance (standard error) of the tested algorithms with varying inference FLOPs (percentage of pruned FLOPs) for CIFAR-10 and CIFAR-100 on the small scale CNN model
Table 4 Performance (standard error) of the tested algorithms with varying inference FLOPs (percentage of pruned FLOPs) for CIFAR-10 and CIFAR-100 on the VGG-16 model

We apply the tested algorithms to prune a portion of filters/neurons of each layerFootnote 10 and evaluate their performance with various degrees of pruning percentage.

As shown in Table 3, our approach better preserves performance at equivalent pruning percentage. A lower penalty \(\lambda\) yields higher performing results but larger training FLOPs, which shows that \(\lambda\) in BEP serves well at balancing performance versus computational cost. A clearer superiority of BEP in validation accuracy can be observed when the pruning percentage is large (i.e., right column of Table 3). Although PruneFromScratch (Wang et al., 2020b) demonstrates comparable performance to BEP in some cases, we show in more complex experiments (Sect. 5.2) a significant performance gap emerges. Although BEP incurs larger training FLOPs than other tested algorithms, we can further reduce the training cost via BEP-LITE as will be shown in Sect. 5.2. EagleEye achieves much lower validation accuracy than other tested algorithms, which implies that an after training pruning technique typically doesn’t work well when applied to the initialization stage for reducing training cost.

To verify the robustness of our approach, we repeat this experiment on the VGG-16 architecture comparing against the most competetive baselines (SNIP and GraSP). This is presented in Table 4.

5.1.5 Ablation study

Table 5 Ablation study showing validation accuracy (standard error) with varying early pruning hyperparameters: MOGP variational inducing points (Ind. pnts.), MOGP latent functions (Lat. func.), and \(T_{step}\)

The objective of BEP is to reduce the cost for DNN training. As such, hyperparameter tuning of BEP on a per DNN architecture basis is not feasible due to its expensiveness. Thus, we check the robustness of BEP and MOGP hyperparameters in this section, which demonstrates that tuning is not necessary on a per DNN basis.

We vary the number of MOGP variational inducing points, MOGP latent functions as well as \(T_{step}\), and test the performance of BEP 1e−4 on CIFAR-10/CIFAR-100 at 95K, 24K, and 6K inference FLOPs as shown in Table 5 with our small scale CNN model. We observe that in general, the validation accuracy of the pruned DNN is robust to the changes of all hyperparameters. Degradation is observed in extremal hyperparameter settings. Reducing the inducing points, and latent functions has a strong effect on the effectiveness of the algorithm in the extremal setting (e.g., 6K Inf. FLOPS and minimal inducing points or latent functions). However, this can be easily avoided in practice. Pruning with a large \(T_step\) offers improved performance, however this correspondingly increases computational cost. The hyperparameter robustness in our approach demonstrates the feasibility of applying BEP to “never-before-seen" network architectures and datasets without additional hyperparameter tuning.

5.2 ResNet early pruning

Table 6 BEP and BEP-LITE versus related work for ResNet-50 on ImageNet dataset

We train ResNet-50 with BEP and other tested algorithm for 100 epochs on \(4 \times\) Nvidia Geforce GTX 1080Ti GPUs. More experimental details can be found in “Appendix 6.1”. We used \(\lambda =\) 1e\(-4\) as it showed strong performance in our smaller scale experiments. As can be observed in Table 6, the proposed methods achieve higher validation accuracy than other tested algorithms, with BEP-LITE showing only a modest \(15\%\) increase in training FLOPs over pruning at initialization. BEP-LITE achieves a \(85\%\) training cost reduction over BEP for \(1.7\text {e}8\) inference FLOPs while achieving superior validation accuracy. The modeling and pruning overhead of our algorithm is not larger than other tested algorithms. PruneFromScratch (Wang et al., 2020b) shows severe degradation when compared to BEP in the \(86.7\%\) pruned FLOPs experiment, and fails to prune altogether in higher sparsity settings. IterSnip (de Jorge et al., 2021) and SynFlow (Tanaka et al., 2020) are unable to prune effectively at high pruning ratios, with severe degradation observed in all tested scenarios. EagleEye continues showing poor performance, which demonstrates the inability of pruning-after-training techniques to be applied to the early pruning problem. In particular, the validation performance of the trained network outshines the competing approaches at larger pruning ratios. This improvement is crucial as DNNs continue to grow in size and require considerable pruning to allow training and inference on commodity hardware. We note that PruneTrain does not provide a mechanism to constrain the trained network size (See constraint (3b)). To compare with PruneTrain, we train ResNet-50 under a varying pruning settings offered by PruneTrain. After training is completed for these networks, we train equivalent inference cost networks using BEP.

Table 7 Timing evaluation

5.3 Training-time improvements and discussion

Our approach delivers training time improvements in Wall-clock time. In Table 7 we show the GPU, wall-clock, overhead, and model time for BEP and BEP-LITE on the ResNet-50 pruning tasks. GPU training time speedup is correlated with the size of the model after training completion. BEP-LITE delivers improved performance in wall-clock and GPU time. This improvement is delivered with no significant loss of performance after training when compared to BEP.

The measured wall-clock time is significantly higher than GPU time due to the disk I/O and image decoding overhead of training. The GPU time reduction is well correlated with the amount of pruning, with higher pruning yielding shorter GPU time. However, due to the constant training overhead, these improvements do not perfectly translate to wall-clock time improvements. Despite this, BEP and BEP-LITE are able to deliver significant improvements in wall-clock training time compared to the unpruned baseline of 47h. In particular, with \(86.7\%\) pruned flops, BEP-LITE shows a \(40\%\) improvement in wall-clock time with only a \(5.5\%\) drop in accuracy.

The training overhead can be significantly reduced in many ways to deliver further wall-clock time improvements. Disk I/O can be reduced by utilizing faster disks, or disk arrays for higher throughput. Image decoding overhead can be alleviated by storing predecoded files in bitmap form. These approaches can further reduce the wall-clock time of the training process. Thus our approach delivers significant, practical improvements in GPU time reduction and wall-clock time reduction. The wall-clock time reduction can be further improved with minimal effort.

6 Conclusion

This paper presents a novel efficient algorithm to perform pruning of DNN elements such as neurons, or convolutional layers during the training process. To achieve early pruning before the training converges while preserving the performance of the DNN upon convergence, a Bayesian model (i.e., MOGP) is used to predict the saliency of DNN elements in the future (unseen) training iterations by exploiting the exponentially decaying behavior of the saliency and the correlations between saliency of different network elements. Then, we exploit a property (Lemma 2) of the objective function and propose an efficient Bayesian early pruning algorithm. Empirical evaluations on benchmark datasets show that our algorithm performs favorably to related works for pruning convolutional filters and neurons. In particular, BEP shows strong improvement in inference performance with minimal cost overhead when a significant portion of the DNN is pruned. Moreover, the proposed BEP is robust to changes in hyperparameters (see Table 5), which demonstrates its applicability to “never-before-seen" network architectures and datasets without further hyperparameter tuning. Our approach also remains flexible to changes in saliency function, and appropriately balances the training cost versus performance trade-off in DNN pruning.