Keywords

1 Introduction

Deep learning has proven its power to solve complex tasks and to achieve state-of-the-art results in various domains such as image classification, speech recognition, machine translation, robotics and control [6, 24]. Over-parameterized artificial neural networks (ANN), which have more parameters than the training samples, can be used to achieve state-of-the-art results in various tasks [39, 57]. However, the large number of parameters comes at the expense of computational cost in terms of memory footprint, training time, and inference time on resource-limited devices.

Fig. 1.
figure 1

The generic flow of OAMIP used to remove neurons with an importance score below a specific threshold.

In this context, the pruning of neurons in an over-parameterized neural model has been an active area of research, enabling the increase of computational efficiency and the uncovering of sub-networks with marginal (or even no) loss in the network’s predictive capacity [1, 9, 17, 28, 41, 42, 45, 50, 51, 56]. The typical sparsification procedure involves training a neural model to convergence, computing the parameters’ importance, then pruning existing ones using specific criteria, and fine-tuning the neural model to regain its lost accuracy. Existing pruning and neuron ranking procedures [1, 9, 17, 18, 27, 35, 45, 56] require iterations of fine-tuning on the sparsified model instead of pruning a pre-trained network directly. Moreover, the evaluation of the generalization of sparsified models across different datasets is under-explored in existing pruning and neuron ranking procedures [31], which is consistent with the lottery ticket hypothesis [13, 34, 37].

We remark that modern network architectures often use sparse neuron connectivity and, most notably, convolutional layers in image processing. Indeed, the limited size of the parameter space in such cases increases the effectiveness of network training and enables the learning of meaningful semantic features from the input images [15]. Inspired by the benefits of sparsity in such architecture designs, we aim to leverage the neuron sparsity achieved by our framework, Optimizing ANN Architectures using Mixed-Integer Programming (OAMIP) to obtain optimized neural architectures that can generalize well across different datasets. For this purpose, we create a sparse sub-network by optimizing on one dataset and then training the same architecture, i.e., masked, on another dataset. Our results indicate a promising direction of future research into the utilization of combinatorial optimization for effective automatic architecture tuning to augment handcrafted network architecture design.

Contributions and Paper Organization. In OAMIP, illustrated in Fig. 1, we formalize the notation of neuron importance score in a trained neural network and the associated dataset. The neuron importance score reflects how much activity decrease can be inflicted while controlling the loss on the neural network model accuracy. To this end, in Sect. 2, we begin by providing background on the constraints that serve as the basis for our Mixed-Integer Programming (MIP) formulation presented in Sect. 3. Concretely, we propose a MIP that allows the computation of the importance score for each fully connected neuron and convolutional feature map. The error propagation associated with pruning between different layers defines each neuron’s importance score. In addition, we also discuss the extension of the MIP constraints for other layers besides ReLU-activated fully connected layers. Section 4 describes OAMIP in detail, namely the integration of the neuron importance scores on the pruning procedure. Here, we also propose a methodology to independently decouple the computation of neuron importance score per layer to represent deeper architectures and, thus, scale up our approach to models like VGG-16 [44]. Furthermore, in Sect. 5, we show OAMIP’s robustness to various input data points besides its ability to parallelize the computation of importance score per class. Finally, we show that OAMIP’s importance score generalizes well over various datasets complying with the lottery ticket hypothesis [13].

1.1 Related Work

Weight Pruning Methods. Early methods in weight pruning relied on the weight magnitude by disabling the lowest magnitude weights and re-training/fine-tuning the resulting sub-network [16, 29, 37]. Magnitude-based techniques rely on the intuition that large weight values are more critical during inference than smaller weight values. [36] devised a greedy criteria-based pruning with fine-tuning by back-propagation. The criteria devised are given by the absolute difference between dense and sparse neural model loss (ranker) to avoid a drop in the predictive capacity. [43] developed a framework that computes the neurons’ importance at each layer through a single backward pass as an approximation to the interpretability of each neuron during inference. Other related techniques, using different objectives and interpretations of neuron importance, have been presented [1, 3, 19, 20, 22, 54], and require either fine-tuning to recover the network’s performance or dynamic re-training and pruning. Another line of research [10, 33, 42, 49, 53, 55] formulates an optimization model to select which neuron to disable without losing performance on the task at hand. With a less conservative perspective but also using an optimization-based model, OAMIP aims to quantify a generalizable per-neuron importance score for either a pre-trained network or at initialization without re-training or fine-tuning the network. Similarly, other pruning procedures aim to avoid the fine-tuning step by pruning the network during initialization. In particular, SNIP [28] and GraSP [51] focus on predicting critical weights during initialization via salience scores and then train the sub-network until convergence. SNIP [28] was the first to investigate the pruning of a network during initialization by computing the connection’s sensitivity to an input batch of data through gradient back-propagation. OAMIP can be applied to the network at initialization or after training without requiring a long fine-tuning step.

Lottery Ticket. [13] introduced the lottery ticket theory that shows the existence of a lucky pruned sub-network, a winning ticket. The lucky pruned sub-network can be trained effectively with fewer parameters while achieving a marginal loss in accuracy. [37] proposed “one ticket to win them all” for sparsifying \(n\) over-parameterized trained neural models based on the lottery hypothesis. Searching for the winning ticket involves pruning the model and disabling some of its sub-networks. The pruned model can be trained on a different dataset using the same initialization (winning ticket), achieving good results. To this end, the dataset used for the pruning phase must be sufficiently large. The lucky sub-network is found by iteratively pruning the lowest magnitude weights and re-training. Another phenomenon discovered in [40, 52] was the existence of smaller, high-accuracy models that reside in larger random networks. This phenomenon is called the strong lottery ticket hypothesis, which was proven [34] on ReLU fully connected layers. Furthermore, [51] proposed a technique to select the winning ticket at initialization (before training the ANN) by computing an importance score based on the gradient flow in each unit.

Mixed-Integer Programming. [12] presented a Mixed-Integer Linear Programming big-M formulation to represent trained ReLU neural networks. Later, [4] introduced the strongest possible tightening to the big-M formulation by adding strengthening separation constraints when needed, which reduced the solving time by several orders of magnitude. Recently, [48] presented efficient partitioning strategies that improved solving time. All the proposed formulations are designed to represent trained ReLU ANNs with fixed parameters. In our framework, we use the formulation from [12] since its performance was good due to our tight local variable bounds, and its polynomial number of constraints (while the models in [4, 48] are non-compact). The interest of representing an ANN as a MIP lies in its use to evaluate robustness, carry out compression and create adversarial examples for trained ANNs. For instance, [21, 47] used a big-M formulation to evaluate the robustness of neural models against adversarial attacks. [55] modeled an extension of the optimal brain surgeon [18], where the goal is to select and remove the weights that have the most negligible impact on the predictive capacity of the network as an Integer Quadratic Program. However, the optimal brain surgeon pruning criteria rely heavily on the weights scale. Moreover, the weights’ scale will be sensitive to the architecture used; different normalization layers affect the scale and magnitude of weights in a different way [28]. [42] also used a MIP formulation to maximize the compression of a trained neural network without decreasing predictive accuracy. Lossless compression [42] relies on different compression methods, such as removing neurons and folding layers. However, the reported computational experiments lead only to the removal of inactive neurons. OAMIP can identify such neurons and quantify the importance of various neurons with respect to the predictive capacity while pruning neurons that are non-critical across different datasets. The latter means that the sub-networks found by our framework to a specific dataset generalize to others.

2 Preliminaries

Consider layer \(l\) of a trained ReLU neural network with \({\boldsymbol{W^l}}\) as the weight matrix, \(w_i^l\) as row \(i\) of \({\boldsymbol{W^l}}\), and \(b^l\) as the bias vector. For each input data point \(x\), let \(h^l\) be a decision vector denoting the output value of layer \(l\), i.e., \(h^l = ReLU({\boldsymbol{W^l}} h^{l-1}+b^l)\) for \(l>0\) and \(h^{0}=x\), and \(z_i^l\) be a binary variable taking value 1 if the unit \(i\) is active (\(w_i^l h^{l-1}+b^l_i \ge 0\)) and \(0\) otherwise. Finally, let \(L^{l}_{i}\) and \(U^{l}_{i}\) be constants indicating a valid lower and upper bound for the input of each neuron \(i\) in layer \(l\). We discuss the computation of these bounds in Sect. 3.2. For now, we assume that \(L^{l}_{i}\) and \(U^{l}_{i}\) are sufficiently small and large numbers, respectively, i.e., the so-called big-M values. Next, we provide the representation of ReLU neural networks of [12]. Although [4] proposed an ideal MIP formulation with an exponential number of facet-defining constraints that can be separated efficiently, we use the formulation by [12], since it performed well in practice for our purpose. For the sake of simplicity, we describe the formulation for one layer \(l\) of the model at neuron \(i\) and one input data point \(x\):

$$\begin{aligned} h^{0}_i = x_i{} & {} \end{aligned}$$
(1a)
$$\begin{aligned} h_{i}^l \ge 0,{} & {} \quad \text { for } l>0 \end{aligned}$$
(1b)
$$\begin{aligned} h^{l}_{i} + (1- z^{l}_i) L^{l}_i \le w_i^{l} h^{l-1} + b^{l}_i ,{} & {} \end{aligned}$$
(1c)
$$\begin{aligned} h^{l}_{i} \le z^{l}_i U^{l}_i,{} & {} \end{aligned}$$
(1d)
$$\begin{aligned} h^{l}_{i} \ge w_i^{l} h^{l-1} + b^{l}_i,{} & {} \end{aligned}$$
(1e)
$$\begin{aligned} z^{l}_{i} \in \{0, 1\}, h_i^l \in \mathbb {R}.{} & {} \end{aligned}$$
(1f)

In constraint (1a), the initial decision vector \(h^{0}\) is forced to be equal to the input \(x\) of the first layer. When \(z^{l}_i\) is 0, constraints (1b) and (1d) force \(h^l_i\) to be zero, reflecting a non-active neuron. If an entry of \(z^{l}_i\) is 1, then constraints (1c) and (1e) enforce \(h^l_i\) to be equal to \(w_i^{l} h^{l-1} + b^{l}_i \). After formulating the ReLU, if we relax the binary constraint (1f) on \( z^{l}_i\) to \( [0, 1] \), we obtain a polyhedron, over which it is easier and faster to optimize. The quality (tightness) of such relaxation highly depends on the choice of tight upper and lower bounds, \( U^l_i, L^l_i\). Indeed, the determination of tight bounds reduces the search space and hence, the solving time.

3 Neuron Importance Score

In what follows, we adapt constraints (1) to quantify neurons’ importance, we describe the computation of the bounds \(L_i^l\) and \(U_i^l\) and we discuss the objective function for our MIP. Our goal is to compute importance scores for all layers in the model in an integrated fashion. In fact, [54] has shown that this integrated perspective leads to better predictive accuracy than layer by layer.

3.1 MIP Constraints

In ReLU-activated layers, we keep the previously introduced binary variables \(z^{l}_i\) and continuous variables \(h_i^l\). Recall that these variables are linked to an input data point x, so if more than one data point is considered, copies of these variables must be created. Additionally, we create the continuous decision variables \(s^l_i \in \left[ 0,1\right] \) representing neuron \(i\) importance score in layer \(l\); contrarily to \(z^{l}_i\) and \(h_i^l\), no copies of \(s^l_i\) are created for each input data point. In this way, we proceed to modify the ReLU constraints (1) by adding the neuron importance decision variable \(s^l_i\) to constraints (1c) and (1e):

$$\begin{aligned} h^{l}_{i} + (1- z^{l}_i) L^{l}_i&\le w_i^l h^{l-1} + b^{l}_i - (1- s^{l}_i) \max {(U^{l}_i, 0)} , \end{aligned}$$
(2a)
$$\begin{aligned} h^{l}_{i}&\ge w_i^l h^{l-1} + b^{l}_i - (1- s^{l}_i) \max {(U^{l}_i, 0)}. \end{aligned}$$
(2b)

Constraints (2) impose that when neuron \(i\) is activated due to the input \(h^{l-1}\), i.e., \(z_i^l=1\), then \(h^l_i\) is equal to the right-hand-side of those constraints. This value can be directly decreased by reducing the neuron importance \(s_i^l\). When neuron \(i\) is non-active, i.e., \(z_i^l=0\), constraint (2b) becomes irrelevant as its right-hand-side is negative. This fact together with constraints (1b) and (1d), imply that \(h^l_i\) is zero. Now, we claim that constraint (2a) allows \(s_i^l\) to be zero if that neuron is indeed non-important, i.e., for all possible input data points, neuron \(i\) is not activated. This claim can be shown through the following observations. Note that decisions \(h\) and \(z\) must be replicated for each input data point \(x\) as they represent the propagation of \(x\) over the neural network. On the other hand, \(s\) evaluates the importance of each neuron for the main learning task, and thus, it must be the same for all data input points. Thus, the key ingredients are the bounds \(L_i^l\) and \(U_i^l\) that are computed for each input data point, as explained in Sect. 3.2. In this way, if \(U_i^l\) is non-positive, \(s_i^l\) can be zero without interfering with constraints (2). The latter is driven by the objective function derived in Sect. 3.3. We designate a neuron as critical with respect to a trained ANN, if its importance score is higher than a predefined threshold, otherwise it is called non-critical.

We now discuss other architectures. Concerning convolutional feature maps, we convert them to toeplitz matrices and their input images to vectors. This allows us to use simple matrix multiplication which is computationally efficient and generates the full convolution output. For padded convolution we use only parts of the output of the full convolution, and for strided convolutions we use sum of 1 strided convolution as proposed by [7]. Moreover, we can represent the convolutional layer using the same formulation of fully connected layers presented in (2a). The importance score of convolutional layers is associated with each feature map [30, 36].

We represent both max and average (avg) pooling on multi-input units in our MIP formulation. Pooling layers are used to reduce spatial representation of input images by applying an arithmetic operation on each feature map of the previous layer. Avg pooling layers compute the average operation on each feature map of the previous layer \(l\) having \(N^l\) as the number of neurons. This operation is linear and thus, it can directly be included in the MIP constraints:

$$\begin{aligned} h^{l+1} = \text {AvgPool}(h^l_1,\cdots ,h^{l}_{N^l} ) = \frac{1}{N^l} \sum _{i=1}^{N^l} h^l_i. \end{aligned}$$

Max Pooling takes the maximum of each feature map of the previous layer:

$$\begin{aligned} h^{l+1} = \text {MaxPool}(h^l_1,\cdots ,h^l_{N^l} ) = \text {max}\{h^l_1, \cdots , h^l_{N^l}\} . \end{aligned}$$

This operation can be expressed by introducing a set of binary variables \(m_1, \cdots ,m_{N^l}\), where \(m_i=1\) implies \(x=\text {MaxPool}(h^l_1,\cdots ,h^l_{N^l} )\):

figure a

3.2 Bound Propagation

In the previous section, we assumed a large upper bound \(U^l_i\) and a small lower bound \(L^l_i\). However, using large bounds may lead to long computational times and a loss of freedom to reduce the importance score, as discussed above. In order to overcome these issues, we tailor these bounds accordingly with their respective input point \(x\) by considering small perturbations on its value:

$$\begin{aligned} L^0&= x - \epsilon \end{aligned}$$
(3a)
$$\begin{aligned} U^0&= x + \epsilon \end{aligned}$$
(3b)
$$\begin{aligned} L^l&= {\boldsymbol{W^{(l-)}}} U^{l-1} + {\boldsymbol{W^{(l+)}}} L^{l-1} \end{aligned}$$
(3c)
$$\begin{aligned} U^l&= {\boldsymbol{W^{(l+)}}} U^{l-1} + {\boldsymbol{W^{(l-)}}} L^{l-1} \end{aligned}$$
(3d)
$$\begin{aligned} {\boldsymbol{W^{(l-)}}}&\triangleq \min {({\boldsymbol{W^{(l)}}}, 0)} \end{aligned}$$
(3e)
$$\begin{aligned} {\boldsymbol{W^{(l+)}}}&\triangleq \max {({\boldsymbol{W^{(l)}}}, 0)}. \end{aligned}$$
(3f)

Propagating the initial bounds of the input data points throughout the trained model will create the desired bound using a simple arithmetic interval. The obtained bounds are tight, narrowing the space of feasible solutions.

3.3 MIP Objective

Our framework aims at identifying non-critical neurons without significantly decreasing the predictive accuracy of the pruned ANN. To this end, we combine two optimization objectives.

Our first objective is to maximize the set of neurons sparsified from the trained ANN. Recall that \(N^l\) is the number of neurons at layer \(l\), and let \(n\) be the number of layers, and \(I^{l} = \sum _{i = 1}^{N^l} (s^l_i -2)\) be the sum of neuron importance scores at layer \(l\) with \(s_i^l\) scaled down to the range \([-2, -1]\).

In order to create a relation between neurons’ importance score in different layers, our objective becomes the maximization of the number of neurons sparsified from the \(n-1\) layers with higher score \(I^l\). Hence, we denote \(A = \{I^l : l=1,\ldots ,n\}\) and formulate the sparsity loss as

$$\begin{aligned} \text {sparsity} = \frac{\displaystyle \max _{A^{'} \subset A, |A^{'}| = (n-1)} \sum _{I \in A^{'}} I}{\sum _{l=1}^{n} \vert N^l \vert }. \end{aligned}$$
(4)

Here, the goal is to maximize the number of non-critical neurons at each layer relative to the other layers of the trained neural model. Note that only the \(n-1\) layers with the most significant importance score will weigh in the objective, allowing to reduce the pruning effort on some layers that will naturally have low scores. The total number of neurons then normalizes the sparsity quantification.

Our second objective is to minimize the loss of important information due to the sparsification of the trained neural model. Additionally, we aim for this minimization to be done without relying on the values of the logits, which are closely correlated with the neurons pruned at each layer. Otherwise, this would drive the MIP to simply give a total score of \(1\) to all neurons to keep the same output logit value. Instead, we formulate this optimization objective using the marginal softmax proposed in [14]. Using marginal softmax allows the solver to focus on minimizing the misclassification error without relying on logit values. Moreover, the scale of logits can be marginally different between the decision vector \(h^n\) computed by the MIP with some disabled neurons and the trained neural network predictions. To that end, in the proposed marginal softmax loss, the label with the highest logit value is optimized regardless of its value. Formally, we write the objective

$$\begin{aligned} \text {softmax} = \sum _{i = 1}^{N^n} \log \left[ \sum _c \exp (h^n_{i, c}) \right] - \sum _{i =1}^{N^n} \sum _c Y_{i, c} h^n_{i, c}, \end{aligned}$$
(5)

where the index \(c\) stands for the class label. The softmax marginal objective retains the trained model’s correct predictions for the batch of input images \(x\) having a one-hot encoded label \(Y\) without regard to the logit value. Finally, we combine the two objectives to formulate the loss

$$\begin{aligned} \text {loss} = \text {sparsity} + \lambda \cdot \text {softmax} \end{aligned}$$
(6)

as a weighted sum of sparsification regularizer and marginal softmax.

4 OAMIP: Pruning Approach

Given a trained neural network and a dataset, our goal is to identify and prune non-critical neurons based on importance score \(s_i^l\) for neuron \(i\) at layer \(l\). To this end, we formulated a neural network as a mixed-integer program, including the neuron importance score in its constraints and objective function. Algorithm 1 summarizes the integration of our formulation within a pruning procedure.

figure b
Fig. 2.
figure 2

Illustration of the auxiliary network attached to each sub-module along with the signal backpropagation during training as shown in [5].

[58] highlights the phenomenon of neural collapse, where features of images from the same distribution in the training set collapse around a class mean and are maximally distant between different classes. Moreover, the neurons that are important for a specific class, as computed on an image, should not change drastically when another image from the same distribution as the training set is used. Besides, using all the training samples as input to the MIP solver is intractable. Hence, we use only a subset of the data points, each representing a class in the classification task for which we aim to approximate the neuron importance score (step 1). Then, OAMIP computes an estimation of the importance score of each neuron (step 2). With a small tuned threshold based on the network’s architecture, we mask (prune) non-critical neurons with a score lower than the threshold (step 3). Finally, our proposed framework returns a pruned ANN (sub-network), achieving marginal loss in accuracy.

The most time-sensitive step of OAMIP is the optimization of the MIP. The number of variables and constraints increases with the number of neurons and input data points. Indeed, if large and realistic ANNs are modeled with our MIP, the computation time for determining importance scores is expected to become very large, as observed in the problem tackled in [12]. To overcome the computational time issue, we propose independent computation of importance scores per layer using auxiliary networks [5]. In particular, we used decoupled greedy learning [5] to train each layer of VGG-16 [44] using a small auxiliary network, and, in this way, we computed the neuron importance score independently on each auxiliary network, as shown in Fig. 2. Then, we fine-tuned the generated masks for one epoch to propagate the errors across them resulting from the independent optimization. Decoupled training of each layer allowed us to represent deep models using the MIP formulation and to parallelize the computation per layer.

5 Empirical Results

This section shows experimentally that (i) our approach can efficiently find high-performance sub-networks from ANN architectures, (ii) the computed sub-networks generalize well to new datasets, and (iii) OAMIP outperforms the state-of-the-art approach SNIP with regards to generalization.

Experimental Setting. We used a simple fully connected 3-layer ANN (FC-3) model, with 300+100 hidden units, from [26], and another simple fully connected 4-layer ANN (FC-4) model, with 200+100+100 hidden units. In addition, we used the convolutional LeNet-5 [26] consisting of two sets of convolutional and average pooling layers, followed by a flattening convolutional layer, then two fully-connected layers. The largest architecture investigated was VGG-16 [44] consisting of a stack of convolutional (Conv.) layers with a small receptive field: \( 3 \times 3 \). The VGG-16 was adapted for CIFAR-10 [25], having two fully connected layers of size 512 and average pooling instead of max pooling. Each of these models was trained three times with different initialization.

All models were trained for 30 epochs using RMSprop [46] optimizer with 1e-3 learning rate for MNIST and Fashion MNIST. LeNet-5 [26] on CIFAR-10 was trained using the SGD optimizer with learning rate 1e−2 and 256 epochs. VGG-16 [44] on CIFAR-10 was trained using Adam [23] with 1e−2 learning rate for 30 epochs. The hyper-parameters were tuned on the validation set’s accuracy. All images were resized to 32 by 32 and converted to 3 channels to generalize the pruned network across different datasets. Our experiments revealed that \(\lambda =5\) generally provides the right trade-off between our two objectives (6) based on the validation set results; see the following thesis [11] for details on these experiments.

Computational Environment. The experiments were performed in an Intel(R) Xeon(R) CPU @ 2.30 GHz with 12 GB RAM and Tesla k80 using Mosek 9.1.11 [38] solver on top of CVXPY [2, 8] and PyTorch 1.3.1Footnote 1.

Fig. 3.
figure 3

Effect of changing validation set of input images.

Fig. 4.
figure 4

Evolution of the computed masked sub-network during model training.

5.1 OAMIP Robustness

We examine the robustness of OAMIP against different batches of input images fed into the MIP, on the implementation of step 2 of OAMIP. Namely, we used 25 randomly sampled balanced images from the validation set. Figure 3 shows that changing the input images used by the MIP to compute neuron importance scores in step 2 resulted in marginal changes in the test accuracy between different batches. We remark that the input batches may contain images that were misclassified by the neural network. In this case, the MIP tries to use the score \(s\) to obtain the true label, which explains the variations in the pruning percentage. Furthermore, we show empirically that OAMIP is robust on different convergence levels of the trained neural network as shown in Fig. 4. Hence, we do not need to wait for the ANN to be trained to identify the target sub-network (strong lottery ticket hypothesis theory [34]).

Additionally, we experiment parallelizing per class neuron importance score computation using a balanced and imbalanced set of images per class. For those experiments, we sampled a random number of images per class (IMIDP), then we took the average of the computed neuron importance scores from solving the MIP on each class. The obtained sub-networks were compared to solving the MIP with 1 image per class (IDP) and to solving the MIP with balanced images representing all classes (SIM). We achieved comparable results in terms of test accuracy and pruning percentage.

Table 1. Comparing test accuracy of Lenet-5 on imbalanced independent class by class (IMIDP.), balanced independent (IDP.) and simultaneously all classes (SIM) with 0.01 threshold, and \(\lambda =1\).

To conclude on the robustness of the scores computed based on the input points used in the MIP, we empirically show in Table 1 that our method is scalable, and that class contribution can be decoupled without deteriorating the approximation of neuron scores and thus, the performance of our methodology. Moreover, we show that OAMIP is robust even when an imbalanced number of data points per class (IMIDP) is used in the MIP formulation.

5.2 Comparison to Random and Critical Pruning

We started by training a reference model (REF.) using previously described training parameters. After training and evaluating the reference model on the test set, we fed an input batch of images from the validation set to the MIP. Then, the MIP solver computed the neuron importance scores based on those input images. We used \(10\) images in our experimental setup, each representing a class.

To validate our pruning policy guided by the computed importance scores, we created different sub-networks of the reference model, where the same number of neurons is removed in each layer, thus allowing a fair comparison among them. These sub-networks were obtained through different procedures: non-critical (our methodology), critical, and randomly pruned neurons. For VGG-16 experiments, an extra fine-tuning step for 1 epoch is performed on all generated sub-networks. Although we pruned the same number of neurons, which accordingly with [32] should result in similar performances, Table 2 shows that pruning non-critical neurons results in marginal loss and gives better performance. On the other hand, we observe a significant drop in the test accuracy when critical or a random set of neurons are removed compared with the reference model. If we fine-tune for just 1 epoch the sub-network obtained through our method, the model’s accuracy can surpass the reference model. This is due to the fact that the MIP while computing neuron scores, is solving its marginal softmax (5) on true labels.

Table 2. Pruning results on fully connected (FC-3, FC-4) and convolutional (Lenet-5, VGG-16) network architectures using three different datasets. We compare the test accuracy between the unpruned reference network (REF.), randomly pruned model (RP.), model pruned based on critical neurons selected by the MIP (CP.) and our non-critical pruning approach with (OAMIP + FT) and without (OAMIP) fine-tuning for 1 epoch.

5.3 Generalization Between Different Datasets

Table 3. Cross-dataset generalization: sub-network masking is computed on source dataset (\(d_1\)) and then applied to target dataset (\(d_2\)) by re-training with the same early initialization. Test accuracies are presented for masked and unmasked (REF.) networks on \(d_2\), as well as pruning percentage.

In this experiment, we train the model on a dataset \( d_1 \) and create a masked neural model using our approach. After creating the masked model, we restart it to its original initialization. Finally, the new masked model is re-trained on another dataset \( d_2 \), and its generalization is analyzed.

Table 3 displays our experiments and respective results. When we compare generalization results to pruning using our approach on Fashion-MNIST and CIFAR-10, we discover that computing the critical sub-network for the LeNet-5 architecture on MNIST creates a more sparse sub-network. Moreover, this sub-network has a test accuracy better than zero-shot pruning without fine-tuning and comparable accuracy with the original ANN. This behavior occurs because the solver is optimizing on a batch of images that are classified correctly with high confidence from the trained model. Furthermore, computing the critical VGG-16 sub-network architecture on CIFAR-10 using decoupled greedy learning [5] generalizes well to Fashion-MNIST and MNIST.

5.4 Comparison to SNIP

OAMIP can be viewed as a compression technique of over-parameterized neural models. We compare it to SNIP [28].

SNIP computes connection sensitivities in a data-dependent way before the training. The sensitivity of a connection represents its importance based on the influence of the connection on the loss function. After computing the sensitivity, the connections below a predefined threshold are pruned before training (single shot).

In our methodology, we exclusively identify the importance of neurons and essentially prune all the connections of non-important ones. On the other hand, SNIP only focuses on pruning individual connections. Moreover, we highlight that SNIP can only compute connection sensitivity on the initialization of an ANN. Indeed, for a trained ANN, the magnitude of the derivatives concerning the loss function optimized during the training, makes SNIP keener to keep all the parameters. On the other hand, OAMIP can work on different convergence levels, as shown in Sect. 3.3. Furthermore, the connection sensitivity computed by SNIP is only network and dataset-specific; thus, the computed connection sensitivity for a single connection does not give a meaningful signal about its general importance for a given task. Rather, it needs to be compared to the sensitivity of other connections.

In order to bridge the differences between the two methods and provide a fair comparison in equivalent settings, we make a slight adjustment to our method. In step 2 of OAMIP, we compute neuron importance scores on the model’s initializationFootnote 2. We note that we used only 10 images as input to the MIP, corresponding to the 10 different classes, and 128 images as input to SNIP, following its original paper [28]. Our algorithm was able to prune neurons from fully connected and convolutional layers of LeNet-5. After creating the sparse networks using SNIP and our methodology, we trained them on the Fashion-MNIST dataset. The difference between SNIP (\( 88.8\% \pm 0.6 \)) and our approach (\( 88.7\% \pm 0.5 \)) was marginal in terms of test accuracy. SNIP pruned \( 55\% \) of the ANN’s parameters and OAMIP \( 58.4\% \).

Table 4. Cross-dataset generalization comparison between SNIP, with neurons having the lowest sum of connections’ sensitivity pruned, and our framework (OAMIP), both applied on initialization, see Sect. 5.3 for the generalization experiment description.

Next, we compare SNIP and OAMIP in terms of generalization. In Table 4, we show that our framework outperforms SNIP in terms of generalization. We adjusted SNIP to prune entire neurons based on the value of the sum of its connections’ sensitivity, and our framework was also applied to ANN’s initialization. When our framework is applied on the initialization, more neurons are pruned as the marginal softmax part of the objective function discussed in Sect. 3.3 is weighing less (\(\lambda =1\)), driving the optimization to focus on model sparsification.

Finally, we remark that the adjustments made to SNIP and OAMIP in the previous experiments are solely for comparison, while (unlike SNIP) the primary purpose of our method is to allow optimization at any stage – before, during, or after training. In the specific case of optimizing at initialization and discarding entire neurons based on aggregated connection sensitivity, the SNIP approach may have some advantages, notably in scalability for deep architectures. However, it also has some limitations, as previously discussed.

6 Discussion

We proposed a mixed integer program to compute neuron importance scores in ReLU-based deep neural networks. Our contributions focus on providing scalable computations of importance scores in fully connected and convolutional layers. We presented results showing that these scores can effectively prune unimportant parts of the network without significantly affecting its predictive capacity. Further, our results indicate that this approach allows the automatic construction of efficient sub-networks that can be transferred and retrained on different datasets. Knowing a neural network’s critical components can further impact future work beyond the pruning applications presented here.