A novel weight pruning method for MLP classifiers based on the MAXCORE principle

Medeiros, Cláudio M. S.; Barreto, Guilherme A.

doi:10.1007/s00521-011-0748-6

A novel weight pruning method for MLP classifiers based on the MAXCORE principle

Cont. Dev. of Neural Compt. & Appln.
Published: 14 October 2011

Volume 22, pages 71–84, (2013)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Computing and Applications Aims and scope Submit manuscript

A novel weight pruning method for MLP classifiers based on the MAXCORE principle

Download PDF

Cláudio M. S. Medeiros¹ &
Guilherme A. Barreto²

374 Accesses
11 Citations
Explore all metrics

Abstract

We introduce a novel weight pruning methodology for MLP classifiers that can be used for model and/or feature selection purposes. The main concept underlying the proposed method is the MAXCORE principle, which is based on the observation that relevant synaptic weights tend to generate higher correlations between error signals associated with the neurons of a given layer and the error signals propagated back to the previous layer. Nonrelevant (i.e. prunable) weights tend to generate smaller correlations. Using the MAXCORE as a guiding principle, we perform a cross-correlation analysis of the error signals at successive layers. Weights for which the cross-correlations are smaller than a user-defined error tolerance are gradually discarded. Computer simulations using synthetic and real-world data sets show that the proposed method performs consistently better than standard pruning techniques, with much lower computational costs.

Feature importance measure of a multilayer perceptron based on the presingle-connection layer

Article 04 September 2023

A Computationally Efficient Weight Pruning Algorithm for Artificial Neural Network Classifiers

Article 24 October 2017

A Novel Pruning Method Based on Correlation Applied in Full-Connection Layer Neurons

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Even though it has passed two decades and a half since the rediscovery of the backpropagation algorithm in the mid-1980s, and despite all the available literature on the MLP network, a beginner soon becomes aware of the difficulties in finding a suitable architecture for real-world applications. In fact, this is a hard task also for an experienced practitioner. An architecture that is too small will not be able to learn from data properly, no matter what training algorithm is used for this purpose. An architecture with too many input units and hidden layer/neurons are prone to learn undesirable characteristics (e.g. noise) of the training data.

Hence, a crucial step in the design of a MLP is related with the network model selection problem [5]. This problem is still a research topic of interest [8, 10, 13, 20, 28, 30, 33] and can be roughly defined as the task of finding the smallest architecture that generalizes well, making good predictions for new data. Generalization can be assessed by changing the number of adjustable parameters (weights and biases) associated with input units and hidden/output neurons. Among the several ways to implement this in practice, we list the following four as possibly the most common approaches.

1.1 Exhaustive search plus early stopping

Early stopping of training is a method that aims to prevent overtraining due to oversized network, noisy training examples, or a small training set [7]. The performances of several networks having different number of features, hidden layers, and hidden neurons are evaluated during training on an independent validation set. Training of each network is stopped as soon as its generalization error begins to increase. The optimal architecture is the one providing the smallest generalization error.

1.2 Constructive algorithms

Network training starts with a small number of hidden neurons and add neurons during the training process, with the goal of arriving at an optimal network structure [1, 21]. This is the approach behind the Cascade-Correlation network [9, 12, 14, 17, 31].

1.3 Pruning algorithms

Network training can start with a relatively large number of hidden neurons and then, based on a measure of weight relevance, prune out the least significant connections, either by removing individual weights or by removing complete units [6, 11, 24]. This is the approach followed by the Optimal Brain Surgeon [15, 29] (OBS) and the Weight Decay and Elimination (WDE) [23, 32]. The OBS performs weight pruning based on the so-called weight saliencies (see the “Appendix”). The WDE algorithm originates from a regularization method that modifies the error function by the introduction of a term that penalizes large weights.

1.4 Evolutionary computation

Several authors have used genetic algorithms [2, 27, 34] and other evolutionary algorithms, such as particle swarm optimization [35], for determining the best network topology for a given problem, including the number of hidden layers, the number of neurons per hidden layer, and the number of relevant input features, connection weights, and biases.

The previous model selection methods require significant computational efforts. For example, even if we restrict the exhaustive search to a specific class of architectures (e.g. one-hidden-layered MLP), it is still a burdensome task. Evolutionary algorithms are slow by their very population-based nature and usually cannot be used for real-time purposes. The OBS algorithm demands the inversion of the Hessian matrix of the error function, which is a computationally intensive task [4]. The WDE algorithm requires the specification of a user-defined regularization parameter, which is generally set up by cross-validation.

In this paper, we introduce an efficient method for pruning unnecessary weights of a trained MLP classifier without the need of matrix inversions and any additional regularization parameter. The proposed method is based on the correlation analysis between the errors of the output neurons and the errors backpropagated to the hidden neurons. Repeated application of the method leads eventually to the complete elimination of all connections arriving to (or emanating from) a given neuron. Computer simulations using synthetic and real-world data sets are carried out to compare the proposed method with OBS and WDE algorithms on pattern classification and feature subset selection tasks.

The remainder of the paper is organized as follows. In Sect. 2, we briefly review the backpropagation algorithm. In Sect. 3, we introduce the MAXCORE principle as the guiding principle behind the proposal of the weight pruning method to be described in Sect. 4. Comprehensive computer simulations are then presented in Sect. 5 for model selection tasks and in Sect. 5 for feature selection tasks. The computational cost of the proposed weight pruning method is discussed in Sect. 7. We conclude the paper in Sect. 8 with a summary of the achievements and suggestions for further developments.

2 The backpropagation algorithm

In this section, we describe the bascis of the backpropagation algorithm for training a one-hidden-layered MLP. This learning algorithm requires two passes of computation: a forward pass and a backward pass. During the forward pass activations and outputs are computed on a neuron-by-neuron basis, while the weights remain unaltered. At iteration t, the activation of the ith hidden neuron, $i=1, 2, \ldots, Q,$ is given by

$$ u_{i}^{(h)}(t) = \sum_{j=1}^{P} w_{ij}(t) x_j(t) - \theta_i(t) = \sum_{j=0}^{P} w_{ij}(t) x_j(t), $$

(1)

where w _ij is the synaptic weight connecting the jth input unit to the ith hidden neuron, θ_i(t) is the bias of the ith hidden neuron, $Q (2 \leq Q < \infty)$ is the number of hidden neurons, and P is the dimension of the input vector (excluding the threshold). For simplifying notation, we set x ₀(t) = −1 and w _i0 = θ ^(h)_i (t).

The output of the ith hidden neuron is then defined by

$$ y_{i}^{(h)}(t) = \varphi_i\left[u_{i}^{(h)}(t)\right] = \varphi_i\left[\sum_{j=0}^{P} w_{ij}(t) x_j(t)\right], $$

(2)

where $\varphi_i(\cdot)$ is usually a sigmoidal function. Similarly, the output values of the output neurons are given by

$$ y_k^{(o)}(t) = \varphi_k\left[u_{k}^{(o)}(t)\right] = \varphi_k\left[\sum_{i=0}^{Q} m_{ki}(t) y_{i}^{(h)}(t)\right], $$

(3)

where m _ki is the synaptic weight connecting the ith hidden neuron to the kth output neuron ($k=1, \ldots, M$), and M ≥ 1 is the number of output neurons. Again, for the purpose of simplifying notation, we set y ₀(t) = −1 and m _k0 = θ ^(o)_k (t), where θ ^(o)_k (t) is the threshold of the kth output neuron.

The backward pass starts at the output layer by propagating the error signals from the output layer toward the hidden layer. For that, we need first to compute the error value e ^(o)_k (t) generated by each output neuron at current iteration t

$$ e_k^{(o)}(t) = d_k(t) - y_k^{(o)}(t), \quad k=1, \ldots, M $$

(4)

where d _k(t) is the target value for the kth output neuron. The error e _k(t) is multiplied by the derivative $\varphi^{\prime}_{k}\left[u_{k}^{(o)}(t)\right] = \partial \varphi_{k}/\partial u_{k}^{(o)}$ before being propagated to the hidden layer:

$$ \delta_{k}^{(o)}(t) = \varphi^{\prime}_{k}\left[u_{k}^{(o)}(t)\right]e_{k}^{(o)}(t). $$

(5)

The quantity resulting from this multiplication is known as the local gradient of the kth output neuron. Similarly, the local gradient δ ^(h)_i (t) of the ith hidden neuron is computed as

$$ \begin{aligned} \delta_{i}^{(h)}(t) &= \varphi_i^{\prime}\left[u_{i}^{(h)}(t)\right] \sum_{k=1}^{M} m_{ki}(t) \delta_{k}^{(o)}(t)\\ &= \varphi_i^{\prime}\left[u_{i}^{(h)}(t)\right]e_i^{(h)}(t), \quad i = 0, \ldots, Q, \\ \end{aligned} $$

(6)

where the term e ^(h)_i (t) plays the role of a backpropagated or projected error signal for the ith hidden neuron.

Finally, the synaptic weights of the output neurons are updated according to the following rule

$$ m_{ki}(t+1) = m_{ki}(t) + \eta \delta_{k}^{(o)}(t) y_{i}^{(h)}(t), \quad i = 0, \ldots, Q, $$

(7)

where 0 < η≪ 1 is the learning rate. The weights of the hidden neurons are, by their turn, adjusted through a similar learning rule

$$ w_{ij}(t+1) = w_{ij}(t) + \eta \delta_{i}^{(h)}(t) x_j(t), \quad j = 0, \ldots, P. $$

(8)

One complete presentation of the entire training set during the learning process is called an epoch. Many epochs may be required until the convergence of the backpropagation algorithm is verified. Thus, it is good practice to randomize the order of presentation of training examples from one epoch to the next, in order to make the search in the weight space stochastic over the learning cycles.

The simplest way of evaluating convergence is through the average squared error

$$ \varepsilon_{\rm train} = \frac{1}{2N} \sum_{t=1}^{N} \sum_{k=1}^{M} \left[ e_{k}^{(o)}(t) \right]^2 = \frac{1}{2N} \sum_{t=1}^{N} \sum_{k=1}^{M} \left[ d_k(t) - y_k^{(o)}(t) \right]^2, $$

(9)

computed at the end of a training run using the training data vectors. If it falls below a prespecified value, then convergence is achieved. The generalization performance of the MLP should be evaluated on a testing set, which contains examples not seen before by the network.

3 The MAXCORE principle

In Eq. 6, we have defined the backpropagated error signal for the ith hidden neuron as

$$ e_i^{(h)}(t) = \sum_{k=1}^{M} m_{ki}(t) \delta_{k}^{(o)}(t) = \sum_{k=1}^{M} m_{ki}^{*}(t) e_{k}^{(o)}(t), $$

(10)

where $m_{ki}^{*}(t) = m_{ki}(t) \varphi^{\prime}_{k}\left[u_{k}^{(o)}(t)\right]$ is the weight between the kth output neuron and the ith hidden neuron modulated by the derivative of the kth output activation function.

From Eq. 10, we can easily note that e ^(h)_i (t) is the result of a linear combination of the output error signals $e_{k}^{(o)}(t), k=1,\ldots,M,$ with weights m ^*_ki (t). Thus, the higher the value of m ^*_ki (t) and, hence, of m _ki(t), the higher the correlations between e ^(h)_i (t) and e ^(o)_k (t). A similar linear relationship can be derived for the signals e ^(h)_i (t), associated with the hidden neurons, and the signals e ⁽ⁱ⁾_j (t), associated with the input units (see Eq. 16).

From the exposed, we can state the following general principle, henceforth called the Principle of Maximum Correlation of Errors (MAXCORE):

The MAXCORE principle Relevant synaptic weights tend to generate higher correlations between error signals associated with the neurons of a given layer and the error signals propagated back to the previous layer. Nonrelevant (i.e. prunable) weights tend to generate smaller correlations.

The MAXCORE is the guiding principle behind the proposal of the weight pruning procedure to be described in the next sections.

4 The proposed methodology

The procedure to be described is a pruning method. The user should first train a MLP network with a relative large number of hidden neurons until achieves the lowest possible value for $\varepsilon_{\rm train}$. Recall that large MLPs are prone to overfitting due to local optimization of their cost function, which can lead to a poor generalization performance. After that, the application of a pruning procedure can discard nonrelevant weights of an overparameterized network, thus providing a smaller model with equivalent or better generalization performance than the original one.

The main idea behind the pruning procedure to be described is to maintain those connections that produce higher correlations between the errors in a given layer and the errors that are propagated back to the preceding layer. Weights associated with smaller error correlations are candidates to be discarded.

4.1 Pruning hidden-to-output layer weights

The pruning procedure starts with the training data being submitted once again to the trained network. No weight updating is allowed from this stage on. Once all the N training examples are presented, determine the error matrices $\mathbf{E}_{o}$ and $\mathbf{E}_{h},$ defined as follows:

$$ {{\mathbf{E}}}_{o} = \left[\begin{array}{cccc} e_{1}^{(o)}(1) & e_{2}^{(o)}(1) & \cdots & e_{M}^{(o)}(1) \\ e_{1}^{(o)}(2) & e_{2}^{(o)}(2) & \cdots & e_{M}^{(o)}(2) \\ \vdots & \vdots & \vdots & \vdots \\ e_{1}^{(o)}(N) & e_{2}^{(o)}(N) & \cdots & e_{M}^{(o)}(N) \end{array} \right]_{N \times M} $$

(11)

and

$$ {{\mathbf{E}}}_{h} = \left[\begin{array}{ccccc} e_{0}^{(h)}(1) & e_{1}^{(h)}(1) & \cdots & e_{Q}^{(h)}(1) \\ e_{0}^{(h)}(2) & e_{1}^{(h)}(2) & \cdots & e_{Q}^{(h)}(2) \\ \vdots & \vdots & \vdots & \vdots \\ e_{0}^{(h)}(N) & e_{1}^{(h)}(N) & \cdots & e_{Q}^{(h)}(N) \end{array} \right]_{N \times (Q+1)}. $$

(12)

The rows of the matrix ${\mathbf{E}}_{o}$ correspond to the errors generated by the output neurons for a given training example. Hence, this matrix is called from now on the matrix of output errors, in contrast to the matrix ${\mathbf{E}}_{h},$ whose rows correspond to the backpropagated errors associated with the hidden neurons. In particular, the first column of ${\mathbf{E}}_{h}$ corresponds to backpropagated errors associated with the thresholds $m_{k0} = \theta_{k}^{(o)}, k=1, \ldots, M$.

The second step involves the computation of the following matrix product

$$ {{\mathbf{C}}}_{oh} = {{\mathbf{E}}}_{o}^{T}{{\mathbf{E}}}_{h}, $$

(13)

where the superscript T denotes the transpose of a matrix. Note that the (k, i)th entry of ${\mathbf{C}}_{oh},$ denoted by ${\mathbf{C}}_{oh}[k,i],$ corresponds to the scalar product (i.e. the cross-correlation) of the kth column of ${\mathbf{E}}_{o}$ with the ith column of ${\mathbf{E}}_{h},$

$$ {{\mathbf{C}}}_{oh}[k,i] = \sum_{t=1}^{N} e_{k}^{(o)}(t)e_{i}^{(h)}(t), $$

(14)

for $k=1, \ldots, M,$ and $i=0, \ldots, Q$.

The third step requires sorting the absolute values of entries ${\mathbf{C}}_{oh}[k,i]$ in ascending order

$$ {|{\mathbf{C}}}_{oh}[{{\mathbf{r}}}_{1}]| < {|{\mathbf{C}}}_{oh}[{{\mathbf{r}}}_{2}]| < \cdots < {|{\mathbf{C}}}_{oh}[{{\mathbf{r}}}_{L}]|, $$

(15)

where the vector ${\mathbf{r}}_l=(k_{l}, i_{l})$ contains the coordinates of the entry of position l in the ranking, and $L=\dim({\mathbf{C}}_{oh})=M\times(Q+1)$ is the number of entries in ${\mathbf{C}}_{oh}$.

The fourth step involves the execution of the pruning procedure shown in Table 1. In this table, J _train denotes a performance index used to evaluate the network on the training data, such as the average squared error $\varepsilon_{\rm train}$ or the classification rate CR_train. The constant J _tol is a user-defined value. If $J_{\rm train}=\varepsilon_{\rm train},$ then J _tol is the maximum error allowed for the training data. Thus, we eliminate a given connection $m_{{\mathbf{r}}_{l}}$ only if the value of J _train, computed after the elimination of that connection, remains lower than J _tol. If J _train = CR_train, then J _tol is the minimum recognition rate allowed for the training data. In this case, we eliminate a given connection $m_{{\mathbf{r}}_{l}}$ only if the value of J _train, computed after the elimination of that connection, still remains higher than J _tol.

Table 1 Pruning procedure for hidden-to-output layer weights

Full size table

4.2 Pruning input-to-hidden layer weights

For pruning the weights connecting the input units to the hidden neurons, {w _ij}, we now need to propagate the error signals e ^(h)_i (t) toward the input units in order to generate the error signals $e_{j}^{(i)}(t), j = 0, \ldots, P:$

$$ e_{j}^{(i)}(t) = \sum_{i=1}^{Q} w_{ij}(t) \delta_{i}^{(h)}(t) = \sum_{i=1}^{Q} w_{ij}^{*}(t) e_{i}^{(h)}(t), $$

(16)

where $w_{ij}^{*}(t) = w_{ij}(t) \varphi^{\prime}_{i}\left[u_{i}^{(h)}(t)\right]$ is the weight between the ith hidden neuron and the jth input unit modulated by the derivative of the ith hidden activation function.

After the presentation of N training vectors, the resulting errors can be organized into the error matrix $\mathbf{E}_{i},$ defined as

$$ {{\mathbf{E}}}_{i} = \left[\begin{array}{ccccc} e_{0}^{(i)}(1) & e_{1}^{(i)}(1) & \cdots & e_{P}^{(i)}(1) \\ e_{0}^{(i)}(2) & e_{1}^{(i)}(2) & \cdots & e_{P}^{(i)}(2) \\ \vdots & \vdots & \vdots & \vdots \\ e_{0}^{(i)}(N) & e_{1}^{(i)}(N) & \cdots & e_{P}^{(i)}(N) \end{array} \right]_{N \times (P+1)}. $$

(17)

Similarly to the procedure described in the previous section, we need to compute the following matrix product

$$ {{\mathbf{C}}}_{hi} = ({{\mathbf{E}}}_h^-)^{T}{{\mathbf{E}}}_i, $$

(18)

where the matrix ${\mathbf{E}}_h^-$ is obtained from matrix ${\mathbf{E}}_h$ by removing its first column. This does not influence the pruning process since there are no connections among the constant input of the hidden layer (i.e. y ₀(t) = − 1) and the input units.

The (i, j)th entry of ${\mathbf{C}}_{hi},$ denoted by ${\mathbf{C}}_{hi}[i,j],$ is given by

$$ {{\mathbf{C}}}_{hi}[i,j] = \sum_{t=1}^{N} e_{i}^{(h)}(t)e_{j}^{(i)}(t), $$

(19)

for $i=1, \ldots, Q,$ and $j=0, \ldots, P$. Then, we sort the absolute values of entries ${\mathbf{C}}_{hi}[i,j]$ in ascending order

$$ {|{\mathbf{C}}}_{hi}[{{\mathbf{s}}}_{1}]| < {|{\mathbf{C}}}_{hi}[{{\mathbf{s}}}_{2}]| < \cdots < {|{\mathbf{C}}}_{hi}[{{\mathbf{s}}}_{L}]|, $$

(20)

where the vector ${\mathbf{s}}_l=(i_{l}, j_{l})$ contains the coordinates of the entry of position l in the ranking, and $L=\dim({\mathbf{C}}_{hi})=Q\times(P+1)$ is the number of entries in ${\mathbf{C}}_{hi}$.

The final step involves the execution of the pruning procedure shown in Table 2. For obvious reasons, from now on we refer to the proposed method as the CAPE method (Cross-correlation Analysis of back-Propagated Errors). This method must be applied reapeatedly until no more weights are to be pruned (see the flowchart in Fig. 1).

Table 2 Pruning procedure for input-to-hidden layer weights

Full size table

5 Model selection using the CAPE method

As a proof of concept, we work initially with a synthetic two-dimensional data set in order to visualize the hyperplanes (i.e. lines in this case) due to each hidden neuron and the resulting global decision surface. This data set is composed of 200 pattern vectors divided into two nonlinearly separable classes. For training, 140 of them are randomly chosen while the remaining ones are used for testing the performance of the pruned network.

The second set of simulations involves preliminary tests using two benchmarking data sets (Iris and Wine). The Iris data set is composed of 150 4-dimensional pattern vectors distributed into 3 classes. The Wine data set is composed of 178 13-dimensional vectors also distributed into 3 classes. For the Iris (Wine) data set, 35 (33) patterns per class are randomly selected for training purposes and the remaining 45 (79) ones are used for testing.

For a more demanding classification task, we have used a biomedical data set kindly made available for this research by the Group of Applied Research in Orthopaedics (GARO) of the Centre Médico-Chirurgical de Réadaptation des Massues, Lyon, France. The task consists in classifying patients as belonging to one out of three categories: Normal (100 patients), Disk Hernia (60 patients), or Spondylolisthesis (150 patients). Each patient is represented in the database by six biomechanical attributes derived from the shape and orientation of the pelvis and cervical, thoracic, and lumbar spine: pelvic incidence, pelvic tilt, sacral slope, pelvic radius, lumbar lordosis angle, and grade of spondylolisthesis. The training set is formed by randomly selecting 42 pattern vectors per class, and the remaining 184 examples are used for testing purposes. This data set is available for the interested reader upon request. For the interested reader, we recommend the additional reading of the works of Berthonnaud et al. [3] and Rocha Neto and Barreto [26].

Weights and biases are randomly initialized within the range −0.5 to +0.5. All neurons use hyperbolic tangent activation functions, and the inputs are normalized to the range of network activations. Features are normalized to zero mean and unit variance. The output target vectors use the 1-out-of-M binary encoding. The learning rate was set to η = 0.001 for all simulations. All the networks and pruning methods were implemented in MATLAB 7.0.

5.1 Results for the synthetic data set

For the first data set, we initially trained a fully connected MLP network with Q = 12 hidden neurons. The input and output dimensions were set to P = 2 and M = 1, respectively. This network was trained until the MSE value for the training data, $(\varepsilon_{\rm train}),$ reached a steady-state, i.e. no significant changes are observed.

Numerical results are shown in Table 3. In this table, N _c = (P + 1) × Q + (Q + 1) × M is the initial number of weights, CR_train and CR_test stand for the classification rate for the training and testing data, respectively, $\varepsilon_{\rm test}$ is the mean-squared error in testing data set, and AIC stands for Akaike’s Information criterion.^{Footnote 1}

Table 3 Numerical results for the synthetic data set

Full size table

The application of the CAPE and OBS methods involves several steps. Initially, they are applied to Architecture 1, with Q = 12 hidden neurons and, hence, N _c = 49 adjustable parameters as a whole. Note that this initial architecture classifies the data accurately. However, we are not only interested in good classification rates, but also in knowing whether there is redundancy, i.e. whether this architecture is oversized. The application of CAPE and OBS resulted in a pruned architecture with Q = 9 hidden neurons, but with less connections than a fully connected MLP architecture with Q = 9 neurons (N _c = 37). The classification rates of both architectures remained the same as before. The resulting AIC value is slightly higher for the architecture pruned by the CAPE method, mainly due to its larger number of parameters.

Once the performances of the pruned architectures for Q = 9 hidden neurons remained at acceptable levels, we started to wonder what would happen if we apply the pruning methods to another fully connected, one-hidden-layered MLP (Architecture 2) that has approximately the same number of connections N _c = 33 as the pruned Architecture 1. The rationale for doing that was to verify whether there was still room for further weight pruning. The application of CAPE and OBS to Architecture 2 led to the same pruned architecture with Q = 7 neurons and N _c = 22 connections. Again, note that the pruned Architecture 2 has less connections than a fully connected network with Q = 7 neurons (N _c = 29). We continued with the pruning procedure for more two fully connected networks, namely, Architectures 3 and 4. The results were identical for both the CAPE and OBS methods.

The results of the application of the WDE algorithm to the Architecture 1 are also shown in Table 3. Since this algorithm works during the training phase (see the “Appendix”), we cannot compare it directly with the OBS and CAPE methods, which are applied to already trained networks. Thus, only the final result is shown. The regularization parameter is set to λ = 0.001 after some experimentation. The application of the WDE algorithm resulted in an architecture with 4 neurons and 17 connections. However, the CR_test for the resulting architecture is lower than that achieved by the pruned Architecture 4.

In Fig. 2, it is shown the resulting global decision surface and the position of the hyperplanes (i.e. lines) associated with the unpruned Architecture 1. We can see clearly that there are too many lines (i.e. redundant neurons), much more than the number required to discriminate well between the two classes. For comparison purposes, the decision surface and the hyperplanes for Architecture 4, pruned by the CAPE method, are shown in Fig. 3, where only three lines remained. The successive application of the CAPE method, starting from Architecture 1, eventually led to the minimum number of hidden neurons required to solve this nonlinearly separable problem.

Note that the OBS algorithm also led to the same optimal three-hidden-neurons architecture, but at the expenses of much higher computational costs. Recall that the OBS algorithm requires the computation of the inverse of a Hessian matrix, one for each neuron in the network, every time the architecture is retrained. The CAPE method requires no matrix inversion at all. In addition, the CAPE method does not require the specification of the regularization parameter λ (WDE algorithm), which should be estimated by trial-and-error or by cross-validation.

5.2 Results for the Iris data set

Table 4 shows the results achieved by the pruned architectures on the Iris data set. The initial network topology (Q = 9 hidden neurons) is independently trained and tested four times, using the same training and testing data sets. Only the weight initialization and the order of presentation of the training vectors per epoch were changed from one training/testing run to another. This was necessary to emphasize differences between the two pruning methods.

Table 4 Results of the application of the pruning methods (Iris data)

Full size table

Weight pruning is carried out using the highest classification rate achieved for training data as reference (i.e. J _train = CR_train). In general, both methods presented equivalent performance on testing data if we consider only average classification rates and mean-squared errors in the evaluation. In terms of the final number of weights, the OBS method presented a slight advantage over the CAPE, since it ended with a smaller number of weights and hidden neurons. However, as we show later, if we also consider the resulting confusion matrix for the pruned networks, the networks pruned by CAPE performed better than the ones pruned by OBS.

An interesting behavior was observed in this simulation. Those architectures labeled with an asterisk (*) had an output neuron pruned. For all four initial architectures, the application of the CAPE method pruned an output neuron, while the OBS method pruned an output neuron in 3 out of 4 cases. As mentioned earlier, occasionally a neuron has to be pruned when all the connections converging to (or emanating from) that neuron have been pruned. In this simulation specifically, the fact that an output neuron has been pruned and may indicate that the output coding is redundant. As a matter of fact, the 1-out-of-M output encoding used in all simulation in this paper required three neurons to represent the three classes of the Iris data set. However, it is possible to represent the three classes using only two output neurons! This redundancy in output class representation was found, without any a priori information about the task, by both the CAPE and OBS methods, but it is particularly a strong feature of the CAPE method.

Figure 4a shows the classification results for Model 1 with Q = 9 hidden neurons and M = 3 output neurons. The points in the figure correspond to neurons’ output values y ^(o)_k , k = 1, 2, 3; thus, the output space is three-dimensional. The output coordinates (y ^(o)₁ , y ^(o)₂ , y ^(o)₃ ) corresponding to input vectors classified as belonging to the setosa class (in blue) are concentrated around the coordinates (+1, −1, −1). Points corresponding to input vectors classified as belonging to the versicolor class (in red) and to the virginica class (shown in black) are concentrated around the coordinates (−1, +1, −1) and (−1, −1, +1), respectively. There is some superposition among points belonging to the versicolor and virginica classes, resulting in a few classification errors.

Figure 4b shows the resulting classification achieved by the Model 1 after the application of the CAPE method. The pruned architecture presents N _c = 41 weights, Q = 9 hidden neurons, and still M = 3 output neurons. However, since all the weights from the hidden layer to the output neuron 1 has been pruned, the output of this neuron (i.e. y ^(o)₁ ) plays no role now and has been set to zero. Hence, the output coordinates for the classes setosa, versicolor, and virginica now spread around the following points (0, −1, −1), (0, +1, −1), and (0, −1, +1), respectively.

Numerical results are shown in Table 4. Average classification rates and error values were computed for the pruned networks using the original 1-out-M encoding for desired outputs, maintaining three output neurons, however with the pruned output set to zero (i.e. y ^(o)₁ = 0). Despite that, classification rates for the testing data have been maintained at acceptable levels (i.e. >93%) for the pruned architectures.

If we retrain the pruned architectures now using only two output neurons, with desired outputs encoded as (−1, −1) (setosa class), (+1, −1) (versicolor class), and (−1, +1) (virginica class), the classification rates improve considerably. For example, the classification rates for the CAPE-pruned Model 1 changed from 95.5% (3rd row in Table 4) to 98.9% (not shown in Table 4). It was also observed a reduction in the mean-squared error values. The testing error changed from $\varepsilon_{\rm test}=0.2153$ to $\varepsilon_{\rm test}=0.0829$. Improvements in average classification rates and mean-squared errors were also observed for the OBS-pruned Model 1.

We also analyzed the effects of pruning on the confusion matrices of the classifiers. The confusion matrix of the unpruned Model 4 is shown in Table 5 using testing data (45 pattern vectors, 15 for each class). The first row of this table indicates that all 15 pattern vectors of class setosa (i.e. 33.3% of the total amount of testing data) are correctly assigned to this class. The same occurs for the 15 pattern vectors of class virginica (third row). For the 15 data vectors of the class versicolor, 12 of them (26.7% of the total) are correctly classified as belonging to the class versicolor, while 3 pattern vectors of the class versicolor (6.7% of the total) are erroneously assigned to class virginica.

Table 5 Confusion matrix for unpruned Model 4 (Iris data)

Full size table

The confusion matrices obtained for Model 4, pruned by the CAPE and OBS methods, are shown in Table 6. It is clear that the confusion matrix obtained from the CAPE-pruned Model 4 is better than that produced by the OBS-pruned Model 4. As a conclusion, we recommend that the pruned architectures must be further evaluated by their confusion matrices, when presenting equivalent classification rates.

Table 6 Confusion matrices for the pruned Model 4 (Iris data)

Full size table

5.3 Results for the vertebral column data set

For this simulation, we set J _tol = 85% to serve as a reference for the minimum acceptable recognition rate for training and testing purposes. That is, even if a pruned network has produced a CR_train value higher than 85%, its CR_test value should also be higher than 85%. Network pruning is carried out progressively through successive applications of the CAPE/OBS/PWM methods. Numerical results are summarized in Table 7.

Table 7 Numerical results of the successive application of the pruning methods to vertebral column data set

Full size table

The CAPE method has produced a pruned Architecture 1 with Q = 19 hidden neurons and N _c = 126 connections, i.e. with less connections than a fully connected MLP architecture with the same number of hidden neurons (N _c = 193). The OBS method has ended with a pruned network with Q = 22 hidden neurons and N _c = 134. The final classification rates (CR_train and CR_test) of both methods were coincidentally the same.

Important facts are worth mentioning with respect to the PWM method [5], which is described in the “Appendix”. Firstly, despite the fact that it has ended with the network with the smallest number of connections (N _c = 121) for Q = 21 hidden neurons, its CR_test is smaller than 85%. Secondly, due the small number of weights, it has achieved the lowest AIC value, wrongly indicating this architecture as the best one for this problem. These results indicate that the AIC index is possibly not the best model selection method for classification problems, since it is based on the average squared error only, not on classification rates.

Similarly to the procedure described in Sect. 5.1, we have trained another fully connected, one-hidden-layered MLP, denoted Architecture 2, with a number of hidden neurons Q = 18 close to the final value achieved by the CAPE-pruned Architecture 1. The training parameters for the Architecture 2 were the same as before. The application of the CAPE and the OBS methods to Architecture 2 led to the same number of hidden neurons (Q = 14), with the latter achieving less connections than the former. However, the classification rate CR_test for the CAPE method was higher. The AIC model, in this case, indicated the OBS method as the one providing the best pruned architecture, despite the fact that the CAPE method has produced the highest classification rate in testing data. This was mainly due to the smaller number of connections achieved by the OBS method.

The PWM method performed quite well in terms of classification rates, ending with a pruned architecture with Q = 16 number of neurons and N _c = 96 connections. It is worth mentioning, however, that the PWM is highly sensitive to weight initialization, and the numerical results shown in Table 7 for the PWM method corresponds to the best one obtained after 10 training/testing runs. The CAPE and the OBS are much less sensitive to weight initialization.

As the recognition rate CR_test for the pruned Architecture 2, irrespective of the pruning method, also remained above J _tol = 85%, we decided to start pruning another fully connected, one-hidden-layered MLP, with Q = 13. However, no method was able to reduce further the number of hidden neurons, only the number of weights. Anyway, all pruned networks achieved recognition rates on testing data below the minimum allowed value. Thus, one can infer that the architectures best suited to the problem are the pruned ones obtained from Architecture 2.

For the sake of comparison, we evaluated two SVM classifiers, using linear and RBF kernels, for 50 runs on the vertebral column data set. The SVM classifiers were both trained by the standard SMO algorithm Platt [22]. These classifiers achieved average recognition rates of 85.1% (linear kernel) and 85.3% (RBF kernel), which are much inferior than the recognition rate reported in Table 7 for the CAPE-pruned Architecture 2 (88.59%).

6 Feature subset selection using CAPE

If all the weights emanating from one or more input units have been discarded, the features associated with those input nodes are to be considered nonrelevant ones, since they will play no role in the computations. Thus, the proposed weight pruning method can be also used for feature selection.^{Footnote 2} For this simulation, we used the Wine (13 features and 178 instances) and the Dermatology (34 features and 366 instances) benchmarking data sets.

For the Wine data set, we trained an MLP with Q = 9 hidden neurons using all N _feat = 13 features and the whole data set.^{Footnote 3} The total number of weights is N _c = 156. The network was trained until it achieved CR_train = 100%. Then, the CAPE method was applied with J _tol = 99.50%.

The results shown in Table 8 suggest that there are at least 3 redundant input features, as we can see on the rightmost column. Hence, we created another training data set, now with only N _feat = 10 features by excluding the features 6, 7, and 9. Another MLP was then trained and pruned. This procedure was repeated until no more features were to be discarded. At the end of 4 runs of the CAPE method, only 8 features remained.

Table 8 Progressive feature selection for the Wine data set

Full size table

Figure 5a, b illustrate CAPE’s pruning power. In Fig. 5a, we can see the original fully connected network, Architecture 1 (P = 13, Q = 9, M = 3, and N _c = 156), while in Fig. 5b, we see the resulting pruned architecture (N _feat = 8, Q = 4, M = 3, and N _c = 25) after the application of CAPE method. As final step, we evaluated this pruned architecture by training and testing it using the 10-fold cross-validation scheme. No weight pruning was allowed at this step. The resulting average classification rate was 98.60%.

To further evaluate CAPE in feature selection tasks, we trained a fully connected MLP with Q = 10 hidden neurons using the Dermatology data set, which contains 366 patterns distributed into six classes. Each vector had initially N _feat = 34 features that were used to categorize six skin diseases. The whole data set was used for the training process. A procedure similar to the one applied to the Wine data set was followed. All architectures were trained to achieve CR_train > 99.50% with J _tol = 99.50%. The results are summarized in Table 9.

Table 9 Progressive feature selection for the Dermatology data set

Full size table

The CAPE-based feature subset selection approach has led to the elimination of 15 features at the end of the pruning process of the Architecture 7, while the OBS-based approach has led to the elimination of only 8 features. Using the remaining 19 features, the pruned Architecture 7 achieved a recognition rate of CR_train = 99.73%.

An interesting situation happens when we further apply the CAPE algorithm to Architecture 8. Elimination of features 16 and 26 is suggested (1st trial). However, when we pruned out these two features, the resulting architecture did not achieve a recognition rate above the acceptable minimum (i.e. CR_train < J _tol = 99.50%) for a new trained MLP using only 17 features. We then decided to train two other MLPs using 18 features. The first one is trained after the elimination of feature 16 only (2nd trial), while the second one is trained after the elimination of feature 26 only (3rd trial). The 2nd trial achieved the minimum acceptable recognition rate, while the 3rd trial did not. So, we maintained feature 26 and eliminated feature 16.

The final MLP classifier, pruned by CAPE on the Dermatology data set, presents the following numbers: N _feat = 18, Q = 10, M = 6, and N _c = 109. The features 1, 7, 11, 12, 13, 16, 17, 18, 20, 24, 25, 29, 30, 32, 33, and 34 can be discarded, since they are considered nonrelevant for the classification task of interest. As final step, we evaluated the pruned architecture by training and testing it using the tenfold cross-validation scheme. No weight pruning is allowed at this step. The resulting average classification rate was 98.00%.

6.1 Comparison with related methods

In this section, we compare the performance of the CAPE-based feature subset selection method with those reported in some recent works available in the literature.

Luukka [18] introduced a wrapper approach for feature subset selection using fuzzy entropy measures. This approach removed only 5 features of the Dermatology data set, achieving an average classification rate of 98.28%. Rocha et al. [25] proposed two strategies for evolving MLP-like classifiers, named Topology-optimization Evolutionary Neural Network (TENN) and Simultaneous Evolutionary Neural Network (SENN) and evaluated them on several benchmarking data sets, including the Dermatology data set. It is reported average classification rates of 95.7% (TENN) and 95.3% (SENN). For the sake of comparison, on the Dermatology data set, the CAPE-based feature selection method resulted in an average classification rate of 98.00% using only 18 (out of 34) features.

Moustakidis and Theocharis [19] introduced the SVM-FuzCoC, a powerful SVM-based feature selection method that achieved an average classification rate of 97.12% for the Wine data set using 6 features in average. For the Dermatology data set, the SVM-FuzCoC achieved an average classification rate of 94.11% using 12 features in average. For the Dermatology (Wine) data set, the CAPE-based feature selection method resulted in an average recognition rate of 98.00% (98.6%) using 18 (8) features. As a conclusion, the performance of the SVM-FuzCoC method in terms of the trade-off between the smallest number of features and the highest classification rate is slightly better than that provided by the CAPE method, but this is achieved at the expenses of much higher computational efforts.

7 CAPE × OBS: estimating the computational costs

The computational cost is an important issue to be addressed during evaluation of algorithms. Sometimes, the algorithm effectiveness in providing a solution to a given problem requires the execution of complex computations and the use of excessive memory resources, which can create severe difficulties in real-time or low-cost applications.

In this paper, we have evaluated the CAPE and OBS methods in several classification and feature selection tasks. One may argue that the CAPE performed only slightly better than the OBS to justify its choice as a pruning method in the place of the classical OBS/OBD methods. As mentioned before, the OBS/OBD methods require the computation of the inverse Hessian matrix for a given neuron in order to determine the weight saliencies (see the “Appendix”). Computation of the inverse Hessian matrix for MLP networks is a tricky task itself. Exact computation of the Hessian matrix for MLP networks has been made possible [4], but its inversion is still subject to numerical ill-conditioning. In this regard, the CAPE method demands much less computational efforts that the OBS/OBD methods, since no matrix inversion is required.

In the results to be described, we take into account all the essential mathematical operations used to compute the weight saliencies (S _i), as required by the OBS algorithm, and the cross-correlation indexes ($\mathbf{C}_{oh}[k,i]$ and $\mathbf{C}_{hi}[i,j]$) in CAPE algorithm. Tables 10 and 11 show the number of mathematical operations required by the CAPE and OBS methods, respectively, as a function of the number of pattern vectors (N), the input dimension (P), the number of hidden neurons (Q), and the number of output neurons (M).

Table 10 Mathematical operations demanded by the CAPE method

Full size table

Table 11 Mathematical operations demanded by the OBS method

Full size table

Figure 6 shows how the number of operations required by both pruning methods evolves as a function of the number of weights. For the sake of simplicity, we have considered that sums, multiplications, divisions, etc. have the same computational cost.

The number of operations required by the OBS (denoted by $N_{op\_obs}(N_c)$) and the CAPE (denoted by $N_{op\_cape}(N_c)$) was computed from the application of both algorithms over an MLP with P = 2 input units, M = 1 output neuron, and the number of hidden neurons Q ranging from 1 to 10. The number of instances in the data set is N = 140. One can easily infer that the OBS method has a quadratic dependence on N _c, while the CAPE method presents a linear dependence with very small slope. This result can help a user in deciding in favor of the application of the CAPE pruning method to pattern classification.

8 Conclusions

In this paper, we introduced the CAPE method, an easy-to-apply and efficient procedure for pruning nonrelevant weights of a trained MLP classifier. The CAPE method was motivated by the MAXCORE principle, and it is based on the correlation analysis of the errors produced by the output neurons and the errors propagated back to previous layers. Comprehensive computer simulations using synthetic and real-world data indicate that, in terms of performance, the CAPE method compares favorably with standard pruning techniques, such as the OBS, WDE, and PWM, with the advantage of requiring much lower computational efforts.

Notes

The AIC has the follow structure $AIC=-2\ln(\varepsilon_{\rm train})+2N_c $ [23].
Since the proposed approach is dependent on the classifier model, it belongs to the class of wrappers for feature subset selection ([16]).
Recall that the task now is feature selection, not pattern classification. Thus, we can train the network with all the available pattern vectors.

References

Aran O, Yildiz OT, Alpaydin E (2009) An incremental framework based on cross-validation for estimating the architecture of a multilayer perceptron. Int J Pattern Recogn Artif Intell 23(2):159–190
Article Google Scholar
Benardos PG, Vosniakos GC (2007) Optimizing feedforward artificial neural network architecture. Eng Appl Artif Intell 20(3):365–382
Article Google Scholar
Berthonnaud E, Dimnet J, Roussouly P, Labelle H (2005) Analysis of the sagittal balance of the spine and pelvis using shape and orientation parameters. J Spinal Disorders Tech 18(1):40–47
Article Google Scholar
Bishop CM (1992) Exact calculation of the hessian matrix for the multi-layer perceptron. Neural Comput 4(4):494–501
Article Google Scholar
Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press, Oxford
Castellano G, Fanelli AM, Pelillo M (1997) An iterative pruning algorithm for feedforward neural networks. IEEE Trans Neural Netw 8(3):519–531
Article Google Scholar
Cataltepe Z, Abu-Mostafa YS, Magdon-Ismail M (1999) No free lunch for early stopping. Neural Comput 11(4):995–1009
Article Google Scholar
Curry B, Morgan PH (2006) Model selection in neural networks: some dificulties. Eur J Oper Res 170(2):567–577
Article MathSciNet MATH Google Scholar
Dandurand F, Berthiaume V, Shultz TR (2007) A systematic comparison of flat and standard cascade-correlation using a student-teacher network approximation task. Connect Sci 19(3):223–244
Article Google Scholar
Delogu R, Fanni A, Montisci A (2008) Geometrical synthesis of MLP neural networks. Neurocomputing 71:919–930
Article Google Scholar
Engelbrecht AP (2001) A new pruning heuristic based on variance analysis of sensitivity information. IEEE Trans Neural Netw 12(6):1386–1399
Article Google Scholar
Fahlman SE, Lebiere C (1990) The cascade-correlation learning architecture. In: Touretzky DS (ed) Advances in neural information processing systems. Morgan Kaufmann, San Mateo, vol 2, pp 524–532
Google Scholar
Gómez I, Franco L, Jerez JM (2009) Neural network architecture selection: can function complexity help? Neural Process Lett 30:71–87
Article Google Scholar
Hammer B, Micheli A, Sperduti A (2006) Universal approximation capability of cascade correlation for structures. Neural Comput 17(5):1109–1159
Article MathSciNet Google Scholar
Hassibi B, Stork DG (1993) Second order derivatives for network pruning: optimal brain surgeon. In: Hanson SJ, Cowan JD, Giles CL (eds) Advances in neural information processing systems. Morgan Kaufmann, San Mateo, vol 5, pp 164–171
Google Scholar
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324
Article MATH Google Scholar
Littmann E, Ritter H (1996) Learning and generalization in cascade network architectures. Neural Comput 8(7):1521–1539
Article Google Scholar
Luukka P (2011) Feature selection using fuzzy entropy measures with similarity classifier. Exp Syst Appl 38(4):4600–4607
Article Google Scholar
Moustakidis S, Theocharis J (2010) SVM-FuzCoC: a novel SVM-based feature selection method using a fuzzy complementary criterion. Pattern Recogn 43(11):3712–3729
Article MATH Google Scholar
Nakamura T, Judd K, Mees AI, Small M (2006) A comparative study of information criteria for model selection. Int J Bifur Chaos 16(8):2153–2175
Article MathSciNet MATH Google Scholar
Parekh R, Yang J, Honavar V (2000) Constructive neural-network learning algorithms for pattern classification. IEEE Trans Neural Netw 11(2):436–451
Article Google Scholar
Platt JC (1998) Fast training of support vector machines using sequential minimal optimization. In: Advances in Kernel methods: support vector learning. MIT Press, Cambridge, pp 185–208
Principe JC, Euliano NR, Lefebvre WC (2000) Neural and adaptive systems. Wiley, London
Reed R (1993) Pruning algorithms—a survey. IEEE Trans Neural Netw 4(5):740–747
Article Google Scholar
Rocha M, Cortez P, Neves J (2007) Evolution of neural networks for classification and regression. Neurocomputing 70(16–18):1054–1060
Google Scholar
Rocha Neto AR, Barreto GA (2009) On the application of ensembles of classifiers to the diagnosis of pathologies of the vertebral column: a comparative analysis. IEEE Latin Am Trans 7(4):487–496
Article Google Scholar
Saxena A, Saad A (2007) Evolving an artificial neural network classifier for condition monitoring of rotating mechanical systems. Appl Soft Comput 7(1):441–454
Article Google Scholar
Seghouane AK, Amari SI (2007) The AIC criterion and symmetrizing the kullback-leibler divergence. IEEE Trans Neural Netw 18(1):97–106
Article Google Scholar
Stathakis D, Kanellopoulos I (2008) Global optimization versus deterministic pruning for the classification of remotely sensed imagery. Photogrammetr Eng Remote Sens 74(10):1259–1265
Google Scholar
Trenn S (2008) Multilayer perceptrons: approximation order and necessary number of hidden units. IEEE Trans Neural Netw 19(5):836–844
Article Google Scholar
Wan W, Mabu S, Shimada K, Hirasawa K, Hu J (2009) Enhancing the generalization ability of neural networks through controlling the hidden layers. Appl Soft Comput 9(1):404–414
Article Google Scholar
Weigend AS, Rumelhart DE, Huberman AB (1990) Generalization by weight-elimination with application to forecasting. In: Lippmann RP, Moody J, Touretzky DS (eds) Advances in neural information processing systems. Morgan Kauffman, San Mateo, vol 3, pp 875–882
Google Scholar
Xiang C, Ding SQ, Lee TH (2005) Geometric interpretation and architecture selection of the MLP. IEEE Trans Neural Netw 16(1):84–96
Article Google Scholar
Yao X (1999) Evolving artificial neural networks. Proc IEEE 87(9):1423–1447
Article Google Scholar
Yu J, Wanga S, Xi L (2008) Evolving artificial neural networks using an improved PSO and DPSO. Neurocomputing 71(4–6):1054–1060
Article Google Scholar

Download references

Acknowledgments

The authors thank Prof. Ajalmar Rêgo da Rocha Neto (Federal Institute of Ceará—IFCE) for running the experiments with the SVM classifiers on the vertebral column data set. We also thank the anonymous reviewers for their valuable suggestions for improving this paper.

Author information

Authors and Affiliations

Department of Industry, Federal Institute of Ceará, Av. Treze de Maio, 2081—Campus of Benfica, Fortaleza, Ceará, CEP 60040-531, Brazil
Cláudio M. S. Medeiros
Department of Teleinformatics Engineering, Federal University of Ceará, Av. Mister Hull, S/N—Campus of Pici, Center of Technology, CP 6005, Fortaleza, Ceará, CEP 60455-970, Brazil
Guilherme A. Barreto

Authors

Cláudio M. S. Medeiros
View author publications
You can also search for this author in PubMed Google Scholar
Guilherme A. Barreto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guilherme A. Barreto.

Appendix

The WDE algorithm originates from a regularization method that modifies the error function by adding a term that penalizes large weights. As a consequence, Eqs. 7, 8 are now written as [23]

$$ \begin{aligned} m_{ki}(t+1) &= m_{ki}(t)\left( 1 - \frac{\lambda}{(1 + m_{ki}^2(t))^2}\right) + \eta \delta_{k}^{(o)}(t) y_{i}^{(h)}(t),\\ w_{ij}(t+1) &= w_{ij}(t)\left( 1 -\frac{ \lambda}{(1 + w_{ij}^2(t))^2}\right) + \eta \delta_{i}^{(h)}(t) x_j(t), \end{aligned} $$

where 0 < λ < 1 is a user-defined parameter.

The OBS algorithm [15] requires that the weights are ranked based on the computation of weight saliencies defined as

$$ S_i = \Updelta E_i =\frac{1}{2} \frac{\omega_i^2}{[{{\mathbf{H}}}^{-1}]_{ii}} $$

(21)

where ω_i is the ith weight (or bias) of interest and $[{\mathbf{H}}^{-1}]_{ii}$ is the ith diagonal entry of the inverse of the Hessian matrix ${\mathbf{H}} = [H_{ij}] = \frac{\partial^2 E }{\partial \omega_i \partial \omega_j}$.

Pruning by weight magnitude (PWM) is a pruning method based on the elimination of small magnitude weights ([5]). Weights are sort in increasing order of magnitude. Starting from the smallest weight, a given weight is pruned as long as its elimination does not decrease the classification rate in training data set to a value below a predefined value.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Medeiros, C.M.S., Barreto, G.A. A novel weight pruning method for MLP classifiers based on the MAXCORE principle. Neural Comput & Applic 22, 71–84 (2013). https://doi.org/10.1007/s00521-011-0748-6

Download citation

Received: 16 February 2011
Accepted: 17 September 2011
Published: 14 October 2011
Issue Date: January 2013
DOI: https://doi.org/10.1007/s00521-011-0748-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A novel weight pruning method for MLP classifiers based on the MAXCORE principle

Abstract

Similar content being viewed by others

Feature importance measure of a multilayer perceptron based on the presingle-connection layer

A Computationally Efficient Weight Pruning Algorithm for Artificial Neural Network Classifiers

A Novel Pruning Method Based on Correlation Applied in Full-Connection Layer Neurons

1 Introduction

1.1 Exhaustive search plus early stopping

1.2 Constructive algorithms

1.3 Pruning algorithms

1.4 Evolutionary computation

2 The backpropagation algorithm

3 The MAXCORE principle

4 The proposed methodology

4.1 Pruning hidden-to-output layer weights

4.2 Pruning input-to-hidden layer weights

5 Model selection using the CAPE method

5.1 Results for the synthetic data set

5.2 Results for the Iris data set

5.3 Results for the vertebral column data set

6 Feature subset selection using CAPE

6.1 Comparison with related methods

7 CAPE × OBS: estimating the computational costs

8 Conclusions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A novel weight pruning method for MLP classifiers based on the MAXCORE principle

Abstract

Similar content being viewed by others

Feature importance measure of a multilayer perceptron based on the presingle-connection layer

A Computationally Efficient Weight Pruning Algorithm for Artificial Neural Network Classifiers

A Novel Pruning Method Based on Correlation Applied in Full-Connection Layer Neurons

Explore related subjects

1 Introduction

1.1 Exhaustive search plus early stopping

1.2 Constructive algorithms

1.3 Pruning algorithms

1.4 Evolutionary computation

2 The backpropagation algorithm

3 The MAXCORE principle

4 The proposed methodology

4.1 Pruning hidden-to-output layer weights

4.2 Pruning input-to-hidden layer weights

5 Model selection using the CAPE method

5.1 Results for the synthetic data set

5.2 Results for the Iris data set

5.3 Results for the vertebral column data set

6 Feature subset selection using CAPE

6.1 Comparison with related methods

7 CAPE × OBS: estimating the computational costs

8 Conclusions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation