1 Introduction

Even though it has passed two decades and a half since the rediscovery of the backpropagation algorithm in the mid-1980s, and despite all the available literature on the MLP network, a beginner soon becomes aware of the difficulties in finding a suitable architecture for real-world applications. In fact, this is a hard task also for an experienced practitioner. An architecture that is too small will not be able to learn from data properly, no matter what training algorithm is used for this purpose. An architecture with too many input units and hidden layer/neurons are prone to learn undesirable characteristics (e.g. noise) of the training data.

Hence, a crucial step in the design of a MLP is related with the network model selection problem [5]. This problem is still a research topic of interest [8, 10, 13, 20, 28, 30, 33] and can be roughly defined as the task of finding the smallest architecture that generalizes well, making good predictions for new data. Generalization can be assessed by changing the number of adjustable parameters (weights and biases) associated with input units and hidden/output neurons. Among the several ways to implement this in practice, we list the following four as possibly the most common approaches.

1.1 Exhaustive search plus early stopping

Early stopping of training is a method that aims to prevent overtraining due to oversized network, noisy training examples, or a small training set [7]. The performances of several networks having different number of features, hidden layers, and hidden neurons are evaluated during training on an independent validation set. Training of each network is stopped as soon as its generalization error begins to increase. The optimal architecture is the one providing the smallest generalization error.

1.2 Constructive algorithms

Network training starts with a small number of hidden neurons and add neurons during the training process, with the goal of arriving at an optimal network structure [1, 21]. This is the approach behind the Cascade-Correlation network [9, 12, 14, 17, 31].

1.3 Pruning algorithms

Network training can start with a relatively large number of hidden neurons and then, based on a measure of weight relevance, prune out the least significant connections, either by removing individual weights or by removing complete units [6, 11, 24]. This is the approach followed by the Optimal Brain Surgeon [15, 29] (OBS) and the Weight Decay and Elimination (WDE) [23, 32]. The OBS performs weight pruning based on the so-called weight saliencies (see the “Appendix”). The WDE algorithm originates from a regularization method that modifies the error function by the introduction of a term that penalizes large weights.

1.4 Evolutionary computation

Several authors have used genetic algorithms [2, 27, 34] and other evolutionary algorithms, such as particle swarm optimization [35], for determining the best network topology for a given problem, including the number of hidden layers, the number of neurons per hidden layer, and the number of relevant input features, connection weights, and biases.

The previous model selection methods require significant computational efforts. For example, even if we restrict the exhaustive search to a specific class of architectures (e.g. one-hidden-layered MLP), it is still a burdensome task. Evolutionary algorithms are slow by their very population-based nature and usually cannot be used for real-time purposes. The OBS algorithm demands the inversion of the Hessian matrix of the error function, which is a computationally intensive task [4]. The WDE algorithm requires the specification of a user-defined regularization parameter, which is generally set up by cross-validation.

In this paper, we introduce an efficient method for pruning unnecessary weights of a trained MLP classifier without the need of matrix inversions and any additional regularization parameter. The proposed method is based on the correlation analysis between the errors of the output neurons and the errors backpropagated to the hidden neurons. Repeated application of the method leads eventually to the complete elimination of all connections arriving to (or emanating from) a given neuron. Computer simulations using synthetic and real-world data sets are carried out to compare the proposed method with OBS and WDE algorithms on pattern classification and feature subset selection tasks.

The remainder of the paper is organized as follows. In Sect. 2, we briefly review the backpropagation algorithm. In Sect. 3, we introduce the MAXCORE principle as the guiding principle behind the proposal of the weight pruning method to be described in Sect. 4. Comprehensive computer simulations are then presented in Sect. 5 for model selection tasks and in Sect. 5 for feature selection tasks. The computational cost of the proposed weight pruning method is discussed in Sect. 7. We conclude the paper in Sect. 8 with a summary of the achievements and suggestions for further developments.

2 The backpropagation algorithm

In this section, we describe the bascis of the backpropagation algorithm for training a one-hidden-layered MLP. This learning algorithm requires two passes of computation: a forward pass and a backward pass. During the forward pass activations and outputs are computed on a neuron-by-neuron basis, while the weights remain unaltered. At iteration t, the activation of the ith hidden neuron, \(i=1, 2, \ldots, Q,\) is given by

$$ u_{i}^{(h)}(t) = \sum_{j=1}^{P} w_{ij}(t) x_j(t) - \theta_i(t) = \sum_{j=0}^{P} w_{ij}(t) x_j(t), $$
(1)

where w ij is the synaptic weight connecting the jth input unit to the ith hidden neuron, θ i (t) is the bias of the ith hidden neuron, \(Q (2 \leq Q < \infty)\) is the number of hidden neurons, and P is the dimension of the input vector (excluding the threshold). For simplifying notation, we set x 0(t) = −1 and w i0 = θ (h) i (t).

The output of the ith hidden neuron is then defined by

$$ y_{i}^{(h)}(t) = \varphi_i\left[u_{i}^{(h)}(t)\right] = \varphi_i\left[\sum_{j=0}^{P} w_{ij}(t) x_j(t)\right], $$
(2)

where \(\varphi_i(\cdot)\) is usually a sigmoidal function. Similarly, the output values of the output neurons are given by

$$ y_k^{(o)}(t) = \varphi_k\left[u_{k}^{(o)}(t)\right] = \varphi_k\left[\sum_{i=0}^{Q} m_{ki}(t) y_{i}^{(h)}(t)\right], $$
(3)

where m ki is the synaptic weight connecting the ith hidden neuron to the kth output neuron (\(k=1, \ldots, M\)), and M ≥ 1 is the number of output neurons. Again, for the purpose of simplifying notation, we set y 0(t) = −1 and m k0 = θ (o) k (t), where θ (o) k (t) is the threshold of the kth output neuron.

The backward pass starts at the output layer by propagating the error signals from the output layer toward the hidden layer. For that, we need first to compute the error value e (o) k (t) generated by each output neuron at current iteration t

$$ e_k^{(o)}(t) = d_k(t) - y_k^{(o)}(t), \quad k=1, \ldots, M $$
(4)

where d k (t) is the target value for the kth output neuron. The error e k (t) is multiplied by the derivative \(\varphi^{\prime}_{k}\left[u_{k}^{(o)}(t)\right] = \partial \varphi_{k}/\partial u_{k}^{(o)}\) before being propagated to the hidden layer:

$$ \delta_{k}^{(o)}(t) = \varphi^{\prime}_{k}\left[u_{k}^{(o)}(t)\right]e_{k}^{(o)}(t). $$
(5)

The quantity resulting from this multiplication is known as the local gradient of the kth output neuron. Similarly, the local gradient δ (h) i (t) of the ith hidden neuron is computed as

$$ \begin{aligned} \delta_{i}^{(h)}(t) &= \varphi_i^{\prime}\left[u_{i}^{(h)}(t)\right] \sum_{k=1}^{M} m_{ki}(t) \delta_{k}^{(o)}(t)\\ &= \varphi_i^{\prime}\left[u_{i}^{(h)}(t)\right]e_i^{(h)}(t), \quad i = 0, \ldots, Q, \\ \end{aligned} $$
(6)

where the term e (h) i (t) plays the role of a backpropagated or projected error signal for the ith hidden neuron.

Finally, the synaptic weights of the output neurons are updated according to the following rule

$$ m_{ki}(t+1) = m_{ki}(t) + \eta \delta_{k}^{(o)}(t) y_{i}^{(h)}(t), \quad i = 0, \ldots, Q, $$
(7)

where 0 < η≪ 1 is the learning rate. The weights of the hidden neurons are, by their turn, adjusted through a similar learning rule

$$ w_{ij}(t+1) = w_{ij}(t) + \eta \delta_{i}^{(h)}(t) x_j(t), \quad j = 0, \ldots, P. $$
(8)

One complete presentation of the entire training set during the learning process is called an epoch. Many epochs may be required until the convergence of the backpropagation algorithm is verified. Thus, it is good practice to randomize the order of presentation of training examples from one epoch to the next, in order to make the search in the weight space stochastic over the learning cycles.

The simplest way of evaluating convergence is through the average squared error

$$ \varepsilon_{\rm train} = \frac{1}{2N} \sum_{t=1}^{N} \sum_{k=1}^{M} \left[ e_{k}^{(o)}(t) \right]^2 = \frac{1}{2N} \sum_{t=1}^{N} \sum_{k=1}^{M} \left[ d_k(t) - y_k^{(o)}(t) \right]^2, $$
(9)

computed at the end of a training run using the training data vectors. If it falls below a prespecified value, then convergence is achieved. The generalization performance of the MLP should be evaluated on a testing set, which contains examples not seen before by the network.

3 The MAXCORE principle

In Eq. 6, we have defined the backpropagated error signal for the ith hidden neuron as

$$ e_i^{(h)}(t) = \sum_{k=1}^{M} m_{ki}(t) \delta_{k}^{(o)}(t) = \sum_{k=1}^{M} m_{ki}^{*}(t) e_{k}^{(o)}(t), $$
(10)

where \(m_{ki}^{*}(t) = m_{ki}(t) \varphi^{\prime}_{k}\left[u_{k}^{(o)}(t)\right]\) is the weight between the kth output neuron and the ith hidden neuron modulated by the derivative of the kth output activation function.

From Eq. 10, we can easily note that e (h) i (t) is the result of a linear combination of the output error signals \(e_{k}^{(o)}(t), k=1,\ldots,M,\) with weights m * ki (t). Thus, the higher the value of m * ki (t) and, hence, of m ki (t), the higher the correlations between e (h) i (t) and e (o) k (t). A similar linear relationship can be derived for the signals e (h) i (t), associated with the hidden neurons, and the signals e (i) j (t), associated with the input units (see Eq. 16).

From the exposed, we can state the following general principle, henceforth called the Principle of Maximum Correlation of Errors (MAXCORE):

The MAXCORE principle Relevant synaptic weights tend to generate higher correlations between error signals associated with the neurons of a given layer and the error signals propagated back to the previous layer. Nonrelevant (i.e. prunable) weights tend to generate smaller correlations.

The MAXCORE is the guiding principle behind the proposal of the weight pruning procedure to be described in the next sections.

4 The proposed methodology

The procedure to be described is a pruning method. The user should first train a MLP network with a relative large number of hidden neurons until achieves the lowest possible value for \(\varepsilon_{\rm train}\). Recall that large MLPs are prone to overfitting due to local optimization of their cost function, which can lead to a poor generalization performance. After that, the application of a pruning procedure can discard nonrelevant weights of an overparameterized network, thus providing a smaller model with equivalent or better generalization performance than the original one.

The main idea behind the pruning procedure to be described is to maintain those connections that produce higher correlations between the errors in a given layer and the errors that are propagated back to the preceding layer. Weights associated with smaller error correlations are candidates to be discarded.

4.1 Pruning hidden-to-output layer weights

The pruning procedure starts with the training data being submitted once again to the trained network. No weight updating is allowed from this stage on. Once all the N training examples are presented, determine the error matrices \(\mathbf{E}_{o}\) and \(\mathbf{E}_{h},\) defined as follows:

$$ {{\mathbf{E}}}_{o} = \left[\begin{array}{cccc} e_{1}^{(o)}(1) & e_{2}^{(o)}(1) & \cdots & e_{M}^{(o)}(1) \\ e_{1}^{(o)}(2) & e_{2}^{(o)}(2) & \cdots & e_{M}^{(o)}(2) \\ \vdots & \vdots & \vdots & \vdots \\ e_{1}^{(o)}(N) & e_{2}^{(o)}(N) & \cdots & e_{M}^{(o)}(N) \end{array} \right]_{N \times M} $$
(11)

and

$$ {{\mathbf{E}}}_{h} = \left[\begin{array}{ccccc} e_{0}^{(h)}(1) & e_{1}^{(h)}(1) & \cdots & e_{Q}^{(h)}(1) \\ e_{0}^{(h)}(2) & e_{1}^{(h)}(2) & \cdots & e_{Q}^{(h)}(2) \\ \vdots & \vdots & \vdots & \vdots \\ e_{0}^{(h)}(N) & e_{1}^{(h)}(N) & \cdots & e_{Q}^{(h)}(N) \end{array} \right]_{N \times (Q+1)}. $$
(12)

The rows of the matrix \({\mathbf{E}}_{o}\) correspond to the errors generated by the output neurons for a given training example. Hence, this matrix is called from now on the matrix of output errors, in contrast to the matrix \({\mathbf{E}}_{h},\) whose rows correspond to the backpropagated errors associated with the hidden neurons. In particular, the first column of \({\mathbf{E}}_{h}\) corresponds to backpropagated errors associated with the thresholds \(m_{k0} = \theta_{k}^{(o)}, k=1, \ldots, M\).

The second step involves the computation of the following matrix product

$$ {{\mathbf{C}}}_{oh} = {{\mathbf{E}}}_{o}^{T}{{\mathbf{E}}}_{h}, $$
(13)

where the superscript T denotes the transpose of a matrix. Note that the (ki)th entry of \({\mathbf{C}}_{oh},\) denoted by \({\mathbf{C}}_{oh}[k,i],\) corresponds to the scalar product (i.e. the cross-correlation) of the kth column of \({\mathbf{E}}_{o}\) with the ith column of \({\mathbf{E}}_{h},\)

$$ {{\mathbf{C}}}_{oh}[k,i] = \sum_{t=1}^{N} e_{k}^{(o)}(t)e_{i}^{(h)}(t), $$
(14)

for \(k=1, \ldots, M,\) and \(i=0, \ldots, Q\).

The third step requires sorting the absolute values of entries \({\mathbf{C}}_{oh}[k,i]\) in ascending order

$$ {|{\mathbf{C}}}_{oh}[{{\mathbf{r}}}_{1}]| < {|{\mathbf{C}}}_{oh}[{{\mathbf{r}}}_{2}]| < \cdots < {|{\mathbf{C}}}_{oh}[{{\mathbf{r}}}_{L}]|, $$
(15)

where the vector \({\mathbf{r}}_l=(k_{l}, i_{l})\) contains the coordinates of the entry of position l in the ranking, and \(L=\dim({\mathbf{C}}_{oh})=M\times(Q+1)\) is the number of entries in \({\mathbf{C}}_{oh}\).

The fourth step involves the execution of the pruning procedure shown in Table 1. In this table, J train denotes a performance index used to evaluate the network on the training data, such as the average squared error \(\varepsilon_{\rm train}\) or the classification rate CRtrain. The constant J tol is a user-defined value. If \(J_{\rm train}=\varepsilon_{\rm train},\) then J tol is the maximum error allowed for the training data. Thus, we eliminate a given connection \(m_{{\mathbf{r}}_{l}}\) only if the value of J train, computed after the elimination of that connection, remains lower than J tol. If J train = CRtrain, then J tol is the minimum recognition rate allowed for the training data. In this case, we eliminate a given connection \(m_{{\mathbf{r}}_{l}}\) only if the value of J train, computed after the elimination of that connection, still remains higher than J tol.

Table 1 Pruning procedure for hidden-to-output layer weights

4.2 Pruning input-to-hidden layer weights

For pruning the weights connecting the input units to the hidden neurons, {w ij }, we now need to propagate the error signals e (h) i (t) toward the input units in order to generate the error signals \(e_{j}^{(i)}(t), j = 0, \ldots, P:\)

$$ e_{j}^{(i)}(t) = \sum_{i=1}^{Q} w_{ij}(t) \delta_{i}^{(h)}(t) = \sum_{i=1}^{Q} w_{ij}^{*}(t) e_{i}^{(h)}(t), $$
(16)

where \(w_{ij}^{*}(t) = w_{ij}(t) \varphi^{\prime}_{i}\left[u_{i}^{(h)}(t)\right]\) is the weight between the ith hidden neuron and the jth input unit modulated by the derivative of the ith hidden activation function.

After the presentation of N training vectors, the resulting errors can be organized into the error matrix \(\mathbf{E}_{i},\) defined as

$$ {{\mathbf{E}}}_{i} = \left[\begin{array}{ccccc} e_{0}^{(i)}(1) & e_{1}^{(i)}(1) & \cdots & e_{P}^{(i)}(1) \\ e_{0}^{(i)}(2) & e_{1}^{(i)}(2) & \cdots & e_{P}^{(i)}(2) \\ \vdots & \vdots & \vdots & \vdots \\ e_{0}^{(i)}(N) & e_{1}^{(i)}(N) & \cdots & e_{P}^{(i)}(N) \end{array} \right]_{N \times (P+1)}. $$
(17)

Similarly to the procedure described in the previous section, we need to compute the following matrix product

$$ {{\mathbf{C}}}_{hi} = ({{\mathbf{E}}}_h^-)^{T}{{\mathbf{E}}}_i, $$
(18)

where the matrix \({\mathbf{E}}_h^-\) is obtained from matrix \({\mathbf{E}}_h\) by removing its first column. This does not influence the pruning process since there are no connections among the constant input of the hidden layer (i.e. y 0(t) =  − 1) and the input units.

The (ij)th entry of \({\mathbf{C}}_{hi},\) denoted by \({\mathbf{C}}_{hi}[i,j],\) is given by

$$ {{\mathbf{C}}}_{hi}[i,j] = \sum_{t=1}^{N} e_{i}^{(h)}(t)e_{j}^{(i)}(t), $$
(19)

for \(i=1, \ldots, Q,\) and \(j=0, \ldots, P\). Then, we sort the absolute values of entries \({\mathbf{C}}_{hi}[i,j]\) in ascending order

$$ {|{\mathbf{C}}}_{hi}[{{\mathbf{s}}}_{1}]| < {|{\mathbf{C}}}_{hi}[{{\mathbf{s}}}_{2}]| < \cdots < {|{\mathbf{C}}}_{hi}[{{\mathbf{s}}}_{L}]|, $$
(20)

where the vector \({\mathbf{s}}_l=(i_{l}, j_{l})\) contains the coordinates of the entry of position l in the ranking, and \(L=\dim({\mathbf{C}}_{hi})=Q\times(P+1)\) is the number of entries in \({\mathbf{C}}_{hi}\).

The final step involves the execution of the pruning procedure shown in Table 2. For obvious reasons, from now on we refer to the proposed method as the CAPE method (Cross-correlation Analysis of back-Propagated Errors). This method must be applied reapeatedly until no more weights are to be pruned (see the flowchart in Fig. 1).

Table 2 Pruning procedure for input-to-hidden layer weights
Fig. 1
figure 1

Flowchart for the CAPE methodology

5 Model selection using the CAPE method

As a proof of concept, we work initially with a synthetic two-dimensional data set in order to visualize the hyperplanes (i.e. lines in this case) due to each hidden neuron and the resulting global decision surface. This data set is composed of 200 pattern vectors divided into two nonlinearly separable classes. For training, 140 of them are randomly chosen while the remaining ones are used for testing the performance of the pruned network.

The second set of simulations involves preliminary tests using two benchmarking data sets (Iris and Wine). The Iris data set is composed of 150 4-dimensional pattern vectors distributed into 3 classes. The Wine data set is composed of 178 13-dimensional vectors also distributed into 3 classes. For the Iris (Wine) data set, 35 (33) patterns per class are randomly selected for training purposes and the remaining 45 (79) ones are used for testing.

For a more demanding classification task, we have used a biomedical data set kindly made available for this research by the Group of Applied Research in Orthopaedics (GARO) of the Centre Médico-Chirurgical de Réadaptation des Massues, Lyon, France. The task consists in classifying patients as belonging to one out of three categories: Normal (100 patients), Disk Hernia (60 patients), or Spondylolisthesis (150 patients). Each patient is represented in the database by six biomechanical attributes derived from the shape and orientation of the pelvis and cervical, thoracic, and lumbar spine: pelvic incidence, pelvic tilt, sacral slope, pelvic radius, lumbar lordosis angle, and grade of spondylolisthesis. The training set is formed by randomly selecting 42 pattern vectors per class, and the remaining 184 examples are used for testing purposes. This data set is available for the interested reader upon request. For the interested reader, we recommend the additional reading of the works of Berthonnaud et al. [3] and Rocha Neto and Barreto [26].

Weights and biases are randomly initialized within the range −0.5 to +0.5. All neurons use hyperbolic tangent activation functions, and the inputs are normalized to the range of network activations. Features are normalized to zero mean and unit variance. The output target vectors use the 1-out-of-M binary encoding. The learning rate was set to η = 0.001 for all simulations. All the networks and pruning methods were implemented in MATLAB 7.0.

5.1 Results for the synthetic data set

For the first data set, we initially trained a fully connected MLP network with Q = 12 hidden neurons. The input and output dimensions were set to P = 2 and M = 1, respectively. This network was trained until the MSE value for the training data, \((\varepsilon_{\rm train}),\) reached a steady-state, i.e. no significant changes are observed.

Numerical results are shown in Table 3. In this table, N c  = (P + 1) × Q + (Q + 1) × M is the initial number of weights, CRtrain and CRtest stand for the classification rate for the training and testing data, respectively, \(\varepsilon_{\rm test}\) is the mean-squared error in testing data set, and AIC stands for Akaike’s Information criterion.Footnote 1

Table 3 Numerical results for the synthetic data set

The application of the CAPE and OBS methods involves several steps. Initially, they are applied to Architecture 1, with Q = 12 hidden neurons and, hence, N c  = 49 adjustable parameters as a whole. Note that this initial architecture classifies the data accurately. However, we are not only interested in good classification rates, but also in knowing whether there is redundancy, i.e. whether this architecture is oversized. The application of CAPE and OBS resulted in a pruned architecture with Q = 9 hidden neurons, but with less connections than a fully connected MLP architecture with Q = 9 neurons (N c  = 37). The classification rates of both architectures remained the same as before. The resulting AIC value is slightly higher for the architecture pruned by the CAPE method, mainly due to its larger number of parameters.

Once the performances of the pruned architectures for Q = 9 hidden neurons remained at acceptable levels, we started to wonder what would happen if we apply the pruning methods to another fully connected, one-hidden-layered MLP (Architecture 2) that has approximately the same number of connections N c  = 33 as the pruned Architecture 1. The rationale for doing that was to verify whether there was still room for further weight pruning. The application of CAPE and OBS to Architecture 2 led to the same pruned architecture with Q = 7 neurons and N c  = 22 connections. Again, note that the pruned Architecture 2 has less connections than a fully connected network with Q = 7 neurons (N c  = 29). We continued with the pruning procedure for more two fully connected networks, namely, Architectures 3 and 4. The results were identical for both the CAPE and OBS methods.

The results of the application of the WDE algorithm to the Architecture 1 are also shown in Table 3. Since this algorithm works during the training phase (see the “Appendix”), we cannot compare it directly with the OBS and CAPE methods, which are applied to already trained networks. Thus, only the final result is shown. The regularization parameter is set to λ = 0.001 after some experimentation. The application of the WDE algorithm resulted in an architecture with 4 neurons and 17 connections. However, the CRtest for the resulting architecture is lower than that achieved by the pruned Architecture 4.

In Fig. 2, it is shown the resulting global decision surface and the position of the hyperplanes (i.e. lines) associated with the unpruned Architecture 1. We can see clearly that there are too many lines (i.e. redundant neurons), much more than the number required to discriminate well between the two classes. For comparison purposes, the decision surface and the hyperplanes for Architecture 4, pruned by the CAPE method, are shown in Fig. 3, where only three lines remained. The successive application of the CAPE method, starting from Architecture 1, eventually led to the minimum number of hidden neurons required to solve this nonlinearly separable problem.

Fig. 2
figure 2

Decision surface (left) and individual lines (right) for a trained MLP with Q = 12 hidden neurons

Fig. 3
figure 3

Decision surface (left) and individual lines (right) for a pruned MLP with Q = 3 hidden neurons

Note that the OBS algorithm also led to the same optimal three-hidden-neurons architecture, but at the expenses of much higher computational costs. Recall that the OBS algorithm requires the computation of the inverse of a Hessian matrix, one for each neuron in the network, every time the architecture is retrained. The CAPE method requires no matrix inversion at all. In addition, the CAPE method does not require the specification of the regularization parameter λ (WDE algorithm), which should be estimated by trial-and-error or by cross-validation.

5.2 Results for the Iris data set

Table 4 shows the results achieved by the pruned architectures on the Iris data set. The initial network topology (Q = 9 hidden neurons) is independently trained and tested four times, using the same training and testing data sets. Only the weight initialization and the order of presentation of the training vectors per epoch were changed from one training/testing run to another. This was necessary to emphasize differences between the two pruning methods.

Table 4 Results of the application of the pruning methods (Iris data)

Weight pruning is carried out using the highest classification rate achieved for training data as reference (i.e. J train = CRtrain). In general, both methods presented equivalent performance on testing data if we consider only average classification rates and mean-squared errors in the evaluation. In terms of the final number of weights, the OBS method presented a slight advantage over the CAPE, since it ended with a smaller number of weights and hidden neurons. However, as we show later, if we also consider the resulting confusion matrix for the pruned networks, the networks pruned by CAPE performed better than the ones pruned by OBS.

An interesting behavior was observed in this simulation. Those architectures labeled with an asterisk (*) had an output neuron pruned. For all four initial architectures, the application of the CAPE method pruned an output neuron, while the OBS method pruned an output neuron in 3 out of 4 cases. As mentioned earlier, occasionally a neuron has to be pruned when all the connections converging to (or emanating from) that neuron have been pruned. In this simulation specifically, the fact that an output neuron has been pruned and may indicate that the output coding is redundant. As a matter of fact, the 1-out-of-M output encoding used in all simulation in this paper required three neurons to represent the three classes of the Iris data set. However, it is possible to represent the three classes using only two output neurons! This redundancy in output class representation was found, without any a priori information about the task, by both the CAPE and OBS methods, but it is particularly a strong feature of the CAPE method.

Figure 4a shows the classification results for Model 1 with Q = 9 hidden neurons and M = 3 output neurons. The points in the figure correspond to neurons’ output values y (o) k k = 1, 2, 3; thus, the output space is three-dimensional. The output coordinates (y (o)1 y (o)2 y (o)3 ) corresponding to input vectors classified as belonging to the setosa class (in blue) are concentrated around the coordinates (+1, −1, −1). Points corresponding to input vectors classified as belonging to the versicolor class (in red) and to the virginica class (shown in black) are concentrated around the coordinates (−1, +1, −1) and (−1, −1, +1), respectively. There is some superposition among points belonging to the versicolor and virginica classes, resulting in a few classification errors.

Fig. 4
figure 4

Training (asterisk) and testing (circle) classification results for the Iris data set. a Unpruned MLP and b pruned MLP with pruned output set to zero

Figure 4b shows the resulting classification achieved by the Model 1 after the application of the CAPE method. The pruned architecture presents N c  = 41 weights, Q = 9 hidden neurons, and still M = 3 output neurons. However, since all the weights from the hidden layer to the output neuron 1 has been pruned, the output of this neuron (i.e. y (o)1 ) plays no role now and has been set to zero. Hence, the output coordinates for the classes setosa, versicolor, and virginica now spread around the following points (0, −1, −1), (0, +1, −1), and (0, −1, +1), respectively.

Numerical results are shown in Table 4. Average classification rates and error values were computed for the pruned networks using the original 1-out-M encoding for desired outputs, maintaining three output neurons, however with the pruned output set to zero (i.e. y (o)1  = 0). Despite that, classification rates for the testing data have been maintained at acceptable levels (i.e. >93%) for the pruned architectures.

If we retrain the pruned architectures now using only two output neurons, with desired outputs encoded as (−1, −1) (setosa class), (+1, −1) (versicolor class), and (−1, +1) (virginica class), the classification rates improve considerably. For example, the classification rates for the CAPE-pruned Model 1 changed from 95.5% (3rd row in Table 4) to 98.9% (not shown in Table 4). It was also observed a reduction in the mean-squared error values. The testing error changed from \(\varepsilon_{\rm test}=0.2153\) to \(\varepsilon_{\rm test}=0.0829\). Improvements in average classification rates and mean-squared errors were also observed for the OBS-pruned Model 1.

We also analyzed the effects of pruning on the confusion matrices of the classifiers. The confusion matrix of the unpruned Model 4 is shown in Table 5 using testing data (45 pattern vectors, 15 for each class). The first row of this table indicates that all 15 pattern vectors of class setosa (i.e. 33.3% of the total amount of testing data) are correctly assigned to this class. The same occurs for the 15 pattern vectors of class virginica (third row). For the 15 data vectors of the class versicolor, 12 of them (26.7% of the total) are correctly classified as belonging to the class versicolor, while 3 pattern vectors of the class versicolor (6.7% of the total) are erroneously assigned to class virginica.

Table 5 Confusion matrix for unpruned Model 4 (Iris data)

The confusion matrices obtained for Model 4, pruned by the CAPE and OBS methods, are shown in Table 6. It is clear that the confusion matrix obtained from the CAPE-pruned Model 4 is better than that produced by the OBS-pruned Model 4. As a conclusion, we recommend that the pruned architectures must be further evaluated by their confusion matrices, when presenting equivalent classification rates.

Table 6 Confusion matrices for the pruned Model 4 (Iris data)

5.3 Results for the vertebral column data set

For this simulation, we set J tol = 85% to serve as a reference for the minimum acceptable recognition rate for training and testing purposes. That is, even if a pruned network has produced a CRtrain value higher than 85%, its CRtest value should also be higher than 85%. Network pruning is carried out progressively through successive applications of the CAPE/OBS/PWM methods. Numerical results are summarized in Table 7.

Table 7 Numerical results of the successive application of the pruning methods to vertebral column data set

The CAPE method has produced a pruned Architecture 1 with Q = 19 hidden neurons and N c  = 126 connections, i.e. with less connections than a fully connected MLP architecture with the same number of hidden neurons (N c  = 193). The OBS method has ended with a pruned network with Q = 22 hidden neurons and N c  = 134. The final classification rates (CRtrain and CRtest) of both methods were coincidentally the same.

Important facts are worth mentioning with respect to the PWM method [5], which is described in the “Appendix”. Firstly, despite the fact that it has ended with the network with the smallest number of connections (N c  = 121) for Q = 21 hidden neurons, its CRtest is smaller than 85%. Secondly, due the small number of weights, it has achieved the lowest AIC value, wrongly indicating this architecture as the best one for this problem. These results indicate that the AIC index is possibly not the best model selection method for classification problems, since it is based on the average squared error only, not on classification rates.

Similarly to the procedure described in Sect. 5.1, we have trained another fully connected, one-hidden-layered MLP, denoted Architecture 2, with a number of hidden neurons Q = 18 close to the final value achieved by the CAPE-pruned Architecture 1. The training parameters for the Architecture 2 were the same as before. The application of the CAPE and the OBS methods to Architecture 2 led to the same number of hidden neurons (Q = 14), with the latter achieving less connections than the former. However, the classification rate CRtest for the CAPE method was higher. The AIC model, in this case, indicated the OBS method as the one providing the best pruned architecture, despite the fact that the CAPE method has produced the highest classification rate in testing data. This was mainly due to the smaller number of connections achieved by the OBS method.

The PWM method performed quite well in terms of classification rates, ending with a pruned architecture with Q = 16 number of neurons and N c  = 96 connections. It is worth mentioning, however, that the PWM is highly sensitive to weight initialization, and the numerical results shown in Table 7 for the PWM method corresponds to the best one obtained after 10 training/testing runs. The CAPE and the OBS are much less sensitive to weight initialization.

As the recognition rate CRtest for the pruned Architecture 2, irrespective of the pruning method, also remained above J tol = 85%, we decided to start pruning another fully connected, one-hidden-layered MLP, with Q = 13. However, no method was able to reduce further the number of hidden neurons, only the number of weights. Anyway, all pruned networks achieved recognition rates on testing data below the minimum allowed value. Thus, one can infer that the architectures best suited to the problem are the pruned ones obtained from Architecture 2.

For the sake of comparison, we evaluated two SVM classifiers, using linear and RBF kernels, for 50 runs on the vertebral column data set. The SVM classifiers were both trained by the standard SMO algorithm Platt [22]. These classifiers achieved average recognition rates of 85.1% (linear kernel) and 85.3% (RBF kernel), which are much inferior than the recognition rate reported in Table 7 for the CAPE-pruned Architecture 2 (88.59%).

6 Feature subset selection using CAPE

If all the weights emanating from one or more input units have been discarded, the features associated with those input nodes are to be considered nonrelevant ones, since they will play no role in the computations. Thus, the proposed weight pruning method can be also used for feature selection.Footnote 2 For this simulation, we used the Wine (13 features and 178 instances) and the Dermatology (34 features and 366 instances) benchmarking data sets.

For the Wine data set, we trained an MLP with Q = 9 hidden neurons using all N feat = 13 features and the whole data set.Footnote 3 The total number of weights is N c  = 156. The network was trained until it achieved CRtrain = 100%. Then, the CAPE method was applied with J tol = 99.50%.

The results shown in Table 8 suggest that there are at least 3 redundant input features, as we can see on the rightmost column. Hence, we created another training data set, now with only N feat = 10 features by excluding the features 6, 7, and 9. Another MLP was then trained and pruned. This procedure was repeated until no more features were to be discarded. At the end of 4 runs of the CAPE method, only 8 features remained.

Table 8 Progressive feature selection for the Wine data set

Figure 5a, b illustrate CAPE’s pruning power. In Fig. 5a, we can see the original fully connected network, Architecture 1 (P = 13, Q = 9, M = 3, and N c  = 156), while in Fig. 5b, we see the resulting pruned architecture (N feat = 8, Q = 4, M = 3, and N c  = 25) after the application of CAPE method. As final step, we evaluated this pruned architecture by training and testing it using the 10-fold cross-validation scheme. No weight pruning was allowed at this step. The resulting average classification rate was 98.60%.

Fig. 5
figure 5

a Fully connected MLP (Q = 9 and N c  = 156) and b pruned MLP (Q = 4 and N c  = 25)

To further evaluate CAPE in feature selection tasks, we trained a fully connected MLP with Q = 10 hidden neurons using the Dermatology data set, which contains 366 patterns distributed into six classes. Each vector had initially N feat = 34 features that were used to categorize six skin diseases. The whole data set was used for the training process. A procedure similar to the one applied to the Wine data set was followed. All architectures were trained to achieve CRtrain > 99.50% with J tol = 99.50%. The results are summarized in Table 9.

Table 9 Progressive feature selection for the Dermatology data set

The CAPE-based feature subset selection approach has led to the elimination of 15 features at the end of the pruning process of the Architecture 7, while the OBS-based approach has led to the elimination of only 8 features. Using the remaining 19 features, the pruned Architecture 7 achieved a recognition rate of CRtrain = 99.73%.

An interesting situation happens when we further apply the CAPE algorithm to Architecture 8. Elimination of features 16 and 26 is suggested (1st trial). However, when we pruned out these two features, the resulting architecture did not achieve a recognition rate above the acceptable minimum (i.e. CRtrain < J tol = 99.50%) for a new trained MLP using only 17 features. We then decided to train two other MLPs using 18 features. The first one is trained after the elimination of feature 16 only (2nd trial), while the second one is trained after the elimination of feature 26 only (3rd trial). The 2nd trial achieved the minimum acceptable recognition rate, while the 3rd trial did not. So, we maintained feature 26 and eliminated feature 16.

The final MLP classifier, pruned by CAPE on the Dermatology data set, presents the following numbers: N feat = 18, Q = 10, M = 6, and N c  = 109. The features 1, 7, 11, 12, 13, 16, 17, 18, 20, 24, 25, 29, 30, 32, 33, and 34 can be discarded, since they are considered nonrelevant for the classification task of interest. As final step, we evaluated the pruned architecture by training and testing it using the tenfold cross-validation scheme. No weight pruning is allowed at this step. The resulting average classification rate was 98.00%.

6.1 Comparison with related methods

In this section, we compare the performance of the CAPE-based feature subset selection method with those reported in some recent works available in the literature.

Luukka [18] introduced a wrapper approach for feature subset selection using fuzzy entropy measures. This approach removed only 5 features of the Dermatology data set, achieving an average classification rate of 98.28%. Rocha et al. [25] proposed two strategies for evolving MLP-like classifiers, named Topology-optimization Evolutionary Neural Network (TENN) and Simultaneous Evolutionary Neural Network (SENN) and evaluated them on several benchmarking data sets, including the Dermatology data set. It is reported average classification rates of 95.7% (TENN) and 95.3% (SENN). For the sake of comparison, on the Dermatology data set, the CAPE-based feature selection method resulted in an average classification rate of 98.00% using only 18 (out of 34) features.

Moustakidis and Theocharis [19] introduced the SVM-FuzCoC, a powerful SVM-based feature selection method that achieved an average classification rate of 97.12% for the Wine data set using 6 features in average. For the Dermatology data set, the SVM-FuzCoC achieved an average classification rate of 94.11% using 12 features in average. For the Dermatology (Wine) data set, the CAPE-based feature selection method resulted in an average recognition rate of 98.00% (98.6%) using 18 (8) features. As a conclusion, the performance of the SVM-FuzCoC method in terms of the trade-off between the smallest number of features and the highest classification rate is slightly better than that provided by the CAPE method, but this is achieved at the expenses of much higher computational efforts.

7 CAPE × OBS: estimating the computational costs

The computational cost is an important issue to be addressed during evaluation of algorithms. Sometimes, the algorithm effectiveness in providing a solution to a given problem requires the execution of complex computations and the use of excessive memory resources, which can create severe difficulties in real-time or low-cost applications.

In this paper, we have evaluated the CAPE and OBS methods in several classification and feature selection tasks. One may argue that the CAPE performed only slightly better than the OBS to justify its choice as a pruning method in the place of the classical OBS/OBD methods. As mentioned before, the OBS/OBD methods require the computation of the inverse Hessian matrix for a given neuron in order to determine the weight saliencies (see the “Appendix”). Computation of the inverse Hessian matrix for MLP networks is a tricky task itself. Exact computation of the Hessian matrix for MLP networks has been made possible [4], but its inversion is still subject to numerical ill-conditioning. In this regard, the CAPE method demands much less computational efforts that the OBS/OBD methods, since no matrix inversion is required.

In the results to be described, we take into account all the essential mathematical operations used to compute the weight saliencies (S i ), as required by the OBS algorithm, and the cross-correlation indexes (\(\mathbf{C}_{oh}[k,i]\) and \(\mathbf{C}_{hi}[i,j]\)) in CAPE algorithm. Tables 10 and 11 show the number of mathematical operations required by the CAPE and OBS methods, respectively, as a function of the number of pattern vectors (N), the input dimension (P), the number of hidden neurons (Q), and the number of output neurons (M).

Table 10 Mathematical operations demanded by the CAPE method
Table 11 Mathematical operations demanded by the OBS method

Figure 6 shows how the number of operations required by both pruning methods evolves as a function of the number of weights. For the sake of simplicity, we have considered that sums, multiplications, divisions, etc. have the same computational cost.

Fig. 6
figure 6

Estimated costs for CAPE × OBS when pruning an MLP(2,Q, 1) classifier

The number of operations required by the OBS (denoted by \(N_{op\_obs}(N_c)\)) and the CAPE (denoted by \(N_{op\_cape}(N_c)\)) was computed from the application of both algorithms over an MLP with P = 2 input units, M = 1 output neuron, and the number of hidden neurons Q ranging from 1 to 10. The number of instances in the data set is N = 140. One can easily infer that the OBS method has a quadratic dependence on N c , while the CAPE method presents a linear dependence with very small slope. This result can help a user in deciding in favor of the application of the CAPE pruning method to pattern classification.

8 Conclusions

In this paper, we introduced the CAPE method, an easy-to-apply and efficient procedure for pruning nonrelevant weights of a trained MLP classifier. The CAPE method was motivated by the MAXCORE principle, and it is based on the correlation analysis of the errors produced by the output neurons and the errors propagated back to previous layers. Comprehensive computer simulations using synthetic and real-world data indicate that, in terms of performance, the CAPE method compares favorably with standard pruning techniques, such as the OBS, WDE, and PWM, with the advantage of requiring much lower computational efforts.