1 Introduction

Artificial neural networks are biological inspired computational models based in the brain structure and operation, composed of several interconnected nodes or neurons. Some of these units exchange information with the environment (input and output units), and others are internal units communicating only with hidden or internal units within the network. The learning and generalization properties of these models give them a wide applicability in pattern recognition, financial analysis, biology, medicine, image analysis, etc.

Supervised trained artificial neural networks have evolved from the shallow architectures proposed in the 1980s [21] to the well-known deep learning (DL) neural models used nowadays in several applications and that have revolutionized the field of machine learning [12]. Despite the great advances and superior prediction capabilities shown by DL architectures, there are certain aspects that still leave room for other types of neural architectures: (a) DL architectures are very large and need heavy computational resources both in terms of memory and computation, (b) DL has not shown clear improvement over alternative techniques when small data sets are used. Both previous aspects are relevant, in particular, to embedded systems, used in sensors, actuators and in several electronic devices, as small neural architectures with good learning and generalization capabilities are needed for their implementation in embedded devices.

Algorithms that can automatically determine an optimal network architecture according to the complexity of the underlying function embedded in the data set are highly desirable [4, 7]. Efforts toward the network size determination have been made in the literature for many years, and several techniques have been developed, among which we can highlight as more important the decomposition of singular values [26], techniques based on the study of the geometry of classes [1, 7, 14, 28] or the analysis of the entropy of information [27].

Alternative approaches have been proposed by other authors, among which we can distinguish pruning techniques and constructive methods [2, 5]. The latter option offers the possibility of generating networks that grow as the input data are analyzed. Moreover, the training procedure in the constructive algorithms, considered a computationally expensive problem in the standard feed-forward neural networks, can be done online and relatively fast even in very small electronic devices [17]. Pruning techniques work in the opposite way, as large architectures are used as starting point and neurons and connections are pruned following an optimization criterion. Further, constructive algorithms search for small architecture solutions, offering the possibility of finding networks with minimum size that could match the complexity of the data.

C-Mantec (Competitive MAjority Network Trained by Error Correction) is a constructive neural network algorithm [23, 24] that has shown exactly these previously mentioned capabilities, through a competitive scheme that leads to very small one hidden layer neural architectures, and that has been successfully implemented for its use in limited resource hardware such as micro-controllers and embedded systems [17, 18], noting that the use of a small architecture also reduces energy consumption [25].

The C-Mantec algorithm builds neural networks with a single hidden layer of neurons and a single output neuron, adding new nodes to the hidden layer until all training patterns are correctly classified. It combines competition between neurons with a thermal perceptron learning rule [6], adding stability to the acquired knowledge while the architecture of the net grows. At the single-level neuron, the C-Mantec algorithm uses the thermal perceptron rule, while at a global level it performs a competitive strategy between neurons, so obtaining more compact architectures. The main difference with existing constructive algorithms is that at all times, the existing neurons keep learning the information provided by the input patterns without freezing its synaptic weights as it is the standard procedure in most constructive schemes.

This paper presents first in Sect. 2 a study of the training procedure (“loading problem”), carried out in networks trained by C-Mantec and back-propagation algorithms in order to analyze the observed differences, to later analyze in Sect. 3 the improvements implemented in the C-Mantec algorithm with the aim of increasing its generalization performance, by optimizing the hyperplanes of the neurons. Section 3 includes the results obtained on a set of benchmark functions. Finally, the conclusions of the studies carried out are presented in Sect. 4.

2 Methods

2.1 C-Mantec neural network constructive algorithm

We give below some introduction to the way the C-Mantec algorithm functions as it is relevant to understand the rest of results. C-Mantec generates one hidden layer network architectures with a single output neuron. The activation state (S) of each neuron in the hidden layer depends on the N input signals \(\varPsi _i\), and on the actual value of the N synaptic weights \(w_i\) and a bias b as follows:

$$\begin{aligned} S = {\left\{ \begin{array}{ll} 1(\mathrm{ON}) &{} \quad \hbox {if}\, \text {}\ \varPhi \ge 0 \\ 0(\mathrm{OFF}) &{} \quad \text {otherwise}\\ \end{array}\right. } \end{aligned}$$

where \(\varPhi\) corresponds to the synaptic potential of the neuron and it is defined by:

$$\begin{aligned} \varPhi =\sum w_i\varPsi _i-b \end{aligned}$$

The thermal perceptron rule modifies the synaptic weights \(\nabla w_i\) after the presentation of a single input pattern according to the following equation:

$$\begin{aligned} \nabla w_i = (t-S)\varPsi _iT_\mathrm{fac} \end{aligned}$$

where t is the target value of the presented input and \(\varPsi _i\) represents the value of input unit i connected to the output by the weight \(w_i\). The \(T_\mathrm{fac}\) factor can be defined as:

$$\begin{aligned} T_\mathrm{fac}=\frac{T}{T_0} e^{-\frac{|\varPhi |}{T}} \end{aligned}$$

Similar to a simulated annealing process, the value of T decreases with the learning process according to:

$$\begin{aligned} T=T_0 \; \left( 1-\frac{I}{I_\mathrm{max}}\right) , \end{aligned}$$

where \(I_\mathrm{max}\) defines the maximum number of iterations allowed and I is a cycle counter. The thermal perceptron can be seen as a modification of the standard perceptron rule incorporating a modulation factor forcing the neurons to learn only target examples close to the already learned ones. For a deeper analysis of the performance of the thermal rule, see the original paper [6].

Competition between neurons is implemented selecting the neuron with a smallest value of \(\phi\) which will learn the presented input if certain conditions are met (only if the \(T_\mathrm{fac}\) value is larger than the \(g_\mathrm{fac}\) parameter of the algorithm, to prevent the unlearning of the previous stored information), otherwise a new neuron will be added in the network to learn it. The individual temperature T of each neuron is lowered every time its weight is updated. The addition of a new neuron to the network causes a reset of the individual temperature of all neurons to the initial value \(T_{0}\). The learning process continues until the whole set of input patterns can be classified correctly by the network.

The C-Mantec output neuron computes the majority function (also called the median operator [11]) of the activation of hidden nodes, as previous experiments shown very good capabilities among the set of linearly separable functions ([22]). It outputs a classifier defined by a set of hyperplanes which categorizes new examples. The growing factor parameter determines when to stop the learning cycle, including a new node in the hidden layer.

2.2 Set of benchmark data sets

In order to analyze the performance of the C-Mantec algorithm and the different strategies to improve the accuracy of the classification results, we have selected 15 real input and Boolean output data from the UCI Machine Learning Repository [13] and the DELVE project. The data set selected (name and number of inputs) is indicated in the two first columns of Table 3.

2.3 Analysis of the training procedure (“loading problem”) for C-Mantec and back-propagation algorithms

Probably the most interesting aspect of neural networks is its generalization ability. Even if there are some theories explaining these effects, real results for specific architectures and data are complex to be explained by general theories, and thus, there is almost no other alternative than relying in numerical simulations.

Feed-forward networks operate like combinatorial circuits during testing new examples, being this a fast and fully defined phase, but before predictions can be retrieved from a network, they require a training process to load the information into their components. This process of adjusting the synaptic weights so that the mapping system output performs in a proper way has been known as loading problem, being identified in general case as a NP-complete problem in [9, 10]. One way to analyze this training procedure is to train a neural network and then study how each of the neurons present in the hidden layers contributes to the correct performance of the network during the training phase. It has been argued that balanced loading (such as in the back-propagation algorithm case) leads to better generalization ability than unbalanced cases that can occur in constructive algorithms.

In this section, we study how the C-Mantec algorithm generates the load distribution, comparing these results using the back-propagation algorithm, trying to get an insight into how this distribution can influence the generalization capability of the networks.

Table 1 Load distribution information from every separate hyperplane using C-Mantec and back-propagation algorithms

We computed numerical simulations using the 15 benchmark data sets described in Sect. 2.2, using a size architecture network determined by running the C-Mantec algorithm. As one of the aims of the present study is the possible application of the algorithms in limited resources devices (microcontrollers, embedded systems, etc.) for a useful comparison, the results obtained for the back-propagation algorithm were executed with the same architecture (i.e., one hidden layer architecture containing the same number of neurons obtained for C-Mantec). The back-propagation algorithm was run using the MATLAB neural network toolbox with the default parameter settings using the Levenberg–Marquardt training algorithm. We used a classical data set splitting scheme to present examples to the neural network, in which \(75\%\) of the whole set of examples were used for training purposes and the remaining \(25\%\) for testing the prediction performance.

In Table 1, the results of experiments for studying the loading distribution are shown, where the first column is a function identifier, the second column represents the method implemented, the 3–8 columns represent the loading information corresponding to every node in the network using the C-Mantec and back-propagation algorithms, and the two last columns indicate the generalization accuracy with the testing set of examples and the standard deviation of the loading information distributed over the nodes. The loading values obtained for each of the neurons present in the single hidden layer of the used architectures are computed as follows: Once the training process of the networks was finished, the fraction of times that the hidden neuron’s output coincides with the desired target was analyzed.

Fig. 1
figure 1

Generalization results vs standard deviation from the loading distribution with C-Mantec and the back-propagation algorithms

Furthermore, we show in Fig. 1 the standard deviation for the loading of the neurons as a function of the generalization ability for back-propagation (right graph) and C-Mantec (left graph), where it can be seen that the average of both quantities is larger for C-Mantec. The results confirm the balanced loading distribution previously observed for the back-propagation algorithm while larger values are obtained for the C-Mantec algorithm. Nevertheless, in terms of the generalization ability observed values are larger for C-Mantec but it is worth noting that for back-propagation the number of neurons in the hidden layer was not optimized but chosen equal to the one selected by the C-Mantec algorithm.

Table 2 Incremental load distribution obtained from the C-Mantec application in training phase

A further analysis was still carried out in order to check and analyze the functioning of the C-Mantec algorithm. In Table 2, the incremental training accuracy is shown for the different analyzed data sets as the architecture grows as needed. The results confirm that the learning process finishes always with the complete training of the patterns, noting that in order to accomplish this, the C-Mantec algorithm removes some patterns considered as noise. In relationship to this issue, in order to avoid overfitting effects C-Mantec includes built-in process that remove patterns considered as conflictive when they need a certain number of learning modifications (the standard setting consists in eliminating patterns needing more than the average of the rest of the patterns plus two standard deviations). Further details of the procedure can be found in the original C-Mantec papers, where the algorithm was introduced [23, 24].

3 Optimizing C-Mantec hyperplane classification

The general observation about the performance of C-Mantec is that it often generates architectures with good prediction capabilities. The addition of new neurons into the hidden layer of the architecture occurs when a new training example cannot be classified correctly by the network in the pre-defined number of iterations. The addition of new nodes to the architecture might allow for new and better hyperplane solutions, and even if the training implemented in C-Mantec permits all neurons to learn, it does not guarantee the optimal option. In order to test how optimal the hyperplane selection was done with the standard C-Mantec settings, we consider in this section different options to optimize the hyperplane position.

A reasonable strategy is to select hyperplanes which have the largest separation to the nearest data point of each class. This situation is well known in the neural networks area, and it is a standard machine learning principle, known as the maximum margin classification [8]. Support vector machines [3] use this principle to classify data, applying a technique called kernel trick to convert a non-separable problem to separable.

According to the above strategy, we present three offline approaches to create the new optimized hyperplane, named maximum margin (MM), parallel to nearest neighbor (PNN) and mixed, that offer the possibility of improving the accuracy of the classifier generated by the C-Mantec algorithm. The three mentioned approaches consider the architecture determined by the application of the C-Mantec method as the starting point, allowing the generation of a new network configuration by modifying slightly every hyperplane of the C-Mantec original solution through certain operations involving the support vectors of every class.

3.1 Maximum margin strategy (MM)

The MM strategy focuses on modifying each C-Mantec classifier individually by an alteration of it, using the n nearest training examples of every outcome half-space. First, the algorithm works with the patterns classified in one class, removing those misclassified and generating a group (\(G_+\)) of patterns sorting out by their distance to the original C-Mantec classifier surface. Only these nearest n elements are saved in this group. The same process is done with the opposite class, generating the group \(G_-\).

The algorithm continues selecting the first pattern of \(G_+\) group and matching it with the nearest pattern of the opposite group \(G_-\). Once this correspondence is hold out, the midpoint is computed and the two patterns are dropped from their respective group. Carrying out the same operation with all patterns in \(G_+\), we obtain the n midpoints to compute/build the new MM hyperplane \(H_{MM}\).

We considered not to apply the MM strategy if the number of examples included in any of the classes is lower than n. Figure 2 shows a scheme of the MM strategy implementation.

Fig. 2
figure 2

Maximum margin strategy (MM). \(H_{CM}\) is the actual C-Mantec hyperplane. The \(G_+\) class is filled with closest examples to the C-Mantec hyperplane, \(x_i\) and \(x_j\). \(G_-\) is generated following the same procedure. \(H_{MM}\) is computed with the midpoints between pairs of n closest point at different sides of the hyperplane

3.2 Parallel to nearest neighbor strategy (PNN)

The PNN strategy considers the two groups of patterns computed from the MM strategy, and it has two parts: Once the \(G_+\) group is generated, the algorithm computes the hyperplane \(H_+\) defined by the n nearest points of \(G_+\) to the C-Mantec hyperplane \(H_\mathrm{CM}\). This hyperplane \(H_+\) is considered to find the nearest point \(x_j\) belonging to the opposite group, and computing the hyperplane through this point and being parallel to \(H_+\), \(H_+'\). Finally, the bisector hyperplane determined by \(H_+\) and \(H_+'\) is considered as candidate, labeled as \(H_+*\). Following the same technique, the algorithm considers the opposite group of patterns, \(G_-\), to determine \(H_-*\). Figure 3 shows a scheme of the PNN strategy implementation.

Fig. 3
figure 3

PNN strategy scheme. \(H_+*\) and \(H_-*\) are the hyperplane candidates

3.3 Mixed strategy

The Mixed strategy starts with a performance of the C-Mantec algorithm with a standard set of parameters. Thus, considering each hyperplane generated from the C-Mantec method individually, the MM and PNN strategies are computed, and the results are compared to each other. The option with the best accuracy values using the training set will be selected to replace the C-Mantec hyperplane. We note that the application of the Mixed method could provide with architectures without changes in the C-Mantec architecture when the computation of the new strategies in each hyperplane does not improve generalization values.

3.4 Simulations and results

We have performed exhaustive simulations to find the best set of hyperplanes applying the MM and PNN strategies or a combinations of them, named as Mixed strategy, to modify each hyperplane resulting from the application of C-Mantec algorithm. In order to carry out the experiments, we have considered a data partition scheme with ten random boxes, taking nine of them as training and the remaining one as test. Experiments and analysis have been carried out in MATLAB, 2012b version. For further comparison, simulations were carried out also for a standard support vector machine (SVM) with polynomial kernels run with standard settings under MATLAB.

The data and simulation results are summarized in Table 3. The first two columns indicate the name and the number of input variables of each problem. The third and fourth columns refer to the generalization ability obtained from the application of the standard C-Mantec algorithm together with the final training accuracy, which is always 1 for standard C-Mantec as it learns up to perfect training using an elimination procedure of patterns considered as noise. The six following columns correspond to the results in accuracy with test examples and train patterns using the MM and PNN strategy and also an SVM model used for comparison. The last five columns show the results for the Mixed strategy, where in the last three columns the values correspond to the percentage of hyperplanes used for the Mixed strategy.

A difference between the generalization ability for the test patterns is appreciate in Table 3, where we can see an improvement in these values obtained for the majority of the problems considered when the MM strategy is applied, showing/reflecting the worst results with the performance of the PNN strategy in isolation. As might be expected, the prediction ability concerning with a Mixed strategy produces an interesting improvement in a high proportion of the considered data set.

A more detailed analysis of Table 3 reveals an effective performance of the C-Mantec algorithm for the data with a low and medium complexity (highest generalization values with test patterns). As the complexity of the data grows, we can emphasize that the times when the PNN technique is involved overcome the use of the MP technique. Regarding the comparison with the values obtained from a SVM, we see that the SVM leads to slightly larger prediction values that are almost similar to those that can be obtained using the Mixed strategy.

One interesting aspect of the current comparison done between C-Mantec, back-propagation and SVMs is its possible hardware implementation in microcontrollers and embedded systems like FPGAs. In the previous works [17, 19], we implemented both C-Mantec and the back-propagation algorithm in FPGAs and microcontrollers, and the comparison shows that C-Mantec needs lower hardware resources than back-propagation (considering the same architecture) due to the different transfer functions they use [20]. Regarding the comparison between C-Mantec and SVM, based on published bibliography [15] and our experience the situation looks similar, and even more favorable to C-Mantec given that SVMs usually utilize a much larger number of kernel units than C-Mantec.

4 Conclusions and future work

We have analyzed in this work the functioning of the C-Mantec constructive neural network algorithm, carrying out a deep study of how the training of the patterns is performed, a problem known as the loading problem, considered of interest as it is related to the generalization ability that can be obtained. The analysis done in Sect. 2 shows that C-Mantec has a loading distribution among the neurons in the hidden layer that is more unbalanced than that obtained for the well-known back-propagation algorithm, with a value of the standard deviation of 0.17. A more detailed analysis of every individual case shows that in most cases, the deviation values were lower (below 0.05) but in some cases in which only very few neurons were needed the deviation values were higher, with several cases with values approximately around 0.35, but noting that even in these cases the generalization ability observed was still high. The conclusion of this part of the study might suggest that for complex problems where many neurons are needed the loading distribution is quite balanced as expected because C-Mantec permits all neurons to learn simultaneously. In the second part of this work, we analyzed different alternative optimization strategies for the separating hyperplanes of the hidden neurons for the C-Mantec algorithm, having found that on average only a relatively small increase in performance can be obtained, but that for the case of complex problems (for which we used as indication the generalization obtained) some more benefit can be observed mainly by using the PNN (parallel to nearest neighbor) strategy.

As an overall conclusion, we can say that the standard version of the C-Mantec algorithm performs quite well and that in order to obtain larger accuracies like the ones observed in recent year through the use of DL models, it might be necessary to include further layers rather than thinking on optimizations of the current model, as no big improvements have been found with the analyzed alternatives. Nevertheless, one important point observed before and confirmed in this work is that the overall performance of C-Mantec is very competitive in comparison with alternative machine learning models, such as standard back-propagation and SVM, noting that a big advantage of C-Mantec models is the lower number of units present in the generated models that make them ideal for the implementation in limited resource hardware, as is the case of micro-controllers, embedded systems, etc.

Table 3 Generalization ability obtained in a set of benchmark data sets for C-Mantec (CM) and three alternative hyperplane optimization solutions (MM, PNN, Mixed)