1 Introduction

Back-propagation is now the most widely used tool in the field of artificial neural networks (NN). However, despite the general success of the back-propagation, several major deficiencies are still needed to be solved. The major disadvantage of back-propagation is the slow rate of convergence of net output error [27, 38]. This is especially difficult in class imbalance problems [3, 35], and often it is the cause of the poor classifications performance of the NN.

The class imbalance problem occurs when, in a classification problem, there are many more samples of some classes than others [13]. This problem exists in many real-world domains, such as spotting unreliable telecommunications customer, detection of oil spills in satellite radar images, detection of fraudulent telephone calls, information retrieval and filtering task and so on [24].

Much research has been done in addressing the class imbalance problem [21, 39]. In the back-propagation in “batch mode” [19], it is very popular the use of cost function to deal with class imbalance problem (e.g. see Ref. [2, 6, 25, 26, 31, 35]). In these approaches, the basic idea is modify the error function of the back-propagation by introducing different costs associated with making errors in different classes, for dealing class imbalance.

In the “sequential” back-propagation (which estimates the error based on individual error training sample [19]), a common practice is to apply re-sampling techniques on the original training dataset, either by over-sampling or under-sampling or both. The re-sampling methods are the most researched because they are independent of the underlying classifier and can be easily implemented for any problem [32].

The simplest method to increase the size of the minority class corresponds to random over-sampling, which is a non-heuristic method that balances the class distribution through the random replication of positive examples [21, 23]. Nevertheless, since this method replicates existing examples in the minority class, overfitting is more likely to occur.

Others over-sampling methods with some heuristic techniques have been proposed. Chawla et al. [7] proposes the synthetic minority over-sampling technique (SMOTE), which generates new synthetic minority samples by interpolating between several preexisting positive examples that lie close together. In Ref. [17] the Borderline-SMOTE was presented, which, it only over-samples the borderline samples of the minority class. Adaptive synthetic sampling (ADASYN) was proposed as a technique that uses a systematic method for adaptively creating different amounts of synthetic data according to their distributions [20]. García et al. [16] uses surrounding neighborhood approaches with the aim of generating artificial minority examples, but taking both the proximity and the spatial distribution of the examples into account.

On the other hand, random under-sampling is the most popular technique among this nature, whose aim is at balancing the dataset through the random removal of negative examples. Despite its simplicity, it has empirically been shown to be one of the most effective re-sampling methods [23]. However, some works agree [9, 21, 36] in that the random under-sampling is weakened in multi-class scenarios and it can cause great performance reduction to those majority classes. Many other under-sampling proposals are based on a more intelligent selection of the negative examples to be eliminated [21], for example the Tomek link, the nearest neighbor rule (NNR), the condensed NNR [4], the Gabriel Graphs [2] genetic algorithms [15], so on. It is also common to blend the over and under samplings methods [2, 4, 8, 10]

As been stated by several authors, the re-sampling methods may entail important criticism and limitations:

  1. 1.

    How to automatically discover the proper amount of sampling (sampling rate)? [8].

  2. 2.

    In severe class imbalance problems, over-sampling methods modifies the data set probability distribution, cause longer training time and suffers from high computational cost in terms of memory [9, 13].

  3. 3.

    The under-sampling techniques involves a lost of information which can be detrimental for the classifier performance [9, 21, 36].

The present paper focus in the first and second point with the aim to determine a proper over-sampling rate without sacrificing the performance on the minority classes, while also reducing the training time and the computational cost. In brief, we propose a dynamic method that allows the efficient use of an over-sampling strategy on severe multi-class imbalanced problems. The method is based on the back-propagation mean square error (MSE) for automatically identifying the over-sampling rate. More specifically, the main contributions of this paper/method/technique are :

  1. 1.

    To deal with severe multi-class imbalance problems, which have been less investigated [36].

  2. 2.

    To provide an efficient way of over-sampling minority classes on highly multi-class imbalance problems.

  3. 3.

    A simple method for automatically finding the over-sampling rate which does not need the free parameters and it is very easy to implement.

The rest of this paper is organized as follows. Related works are briefly reviewed in Sect. 2. In Sect. 3 we introduce the proposed method for tackling the multi-class imbalance problem, and the Sects. 4 and 5 show the experimental set up and results, respectively. Finally, Sect. 6 is for concluding remarks.

2 Related Works

A common practice for dealing with class imbalanced data sets is to re-balance them artificially through the re-sampling techniques [21]. However, a supported concern in researches in data mining and machine learning has been to deal with improving the re-sampling methods [8, 34]. Some efforts have been addressed to overcome one of its main criticisms: to find the proper over or under sampling rate.

Fernández Navarro et al. [13, 14] present a dynamic over-sampling algorithm to deal with multi-class imbalance problems on Radial Basis Function and Multilayer Perceptron, respectively. In that approach, the data set is modified into two stages. In the first stage, the data set is preprocessed with SMOTE to reduce the class imbalance ratio. The number of samples created by SMOTE in this stage is less than \(1/2*J\) where \(J\) is the number of classes. In the second stage, a memetic Algorithm [29] is proposed to obtain the best parameters for NN. Next, the SMOTE algorithm is applied to the minimum sensitivity class to decrease the imbalance problem. This stage is in order to run while stop condition is not succeed. However, the authors do not present the class imbalance ratio resulting from finishing this proccess or its computational efficiency.

Chawla et al. [8] propose a wrapper paradigm that discovers the amount of re-sampling for a data set based on optimizing evaluation functions like the f-measure, area under the ROC curve (AUROC), cost, cost-curves, and the cost dependent f-measure. The classifiers base were C4.5 and RIPPER. To discover the re-sampling amounts, the five-cross validation method is applied at the training set, and the next two stages are to run. First, the wrapper finds the under-sampling percentages for the dataset. The process consists in decreasing the majority class to improve the minority classification performance without sacrificing performance on the majority class. In the second stage, the amount of SMOTE is incremented, and it is evaluated whether the performance is increased with the new SMOTE amount. This process repeats, greedily, until not performance gains are observed. Once that the re-sampling amounts are obtained, they are used for re-sampling the original training dataset and the classifier is trained again. The result presented by the authors demonstrated the effectiveness of the generalization performance of the proposed method. However, the over-sampling amounts showed in some datasets are much higher than to balance at 100 % the dataset, therefore, we considered that the wrapper paradigm might be not efficient in NN. In this respect, a very similar work is presented by Debowski et al. [10], but at difference of Chawla et al. [8], it shows important weaknesses. For example, the stopping condition is not clear and we consider that the experimental framework is limited.

Ref. [30, 33] present the snowball method to deal with the class imbalance problem. It is a dynamic over-sampling training method for NN. The basic idea is to first train the NN only with the examples of the minority class. Next, they use a dynamic training which includes all examples of minority class and a gradual increasing of number of examples of the majority class in the training. In this way, the effect of undoing the presentation of minority class examples can be greatly reduced. However the authors contradict their owns results, in Ref. [30] Murphey says, that on Back-propagation, the classification accuracy increased over the minority class with the price of dramatically decreasing classification accuracy over the majority class, and Ou et al. [33] show that Snowball method gives the best performance without any loss on the majority classes.

A similar work is presented by Bo-Yu Li [28]. The basic idea is for training the NN with a dynamic threshold learning algorithm. This method uses multiple dynamic threshold parameter to gradually remove some training samples that can be classified correctly by the NN, and, in this way to get a class balance and to improve the classification performance over the minority classes. But in the same way that in others works [2, 30], the cost of improving the minority classes performance is to sacrifice the effectiveness of the NN on the majority classes.

We proposed a method which is different to the other works in the “approach”, i.e., it deals to reduce as much as is possible the NN training time, when it is trained with multi-class imbalanced datasets, without significantly loss of the classification performance, and in this way to deal with one of the major disadvantages of back-propagation (the NN training time). This is very important on scenarios where the NN is trained from datasets of considerable size with highly imbalanced classes. For example in the classification of hyperspectral remote sensing images, where the over-sampling strategies may increase too much the training time or the under-sampling might seem inapplicable due to the important loss of information.

3 Proposed Method for Dealing with Multi-class Imbalance Problems

It is well known that in the back-propagation algorithm the class imbalance problem generates unequal contributions to the mean square error (MSE) in the training phase [2, 3]. So the training process also becomes slow and it takes long time to converge to the expected solution.

The main problem consists in that the majority classes produce the major contribution to the MSE. Let us consider the next: Given a training dataset (TDS) with two classes (\(J=2\)) such that \(Q=\sum _{j}^{J}{Q_{j}}\) and \(Q_{j}\) is the number of samples from class \(j\), and supposing that the MSE by class can be expressed as

$$\begin{aligned} E_{j}(U)=\frac{1}{Q}\sum _{i=1}^{Q_{j}} \sum _{p=1}^{J}(t_{p}^{i}-z_{p}^{i})^{2}\,, \end{aligned}$$
(1)

where \(t_{p}^{i}\) is the desired output and \(z_{p}^{i}\) is the actual output of the network for the sample \(i\). Then the overall MSE can be expressed as

$$\begin{aligned} E(U)=\sum _{j=1}^{J}E_{j}(U) = E_{1}(U)+E_{2}(U)\,. \end{aligned}$$
(2)

If \(Q_1<<Q_2\) then \(E_1(U) << E_2(U)\) and \(\Vert \nabla E_1(U)\Vert << \Vert \nabla E_2(U)\Vert \), where the operator \(\nabla \) denotes the gradient of the error function. Consequently, \(\nabla E(U) \approx \nabla E_2(U)\). So, \(-\nabla E(U)\) is not always the best direction to minimize the MSE in both classes [3].

In the batch mode back-propagation is a common practice to balance the MSE including a cost function \(\gamma (j)\) for balancing the MSE, i.e, \(\gamma (1) \Vert \nabla E_{1}(U)\Vert \approx \gamma (2) \Vert \nabla E_{2}(U)\Vert \) [2, 6, 25, 26, 31, 35]. Nevertheless, in the sequential mode of back-propagation it is not trivial task. For this reason is normal the use of the re-sampling techniques to deal with balancing the MSE on the training process.

In this work we propose a dynamic over-sampling technique to balance the MSE on the training stage when a multi-class imbalanced dataset is used. The propose method consists in two steps:

  1. 1.

    Before training: The TDS is balanced at 100 % through of an effective over-sampling technique. In this work we use SMOTE [7].

  2. 2.

    During training: The MSE by class (\(E_j\)) is used to determinate the number of samples by class (or ratio of class) in order to forward it to the NN. The equation employed to obtain the ratio of class is defined as

    $$\begin{aligned} \textit{ratio}_j = \frac{E_{max}}{E_j}*\frac{Q_j}{Q_{max}}; \quad \textit{for} \, \, j=1, 2, ...,J \, , \end{aligned}$$
    (3)

    where \(J\) is the number of classes in the dataset and \(max\) identifies at the largest majority class. The Eq. 3 allows to balance the MSE by class reducing the impact of the class imbalance problem on the NN. The Algorithm 1 shows the implementation of this step.

figure a

The main peculiarity of our proposed method is that in the training stage only uses the necessary samples for dealing with the class imbalance problem and in this way to avoid a poor performance of classifications resulting from the NN over the minority classes and the NN training time is not increased excessively.

The proposed method (detailed in Algorithm 1) shows the next advantages:

  1. 1.

    It is a simple method with a single classifier.

  2. 2.

    It does not need more free parameters than the standard back-propagation.

  3. 3.

    It only uses the necessary samples in the training stage to get a MSE by class relatively balanced.

4 Experimental Setup

In order to evaluate the validity and performance of the technique just proposed, we have accomplished thorough experiments on several multi-class imbalanced data sets. In this section, we will describe the techniques, data sets and experimental framework used in the paper.

4.1 Re-sampling Methods

The class imbalance problem has been addressed by re-sampling techniques, which artificially balance the original data set, either by over-sampling of the minority class or under-sampling of the majority class or both. In this work, we have used a renowned over-sampling technique called SMOTE proposed by Chawla et al. [7]. This method generates artificial examples of the minority class by interpolating existing instances that lie close together. For each minority sample, it finds the \(k\) intra-class nearest neighbors, and then synthetic samples are generated in the direction of some or all of those nearest neighbors. As reported in paper by Chawla et al. [7], in our experiments the \(k\) value has been restricted to five nearest neighbors, bearing in mind that the aim of the present study is not at finding the optimal \(k\) value. Besides, a constant \(k\) value allows to make easier the interpretation of results focused on our proposal.

In addition, we decided to add examples until a balanced distribution was reached. This decision was intentioned by two aims: (a) simplicity (to avoid use of many free parameters) and (b) effectiveness. Results obtained with the other classifiers [37], have shown that when AUC is used as a performance measure, the best class distribution for learning tends to be near the balanced class distribution.

Although the class imbalance problem has been claimed as the main factor that significant degrades the performance of classifiers. Several studies have pointed out that the degradation is also related to other factors such as small disjuncts, high dimensionality and class overlapping [5]. In order to handle both class imbalance and class overlapping, we have jointly used an over-sampling and a data cleaning method with the double aim of balancing the skewed classes and removing any erroneous and harmful majority example. Specifically, in this work we have chosen the aforementioned SMOTE and the Gabriel graph editing (GGE) techniques, which have been shown to be suitable to deal with the two issues for the back-propagation learning procedure [2].

4.2 Description of the Experimental Data Sets

Five real-world remote sensing data sets were selected to test our proposal: MSE back-propagation over-sampling (MSEBPOS) technique. The Cayo data set comes from a particular region in the gulf of Mexico [2]. The Feltwell data set represents an agricultural area near the village of Fetwell (UK) [6]. The Satimage and Segment data sets are from the UCI Machine Learning Database Repository [1]. The 92AV3C datasetFootnote 1 corresponds to a hyperspectral image (145 \(\times \) 145 pixels, 220 bands, 17 classes) taken over Northwestern Indianas Indian Pines by the AVIRIS sensor. In this work, we employed a reduced version of this data set with six classes (2, 3, 4, 6, 7 and 8) and 38 attributes as in [2].

As we are interested in analyzing the technique proposed on highly imbalanced multi-class data sets, each original data set was altered by combining and/or reducing the size of some classes in order to construct fifteen multi-class data sets with a diverse number of class distributions. Table 1 reports a summary of the original classes that were joined to shape the majority and minority classes. The third and fourth columns indicate the original and final classes, respectively. The number between parentheses represents the classes that were joined to shape the majorities classes. For example for the MCAA subset the classes 1, 3, 6, 7 and 10 from original database (Cayo) were joined to integrate its first majority class and the classes 8, 9 and 11 to ingrate its second majority class. So the result of this process is a subset (MCAA) from Cayo with five classes: two majorities and three minorities classes. The main difference between the subsets obtained from the original database is the classes that were integrated to shape their majorities classes, for example, the difference between MCAB and MCAC is that for MCAB the class 4 is part of the one of its majorities classes meanwhile that in MCAC the class 4 is a minority class.

Table 1 Detail of the classes merge process to get highly imbalances data sets

The above described process was performed in all datasets used in this work (Cayo, Fetwell, Satimage, Segment and 92AV3C) and to further reduce the minorities classes random size under sampling was employed. The main characteristics of the new produced benchmarking data sets are showed in Table 2.

Table 2 A brief summary of some characteristics of the data sets used in the experimental stage

4.3 Experimental Framework

An empirical comparison between the MSE back-propagation over-sampling (MSEBPOS) technique here proposed and other re-sampling strategies were performed over a total of fifteen data sets by using the multi-layer perceptron (MLP) neural network trained with the sequential back-propagation (SBP) algorithm. A stratified ten-fold cross validations was adopted for the present study. For each fold, nine parts were pooled as the training data, and the remaining block was employed as an independent test set. All the training sets were preprocessed by the MSEBPOS technique and the original SMOTE. Apart from these methods, other two variants in combination with the GGE were included in the study: MSEBPOS+GGE and SMOTE+GGE. Here, it is worth mentioning that all the original and resampled data sets by SMOTE and SMOTE+GEE were used to build the prediction model with the non-modified SBP algorithm.

In the training process, for both the SBP and MSEBPOS, the weights were randomly initialized ten times. Therefore, the results from classifying the test samples were averaged between the ten runs and the ten different initialization weights. The learning rate (\(\eta \)) was set to \(0.1\) and the stopping criterion was established at 5,000 epoch or the MSE value is lower than \(0.001\). A single hidden layer was used, where for each data set the number of neurons was obtained by a trial and error strategy: Cayo \(= 7\), Feltwell \(= 6\), Satimage \(= 12\), Segment \(= 10\) and 92AV3C \(=10\).

4.4 Performance Evaluation

Several empirical and theoretical studies have shown that the plain accuracy and/or error rates are strongly biased with respect to data imbalance, which might produce misleading conclusions [21, 31]. In order to face this shortcoming, alternative performance evaluation measures have been proposed. One of the most widely-used graphical evaluation methods is the receiver operating characteristic (ROC) curve, which is a tool for visualizing, organizing and selecting binary classifiers based on their trade-offs between true positive rates and false positive rates [12]. A quantitative representation is the Area Under Curve (AUC) ROC, which “summarizes” the quality of the classifier. In problems where the classes can be more than two, the AUC can be defined as [18]:

$$\begin{aligned} AUC= \frac{2}{\Vert J\Vert (\Vert J\Vert -1)} \sum _{{j_i,j_k}\epsilon J} AUC_R(j_i,j_k) \, \, , \end{aligned}$$
(4)

where AUC\(_R(j_i,j_k)\) is the area under the curve for each pair of classes \(j_i\) and \(j_k\).

4.5 Criteria for Evaluating Experimental Results

In this work, we have employed the Friedman test for evaluating the experimental results with the aim of verifying the hypothesis of improved performance of the re-sampling techniques here used. It is a non-parametric statistical test that performs multiple comparisons among the algorithms considered over a collection of data sets [11]. The procedure starts by computing the ranks of the algorithms or strategies for each dataset separately, where the best performing algorithm gets the rank of 1, the second best rank 2, and so on. In case of ties, average ranks are computed. Let \(r_i\) be the rank of the \(j-\)th of \(K\) algorithms on the \(i-\)th of \(N\) data sets. The next step is in order to obtain the average ranking for each algorithm, \(R_j= \frac{1}{N}\sum _i r_i^j\). Under the null hypothesis which states that all algorithms behave similarly and therefore their ranks \(R_j\) should be equal, the Friedman statistic can be computed as follows:

$$\begin{aligned} \chi _F^2 = \frac{12N}{K(K+1)} \left( \sum _j R_j^2 - \frac{K(K+1)^2}{4}\right) \, \, . \end{aligned}$$
(5)

The \(\chi _F^2\) is distributed according to the Chi-square distribution with \(K-1\) degrees of freedom, when \(N\) and \(K\) are big enough. Owing to \(\chi _F^2\) presents an undesirable conservative behavior, Iman and Davenport [22] have devised a better statistic distributed according to the \(F-\)distribution with \(K-1\) and \((K-1)(N-1)\) degrees of freedom,

$$\begin{aligned} F_{F}=\frac{(N-1)\chi _{F}^2}{N(K-1)-\chi _{F}^2} \, \, . \end{aligned}$$
(6)

If the null-hypothesis is rejected, we can use a Bonferroni-Dun post-hoc test, which compares a control algorithm with the \(K-1\) algorithms. The performance of two algorithms is significantly different if the corresponding average ranks is at least as great as its critical difference,

$$\begin{aligned} { CD} = q_{\alpha } \sqrt{\frac{K (K+1)}{6N}} \, \, , \end{aligned}$$
(7)

where the \(q_{\alpha }\) values is based on the studentized range statistic divided by \(\sqrt{2}\) (Table 5(b) in  [11]).

5 Results and Discussion

Table 3 reports the detailed AUC results of each problem and the average AUC values across all databases using the different strategies here explored: the imbalanced case, the MSE back-propagation over-sampling technique (MSEBPOS), SMOTE, MSEBPOS+GGE and SMOTE+GGE. The average ranks (Friedman score) are also given. As expected, classification with the imbalanced data set (the non-preprocessed training set) yields the poorest AUC value. The two best performing algorithms corresponds to SMOTE and MSEBPOS techniques. This can be further confirmed by noting the average ranks, which provides a useful comparison of the algorithms, where the imbalanced case has the highest average rank. The SMOTE and MSEBPOS were ranked with \(1.27\) and \(2.67\), respectively.

Table 3 Classification performance on fifteen data sets measured using \(AUC\) and average rank (AR)

In order to detect whether there exist statistical differences between the AUC results of the techniques studied, we employed the Iman-Davenport statistic. This computation produced \(F_F = 17.3745\), distributed according to F distribution with 4 and 56 degrees of freedom. The \(p-\)value returned by using \(F(4,56)\) was \(25.41E-10\). As the \(p-\)value is lower than a significant level of \(\alpha =0.05\), the null hypothesis which states that all algorithms here explored behave equally can be rejected. Hence, we carried out a post-hoc statistical analysis by using Bonferroni-Dunn procedure to compare each strategy against the control classifier, which corresponds to the best strategy.

Figure 1 plots the average ranks for each strategy, which appear sorted according to their ranks. The horizontal line represents the threshold of the critical difference value computed by the Bonferroni-Dunn test with \(\alpha =0.05\). This line is equal to the sum of the lowest rank \((1.27)\) and the CD value \((1.442)\). Those bar (algorithms) above this cut line perform significantly worse than the best model. Observing the results from Fig. 1, we can see that only the MSEBPOS behaves equally to SMOTE. The other three strategies do not perform significantly better. Therefore, the MSEBPOS appears to be a suitable and effective approach to deal with multi-class imbalance problems.

Fig. 1
figure 1

Bonferroni-Dunn graphic for AUC

We have also analysed the final set size obtained by each algorithm. Table 4 reports the set size ratio on each database. It was computed as \(ratio_k = Q_k/Q_{SMOTE}\), where \(Q_k\) and \(Q_{SMOTE}\) are the size of the preprocessed and balanced data sets (this by SMOTE), respectively. As can be observed, the re-sampling methods based on the MSE back-propagation over-sampling technique (MSEBPOS and MSEBPOS+GGE) have achieved less than \(0.53\) of set size ratio, what means significant saving in time computing (see Table 5) and storage requirements, when compared to the balanced data set by SMOTE. From Table 5, we also can see that the time processing rate was remarkable reduced more than twice SMOTE.

Table 4 The size ratio of each dataset obtained by the strategies studied. It was computed taking as reference at SMOTE: \(ratio_k = Q_k/Q_{ SMOTE}\), where \(Q_k\) and \(Q_{ SMOTE}\) are the size of data set processed by the strategy \(k\) and SMOTE, respectively
Table 5 Time processing rate obtained from analyzed strategies. It was computed taking as reference at SMOTE: \(ratio_k = TT_k/TT_{ SMOTE}\), where \(TT_k\) and \(TT_{ SMOTE}\) are the training time (measured in minutes) of the strategy \(k\) and SMOTE, respectively.

For the sake of a visual comparison and with the aim of analyzing the performance of a re-sampling approach in terms the AUC and the set size ratio, we have employed a scatter-plot of the size ratio versus the AUC, values by means of average ranks. Fig. 2 displays all the strategies studied (including the imbalanced data set), where the \(x-\) axis are the average ranks of AUC results and the \(y-\)axis are the average ranks of the set size ratio. In such a way, points close to the origin \((0,0)\) of the plot might corresponds to the best methods with a good balanced trade-off between performance and size complexity. Similarly, we have plotted the average ranks of the time processing versus the AUC values in Fig. 3.

Fig. 2
figure 2

Set size ratio versus AUC plot using average ranks

Fig. 3
figure 3

Time processing rate versus AUC plot using average ranks

From Figs. 2 and 3, one can observe that the MSEBPOS approach lies the nearest from the origin of the plot, which suggest that this technique has the most suitable trade-off in terms of performance and size, as well as, time processing. Alternative, MSEBPOS+GGE appears as the second algorithm with a good trade-off. In the case of SMOTE and imbalanced approaches, these lies the furthest from the origin \((0,0)\).

6 Conclusions

This paper has proposed a MSE back-propagation over-sampling technique for learning multi-class imbalance data sets. The method has been proposed based upon the SBP algorithm. The aim of this alternative is to identify a suitable over-sampling rate, whilst reducing the processing time and storage requirements, as well as, keeping or increasing the performance of predictive models.

Experimental results over fifteen high imbalanced multi-class data sets have demonstrated that the MSEBPOS algorithm achieve competent results in terms of the AUC measure, processing time and storage requirements with respect to the original SMOTE technique. Also, an analysis with Bonferroni-Dunn post-hoc model test has allowed to observe that MSEBPOS behaves similarly to this one. When visualizing the AUC results and the time processing rate, as well as, the set size ratio, we have found that the strategies based on the MSEBPOS, yield the most balanced trade-off between the three rates.

Future work will extend this study in order to find the new mechanics to identify the most appropriate over-sampling ratio which allows to improve significantly the classification performance of the proposed method but keeping its advantage in terms of NN training time. On other hand, would be interesting for future work to generalize the proposed method, i.e, to balance the training dataset through of an effective over-sampling technique and to use the MSE to automatically identify the optimal amount of samples as the minorities as the majorities classes so the resulting method may be considered a under and over sampling technique. In addition, we will want to dip in the analysis of multi-class learning problems where the dataset shows a large size and an extreme class imbalance.