Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

In Artificial Neural Network (ANN) , the selection of connection weights is a key issue and Genetic and Evolution Strategies have been found to be promising algorithms to solve this important task. Motivated by that, this study investigates the applicability of using two novel Multi-Objective Evolutionary Algorithms (MOEA): Speed constrained Multi-Objective Particle Swarm Optimization (SMPSO) and Multi-Objective Differential Evolution Algorithm based on Decomposition with Dynamical Resource Allocation (MOEA/D-DE-DRA). ANNs are training to learn data classification using sensibility and specificity for different UCI databases. The results are compared using the Hypervolume as quality indicator and statistical test.

10.1 Introduction

Most training algorithms, such as Backpropagation (BP) and conjugate gradient algorithms, are based on gradient descent [15]. There have been many successful applications of BP in various areas, but BP has drawbacks due to the use of gradient descent. It often gets trapped in a local minimum of the error function and is incapable of finding a global minimum if the error function is multimodal and/or non-differentiable.

In the other side, Evolutionary Algorithms (EAs) can help to avoid the problem of convergence to local minima and explore global search for training MLP. EAs can be used effectively to find a near-optimal set of connection weights without computing gradient information. The fitness of an ANN can be defined according to different needs. Moreover, the task of learning the connection weights can be stated as a Multi-Objective task and Multi-Objective Evolutionary Algorithms (MOEAs) can be used to solve this task.

In this study, two different MOEAs are investigated: Speed constrained Multi-Objective Particle Swarm Optimization (SMPSO) [8] and Multi-Objective Differential Evolution Algorithm Based on Decomposition (MOEA/D-DE) [19] with Dynamical Resource Allocation (DRA - MOEA/D-DE-DRA) [20].

The algorithm of Speed constrained Multi-Objective Particle Swarm Optimization (SMPSO) is a technique of optimization based on Particle Swarm Optimization (PSO). PSO developed by Kennedy and Eberhart [8], is a population-based heuristic inspired by the social behavior of bird flocking aiming to find food. PSO have some similarities with evolutionary algorithms: both systems are initialized with a set of solutions, possibly random, and search for optima by updating generations. Despite these similarities, there are two main differences between them. First, there is no notion of offspring in PSO, the search is guided by the use of leaders. Secondly, PSO has no evolution operators such as crossover or mutation. In Particle Swarm Optimization, the set of possible solutions is a set of particles, called swarms moving in the search space, in a cooperative search procedure. These moves are performed by an operator that is guided by a local and a social component [9]. SMPSO algorithm is an extension of PSO for solving Multi-Objective problem. Researchers like SMPSO algorithm because this algorithm is easy to program when compared to other MOEAs.

Multi-Objective Differential Evolution Algorithm based on Decomposition (MOEA/D) is an evolutionary algorithm that optimize multi-objectives problems, using the idea of decomposition [19]. MOEA/D decompose the multi-objective problem into different sub-problems using scalar weight functions. Thus, the algorithm solves these sub problems simultaneously evolving a population of solutions using differential evolution operators. In each generation, the population is composed by the best solution found so far for each sub-problem. The relation among sub-problems are set based on the distances between their weighting vectors [19]. The MOEA/D-DE-DRA algorithm [20] uses the same concepts of MOEA/D [19], but the amount of computational resources (memory) reserved to solve each sub-problem is based on a utility function. Nowadays, MOEA/D-DRA is a state of art on MOEAs.

These two MOEAs are used to train ANN to classify data. With this purpose, two fitness functions are used: the sensitivity and specificity criteria that are directly related to the quality of the classification. An empirical evaluation is made using different UCI databases and the comparison show the effectiveness of these algorithms.

This work is structured as follow: Sect. 10.2 present the basic concepts of ANN (Sect. 10.2.1), Evolutionary Algorithms (Sect. 10.2.2), SMPSO (Sect. 10.2.2.1), MOEA/D-DE-DRA (Sect. 10.2.2.2), Hypervolume (Sect. 10.2.3) and the classification problem (Sect. 10.2.4); Sect. 10.3 describes the configuration of experiments and the obtained results. Finally, Sect. 10.4 has the conclusion and future works.

10.2 Elementary Concepts

In this section, we describe concepts of MLP, multi-objective optimization and the algorithms used in the study. Moreover, elementary concepts of classification are presented.

10.2.1 Artificial Neural Networks - ANN

Researches on neural networks look to the organization of the brain as a model for building intelligent machines. Moreover, the human brain processes information in an entirely different way than conventional digital computer [5]. The brain is a highly complex computer, non-linear and parallel. It has the ability to organize their structural components, known as neurons, in order to perform certain tasks, such as pattern recognition, sense and motor control, much faster than the fastest existing digital computer.

An ANN consists of a set of processing elements, also known as neurons or nodes, which are interconnected. It can be described as a directed graph in which each node i performs a transfer function \(f_{i}\) as described by Eq. 10.1

$$\begin{aligned} y_i = f_i\left( \sum _{j=1}^n (w_{ij} \cdot x_{j}) + bias\right) \end{aligned}$$
(10.1)

where \(y_{i}\) is the output of the node i, \(x_{i}\) is the \(j_{th}\) input to the node, and \(w_{ij}\) is the connection weight between nodes i and j. The threshold is the bias of the node. Usually, \(f_{i}\) is nonlinear, such as a heaviside, sigmoid, or Gaussian function. Equation 10.2 shows the sigmoid function.

$$\begin{aligned} out = \frac{1}{1+e^{-net}}\underline{} \end{aligned}$$
(10.2)

A neural network topology represents the way in which neurons are connected to form a network. In other words, the neural network topology can be seen as the relationship between the neurons by means of their connections. The topology of ANNs can be divided into feedforward (FFNN) and recurrent classes according to their connectivity. An ANN is a feedforward if the information flow is unidirectional. A unit sends information to another unit from which it does not receive any information. There are no feedback loops. They are used in pattern generation, recognition and classification. In recurrent ANNs, feedback loops are allowed. They are used in content addressable memories.

Basically, there are two kinds of FFNN: single-layer perceptron (SLP), and multilayer perceptron (MLP). The SLP networks consist of a single layer of output nodes, which are fed directly by input layer via a set of weights. MLP networks consist of multiple layers: an input layer, one or more hidden layers and an output layer. Each layer has nodes and each node is fully weighted interconnected to all nodes in the subsequent layer. Figure 10.1 shows an illustration of an ANN of the type MLP.

Fig. 10.1
figure 1

Representation of an ANN - MLP

The most important feature of an ANN is how its learning process occurs. According to Haykin [5], the learning is defined as a process where the free parameters of a neural network are adjusted by a stimulation process by the environment where it is inserted.

In supervised learning, training is performed by presenting a large set of examples, called the training set, to the network. Each example consists of a set of inputs presented to the input layer and the respective set of desired outputs. Although training an ANN can be time-consuming, once this stage is successful completed, the input–output mapping is evaluated almost instantaneously. However, care must be taken to use an adequate training set, representative of the sampling space. In many cases this is not feasible, and the sampling space must be restricted to a specific sub-domain. This means that ANNs are best applied to specific well and defined problems [3].

When using a MLP to solve a problem, the first activity is to train the MLP. Training depends on chosen initial weights and usually applies gradient learning algorithms to adapt weight values. Among these algorithms, error Backpropagation (BP) method [15] is one of the most used. In BP, the weight adjustment starts in the output nodes, where the measure of the error is available, and proceeds backpropagating this error through the previous layers. BP is a method based in gradient descendent, what means BP does not assure to find a global minimum and can get stuck on local minima, where it will stay indefinitely. However, BP is popular and widely used on ANN training [17].

As alternative, evolutionary algorithms can be applied to global searches within the weight space of a typical feedforward neural network (FFNN) and outline local minima and enable adaptive selection of control parameters [7, 16].

10.2.2 Multi-Objective Evolutionary Algorithms-MOEAs

According to Yao [18], the EAs can be used in the global evolution, to find a set of optimal (or near-optimal) weights of connections, and without gradient calculation. The error value can be defined based on the specific needs of the task to run. A commonly used factor in the formulation of the error function is the difference, called the error between the expected output and the actual output.

Two MOEAs are chosen for this study: MOEA/D-DE-DRA a state of art on MOEAs and SMPSO algorithm because this algorithm is easy to program when compared to other MOEAs.

10.2.2.1 SMPSO

Particle Swarm Optimization (PSO) is a stochastic meta-heuristic based on the movement of bird flocks looking for food, created to optimize nonlinear functions. In this method a swarm (population) of particles (solutions) moves across the search space (evolves) guided by personal and social leaders. A particle as two components: position and velocity. These components are updated at each generation.

Equations 10.3 and 10.4 present the rules for updating the speed (\(v_{i}\)) and position (\(p_{i}\)) of a particle i. The first member of the Eq. 10.3 is the inertia term, the second term is a movement to the personal best position \(pBest_{i}^{t}\) and the third term is a movement towards the global best position \(gBest_{i}^{t}\) (social term).

To expand the PSO to solve multi-objective problems, and create a Multi-Objective Particle Swarm Optimization (MOPSO) [14] algorithm, some modifications are needed. The first of them is the creation of an external archive (repository) to store the better (non-dominated) solutions found so far, another modification is in the leader selection scheme, which has to choose from a set of equally good leaders according to some criterion. As the number of non- dominated solutions may become very large, an archiving method is needed to prune the repository and keep only a predefined number of solutions, discarding some non-dominated solutions according to its criterion.

A MOPSO that has shown very good results in the literature is the Speed-constrained Multi-objective PSO (SMPSO) [11]. It was noted that in some conditions the velocity of the particles in a MOPSO can become too high, generating erratic movements towards the limits of the decision space. To avoid such situations, SMPSO presents a velocity constriction mechanism based on a factor \(\chi \) that varies based on the values of the influence coefficients of personal and global leaders (C1 and C2 respectively). In SMPSO, the (global) leader selection method uses a binary tournament based in the Crowding Distance metric from [2], and the archiving strategy also uses the Crowing Distance.

$$\begin{aligned} \displaystyle v_i^{t+1} = \overbrace{\omega \cdot v_i^{t}} ^{inertia} + \overbrace{c_1 \cdot r_1^t(pBest_i^t - p_i^t)}^{personal} + \overbrace{c_2 \cdot r_2^t(gBest_i^t - p_i^t)}^{social} \end{aligned}$$
(10.3)
$$\begin{aligned} \displaystyle p_i^{t+1} = p_i^t + v_i^{t+1} \end{aligned}$$
(10.4)
figure a

At Algorithm 9 the pseudo-code of the SMPSO algorithm is presented. First the swarm and leaders archive (repository) are initialized and the evolutionary process begin. At each generation, for each particle in the population, the leaders are calculated and then the speed and position are updated. After, it is performed the Polynomial mutation for each particle, and the particles are evaluated. Finally, the particles update the leaders archive. The output of SMPSO is the leaders archive or repository.

10.2.2.2 MOEA/D-DE-DRA

The decomposition is another way to solve a problem with multi-objectives. The MOEA/D-DE-DRA decompose one multi-objective optimization problem (MOP), in many single-objective sub-problems.

There are two main components in MOEA/D. First, the mechanism to decompose MOP into sub-problems. Normally weight vectors are generated randomly and each one defines a sub-problem. The objective of each sub-problem is a (linear or nonlinear) weighted aggregation of all the individual objectives in the MOP.

The second main component is the neighborhood relations among these sub-problems. The neighborhood relations are defined based on the distances between their weight vectors. Each sub-problem (i.e., scalar aggregation function) is optimized in MOEA/D by using information from its neighboring sub-problems.

The MOEA/D-DE with Dynamical Resource Allocation (DRA) is a version where different amounts of computational effort are allocated to different problems. In MOEA/D with Dynamical Resource Allocation (MOEA/D-DE-DRA), the version of MOEA/D used in this paper, the utility \(\pi _i\) for each subproblem is used.

MOEA/D-DE and its variants can use any decomposition approach for defining their sub-problems. This work uses the Tchebycheff [20] approach. Using this decomposition method, each sub-problem can be formulated as in Eq. 10.5:

$$\begin{aligned} \displaystyle Min~g^{te}(x~|~\lambda , z^*) = max_{1 \le j \le M}~\{\lambda _j~|~f_j (x) - z^*_j~|~\}\\ subject~to~x~\in ~\varOmega \nonumber \end{aligned}$$
(10.5)

wherein \(g^{te}\) is the Tchebycheff function, \(f(x) = (f_{1}(x), \ldots , f_{M}(x))\) is the set of functions that has to be minimized, and \(\lambda = (\lambda _{1}, \ldots , \lambda _{M})\) is the weight vectors.

The sub-problems are evolved using Differential Evolution(DE) operators. DE uses a simple mutation operator based on differences between pairs of solutions (called vectors) with the aim of finding a search direction based on the distribution of solutions in the current population. DE also utilizes a steady-state-like replacement mechanism, where the newly generated offspring (called trial vector) competes only against its corresponding parent (old object vector) and replaces it if the offspring has a higher fitness value.

figure b

The MOEA/D-DE-DRA is presented at Algorithm 10. The first steps of MOEA/D-DE-DRA is to initialize various data structures, analogous to most MOEA/D variants. The weight vectors \(\lambda _{i}\), \(i = 1, \ldots , N\), representing coefficients associated with each objective, are generated using a uniform distribution. The neighborhood (\(B^{i} = {i_{1}, \ldots , i_{C}}\)) of weight vector \(\lambda _{i}\) stores the indexes of the C weight vectors closest to \(\lambda _{i}\). The initial population is randomly generated and evaluated. Each individual (\(x_{i}\)) is associated with the \(i_{th}\) weight vector. The empirical ideal point (\(z^{*}\)) is initialized as the minimum value of each objective found in the initial population and the generation (g) is set to 1.

After initialization steps, the algorithm enters its main loop. The first step of the main loop is to determine which individuals from the population will be processed. A 10-tournament selection based on the utility value of each sub-problem (\(\pi _{i}\), calculated accordingly to Eq. 10.6) is used to determine the individuals to evolve. Next, the scope used during the generation of the individual and the population update is randomly chosen. DE heuristics (mutation strategies and crossover) are applied considering individuals randomly selected from scope. In this work, scope can swap from the neighborhood to the entire population (and vice-versa) It is composed by the indexes of chromosomes from either the neighborhood \(B^{i}\) (with probability \(\delta \)) or from the entire population (with probability \(1 - \delta \)). Based on the chosen strategy, a modified chromosome y is generated in step 19 and modified by the polynomial mutation in step 20, generating \(y^{'} = (y_{1}^{'} , \ldots , y_{n}^{'})\) from y.

In step 22, if the new chromosome \(y^{'}\) has an objective value better than the value stored in the empirical ideal point, \(z^{*}\) is updated with this value. The next steps involve the population update process (steps 23–26) which is based on the comparison of the fitness of individuals. In the MOEA/D-DE framework, the fitness of an individual is measured accordingly to a decomposition function. In this work the Tchebycheff function is used (Eq. 10.5) Accordingly to what is selected for the scope (steps 15 or 17), the neighborhood or the entire population is updated.

To avoid the proliferation of \(y^{'}\) to a great part of the population, a maximum number of updates (NR) is used. The population update is as follows: if a new replacement may occur, (i.e., while \(nr < NR\) and there are unselected indexes in scope), a random index (k) from scope is chosen. If \(y^{'}\) has a better Tchebycheff value than \(x_{k}\) (both using the \(k_{th}\) weight vector - \(\lambda _{k}\)) then \(y^{'}\) replaces \(x_{k}\) and the number of updated chromosomes (nr) is incremented. If the current generation is a multiple of 50, then the utility value of each sub-problem is updated using Eq. 10.6. The evolutionary process stops when the maximum number of evaluations is reached.

$$\begin{aligned} \pi ^{2} = {\left\{ \begin{array}{ll}1, &{} if \varDelta ^{i} > 0.001 \\ (0.95 + 0.05 * \varDelta ^{i}/0.001) * \pi ^{i}, &{} otherwise \end{array}\right. } \end{aligned}$$
(10.6)

10.2.3 Hypervolume

The performance comparison of one or more multi-objective optimization methods is a complex task. Two goals of multi-objective optimization are: convergence and diversity of solutions.

A widely used metric in the evaluation of multi-objectives algorithms is the indicator of Hypervolume (HV). In HV, the volume of the covered area between the points of the solutions on the Pareto front P (non-dominated solutions) and a reference point W is calculated. Each solution \(i \in P\), constitutes a hypercube, \(v_i\) with reference to a point W [21]. This reference point can be found by building a vector with the worst values of the objective function. The union of all hypercubes found is the result of the metric and, as higher is the value of HV better are the results. Higher values of HV indicate that there is a higher spreading between the solutions in P and indicate that there is a better convergence to the Pareto front.

Hypervolume corresponds to the area formed by the union of all rectangles, as shown in Fig. 10.2.

Fig. 10.2
figure 2

Hypervolume area

10.2.4 Classification Problem

Classification is one of the main tasks of Data Mining. According to Han and Kamber [4] classification is the process of finding a model or function that describes and distinguishes data elements or concepts in order to be able to use the model to predict the class of an object whose class is unknown. The derived model is based on analysis of a set of training data.

The training data consist of pairs of inputs (vectors) and desired outputs. For example, in a classification problem, a hospital may want to classify medical patients into those who have high, medium or low risk to acquiring a certain illness.

The model generated by a learning algorithm should both fit the input data well and correctly predict the class labels of records it has never seen before. Therefore, a key objective of the learning algorithm is to build models with good generalization capability; i.e., models that accurately predict the class labels of previously unknown records.

A general approach for solving classification problems consist of two steps. First, a training set consisting of records whose class labels are known must be provided. The training set is used to build a classification model, which is subsequently applied to the test set, which consists of records with unknown class labels.

Evaluation of the performance of a classification model is based on the counts of test records correctly and incorrectly predicted by the model. These counts are tabulated in a table known as a confusion matrix (Table 10.1).

Table 10.1 Confusion matrix

From the confusion matrix (10.1) is possible to calculate measures such as: True Positive rate (TP rate), True Negative rate (TN rate or specificity), False Positive rate (FP rate) and False Negative rate (FN rate). TP rate, also called sensitivity, is the precision between the positive examples (Eq. 10.7). Its complement is the FN rate (i.e., \(FN rate = 1 - FP rate\)). Specificity is the precision between the negative examples (Eq. 10.8). Its complement is the FP rate.

$$\begin{aligned} sensitivity = \frac{TP}{TP+FN} \end{aligned}$$
(10.7)
$$\begin{aligned} specificity = \frac{TN}{TN+FP} \end{aligned}$$
(10.8)

For several years, the most used performance measure for classifiers was the accuracy [1]. The accuracy is the fraction of examples correctly classified, showed on Eq. 10.9. Despite of its use, the accuracy maximization is not an appropriate goal for many of the real-world tasks [13]. A tacit assumption in the use of classification accuracy as an evaluation metric is that the class distribution among examples is constant and relatively balanced. In real world this case is rare, moreover, the cost associated with the incorrect classification of each class can be different because some classifications can lead to actions which could have serious consequences [12].

$$\begin{aligned} accuracy = \frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$
(10.9)

Classification is one of the most dynamic exploratory and application areas of ANNs. However, as mentioned before the selection of connection weights is a key issue and here this issue is tackle with two MOEAs.

10.3 Experimental Evaluation and Results

The experimental evaluation aims at answering the following research questions:

  1. RQ1:

    Is there difference of performance among the configurations of each algorithm?

  2. RQ2:

    Is there difference of performance between SMPSO and MOEA/D-DE-DRA?

  3. RQ3:

    What are the advantages of the multi-objective versus mono-objective approach for evolving connection weights of ANN for classification task?

To answer RQ1, first different configurations of the algorithms are used to learn ANNs for each training database using sensitivity and specificity as fitness functions. Second, the learned ANNs are applied into the test databases obtaining a new set of values of sensitivity and specificity. Finally, the different configurations are compared using the Hypervolume indicator and the Friedmann rank test [6].

The goal of RQ2 is to verify whether exists one algorithm with better results than the other. The results obtained in RQ1 are now compared using the best configuration obtained for each algorithm. Again the Friedmann rank test is used.

To answer RQ3, the results generated by applying the ANNs to each test databases are analyzed using the accuracy, sensitivity and specificity.

In order to verify statistical difference among the results found by all algorithms and settings, all of them were run 30 times and Friedmann [10] and Mann–Whitney tests were executed with 0.05 significance level.

This section explains the methodology adopted to evolve connection weights of artificial neural network using a multi-objective approach and its application in class Prediction. The Java language was used to implement the ANN and to compute the two fitness functions: sensitivity and specificity. The implementation of SMPSO and MOEA/D-DE-DRA available at the JMetal Framework were used.

The following databases were used:

  1. 1.

    Breast Cancer Wisconsin (Original) Data Set (called as Cancer);

  2. 2.

    Pima Indians Diabetes Data Set (called as Diabetes);

  3. 3.

    Glass Identification Data Set (called as Glass);

  4. 4.

    Statlog (Heart) Data Set (called as Heart).

Each database was divided into 2 groups of instances, each one corresponding to training set and testing set. These groups were set up with different sizes depending on the database as shown in Table 10.2.

Table 10.2 Separation of databases

The topologies of the ANNs were defined according to the databases. The input layers are defined according to the numbers of attributes and the output layer according to the number of classes. The complete definition of the used topologies is presented at Table 10.3.

Table 10.3 Number of neurons of each layer

The topology defines the size of the individuals that were evolved by the algorithms, one dimension for each connection plus the bias for each neuron, i.e., each individual defines one ANN. The neurons used a sigmoid function.

The algorithms were executed with two different population sizes: 50 and 100 and two different number of generations: 500 and 1000, given four different configurations for each algorithm. C1 with a population size of 50 and number of generations set to 500; C2 with a population size of 50 and number of generations set to 1000; C3 with a population size of 100 and number of generations set to 500 and, C4 with a population size of 100 and number of generations set to 1000. The remaining parameters were set as presented at Table 10.4 using the default values of the JMetal.

Table 10.4 Parameters values used

Next we present and discuss the results of the experiments in order to answer the research questions.

10.3.1 RQ1 - Comparing Different Configuration of Each Algorithm

As mentioned before, different configurations of each algorithm were compared to set the values of the parameters: population size and number of iterations.

Table 10.5 Results of Hypervolume in each configuration
Table 10.6 Kruskal–Wallis at 0.05 significance level for Hypervolume

Table 10.5 shows the mean values and standard deviation of Hypervolume indicator. At the top of the Table, the results of the SMPSO are reported and at the bottom the results of MOEA/D-DE-DRA. For SMPSO, the best configuration for Cancer is C1, Diabetes is C4, Glass is C3 and for Heart is C4. In the case of MOEA/D-DE-DRA, the best configuration for Cancer, Diabetes and Heart is C2 and for Glass is C4. However, the difference between the values of Hypervolume is not high. For a deep analysis on these values the Kruskal–Wallis at 0.5 significance level was applied. These results are reported at Table 10.6, for SMPSO and MOEA/D-DE-DRA. Analyzing the Kruskal–Wallis results, is possible to observe that for SMPSO the configuration C4 always get best or equivalent results for all databases. For MOEA/D-DE-DRA, the configuration C2 almost always get best or equivalent results for all databases, with exception in Glass where C4 is the best configuration.

The confirmation of these findings is given by the average rankings of configurations obtained using Friedman test. These results are showed for SMPSO and MOEA/D-DE-DRA at Tables 10.7 and 10.8 respectively. Summarizing, the Friedman test point out configuration C4 for SMPSO and C2 for MOEA/D-DE-DRA as the better considering all databases. So, these configurations were chosen for being used in the following experiments.

Table 10.7 SMPSO average rankings of configurations (Friedman)
Table 10.8 MOEA/D-DE-DRA average rankings of configurations (Friedman)

10.3.2 RQ2 - Comparing Different Algorithms

To answer RQ2, we compared the results from SMPSO algorithm with MOEA/D-DE-DRA, using the configurations chosen according to the results presented previously. Table 10.9 shows the results of the Wilcoxon test at 0.5 significance level and the effect size. It possible to observe that the algorithms present significant different results for each database. However, SMPSO presents better results for Cancer, Diabetes and Glass. For Heart the best results are for MOEA/D-DE-DRA.

Table 10.9 Wilcoxon test at 0.05 significance level, SMPSO x MOEA/D-DE-DRA, Hypervolume results
Fig. 10.3
figure 3

Obtained Fronts, SMPSO and MOEAD/D-DE-DRA for Diabetes

Table 10.10 SMPSO obtained Pareto Front for Diabetes
Table 10.11 MOEAD obtained Pareto Front for Diabetes

Figure 10.3 depicts the obtained fronts using SMPSO and MOEA/D-DE-DRA for Diabetes. Tables 10.10 and 10.11 present the values of sensitivity and specificity of each of the solutions in the fronts, for SMPSO and MOEA/D-DE-DRA respectively. These fronts are the obtained fronts after executing 30 times the algorithms and removing dominated and repeated solutions.

For Diabetes, SMPSO clearly outperforms MOEA/D-DE-DRA. The same happens for Heart but, in this case, is MOEA/D-DE-DRA that outperforms SMPSO. Then, the average rankings was obtained using Friedman test. These results are presented at Table 10.12, there is possible to observe that SMPSO is slightly better than MOEA/D-DE-DRA considering the Hypervolume.

Table 10.12 Average rankings of the algorithms (Friedman)

10.3.3 RQ3 - Advantages of a Multi-Objective Approach

In the task of learning classification algorithms as ANNs, the goal is to create algorithms that have good performance for classification. Hence, the great majority of the methods aims to optimize the performance of the classification by improving the accuracy in the set. Despite of its use, the accuracy maximization is not an appropriate goal for many of the real-world tasks [13]. A tacit assumption in the use of classification accuracy as an evaluation metric is that the class distribution among examples is constant and relatively balanced. In real world this is rarely the case, because classification leads to actions which could have serious consequences. Therefore, recent researches point out sensitivity and specificity as better metrics to be used for induction of classification algorithms. Sensitivity is a relative measures of instances of the positive class that are well classified. Hence, the greater the sensitivity, the greater the number of instances in the positive class that are correctly classified. Specificity is the same measure of sensitivity, but for negative instances. The greater its value, the lower the number of instances in the negative class that are misclassified. Sensitivity and specificity are inversely proportional, meaning that as the sensitivity increases, the specificity decreases and vice versa. For understanding the advantages of a multi-objective approach based on these two metrics in the following the ANNs obtained by SMPSO for Diabetes are deeply analyzed. Table 10.13 presents TP, FP, TN, FN and accuracy of the ANNs sorted by increased value of TP. It possible to note that as TP increases, TN decreases. The best value of accuracy achieved is 0.7288135593, with TP \(=\) 127, FP \(=\) 19, TN \(=\) 45 and FN \(=\) 45. Or in other words, 127 patients that have diabetes were diagnostic as having diabetes but 19 patients that have not diabetes were included in the diagnostic. In the other hand, 45 patients that have not diabetes were confirmed as not having the diseases but 45 patients that have diabetes were diagnostic as without diabetes. This can be dangerous because a treatment at time can make a good difference on the quality of life for these patients. Having access to all these informations another ANN could be used. That is, the user as more freedom to adequate the ANN that is better for its preference.

Table 10.13 SMPSO obtained solutions for Diabetes

10.4 Conclusion

ANNs are specially used to find a general solution in problems where a pattern needs to be extracted, such as data-mining. The main difficulty to apply ANN in some domain problem is to train the ANN to learn and predict. ANN provides different ways to solve many nonlinear problems that are hard to solve by conventional techniques.

The use of evolutionary algorithms has excelled to problem solving that requires space of global search optimization in several types problems. Theses algorithms have also been used to train ANNs. This paper describes and compares the results obtained in ANN training with two different algorithms: based on particle swarm optimization (SMPSO) and differential evolution(MOEA/D-DE-DRA). ANNs are trained for classification task, moreover, to properly tackle this task, ANNs need to maximize two metrics: sensitivity and specificity.

An experiment was conducted using different benchmark databases. First the goal was to determine the values of two important parameters of the algorithms: the population size and number of generations. After then, the best configurations were compared to answer which is the best algorithm for the task. Here, it was possible to observe that the best algorithm depends on the database, however, SMPSO presented slightly better results. Finally, using the results found for Diabetes the advantages of using sensibility and specificity were illustrated.

Future works include analyzing the influence of other parameters of the algorithms, for example to use an adaptive version of MOEA/D-DE-DRA. It is known that an appropriate configuration of parameters can produce better results.