1 Introduction

The artificial neural network (ANN) is a machine learning method; this method has an architecture that uses mathematical models. The growth in ANNs and their achievements in the previous research show that they are a reliable solution for many computational applications and models in different application areas [1], especially when addressing very large datasets that have many dimensions [2]. Despite their success with different problems, ANNs cannot reach optimum performance in many nonlinear problems [3], due to the problem of choosing appropriate values for the initial value of the connection weights, structure of the networks (number of hidden nodes), training error and convergence of the learning algorithms. Although the choice of parameters is a very important aspect of ANNs, this task is not easy because one parameter can affect the network performance and the adjustment of all of the parameters depends on the user experience. Thus far, the difficulty in determining optimal network parameters is still a major challenge that is faced by users of ANNs. In other words, there is a question of which parameter should be optimized to make the best use of the ANNs and the optimum value of that parameter. Therefore, the optimization of the connection weights, training process and structure of the network has become more attractive in the past few years.

Because ANNs suffer from these problems, evolutionary algorithms (EAs) are used to solve the above problems, evolve the ANNs efficiently and improve network performance. Moreover, they can choose the best connection weights and also reduce the number of hidden nodes with an effective structure for the network size and with positive effects on the network performance [4, 5]. Recent studies have proposed exploiting EA techniques to overcome the above problems [4,5,6,7,8,9]. Most of these studies utilize EAs for evolving ANNs to gain simple and accurate ANNs. More importantly, the integration of EAs and ANNs is still under research; combining the advantages of each can yield a more efficient method. One of the most successful applications of EAs is their use for evolving ANNs, as shown in [10]. Due to modern applications in many fields in which there are many incompatible objectives, as an alternative to addressing a single optimal solution, a set of optimal solutions called the Pareto optimal set exists for problems, such as multi-objective optimization problem (MOP) [11]. The corresponding objective functions, of which non-dominated solutions are in the Pareto optimal set, are called a Pareto front.

The multi-objective evolutionary algorithms (MOEAs) research area is one of the most active areas in the field of evolutionary computation [12]. Therefore, MOEAs are used to produce and optimize ANN parameters with the optimization of two conflicting objectives, namely the minimization of the ANNs’ structural complexity and the maximization of the network’s capacity. These types of algorithms are applied to improve the generalization ability, from the training data to the network unseen data. Moreover, MOEAs are suitable to produce and design appropriate and accurate ANNs from the simultaneous optimization of two or more conflicting objectives. Hence, due to their ability to improve the structural performance, recently, MOEAs have been applied successfully to optimize the network structure and connection weights [13,14,15,16]. Furthermore, in a single run, they can find multiple solutions [17,18,19,20]. However, a considerable number of studies in the literature were applied using these techniques. As an example, multi-objective genetic algorithm optimization was used by [21, 22] to train a feedforward neural network. Similarly, there are hybrid methods that use ANNs with evolutionary Pareto-based algorithms, and this type of research is known as multi-objective evolutionary artificial neural networks (MOEANNs) [23]. Another method based on the generalized multilayer perceptron (MLP) improved the performance of the evolutionary model [24]. A major study [25] proposed a hybrid multi-objective genetic algorithm (MOGA), which is based on the Strength Pareto Evolutionary Algorithm 2 (SPEA2) and non-dominated sorting genetic algorithm-II (NSGA-II) algorithms to optimize the training and topology of the recurrent neural network (RNN) simultaneously. Recently, [15] applied a non-dominated sorting genetic algorithm-II as a MOGA to train the neural network and optimize its weights and biases with respect to the maximum accuracy and minimum dimension.

On the other hand, memetic algorithms (MAs) have been developed over the past few years. Recently, several studies in the literature have used ANNs, MOEAs and local optimizers to speed up convergence [17, 26, 27]. In addition, Abbass [28] concludes that his proposed memetic Pareto artificial neural network (MPANN), which is based on a Pareto optimal solution, has better generalization and lower computational cost. Almeida and Ludermir [29] proposed a multi-objective memetic and hybrid approach for optimizing the parameters and performance of the ANNs, by using a combination of evolutionary strategies, genetic algorithms and particle swarm optimization. Another study proposed a memetic multi-objective evolutionary neural network algorithm to automatically design ANN models with sigmoid basis units for multi-classification problems [30]. Likewise, [18] is considered to be a memetic Pareto evolutionary approach that is based on the NSGA2 evolutionary artificial neural network algorithm to optimize two conflicting main objectives: a high correct classification rate and a high classification rate for each class. Qasem and Shamsuddin [31] presented a new memetic multi-objective evolutionary algorithm for the design of radial basis function networks (RBFNs) for classification tasks. Recent work in [32] introduced a multi-objective evolutionary learning algorithm using an improved version of the NSGA2 algorithm hybridized with a local search algorithm for training ANNs with generalized radial basis functions.

Modern studies on self-adaptive properties for multi-objective optimization algorithms in the literature indicate that the self-adaptive method can improve performance. Abbass [33] presented a self-adaptive Pareto differential evolution algorithm for multi-objective optimization problems (SPDE) that self-adapts the crossover and mutation rates. In conducted experiments, the SPDE algorithm outperforms the other evolutionary multi-objective optimizations. A multi-objective self-adaptive differential evolution algorithm with objective-wise learning strategies (OW-MOSaDE) is introduced to solve numerical optimization problems with multiple conflicting objectives [34]. Another mechanism used self-adaptive features for MOEAs, which suggested that the dynamic adjustment of the distribution index of the simulated binary crossover (SBX) operator has been shown [35]. Lately, [36] introduced a study about adaptive memetic computing applied in multi-objective optimization that yielded better results in optimization performance. The results showed the strengths of the proposed technique and proved the efficiency of the proposed adaptive memetic technique.

The memetic adaptive techniques that were applied in multi-objective optimization in a different application benefited from two techniques that improved the process and also led to better accuracy in the final solutions. When using an adaptive and local search technique, the adaptive method can cause dynamic behavior to adjust to the distribution index of the SBX crossover at each generation in the genetic process. This arrangement led the algorithm to produce much better results than the original or fixed SBX crossover. On the other hand, the local search technique includes speeding up the convergence and increasing the quality of the Pareto optimal solutions. It has been observed that both of the mentioned techniques during the evolutionary process can improve the MOEA’s performance by exploiting and optimizing the balance between the exploration and exploitation during the various stages of the evolutionary search.

Motivated by this observation, in this study, we present a new memetic adaptive multi-objective evolutionary algorithm that is based on a three-term backpropagation (TBP) network (MAMOT) to optimize the parameters of the TBP network and improve the network accuracy. The adaptive non-dominated sorting genetic algorithm (ANSGA-II) is utilized to optimize three objectives (parameters) simultaneously, namely the number of hidden nodes in the hidden layer, the norm of connection weights and the error of the network, to solve a pattern classification problem. For performance metrics, we used some of the performance metrics that are used for classification problems [37, 38], namely the accuracy, sensitivity, specificity and mean squared error (MSE).

Although EAs have several advantages, these algorithms are slow to converge, which is a major setback [39], and there is difficulty in tuning the final solutions in the search space [40]. To overcome these setbacks, a global search algorithm combined with a local search technique (a memetic process) offers a better speed of convergence for the evolutionary approach and better accuracy of the final solutions. This approach has yielded very promising results in other complex problem solving. At the same time, the flexibility of the crossover operator brought about a dynamic nature to the proposed method and has been the motivation for this study. However, previous studies indicate that the memetic adaptive methods applied to multi-objective evolutionary algorithms have achieved success in diverse applications. At the same time, no measure has been taken for using this method in the literature, which optimizes and automatically designs the ANNs. Therefore, it can be argued that such an action is a novel approach in this research area.

The novelty of the proposed method came from using an adaptive method with local search technique for designing an artificial neural network. The adaptive method can cause dynamic behavior to adjust to the distribution index of the SBX crossover at each generation in the genetic process. This arrangement led the algorithm to produce much better results than the original or fixed SBX crossover. On the other hand, the local search technique includes speeding up the convergence and increasing the quality of the Pareto optimal solutions. The goal of this proposed method is to generate an automatic design of the ANN structure and to reduce the error rate of the TBP network achieving better performance as well as a better architecture in terms of the hidden nodes.

The remainder of this study is organized as follows: Sect. 2 introduces the materials and methods used in this study. The proposed method and flowchart of the algorithm are illustrated in Sect. 3. The experimental results, datasets, experimental setup, results, discussion and statistical testing are given in Sect. 4. Finally, Sect. 5 concludes the study.

2 Materials and methods

In this section, we highlight the main methods that are used in this paper. The hybrid method is to train the three-term backpropagation neural network, which is dependent on self-adaptive simulated binary crossover method of multi-objective evolutionary algorithm combined with local search technique which can significantly speed up the multi-objective evolutionary algorithm performance.

2.1 Multi-objective evolutionary artificial neural network

The use of evolutionary approaches for ANN training, known as evolutionary artificial neural networks (EANNs), has been a key research area for the past few years [17]. Researchers have developed methods and techniques to find better approaches to evolve ANNs, attempting to find a simple architecture and accurate ANNs with good generalization capabilities. Moreover, there are many advantages of evolutionary approaches for ANN training, with the main advantages being the ability to escape a local optimum, robustness and ability to adapt in a changing environment. Research into EANNs has usually taken one of three approaches: first, evolving the weights of the network; second, evolving the network architecture; and last, evolving both the weight and architecture simultaneously [10]. The preliminary work of [17] has succeeded in designing networks that have a good generalization capability. However, finding a good ANN architecture has also been discussed in the ANN research literature. The main advantages of the evolutionary approach to ANN training are the ability to escape a local optimum, robustness and the ability to adapt to a changing environment [4, 13, 41]. Multi-objective evolutionary algorithms (MOEAs) represent a population-based search approach, and hence, in a single run, many Pareto optimal sets (solutions) can be obtained, and these are attractive when using this type of algorithm. The current research focuses on the application of multi-objective evolutionary algorithms to solve multi-objective optimization problems in different fields [32, 42,43,44].

2.2 Parameter optimization

To evaluate the three-term backpropagation network performance of the proposed method, three objective functions were used in this study, as follows:

  1. 1.

    The performance of the network (accuracy) is based on the mean squared error (MSE) on the training set. This performance involves the first objective function and is given below:

    $$f_{1} = \frac{1}{N}\sum\limits_{j = 1}^{N} {(t_{j} - o_{j} )^{2} }$$
    (1)

    where oj is the network output value of the output unit, tj is the target value of the output, and N is the number of samples.

  2. 2.

    The complexity of the network is based on the number of hidden nodes in the hidden layer of the TBP network and is a second objective function. This function is computed as follows:

    $$f_{2} = \sum\limits_{h = 1}^{H} {\rho_{h} }$$
    (2)

    where ρ is the dimension of H and H is the maximum number of hidden nodes in the network. \(\rho_{h} \in \rho\) is a binary value {0,1} used to refer to the hidden node with respect to whether it exists in the network or not. Turning a hidden unit ON or OFF works like a switch; this mechanism is involved in determining the maximum number of hidden nodes in the TBP network.

  3. 3.

    The complexity of the TBP network is based on the weights of the network, which is based on the notion of regularization and represents the smoothness of the model. This function is the last objective of this study (f3) and is given as:

    $$f_{3} = \left\{ {||\omega ||} \right\}$$
    (3)

    where ω is a matrix of weights in the network.

In this study, three fitness functions were analyzed to optimize the performance of the network (f1), to minimize the structure of the network (hidden nodes) (f2) and (f3) to minimize the connections (weights) of the TBP network.

2.3 Three-term backpropagation algorithm

The three-term backpropagation network (TBP) is a type of ANN and has been proposed by Zweiri et al. [45] to speed up the weight adjustment process by increasing the convergence rate of the algorithm and reducing the learning stalls. The TBP network modifies the architecture and procedure of the standard backpropagation (BP) algorithm by adding an extra term to increase the BP learning speed [46]. The neurons in all of the layers are connected with connection links. A weight is associated with each connection link and is multiplied with the signal that exits within each neuron in the network (from input to hidden and from hidden to output layer), see Fig. 1.

Fig. 1
figure 1

ANN architecture

In TBP network, in addition to the learning rate and momentum parameters, a third parameter, called the proportional factor (PF), is introduced. This presentation of PF has proven to be successful in improving the convergence rate of the algorithm and speeding up the weight adjustment process.

$${\text{net}}_{j} = \sum\limits_{i = 1}^{M} {W_{ij} O_{i} + \theta_{i} }$$
(4)
$$O_{j} = \frac{1}{{1 + {\text{e}}^{{ - {\text{net}}_{j} }} }}$$
(5)

where netj is the summation of the weighted inputs added to the bias, Wij is the weighted value between input layer i and hidden layer j, Oi is the output from the input layer i at the same time that it is the input to the hidden layer j, and θi is the bias associated with each connection link between the two respective layers. Equation (5) shows the calculation of Oj, which is the output of the activation function at the hidden layer j.

$$E = \frac{1}{L}\sum\limits_{k = 1}^{L} {(t_{k} - o_{k} )^{2} }$$
(6)

where E is the error function of the network mean squared error, tk is the target output at output layer k, and the network has L output neurons.

Consider W as network weights vector, k as iteration number of the weight vector, and ∆W(k)= W(k + 1) − W(k). The weight adjustment in Eq. (7) is modified to include the proposed third parameter, which is proportional to the difference between the desired and calculated output in Eq. (8). Thus, we can say that Eq. (7) presented the two-term backpropagation, while Eq. (8) shows the three-term backpropagation.

$$\Delta W(k) = \alpha ( - \,\nabla E(W(k))) + \beta \Delta W(k - 1)$$
(7)
$$\Delta W(k) = \alpha ( - \,\nabla E(W(k))) + \beta \Delta W(k - 1) + \gamma e(W(k))$$
(8)

where α and β are the learning rate and momentum, respectively; γ is the proportional factor; and \(e(W(k))\) represents the difference between the output and the target at each iteration.

2.4 Adaptive NSGA-II

The genetic algorithm (GA) is based on simulating the biological evolution of the search space in the search process automatically, and it is a parallel global search method [47]. The non-dominated sorting genetic algorithm-II (NSGA-II) is proposed in [48] because of its good performance in global searching. The NSGA-II method proposes a new method and a new arithmetic operator by improving the first version of the NSGA [43]: the fast non-dominated sorting approach and the crowded comparison operator. Thus far, many studies regarding optimization and design have been performed [4, 32, 49, 50]. All of these studies prove that the genetic algorithm and its upgraded derivatives are feasible for optimal design. Recently, many studies have proven that the adaptive multi-objective optimizations are valuable and able to achieve better results in a variety of applications [35, 51, 52]. Using the self-adaptive crossover operator can dynamically adapt the solution and can create children solutions in an appropriate way from the parents. The update process for the distribution index can be increased or decreased for the next generation depending on how the child outperforms (has a better fitness value than) the parents. These processes comprise the NSGA-II Adaptive algorithm, which is called ANSGA-II. This methodology guarantees improving the solution.

The main idea for the crowding distance is to find the Euclidian distance between every individual front, using the objectives in the dimensional hyperspace. Individuals at the boundary with infinite distance are always selected. The crowding distance is assigned once the non-dominated sort is completed. Having picked individuals based on crowding distance ranking, individual populations are assigned to the crowding distance value. Thus, reversing the labels by assigning front wise and comparing the crowding distance between two individuals is meaningless, see Fig. 2.

Fig. 2
figure 2

Standard NSGA-II technique

2.5 Self-adaptive simulated binary crossover

Self-adaptation techniques are based on a population’s diversity. Whereas the adaptation of the operator ensures a good convergence speed, the degree of diversity determines the convergence reliability. Generally, the relationship between the parent and offspring population is controlled with a self-adaptation technique. A self-adaptive simulated binary crossover (SA-SBX) was proposed by [51] to adjust the distribution index of the simulated binary crossover (SBX) operator dynamically at each generation in NSGA-II. Compared to the SBX, several studies have reported that the SA-SBX produces better solutions when applied to both single- and multi-objective optimization problems [35, 51]. The important factor in the SBX crossover is finding the appropriate value of the distribution index (ηc) because it has an effect on the convergence speed and local/global optimum solution, and thus, the self-adaptive SBX adaptively updates the distribution index to solve these problems. Moreover, self-adaptive simulated binary crossover (SA-SBX) at each generation in the NSGA-II procedure can dynamically adjust the distribution index ηc of the crossover operator adaptively. As is well known, the crossover operator in genetic algorithms produces children by recombining the information from the parents. If the child is better than the parent, then the child is extended further in the hope of creating a better solution, while the opposite occurs if a worse solution is created.

The process of calculating offspring solutions \(x_{i}^{(1,t + 1)}\) and \(x_{i}^{(2,t + 1)}\) from the parent solutions \(x_{i}^{(1,t)}\) and \(x_{i}^{(2,t)}\) appear in Eq.(9). In addition, the spread factor \(\beta_{i}\) is defined as the ratio of the absolute difference in the offspring values to that of the parents and described in Eq. (9) as well:

$$\beta_{i} = \left| {\frac{{x_{i}^{(2,t + 1)} - x_{i}^{(1,t + 1)} }}{{x_{i}^{(2,t)} - x_{i}^{(1,t)} }}} \right|$$
(9)

A random number, ui, is created and ranges between 0 and 1, which establishes a probability distribution function. The probability distribution in Eq. (10) is graphically shown in Fig. 3 for ηc = 2 and 5, which is used for making offspring from two parent solutions (\(x_{i}^{(1,t)} = 2.0\) and \(x_{i}^{(2,t)} = 5.0\)). In expression (10), ηc is any nonnegative real number. From Fig. 3, it can be seen that a large value for ηc yields a higher probability for creating (near parent) solutions and, consequently, for providing a pathway for a focused search. Moreover, a small value of ηc permits distant solutions to be chosen as offspring, which permits diverse searches. Please see [51] for more details about this technique.

Fig. 3
figure 3

Probability density function for creating offspring solutions with the SBX operator [51]

$$P(\beta_{i} ) = \left\{ {\begin{array}{*{20}l} {0.5(\eta_{c} + 1)\beta_{i}^{{\eta_{c} }} ,} \hfill & {{\text{if}}\;\;\beta_{i} \le 1;} \hfill \\ {0.5(\eta_{c} + 1)\frac{1}{{\beta_{i}^{{\eta_{c} + 2}} }},} \hfill & {{\text{otherwise}} .} \hfill \\ \end{array} } \right.$$
(10)

2.6 Local search algorithm

A local search algorithm is a meta-heuristic approach that is used to solve hard optimization tasks. In a local search method, the algorithm moves through the search space and searches for a solution from a number of solutions by applying local changes; this process is continued until one of the solutions is considered to be optimal or after the expiration of a specified amount of time.

Local search algorithms are widely used for several problems in different areas, but they have received more attention in computer science and engineering, especially in artificial intelligence applications. It is known that the local methods can find a local optimum when searching in a small area of space. Therefore, with a combination of EA and local search algorithms, EAs perform a global search within the space of solutions and use ANNs to locate solutions near the global optimum and to apply a local search method to quickly and efficiently find the best solution. This type of hybrid algorithm is known as a memetic algorithm (MA). MAs can provide not only the best speed of convergence to the evolutionary approach but also the best accuracy for the final solutions [53].

There are several studies [14, 28] that use MOEAs along with ANN local optimizers to adjust the weights. This approach is called lifetime learning, and it consists of updating each individual with respect to the approximation error. The main problem with this type of algorithm is the computational cost. Some studies used a local search algorithm after the crossover and mutation operations for all of the individuals in the population in each iteration [17, 28, 54]. As is well known, the BP algorithm is a learning heuristic for supervised learning in ANNs. Therefore, in this study, we used a classical BP algorithm as a local search method.

3 The proposed memetic adaptive multi-objective genetic evolutionary algorithm

This section introduces a memetic adaptive multi-objective genetic evolutionary algorithm (MAMOGEA) that is basically adapted from the non-dominated sorting genetic algorithm-II (NSGA-II) [48] and modifies the crossover operator to self-adaptive crossover, hybridized with BP as a local search algorithm to optimize TBP network parameters being implemented for solving pattern classification problems. The network architecture and accuracy are evolved simultaneously, with each individual being a fully specified TBP network. In this study, MAMOGEA based on the TBP network has been proposed to determine the best parameters, performance and corresponding architecture of the TBP network, which we call MAMOT.

In addition, self-adaptive simulated binary crossover (SA-SBX) [51] gives the proposed method the property of having a dynamic nature to the distribution index; it can automatically update the crossover operator, providing the ability to create child solutions in the true direction from the parents. If the child solutions that are produced by this process are worse than the parent solutions, then it can provide a move to obtain a modified child to move very near to the parents’ results. This process can optimize the balance between exploration and exploitation during the various stages of the evolutionary search. The ANSGA-II is hybridized with the BP algorithm as a local search algorithm to enhance all of the individuals in the population to improve the classification accuracy. At the same time, the above scenario helps the proposed method to produce good final solutions. These solutions represent three objectives and the following analysis: (1) optimize the performance of the network (f1); (2) minimize the network complexity based on the number of hidden nodes in the hidden layer of the TBP network (f2); and (3) minimize the connection weights of the TBP network (f3), which is based on the notion of regularization and represents the smoothness of the model. To measure the network complexity, the study uses both of the objectives, f2 and f3. The attempt to minimize f2 leads to a lower number of hidden nodes in the hidden layer of the TBP network, while an attempt to minimize f3 leads to a lower matrix of weights. However, to assist in the TBP network design, GA and MOEA are combined as a rank-density-based GA to perform the fitness evaluation and mating selection schemes. Similarly, the MAMOT begins by collecting, normalizing and reading the dataset. The result determines the dataset. Then, the number of hidden nodes and the maximum number of iterations are set. Additionally, the individual dimension is determined. Furthermore, it generates and initializes a population of the TBP network, and during the experiment, the initialization is set before each TBP training generation.

Every individual is evaluated for every iteration based on the objective functions. After the maximum iterations are reached, the proposed method stops and outputs a set of non-dominated TBP networks. Figure 4 shows the flowchart of the MAMOA based on the TBP network. Furthermore, the proposed method is given in the following steps:

figure a
Fig. 4
figure 4

Flowchart of the proposed MAMOT

Pseudo-code of proposed memetic adaptive multi-objective genetic evolutionary algorithm.

The method starts by generating a random population P(g), g  = 0, of size N. Evaluation of the individuals P(g) based on three objectives was mentioned in the section on parameter optimization. Then, the population is sorted according to the non-domination aspect, and the solution ranks are provided based on the non-domination levels and a crowding distance value. The procedure is described as follows: First, the usual binary tournament selection and the SA-SBX crossover and mutation are used for binary encoding, and also, the mutation and SA-SBX crossover operators are used for real encoding to create an offspring population Q(g) of size N. Second, apply the BP local method to each individual of the offspring population Q(g) and evaluate the individuals of the population Q(g) based on their accuracy and complexity. Because elitism is introduced by comparing the current population with the previously found best non-dominated solutions, the procedure is presented in each generation as a combined population \(R(g) = P(g) \cup Q(g)\) that is formed, with the size of the R(g) population being equal to 2 N. Afterward, according to the non-domination criteria, the population R(g) is sorted, and the best solutions in the population are those that belong to the best non-dominated set F1. If the size of F1 is smaller than N, then all of the members of the set F1 are definitely chosen for the new population P(g +1). The remaining members of the population P(g +1) are chosen from subsequent non-dominated fronts in their order of ranking. Thus, solutions from the set F2 are chosen next, followed by solutions from the set F3, and so on. This procedure is continued until no more sets can be accommodated. Then, the new population P(g +1) is sorted according to the rank and crowding values, and the first N individuals are selected. Finally, we use a binary tournament on P(g +1) to obtain N individuals (Table 1).

Table 1 A summary of the methods used in the literature

4 Experiments

4.1 Datasets

For the experimental design, we consider 11 real-world datasets for classification tasks. The datasets include two-class, multi-class and complex real problem pattern classification tasks. The breast cancer, diabetes, heart, hepatitis and liver datasets represent two-class datasets, while the iris, lung cancer, QAC, segment, wine and yeast data represent multi-class datasets. All of the datasets are obtained from the UCI machine learning repository [55], except for the Qualitative Analytical Chemistry (QAC) dataset, which is sometimes called BTX, and it is considered to also be a complex problem. More information can be found about QAC in [56]. Table 2 shows the number of features, classes and instances for all of the datasets. For each dataset, 75% of the dataset is used for the training set, and the remaining 25% is used for the testing set. In addition, all of the dataset values are normalized in the range [0, 1].

Table 2 Summary of datasets used in the experiments

4.2 Experimental setup

The experiments are conducted to test the efficiency of the proposed method for all of the datasets. The proposed method is evaluated by using the tenfold cross-validation technique. In tenfold cross-validation, the dataset is split into ten equally sized subsets. Nine of those subsets are used as the training dataset, and the one remaining subset is used as the testing dataset. This train and test process is repeated in such a way that all of the subsets are used as a test dataset. The results of MAMOT for each dataset are compared to MOGAT. The number of input and output nodes is dependent on the dataset, and it is different from one dataset to another. The maximum number of hidden nodes is set to 10 [17, 23, 54]. The maximum number of neural network training iterations is set to 1000 [2, 46] for all of the datasets. For the local search algorithm, the learning rate is set to 0.01 and the number of iterations is set to 5 [17, 54]. Table 3 presents the other various parameter settings. From Table 3, the “N” refers to the dimension of the individual. Moreover, there are some parameters of the TBP network that must be specified by the user. In the MAMOT, we assign a distribution index ηc value in the initial population using ηc = 2. Afterward, this value is updated depending on the creation of a better child or a worse child than both parents.

Table 3 Parameter settings for the proposed algorithms

4.3 Results and discussion

This section presents the results of MAMOGEA and MOGA applied to the TBP network (called MAMOT and MOGAT, respectively). MAMOT is the proposed method in this paper, it is a new memetic adaptive multi-objective evolutionary algorithm based on a three-term backpropagation network. It used self-adaptive NSGA-II combined with the local search method to optimizing the parameters and performance of neural networks. On the other hand, MOGAT is proposed in [14] and implemented in this for comparison to MAMOT. It used non-dominated sorting genetic algorithm-II based on TBP network. Both MAMOT and MOGAT are based on TBP network. The main differences between them are the MAMOT used self-adaptive method to improve the performance of the algorithm and a local search technique to improve all of the individuals in a population.

The results of these algorithms are Pareto optimal solutions to improving generalization on unseen data. The training set is used to train the TBP network to obtain the Pareto optimal solutions, while the testing set is used to test the generalization performance of the Pareto TBP network. The result for each dataset focused on the analysis of the hidden nodes, network error and accuracy, and the results are analyzed based on the convergence to the Pareto front with their classification performance. In Tables 4, 5, 6 and 7, the best results are the highlighted bold entries.

Table 4 Statistical evaluation of the testing errors for MAMOT and MOGAT methods
Table 5 Average numbers of the hidden nodes in TBP network obtained by MAMOT and MOGAT methods
Table 6 Statistical evaluations for the testing accuracy of the MAMOT and MOGAT methods
Table 7 Sensitivity and specificity for the testing sets for MAMOT and MOGAT methods

The hybrid of the local search algorithm with evolutionary algorithms is a good choice in the problems studied because the hybrid algorithms achieve the best performance in all of the datasets. In addition, self-adaptive crossover helped the algorithms in MAMOT to obtain better solutions than MOGAT in most cases. Therefore, the crossover operator benefited from the adaptive process. In fact, the algorithm MAMOT has obtained the best performing networks in all of the problems. Moreover, the size of the networks obtained by this algorithm is, in general, smaller than MOGAT because of the selective pressure produced as a result of the objectives in Eqs. (2) and (3) working together. To evaluate the classification capabilities of the proposed method, the performance of the MAMOT and MOGAT in the average sensitivity, specificity and accuracy was performed with the results shown in Tables 6 and 7 in addition to Figs. 8, 9 and 10.

Tables 8 and 9 show the robust tests for normality and a paired sample test, respectively. Robust tests for normality used to check the normality assumption and a paired sample t test are used after the normality test to compare the proposed methods (MAMOT with the MOGAT). This test is to ensure that there is no statistically significant difference in the means between the accuracy obtained by the proposed methods.

$${\text{Sensitivity}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FP}}}}\%$$
(11)
$${\text{Sepecifity}} = \frac{\text{TN}}{{{\text{TN}} + {\text{FP}}}}\%$$
(12)
$${\text{Accuracy}} = \frac{{{\text{TP}}\text{ } + \text{ }{\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}\%$$
(13)

where TP = true positives, FN = false negatives, TN = true negatives, and FP = false positives.

Table 8 Results of the normality tests
Table 9 Results of paired t test and Wilcoxon’s signed-ranks test

The performance measures used in this study for the classification of the datasets are the sensitivity, specificity and accuracy. Sensitivity is the measure of the classifier according to its ability to identify the correct positive samples, and it depends on the number of true positives and false negatives, as shown in Eq. (11). Specificity is a measure of the classifier’s ability to predict correctly the negative samples; it depends on the number of the true negatives and false positives, as shown in Eq. (12). Additionally, accuracy is a measure of the classifier’s ability to produce a level of accurate diagnosis; Eq. (13) shows the accuracy formula.

The results in Table 4 demonstrate the performance of the proposed method (training and testing error) for all eleven datasets used. The average of the results values and the sample standard deviations determine how far each value in the results varies from the average value of the result and the maximum value and minimum value, which appear in Table 4 as the mean, SD, max and min, respectively. The average of the mean squared error (MSE) of the proposed methods for all of the datasets is presented in this Table. The error rates for all of the results are shown in the same Table, as obtained by MAMOT and MOGAT. The results show the generalization error of the proposed methods. From Table 4, we can observe that in all of the datasets on the mean rows, MAMOT gives promising results regarding the performance (testing error) compared to MOGAT. Additionally, MAMOT produced the smallest error on all of the datasets. Furthermore, the testing errors that are shown in the same Table are the average of the errors obtained in a single run of the MAMOT and MOGAT applied to the TBP network.

Moreover, Fig. 5 shows the comparison of all of the errors obtained in the training and testing set, respectively, using MAMOT and MOGAT. From the same Figure, the Y-axis plots the MSE, while the X-axis plots the datasets used in this study. We can see that the error rates are reasonable and small in all of the datasets, especially in the breast cancer dataset, which has the lowest amount of error, followed by the yeast data.

Fig. 5
figure 5

Comparison of training and testing average error results obtained by MAMOT and MOGAT

Regarding complexity, Table 5 presents the results of the complexity or the average number of hidden nodes in the TBP network structure. Of all of the datasets, the MAMOT achieved a better network structure with the lowest complexity and lower average results for the hidden nodes than MOGAT. In addition, from Table 5, we can observe very clearly that the average number of hidden nodes in the structure of the TBP network in all of the datasets is not more than 4.7 when using MAMOT. Precisely, two datasets obtained 4.7, which are the iris and the liver datasets, while the average number of hidden nodes of the MOGAT is not more than 5.60, which was achieved by the diabetes dataset. On the other hand, the minimum average number of hidden nodes when using the lung cancer dataset is 2.00 in both algorithms. If there are more hidden nodes in the network, we can learn a training set more quickly, but it might not generalize well on a testing set. Therefore, Table 5 and Fig. 6 show that MAMOT has the capability to design simple TBP networks with the lowest number of hidden nodes.

Fig. 6
figure 6

Comparison of MAMOT and MOGAT for the average number of hidden nodes for all of the datasets

In terms of the classification accuracy rate for all of the datasets, the accuracy rates demonstrate very good results in general, especially in the breast cancer and yeast dataset. As shown in Table 6, the two mentioned datasets obtained 97.94 and 90.03%, respectively, in MAMOT, while using MOGAT they obtained an accuracy rate on the same two datasets of 96.69 and 90.01%, respectively. Table 6 shows the highest testing accuracy highlighted with bold font for all of the datasets. In general, the best results for classification accuracy in the testing sets are obtained through the MAMOT approach for all datasets. Figure 7 clearly shows the average percentage of accuracy obtained in the testing accuracy for all datasets.

Fig. 7
figure 7

Comparison of MAMOT and MOGAT with respect to the accuracy results obtained on the testing set

Table 7 shows the statistical results for the sensitivity, specificity and classification accuracy of the proposed methods on the training set and testing set for all datasets. In terms of the sensitivity, MAMOT produced the best results on the training and testing set for all of the datasets and on average as well, except for the lung cancer dataset, which had better results using MOGAT for the training and testing set, although the breast cancer dataset had the best values for sensitivity among all of the datasets, obtaining 98.87% for the training set and 97.07% for testing set. The yeast dataset achieved the lowest sensitivity value (which was 0.00%) for both the training and testing set. Some of the datasets, such as QAC, segment and yeast, achieved very low values in the sensitivity and did not exceed 6.50%. We infer that the improvement in the sensitivity is very difficult in these datasets because there are difficult classification problems in these datasets and because the datasets are extremely unbalanced. Thus, this difficulty leads to lower sensitivity in these datasets. Moreover, Fig. 8 shows the comparison of MAMOT and MOGAT with respect to the sensitivity results obtained in the testing set.

Fig. 8
figure 8

Comparison of MAMOT and MOGAT with respect to the sensitivity results obtained on the testing set

With regard to the specificity in Table 7, MAMOT provided the best results on the training and testing sets for all of the datasets and on average as well, except for two datasets, which are the QAC and Segment datasets. The results reported in Table 7 and the histogram in Fig. 9 show that MAMOT and MOGAT produced the same specificity results only for the yeast dataset. MAMOT has better results in training and testing, especially in the iris and yeast datasets, which obtained 98.18% in training and 98.21% in the testing set for the iris data, while the yeast data achieved 100.00% in both the training and testing set. MOGAT achieved successful results, which means 100.00% specificity in the QAC, segment and yeast datasets. In general, it can be clearly seen from the data in Table 7 that the specificity results for all of the datasets obtained high specificity rates.

Fig. 9
figure 9

Comparison of MAMOT and MOGAT with respect to the specificity results obtained on the testing set

Based on the evaluation viewpoint that was utilized in this study, it can be concluded that the MAMOT is more suitable to be employed as a classifier, whereas the proposed method shows the best performance to some extent. However, the optimal parameters are very important for ensuring the accuracy of the ANNs. Hence, the MAMOT has facilitated the searching process for the optimal parameters of the TBP network and is thus able to produce more accurate results. The results of this study also demonstrated that the use of the adaptive method in estimating the parameters of the TBP network was able to improve the accuracy of the proposed method in all of the datasets used.

4.4 Statistical test

To compare two or more classifiers on multiple datasets in this study, we used statistical tests to determine whether the algorithms are significantly different or not. Several known statistical tests are examined, and their suitability was studied to determine the significance of the proposed methods based on the complexity of the TBP network and its accuracy. To test the difference between two classifiers’ results over various datasets, a paired t test was used, which determines whether the average difference in their performance over the datasets is significantly different from zero. On the other hand, the Wilcoxon signed-ranks test was also used to detect significant differences between the behaviors of the algorithms’ pair.

Before we used such statistical tests, we performed some statistical analysis so as to check the normality assumption. The used tests were Robust Jarque–Bera test (RJB), the Jarque–Bera (JB) test, the SJ test (SJ), the popular Shapiro–Wilk test (SW) and the Anderson–Darling test (AD); these tests are robust tests for normality with reference to the study [59] and Skewness Kurtosis test (SK). We used such robust tests for normality to investigate whether the different values for accuracy and complexity are normally distributed or not. Table 8 shows the statistical tests results and proved that the accuracy results of MAMOT and MOGAT follow the normal distribution assumption. Furthermore, the results proved that the complexity results of MAMOT are normally distributed, while the MOGAT method violates the normality assumption. Based on our results of the robust tests for normality, we used a paired sample t test to test difference of the accuracy for the MAMOT and MOGAT. On the other hand, we used Wilcoxon signed-ranks test for the difference of the complexity for MAMOT and MOGAT.

The MAMOT and MOGAT were tested using the t test and Wilcoxon signed-ranks test for testing the model accuracy difference and complexity, respectively. Let us first construct the null hypothesis to test the significance of MAMOT in relation to MOGAT in accuracy. The null hypothesis is that there is no difference between the average accuracy of MAMOT versus MOGAT. The results of the applied paired t test are shown in Table 9. The p value of the t test is less than α = 0.05 significance level, which implies the rejection of the null hypothesis. Therefore, there were significant differences; furthermore, MAMOT was significantly better than MOGAT. On the other hand, the p value resulted from the Wilcoxon signed rank of testing the differences of the algorithms complexity also was lower than α, which implied the rejection of the null hypothesis. Thus, there were significant differences. As we can see below, the results from Table 9 show that there were statistically significant differences between MAMOT and MOGAT.

4.5 Comparisons with other studies

The performance of the proposed methods can be compared with MOGAT and other memetic and multi-objective genetic algorithm-based ANN algorithms found in the literature (which used the same datasets) and some baseline methods, such as (HMOEN L2 and HMOEN HN [57], MPENSGA2E and MPENSGA2S [18], MEPGANf1f2 and MEPGANf1f3 [54] and also SVM [54]). Table 10 and Fig. 10 show a summary of the comparison results. Our proposed method, MAMOT, is the best of all of the methods reported in Table 10 on all of the datasets, except for diabetes, iris and liver, in which MPENSGA2E [18] is better than our algorithm in diabetes, while [57] there are two methods, HMOEN L2 and HMOEN HN, which are better than our algorithm for the iris and liver datasets, respectively. Additionally, on the breast cancer data, MAMOT achieved a better accuracy value than the other methods.

Table 10 Classification accuracies of the proposed method and some of the other methods in the literature
Fig. 10
figure 10

Performance comparisons of the proposed and existing methods on the testing set for all of the datasets

5 Conclusions

This paper introduces a new memetic adaptive multi-objective evolutionary algorithm that is based on the TBP network, MAMOT, for optimizing the TBP network parameters by using an adaptive non-dominated sorting genetic algorithm, ANSGA-II. The memetic process introduces the BP algorithm as a local optimizer hybrid with ANSGA-II, which was used to enhance all of the individuals in the population. On the other hand, the self-adaptive SBX crossover was used to improve the performance of the ANSGA-II. The new memetic adaptive multi-objective evolutionary algorithm simultaneously optimizes the neural network parameters, specifically the connection weights, error rate and optimal structure in terms of the number of nodes in the hidden layer. This MAMOT-based algorithm not only helps to improve the classification accuracy, but also automatically designs and reduces the network structure during the classification phase of the neural classifier. To assess the performance of the MAMOT, experiments were conducted with 11 different dataset types for classification problems, 10 datasets obtained from UCI repository and another complex environment problem from qualitative analytical chemistry. The experimental results obtained show that the proposed method (MAMOT) was able to obtain a TBP network with better classification accuracy and a simpler network structure on the classification tasks compared with the other algorithms found in the literature. Based on an evaluation and statistical tests that were conducted, it can be concluded that the proposed MAMOT is suitable to be employed as a classifier for classification problems. As a future work, we plan to integrate a preferential local search with adaptive weights, as proposed in [58], to improve the performance of the algorithm and obtain better results. Our future work also will include the investigation of the proposed method and the effectiveness with larger datasets. In addition, we are planning to check the performance of the proposed method in other types of artificial neural networks.