1 Introduction

Radial basis function (RBF) networks are one of the most popular and applied type of neural networks. RBF networks are universal approximators and considered as special form of multilayer feedforward neural networks that contain only one hidden layer with Gaussian based activation functions. RBF networks were first introduced by Broomhead et al. [8] with a strong foundation in the conventional approximation theory [13, 16].

The advantages of RBF networks compared to other neural networks include the high generalization capability, its simple and compact structure (i.e; only three layers), easier parameters adjustment, very good noise tolerance and the high learning speed [26, 60]. Due to these advantages, RBF networks have been a common alternative to MLP networks [16]. Moreover, RBF networks have been successfully applied to many applications like: systems identification [2, 36], process faults classification [31], nonlinear control [30] and time series forecasting [11, 18].

Like other neural networks, RBF networks have two major components: the structure and the training method. The training method has a significance influence on the performance of the network. In literature, researchers proposed and investigated a wide variety of learning schemes for RFB network.

Castao et al. [9] classified the training methods of RBF networks into two categories: quick learning and full learning. The quick learning methods are more popular where the learning process can be performed in two independent stages: In the first stage, the structure of the network (i.e; the centers and widths of the network) is identified usually by an unsupervised learning algorithm like K-Means algorithm, while in the second stage, the connection weights between the hidden and output layers are tuned using least mean squares (LMS), gradient-based methods and variations of the backpropagation algorithm [42]. The drawback of using an unsupervised technique to locate the centers is that it depends only on the input features and it does not consider the distribution of the label class [50]. Moreover, using the common K-Means algorithm does not necessarily guarantee that the centers are best located [33]. On the other side, the main issue with the gradient methods is that its highly probable that the search process will be trapped in a local minima. Moreover, Vakil-Baghmisheh and Pave reported in [48] that applying customized version of the backpropagation algorithm to RBF networks could suffer from some drawbacks like the slow convergence and over-training which consequently affects the generalization ability of the model. Alternatively, the full learning methods optimize the RBF parameters simultaneously as a supervised task.

Nature-inspired Metaheuristic algorithms have been widely investigated in evolving and training RBF networks. These algorithms are based on stochastic search algorithm that are simulated by natural systems and phenomenons. Most of Nature-inspired metaheuristics are population based and rely on randomness as an essential principle of their process. The advantage of these search methods is their flexibility, self-adaptation, conceptual simplicity and ability of searching for a global optima rather than a local one [17]. Nature-inspired metaheuristic were deployed in different ways in training RBF networks. Some were applied for finding one parameter of the network like the centers [28], others optimized all the parameters [43], while others investigated optimizing all the parameters along with the structure of the network. Such algorithms applied to RBF networks training include: Genetic Algorithms (GA) [6, 15, 21], particle swarm optimization (PSO) [25, 37, 40, 47], Ant Colony Optimization (ACO) [12, 45], Differential Evolution (DE) [4, 14, 58], Firefly algorithm (FFA) [24], Cuckoo search (CS) [3, 10], Honey Bee Mating Optimization (HBMO) [23], Artificial Bee Colony [29] and BAT Algorithm [46].

According to the No Free Lunch theorem (NFL), there is no heuristic algorithm that certainly performs better than all other algorithms in all optimization problems [7, 22, 49]. Motivated by this reason, in this work, we propose a novel RBF training algorithm based on the recent biogeography-based optimizer (BBO) for optimizing the parameters (centers, widths and weights) of RBF network, simultaneously. BBO is an evolutionary algorithm, which was developed by Simon [44]. BBO was inspired by the studies related to the geographical distribution of biological organisms in terms of time and space. Recently, BBO optimizer has been applied in training neuro-fuzzy networks [38] and feedforward MLPs [35] and showed high modeling capability. However, according to our knowledge, there is no previous work investigating the efficiency of the well-regarded BBO algorithm in any type of RBF network training.

In order to evaluate the efficiency and effectiveness of the new BBO trainer, the proposed trainer is applied on twelve popular benchmark datasets, which are selected from the UCI machine learning repository.Footnote 1 The results of the proposed BBO trainer are compared with those obtained with other eleven algorithms. Six algorithms out of the eleven are classical evolutionary algorithms, which are the GA, PSO, DE, Evolutionary Strategy (ES), ACO and the population-based incremental learning (PBIL). While four algorithms are recent nature-inspired algorithms which are the FFA, CS, ABC and BAT Algorithm. The eleventh algorithm is actually a hybrid two stages training algorithm based on K-Means and gradient decent optimization.

This paper is organized as follows: in Sect. 2 a description of the RBF network, and its classical two-phases learning approach is given. In Sect. 3 the BBO algorithm is explained. Section 4 describes in detail the developed BBO-based approach for training RBF network. Experimental results are outlined in Sect. 5. Finally, the finding and remarks of this work, and future works are concluded in Sect. 6.

2 Radial based function neural networks

RBF neural network is a special type of fully connected feedforward networks that consists of only three layers: input, hidden and output layers. The number of neurons in the input layer depends on the number of dimensions of the input vector, whereas output layer neurons depend on the number of class labels in the data. The number of neurons in the hidden layer determines the topology of the network which also determines the decision boundary between data clusters. Each hidden neuron has an RBF activation function that calculates the similarity between the input and a stored prototype in that neuron. Having more prototypes results in a more complex decision boundary, which means higher accuracy. However, it results of more computations to evaluate the network.

Figure 1 shows the structure of RBF network in comparison with the Multilayer Perceptron network. Inspecting this figure, it may be seen that the arrows between the input layer and the hidden neurons in the RBF network represent the Euclidean distance between the input vector and the prototypes stored in the hidden neurons. On the other side, in MLP, arrows between the input layer and output layer represent weights. Moreover, in RBF networks, activation functions in the hidden nodes are Gaussian basis functions, while in MLP the sigmoidal functions are typically used.

Fig. 1
figure 1

Representation of ANN networks. a RBF Networks. b MLP Networks

RBF ANN process works as follows, first the input data enter the network through the input layer. After that, each neuron in the RBF layer (hidden layer) calculates the similarity between the input data and the prototype stored inside it, using the nonlinear Gaussian function shown in Eq. 1.

$$\begin{aligned} \phi (\left\| x-c_{j}\right\| )=\mathrm{exp}-\left( \frac{\left\| x-c_{j}\right\| ^{2}}{2\sigma _{j}^{2}}\right) \end{aligned}$$
(1)

where \(\left\| x-c_{j}\right\| \) is Euclidean norm.

The output of the RBF is calculated using weighted average method by the following equation:

$$\begin{aligned} y_{i}=\sum _{j=1}^{n}\omega _{ji}\phi _{j}(x) \end{aligned}$$
(2)

where \(\omega _{ji}\) represents the ith weight between the hidden layer and output layer, and n represents the number of hidden nodes.

The output of the RBF neuron is closer to 1 whenever the similarity between the input and the prototype is high, and close to zero otherwise. The output layer neurons takes the weighted sum of every RBF neuron output in order to decide the class label. Which means that every RBF neuron contributes in the labeling decision, higher similarity has larger contribution.

2.1 Classical radial basis function network training

Classical RBF process depends mainly on three points, the prototypes inside each RBF neurons and how to be chosen perfectly, the beta value in the similarity equation, and the weights between hidden layer and output layer (which affects the last decision). Choosing the prototypes can be done using many approaches such as choosing random data points from the data, or using K-Means clustering approach and use the clusters centers as prototypes or any other approach you may choose. Using K-Means clustering is the most used approach in the literature as it helps in smartly choosing small number of RBF neurons (K neurons), where each neuron represents a cluster in the data. Moreover, having only K neurons does not affect the complexity of the RBF network nor the accuracy of the classification decision.

Beta coefficient in the RBF activation function controls the width of the bell curve and should be determined in a manner that optimizes the fit between the activation function and the data. When K-Means is used to choose the RBF neurons prototypes, then Beta can be set using the following equation:

$$\begin{aligned} \mathrm{Beta}=\frac{1}{2 \times \sigma ^{2}} \end{aligned}$$
(3)

where \(\sigma \) equals to the average distance between all points in the cluster and the cluster center.

Training the output weights is the third important parameter to set for RBF to work perfectly. Training these weights can be done using the Gradient decent which is an optimization technique that takes the outputs of the RBF neurons as input and optimize the weights according to them. Gradient descent must be run separately for each output node. The following subsections describe more details about K-Means and gradient decent methods that are selected in this work as a classical approach for optimizing the connection weights.

2.1.1 K-Means

K-Means is considered one of the most efficient clustering algorithms that used in many applications in the literature. K-Means clustering has many advantages, such as simplicity to implement and has good performance with large dataset. K-Means is a partitioning clustering algorithm, where the objective is to maximize the similarity between the members in each cluster and minimize the similarity between the members in different clusters. The main idea of the K-Means clustering is to define k centers, one center for each cluster. The data points are assigned to the proper cluster based on the minimum distances to all cluster centers. After that, cluster centers should be modified in an iterative way by calculating the mean of cluster’s members to achieve the best clustering quality (The squared error function). This process is continued until centers do not change any more.

2.1.2 Gradient decent

Gradient decent (GD) is considered an optimization algorithm uses the first-order derivative calculation to find a local minimum of a function. The algorithm applies consecutive steps to find the gradient of the objective function at the current point. The output of a RBF network can be represented as shown in Eqs. 4, and 5, while the error function E is given in Eq. 7, where \(\hat{y}_{i,k}\) is the response value of the ith output unit, and \(y_{i,k}\) is the actual response. The Gradient decent algorithms can be used to find the solution matrix W as shown in Eq. 7, where \(\eta \) is a small decreasing value called the learning rate.

$$\begin{aligned} \hat{y}= \,& {} (y_{1},y_{2},...,y_{m})=\left[ \begin{array}{ccccc} \omega _{11} &{} \omega _{11} &{} . &{} . &{} \omega _{1m}\\ \omega _{21} &{} \omega _{21} &{} . &{} . &{} \omega _{2m}\\ . &{} . &{} . &{} . &{} .\\ . &{} . &{} . &{} . &{} .\\ \omega _{I1} &{} \omega _{I1} &{} . &{} . &{} \omega _{Im} \end{array}\right] \left[ \begin{array}{c} \phi _{1}(x)\\ \phi _{2}(x)\\ .\\ .\\ \phi _{m}(x) \end{array}\right] \end{aligned}$$
(4)
$$\begin{aligned} O= \, & {} W \cdot H \end{aligned}$$
(5)
$$\begin{aligned} E=\, & {} \frac{1}{2}\sum _{k=1}^{M}\sum _{i=1}^{L}(y_{i,k}-\hat{y}_{i,k})^{2} \end{aligned}$$
(6)
$$\begin{aligned} \omega _{ij}=\, & {} \omega _{ij}-\eta \frac{\partial E}{\partial \omega } \end{aligned}$$
(7)

In this work, the conjugate gradient (CG) is used to optimize the weights in the standard RBF network. CG is a special type of gradient descent with regularization that used to compute search directions. The CG uses a line search with quadratic, and cubic polynomial approximations. The stopping criteria that used in CG is the Wolfe–Powell, and CG guesses the initial step sizes using slope ratio method.

3 Biogeography-based optimization optimizer

Evolutionary Algorithms (EAs) belong to the class of stochastic population-based algorithms. As the name implies, such techniques approximate the global optima for optimization problems using stochastic operators. The optimization process first starts with a set of random solutions as candidate solutions for a given problem. This set is then evolved using different mechanism defined by the algorithm to find a better approximation of the global optimum. This framework is common between all EAs despite different mechanisms to evolve the solutions.

One of the most recent and well-regarded EAs proposed in the literature is the biogeography-based optimization (BBO) algorithm [44]. This algorithm mimics evolutionary phenomena in the field of biogeography to solve optimization problem. The main inspiration of BBO is based on the fact that nature balances between the prey and predator using migration in the same habitat and different habitats.

In BBO, each solution represents a habitat and each variable in the solution indicates a habitant (prey of predator). The objective function is called Habitat Suitability Index (HSI), which obviously shows how suitable a habitat is. The following rules should be considered to simulate the evolvement of habitats and habitants in nature:

  1. 1.

    Habitants in any habitat face mutation regardless of their HSI.

  2. 2.

    Habitants in a habitat with better HSI are more likely to migrate to habitats with worse HSI.

  3. 3.

    Habitants in a habitat with worse HSI are more likely not to migrate.

  4. 4.

    Immigration is always from better habitat to worse habitat.

  5. 5.

    Each habitat has a rate of immigration and emigration, which define the rate of immigration to or from other habitats.

The immigration between habitats are simulated by exchanging the variables of solutions. In BBO algorithm, each habitat has different emigration and immigration rates to simulate habitats with different characteristics in nature. Obviously, with constant migration rates between habitats, the BBO is not able to balance exploration and exploitation. Therefore, this algorithm has been equipped with the following adaptive immigration and emigration rates:

$$\begin{aligned} \mu _k=\, & {} \frac{E \times n}{N} \end{aligned}$$
(8)
$$\begin{aligned} \lambda _k=\, & {} I \frac{1-n}{N} \end{aligned}$$
(9)

where n is the habitants number, N is the maximum habitants allowed, E is the maximum allowed emigration rate, and I is the maximum immigration rate. The mutation rate has also been required to change adaptively as follows:

$$\begin{aligned} m_n = \textit{M}\left( 1-\frac{p_n}{p_{\mathrm{max}}} \right) \end{aligned}$$
(10)

where M is the initial value, \(P_n\) is the mutation probability, and \(P_{\mathrm{max}}\) shows the maximum probability.

The significant number of works in the literature proves that the BBO algorithm is able to solve optimization problems. This is due to the high exploration of this algorithm, which originate from the migration mechanism between the habitats. The migration mechanisms abruptly changes the solution, which assist the BBO to avoid local solutions and determine an accurate approximation of the global optima for challenging problems effectively. This motivated our attempts to propose a trainer based on BBO to train RBFN for the first time in the literature.

4 Biogeography-based optimization for training radial basis function networks

In contrast to the classical approach where RBF network are trained in two independent phases, our proposed BBO-based approach searches for all RBF network parameters simultaneously. The parameters are the centers, widths and connection weights including the bias terms. In the proposed training algorithm, each habitat is encoded to represent these parameters as shown in Fig. 2 where \(C_{i}\) is the center of the hidden neuron i, \(\sigma _{i}\) is the width of that neuron and \(\omega _{ij}\) is weight connecting between neuron i and output unit j. Habitats are implemented as real vectors with a length D which can be calculated as follows: suppose that n is number of hidden neurons, I is the number of features in the dataset and m is the number of output units then D can be calculated as given in Eq. 11.

$$\begin{aligned} D = (n \times I)+ n + (n \times m) + m \end{aligned}$$
(11)

In order to evaluate the fitness value (HSI) of the habitats (candidate RBF networks), the mean squared error (MSE) is calculated over all training samples for each habitat. MSE can be given as in Eq. 12 where y is the actual output, \(\hat{y}\) is the estimated output and k is the total number of instances in the training dataset.

$$\begin{aligned} \mathrm{MSE} = \frac{1}{k} \sum _{i=1}^{k}(y - \hat{y})^{2} \end{aligned}$$
(12)
Fig. 2
figure 2

Representation of BBO individuals structure

Based on the encoding scheme and the fitness evaluation described above, the BBO algorithm is designed to train the RBF networks as described in the flowchart in Fig. 3. This figure shows that the BBO first creates a set of random candidate solutions, which includes RBF networks with random connection weights and biases. This algorithm then repeatedly calculates the MSE for all the RBF networks when classifying the training data. The MSE shows which “random” RBF is better. Based on the rules discussed above, the BBO algorithm creates a set of new RBF networks considering the best RBF networks found so far. The process of calculating MSEs and improving the RBFs continues until the satisfaction of the end criterion, which could be a threshold or maximum iterations. It should be noted that the average MSE is calculated when classifying all training samples in the dataset for each RBF network in the proposed BBO-based trainer. Therefore, the computational complexity is of O(ntd) where n is the number of random RBF networks, t indicates the maximum iterations, and d shows the number of training samples in the dataset.

Fig. 3
figure 3

RBF networks using BBO trainer flowchart

5 Experiments and results

In this section, the BBO training algorithm is evaluated on twelve datasets to verify the power of BBO for RBF neural network training. Furthermore, a comprehensive comparison of the BBO with other ten well-known metaheuristic algorithms is conducted. The metaheuristic algorithms that are used in this experiment are: GA, PSO, ACO, ES, PBIL, DE, Firefly, Cuckoo search, ABC and BAT Algorithm which are the most common metaheuristic-based trainers for RBF network in the literature. In addition, the BBO trainer is compared with the RBFclassic (Gradient-based) technique, which is considered the common method for training the RBF neural network.

5.1 Experimental setup

The MATLAB R2010b is used to implement the proposed BBO trainer and other algorithms. All datasets are divided using 66, 34 % for training and testing, respectively. 10 different runs are executed for all experiments, and 250 iterations in each run. Moreover, the population size is fixed to 50 individuals for all algorithms. The parameter settings for each algorithm are shown in Table 1.

In CS, besides the population size, the discovery rate \(p_{\alpha }\) is the only parameter needs to be tuned. \(p_{\alpha }\) is set to 0.25 since it was stated in [55] that this value is sufficient for most optimization problems. For Firefly, Beta is set to 1 as it was reported in [56] that parametric studies suggest setting the value of Beta to 1 can be used for most applications while gamma can be set to \(1/\sqrt{L}\) where L is a scaling factor and if the scaling variations are not significant, then we can set gamma = O(1). Alpha is roughly tuned and set to 0.2. Same values were used and applied in previous studies as in [52, 53].

For PSO, acceleration constants are typically set to \(\approx\)2 [1, 57]. We use also linear decreasing strategy to update the Inertia in the interval [0.9,0.6]. It was found by experiments in the literature that this strategy improves the efficiency and performance of PSO achieving excellent results [5, 51].

For GA, the crossover probability is usually set to a much high rate, while the mutation probability is set to a much low probability [19]. With a rough tuning, the crossover and mutation probabilities are set to 0.9 and 0.1, respectively. For DE, the DE/rand/1/bin variant is applied with the crossover probability and differential weight equal to 0.9 and 0.5 as applied and recommended in [34, 59].

For ACO, ES and PBIL, all parameters are set as used and applied in [35, 44]. And for ABC and BAT, the defaults parameters are used [27, 54].

For BBO, we used the same parameters as in [35, 44] habitat modification probability is set to 1, immigration probability bounds per gene = [0, 1], step size is set to 1, maximum immigration and migration rates for each island is set to 1, while the mutation probability is set 0.05 as in [35].

However, it is worth mentioning that finding the best parameters of these algorithms is considered as another optimization problem by itself and it is known as meta-optimization. Therefore, fine tuning the optimization algorithms is out of scope if this work [39]. All dataset are normalized to the interval of [0, 1].

Table 1 The metaheuristic algorithms with initial parameters

An RBF network with large number of neurons in the hidden layer may achieve good results based on the training data; however, this could lead to a bad generalization [41]. In our experiments, we assess the performance of the proposed training algorithm based on different number of neurons in the hidden layer: 4, 6, 8, 10 respectively.

In our experiments, we used five different measurements to evaluate the developed RBF network models. The measurements are Accuracy, Specificity, Sensitivity, Complexity and MSE. MSE was given previously in Eq. 12, while the rest are calculated using the following Eqs. 13, 14, 15 and 16, respectively. Accuracy, Specificity, Sensitivity and MSE assess the prediction accuracy, while the complexity equation reflects the network structure complexity based on the number of neurons.

$$\begin{aligned} \mathrm{Accuracy}= & {} \dfrac{{\rm Number}\,\, {\rm of}\,\, {\rm correctly}\,\, {\rm classified}\,\, {\rm instances}}{{\rm Total}\,\, {\rm number}\,\, {\rm of}\,\, {\rm instances}} \end{aligned}$$
(13)
$$\begin{aligned} {\rm Specificity}= & {} \frac{{\rm Number}\,\, {\rm of}\,\, {\rm predicted}\,\, {\rm instances}\,\, {\rm of}\,\, {\rm negative}\, {\rm class}}{{\rm Number}\,\, {\rm of}\,\, {\rm actual}\,\, {\rm negative}\,\, {\rm instances}} \end{aligned}$$
(14)
$$\begin{aligned} \mathrm{Sensitivity}= & {} \frac{\rm {Number}\,\, {\rm of}\,\, {\rm predicted}\,\ {\rm instances}\,\, {\rm of}\,\, {\rm positive}\,\, {\rm class}}{{\rm Number}\,\, {\rm of}\,\, {\rm actual}\,\, {\rm positive}\,\, {\rm instances}} \end{aligned}$$
(15)
$$\begin{aligned} {\rm Complexity}= & {} \frac{1}{2} \sum _{i=1}^{|w|}(w_i)^{2} \end{aligned}$$
(16)

where \(|w| = 2*(n+1)\), n is the number of neurons.

5.2 Datasets description

The proposed BBO trainer is evaluated using twelve known real datasets, which are selected from UCI Repository [32]. All datasets contains two classes. Table 2 describes these datasets in terms of number of features, training samples, testing samples and the accuracy of the baseline classifier for each dataset. The baseline classifier is the Zero Rule classifier or ZeroR for short. ZeroR is the simplest classifier which relies only on the output class by simply predicting the majority class.

Table 2 Summary of the classification datasets

5.3 Results

The proposed BBO trainer is evaluated by comparing its results with standard RBFclassic and other ten metaheuristic (GA, PSO, ACO, ES, PBIL, DE, FF, Cuckoo, ABC and BAT) trainers using the Accuracy, Sensitivity, Specificity, MSE, and Complexity evaluation measures.

Table 3 shows the the results in terms of the average accuracy (AVE) and standard deviation (STD), as well as the best accuracy result of the proposed BBO and other algorithms on Blood dataset. The table reports the results with different number of neurons in the hidden layer. The best accuracy results are highlighted in bold. According to the results of AVE, STD, and best results using 4 neurons, BBO is able to classify 77.1 % of the test samples, which is more than PSO, ACO, ES, PBIL, DE and ABC results and with slight difference with GA, FireFly, Cuckoo, BAT and RBFclassic. Furthermore, BBO outperforms all other methods using 6 neurons and 8 neurons with accuracy rates 77.29 and 77.45 %, respectively. In addition, the accuracy results of RBFclassic, BAT and BBO using 10 neurons are very close for this dataset, and the three algorithms outperform the other methods.

Table 3 The average accuracy, and standard deviation results of the Blood dataset using different algorithms

The accuracy results of BBO and other optimizers for Breast cancer dataset are presented in Table 4. According to the results of AVE, STD, and best results, BBO outperforms all other methods using 4, 6 and 10 neurons. Moreover, the BBO able to classify 96.55, 97.61, 96.97, and 97.86 % of the test samples using 4, 6, 8, and 10 neurons, respectively.

Table 4 The average accuracy and standard deviation results of the Breast dataset using different algorithms

Table 5 presents the accuracy results of the Hepatitis dataset. The results of BBO are significantly better than all the other algorithms using 6 neurons. Moreover, BBO has better results compared with most algorithms using 4, 8, and 10 neurons as well.

Table 5 The average accuracy and standard deviation results of the Hepatitis dataset using different algorithms

The accuracy results of BBO and other training algorithms on Diabetes and Vertebral datasets are presented in Tables 6, and 7, respectively. According to the results of AVE, STD, and best results using 4, 6, 8, and 10 neurons, BBO outperforms most of other methods except RBFclassic which has better accuracy. These results support the merits of the proposed BBO algorithm in training RBF networks.

Table 6 The average accuracy and standard deviation results of the Diabetes dataset using different algorithms
Table 7 The average accuracy and standard deviation results of the Vertebral dataset using different algorithms

The accuracy results of BBO and other algorithms on Diagnosis I and Diagnosis II datasets are presented in Tables 8, and 9, respectively. According to the results of AVE, STD, and best results, the results of BBO on the Diagnosis I are significantly better than the other algorithms using 4, 6, and 8 neurons. Furthermore, the BBO results on the Diagnosis II are very comparable with other optimizers using different neurons.

Table 8 The average accuracy and standard deviation results of the Diagnosis I dataset using different algorithms
Table 9 The average accuracy and standard deviation results of the Diagnosis II dataset using different algorithms

Tables 10, and 11 show the accuracy results of BBO and other algorithms on Parkinsons and Liver datasets, respectively. The accuracy results of BBO on the Parkinson, and Liver datasets outperform all other algorithms using different number of neurons.

Table 10 The average accuracy and standard deviation results of the Parkinson dataset using different algorithms
Table 11 The average accuracy and standard deviation results of the Liver dataset using different algorithms

The accuracy results of BBO and other training algorithms on Sonar and German datasets are presented in Tables 12, and 13, respectively. According to both dataset results using 4, 6, 8, and 10 neurons, BBO comes second after the RBFclassic method and it outperforms all other metaheuristic methods.

Table 12 The average accuracy and standard deviation results of the Sonar dataset using different algorithms
Table 13 The average accuracy and standard deviation results of the German dataset using different algorithms

The accuracy results of BBO and other optimizers for Australian dataset are presented in Table 14. According to the results of using 4, 6, 8, and 10 neurons, BBO has superior classification accuracy results compared with other methods. Moreover, the BBO was able to classify 85.32, 84.98, 84.51, and 85.11 % of the test samples using 4, 6, 8, and 10 neurons, respectively.

Table 14 The average accuracy and standard deviation results of the Australian dataset using different algorithms
Table 15 The average sensitivity and specificity results of all datasets using different algorithms with (4 Neurons)
Table 16 The average sensitivity and specificity results of all datasets using different algorithms with (6 Neurons)
Table 17 The average sensitivity and specificity results of all datasets using different algorithms with (8 Neurons)
Table 18 The average sensitivity and specificity results of all datasets using different algorithms with (10 Neurons)

To give a better insight on the classification performance regarding each class label, the specificity and sensitivity are measured and listed in Tables 15, 16, 17 and 18 for RBF networks with 2, 6 ,8 and 10 neurons in the hidden layer respectively. According to these tables, it can be noticed that RBF networks optimized by BBO have higher and more balanced specificity and sensitivity than most of the other optimizers in the following datasets: Breast cancer, Hepatitis, Vertebral, Diagnosis I, Diagnosis II, Parkinsons, Liver, Sonar and Australian.

To summarize the above results, we can note that BBO outperform most of other algorithms, which supports the merits of the proposed BBO algorithm in training MLPs. Moreover, and to support this summary, Friedman statistical test is calculated to check the significance of the accuracy results. Friedman test is accomplished by ranking the different trainers (BBO, GA, PSO, ACO, ES, PBIL, DE, FireFly, Cuckoo, ABC, BAT and RBFclassic) based on the average accuracy values for each dataset using different neurons. Table 19 shows the average ranks for each technique in using Friedman test for 4, 6, 8, and 10 neurons. The Friedman test in the Table 19 shows that significant difference exists between the 12 trainers (lower is better). In terms of F-test ranking, BBO outperforms other trainers for all number of neurons that used.

Table 19 The Average ranking results obtained by each algorithm in the Friedman test using all datasets

Figure 4 shows the complexity of trained RBF network and its corresponding MSE on different datasets for each number of neurons in the hidden layer. As shown in all sub-figures for all datasets, BBO has the best results, which has relatively the smallest MSE comparing with all other algorithms except the RBF classic. BBO outperforms all algorithms in terms of complexity, which has the smallest complexity values. Moreover, the RBFclassic has the smallest MSE values, but the complexity values are the largest, which outputs complex structure of the RBF network with very low smoothness. The complexity results show the merits of the proposed BBO algorithm in achieving very smooth RBF networks.

Fig. 4
figure 4

The MSE versus complexity of the classification of different datasets (with 4, 6, 8, and 10 Neurons). Results for a Blood, b Breast, c Diabetes, d Hepatitis, e Vertebral, f Diagnosis I, g Diagnosis II, h Parkinson, i Liver, j Sonar, k German, and l Australian datasets, respectively

Convergence graphs for all algorithms are shown in the Figs. 5, 6, 7, and 8 using 4, 6, 8, and 10 neurons, respectively. The convergence curves show the MSE averages of 10 independent runs over 250 iterations. All sub-figures show that BBO is the fastest algorithm in convergence for all datasets. Furthermore, most of other algorithms like DE, ACO, ES, and PBIL have some drawbacks such as trapping at local minima with slow convergence rate. Based on the convergence results, BBO has the superior ability to avoid the local optima.

Fig. 5
figure 5

MSE Convergence curves with (4 Neurons). Convergence curves for a Blood, b Breast, c Diabetes, d Hepatitis, e Vertebral, f Diagnosis I, g Diagnosis II, h Parkinson, i Liver, j Sonar, k German, and l Australian datasets, respectively

Fig. 6
figure 6

MSE Convergence curves with (6 Neurons). Convergence curves for a Blood, b Breast, c Diabetes, d Hepatitis, e Vertebral, f Diagnosis I, g Diagnosis II, h Parkinson, i Liver, j Sonar, k German, and l Australian datasets, respectively

Fig. 7
figure 7

MSE Convergence curves with (8 Neurons). Convergence curves for a Blood, b Breast, c Diabetes, d Hepatitis, e Vertebral, f Diagnosis I, g Diagnosis II, h Parkinson, i Liver, j Sonar, k German, and l Australian datasets, respectively

Fig. 8
figure 8

MSE convergence curves with (10 Neurons). Convergence curves for a Blood, b Breast, c Diabetes, d Hepatitis, e Vertebral, f Diagnosis I, g Diagnosis II, h Parkinson, i Liver, j Sonar, k German, and l Australian datasets, respectively

In summary, the algorithms employed in this work can be classified into four groups: random, evolutionary, swarm-based, and gradient-based.

The results show that evolutionary trainers (including BBO) outperform the other four groups. This is due to the superior local optima avoidance of these algorithms. Evolutionary algorithms mostly have cross-over operators that combine the search agents to create new population(s). Such operators abruptly change the individuals in the population, which results in emphasizing exploration of the search space and local optima avoidance. The gradient-based technique has the least local optima avoidance capability, which resulted in showing the worse performance on the test cases. The swarm-based algorithms perform better than the gradient-based algorithm because of the higher local optima avoidance. The high local optima avoidance of swarm-based algorithms mostly originate from the population-based nature of these algorithms. However, such algorithms have less intrinsic exploration ability compared to evolutionary algorithms because there are less sudden changes in the search agents.

The results also prove that evolutionary algorithms employed in this work show a better result accuracy and faster convergence rate in average. This shows that high random changes in the search agents of such algorithms do not negatively impact the result accuracy and convergence curve. This originates from the fact that evolutionary algorithms reduce randomness and favor gradual changes with a mechanism called mutation. The mutation operator causes small perturbations and consequently local search around individuals in the population. In other words, the effects of mutation in the overall population are much less than cross-over operators. This operator assists evolutionary algorithms to improve the accuracy of solutions proportional to the number of iterations. Also, the convergence rate is accelerated toward the global optimum by the mutation operator.

Among the swarm-based techniques employed in this work, BAT algorithm outperforms Cuckoo, Firefly algorithm, PSO, ABC and ACO. BAT algorithm is equipped with frequency tuning principle which gives solutions that closed the ideal solutions. Furthermore, Cuckoo algorithm gives good results and very close to BAT algorithm. The Cuckoo algorithm has been equipped with a lévy flight which abruptly changes the search agents of this algorithm. Similarly to evolutionary algorithms, this causes extensive exploration of the search space and local optima avoidance significantly. However, other swarm-based algorithm have less operators to promote sudden changes. The ACO algorithm utilizes a pheromone matrix which mostly boosts exploitation and makes this algorithm more suitable for combinatorial problems. The performance of the PSO and ABC algorithms also largely depends on the distribution of initial population. These algorithms can easily be trapped in local solution if there is no good distribution in the initial population. The Firefly algorithm also does not random walk or lévy flight, which leads this algorithm to have tendency toward local solutions and less able to avoid them.

In contrary, most of the evolutionary algorithms performed well on the test cases and suppressed swarm-based techniques. Among them, ES and DE showed the poorest performance. In ES, the selection of individual is deterministic which reduces randomness level and exploration of this algorithm. The main operators in this algorithm are mutations which favor exploitation and convergence. These are the main facts that contributed to the failure of this algorithm in solving the datasets. The same statements can be made for DE, but this algorithms has stochastic selection and more crossover operators, which assist it for showing a better performance compared to ES. The performance of the PBIL is better than ES and DE but worse than GA and BBO. This is because PBIL performs crossover on the entire population combined in a vector, which cause better exploration and local optima avoidance compared to ES and DE. However, each individual faces less sudden random changes in comparison with GA and BBO.

BBO outperformed GA because the random changes in the individuals are much higher in this algorithm. The GA algorithm assigns a similar reproduction rate to all the individuals in the population, which causes the same cross-over rate over the course of generations. In contrary, the BBO algorithm assigns each individual a unique emigration and immigration rates. This results in different reproduction rates for each individual and consequently promotion of the exploration and local optima avoidance. Needless to day, this is the main reason of the significant superiority of the BBO-based trainer compared to all trainers employed on all datasets in this work.

The results and discussion of this section show that the BBO algorithm is able to effectively alleviate the drawbacks of the current algorithms when training RBF networks in terms of local optima entrapment, result accuracy, and convergence rate.

5.4 Comparisons with traditional classifiers in the literature

In this section, we compare the results of the optimized RFB network using BBO with other five popular classifiers from the literature: Naive Bayes (NB), decision trees algorithm C4.5 (J48), Random Forests (RF), Support Vector Machines (SVM), and the Zero-R Rule Classifier (ZeroR) which is the base classifier. We used the Java-based open source data mining framework Weka as an implementation [20].

Table 20 shows the average accuracy results of NB, J48, RF, SVM, Zero-R, and BBO. It can be seen that optimized RFB network achieves very competitive results and performs reasonably. The BBO results for Breast, Liver and Australian datasets are higher and significantly better than all other classifiers. Moreover, the results of the BBO for Parkinson, and Blood datasets are very close to the other classifiers. Examining these results, we can notice that the BBO-RBF classifier has achieved better results than the base classifier in all datasets and better than NB and J48 in 7 and 6 datasets, respectively. Moreover, comparing BBO-RBF network with very powerful classifiers like RF and SVM, we can see that it stays competitive with better accuracy results in 5 datasets.

As a summary, the obtained results by the proposed BBO support the merits of the proposed BBO algorithm in training RBF networks and solving data classification problems.

Table 20 The average accuracy results of BBO-RBF network compared to popular classifiers in the literature

6 Conclusion

This paper proposed the use of the well-regarded BBO algorithm for training RBF networks to alleviate the drawbacks of conventional and new training algorithms: local optima entrapment, low result accuracy, and slow convergence speed. After proposing the method of training using BBO, it was employed to solve 12 well-known datasets and compared to 11 training methods in the literature including gradient-based, evolutionary, and swarm-based algorithms. The algorithms were compared by statistical test on RBF networks with different number of neurons to confidently confirm the performance of the proposed trainer. The results evidently demonstrated that the BBO algorithm is able to outperform the current techniques on the majority of datasets substantially. According to the results, finding, analysis, and discussion of this paper, the following conclusions can be drawn:

  1. 1.

    BBO shows a fast convergence speed and high result accuracy.

  2. 2.

    BBO can avoid local optima in the search space of the training RBF networks problem.

  3. 3.

    BBO is able to train RBF networks effectively to classify different datasets with a diverse number of features and training samples.

  4. 4.

    BBO is able to train RBF networks with different number of neurons.

For future works, this research is planned to be extended in two main lines. First, the proposed BBO-RBF network could be investigated for other data mining tasks like regression and time series prediction. Second, it is planned to study the efficiency of optimizing the structure of the RBF network along with its widths, centers and weights, simultaneously. As it is expected that the complexity of the search space will increase, the evaluation will consider complexity and executing time in addition to the prediction accuracy of the optimized models.