1 Introduction

Artificial Neural Networks (ANNs) are one of the most useful computational modeling tools in machine learning algorithms. Therefore, ANN has attracted attention in many disciplines including engineering, medicine, agriculture, technology, business, arts, etc. Up to now, it has been applied as a reliable method to various challenging manners such as classification, image processing, speech recognition, natural language processing, etc. ANN supplies a parallel computing system including many simple processors inspired by biological neural networks and uses some organizational principles used in humans. In ANN, the connection weights storing information are combined in a parallel and sequential manner which forms the network architecture. In many real-world implementations, feed-forward neural networks (FNNs) are the most popular neural networks.

The concept of learning refers to the process of finding optimal weight values in this architecture. The success of ANNs in problem-solving depends largely on the training of the network and the performance of the learning algorithm used in the training phase [1, 8, 13, 23, 24]. Several learning algorithms have been used in literature. The most popular training methods are based on mathematical error reduction techniques such as back-propagation (BP) [6], Gradient Descent (GD) [29], Conjugate Gradient [36], Newton’s Method [5] and Levenberg–Marquardt (LM) [12]. Due to some factors such as the problems having nonlinear properties and/or large-scale dimensions, derivative learning-based algorithms may not always be sufficient for training neural networks.

Many metaheuristic optimization algorithms have been widely used in solving NP-hard problems such as training of FNNs since these algorithms establish a balance-aware between exploration and exploitation so that providing the optimal solution in the search space [17]. Some of the most-cited algorithms developed to date are Artificial Bee Colony (ABC) [16], Differential Evolution Algorithms (DEA) [32], Genetic Algorithms (GA) [14], Simulated Annealing (SA) [19], Particle Swarm Optimization (PSO) [18], Grey Wolf Optimizer [21], Firefly Algorithm (FA) [40], Butterfly Optimization Algorithm [3], Biogeography based Optimization Algorithm [30] and Gravitational Search Algorithm [28]. Apart from these, there are more than 250 metaheuristic algorithms presented in the literature for various purposes based on the NFL theorem [37].

This paper focuses on Vortex Search (VS) optimization algorithm [9] and it is adapted to the training task of FNNs as a learning method. The VS algorithm has been recently proposed for solving numerical function optimization and it uses multivariate Gaussian distribution to generate candidate solutions. The search space is reduced by using the inverse incomplete gamma function during iterations. Thus, exploration is effectively accelerated while performing exploitation.

To the best of our knowledge, this is the first study that the efficiency of VS is investigated for the training of FNNs and a VS-based learning method for FNNs (VS-FNN) is proposed in this paper. In order to adapt the VS algorithm to the training phase, the training process is accepted as an optimization problem and all weights and bias values in the architecture of FNN are systematically stored in a matrix to generate and optimize as a candidate solution in the search space. In other words, each weight value in this matrix represents a component of the center point for the vortex in the algorithm.

1.1 The motivation and contribution

As reported in the paper [9], VS was tested over 50 benchmark mathematical functions and the results were compared to both the single-solution based (SA and Pattern Search) and population-based (PSO and ABC) algorithms. Herewith, it was reported that VS outperforms or at least competed with other algorithms. This report proves that VS ensures the balance strongly between exploration and exploitation. Our motivation at the beginning of this work is based on the fact that it has not been previously investigated in the training of FNN, although it has been stated to be very successful against the most advanced algorithms.

The efficiency of the VS algorithm is examined in the training of FNN by comparing it with ABC, PSO, SA, GA, and SGD algorithms. However, it is not within the scope of this study to find the optimal network architecture to provide the lowest possible training error. The aim is to demonstrate that the VS algorithm is as competitive as other metaheuristics in training neural-networks and to analyze its performance. To crosscheck the accuracies of the trained FNNs, the algorithms are run over six datasets with multiple classes whose names are called 3-bit XOR, Iris Classification, Wine-Recognition, Wisconsin-Breast-Cancer, Thyroid-Disease, and Pima-Indians-Diabetes. In summary, the contributions of VS-based learning algorithm can be bulleted as follows:

  • To make the first investigation effort for the VS algorithm in the training of FNNs.

  • To demonstrate the effectiveness of its running nature not needing a user-tuning parameter on the adjustment of weights in an FNN.

  • To prove that it can achieve an accuracy performance as high as at least other successful algorithms by consuming less computation time.

The rest of the paper is organized as follows: some previous related works are introduced in Sect. 2. Then, the definition of FNN is explained briefly in Sect. 3. Vortex Search optimization algorithm is described in Sect. 4. After that, the VS-based learning method is revealed in Sect. 5. The experimental results are demonstrated and evaluated in Sect. 6. Finally, the conclusions are briefly considered in Sect. 7.

2 Related works

The focus here is to adjust the optimal weights and biases in an FNN structure. In this way, training neural networks can be accepted as an important nonlinear optimization problem. In literature, besides gradient methods, a large number of metaheuristics have been presented to train neural networks. Some of the mostly-used algorithms depend on evolutionary techniques and swarm intelligence algorithms.

One of the first studies in this field was presented by [25]. They used a GA-trained FFN to classify data of a sonar image dataset. They reported that GA could achieve better results than BP in particularly domain-specific problems. In 1998 [35], revealed a training method by combining gradient descent with SA. They claimed to maintain a quick convergence without tackling local optima during the training process.

In the literature, there are many papers presented for the training of FNN with PSO. For example, in 2007, (Gudise and Venayagamoorthy 2003) made a comparative study between BP and PSO for training neural networks. Considering computational requirements, they concluded that PSO had a faster convergence than BP. Then [41], presented a hybrid learning algorithm combining PSO with BP for the same purpose. They took advantage of the global search ability of PSO at the beginning of the iterations and later used BP for local search. They used three benchmark problems (3-bit-parity, function approximation, and Iris classification) and reported that their proposed method had reached better convergence than PSO and BP. In another study [31], adapted ant colony optimization to continuous optimization and they applied it to the training of feed-forward neural networks. The obtained results were evaluated as successful as GD by the authors [15]. used BP and PSO for training neural networks. In order to solve medical diagnostic problems, the performance between global and local techniques was evaluated on the training. They analyzed the performance of the algorithms on three medical diagnostic problems stated in the Proben1 dataset. BP was found to be more efficient than PSO [39] trained neural-networks with a combined method involving opposition based PSO and BP with momentum. They examined the hybrid method on well-known eight datasets. The produced results for training time and classification accuracy showed the superiority of the method.

ABC is another metaheuristic proposed for FNN training in literature [26] designed ABC algorithm to solve XOR, 3-Bit Parity, Encoder-Decoder test problems, and classification problems. The performance of the ABC algorithm for training neural networks was compared with BP, LM, GA, and PSO algorithms. They were reported that the ABC-based ANN training method reached more successful results by obtaining the highest classification success. (Brajevic and Tuba 2013) are used FA to train neural-networks by using two different transfer functions: sigmoid and sine. They also investigated the performance of FA with the results of ABC and GA obtained from benchmarks XOR, 3-Bit Parity, and 4-bit parity. According to experimental results, they concluded that superiority order among three algorithms was ABC, FA, and GA, respectively. A new study for FNN training with ABC was recently proposed by [38]. They modified ABC to accelerate the convergence and added a new selection method to improve the performance. Then they compared the training results with ABC variants.

Some of the various meta-heuristic optimization algorithms recently proposed as the learning method for FNNs are mentioned below. For example, Mirjalili et al. [22] focused on the Gravitational Search Algorithm (GSA) for training FNN and they proposed a hybrid learning method (PSOGSA) combining GSA and PSO. They compared the hybrid method with the original GSA and PSO by applying on three benchmark problems. Consequently, PSOGSA achieved better results in terms of converging speed and accuracy.

In another recent study, bird mating optimizer (BMO) inspired from the behaviors of birds in mating time, was used by [4] for the training of neural networks. They said that promising results had been achieved by the BMO-based training method on three benchmark datasets: Iris, Wisconsin-Breast-Cancer, Pima-Indian-Diabetes, and fuel cell system.

Piotrowski [27] had researched the performance of previous DEA based training methods and the researcher thought that stagnation is the main reason why it fails compared to other methods. In order to overcome this problem, it is reported that he achieved satisfactory results by combining global and local neighborhood-based mutation operators with the trigonometric mutation operator.

Tang et al. [34] adapted Dynamic Group Optimization (DGO) algorithm for training FNN. They used the DGO algorithm to find the optimal weights and bias values as well as to determine the structure of the FFN. Considering the reported experiments, they claimed that DGO is a suitable method for FNN training.

Swain et al. [33] proposed a hybrid metaheuristic combining the Gravitational Search algorithm and PSO for training FNN to diagnose the faults in wireless sensor networks.

3 Feed-forward neural-network

Feed-Forward Neural Network (FNN) is the most popular type of neural-network in which the data flow occurs only in the forward direction since FFNs can solve classification and regression problems effectively. The basic processing unit in FFN is defined as a neuron, which is inspired by biological nerve cells. Figure 1 shows the structure of an artificial neuron. Every single neuron generates one output signal corresponding to an input vector with n-dimensional. To accomplish a particular task, neural networks are trained by updating the values of the connections between neurons. These connections are usually named as weights and the connections do not form a cycle.

Fig. 1
figure 1

The structure of a neuron

In FNN, the neurons are grouped in three types of layers: (i) input layer containing an equal number of neurons with the same number of problem-inputs, (ii) hidden layer containing at least one neuron that has to be in the solution of non-linear NP-hard problems and (iii) output layer containing an equal number of neurons with a number of problem-outputs. The numbers of neurons in input and output layers depend on the problem, while the number of neurons in the hidden layer and the number of hidden layers are defined by decision-makers who set the architecture to be applied [2]. The schematic of a general three-layer FNN is shown in Fig. 2.

Fig. 2
figure 2

A general three-layer FNN

Firstly, the weighted sum of inputs is calculated by Eq. (1) and then output (\(y\)) is calculated by sigmoid function as follows in Eq. (2).

$$net=\sum_{\mathcal{i}=1}^{n}{w}_{\mathcal{i}\mathcal{j}}{x}_{\mathcal{j}}+{w}_{bias},$$
(1)
$$y= f\left(net\right)=\frac{1}{1+{e}^{-\mathfrak{n}\mathfrak{e}\mathfrak{t}}},$$
(2)

where \({x}_{\mathcal{j}}\) indicates the i.th input and \(i\in \left\{\mathrm{1,2},\dots ,n\right\}\). \({w}_{\mathcal{i}\mathcal{j}}\) is the weight between i.th input and j.th neuron. The red input with constant one value in Fig. 1 is a bias which is the weight value of a neuron which is independent of the inputs and it is applied to the hidden and output layers in order to balance the weight values in the training phase. The output of each neuron is calculated by a transfer function such as linear, radial, sigmoid, and hyperbolic tangent.

The aim of the learning procedure is to minimize the error caused by the difference between the calculated and expected output values. ANN is trained by inputs in training set until the output of the network satisfies the expected output. Then output values for unknown input values are estimated by using a trained network. To obtain acceptable results from the complex and nonlinear multivariable problems, ANN should be trained with sufficient data using an appropriate learning algorithm.

4 Vortex search optimization algorithm

Vortex Search (VS) is a single-solution based algorithm for numerical function optimization, which has been recently proposed by [9]. VS algorithm has an inspiration originated from the vortex rotation shape of liquids. The algorithm narrows the search space along with iterations by using the inverse incomplete gamma function so that it reaches the global optimum faster. In addition, VS does not contain any user-tuning parameter.

Assuming a two-dimensional optimization problem, the schema of a vortex is formed by a series of nested circles. In the first iteration, the center of the largest circle \(\left({\mu }_{0}\right)\) is calculated by Eq. (3) for all dimensions.

$${\mu }_{0}=\frac{UpLimit+LowLimit}{2},$$
(3)

where \(UpLimit\) and \(LowLimit\) are the upper and lower boundaries of decision variables, respectively. Also, the initial radius \(\left({r}_{0}\right)\) is determined by Eq. (4). Initially, a large value for \({r}_{0}\) is selected since the search area must be fully covered by the outer circle.

$${r}_{0}={\sigma }_{0}=\frac{max\left(UpLimit\right)-min(LowLimit)}{2}.$$
(4)

Then the candidate solutions are randomly generated by using Gaussian distribution within the specified circle in d-dimensional search space. The formulation of Gaussian distribution is given in Eq. (5).

$$\begin{gathered} p\left( {x{|}\mu ,{\Sigma }} \right) = \frac{1}{{\sqrt {\left( {2{\Pi }} \right)^{d} \left| {\Sigma } \right|} }}exp\left\{ { - \frac{1}{2}\left( {x - \mu } \right)^{T} {\Sigma }^{ - 1} \left( {x - \mu } \right)} \right\} \hfill \\ {\Sigma } = \sigma^{2} .\left[ {\text{\rm I}} \right]_{dxd} \hfill \\ \end{gathered}$$
(5)

where \(x\) is the \(dx1\) dimensional vector of a random variable and \(\Sigma\) is the covariance matrix. \({\sigma }^{2}\) shows the variance of the distribution. \(I\) indicates the \(dxd\) equivalent matrix. After the generation of candidate solutions, the parameter values of solutions outside the boundaries are drawn into the boundaries. This operation is shown in Eq. (6).

$${\mathrm{s}}_{m}^{j}=\left\{\begin{array}{c}\begin{array}{c}\\ {\mathrm{ s}}_{m}^{j}<{LowLimit}^{j}\end{array}\\ \mathrm{rnd}.\left({UpLimit}^{j}-{LowLimit}^{j}\right)+{LowLimit}^{j}\\ \begin{array}{c}\\ {LowLimit}^{j}\preccurlyeq {\mathrm{s}}_{m}^{j}\preccurlyeq {UpLimit}^{j}\\ \begin{array}{c}{\mathrm{s}}_{m}^{j} \\ \\ \begin{array}{c}{LowLimit}^{j}, {\mathrm{ s}}_{m}^{j}>{UpLimit}^{j}\\ \begin{array}{c}\mathrm{rnd}.\left({UpLimit}^{j}-{LowLimit}^{j}\right)+{LowLimit}^{j}\\ \end{array}\end{array}\end{array}\end{array}\end{array}\right.,$$
(6)

where \(m=\mathrm{1,2},\ldots,n\) and \(j=\mathrm{1,2},\ldots,d\) and \(\mathrm{rnd}\) is a random value in the range [0, 1].

The next operation is the selection of the best solution from the candidate solutions set. The best means that a solution has the best fitness value among all current solutions. A greedy selection is applied between the selected solution and the best solution found up to now. The better one is stored in memory and replaced with the current center of circle \(\mu\). In this phase, the circle radius is decreased so that exploitation increases in search space. This radius decrement is slower in the early iterations of the algorithm since exploration is prioritized. However, the reduction of the radius is accelerated to ensure better exploitation during the second half of the total iterations. In the VS algorithm, an adaptive radius reduction strategy is used with the incomplete inverse gamma function to reduce the radius. The gamma function \(\Gamma \left(a\right)\) is given in Eq. (7).

$$\begin{gathered} \Gamma \left( a \right) = {\upgamma }\left( {{\text{x}},{\text{a}}} \right) + \Gamma \left( {x,a} \right) \hfill \\ {\upgamma }\left( {{\text{x}},{\text{a}}} \right) = \mathop \smallint \limits_{0}^{{\text{x}}} {\text{e}}^{{ - {\text{t}}}} {\text{t}}^{{{\text{a}} - 1}} {\text{dt }} a > 0 \hfill \\ \Gamma (x,a) = ?_{x}^{\alpha } e^{ - t} t^{a - 1} dta > 0 \hfill \\ \end{gathered}$$
(7)

where \(\upgamma \left(\mathrm{x},\mathrm{a}\right)\) and \(\Gamma (x,a)\) are incomplete gamma function and complementary, respectively. \(a\) defines the inflexibility of the search, \(\mathrm{x}\) is a random variable. In each iteration, the radius is updated by Eq. (8).

$$\begin{gathered} r_{t} = \sigma_{0} .\frac{1}{x}.\Gamma \left( {x,a_{0} } \right) \hfill \\ a_{t} = a_{0} - t/MaxItr. \hfill \\ \end{gathered}$$
(8)

Here, in the first iteration \({a}_{0}\) equals 1 in order to cover the entire search space. \(t\) refers to the iteration number and \(MaxItr\) represents the maximum iteration number. This loop is repeated until a termination condition is satisfied.

5 VS-based feed-forward neural network

Several improvements can be made to increase the performance of FNNs. Some of the major ones are grouped into three headings. The first one is to compose the architecture of the FNN. In other words, changing the number of hidden layers and the number of neurons in each hidden layer is a crucial parameter to influence learning performance. The second heading is to determine the transfer function and optimal user-defined values (such as epoch size and learning rate) of the training method for the problem being addressed. Lastly, to determine the best connection weights is another heading.

In this study, we focus on the third heading, and the VS algorithm is used to determine the best weights and biases by minimizing the error in an FNN. It is executed as a training method by tuning all weights in FNN with a specified architecture. For this purpose, weights and biases represent decision variables of search space in the optimization process. These are defined in a matrix and candidate solutions are generated concerning that matrix.

Figure 3 indicates the distributions of the weights and biases for an FNN with a 2–2–1 structure. In this study, the matrix encoding strategy shown below is used as it is dealing with training FNNs. An encoded candidate solution matrix (CS) consists of four weight vectors, which are W1, B1, W2, and B2. These are represented as follows:

$$W1=\left[\begin{array}{cc}{W}_{13}& {W}_{23}\\ {W}_{14}& {W}_{24}\end{array}\right]$$
$$B1=\left[\begin{array}{c}{W}_{B1}\\ {W}_{B2}\end{array}\right]$$
$$W2^{\prime}=\left[\begin{array}{c}{W}_{36}\\ {W}_{46}\end{array}\right]$$
$$B2=\left[{W}_{B3}\right]$$
$$CS=\left\{\begin{array}{lll}W1& B1& \begin{array}{ll}W2{^{\prime}}& B2\end{array}\end{array}\right\}=\{\mathrm{W}13,\mathrm{ W}14,\mathrm{ W}23,\mathrm{ W}24,\mathrm{ WB}1,\mathrm{ WB}2,\mathrm{ W}36,\mathrm{ W}46,\mathrm{ WB}3\}$$

where W1 is the hidden layer weight matrix, B1 is the hidden layer bias matrix, W2 is the output layer weight matrix, W2′ is the transpose of W2, and B2 is the hidden layer bias matrix.

Fig. 3
figure 3

A candidate solution for a 2–2–1 FNN

The training of FNN with VS, i.e. VS-FNN method starts to run after a suitable FNN architecture is defined for the selected dataset. To achieve the best values of weights and biases, the search space is constructed depending on the CS matrix. It means that the dimension of the problem equals the total number of weights and biases. Therefore, similar to how large-scale optimization problems are handled, the number of candidate solutions to be generated in one iteration in the complex FNN architectures can be increased compared to FNNs with a smaller dimension. The next step of the algorithm is to calculate the first center point and radius by using Eqs. (3) and (4), respectively. Then, the candidate solutions are randomly generated within the determined radius around this center. The solution with the least training error is selected as the new center point for the next iteration, and the search space is narrowed by decreasing the radius. Thus, the center point is evolved during the iterations.

Since the training of FNN is considered as an optimization problem, an objective function has to be defined to evaluate the fitness of the weight values to be generated during the optimization process. In this proposed approach, the mean training error is used as an objective function. Therefore, the feed-forward calculation is first applied for each sample in the dataset. Then, the errors which are the difference between calculated and expected values are found. Finally, the mean squared error (MSE) is calculated in Eq. (9).

$$\mathrm{MSE}\left(\overrightarrow{\upomega }(\mathrm{t}\right))=\frac{1}{\mathrm{N}}\sum_{\mathcal{j}=1}^{\mathrm{N}}\sum_{\mathrm{k}=1}^{\mathrm{P}}{({\mathrm{d}}_{\mathrm{k}}-{\mathrm{o}}_{\mathrm{k}})}^{2},$$
(9)

where \(\overrightarrow{\upomega }(\mathrm{t})\) connection weights in iteration \(t\). \({\mathrm{d}}_{\mathrm{k}}\) represents the desired output value and \({\mathrm{o}}_{\mathrm{k}}\) produced output value; P shows the number of output neurons and N is the number of samples.

VS-FNN continues until the maximum number of cycles is reached. The flowchart of VS-FNN can be seen in Fig. 4.

Fig. 4
figure 4

Flowchart of VS-based learning method for FNN

6 Experimental results

In this study, the proposed VS based learning method (VS-FNN) compares with PSO, ABC, GA, SA, and SGD algorithms by applying on an FNN with the same structure for all methods. These methods are called PSO-FNN, ABC-FNN, GA-FNN, SA-FNN, and SGD, respectively. To evaluate the performance of the proposed VS-FNN method, all methods are applied to six benchmark problems. The first of them is the three-bits parity (3-bit XOR) problem, whose inputs and expected outputs are shown in Table 1. The other problems are five popular datasets which are stated in the UCI machine learning repository of University of California at Irvine [10]. These are the Iris classification, Wine-Recognition, Wisconsin-Breast-Cancer, Pima-Indians-Diabetes, and Thyroid-Disease. In this paper, for simplicity, the names of datasets are abbreviated in some places as iris, wine, WBC, PID, thyroid respectively.

Table 1 Three bits parity problem

These datasets are well-known benchmark problems and frequently used to evaluate for classification performance of the algorithms in the literature. The numbers of attributes, classes, and samples in each dataset are available in Table 2. In addition, attributes of datasets can be seen in Fig. 5.

Table 2 Benchmark datasets
Fig. 5
figure 5

The list of attributes for each dataset

To compare the results consistently, the same values are assigned to the common parameters in all algorithms. These are listed in Table 3. In these benchmark problems, it is assumed that a single hidden layer is used and every weight value is randomly initialized in the range of [− 50, 50]. The maximum number of iterations set to 100 for all problems, but only 500 for the three-bit parity problem.

Table 3 Common parameters for all algorithms

In addition, there are individual user-tuning parameters for each algorithm. For the PSO algorithm, fully connected topology is used. The cognitive constant (C1) and social constant (C2) are set to 2. The parameters \({r}_{1}\) and \({r}_{2}\) are randomly generated in the range of [0,1]. \(w\) decreases linearly from 0.9 to 0.1. The initial velocities of particles are assigned to one-tenth of their initial \(pbest\) values.

For the ABC algorithm, the limit value is set to the product of the problem dimension and food number. For the GA, a real-coded algorithm is run over the problems. Binary tournament selection, arithmetic crossover (probability = 0.7), and uniform mutation (probability = 0.01) operators are used. For the SA algorithm, the initial temperature is set to 1000 and the final temperature is 1. For the SGD algorithm, the learning rate is set to 0.2 and the number of epochs is 2000.

The outcomes of the methods are compared based on the average, standard deviation (Std dev), median, interquartile range (IQR), best and worst of the Mean Square Errors (MSE) values over 30 independent runs. Besides, the elapsed times for each run on all test problems are memorized and average times in second is calculated for each algorithm.

6.1 Selected FNN structures

In this study, the attributes of the datasets are normalized in the range of [0,1]. Then, all algorithms are applied to the datasets with the same architecture. In the selected FNN structures, two bias units are located into a network with one for the hidden layer and one for the output layer, whereas the sigmoid function is used as the activation function for both layers.

For each dataset, the FNNs with structure I–H–O are used to solve the classification problems, where I is the number of attributes, O is the number of classes, and H is the number of hidden nodes. The performances of FNNs are compared with H = 4, 5, 6, 7, 8, 9, 10, 11, 13, 15, 20, and 30. For these 12 different structures, each test problem is trained by the algorithms for 30 independent runs. The training phase continues until the maximum number of iterations has been completed.

6.2 Results and discussion

The statistical results and the comparisons of the ranks of the algorithms for each test problem are provided in tables from Tables 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 and 21. For any dataset, the best results are indicated in bold type in the tables. Further, the convergence curves are depicted in figures from Figs. 6, 7, 8, 9, 10 and 11. The parts (a), (b), (c), (d), (e), (f), (g), (h), (i), (j), (k), and (l) of each figure are the convergence curves for FNNs with H = 4, 5, 6, 7, 8, 9, 10, 11, 13, 15, 20, and 30, respectively.

Table 4 The statistical results of MSE over 30 independent runs for the a 3-bit XOR problem
Table 5 The comparison of the ranks of the algorithms for the results of 3-bit XOR problem with respect to average MSE
Table 6 The comparison of the ranks of the algorithms for the results of 3-bit XOR problem with respect to median MSE
Table 7 The statistical results of MSE over 30 independent runs for the Iris benchmark problem
Table 8 The comparison of the ranks of the algorithms for the results of Iris benchmark problem with respect to average MSE
Table 9 The comparison of the ranks of the algorithms for the results of Iris benchmark problem with respect to median MSE
Table 10 The statistical results of MSE over 30 independent runs for the Wine benchmark problem
Table 11 The comparison of the ranks of the algorithms for the results of Wine benchmark problem with respect to average MSE
Table 12 The comparison of the ranks of the algorithms for the results of Wine benchmark problem with respect to median MSE
Table 13 The statistical results of MSE over 30 independent runs for the WBC benchmark problem
Table 14 The comparison of the ranks of the algorithms for the results of WBC benchmark problem with respect to average MSE
Table 15 The comparison of the ranks of the algorithms for the results of WBC benchmark problem with respect to median MSE
Table 16 The statistical results of MSE over 30 independent runs for the PID benchmark problem
Table 17 The comparison of the ranks of the algorithms for the results of PID benchmark problem with respect to average MSE
Table 18 The comparison of the ranks of the algorithms for the results of PID benchmark problem with respect to median MSE
Table 19 The statistical results of MSE over 30 independent runs for the Thyroid benchmark problem
Table 20 The comparison of the ranks of the algorithms for the results of Thyroid problem with respect to average MSE
Table 21 The comparison of the ranks of the algorithms for the results of Thyroid problem with respect to median MSE
Fig. 6
figure 6

Convergence curves of VS-FNN, PSO-FNN, ABC-FNN, GA-FNN, and SA-FNN based on averages of MSE values over 30 independent runs in a 3-bit XOR problem. al are the convergence curves for FNNs with H = 4, 5, 6, 7, 8, 9, 10, 11, 13, 15, 20, and 30, respectively

Fig. 7
figure 7

Convergence curves of VS-FNN, PSO-FNN, ABC-FNN, GA-FNN, and SA-FNN based on averages of MSE values over 30 independent runs in the Iris benchmark problem. al are the convergence curves for FNNs with H = 4, 5, 6, 7, 8, 9, 10, 11, 13, 15, 20, and 30, respectively

Fig. 8
figure 8

Convergence curves of VS-FNN, PSO-FNN, ABC-FNN, GA-FNN, and SA-FNN based on averages of MSE values over 30 independent runs in the Wine benchmark problem. al are the convergence curves for FNNs with H = 4, 5, 6, 7, 8, 9, 10, 11, 13, 15, 20, and 30, respectively

Fig. 9
figure 9

Convergence curves of VS-FNN, PSO-FNN, ABC-FNN, GA-FNN, and SA-FNN based on averages of MSE values over 30 independent runs in the WBC benchmark problem. al are the convergence curves for FNNs with H = 4, 5, 6, 7, 8, 9, 10, 11, 13, 15, 20, and 30, respectively

Fig. 10
figure 10

Convergence curves of VS-FNN, PSO-FNN, ABC-FNN, GA-FNN, and SA-FNN based on averages of MSE values over 30 independent runs in the PID benchmark problem. al are the convergence curves for FNNs with H = 4, 5, 6, 7, 8, 9, 10, 11, 13, 15, 20, and 30, respectively

Fig. 11
figure 11

Convergence curves of VS-FNN, PSO-FNN, ABC-FNN, and GA-FNN based on averages of MSE values over 30 independent runs in the Thyroid benchmark problem. al are the convergence curves for FNNs with H = 4, 5, 6, 7, 8, 9, 10, 11, 13, 15, 20, and 30, respectively

Considering the experimental results, convergence graphics, and statistical results given in the tables; it is seen that the proposed VS-FNN method can mostly achieve more successful results in all data sets than all other algorithms compared. The main reason why VS-FNN can be superior to other algorithms can be summarized as having a more efficient exploitation process by iteratively narrowing the search space around the global optimum.

As known, the running principle of metaheuristics depends on two processes: exploration and exploitation. Therefore, the performance of an algorithm depends on how balanced it manages these two processes. In general, it is hoped that as the iterations progress, the algorithm convergences the global optimum depending on the exploration strategy. Then, for the exploitation process to be implemented effectively, new candidate solutions produced by neighborhood relationships begin to be close to each other in the search space.

At this point, VS employs a strong locality in the exploitation process, unlike many other algorithms. This means that VS allows more efficient exploitation by rapidly narrowing the search space systematically, especially after half of the maximum number of iterations. In this paper, it has been proven that VS-FNN has an ability to convergence to global optimum more effectively when it is applied with suitable control parameter values for a selected dataset. This inference can be seen from the convergence graphics given below. Especially after the 50th iteration, VS-FNN provides faster convergence than other algorithms.

Another advantage of VS-FNN that can be deduced from the experimental results is that the runtime is often less than other algorithms since the VS algorithm has a very simple structure in terms of computational complexity. In the VS algorithm, instead of generating new candidate solutions by mating with complex neighborhood relationships, it simply uses Gaussian distribution to generate the solutions in the narrowed search space. This simple strategy makes the algorithm run fast.

6.2.1 The N bit parity problem

The N bit parity problem has frequently been used to prove the effectiveness of training algorithms in several studies since it is considered as a difficult task for neural networks. It is defined as the number of even parity bits in a binary string. The XOR result of the bits in a string returns as output. FNNs with structures 3–H–1 are trained by the algorithms. Table 4 shows the statistical results of the 3-bit parity problem. In Table 4, the values below \({10}^{-20}\) are considered as zero. For all hidden nodes, it can be seen that VS-FNN and PSO-FNN have better results for especially median, best, mean, standard deviation, and IQR of MSE. The rank-based comparisons of the algorithms for the 3-bits parity problem are provided in Tables 5 and 6 according to the average and median of MSE, respectively. As shown in these tables, for all hidden nodes, the VS-FNN and PSO-FNN methods outperform all other methods by a majority. For FNNs with H = 4, 5, 6, and 7, the ABC-FNN method has better results for average MSE. These results prove that VS-FNN has a better or competitive ability to avoid local minima in this test problem. Figure 6 shows the convergence curves of all algorithms based on the average of MSE values. These figures confirm that VS-FNN reaches the most accurate results with a steady convergence acceleration.

6.2.2 The iris classification problem

The iris classification problem is the most known dataset in the field of machine learning. This dataset contains the samples to classify the iris flowers into three species (Setosa, Versicolor, and Virginica) by the aid of the measurements of length and width of sepals and petals. FNNs with structures 4–H–3 are trained by the algorithms. The statistical results for the Iris benchmark problem are presented in Table 7. For average MSE values, VS-FNN has the best values in 7 of 12 FNNs with H = 7, 8, 9, 10, 15, 20, and 30. Moreover, the proposed method has the second rank in 4 of 12 FNNs with H = 4, 6, 11, and 13, while it has the third rank for only FNN with 5 hidden nodes among the six algorithms applied to this problem. Similarly, for median MSE values, the VS-FNN method has the best values in 7 of 12 FNNs, while it has the second-best values in the rest of the structures. The rank-based comparisons are given in Tables 8 and 9, regarding the mean and median of MSE values, respectively. As seen in these tables, it can be said that the winner method is the VS-FNN. Then PSO-FNN closely follows it. For the Iris problem, Fig. 7 shows the convergence curves of all algorithms based on the average of MSE values. It can be seen from these figures that, especially after the half of the maximum number of iterations, VS provides a faster convergence thanks to its strong locality feature.

6.2.3 The wine recognition problem

The wine recognition problem is a well-known classification dataset that contains the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. In this section, FNNs with structures 13–H–3 are trained by the algorithms. The comparative training results are shown in Table 10 for this problem. From Table 10, it can be seen that the VS-FNN method has the best average MSE values in only two FNNs with H = 6 and 9. Although the VS-FNN is the second-best values for the remaining FNN structures, it follows behind the PSO for this problem. However, the VS-FNN outperforms the ABC-FNN, GA-FNN, SA-FNN, and SGD methods by obtaining competitive results with PSO-FNN. The rank-based comparisons provided in Tables 11 and 12 confirm this situation. On the other hand, although the GA-FNN method can also achieve good results for this problem, it stays behind PSO, VS, and ABC in terms of duration. Figure 8 shows the convergence curves of all algorithms based on the average of MSE values. These figures confirm that VS-FNN has a remarkable performance with a competitive convergence rate for all FNNs.

6.2.4 The WBC problem

The Wisconsin-Breast-Cancer (Original) dataset is another well-studied benchmarking problem that contains the samples of the most common malignancy among women to identify into benign or malignant classes [20]. FNNs with structures 10–H–2 are trained by the algorithms. The comparative results of FNNs to solve this problem are presented in Table 13. Apparently, for all hidden nodes, VS-FNN has the best values for the median and IQR. With regard to the values of average and standard deviation, VS-FNN and PSO-FNN have very close and the best values together. This is seen from Tables 14 and 15. The number of the first ranks for average MSE values is equal to each other for VS-FNN and PSO-FNN while the number of first ranks for median MSE values is 12 for VS-FNN. These results prove that the proposed VS based learning method can reach the most accurate results and avoid local minima for this problem. Figure 9 shows the convergence curves of all algorithms for the WBC problem. These figures confirm that VS-FNN has the best convergence rate for almost all values of hidden nodes.

6.2.5 The PID problem

The Pima-Indians-Diabetes dataset consists of the diagnostic measurements of some patients who are females at least 21 years old of Pima Indian heritage. With this dataset, a classification is made to predict whether a patient has diabetes or not. The PID dataset has been frequently used as a benchmark in the field of machine learning. In this study, FNNs with structures 8–H–2 are trained by the algorithms. Table 16 shows the statistical results of MSE values over 30 independent runs for the PID problem. Tables 17 and 18 provide the rank-based comparisons for the PID problem with respect to average and median values of MSE, respectively. From these tables, it can be seen that VS-FNN performs better than the other algorithms. Both average and median MSE values are superior in 10 of 12 FNNs. In almost all structures, VS-FNN achieves more accurate values than others. Figure 10 shows the convergence curves of all algorithms based on the average of MSE values for the PID problem. These figures reflect the balance ability of VS-FNN between exploration and exploitation. It has a weak locality until the 50th iteration. After the half of the maximum number of iterations, the radius decreases significantly so that the algorithm has a strong locality.

6.2.6 The thyroid-disease problem

The Thyroid-Disease dataset is one of the most commonly used datasets for classification systems in machine learning literature. It contains the samples to classify that a given patient is normal (i) or suffers from hyperthyroidism (ii) or hypothyroidism (iii). For this purpose, FNNs with structures 21–H–3 are trained by the algorithms. The statistical results for the thyroid-disease benchmark problem are presented in Table 19. It can be seen the VS-FNN performs a competitive learning process with GA-FNN and PSO-FNN methods. The VS-FNN has better results in many of 12 different FNN structures. The rank-based comparisons for the thyroid-disease problem are provided in Tables 20 and 21 according to the average and median of MSE, respectively. As can be inferred from these tables, the proposed method achieves one of the best rankings among the other state-of-the-art algorithms. Figure 11 shows the convergence curves of all algorithms, except SA, based on the average of MSE values. The results of the SA are not included in the figures so that the curve of other algorithms can be seen in more detail. From these figures, it is seen that all algorithms except ABC present a close convergence rate to each other.

7 Conclusions

Neural Networks has always been one of the most important topics in machine learning, and FNN is the most popular type of neural network in many applications. The success of FNNs used for classification and prediction on many real-world problems depends largely on the correct training of the network and the performance of the training method. Derivative classical learning algorithms are insufficient in the training of FNNs due to factors such as non-linearity of problem types and very large dimensions. In the literature, many metaheuristic optimization algorithms have been proposed to determine the optimal values of weights for the solution of such problems.

The main contribution of this paper is that it is the first attempt to investigate the performance of the Vortex Search optimization algorithm in the process of FNN training. The VS algorithm deserves this thanks to its noteworthy performance and features such as not needing any user-tuning parameters, low computational complexity, fast convergence, and including a simple but effective search method based on Gaussian distribution.

In this study, the VS algorithm is adapted into the training process of FNNs as a new training method called VS-FNN. The performance of the proposed method was compared with six other FNNs trained with SGD, ABC, PSO, SA, and GA. Under the same running conditions, all algorithms are applied to six benchmark problems: three-bits parity (3-bit XOR) problem, Iris classification, Wine-Recognition, Wisconsin-Breast-Cancer, Pima-Indians-Diabetes, and Thyroid-Disease. To compare the effectiveness of the proposed algorithm in more detail, all datasets were run for neural network structures containing twelve different numbers of hidden nodes. The experimental results obtained after 30 independent runs prove that the proposed training method shows better or competitive performance in terms of accuracy, convergence rate, and computation time for all benchmark problems. Therefore, it can be concluded that the VS-FNN method a suitable algorithm for the training of FNNs.