1 Introduction

Artificial neural networks (ANNs) (Fausett 1994; Gurney 1997) are nonlinear sophisticated modeling techniques inspired by biological nervous systems. ANNs are one of the powerful tools effectively employed in various fields (Anderson 2006) such as time series prediction, classification, pattern recognition, system identification and control, function approximation, signal processing and so on. The performance of ANNs depends on selecting proper architecture and learning algorithm.

During the recent decades, there have been many studies for solving the problem of ANN learning and architecture optimization. The Back-Propagation (BP) algorithm (Rumelhart and McClelland 1986) was introduced and applied widely to train multilayer networks. However, algorithms based on gradient-descent methods like BP easily trap into local minimum (Brent 1991; Gori and Tesi 1992). Also, the proposed traditional learning algorithms suffer from some shortcomings such as slow convergence rate and long learning time. Hence, many researchers have employed meta-heuristic algorithms for ANN learning to overcome the mentioned limitations. They have been used both in training and obtaining appropriate ANN architecture. The main advantages of using these algorithms over the traditional ones are their ability of escaping from local minima and adapting with their environment.

Genetic algorithm (GA) (Tang et al. 1996), artificial immune system (AIS) (Farmer et al. 1986), ant colony optimization (ACO) (Dorigo et al. 1996), particle swarm optimization (PSO) (Kennedy and Eberhart 1995; Shi and Eberhart 1998), artificial bee colony (ABC) (Karaboga 2005), imperialist competitive algorithm (ICA) (Atashpaz-Gargari and Lucas 2007) and gravitational search algorithm (GSA) (Rashedi et al. 2009) are in the class of meta-heuristic algorithms. The study in (Yao 1999) offers a general framework of the algorithms for ANN learning.

Samanta et al. (2003) compared the performance of bearing fault detection using two different classifiers of ANN and support vector machine (SMV) where the classifier parameters were optimized by GA. In the method, the time-domain vibration signals of a rotating machine with normal and defective bearings were processed for feature extraction. The extracted features from original and preprocessed signals were applied as inputs to the classifiers for two-class (normal or fault) recognition. In another research, GA and ANN were developed for loading pattern optimization of advanced gas-cooled reactors (AGR) (Ziver et al. 2004). Oreski et al. (2012) proposed a hybrid system of GA and ANN for retail credit risk assessment. The method provided a feature selection technique for finding an optimum feature subset to improve the accuracy of ANN classification.

Yao et al. (2000) applied the hybrid learning of PSO and feed-forward neural network (FFNN) to classify two problems in medical domain. The result showed that the method has better accuracy in classified data compared to other algorithms (Bennett and Mangasarian 1992; Setiono and Hui 1995; Prechelt 1995). Also, an adapted particle swarm optimization (PSO) model was offered to train MLP for predicting the outcome of construction claims in Hong Kong (Chau 2007). The successful prediction rate was up to 80 % and the method had more accurate results than its counterparts of a benchmarking BP. In other study, Huang et al. (2010) proposed an ANN classifier with entropy based feature selection for Breast Cancer diagnosis. Levenberg–Marquardt (LM) was used for training and PSO algorithm was applied for obtaining the proper weights of ANN.

Further, a hybrid approach based on ACO and ANN (Socha and Blum 2007) was presented for feature selection of several medical datasets. The proposed method had better performance than BP and GA for ANN. In another study, a hybrid version of continuous ACO with classical gradient techniques was applied to train ANN for medical pattern classification (Sivagaminathan and Ramakrishnan 2007).

Moreover, a comparative study was conducted on chronic obstructive pulmonary and pneumonia diseases diagnosis using the hybrid of AIS and different ANN classification models (Er et al. 2009). The results of the study were compared with those of the pervious similar studies reported focusing on the diseases diagnosis (Ashizawa et al. 1999; Heckerling et al. 2004; Coppini et al. 2007). It was concluded that AIS and ANN could be successfully used to diagnose chronic obstructive pulmonary and pneumonia diseases. A hybrid learning of MLP and AIS also was presented to predict the output from a grid-connected photovoltaic system (Sulaiman et al. 2012). In the proposed method, AIS was selected as the optimizer for the training process of MLP to optimize ANN architecture.

Besides, Ozkan et al. (2011) designed an ANN for modeling daily reference evapotranspiration (ET\(_{0})\) using ABC algorithm. The accuracy of proposed model was compared with those of the ANN models trained with LM, standard BP algorithms and some existing empirical models (Kisi and Ozturk 2007). The proposed model had better performance in estimation of ET\(_{0}\).

Ahmadi et al. (2013) offered a model based on MLP to predict permeability of the reservoir and used ICA to decide the initial weights of the network. The performance and the generalization capability of proposed method were compared with those of models developed with the common technique of BP. The results showed that the method outperformed the gradient descent-based neural network. Also, Tayefeh-Mahmoudi et al. (2013) proposed a method to evolve ANN architecture using grammar encoding and ICA. The hybrid method was evaluated on four known regression problems and compared against the state of the art methods: GA, resilient BP, Minimum Finder (MinFinder), Broyden–Fletcher–Goldfarb–Shanno (BFGS), and neural network construction (NNC) as optimization methods (Tsoulos et al. 2008). The experimental results showed that the proposed grammar encoding method using ICA outperformed the other methods.

Although many meta-heuristics have been extensively applied in ANN learning, none of them has shown a good performance for all applications. Also, the existing meta-heuristics have several disadvantages (Leung et al. 1997; Hrstka and Kučerová 2004; Liang et al. 2006; Gao et al. 2012; Moslemipour et al. 2012) such as slow convergence speed, trapping into local minima, long computational time, tuning many parameters and difficult encoding scheme. Hence, the performance enhancement of previous meta-heuristics or proposing new ones would seem necessary to increase the efficiency of ANN learning in different domains.

In this paper, the CAPSO algorithm (Beheshti and Shamsuddin 2013a) is used to improve ANN learning. Several medical datasets (Kuncheva 2005; Bache and Lichman 2013) are selected in order to evaluate the proposed method. The performance metrics of mean square error (MSE), correct classification rate or accuracy (ACC), sensitivity, specificity and area under the receiver operating characteristics curve (AUC) are employed to compare the results of proposed algorithm with PSO, GSA and ICA on MLP learning.

The structure of this paper is organized as follows. Section 2 briefly describes a background of MLP and blending some meta-heuristic algorithms used in this study for MLP learning. The details of proposed approach are elucidated in Sect. 3. Section 4 demonstrates the experimental results of neural network classification on medical datasets using the algorithms. Finally, a conclusion is given in Sect. 5.

2 Background materials

The basic background materials required for a deep understanding of the proposed method are elucidated in this section. At first, MLP as a FFNN is described after that the algorithms of PSO, GSA and ICA are explained in order to compare their results with the proposed algorithm for MLP learning in the classification problems. Later, the classification problem is briefly explained.

2.1 MLP network

FFNN is one of the most common ANNs and maintains a high level of research interest because of its ability to map any function to an arbitrary degree of accuracy. This kind of network has been used for both MLP network (Hornik et al. 1989) and radial basis function (RBF) network (Park and Sandberg 1991). Due to their similarity in functional mapping ability, MLP and RBF networks share the same problem domains.

Multi-layer perceptron has been utilized in a broad range of science such as finance, medicine, engineering, geology, physics and biology. In medicine, it has been extensively employed in medical diagnosis, detection and evaluation of medical conditions and treatment cost estimation (Handels et al. 1999; Folland et al. 2004; Karabatak and Ince 2009; Ari and Saha 2009; Jerez et al. 2010; Abbod et al. 2011). Also, it has been used in replacing conventional pattern recognition methods for the disease diagnosis systems (Kayaer and Yildirim 2003; Delen et al. 2005; Temurtas 2007; Elveren and Yumuşak 2011). As Fig. 1 demonstrates, MLP contains one input layer which receives external inputs, one or more hidden layers, and one output layer which shows the results. All layers except the input layer are made of processing nodes and activation function. Data presented at the input layer and the network nodes carry out calculations in the sequential layers until an output value is acquired at each output node.

Fig. 1
figure 1

Architecture of MLP network

Neurons are the processing units of MLP. An artificial neuron has been simulated in Fig. 2. As shown, each neuron or node computes the sum of the inputs weighted at the presence of a bias, and passes this sum through an activation function (like sigmoid function) so that the output is obtained. This process can be expressed as Eq. (1).

$$\begin{aligned} U_j&= \sum \limits _{i=1}^n {w_{ji} x_j } +\theta _j , \nonumber \\ O_j&= f_j ( {U_j }) \end{aligned}$$
(1)

where \(w_{ji}\) is the weight connected between node \(i \) and \(j,U_{j}\) is the linear combination of inputs, \( O_{j}\) and \( \theta _j \) are the output and bias of node \(j\) respectively and \(f(.)\) is an activation function.

Fig. 2
figure 2

An artificial neuron of MLP

\(f_{j}\) (.) is usually considered as a sigmoid function and \(O_{j}\) is represented by:

$$\begin{aligned} O_j =f_j (U_j )=\frac{1}{(1+e^{-U_j })}. \end{aligned}$$
(2)

Proper design of MLP architecture is very complex because many elements affect the performance of MLP such as the number of nodes in input, hidden and output layer, distribution of layers, interconnection between neurons and layers (weights), error function and activation function.

Also, selecting a proper learning algorithm is another problem of using MLP. In the following sections, some meta-heuristic algorithms applied in MLP learning are explained and finally their performances are compared together.

2.2 Particle swarm optimization

Particle swarm optimization is a population based meta-heuristic algorithm inspired by social behavior of bird flocking or fish schooling. In the algorithm, each solution is called particle and particles can be presented by a group of vectors as (\(\vec {X}_i ,\vec {V}_i ,\vec {P}_i )\) in a d-dimensional search space, where \(\vec {X}_i \), \(\vec {V}_i \) and \(\vec {P}_i \) are the position, velocity and the personal best position of the \(i\mathrm{th}\) particle respectively. The vectors are defined as follows:

$$\begin{aligned} \vec {X}_i&= (x_{i1} ,x_{i2} ,\ldots ,x_{id} ) \textit{for}\,\, i=1,2,\ldots ,N,\end{aligned}$$
(3)
$$\begin{aligned} \vec {V}_i&= (v_{i1} ,v_{i2} ,\ldots ,v_{id} ) \textit{for}\,\, i=1,2,\ldots ,N,\end{aligned}$$
(4)
$$\begin{aligned} \vec {P}_i&= (p_{i1} ,p_{i2} ,\ldots ,p_{id} ) \textit{for}\,\, i=1,2,\ldots ,N, \end{aligned}$$
(5)

where \(N\) is the number of particles.

The velocity and position of particles are randomly initialized and the next velocity and position of every particle are computed as Eqs. (6) and (7):

$$\begin{aligned}&v_{id} ({t\!+\!1})=w( t)\!\times v_{id} ( t)\!+\!C_1 \times \textit{rand}\times ( {p_{id} ( t)\!-\!x_{id} ( t)})\nonumber \\&\quad +C_2 \times \textit{rand}\times ( {p_{gd} ( t)-x_{id} ( t)}),\end{aligned}$$
(6)
$$\begin{aligned}&x_{id} ( {t+1})=x_{id} ( t)+\mathop v\nolimits _{id} ( {t+1}) \end{aligned}$$
(7)

where \(v_{id}(t)\) and \(x_{id}(t)\) are the current velocity and position of ith particle respectively. \(w( t)\) is inertia weight, \(C_{1}\) and \(C_{2}\) are acceleration coefficients. rand is uniformly random number in the interval of [0, 1]. In addition, \(v_{id} ( t)\in [ {-v_{\max } ,v_{\max } }]\) and \(v_\textit{max}\) is set to a constant based on the bounds of the solution space by users. In Eq. (6), the second and the third terms are called cognition and social term respectively. \(p_{gd}\) is the best position achieved by the entire swarm so far in the \(d\mathrm{th}\) dimension. For the population, this position is defined as follows:

$$\begin{aligned} \mathop {Pg}\limits ^\rightarrow =( {p_{g1} ,p_{g2} ,\ldots ,p_{gd} }). \end{aligned}$$
(8)

Although PSO is easy to implement, it has several disadvantages due to its poor exploration, such as facing the slow convergence rate, selecting parameters and getting easily trapped in a local optimum (Leung et al. 1997; Hrstka and Kučerová 2004; Liang et al. 2006; Gao et al. 2012; Moslemipour et al. 2012).

2.3 Gravitational search algorithm

GSA is a meta-heuristic algorithm inspired by the Newtonian gravity and motion laws. According to the gravity law (Schutz 2003), all particles in the universe attract each other by a gravity force. This force is directly proportional to the product of their masses and inversely proportional to the square of the distance between them. Therefore, GSA defines a system with \(N\) agents (masses) based on the law in a d-dimensional search space and at the beginning, agent positions are initialized randomly. The position of the \(i\mathrm{th}\) agent is illustrated as follows:

$$\begin{aligned} \vec {X}_i =(x_{i1} ,x_{i2} ,\ldots ,x_{id} )\quad \textit{for}\,\, i=1,2,\ldots ,N, \end{aligned}$$
(9)

where \(x_{id}\) is the position of the ith agent in the dth dimension.

Each mass presents a solution and the heaviest mass shows an optimum solution in the search space. By lapse of time, masses are attracted by the heaviest mass. The gravitational force of mass \(j\) on mass \(i\) at a specific time \(t\) is computed as follows:

$$\begin{aligned} F_{ij,d} ( t)=G( t)\frac{M_i ( t)\times M_j ( t)}{R_{ij} ( t)+\varepsilon }( { x_{jd} ( t)- x_{id} ( t)}), \end{aligned}$$
(10)

where \(M_{i}\) and \(M_{j}\) are the masses of agent \(i\) and agent \(j\) respectively. \(\varepsilon \) is a small constant. \(G(t)\) is gravitational constant at time \(t\) initialized at the beginning and decreased by time \(t\) to control the search accuracy. \(R_{ij}\)(t) is the Euclidean distance between probe \(i\) and \(j\) at time \(t\):

$$\begin{aligned} R_{ij} ( t)= {\left\| {x_i ( t),x_j ( t)} \right\| }_2 . \end{aligned}$$
(11)

\(M_{i}\) in Eq. (10) is calculated as follows:

$$\begin{aligned} m_i ( t)&= \frac{ {\textit{fit}}_i ( t)-\textit{worst}( t)}{\textit{best}( t)-\textit{worst}( t)},\end{aligned}$$
(12)
$$\begin{aligned} M_i ( t)&= \frac{ m_i ( t)}{\sum \nolimits _{j=1}^N { m_j ( t)} }, \end{aligned}$$
(13)

where fit \(_{i}(t)\) is the fitness value of the agent \(i\), best(t) and worst(t) are the best and the worst values of fitness functions at time \(t\).

Also, the total force acting on agent \(i \) at time \(t \) in the \(d\)th dimension is computed by Eq. (14):

$$\begin{aligned} F_{id} ( t)=\sum \limits _{j\in \textit{kbest},j\ne i}^N { {\textit{rand}_j \times F}_{ij,d} ( t)}, \end{aligned}$$
(14)

where rand \(_{j}\) is a random number in the range of [0, 1]. Kbest is a function of time, with the initial value \(K_{0}\) at the beginning and it is linearly decreased step by step. In fact, Kbest is the set of first \(K \)agents with the best fitness value and the biggest mass.

The force accelerates the agent \(i \) as Eq. (15) and the agent moves from current position to new position:

$$\begin{aligned}&a_{id} ( t)=\frac{ F_{id} (t)}{ M_i ( t)},\end{aligned}$$
(15)
$$\begin{aligned}&v_{id} ( {t+1})= v_{id} ( t)+ {\textit{rand}}_i {\times \,a}_{id} ( t),\end{aligned}$$
(16)
$$\begin{aligned}&x( {t+1})= x_{id} ( t)+ v_{id} ( {t+1}), \end{aligned}$$
(17)

where \(a_\textit{id}(t)\) represents the acceleration of the \(i\mathrm{th}\) agent, \(v_\textit{id}(t)\) and \(x_\textit{id}(t)\) are the current velocity and position of the \(i\mathrm{th}\) agent. Also, \(v_\textit{id}(t~+~1)\) and \(x_\textit{id}(t~+~1) \) show the next velocity and position of the agent \(i\) in dimension \(d\) at time \(t\) respectively.

Similar to other meta-heuristic algorithms, GSA also has some weaknesses such as using complex operators and taking long time to run (Beheshti and Shamsuddin 2013a, b).

2.4 Imperialist competitive algorithm

ICA is a global search optimization algorithm inspired by imperialistic competition. The algorithm is population-based including countries and imperialists. Each individual of the population is called a country. Some of the best countries with the best cost are chosen as imperialist states and others named colonies are divided among the imperialists according to their costs. Empires are formed by imperialist states and their colonies. After forming the empires, imperialistic competition is started among them to collapse weak empires and to remain the most powerful empire. To divide the colonies among the imperialists, the cost and power of each imperialist are normalized and initial colonies are allocated to the empires as Eqs. (18)–(20):

$$\begin{aligned}&C_n = c_n - \mathop {\max }\limits _i \left\{ { c_i } \right\} \!,\end{aligned}$$
(18)
$$\begin{aligned}&p_n =\left| {\frac{ C_n }{\sum \nolimits _{i=1}^{ N_{imp} } { C_i } }} \right| ,\end{aligned}$$
(19)
$$\begin{aligned}&{N\cdot C}_n =\textit{round}\left\{ {p_n .N_{col} } \right\} \!, \end{aligned}$$
(20)

where \(c_{n}\) is the cost of the \(n\mathrm{th}\) imperialist and \(C_{n}\) and \(p_{n}\) are the normalized cost and power of the imperialist respectively. Also, \(N.C_{n}\) is the initial number of colonies related to the \(n\mathrm{th}\) empire and \(N_\textit{col}\) is the total number of initial colonies.

To form the \(n\mathrm{th}\) empire, the \(N.C_{n}\) of the colonies are randomly selected and allocated to the \(n\mathrm{th}\) imperialist. In real world, the imperialist states try to make their colonies as a part of themselves. This process called assimilation is modelled by moving all of the colonies toward the imperialist as illustrated in the Fig. 3. In this figure, \(\theta \) and \(x\) are random angle and number with uniform distribution respectively, also; \(d\) is the distance between the imperialist and colony.

$$\begin{aligned} x&\sim U( {0,\beta \times d}),\end{aligned}$$
(21)
$$\begin{aligned} \theta&\sim U( {-\gamma ,\gamma }), \end{aligned}$$
(22)

where \(\beta \), \(\gamma \) are parameters which cause colonies to move their relevant imperialist in a randomly deviated direction.

Fig. 3
figure 3

Movement of colonies toward their relevant imperialist

While a colony moves toward an imperialist, it might achieve a better position than the imperialist. In this case, the colony and the imperialist change their positions with each other. Also, it is possible that a revolution takes place among colonies which changes the power or organizational structures of the empire. This revelation is based on a revolution rate which shows the percentage of colonies that randomly change their position.

Finally, empires compete together in order to possess and control other empires’ colonies as shown in Fig. 4. A colony of the weakest empire is selected and the possession probability of each empire, Pp, is computed as Eq. (24). The normalized total cost of an empire, \(N.T.C_{n}\), is acquired by Eq. (23) and used to obtain the empire possession probability.

$$\begin{aligned}&N.T. C_n =T. C_n - \mathop {\max }\limits _i \left\{ { {T.C}_i } \right\} .\end{aligned}$$
(23)
$$\begin{aligned}&p_{ p_n } =\left| {\frac{ {N.T.C}_n }{\sum \nolimits _{i=1}^{ N_{imp} } { {N.T.C}_i } }} \right| . \end{aligned}$$
(24)
Fig. 4
figure 4

Imperialistic competition

Vector \(P\) is formed in order to divide the mentioned colonies among empires:

$$\begin{aligned} P=\left[ { p_{ p_1 } , p_{ p_2 } , p_{ p_3 } ,\ldots , p_{ p_{ N_\textit{imp} } } } \right] . \end{aligned}$$
(25)

Vector \(R\) with the same size as \(P\) whose elements are uniformly distributed random numbers is created:

$$\begin{aligned} \begin{aligned}&R=\left[ {r_1, r_2, r_3,\ldots , r_{ N_\textit{imp} } } \right] \\&r_1 , r_2 , r_3 ,\ldots , r_{ N_\textit{imp} } \in U( {0,1}). \end{aligned} \end{aligned}$$
(26)

Vector \(D\) is formed so that the mentioned colony (colonies) is given to an empire whose relevant index in \(D\) is maximized.

$$\begin{aligned} D=P-R=\left[ { D_1 , D_2 , D_3 ,\ldots , D_{ N_\textit{imp} } } \right] . \end{aligned}$$
(27)

According to vector \(D\), the process of choosing an empire is similar to the roulette wheel in GA. However, this method is faster because the selection is based on probabilities values.

The competition affects the power of empires and an empire power will be weaker or stronger. Therefore, all colonies of the weak empire are owned by more powerful empires and the weaker one is eliminated. The total power (cost) of an empire is modelled by adding the power of imperialist country (cost) and a percentage of mean power of its colonies (colonies costs) as:

$$\begin{aligned} T.C_n&= \textit{Cost}( {\textit{imprialist}_n })\nonumber \\&+\,\xi \,\,\textit{mean}\left\{ {\textit{Cost}( {\textit{colonies}\,\,\textit{of}\,\textit{empire}_n })} \right\} , \end{aligned}$$
(28)

where \(T.C_{n}\) is the total cost of the \(n\mathrm{th}\) empire and \(\xi \) is a positive small number.

The competition will go on until remaining one empire or reaching determined maximum iteration. ICA also has some drawbacks such as using complex operators, tuning several parameters and taking long time for running (Beheshti and Shamsuddin 2013a, b). These limitations do not allow that ICA is easily used in different problems.

The described algorithms are applied for MLP learning to classify some medical datasets. In the next section, classification concept is explained to utilize in medical diseases diagnosis.

2.5 Classification

Classification is used to understand the existing data and to predict how new instances will behave. In other words, the objective of data classification is to classify data in different classes. Classification techniques include neural networks (Bishop 1995), decision tree induction (Breiman et al. 1984; Quinlan 1993) and support vector machines (Cristianini and Shawe-Taylor 2000). Researchers have employed the methods to classify different problems (Chen et al. 2007; Mitra and Wang 2008; Reynolds and Iglesia 2009; Saxena and Saad 2007; Qasem and Shamsuddin 2011; Awad and Motai 2008; García et al. 2009; Fana et al. 2011; Ekici 2012).

The correct classification is defined by the number of true positives (TP) and the number of true negatives (TN), whereas the misclassification is defined by the number of false positives (FP) and the number of false negatives (FN). Also, the accuracy of predicting events is measured by Sensitivity computed as TP/total actual positive and Specificity is a measure of accuracy for predicting nonevents based on TN/total actual negative of a classifier for a range of cutoffs. Receiver operating characteristic (ROC) curve is another concept to analyze the performance of algorithms in classification problem. ROC is a graphical plot to demonstrate the quality of classifiers. In other words, the curve shows the true positive rate (sensitivity) and false positive rate (1-specificity). AUC is the area under the curve (ROC). Since, the AUC is a portion of the area of the unit square, its value is between 0.0 and 1.0. Analyses of sensitivity and specificity of algorithms are carried out using the following equations (Fawcett 2006):

$$\begin{aligned} \textit{Sensitivity} = \frac{\textit{TP}}{\textit{TP}+\textit{FN}}(\% ),\end{aligned}$$
(29)
$$\begin{aligned} \textit{Specificity} = \frac{\textit{TN}}{\textit{TN}+\textit{FP}}(\% ), \end{aligned}$$
(30)

where TP \(+\) FN is the number of positive patterns and TN \(+\) FP is the number of negative patterns in a dataset.

According to the concept, Accuracy can be defined as follows:

$$\begin{aligned} \textit{Accuracy} = \frac{\textit{TN}+\textit{TP}}{\textit{TN}+\textit{TP}+\textit{FN}+\textit{FP}}(\% ) \end{aligned}$$
(31)

In the next section, a new meta-heuristic algorithm is introduced to enhance the ANN learning and used to train MLP for classification problems.

3 Methods

This section introduces CAPSO as a new global optimization algorithm to improve the ANN accuracy and follows by proposing the hybrid learning of CAPSO and MLP (CAPSO-MLP).

3.1 CAPSO algorithm

The CAPSO algorithm is an efficient optimization algorithm utilizing mechanics motion laws and PSO algorithm. According to the law, if an object with initial velocity \(v_{1}\) moves from position \(x_{1 }\)to position \(x_{2}\) and its velocity modifies to \(v_{2 }\)during the time step of \(\Delta t\), the object will be accelerated. Therefore, the new velocity and position are obtained as follows:

$$\begin{aligned} v_2&= v_1 +a\times \Delta t,\end{aligned}$$
(32)
$$\begin{aligned} x_2&= x_1 +\textstyle {1 \over 2}\times a\times \Delta t^2+v_1 \times \Delta t, \end{aligned}$$
(33)

where \(a \) is the object acceleration.

Regarding the concept, each solution candidate, called a particle, in CAPSO algorithm has four specifications: position, velocity, acceleration and centripetal acceleration. If a d-dimensional search space with \(N\) particles is considered, the position and velocity of the \(i\mathrm{th}\) particle are as follows:

$$\begin{aligned}&\mathop {X_i }\limits ^\rightarrow =( { x_{i1} , x_{i2} ,\ldots , x_{id} }) \quad {\textit{for}}\,\, {i=1,2,\ldots ,N.}\end{aligned}$$
(34)
$$\begin{aligned}&{\mathop {V_i }\limits ^\rightarrow =( { v_{i1} , v_{i2} ,\ldots , v_{id} })} \quad \textit{for}\,\, i=1,2,\ldots ,N. \end{aligned}$$
(35)

Also, the personal best position of the ith particle, \(\mathop {Pi}\limits ^\rightarrow \), and the best position explored by population so far, \(\mathop {Pg}\limits ^\rightarrow \), are illustrated using Eqs. (36) and (37):

$$\begin{aligned}&\mathop {P_i }\limits ^\rightarrow =( { p_{i1} , p_{i2} ,\ldots , p_{id} })\quad \textit{for} \quad i=1,2,\ldots ,N.\end{aligned}$$
(36)
$$\begin{aligned}&\mathop {Pg}\limits ^\rightarrow =( { p_{g1} , p_{g2} ,\ldots , p_{gd} }). \end{aligned}$$
(37)

Every particle updates its velocity based on current velocity, acceleration and centripetal acceleration:

$$\begin{aligned} v_{id} ( {t+1})= v_{id} ( t)+ a_{id} ( t)+ A_{id} ( t), \end{aligned}$$
(38)

where vi \(_{d}(t~+~1)\) and \(v_{id}(t)\) are the next and current velocity respectively.

Additionally, \(a_{id}(t)\) and \(A_{id}(t)\) represent the acceleration and centripetal acceleration as follows:

$$\begin{aligned} a_{id} ( t)&= \textit{rand}\times ( { p_{\textit{id}} ( t)\!-\! x_{\textit{id}} ( t)})\!+\!\textit{rand}\nonumber \\&\times ( { p_{gd} ( t)\!- \!x_{\textit{id}} ( t)}), \end{aligned}$$
(39)
$$\begin{aligned} A_{\textit{id}} ( t)&= E_i ( t)\times \textit{rand}\times ( { p_{\textit{id}} ( t)- p_{\textit{md}} ( t)- x_{\textit{id}} ( t)}),\nonumber \\ \end{aligned}$$
(40)

where rand is a random number with uniform distribution in the interval of [0, 1]. \(x_{\textit{id}}(t)\) is the current position, \(p_{\textit{id}}(t)\) shows the personal best position of the \(i\mathrm{th}\) particle and \(p_{\textit{gd}}(t)\) is the global best position explored in the \(d\mathrm{th}\) dimension. Also, \(p_{\textit{md}}(t)\) is the current median position of the swarm in dimension \(d\) and \( E_{i}(t)\) is acceleration coefficient computed by Eq. (42). Moreover, according to Eq. (33) the next position of the \(i\mathrm{th}\) particle is acquired by Eq. (43) as follows:

$$\begin{aligned}&e_i ( t)= {\textit{fit}}_i ( t)-\textit{GWfit}( t),\end{aligned}$$
(41)
$$\begin{aligned}&E_i ( t)=\frac{ e_i ( t)}{\sum \nolimits _{j=1}^N { e_j ( t)} },\end{aligned}$$
(42)
$$\begin{aligned}&x_{\textit{id}} ( {t+1})= x_{\textit{id}} ( t)+\textstyle {1 \over 2} {\times a}_{\textit{id}} ( t)+ v_{\textit{id}} ( {t+1}) \end{aligned}$$
(43)

where \( {\textit{fit}}_i ( t)_{ }\)represents the fitness value of particle \(i\) and \(\textit{GWfit}( t)\) is the worst fitness value explored so far by the swarm.

3.2 Analysis and design of the proposed algorithm

In CAPSO algorithm, two terms of acceleration, \(\vec {a}_i\), and centripetal acceleration, \(\vec {A}_i\), play a key role to increase the convergence rate. As shown in Fig. 5, if a particle is far from the solutions obtained so far by the swarm (\(\vec {P}_g, \vec {P}_i, \vec {P}_m)\), these terms help to move the particle toward the solutions. Also, \(\vec {A}_i\) helps to escape the algorithm from local optima. Fig. 5a demonstrates the term of \(\vec {A}_i\) computed based on Eq. (40) and in Fig. 5b, the graphical representation of \(\vec {a}_i\) has been illustrated according to Eq. (39).

Fig. 5
figure 5

Graphical representation of a \(\vec {A}_i\) and b \(\vec {a}_i\)

Figure 6 demonstrates the flowchart of proposed method and the CAPSO pseudo code has been illustrated in Fig. 7. As shown in these figures, the velocities and positions of particles are randomly initialized and each particle is evaluated based on its fitness value. \(\mathop {P_i }\limits ^\rightarrow \) is computed for each particle and the best position explored by the swarm is chosen as \(\mathop {Pg}\limits ^\rightarrow \). The next particles velocity and position will be acquired based on Eqs. (38)–(43). These steps will go on until stopping criterion is met and the best solution is returned by the algorithm.

Fig. 6
figure 6

Flowchart of CAPSO

Fig. 7
figure 7

CAPSO pseudo code

3.3 Hybrid learning of CAPSO and MLP network

In this section, CAPSO is used for MLP learning. The approach is the hybrid learning of CAPSO and MLP network (CAPSO-MLP) to enhance the ability of the network in term of accuracy. The algorithm will simultaneously determine the set of weights and its corresponding accuracy by training the network. The MLP network can be represented as a vector with dimension \(D\) containing the network weights. The vector for MLP is defined as Eq. (44). To optimize the MLP weights using CAPSO algorithms, the dimension of each particle is considered as the vector \(D\):

$$\begin{aligned} D&= ( {\textit{Input}\times \textit{Hidden}})+( {\textit{Hidden}\times \textit{Output}})\nonumber \\&+\,\textit{Hidden}_{\textit{bias}} +\textit{Output}_{\textit{bias}}, \end{aligned}$$
(44)

where Input, Hidden and Output are referred the number of input, hidden and output neurons of MLP network respectively. Also, Hiddenb \(_{ias}\) and Output \(_{bias}\) are the number of biases in hidden and output layers.

Figure 8 shows the flowchart of hybrid leaning of CAPSO and MLP (CAPSO-MLP). The CAPSO-MLP is started by collecting, normalizing and reading a dataset. This is followed by setting the desired number of input, output and hidden neurons to determine the dimension of the particles as Eq. (44). The population is initialized and after MLP training, the training error is calculated as an objective function. According to training error, every particle updates its velocity and position. The new positions are the new weights of MLP network which should minimize the objective function. The function is computed based on mean square error (MSE) as Eq. (45). These steps will go on until meeting stop conditions.

$$\begin{aligned} \textit{MSE}=\frac{1}{P}\sum \limits _{j=1}^P { {( { t_j - O_j })}^2 }, \end{aligned}$$
(45)

where \(t_{j}\) and \(O_{j}\) are the desired output and the network output respectively, and \(P\) is the number of patterns.

Fig. 8
figure 8

Flowchart of hybrid learning of CAPSO and MLP network

4 Experimental studies

To evaluate the performance of CAPSO-MLP, nine medical datasets (Kuncheva 2005; Bache and Lichman 2013) are selected as detailed in Table 1. As shown in this table, the selected datasets of Hepatitis, Heart Disease, Pima Indian Diabetes, Wisconsin Prognostic Breast Cancer, Parkinson’s disease and Echocardiogram (Heart attack), Liver Disorders, Laryngeal 1 and Acute Inflammations are in the class of binary classification problems. Each dataset will be represented as input patterns and outputs in hybrid learning of MLP network. Therefore, the effectiveness and performance of MLP network learning have been verified using different datasets. CAPSO, PSO, GSA and ICA algorithms are used for MLP learning on these datasets to assess the ability of proposed approach. In this context, the hybrids of the algorithms and MLP network are called PSO-MLP, GSA-MLP and ICA-MLP respectively.

Table 1 Description of datasets

4.1 Datasets design

The Hepatitis dataset problem is a complex and noisy data and has a large number of missing data (167 missing values in total dataset). The classification should predict whether a patient with hepatitis will live or die. There are 19 attributes/inputs and the output classes’ distribution is live: 123 and die: 32.

The absence or presence of heart disease is predicted by Statlog (Heart Disease) dataset. This dataset contains 13 attributes/inputs and the output classes’ distribution is absence: 150 and presence: 120.

The Pima Indian Diabetes dataset diagnoses a person based on personal data and medical examination has Diabetes or not. There are 8 attributes/inputs and the output class distribution is Diabetes positive: 268 and Diabetes negative: 500.

The aim of the Wisconsin Prognostic Breast Cancer (WPBC) dataset is to predict whether a tumor is recurrence or non-recurrence. Each record represents a follow-up data for one breast cancer case. There are 33 attributes/inputs and the output class distribution is recurrence: 47 and non-recurrence: 151.

The objective of Parkinson’s disease dataset is to discriminate healthy people from those with Parkinson’s disease. This dataset is composed of a range of biomedical voice measurements with 22 attributes/inputs and the output class distribution is health status: 147 and Parkinson’s disease: 48.

In the dataset of Echocardiogram, all patients have suffered from heart attacks at some points in the past. Some of them are still alive and some are not. The dataset has 132 samples and 12 attributes/inputs. The number of missing values in total dataset is 132 and the distribution of output attribute is alive: 43, dead: 88 and missing value: 1.

The main aim of the Liver Disorders dataset is to diagnose a patient as healthy (normal) or abnormal. It has six attributes/inputs and two output classes (abnormal or normal). The first five variables are all blood tests which are thought to be sensitive to liver disorders that might arise from excessive alcohol consumption. This dataset contains 345 examples in, where 200 are negative examples (normal) and 145 are positive examples (abnormal). Each line in the Liver Disorders dataset constitutes the record of a single male individual.

The Laryngeal 1 dataset is applied to diagnose the laryngeal pathology as normal and pathological voices. The number of attributes/inputs is 16 (all continuous-valued). The total instances is 213 and the distribution of output attribute is 81 Normal and 132 Pathological voices.

The idea of Acute Inflammations dataset is to diagnose two diseases of urinary system: Inflammation of urinary bladder and Nephritis of renal pelvis origin. The basis for rules detection is Rough Sets Theory. The dataset has 120 samples and 6 attributes/inputs. In this paper, the algorithms are evaluated to classify the Inflammation of urinary bladder output. The distribution of output attribute is Positive: 59 and Negative: 61.

The datasets are processed for better generalization of error and accuracy. Normally, real world data are noisy because they are incomplete and inconsistent. The datasets are normalized and missing data values are separately replaced by the mean of attributes values.

4.2 Experimental setup

The mentioned datasets are partitioned into two as training and testing sets. The training set is used to train the network in order to achieve the optimal weights. The testing set is applied on unseen data to test the generalization performance of meta-heuristic algorithms on MLP network. For each dataset, 80 % of data are employed for the training set and the rest 20 % for the testing set.

The number of input and output neurons in MLP network is problem-dependent and the number of hidden nodes is computed as Kolmogorov theorem (Hecht-Nielsen 1987) based on input neurons:

$$\begin{aligned} \textit{Hidden}=2\times \textit{Input}+1. \end{aligned}$$
(46)

The number of iterations is set to 500 for all datasets. The population size is considered 40 in CAPSO, PSO and GSA; also, the number of countries and imperialist are 160 and 8 in ICA respectively. In PSO, \(w\) is set to 0.7 and linearly decreased to 0.4 and \(V_{max}\) is set to 0.5. The acceleration coefficients \(C_{1}\) and \(C_{2}\) are equal to 2. In ICA, revolution rate \(=\) 0.5, \(\beta =2\), \(\gamma =0.5\) and \(\xi = 0.02\) (Tayefeh-Mahmoudi et al. 2013) are considered. Besides, \(G\) in GSA is computed as Eq. (47):

$$\begin{aligned} G(t)= G_0 e^{\frac{-\alpha t}{T}} \end{aligned}$$
(47)

where \(G_{0}\) and \(\alpha \) are set to 100 and 20 respectively; also, \(T\) is set to the maximum iteration. At the beginning, Kbest in Eq. (14) is initialized with 40 (population size) and then linearly decreased to 1.

4.3 Results and discussion

This section presents the results of CAPSO, PSO, GSA and ICA adapted on MLP network. The experiments are conducted using the nine medical datasets. In this study, all datasets have been partitioned into two sets: training set and testing set. The training set is used to train the network in order to achieve the optimal weights. The testing set is applied on unseen data to test the generalization performance of meta-heuristic algorithms on MLP network.

The algorithms are independently run 10 times and the results are shown in Tables 2, 3, 4, 5 and 6. In these tables, the Mean, Min and Max indicate the average, minimum and maximum value respectively. Also, the algorithms are ranked based on the mean solution in each row of tables and the rank of algorithms is calculated in the each column of table.

Table 2 MSE of CAPSO-MLP, PSO-MLP, GSA-MLP and ICA-MLP for all datasets
Table 3 Comparison of datasets in term of accuracy (%)
Table 4 Comparison of datasets in term of sensitivity (%)
Table 5 Comparison of datasets in term of specificity (%)
Table 6 Comparison of datasets in term of AUC

Table 2 demonstrates the training and testing error (MSE) of MLP network based on the mentioned algorithms on Hepatitis, Heart Disease, Diabetes, Breast Cancer, Parkinson’s disease, Echocardiogram, Liver Disorders, Laryngeal 1 and Acute Inflammations datasets. In the table, the training and testing ranks signify that CAPSO-MLP has better convergence rate in comparison with other algorithms. In contrast, GSA-MLP shows the worst convergence rate among the algorithms. Also, Fig. 9 illustrates the training error (MSE) of CAPSO-MLP, PSO-MLP, GSA-MLP and ICA-MLP on the Hepatitis, Heart Disease and Breast Cancer datasets. As observed, CAPSO-MLP provides the lower error training for all datasets in contrast; GSA-MLP shows the worst MSE among the algorithms.

Fig. 9
figure 9

Training errors (MSE) of CAPSO-MLP, PSO-MLP, GSA-MLP and ICA-MLP for a Hepatitis, b Heart Disease and c Breast Cancer

Table 3 presents the classification accuracy of CAPSO-MLP, PSO-MLP, GSA-MLP and ICA-MLP over ten independent runs on the datasets. The accuracy refers to the percentage of correct classification on training and testing data respectively. As depicted in this table, CAPSO-MLP provides the best performance on unseen data (testing data): Hepatitis with 71.29 %, Heart Disease with 81.85 %, Diabetes with 72.99 %, Wisconsin Prognostic Breast Cancer with 80.25 %, Parkinson’s disease with 92.56 %, Echocardiogram (Heart attack) with 88.46 %, Liver Disorders with 72.32 %, Laryngeal 1 with 81.63 % and Acute Inflammations with 100 %. Also, it is noticeable that the proposed method has the highest classification accuracy for Hepatitis and Echocardiogram datasets, containing missing data, compared to the other algorithms. Also, GSA-MLP shows a lower accuracy among them.

In addition, to evaluate the classification abilities of CAPSO-MLP, PSO-MLP, GSA-MLP and ICA-MLP, a comparison of the algorithms in the terms of Sensitivity, Specificity and AUC has been performed as shown in Tables 4, 5 and 6.

In terms of Sensitivity and Specificity, the proposed method has offered better results than the others algorithms. In other words, CAPSO-MLP presents better Sensitivity and Specificity on testing data (unseen data).

As shown in Table 6, the proposed method provides the best AUC both in training and testing sets, except in Liver Disorders dataset. Among the algorithms, GSA-MLP has the lowest AUC, both in training and testing data. The better results of CAPSO-MLP are also observable in Fig. 10. As the ROC of training and testing Heart Disease illustrate, CAPSO-MLP has yielded the best AUC on the dataset.

Fig. 10
figure 10

ROC curve of a training data for Heart Disease and b testing data for Heart Disease

Further, in order to determine whether the results achieved by the algorithms are statistically different from each other, \(t\)-test (Box et al. 2005) and Wilcoxon’s signed ranks test (Wilcoxon 1945) are conducted between the results obtained by the proposed algorithm, CAPSO-MLP, with PSO-MLP, GSA-MLP and ICA-MLP. The tests are performed on datasets for training and testing accuracy. In these tests, a p-value is computed and as shown in Table 7, if p-value is greater than the significance level of \(\alpha = 0.05\), there are no significant differences between algorithms; otherwise, there are significant differences. In the table, the symbols ‘\(>\)’ and ‘=’ mean the algorithm on the left side is significantly and no significantly better than the algorithm on the right side respectively. As illustrated, there are significant differences between CAPSO-MLP and the other algorithms.

Table 7 Results of \(t\)-test and Wilcoxon’s signed-ranks test taking into account accuracy rate

5 Conclusions

In this paper, an improved scheme of particle swarm algorithm (PSO) and Newtonian’s motion laws, called centripetal accelerated particle swarm optimization (CAPSO) has been proposed to accelerate the learning and convergence procedure of classifiers. CAPSO has been adapted on MLP network to optimize the accuracy and the weights of the network. The obtained results of nine medical disease diagnosis benchmarks indicate that the proposed method provides better results than PSO-MLP, GSA-MLP and ICA-MLP in terms of good convergence rate and classification accuracy. The superiority of CAPSO-MLP against other methods is considerable especially in unseen data (testing data) and datasets with high missing data values.

The main advantages of CAPSO for neural network learning are the simple concept, the ease of implementation and the needless to tune to any algorithm-specific parameters. The algorithm requires only common controlling parameters such as the number of generation and population size. Therefore, it can be concluded that the proposed method is a viable tool to evolve the neural network learning and accuracy.