1 Introduction

Support vector machine (SVM) is one of the well-known machine learning methods for classification and regression problems. It was introduced by Vapnik, and it has been used in many applications such as biometrics (Tharwat et al. 2018), bioinformatics (Byvatov and Schneider 2002), and chemoinformatics (Tharwat and Moemen 2017). In SVM classifier, the training patterns are used to train the classification model, and this model is used to classify an unknown pattern. SVM has two main parameters: the penalty parameter (C) and kernel parameters. The penalty parameter determines the trade-off between maximizing the classification margin and minimizing the training error, while the kernel parameters are used to transform the data from input feature space to a high-dimensional space. Thus, choosing the values of the two parameters controls the performance of SVM.

Swarm intelligence algorithms achieved competitive results when solving optimization problems including parameter tuning problems (Wang and Guo 2013; Yang 2014). Particle swarm optimization (PSO) is one of the well-known optimization algorithms. It is a population-based stochastic optimization algorithm, and it has a small number of parameters; hence, it can be easily implemented. In the PSO algorithm, the particles moved in the search space with a velocity constantly updated by (1) the particle’s own experience;and (2) the experience of the whole swarm (Eberhart and Kennedy 1995).

PSO has been applied in many optimization problems. PSO is widely used in image processing. For example, Maitra et al. used the PSO algorithm to search for the optimal thresholding value in image segmentation (Maitra and Chatterjee 2008). Moreover, Akhilesh et al. used a new variant of the PSO algorithm for image segmentation (Chander et al. 2011). In power applications, the PSO algorithm was used for different purposes. Miyatake et al. used the PSO algorithm to search for the maximum power point tracking of multiple photovoltaic and they found that the PSO algorithm converged rapidly to the global maximum power point (Miyatake et al. 2011). Sidhartha Pandan and Narayana Prasad Padhy used PSO to find the optimal location of the static synchronous compensator and its coordinated design with power system stabilizers (Panda and Padhy 2008). They found that the PSO algorithm improved the stability of the power system. The PSO algorithm has been used to solve mathematical problems. Liang et al. used the PSO algorithm to find the global solution of multimodal functions (Liang et al. 2006). Moreover, Zhao Xinchao proposed a perturbed PSO algorithm (Xinchao 2010). He used the new algorithm to search for the optimal solutions for 12 functions, and he found that the proposed algorithm achieved results better than the standard PSO algorithm. In machine learning, PSO was used to search for SVM parameters (Subasi 2013). Moreover, the PSO algorithm was used to adjust the weights of back-propagation (BP) neural networks, and it achieved good results compared with the BP algorithm (Bashir and El-Hawary 2009). In addition, PSO was used for feature subset selection (Lin et al. 2008). In clustering, PSO was used to search for the centroids of a user specified number of clusters, and the PSO algorithm achieved results better than K-means clustering (Merwe et al. 2003). However, due to the stochastic nature of PSO, swarm intelligence algorithms are never guaranteed to find an optimal solution for any problem, but they will often find a good solution if one exists.

Many variants of PSO algorithm were proposed to generate random numbers. Richer and Blackwell used the Levy distribution to improve the performance of the PSO algorithm (Richer and Blackwell 2006). Moreover, Kennedy used a double-exponential distribution with other versions of Gaussian distributions in the PSO algorithm to achieve good results (Kennedy 2004, 2005). The influence of different probability distributions, such as Gaussian, Cauchy, and the exponential probability distributions, was studied by Krohling and Coelho, and they found that these distributions could improve the performance of the PSO algorithm (Krohling 2004; Santos Coelho and Krohling 2005; Krohling and Santos Coelho 2006). Chaos theory was proposed to adapt the inertia weight factor of PSO (Liu et al. 2005). The proposed Chaotic PSO efficiently balances the exploration and exploitation phases (Tharwat and Hassanien 2018).

Recently, Sun et al. proposed a new strategy based on quantum mechanics (Clerc and Kennedy 2002). They used the quantum model to sample around the previous best positions, and then the mean best solution was introduced in the proposed version of PSO, quantum-behaved particle swarm optimization (QPSO) (Sun and Feng 2004; Sun et al. 2004; 2005). In the QPSO algorithm, the iterative equations are different from that of PSO. Moreover, the QPSO does not require velocity vectors to move the particles as in the PSO algorithm, and the number of adjusting parameters of QPSO was less than the standard PSO; hence, the QPSO algorithm is easier to implement than PSO. The QPSO algorithm has been used to solve different optimization problems and it achieved competitive results (Mikki and Kishk 2006; Li et al. 2007; Omkar et al. 2009).

The main objective of this paper is to use the QPSO algorithm to present a novel QPSO-SVM model for parameters optimization of SVM, which improves the classification performance. This novel approach adjusts the parameters’ values to make the optimal separating hyperplane obtainable in linear and nonlinear classification problems.

The rest of the paper is organized as follows: Section 2 presents the related work. In Section 3, the background of the SVM classifier, PSO, and QPSO are introduced; the proposed model (QPSO-SVM) is introduced in Section 4; experimental results and discussion are presented in Section 5; concluding remarks and future work are provided in Section 6.

2 Related Work

There are many studies of tuning SVM parameters empirically by trying a finite number of values and selecting the values that achieve the minimum classification error. This procedure needs an exhaustive search to find the feasible solution, and this is computationally infeasible (Friedrichs and Igel 2005). A grid search algorithm is used to find optimal parameters of SVM. In this algorithm, the parameters vary with a fixed step-size through the parameter space, and the values of each parameter combination are calculated. This algorithm is suitable for adjustment of very few parameters, but it is time-consuming (Chapelle et al. 2002; Wang 2005; Friedrichs and Igel 2005). A leave-one-out (LOO) error and its gradient method were proposed to optimize SVM parameters (Friedrichs and Igel 2005). Different studies used evolutionary algorithms to find the optimal values of SVM parameters to improve the accuracy and stability of the classification model. Shawkat et al. showed that selecting the optimal degree of a polynomial kernel plays a critical role in SVM to ensure a good generalization (Ali and Smith 2003). In another research, PSO was used for parameter determination of SVM and feature selection (Lin et al. 2008). Wu et al. used a hybrid genetic algorithm (GA) for SVM parameter optimization to predict the maximum electrical daily load (Wu et al. 2009). Moreover, ant colony optimization (ACO) was used to search for the optimal SVM parameters (Zhang and Chen 2010). Subasi used PSO algorithm to find SVM parameters for diagnosis of neuromuscular disorders in EMG signals (Subasi 2013). Social emotional optimization (SMO) was used to optimize SVM parameters, and it achieved competitive results (Zhang and Zhang 2015). Bat, dragonfly, and whale algorithms were used also to optimize SVM parameters (Tharwat et al. 2016; Tharwat et al. 2017; Tharwat and Moemen 2017).

3 Preliminaries

3.1 Support Vector Machine Classifier

SVM is one of the most widely used learning algorithms. The goal of SVM is to separate different classes using hyperplanes. The performance of SVM is highly affected by nonlinearly separable data. However, this problem can be solved by transforming the data from the input space to a new higher dimensional space using one of the kernel functions. The data after this transformation can be separated easily. Selecting a suitable kernel function and adjusting its parameters helps the SVM classifier to create linear decisions through a nonlinear transformation. Computationally, searching for the best decision plane is an optimization problem (Chapelle et al. 2002).

Assume we have N training patterns (X = {x1,x2,…,xN}), where xi is the i th training pattern and each pattern has d attributes and it is in one of two classes yi ∈{± 1}. Thus, the training set is denoted by {(x1,y1),(x2,y2),…,(xN,yN)}, where y1,y2,…,yN indicate the class labels for x1,x2,…,xN. In linearly separable data, the line wTx + b = 0 represents the decision boundary between the two classes, positive and negative classes, where w is a weight vector, b is the threshold or bias, and x is the input vector or the training pattern (Tharwat 2016a; Tharwat et al. 2017). This line in the two-dimensional space, i.e., wTx + b = 0, is represented by a hyperplane in higher dimensional space, and it divides the space into two spaces, namely positive half space (where the patterns from the positive class are located) and negative half space (where the patterns from the negative class are located) (Wang 2005).

The goal of the SVM classifier is to select the values of w and b to orientate the hyperplane to be as far as possible from the closest patterns and to construct the two planes, H1wTxi + b = + 1 for yi = + 1 and H2wTxi + b = − 1 for yi = − 1, where wTxi + b ≥ + 1 for positive class and wTxi + b ≤− 1 for negative class. These two planes can be combined as follows, yi(wTxi + b) − 1 ≥ 0 ∀i = 1,2,…,N.

The distance from H1 and H2 to the hyperplane is denoted by d1 and d2, respectively, and these distances represent the margin of SVM. The hyperplane is equidistant from the two planes; hence, \(d_{1}=d_{2}=\frac {1}{\|\mathbf {w}\|}\) and the margin represents the total of d1 and d2. The margin width needs to be maximized as follows:

$$\begin{array}{@{}rcl@{}} && \min \frac{1}{2} \|\mathbf{w}\|^{2}\\ && \text{s.t. } y_{i} (\mathbf{w}^{T} \mathbf{x}_{i}+b) - 1\geq 0 \forall i = 1,2,\dots,N \end{array} $$
(1)

Equation 1 represents quadratic programming problem, where \(\min \frac {1}{2} \|\mathbf {w}\|^{2}\) is the objective function and yi(wTxi + b) − 1 ≥ 0 represents the constraint. This equation can be formalized into Lagrange formula as follows:

$$\begin{array}{@{}rcl@{}} \min L_{P} &=& \frac{\|\mathbf{w}\|^{2}}{2} - \sum\limits_{i} \alpha_{i} (y_{i} (\mathbf{w}^{T} \mathbf{x}_{i}+b)-1)\\ &=& \frac{\|\mathbf{w}\|^{2}}{2} - \sum\limits_{i} \alpha_{i} y_{i}(\mathbf{w}^{T}\mathbf{x}_{i}+b) + \sum\limits_{i = 1}^{N} \alpha_{i} \end{array} $$
(2)

where αi ≥ 0, i = 1,2,…,N represent the Lagrange multipliers and each Lagrange multiplier (αi) corresponds to one training pattern and LP represents the primal problem. The values of w, b, and α which can be calculated by differentiating LP with respect to w and b and setting the derivatives to zero as follows:

$$ \frac{\partial L_{P}}{\partial \mathbf{w}} = 0 \Rightarrow \mathbf{w} = \sum\limits_{i = 1}^{N}\alpha_{i}y_{i}\mathbf{x}_{i} $$
(3)
$$ \frac{\partial L_{P}}{\partial b}= 0 \Rightarrow \sum\limits_{i = 1}^{N}\alpha_{i}y_{i}= 0 $$
(4)

Substituting Eqs. 3 and 4 into Eq. 2, the dual problem can be written as follows:

$$\begin{array}{@{}rcl@{}} \max L_{D} &=& \sum\limits_{i = 1}^{N} \alpha_{i} - \frac{1}{2} \sum\limits_{i,j} \alpha_{i} \alpha_{j} y_{i} y_{j} {\mathbf{x}}_{i}^{T} \mathbf{x}_{j}\\ \text{s.t.} \alpha_{i} &\geq& 0, \sum\limits_{i = 1}^{N} \alpha_{i} y_{i}= 0 \forall i = 1,2,\dots,N \end{array} $$
(5)

where LD represents the dual form of LP. In dual SVM, the objective function is maximized with respect to αi, instead of minimizing it with respect to w and b as in primal SVM as denoted in Eqs. 2 and 5.

The values of w, b, and α are determined by solving Eqs. 34, and 5. The nonzero α’s are corresponding to support vectors (SVs), which are the patterns that are closest to the separating hyperplane; thus, SVs achieved the maximum width margin.

In the case of nonseparable data, more misclassified patterns result. Therefore, a slack variable, 𝜖i, is added to relax the constraints of linear SVM, where 𝜖i indicates the distance between the i th training pattern and the corresponding margin hyperplane, and it should be minimized. The objective function of SVM after adding 𝜖i will be as follows:

$$\begin{array}{@{}rcl@{}} && \min \frac{1}{2} \|\mathbf{w}\|^{2}+ C \sum\limits_{i = 1}^{N} \epsilon_{i}\\ && \text{s.t. } y_{i}(\mathbf{w}^{T} \mathbf{x}_{i}+b) -1+\epsilon_{i} \geq 0 \forall i = 1,2,\dots,N \end{array} $$
(6)

where C represents the penalty parameter, and it controls the trade-off between the slack variable penalty and the size of the SVM margin. Lagrange formula of Eq. 6 is as follows:

$$ L_{P} = \frac{1}{2} \|\mathbf{w}\|^{2} + C \sum\limits_{i = 1}^{N} \epsilon_{i} - \sum\limits_{i = 1}^{N} \alpha_{i} [y_{i}(\mathbf{w}^{T} \mathbf{x}_{i}+b)-1 + \epsilon_{i}] $$
(7)

where αi ≥ 0.

The value of 𝜖i is calculated by differentiating LP with respect to 𝜖i as follows:

$$ \frac{\partial L_{P}}{\partial \epsilon_{i}}= 0 \Rightarrow C=\alpha_{i}+\epsilon_{i} $$
(8)

As denoted in Eq. 8, αi is limited by the upper-bound C, and SVs with αi = C lie outside the margin or on the margin boundary.

In case of nonlinearly separable data, SVM uses kernel functions to transform the data into a higher dimensional space using a nonlinear kernel function (ϕ) where the data can be linearly separable. The objective function of SVM will be as follows:

$$\begin{array}{@{}rcl@{}} && \min \frac{1}{2} \|\mathbf{w}\|^{2}+ C \sum\limits_{i = 1}^{N} \epsilon_{i}\\ && \text{s.t. } y_{i}(\mathbf{w}^{T}\phi(\mathbf{x}_{i})+b) -1+\epsilon_{i} \geq 0 \forall i = 1,2,\dots,N \end{array} $$
(9)

There are different kernel functions such as follows:

  • Linear kernel, K(xi,xj) = 〈xi,xj〉;

  • Radial basis function (RBF) or Gaussian kernel, K(xi,xj) = exp(−||xixj||2/2σ2);

  • Polynomial kernel of degree d, K(xi,xj) = (〈xi,xj〉 + c)d (Scholköpf and Smola 2001).

In this study, the RBF kernel function is applied in the SVM classifier to obtain the optimal or near optimal solutions.

3.2 SVM Parameter Optimization: Illustrative Example

In SVM, the penalty parameter C and kernel function parameters have considerable influence on the classification performance. The parameter C controls the trade-off between the minimization of classification error and maximization of margin. In the RBF kernel function, the value of σ parameter affects the mapping of the input data to a higher dimensional space. In this section, an illustrative example is presented to explain the impact of each parameter on the classification performance of SVM.

In this example, the data consists of 500 patterns, and the data was nonlinearly separable as shown in Fig. 1; therefore, the RBF kernel function was used. Table 1 lists the training error rate, number of SVs, and the number of misclassified patterns using different values of C and σ.

Fig. 1
figure 1

The effect of the penalty parameter C and kernel parameter σ on the decision boundary of SVM. Decision boundary is in blue line and misclassified patterns are marked with red squares

Table 1 The influence of C and σ on the classification results of SVM

3.2.1 The Influence of C Parameter

The value of C parameter controls the classification accuracy and the number of SVs. Small C allows the constraints to be easily ignored, while large C makes the constraints difficult to ignore (Wang 2005). In this part from the example, the value of σ parameter was ten. The results of this example with different values of C are given in Table 1.

As shown in Fig. 1a, a small value of C leads to a high number of misclassified patterns and SVs (see, Table 1). In Fig. 1b, the number of misclassified patterns was decreased to zero, and the number of SVs was 80 when C = 1. As shown in Table 1, the number of SVs was decreased from 80 to 34 when C was increased from one to ten. Moreover, the number of misclassified patterns reached to its minimum when C ≥ 0.1.

Generally, when C was small (in our example C = 0.001 or C = 0.01), the margin was maximized; hence, the number of SVs and misclassified patterns was increased. Thus, very small C may lead to severe underfitting as reported in (Keerthi and Lin 2003). On the contrary, very large C (in our example C = 100) minimizes the margin’s width and increases the weight of the nonseparable patterns. Thus, one single outlier can affect the boundary, which makes the classifier sensitive to noisy data. Thus, large C may lead to severe overfitting (Keerthi and Lin 2003). In addition, increasing C increases the training time and number of SVs.

From these findings, we can conclude that the value of C parameter is changed through a wide range of values and the optimal value of C can be reached by trying a finite number of values and preserving the value that obtains the minimum classification error (Ben-Hur et al. 2008).

3.2.2 The Influence of σ Parameter

The values of kernel parameter, σ in RBF kernel, affect the partitioning outcome in the feature space. Thus, it has a much greater influence on the classification accuracy than C as reported in (Scholköpf and Smola 2001). The σ parameter defines how far the impact of a single training pattern reaches, with high values meaning close and low values meaning far.

In this example, the value of C parameter was one. The results of this example with different values of σ are given in Table 1. As shown in Table 1, the training error rate was high when the value of σ was 500, while the error rate decreased when σ = 0.005,0.05,0.5, or 5. Moreover, the maximum number of SVs achieved when σ was very low (0.005 or 0.05) or very high (500). In addition, the number of SVs decreased to its minimum when σ = 5.

From these findings, we can conclude that a very small value of σ tends to make the kernel value for any unknown pattern not very close to a trained pattern. Mathematically, very small σ makes the kernel value, K(xtest,xi) = exp(−||xtestxi||2/2σ2) will be approximately zero, where xtest is an unknown pattern which is far from a training pattern (xi). In other words, the kernel function with a very small σ maps the pairwise distance between two far patterns, i.e. xtest and xi, to be approximately zero in the new higher dimensional space. Thus, a very small σ tends to be less bias and more variance this is so-called overfitting; thus, the decision boundary tends to be strict and sharp and the number of SVs increases as shown in Fig. 1d. On the other hand, a very large value of σ maximizes the kernel value, hence increases the number of misclassified patterns and SVs. Therefore, very large σ tends to less variance and more bias this is so-called underfitting; therefore, the decision boundary will be more smooth and flexible (see, Fig. 1c). These findings are in agreement with (Kecman 2001; Wang 2005).

To conclude, from these findings, the σ parameter controlling the flexibility of the resulting SVM in fitting the data. Hence, this parameter may lead to overfitting or underfitting.

3.3 Classical/Standard Particle Swarm Optimization (PSO)

The original PSO was first introduced by Reynolds and simulated by Kennedy and Eberhart (Reynolds 1987; Eberhart and Kennedy 1995). The main goal of PSO is to search in the search space for the positions that are close to the optimal solution(s). In the PSO algorithm, each individual is referred to as a particle, and each particle represents one solution to a problem in d-dimensional space; hence, each particle represents one point in the search space. The number of particles (N) is determined by the user.

The particles’ positions are initialized randomly, and the position of the i th particle in the t th iteration is denoted by X(t)(i) = [X(t)(i,1),…,X(t)(i,d)]. The movement of each particle depends mainly on two kinds of information. The first kind is the personal experience of the particle. This is represented by the previous best position (the position that achieved the best fitness value by the particle), which is denoted by P(t)(i) = [P(t)(i,1),…,P(t)(i,d)]. The second kind of information is the other particles’ experiences (the knowledge of the particles around that particle). This is referred as global best position (P(t)(g)), and it represents the position of the particle that has the best fitness value, where g indicates the index of the best particle among all the particles in the swarm (Santos Coelho 2010).

The particles updates their positions as in Eq. 10.

$$ \mathbf{X}^{(t + 1)}(i) = \mathbf{X}^{(t)}(i) + \mathbf{V}^{(t)}(i), $$
(10)

where V(t)(i) is the velocity vector of the i th particle in the t th iteration. The velocity is calculated as in Eq. 11 and it consists of the following terms:

  1. 1.

    The current motion or original velocity of that particle (wV(t)(i)), where w is the inertia weight.

  2. 2.

    The position of the previous best position of that particle that is called particle memory or cognitive component. This term is used to adjust the velocity towards the best position visited by that particle (C1r1(P(t)(i) −X(t)(i))), where C1 is the cognition learning factor and r1 is the uniformly generated random number in the range of [0,1].

  3. 3.

    The position of the best fitness value or social component (C2r2(P(t)(g) −X(t)(i))) that is used to adjust the velocity towards the global best position, where C2 indicates the social learning factors and r2 is the uniformly generated random number in the range of [0,1] (Hassan et al. 2005)

    $$\begin{array}{@{}rcl@{}} \mathbf{V}^{(t + 1)}(i) &=& w \mathbf{V}^{t}(i) + C_{1} r_{1} (\mathbf{P}^{(t)}(i) - \mathbf{X}^{(t)}(i))\\ && + C_{2} r_{2} (\mathbf{P}^{(t)}(g) - \mathbf{X}^{(t)}(i)). \end{array} $$
    (11)

As denoted in Eq. 11, the velocity of the particles depends on random variables; thus, the particles are moved randomly. Hence, the motion of the particles is called a random walk (Yang 2014). The PSO algorithm searches for better positions by keep moving the particles towards the global solution (Eberhart and Kennedy 1995; Kennedy 2010).

3.4 Quantum-Behaved Particle Swarm Optimization (QPSO)

In standard PSO, the position and velocity of a particle determine the trajectory, and the particle moves along the determined trajectory in Newtonian mechanics. In quantum mechanics, the trajectory term is meaningless, because the position and velocity of a particle cannot be determined simultaneously according to the uncertainty principle. Thus, in quantum PSO (QPSO), the particles move in a different fashion (Sun and Feng 2004; Sun et al. 2005).

In the QPSO algorithm, the particles are depicted by Schrödinger equation ψ(x,t), instead of position and velocity in the classical or standard PSO algorithm. Hence, the dynamic behavior of the particles in QPSO is widely divergent from PSO, where the exact values of the position and velocity cannot be determined simultaneously (Liu et al. 2005). Employing the Monte-Carlo method, the particles in the QPSO algorithm move according to Eq. 12 (Sun and Feng 2004; Sun et al. 2005; Xi and Sun 2008; Cai et al. 2008).

$$ \left\{\begin{array}{ll} \mathbf{X}^{(t + 1)}(i)=\mathbf{p}(i)+\alpha . |\mathbf{M}_{\text{best}}-\mathbf{X}^{(t)}(i)|.\ln(\frac{1}{u}) & \text{if } k\geq 0.5\\ \mathbf{X}^{(t + 1)}(i)=\mathbf{p}(i)-\alpha . |\mathbf{M}_{\text{best}}-\mathbf{X}^{(t)}(i)|.\ln(\frac{1}{u}) & \text{if } k< 0.5, \end{array}\right. $$
(12)

where α is a design parameter called contraction-expansion (CE) coefficient, u and k are the uniformly generated random numbers in the range of [0,1], Mbest indicates the global point or mean best of the population and it represents the mean of the previous best positions, i.e., P, of all particles and it is calculated as in Eq. 13, and the p is the local attractor, which is used to guarantee convergence of the QPSO algorithm, the value of p is defined as in Eq. 14 (Santos Coelho 2010).

$$\begin{array}{@{}rcl@{}} \mathbf{M}^{(t)}_{\text{best}} &=& \frac{1}{N} \sum\limits_{i = 1}^{N} \mathbf{P}^{(t)}(i)\\ &=& \left( \frac{1}{N} \sum\limits_{i = 1}^{N} P^{(t)}(i,1), \frac{1}{N}\sum\limits_{i = 1}^{N}P^{(t)}(i,2), \ldots,\frac{1}{N}\sum\limits_{i = 1}^{N}P^{(t)}(i,d) \right), \end{array} $$
(13)
$$ \mathbf{p}(\mathbf{i}) = \phi \mathbf{P}^{(t)}(i)+ (1-\phi) \mathbf{P}^{(t)}(g), $$
(14)

where

$$ \phi = \frac{C_{1}r_{1}}{C_{1}r_{1}+C_{2}r_{2}}. $$
(15)

The CE parameter (α) controls the speed of convergence. Some studies used fixed α and the results proved that this parameter is sensitive to the population size and the maximum number of iterations. Hence, fixed α is not practically efficient. Sun et al. proposed to use time-varying CE parameter instead of using a fixed value, and they found that decreasing the value from α1 to α0 (α0 < α1) in the course of iterations is simple and efficient than fixed value (Sun et al. 2012). In this research, the value of α is computed as follows:

$$ \alpha = \alpha_{0} + (\text{Max}_{\text{iter}}-t).(\alpha_{1}-\alpha_{0})/\text{Max}_{\text{iter}} $$
(16)

where α0 is the initial value of α, α1 is the final value of α, t is the current iteration, and Maxiter is the maximum number of iterations. The value of α0 was suggested to be smaller than 0.6, and the value of α1 was suggested to be larger than 0.8 and smaller than 1.2 (Sun et al. 2012).

The steps of QPSO algorithm are listed in Algorithm 1.

figure a

3.4.1 QPSO vs. PSO

There are some characteristics that make QPSO outperforms PSO such as follows:

  • In QPSO, the exponential distribution of the particles’ positions makes QPSO globally convergent.

  • The PSO particles converge to the global best solution. On the other hand, in QPSO, the particles cannot converge to the global best position without considering their colleagues. Equation 12 indicates that the distance between the current position and Mbest position determines the position for the particle’s in the next iteration. Hence, in QPSO, the Mbest may be pulled away from the global best position (P(t)(g)) by the particles which are far from the global best position, called lagged particles. When the lagged particles converging to P(t)(g), the Mbest will be approaching P(t)(g) slowly. The distance between Mbest and the previous best positions of particles near P(t)(g) is decreased slowly, which decelerate the convergence of particles near P(t)(g); hence, the particles can wait and explore globally around P(t)(g) until the lagged particles are close to P(t)(g), this called waiting phenomena. Therefore, QPSO never abandons any lagged particle and hence QPSO seems to be more cooperative and intelligent. Simply, Mbest enhances the search capability of the particles and makes them search globally for the optimal solution (Sun et al. 2012).

Figure 2 shows the difference between PSO and QPSO. As shown, the black circle represents the global best position, blue circles represent the lagged particles, and red circles represent the other particles. The dashed arrows around the circles, in QPSO, represent the possible direction of the particles; the solid arrow with a big arrowhead points to the direction in which the particle moves with high probability. As shown in the left panel in Fig. 2, the particles in PSO algorithm converges to the global best position without waiting for their colleagues. Hence, the influence of the lagged particle on the other particles is very little, and if the lagged particles fail to find positions better than P(t)(g) during iteration, their influence on the other particles will be null. On the contrary, in the QPSO algorithm (as shown in the right panel in Fig. 2), the lagged particles have a great impact on the other particles through using Mbest; hence, the particles around P(t)(g) can fly in any direction. However, the particles move in the direction toward P(t)(g) with higher probability since the overall tendency of the particles’ movements is convergence to P(t)(g). In the next section, an illustrative example will be presented to explain the difference between PSO and QPSO algorithms.

Fig. 2
figure 2

The movements of particles in PSO and QPSO algorithms

3.4.2 QPSO vs. PSO: an Illustrative Example

In this section, a simple example is introduced to show the difference between the QPSO and PSO algorithms. In this example, the PSO and QPSO algorithms were used to search for the optimal solution and the fitness function was as follows:

$$ \min : \mathcal{F}(x) = \sum\limits_{i = 1}^{d} {x^{2}_{i}} , -5 \leq x_{i} \leq 5 $$
(17)

In this example, for both algorithms, the number of particles was 20, and the maximum number of iterations was 30.

Figure 3 shows the average distance (AD) between the lagged particles and the global best position in the PSO and QPSO algorithms. In this example, the farthest five particles from the global best position represent the lag particles. As shown, in different four runs, as the iterations proceed, the lagged particles of QPSO algorithm become closer to the P(t)(g) than PSO.

Fig. 3
figure 3

The average distance (AD) between the lagged particles and the global best position in the PSO and QPSO algorithms

From this finding, it can be concluded that the particles in the QPSO algorithm were more cooperative as they help the lagged particles to be closer to the global best position through using Mbest parameter; hence, the distance between the lagged particles and P(t)(g) decreased as the iterations proceed.

Figure 4 shows the indices of the lagged particles during the last ten iterations, where the top index represents the farthest particle from P(t)(g), and the red arrow indicates the change of index. As shown, in the PSO panel, as the iterations proceed, the top two indices were stable and they did not change. Moreover, as the iterations proceed, there was a small number of indices changed, and there were only nine different indices during the last ten iterations. On the other hand, in the QPSO algorithm, the indices were changed more than in PSO. As shown, in the QPSO panel, the first index was 8, and then it changed to 4, and the second index was changed many times during the iterations, and there were 17 different indices during the last 10 iterations.

Fig. 4
figure 4

The indices of lag particles in the PSO and QPSO algorithms during the last ten iterations

From this finding, it can be concluded that due to the noncooperative behavior of the PSO algorithm, the lagged particles may remain constant, i.e., far from P(t)(g). This noncooperative behavior has a significant negative impact on the performance of the PSO algorithm. While, in the QPSO algorithm, Mbest parameter helps the lagged particles to converge to P(t)(g); hence, QPSO outperformed PSO as reported in (Sun et al. 2011, 2012).

4 The Proposed Model: QPSO-SVM

In this section, we describe the proposed QPSO-SVM model (as shown in Fig. 5) to find optimal values of SVM parameters. The detailed explanation is as follows:

Fig. 5
figure 5

Flowchart of the proposed model (QPSO-SVM)

4.1 Data Preprocessing

The first step in our model is the data preprocessing. This step adopts linear scaling to (1) avoid features in greater numeric ranges dominating those in smaller ranges and (2) avoid numerical difficulties during the calculation (Zhao et al. 2011). Each feature can be scaled to the range [0,1] as follows:

$$ {v}^{\prime} = \frac{v-\min}{\max-\min}, $$
(18)

where v is the original value, min and max are the lower and upper bounds of the feature value, respectively, and v is the scaled value.

4.2 Data Partitioning

In this model, the data was partitioned using the k-fold cross-validation method. In k-fold cross-validation, the original data was randomly partitioned into k subsets of (approximately) equal size, and the experiment is run k times as shown in Fig. 5. For each run, one subset was selected as the testing set, and the other k − 1 subsets were utilized as the training set. The average of the k results from the folds can then be calculated to produce a single estimation.

4.3 Parameters’ Initialization

In this step, the parameters of the QPSO algorithm including the number of particles, maximum number of iterations, and initial positions were initialized. In the proposed model, the QPSO algorithm provides the SVM classifier with the values of C and σ to train SVM using the training set. Thus, there were only two parameters in our problem; thus, the search space is two-dimensional, and each position represents a combination between the C and kernel parameter. The particles’ positions were initialized randomly. The searching range of parameter C of SVM was bounded by Cmin = 0.01 and Cmax = 3000, and the searching range of σ parameter was bounded by σmin = 0.01 and σmax = 100 (Lin et al. 2008). Increasing these limits enlarges the search space; thus, more particles are needed to search for the optimal solution, which results in more computation and slow convergence rate.

4.4 Fitness Evaluation

For each position, the training set is used to train the SVM classifier, while the testing set is used to calculate misclassification or error rate, which is defined as the ratio between the number of misclassified patterns (Ne) to the total number of testing patterns (N) as denoted in Eq. 19. The optimal solution (X ∗) is located in a position (\(X_{i} \in \mathcal {R}^{2}\)) where the values of C and σ at that location achieve the minimum testing error rate.

$$ \text{Minimize}: \mathcal{F} = \frac{N_{e}}{N} $$
(19)

4.5 Termination Criteria

The positions of particles were updated according to Eq. 12. In the proposed model, the QPSO algorithm is terminated when a maximum number of iterations are reached or when the best solution is not updated for a given number of iterations. In our experiments, the maximum number of repetitions of the best solution was ten.

5 Experimental Results and Discussion

The experiments were carried out to evaluate the performance of the proposed QPSO-SVM algorithm for the parameter optimization of SVM.

5.1 Experimental Setup

To get an unbiased comparison of CPU times, all experiments are performed using the same PC with the detailed settings as shown in Table 2.

Table 2 The detailed settings

To evaluate the proposed algorithm, seven standard classification datasets were used. The datasets were obtained from University of California at Irvin (UCI) Machine Learning Repository (Blake and Merz 1998). The descriptions of all datasets are listed in Table 3. These datasets are widely utilized to compare the performance of different classification problems in the literature. As shown in Table 3, the iris, wine, and glass datasets have more than two classes; thus, one-against-all (1AA) multiclass SVM method was used (Friedrichs and Igel 2005). Since the number of patterns in each class is not a multiple of ten as shown in Table 3, the dataset cannot be partitioned fairly using k-fold cross-validation. However, the ratio between the number of patterns in the training and testing sets was maintained as closely as possible to 9:1. We must point out that the abbreviations D1,D2,…,D7 in Table 3 are short for the datasets listed in the Table.

Table 3 Datasets description

5.2 Parameter Setting for QPSO

Tuning the parameters for any optimization algorithm is important as designing the algorithm itself. In this section, the effect of the number of particles and the maximum number of iterations on the testing error rate and computational time of the proposed model (QPSO-SVM) were investigated using Iono dataset.

5.2.1 Number of Particles

In this section, the effect of the number of particles on the testing error and computational time of the proposed model was tested when the number of particles ranged from 5 to 30. The results of this experiment are shown in Fig. 6. From the figure, it is observed that increasing the number of particles reduces the testing error, but more computational time was needed. Moreover, the minimum error rate reached to its minimum when the number of particles was 20.

Fig. 6
figure 6

Effect of the number of particles on the performance of the QPSO-SVM model using Iono dataset: a testing error rate of the QPSO-SVM model with different numbers of particles; b computational time of the QPSO-SVM model using different numbers of particles

5.2.2 Number of Iterations

The number of iterations/generations also have a great impact on the performance of the proposed model. In this section, the effect of the number of iterations on the testing error and computational time of the proposed model was tested when the number of iterations was ranged from 10 to 100. The results of this experiment are shown in Fig. 7. As shown in the figure, it can be noticed that, when the number iterations was increased, the testing error was decreased until it reached an extent at which increasing the number of iterations did not affect the testing error. In addition, the computational time increased when the number of iterations was increased.

Fig. 7
figure 7

Effect of the number of iterations on the performance of the QPSO-SVM model using Iono dataset. a Testing error rate of the QPSO-SVM model with different numbers of iterations; b computational time of the QPSO-SVM model using different numbers of iterations

According to the other parameters of QPSO, Jun Sun et al. suggested that QPSO algorithm can obtain a global convergence when α0 < 0.6 and 1.2 > α1 > 0.8 (Sun et al. 2012). On the basis of the above parameter analysis and research results, Table 4 lists the detailed settings for the QPSO algorithm that were used in the proposed model.

Table 4 The initial parameters of QPSO algorithm

5.3 Experimental Results

In this section, three experiments were conducted. The aim of the first experiment is to compare the proposed model, i.e., QPSO-SVM with PSO-SVM. The goal of the second experiment is to compare the QPSO-SVM with the grid search SVM. The last and third experiment was conducted to compare QPSO-SVM with one of the well-known optimization algorithms: GA. To compare the classification error rate of the proposed algorithm with the other algorithms, we used the nonparametric Wilcoxon signed-rank test for all datasets.

5.3.1 QPSO vs. Classical PSO

This experiment was conducted to compare our proposed QPSO-SVM algorithm with the PSO-SVM that was proposed in (Lin et al. 2008). In this experiment, the C and σ parameters were optimized by using our proposed algorithm. Table 5 shows the average classification error rate obtained by the QPSO and PSO algorithms in this experiment, and the obtained results are illustrated with the form of average± standard deviation. As shown, the QPSO-SVM algorithm yielded lower classification error rates than the PSO-SVM algorithm in all datasets. Moreover, the p value for D5 (Sonar dataset) is larger than the predicted statistical significance level of 0.005, but the other p values are smaller than the significance level of 0.005. Furthermore, the optimization processes of QPSO-SVM and PSO-SVM are given in Fig. 8. The values shown in these figures are the average of all cross-validation results, and both algorithms started from the same initial solutions. As shown in the figure, the QPSO-SVM algorithm achieved results better than the PSO-SVM algorithm. Figure 8c, e shows clearly how the PSO was trapped in one of the local minima solutions, while the QPSO algorithm escaped and converged to better solutions. Moreover, the figure shows how the QPSO converged faster than the PSO algorithm. For example, as in Fig. 8g, the QPSO and PSO algorithms achieved the same solution, but the QPSO and PSO algorithms reached to this solution after 24 and 48 iterations, respectively. This reflects the convergence speed of the QPSO algorithm compared to PSO.

Fig. 8
figure 8

Performance comparison using the datasets listed in Table 3

Table 5 Comparison between the proposed QPSO-SVM algorithm and the PSO-SVM algorithm that was proposed by (Lin et al. 2008) (%)

Generally, from Table 5 and Fig. 8, compared with the PSO-SVM, the proposed QPSO-SVM algorithm achieved lower classification error rate. This is due to (1) the exponential distribution of the particles’ positions and (2) the cooperative behavior of the QPSO algorithm as mentioned in Sections 3.4.1 and 3.4.2. Therefore, it is no surprise that QPSO is efficient than classical PSO.

5.3.2 QPSO-SVM vs. Grid Search

This experiment was performed to compare the proposed QPSO-SVM algorithm with the grid search SVM algorithm. Table 6 shows the test error rate and computational time of this experiment. k-fold cross-validation (k = 10) was used to estimate the test error rate of each approach, and the obtained results are illustrated with the form of average ± standard deviation. As shown in the table, the cost time of the proposed algorithm was much lower than the grid search algorithm. As shown in Table 6, the p value for D5 (Sonar dataset) is larger than the predicted statistical significance level of 0.005, but the other p values are smaller than the significance level of 0.005. Generally, compared with the grid search algorithm, the proposed QPSO-SVM algorithm achieved lower classification error rates.

Table 6 Results of the proposed QPSO-SVM algorithm and the grid search SVM algorithm

5.3.3 QPSO-SVM vs. Other Optimization Algorithms

This experiment was conducted to compare our proposed QPSO-SVM algorithm with one of the well-known optimization algorithms that was proposed in (Huang and Wang 2006) to optimize the values of C and σ parameters. Table 7 shows the results of this experiment. As shown, QPSO-SVM algorithm achieved lower classification error rates than the GA-SVM algorithm.

Table 7 Comparison between the QPSO-SVM algorithm and the GA-SVM algorithm that was proposed by (Huang and Wang 2006) (%)

5.3.4 QPSO-SVM vs. Conventional Classifiers

The aim of this experiment is to test the proposed QPSO-SVM model using different classifiers. In this experiment, the QPSO-SVM algorithm was compared with k-nearest neighbor (k-NN) (Dudani 1976), multi-layer perceptron (MLP) (Pal and Mitra 1992), and linear discriminant analysis (LDA) (Tharwat 2016a) classifiers. The value of k in k-NN was three and in MLP, a hidden layer with 15 nodes and 1000 epochs were used. The results of this experiment are summarized in Table 8.

Table 8 Comparison between the QPSO-SVM algorithm and NN, LDA, and k-NN classifiers (%)

As shown in Table 8, the proposed QPSO-SVM algorithm obtained the best results compared to the other classifiers, while the k NN classifier achieved the worst results. This is because (1) in k-NN, there is no training phase and the testing sample assigned to the class which has the maximum neighbors, (2) NN has different parameters and these parameters also need to be tuned, and (3) LDA has the small sample problem (SSS); and this problem can be solved by reducing the dimensions using one of the dimensionality reduction methods such as the principal component analysis (PCA) (Tharwat 2016b) algorithm. Hence, PCA removes some information which may have discriminative data (Tharwat et al. 2017).

To conclude, the developed QPSO-SVM algorithm yielded more appropriate parameters, giving lower classification error rate across different datasets. Even so, we still cannot guarantee that QPSO-SVM must perform well and outperform other methods in several applications. In fact, many factors may have a great impact on the quality of the proposed algorithm, such as the number of classes, the number of training samples, representativeness, the number of features, and diversity of training sets.

6 Conclusion and Future Work

The SVM classifier is widely used in different applications. However, the parameters of SVM affect the test error rate. This study proposes a quantum-behaved based approach (QPSO-SVM) that can search for the optimal values of SVM parameters that minimize the classification error rate. In this research, different experiments were conducted to compare the proposed QPSO-SVM algorithm with the grid search SVM, PSO-SVM, and GA-SVM algorithms by applying many standard classification datasets. The results of this study showed that the QPSO-SVM algorithm achieved competitive results.

Several directions for future studies are suggested. First, our fitness function is not suitable for imbalanced datasets, where the number of patterns of one class (majority class) outnumber the number of the other class(es) (minority class(es)). This is because the accuracy rate does not distinguish between the number of correct labels of different classes. Thus, in the imbalanced datasets, the accuracy may lead to erroneous conclusions (He and Garcia 2009). Therefore, different assessment methods such as geometric mean (GM) can be used in the fitness function. Second, due to the performance of QPSO, it would be worthwhile to explore the potential of QPSO to other classifiers. This is currently being investigated by the authors of this paper.