1 Introduction

Support vector machines (SVMs) first proposed by Vapnik [1] have recently been used in a range of problems including pattern recognition, bioinformatics, text categorization and fault diagnosis [26]. SVM classifies data with different class labels by determining a set of support vectors that are members of the set of training inputs and outline a hyperplane in the feature space [7].

Two major problems exist when using SVM for classification: how to set the optimal parameters for SVM and how to choose the optimal feature subset of the target dataset. How to set the parameters has a direct effect on the classification accuracy. Parameters that need be optimized include the penalty parameter C and the kernel function parameters such as the gamma (γ) for the radial basis functions (RBF) kernel [8, 9]. For parameter determination, grid search is often used. The search process consists of varying parameters by a fixed step size through a wide range of values and then evaluating the performance of each combination. Because of its computational complexity, grid search is only suitable for the optimization when there are very few parameters [10]. With the development of heuristic optimization methods, certain optimization techniques such as the genetic algorithm (GA) [7], particle swarm optimization (PSO) [9], simulated annealing (SA) [11] and CMA-ES [10] have been adopted in parameter optimization for SVM.

Classification problems generally involve a number of features. However, not all of these features are equally important. Some features may be redundant or even irrelevant, and if not eliminated from classification could result in increased computational time complexity or decreased classification accuracy. In order to achieve better performance, feature selection or feature reduction is necessary for complex dataset in classification. As predicted by [12], the feature selection problem is an NP-hard combinatorial problem and requires efficient solution algorithms. Efforts have been made to develop feature selection methods, including stochastic gradient descent algorithm [12], tabu search [13] and discrete particle swarm optimization [14]. Real-value PSO was used for feature selection in kernel clustering to optimize the subsets of classes [15].

Facing the challenges of parameter optimization and feature selection in SVM classification, some researchers have tried to solve them synchronously. Huang and Wang proposed a feature selection and parameter optimization approach based on GA [7]. Huang employed a PSO-SVM model to combine the discrete PSO with the continuous-valued PSO to simultaneously optimize the input feature subsets selection and the SVM kernel parameter setting [9]. In [16], a PSO-SVM model with multi-class SVM was constructed, while PSO was employed to optimize the kernel parameters and input feature subsets together. In [16], the labels of input features and kernel parameters of SVM were represented by particles in real value in optimization process. If the value of a label was less than or equal to 0.5, then its corresponding feature was not chosen. Compared with PSO and GA, the GSA is a swarm-based meta-heuristic search algorithm based on the law of Newtonian gravity [17]. Recently, GSA has been proved to be an excellent optimization method in different kinds of applications, including parameter identification [18, 19], fuzzy model identification [20, 21], wind turbine control [22], capacitor placement optimization in radial distribution system [23], economic emission load dispatch [24], synthesis gas production [25] and so on; especially, GSA has also shown potential in handing problems of feature subset selection [26] and parameter optimization of SVM [27]. Based on the discussion, it is interesting to solve the problem of SVM in classification by using GSA.

Sarafrazi and Nezamabadi-pour have promoted the research of applying GSA to solve the mentioned problems of SVM, while GSA has been used to handle the feature selection and SVM parameter optimization problems together [28]. In their work, a GSA-SVM hybrid system was proposed, in which real-valued GSA was used to optimize parameters and discrete GSA was used to optimize feature subsets simultaneously. Although the discrete GSA is a good method for feature subset selection, many other methods have also been reported to be effective in handling this problem. Among them, chaotic search is an attractive choice.

The superiority of chaotic sequence and chaotic search has been reported widely [2931]. Recently, chaos embedded methods have been developed and applied in parameter optimization of SVMs. In [32], an optimal selection approach for SVM parameters was put forward based on mutative scale optimization algorithm (MSCOA). Wu proposed a new PSO method that uses chaotic mappings for parameter adaptation of wavelet v-support vector machine (Wv-SVM) [33]. In [34], a new chaotic differential evolution optimization approach based on Ikeda map was proposed to optimize kernel function parameters of SVM. Besides, chaotic sequence has also been used in feature selection problems. In [35], two kinds of chaotic maps, the logistic maps and tent maps, were embedded in PSO to handle feature selection problems. Since chaotic sequences have been successfully applied in various optimization problems, it is reasonable that chaotic sequences have good potential for optimization of input feature subsets.

Although SVM classification performance has been improved significantly, there is still motivation to push this work further. This study tries a new system, a hybrid chaos embedded GSA-SVM (CGSA-SVM) by combining GSA and chaotic search. In this hybrid system, the GSA is used to optimize parameters of SVM, while the chaotic search is employed to optimize the input feature subsets. Compared with [28], we introduce chaotic search to replace the discrete GSA for feature selection. Benefiting from the properties of ergodicity and stochasticity, chaos is efficient in real-value optimization and discrete value optimization.

The remainder of this paper is organized as follows. Section 2 reviews pertinent literature on SVM and GSA. Section 3 presents in detail the developed CGSA-SVM approach for determining the parameter values for SVM and selecting feature subsets. Next, Sect. 4 compares the experimental results with those of existing approaches. Conclusions are finally drawn in Sect. 5.

2 Literature review

2.1 Support vector machine

The SVM technique was first proposed by Vapnik [1]. The principles of SVM stem from statistical learning theory. In this section, the technique is briefly introduced as follows.

Let (x i , y i ), i = 1,…,l, denote a set of training data, where x i  ∊ R d is the input data with d dimensions and y i  ∊ { −1, 1}is the corresponding bipolar label. A linear decision surface can be defined by the equation f(x) = 〈wx〉 + b = 0, where w is a weight vector orthogonal to the decision surface, b is an offset term and 〈· , ·〉 is the inner product operator. The original formulation of SVM algorithm seeks a linear decision surface that separates the two opposite classes with a maximal margin 1/‖w‖ by solving the following optimization problem:

$$\mathop {\hbox{min} }\limits_{w,b} \frac{1}{2}\left\| w \right\|^{2}$$
(1)

Subject to \(y_{i} (\langle w, x_{i} \rangle + b) \ge 1, \quad i = 1, \ldots , l\).

This optimization problem can be transformed into its corresponding dual problem:

$$W(a) = \sum\limits_{i = 1}^{l} {a_{i} } - \frac{1}{2}\sum\limits_{i,j \ne 1}^{l} {a_{i} a_{j} y_{i} y_{j} \left\langle {x_{i} ,x_{j} } \right\rangle }$$
(2)

With constraints: a i a j  ≥ 0, i = 1, …, l, ∑ l i=1  y i a i  = 0. The Lagrange multipliers a i can be obtained by optimizing W(a).

Considering w = ∑  l i=1  a i y i x i , the decision function is expressed as follows:

$$f(x) = \text{sgn} \left( {\sum\limits_{i = 1}^{l} {a_{i} y_{i} \left\langle {x_{i} ,x_{j} } \right\rangle + b} } \right)$$
(3)

In order to relax the margin constraints for the nonlinearly separable data, the slack variables are introduced into the optimization problem. The following two forms of soft margin SVMs have generally been discussed and applied:

$$\mathop {\hbox{min} }\limits_{w,\xi } \frac{1}{2}\left\| w \right\|^{2} + C\sum\limits_{i = 1}^{l} {\xi_{i} }$$
(4)

Subject t\(y_{i} (\langle w, x_{i} \rangle + b) \ge 1 - \xi_{i} , \quad i = 1, \ldots , l, \quad \xi_{i} \ge 0\) where ξ i , i = 1,…,l, are slack variables and C is the penalty parameter of error.

In practice, most of the problems are linearly inseparable, even with soft margin SVM. Thus, the input data are mapped into a high dimensional feature space, in which the data are sparse and possibly more separable. Suppose the mapping function is φ( · ), then the inner product 〈x i x j 〉 can be replaced by 〈φ(x i ), φ(x j )〉. Given a symmetric and positive kernel function K(x, y) which satisfies Mercer’s theorem, the inner product in feature space could be expressed by K(x i x j ) = 〈φ(x i ), φ(x j )〉. Consequently, the decision function becomes:

$$f(x) = \text{sgn} \left( {\sum\limits_{i = 1}^{l} {a_{i} y_{i} K(x_{i} ,x_{j} ) + b} } \right)$$
(5)

The radial basis function is a common kernel function given as:

$$K(x_{i} ,x_{j} ) = \exp \left( { - \gamma \left\| {x_{i} - x_{j} } \right\|^{2} } \right)$$
(6)

where γ is the parameter of the kernel function.

The performance of an SVM can be controlled through the penalty parameter C and the kernel parameter γ. These parameters influence the number of support vectors and the maximization margin of the SVM.

2.2 Gravitational search algorithm (GSA)

GSA was first proposed by Rashedi [17]. Assumed there are N agents (masses), the position of the ith agent is L i  = (l 1 i ,…l d i ,…, l n i ), i ∈ {1,…,N}. The mass of each agent is calculated according to the fitness function value of the agent. It is easy to visualize that a good agent possesses a strong gravitational field and moves slowly, as it has a larger inertial mass.

Based on fitness function value, the mass of the ith agent in the kth iteration is defined:

$$M_{i} (k) = \frac{{{\text{fit}}_{i} (k) - {\text{worst}}(k)}}{{\sum\nolimits_{j = 1}^{N} {({\text{fit}}_{j} (k) - {\text{worst}}(k))} }}$$
(7)

where fit i (k) is the fitness function value of the ith agent. For an optimization problem about seeking minimal value, worst(k) = max j∊[1,…,N] fit j (k).

According to Newton gravitation theory, the dth dimension of the force acting on the ith mass from the jth mass in the kth iteration is defined as:

$$F_{ij}^{d} (k) = G(k)\frac{{M_{i} (k) \times M_{j} (k)}}{{||{\mathbf{L}}_{i} (k),{\mathbf{L}}_{j} (k)||_{2} }}(l_{j}^{d} (k) - l_{i}^{d} (k))$$
(8)

where M i and M j are masses of agents, L i is the position vector of agent, l d i is the dth element in L i and G(k) is the gravitational constant in the kth iteration.

It must be pointed out that the gravitational constant G(k) is important in determining the performance of GSA, and G(k) is defined as a function of iteration number k:

$$G(k) = G_{0} \cdot \exp \left( { - \beta \cdot \frac{k}{{\hbox{max} \_{\text{iter}}}}} \right)$$
(9)

where G 0 is the initial value, β is a constant, k is the iteration number, max_iter is the maximum number of iterations.

For the ith agent, the randomly weighted sum of the forces exerted from other agents can be calculated by:

$$F_{i}^{d} (k) = \sum\limits_{j \ne i} {rand_{j} F_{ij}^{d} (k)}$$
(10)

Based on the law of motion, the acceleration of the ith agent can be calculated by:

$$a_{i}^{d} (k) = \frac{{F_{i}^{d} (k)}}{{M_{i} (k)}}$$
(11)

where M i is the inertial mass of the ith agent.

Thus, the searching strategy for this concept can be described by following equations, which describe velocity and location, respectively:

$$v_{i}^{d} (k + 1) = {\text{rand}}_{i} \cdot v_{i}^{d} (k) + a_{i}^{d} (k)$$
(12)
$$l_{i}^{d} (k + 1) = l_{i}^{d} (k) + v_{i}^{d} (k + 1)$$
(13)

In above equations, l d i represents the position of ith agent in the dth dimension, v d i is the velocity, a d i is the acceleration and rand i is a random number among [0, 1].

3 The hybrid CGSA-SVM classifier

Parameter optimization and feature selection are of great importance for improving classification ability of SVM. As reported in literature, PSO, GA, GSA and other optimization methods have been used to improve SVM. This study develops a hybrid approach based on chaotic search and GSA, termed CGSA-SVM, for parameter determination and feature selection in SVM. The scheme of the proposed CGSA-SVM can be illustrated by Fig. 1. In this scheme, GSA is adopted to optimize the key parameter pair (C, γ) of SVM, while chaotic search is implemented for the feature subset optimization. For the parameter optimization process, GSA is suitable for the coding and searching of the real-valued parameter. For the feature selection process, if the needed features must be selected from a total of n features, each feature would be assigned a variable valued between 0 and 1. If the value of a variable is less than or equal to 0.5, then its corresponding feature is not chosen. Conversely, if the value of a variable is >0.5, then its corresponding feature is chosen. Chaotic sequence is used to make variance and represents different selection of feature subsets. The solution representation is shown in Fig. 2.

Fig. 1
figure 1

The scheme of the proposed CGSA-SVM

Fig. 2
figure 2

Solution representation

3.1 Feature selection based on chaotic search

Chaos is a bounded unstable dynamic behavior that exhibits sensitive dependence on initial conditions and includes infinite unstable periodic motions in nonlinear systems. Benefiting from the properties of ergodicity and stochasticity, chaos has been employed in numerous optimization problems. Considering that the feature selection problem is an optimization problem with searching range of [0, 1], chaos is well suited to handle this problem.

A chaotic map with n dimension is a discrete-time dynamical system that can be expressed as:

$$cx_{i}^{(k + 1)} = f(cx_{i}^{(k)} ),\quad i = 1, \ldots ,n$$
(14)

By defining the initial state of cx (0) i , a chaotic sequence can be obtained through running the system function. A chaotic sequence is usually denoted by {cx (k) i , k = 0,1,2,…}.

As a well-known chaotic map, the logistic map was introduced by Robert May in 1976 [36]. It is often used to explain how complex behavior can arise from a simple deterministic dynamic system without any stochastic disturbance. This map is defined as:

$$cx^{(k + 1)} = a \cdot cx^{(k)} (1 - cx^{(k)} )\quad {\text{for}}\quad 0 < a \le 4,\quad cx(k) \in (0,1)$$
(15)

where cx (k) is the kth chaotic number, and a is the control parameter that determine the chaotic behavior of the dynamic system. Typically, a = 4.

In this paper, logistical map is adopted to represent the selection of features as follows:

$$F_{i} = \left\{ {\begin{array}{*{20}l} 1 \hfill & {cx_{i}^{(k)} > 0.5} \hfill \\ 0 \hfill & {cx_{i}^{(k)} \le 0.5} \hfill \\ \end{array} } \right.$$
(16)

where F i  = 1 means the ith feature is selected; F i = 0 means the ith feature is not selected.

3.2 Objective function

Classification accuracy and the number of selected features are the two criteria used to design an objective function for classification [28]. In this paper, an objective function is proposed, which combines the two goals into one by setting weights for the goals. The weight accuracy can be adjusted to a high value (such as 100 %) if accuracy is the most important.

$${\text{obj}} = w \cdot {\text{SVM}}\_{\text{acc}} + (1 - w)\left( {1 - \frac{{\sum\nolimits_{i = 1}^{{n_{f} }} {F_{i} } }}{{n_{f} }}} \right)$$
(17)

where SVM_acc is the SVM classification accuracy, n f is the number of features, w is the weight of SVM classification accuracy and F i  ∈ {0, 1}, where “1” represents that feature i is selected and “0” represents that feature i is not selected.

The weight w is used to control the significance of classification accuracy and number of features. For any SVM-based classifier, the first objective is always improving classification accuracy; thus, w is usually set to be near one. We set w = 0.9 in the following experiments.

3.3 The proposed CGSA-SVM

Based on the above description, the proposed CGSA-SVM algorithm can be illustrated as shown in Fig. 3. In this algorithm, some further explanation is needed for the number of iteration where chaotic search is used. We set the number of iterations of chaotic search to be one half of the maximum number of iterations in GSA. In the first batch of iterations, chaotic search is adopted to search feature subsets by using its ergodicity and stochasticity. At this stage, feature selection by chaotic search and SVM parameter search by GSA take place simultaneously. In each iteration, feature subset represented by chaotic sequence is used to reduce the original dataset. And then, SVM parameter optimization is conducted on reduced dataset. In the second batch of iterations, the optimal feature subset of the first batch is used to reduce dataset, and SVM parameter optimization is going on with the reduced dataset. This design will help to maintain the stability of GSA in searching optimal parameters.

Fig. 3
figure 3

The pseudo-code of the CGSA-SVM hybrid system

4 Experiments

In order to verify the performance of the proposed CGSA-SVM, classification experiments were designed. The datasets used in the experimentation were all obtained from the well-known machine learning data repository of UCI Machine Learning Repository, Center for Machine Learning and Intelligent Systems, University of California [37]. The number of features, number of instances and actual number of classes in the datasets are shown in Table 1.

Table 1 Dataset from the UCI repository

The k-fold method [38] presented by Salzberg was employed in the experiments. In this study, the value of k was set to 10. Thus, a dataset was split into 10 parts, with nine data parts used as training data and the last one used as testing data in SVM classification experiments. For each UCI dataset, the experiments were repeated 10 times, so that each of ten parts would be used as testing data to verify the performance of the hybrid SVM system. The ten classification accuracy results for a dataset were recorded, and the mean accuracy was used to compare with the classification accuracy of other methods in literature.

In the first group of experiments, the classification results of CGSA-SVM were compared with existing results of similar methods in literature. In this part of experiments, parameters of CGSA-SVM were set as: the size of population N = 10, the maximum number of iterations max_it = 50, G0 = 100, β = 20. Therefore, for the proposed CGSA-SVM, the number of objective function calculations is equal to 500. In comparative algorithms GSA-SVM, PSO-SVM and GA-SVM as reported in [7, 8, 28], the number of objective function calculations was 100, 2,000 and 5,000, respectively. The number of objective function calculations by CGSA-SVM was much low than PSO-SVM and GA-SVM, but higher than GSA-SVM. This number was chosen to make the comparison as reliable as possible and to examine the classification capability of the proposed approach more thoroughly.

Table 2 presents the accuracy results of CGSA-SVM (mean, standard deviation, optimized γ, optimized C and number of selected features) applied to the UCI datasets over 10 runs. The results obtained by CGSA-SVM were compared with grid search-based SVM (Grid-SVM) [7], GA-SVM [7], PSO-SVM [8] and GSA-SVM [28] in Table 3, while the results of latter methods were cited from literature. The results in Table 3 show that the proposed approach is more accurate than the other methods for all datasets except for “Ionosphere.” In some dataset, PSO-SVM performs the same as CGSA-SVM. However, the number of objective function calculations used in CGSA-SVM was much lower than PSO-SVM: The former was 500, compared with the latter’s 2,000. The comparison between GA-SVM and PSO-SVM shows that the proposed algorithm is more efficient and accurate in classification.

Table 2 Experimental results and accuracy statistics of CGSA-SVM in the UCI datasets
Table 3 Classification accuracy comparison between different methods

By comparing CGSA-SVM with GSA-SVM [28] in Table 3, it can be seen that the proposed chaos embedded hybrid system is more accurate in classification. However, it also has a higher number of objective function calculations. In next group of experiments, the number of objective function calculations was set to 100, for a better comparison with GSA-SVM.

In order to further verify the performance of CGSA-SVM, we compared PSO-SVM with CGSA-SVM by running them in the same circumstance. In this part of experiments, parameters of CGSA-SVM were set as: the size of population N = 5, the maximum iteration max_it = 20, G0 = 100, β = 20. Therefore, for the proposed CGSA-SVM, the number of objective function calculations was equal to 100. For PSO-SVM, both the cognition learning factor c 1 and the social learning factor c 2 were set to 2, and the number of particles and generations was set to be 8 and 200, respectively. Thus, the number of objective function calculations was 1,600. These two approaches were applied for all the 14 UCI datasets, and the k-fold method was used in experiments with k = 10.

Table 4 shows the comparison between CGSA-SVM and PSO-SVM in classification accuracy and number of selected features. The results of Table 4 were calculated based on 10 repeated experiments for each dataset, with the mean and standard deviation analyzed. In order to examine the effectiveness of experimental results, pair t test was used for analysis. Table 4 shows that the proposed approach performs better than PSO-SVM on most datasets, with the exception of “Ionosphere” and “Wine.” On most datasets, CGSA-SVM performed significantly better than PSO-SVM. Results of pair t tests show that there is significant difference between the results of these two approaches, which proves the effectiveness of the results.

Table 4 Comparison of CGSA-SVM and PSO-SVM and results of pair t tests

In this part of experiments, the results of PSO-SVM are slightly different from those in [8]. The average accuracy of the first nine datasets in [8] is 93.62 %, compared with 90.59 % which we got on the same datasets. The reason would be that the number of evaluations of fitness function in [8] was 2000, while this number had been decreased to 1,600 in our experiments. It also shows that this type of heuristic search algorithm-based system is sensitive to the number of fitness evaluations.

Figure 4 exhibits the average iteration processes of PSO-SVM and CGSA-SVM, where the maximum numbers of iterations are 200 and 20, respectively. The iteration processes are based on results of executing these two approaches 10 times on all 14 UCI datasets. From Fig. 4, it is manifest that CGSA-SVM could find the best fitness value more efficiently and effectively.

Fig. 4
figure 4

Average iteration process of classification on 14 UCI dataset

Comparing the results of CGSA-SVM in these experiments with those of GSA-SVM in literature, it can be seen that CGSA-SVM is more accurate when the number of fitness evaluations for these two approaches was both set as 100. The average accuracy of CGSA-SVM is 97.53 % in Table 4, which is greater than the 95.18 % of GSA-SVM in Table 3.

Experiments of classification on UCI datasets show the superiority of the proposed approach over the existing similar methods. Compared with PSO and GA, the GSA has many advantages [28]. Among other advantages, the GSA is memoryless, which makes it robust to escape from local optima, and the direction of any agent is adjusted based on the position of all agents, which makes it more capable of exploring the search space. The chaos embedded GSA-SVM has inherited the advantage of GSA in searching SVM parameters, so it is acceptable that CGSA-SVM performed better than PSO-SVM and GA-SVM.

The difference between CGSA-SVM and GSA-SVM is the implement of feature selection method. CGSA-SVM employs chaotic sequence to search for the optimal feature subset, while GSA-SVM uses binary GSA for this task. The chaotic sequence possesses the advantages of ergodicity and stochasticity, thus allowing CGSA-SVM to search in the feature space more efficiently.

5 Conclusions

In order to solve the problems of SVM facing in classification, this study presents a hybrid SVM system based on chaotic search and gravitational search algorithm. The proposed CGSA-SVM is capable of searching for the optimal SVM parameters and the optimal feature subset simultaneously. Fourteen UCI datasets were employed in experiments to test the performance of the proposed CGSA-SVM system. Comparison of the obtained results with those of other approaches demonstrates that the developed CGSA-SVM approach has a better classification accuracy than others tested. With 500 evaluations of fitness function, the CGSA-SVM obtains much higher classification accuracy than GSA-SVM, PSO-SVM and GA-SVM, although evaluations of fitness function of latter approaches are 2,000 and 5,000. With the same evaluations of fitness function, CGSA-SVM performs better than GSA-SVM. The experimental results make it convincible that the proposed approach in this paper is an efficient and effective classification method.