1 Introduction

The properties of a dataset and the parameter values of a certain classifier are two factors that highly influence the classifier’s accuracy (Li et al. 2015; Huang and Wang 2006). In complex and high-dimensional datasets that are non-separable, kernel functions can be used to restructure the data in order to map it easily in higher dimensional spaces (Phan et al. 2017). However, there is no kernel function that is appropriate for all datasets, and identifying which kernel to be used with each dataset is a complex and time consuming process. Moreover, mapping the data into a new higher dimensional space increases the complexity of time and space resources (Zhang et al. 2010).

Therefore, many researchers tried to investigate working with datasets properties (e.g., number of features) rather than mapping them with higher dimensional spaces (Li et al. 2015; Al-Zoubi et al. 2019). Feature Selection (FS) and Feature Weighting (FW) are two of the most efficient approaches that have been used to tackle this problem. On one hand, FS methods focus on eliminating some irreverent and redundant features that may mislead the learning algorithms and, therefore, decrease their performance (e.g., accuracy). On the other hand, FW methods tend to assign a rank (or a weight) for each feature based on its correlation with the class label, i.e., the highly correlated features will get a high weight (Tahir et al. 2007).

Support Victor Machine (SVM) is one of the most popular supervised machine learning techniques, which has been successfully applied to various classification problems such as text categorization, hand-written character recognition, image classification, and fault diagnosis (Shin et al. 2005; Yuan and Chu 2007; Guo et al. 2008; Wu et al. 2010; Sun et al. 2019; Tanveer 2015; Xu et al. 2013). Like other classifiers, the performance of SVM classifier is highly dependent on its parameter values and settings [e.g., penalty parameter C and the kernel function’s parameters such as the gamma (\(\gamma \)) (Lin et al. 2008; Huang and Dun 2008)], in addition to the properties of the dataset being classified (e.g., what features to be used). Many researchers in the literature concentrated on optimizing the SVM parameters using different algorithms, while others used to select the most informative features in order to improve the performance of the SVM classifier. However, a few works that combines both problems are available. The Grid search is one of the common algorithms that have been used to set the C and \(\gamma \) parameters when using radial basis functions (RBF) kernel. However, using this method became impractical in terms of time complexity (Hsu et al. 2003; LaValle et al. 2004), and no attention is paid to which features to be used in the learning process.

Recently, many metaheuristic algorithms were employed to tune the SVM parameters and/or select the input subset of features. Some examples of those optimization methods are: Genetic Algorithm (GA) (Huang and Wang 2006), Whale Optimization Algorithm (WOA) (Ala’M et al. 2018) Multiverse optimization (MVO) (Sadiq et al. 2019), Simulated Annealing (SA) (Boardman and Trappenberg 2006), Salp Swarm Algorithm (Ala’M et al. 2020), and Particle Swarm Optimization (PSO) (Huang and Dun 2008). The Competitive Swarm Optimization (CSO) (Cheng and Jin 2015) is a recent optimization algorithm that was fundamentally inspired by the PSO with some modifications, where neither local best nor global best positions were used in CSO. In CSO, a competition mechanism is maintained between the particles of the swarm, where a pairwise competition is performed between the particles. The loser learns from the winner instead of learning from the pbest and gbest as in the original PSO. The CSO recorded competitive results in solving many optimizing problems especially the large-scale problems.

In this paper, an improved CSO is proposed to perform two main tasks simultaneously, which are the optimization of SVM parameters, and optimizing the weights of the input features. The new mechanism in CSO assists the algorithm in establishing a more stable balance between exploratory and exploitative tendencies by integrating a crossover mechanism into its procedure. Moreover, most of the previous works that combined metaheuristics with SVM and performed feature selection where the input features were either included or excluded. In this paper, the proposed approach automatically quantifies the weights of the input features and consequently gives an insight on the relative importance of these features. Therefore, it can provide a great aid for subject matter experts and decision makers in their domains. Therefore, it is expected that the proposed model is best suited for small to medium dimensional datasets. For verification, different experiments are conducted using popular medical datasets with different complexities. The proposed approach is compared with well-regarded algorithms in the literature that were commonly used in combination with SVM. The experimental simulations reveal that the superiority of the proposed CSO and its efficient performance compared to other methods.

The rest of this paper is organized as follows: In Sect. 3, preliminaries including SVM and CSO basics are discussed. The proposed approach is discussed in Sect. 4. In Sect. 5, the details of the experiments conducted in this paper are also discussed, followed with a deep analysis of the obtained results. Finally, the conclusion and future directions are drawn in Sect.  6.

2 Related works

Metaheuristic algorithms are known very well in the literature as high-level procedures designed to handle many complex problems that require searching, generating and selecting. In the literature, the application of metaheuristic algorithms in combination with SVM can be categorized into three main types.

The first type comprises the works designed to improve and optimize the hyper-parameters of the SVM, which are cost and gamma. The SVM usually suffers to reach the optimal solution due to the high range of the two parameters. Therefore, several works applied metaheuristic algorithms to solve this problem. For example, Friedrichs and Igel (2005) proposed an evolutionary approach to select and tune the SVM parameters, they applied the covariance matrix adaptation evolution strategy (CMA-ES) for this purpose. Additionally, they showed that their approach achieved better results than the standard grid search. Xiaofang and Yaonan (2008) utilized the chaos optimization algorithm to search for the optimal parameter values of the SVM. They argue that the chaos algorithm has an effective ability to reach the global optimal and eventually improve search efficiency and accuracy. The work in Lorena and De Carvalho (2008) investigated how improving and optimizing the SVM parameters can be enhanced for multiclass problems by applying GA. Another recent work Tharwat and Hassanien (2019) employed Bat Optimization Algorithm (BA) alongside other metaheuristic algorithms to determine the best hyper-parameters and, thus, increasing the performance of the classification process. BA achieved competitive results when compared with other algorithms such as GA and PSO. Furthermore, many other works covered this area of research on tuning the SVM parameters (Guo et al. 2008; Zhang et al. 2010; Bao et al. 2013; Tharwat and Hassanien 2018). However, optimizing only the hyper-parameters is not effective in all problems, particularly in high-dimensional data.

This is when the second type comes, where this approach combines the optimized hyper-parameters with the feature selection simultaneously. Due to the emergence of high-dimensional data in recent years, the feature selection method has become required and important for such data (Zhao et al. 2014; Zhang et al. 2020). Simultaneously applying metaheuristic algorithms for the two tasks shows magnificent results as demonstrated in the literature. Faris et al. (2018) proposed an automatic approach to optimize the hyper-parameters as well as applying feature selection process to achieve the optimal result from the SVM. The authors used the Multi-Verse Optimizer (MVO) to achieve this; the MVO designed as a tuner to improve the main parameters of SVM and finding the best subset of features. Aljarah et al. (2018) also presented similar work by applying the recent algorithm Grasshopper Optimization Algorithm (GOA). The proposed approach was compared against seven well-known algorithms on 18 high- and low-dimensional benchmark datasets. The results show that their approach outperforms all seven algorithms in most of the datasets. In another research, a simultaneously feature subset selection and optimization of SVM parameters were applied to solve the multi-objective optimization problems by using the multi-objective Genetic Algorithm NSGA-II (Bouraoui et al. 2018). Similarly, several works were implemented using different metaheuristic algorithms including, Chaotic Ant Lion Optimization (CALO) (Tharwat and Hassanien 2018), PSO (Lin et al. 2008; Tu et al. 2007; Liu et al. 2011; Huang and Dun 2008), GA Huang and Wang (2006); Zhao et al. (2011); Reif et al. (2012), Ant Colony Optimization (ACO) (Huang 2009) and Gravitational Search Algorithm (GSA) (Li et al. 2015).

Similar to the other previous approaches, the final type consists of optimizing the hyper-parameters, however, differs in the way it deals with the features; as it performs feature weighting instead of selecting a subset of them. The weighting process operates by multiplying the value of every instance of all the features, then order them by their values. This method is considered more efficacious than the feature selection process in a number of cases and problems where the features are very sensitive. Thus, removing these kinds of features may negatively affect the classification performance. Phan et al. proposed a GA-SVM model that optimizes the hyper-parameters of SVM and weights the features to solve classification problems efficiently (Phan et al. 2017). The authors run their approach on several real-world datasets and compared them with the grid search and other standard search methods. The work introduced a good technique using the GA. To the best of our knowledge, Phan et al. was the earliest in this category. And since their work, there is no major advancements in this direction. Therefore, we conduct this research in order to boost the performance of SVM and show the capabilities of swarm intelligent algorithms in making a significant advancement in this line of research.

3 Preliminaries

3.1 Support vector machine

Support Vector Machines (SVM) is a well-established supervised machine learning technique designed for the first time by Vapnik (1999) to handle and find solutions for classification and regression scenarios, based on examining the given datasets and exploring definite hidden/visible patterns. SVM has shown its high capabilities in dealing with nonlinear classification problems. SVM method distributes the training data in other dimensional feature space, and then, it performs linear separation on the data for detecting possible classes (Wang and Chen 2020). SVM performs the separation step based on generating a hyper-plane, which is considered to optimize the observed margin among the nearest points of various classes, which are called support vectors (Shen et al. 2016). SVM tries to find the fittest hyper-plane as it is demonstrated in Fig. 1. However, this step has an inertia to over-fitting problem, which can lead to the misclassification of the new samples of datasets. To deal with this problem, this step is managed by using a penalty parameter cost (C) that increases the accuracy of classification and prediction.

Fig. 1
figure 1

Support vectors and optimal hyperplane in support vector machines

The formula used in common SVM can be equipped with kernel functions to deal with the nonlinear separation problem, including polynomial kernel and the Radial Basis Function (RBF). RBF is the most popular kernel function utilized within SVM, which uses a Gamma (\(\gamma \)) parameter.

The most applied technique for searching and finding the parameters of SVM-based classifiers is grid search. Managing the misclassification error and obtaining the values of C and \(\gamma \) are simultaneously important to reach to high performance and avoid over-fitting drawback when using SVM with RBF kernels (Hsu et al. 2003). However, because grid search technique performs local search process, it has the tendency to be stagnated to local optima (LO) intervals. In this method, if we set small searching intervals, we will obtain poor results and if we set large intervals, it will bring more computational time for the operating system (LaValle et al. 2004; Hsu and Lin 2002). Additionally, this algorithm cannot be used for Feature Selection (FS).

3.2 Competitive swarm optimization

CSO is known as a successful variant of PSO technique (Cheng and Jin 2015). The well-known PSO cannot show an excellent performance when dealing with high-dimension problems, as reports and observations confirmed, because of many local best positions we may face in these cases (Yang and Pedersen 1997). When PSO cannot manage a fine balance between exploration and exploitation, the problem of premature convergence can happen, which is a frequently observed drawback in metaheuristic approaches including PSO (Chen et al. 2013). To cope with immature convergence trends and tendency to stagnation drawbacks, many variants of PSO were introduced in recent years (Cheng and Jin 2015). However, the majority of these variants still may face the stagnation dilemma, because the global best search particle (gbest) has a significant impact on the efficacy of PSO. In CSO, it was intended to decrease the impact of gbest in order to alleviate the problems of convergence.

The structural and conceptual difference of CSO with PSO has two points. First, in PSO technique, exploration and exploitation mechanisms are motivated by gbest and the personal best particle pbest, whereas in CSO, there is no pbest and gbest within the optimization steps, and the exploratory and exploitative behaviors are observed based on the pairwise competition of particles. In this way, any particle has a chance to be a leader. Second, in CSO, there is no history, and when the particles lose the competition, they try to learn only from the winner particles in the ongoing set of solutions. In CSO, the set of search particles at iteration t P(t) is obtained based on:

$$\begin{aligned} P(t)= & {} \{X_{1}(t),\ldots ,X_{m}(t)\} \end{aligned}$$
(1)
$$\begin{aligned} X_{i}(t)= & {} \{x_{i1},\ldots ,x_{id}(t)\} \end{aligned}$$
(2)

where X shows a d-dimensional particle \(X \in R^d\) and m is population size. In CSO, we need to randomly divide the initial set of particles into two groups with size of m/2. A base particle (one particle from the first group) is considered to compete with the paired one in the other group. The winner is the particle with better fitness, and we directly transfer it into the next step to be used in \(P(t+1)\), while the other one is the looser and should learn from winner, that is to update its velocity and position vectors with regard to those for winner, to be inserted to the next iteration. By this manner, m/2 competitions are performed to update half of the population. Fifty percent of particles are the winners; hence, they are inserted directly to the next level, and the rest of them are treated as looser particles, which will be passed after updating of their status, as shown in Fig. 2.

Fig. 2
figure 2

CSO competition and learning process

Each particle has vectors for position and velocity. In i-th pairwise race in the t-th iteration, the position and velocity vectors of winner and loser particles are denoted by \(X_{w,i}(t), X_{l,i}(t), V_{w,i}(t), V_{l,i}(t)\), respectively, while we have \(i=1,2,\ldots ,m/2\). After i-th race, the loser is updated based on:

$$\begin{aligned} V_{l,i}(t+1)= & {} R_1(i,t)V_{l,i}(t) + R_2(i,t)(X_{w,i}(t)-X_{l,i}(t)) \nonumber \\&+\,\varphi R_3(i,t)(\bar{X_i}(t)-X_{l,i}(t)) \end{aligned}$$
(3)
$$\begin{aligned} X_{l,i}(t+1)= & {} X_{l,i}(t) + V_{l,i}(t+1) \end{aligned}$$
(4)

where \(R_1(i,t), R_2(i,t), R_3(i,t) \in [0,1]^d\) denote random vectors, \(\bar{X_i}(t)\) represents the average locations of the some particles, where, we can calculate the average of locations for ongoing swarm or using predefined neighboring particles. \(\varphi \) is the single tunable factor of CSO to manage the impact of \(\bar{X_i}(t)\). The pseudocode of CSO is represented in Algorithm 1.

figure a

4 Proposed approach

In this section, we first describe the modified version of CSO (CCSO) which introduces a crossover operator embedded in it. Then, we describe, in detail, the proposed CCSO-SVM classification model which deploys CCSO for optimizing the parameters of SVM and to perform feature weighting simultaneously.

4.1 CSO with crossover (CCSO)

To enhance the efficacy of CSO, we utilize a crossover procedure to combine the particles, as described in Eq. 5.

$$\begin{aligned} X_{l}(t) = \texttt {Crossover} (X_{l}(t) , X_\mathrm{{best}}) \end{aligned}$$
(5)

where Crossover() is a new process that executes the crossover-based process on loser particles, \(X_{l}(t)\) is the loser particle of the tth iteration, and \(X_\mathrm{{best}}\) is the particle with the best fitness out of all iterations.

After a pairwise competition is conducted between winner and loser particles where the loser particles are updated according to the winners, the crossover mechanism takes its place. Loser particles are utilized in the crossover operation between them and the best particle found so far. This technique guarantees further exploration and enhancements on the loser particles by making similar random chances of either selecting a number of positions from the best particle or keeping current positions, according to the following equation:

$$\begin{aligned} X_l^{i} ={\left\{ \begin{array}{ll} X_{\mathrm{best}}^{i} &{} \mathrm{rand}(i) = 1\\ X_l^{i} &{} \mathrm{rand}(i) = 0 \end{array}\right. } \end{aligned}$$
(6)

where \(X_l^{i}\) is the ith position in the loser particle, \(X_\mathrm{{best}}^{i}\) is the ith position in the best particle, and rand(i) is the corresponding random element.

figure b

An illustration of this operation is demonstrated in Fig. 3.

Fig. 3
figure 3

The used crossover scheme

Fig. 4
figure 4

A simple example to illustrate the representation of the weighting mechanism in the solution of the proposed model

As shown in Fig. 3, we see that this operator exchanges the values between two particles. By this rule, we can experience sudden fluctuations in the loser particle. This operator tries to generate an intermediate particle within the feature space to assist CSO in exploring the feature weights.

Algorithm 3 explains the new steps for CSO after embedding the crossover technique. As the algorithm shows, the fitness of the updated particle \(X_{l\_\mathrm{cross}}\) is compared against the fitness of both \(X_{\mathrm{best}}\) and \(X_l\), and the particle with the best fitness is moved to the next iteration.

figure c

4.2 CCSO-SVM classification model

Before applying the CCSO algorithm for performing the tasks of optimizing SVM parameters and optimizing the weights of the input features, the representation of the solution (also known as particle or individual), and the selection of the fitness function should be resolved. In the following, we discuss each of these important design issues then we describe the overall system architecture of the proposed model.

4.2.1 Solution representation

As a search algorithm designed for solving sophisticated problems, CCSO is simultaneously utilized in two parts. The first part comprises searching for the best C and \(\gamma \) parameters for SVM classifier, and the second part includes weighting the features (see Fig. 5). Therefore, the number of elements generated by CCSO covers both parameters in addition to D number of features for every dataset, all combined in one-dimensional vector of \(D+2\) real numbers originally generated in the interval [0,1].

The first two elements in the vector are correspondent to C and \(\gamma \) parameters. The search spaces for these parameters are different from the original scale and, therefore, they are scaled to the intervals [0,35000] for C and [0,32] for \(\gamma \). This transformation is performed using Min-max normalization as given in Eq. 7.

$$\begin{aligned} B=\frac{A-\min _{A}}{\max _{A}-\min _{A}}(\max _{B}-\min _{B})+\min _{B} \end{aligned}$$
(7)

The remaining elements that correspond to the features will be at the same original scale to be used for weighting. In the weighting process, each element in the vector is multiplied by the value of its matching feature for every instance 4. For example, if we have a simple dataset of 3 instances, then the values of the first feature of the 3 instances will be multiplied by the value of the first element of the solution generated by the CCSO. The same is applied for the rest of dataset as shown in Fig. 4. The overall solution structure and its presentation is illustrated in Fig. 5.

Fig. 5
figure 5

Structure of the solution in the proposed CCSO-SVM model

Fig. 6
figure 6

The proposed CCSO-based process

Table 1 List of used datasets

4.2.2 Fitness evaluation

The assessment of each individual in every iteration is performed by using the fitness function to provide the feedback for CSO and CCSO. The fitness function that is chosen for evaluation is the classification accuracy of the SVM classifier, which is calculated according to the following equation:

$$\begin{aligned} {\mathrm{fitness}}(I_{i}^{t})=\frac{1}{K}\sum _{k=1}^{K}\frac{1}{N}\sum _{j=1}^{N}\delta (c(x_{j}),y_{j}) \end{aligned}$$
(8)

where \(c(x_{j})\) is the accuracy of the jth instance of the testing set, \(y_{j}\) is the label of the actual class for the jth instance, \(\delta \) denotes the relation between \(c(x_{j})\) and \(y_{j}\), i.e., if \(c(x_{j})\) = \(y_{j}\), then \(\delta =1\), otherwise \(\delta =0\). The number of instances in the testing set is denoted by N, and K is number of folds.

4.2.3 System architecture

The processes that are carried out to fulfill our proposed approach start with splitting every dataset into training and testing sets. The splitting criterion depends on the number of separate experiments. In other words, for k experiments, the dataset is split into k parts, \(k-(1/k)\) parts are used for training, and the remaining 1/k part is used for testing. This guarantees maximum diversity of both training and testing sets to produce the best possible model.

The next step includes involving the CSO and the proposed CCSO algorithms. In this step, CCSO starts its iterations with a randomly generated vector of real numbers, which is then used for setting C, \(\gamma \) and the weights of the features. Then, the SVM classifier starts training using the weighted training set. During the training process, an inner cross-validation is used in order to produce a more robust model.

Table 2 The detailed settings of the utilized system
Table 3 The detail of runs
Table 4 The parameter settings
Table 5 Different population \(\times \) iteration
Table 6 Different Phi parameters of the CCSO
Table 7 Comparison of CCSO and CSO with other metaheuristic methods and grid search in terms of accuracy rates

After finishing the training process, SVM classifier returns the accuracy as the fitness value to the CCSO algorithm. The previous processes are repeated for the same training set until the termination criterion for CCSO is met which, in our case, is the maximum number of iterations.

When the maximum number of iterations is reached, the best individual produced by CCSO is used for testing using the testing set. Finally, all previous steps are repeated for k times and the average values are considered. Figure 6 depicts all the previous processes.

5 Experiments and results

In this section, all results and experiments are reported, in detail, to show the performance of different algorithms in dealing with different datasets, in addition of a brief description regarding the importance of feature weighting.

Table 8 P values of CCSO versus other metaheuristic methods and grid search

5.1 Experimental setup and parameters tuning

In this work, in order to have a fair comparison, all tests are experienced based on the same conditions. We also used a personal computer with the specifications reported in Table 2. To investigate the performance of the proposed CCSO-SVM, ten well-known datasets are utilized from the UCI repository (Lichman 2013). Table 1 shows the details of the selected datasets. It is noteworthy to mention that, the class ratio is calculated by dividing the number of instances of the minor class by the number of the major instances.

Details of tests and runs are shown in Table 3. The parameter settings of the CCSO, CSO, GA, and PSO are shown in Table 4. To set the population size and number of iterations, different combination of these two parameters are conducted for CCSO which are \(10\times 30\), \(30\times 50\), and \(50\times 100\), respectively. Two datasets are selected for these initial experiments as they show high sensitivity during building the models in terms of accuracy results. As shown in Table 5, 30 for population size and 50 iterations produce the best accuracy. Furthermore, different values for Phi are tested as well. Table 6 shows that the best results for CSSO are obtained when Phi is 0.2.

Fig. 7
figure 7

Convergence curves of CCSO, CSO, GA, and PSO techniques

Fig. 8
figure 8

BoxPlot results for CCSO, CSO, GA and PSO

5.2 Comparison with other well-regarded metaheuristic-based SVM

Table 7 compares the accuracy results of the proposed CCSO against CSO and other peers. As per results in Table 7, it is observed that the proposed CCSO outperforms other competitors in terms of accuracy rates on Colon cancer, Haberman, Heart, Libras, Parkinsons, spectf, and WDBC datasets (70% of datasets). Compared to Grid-SVM, we see that all swarm-based optimizers show a better performance in terms of average (Acc) and (Std) results. The CSO reveals the best results on Blood and Diabetes cases, while GA obtains the highest rate on Liver case. The CCSO shows the highest accuracy results for a large dataset like Colon cancer and it improves the results of CSO, GA, PSO, and Grid search up to, 1.428%, 3.095%, 5%, and 20.952%, respectively. Based on overall ranks, we can see that the CCSO is ranked one, followed by CSO, GA, PSO, and Grid-search methods. The main reason is that the CCSO has an improved capability in balancing the exploratory and exploitative trends when dealing with more complex feature spaces. The integrated crossover scheme has assisted CCSO to show a more efficient performance in terms of local optima escaping behaviors compared to other peers.

Fig. 9
figure 9

Features weights obtained by CCSO method for blood and colon cancer datasets

Fig. 10
figure 10

Features weights obtained by CCSO method for diabetes and Haberman datasets

Fig. 11
figure 11

Features weights obtained by CCSO method for heart and Libras datasets

Fig. 12
figure 12

Features weights obtained by CCSO method for Liver and Parkinsons datasets

Fig. 13
figure 13

Features weights obtained by CCSO method for Spectf and WDBC datasets

The nonparametric statistical test Wilcoxons rank sum is conducted to test the significance of the obtain results of CCSO against the other metaheuristic algorithms. In this work, the test is performed at 5% significance level. Table 8 shows the statistical test results in terms of p values. Note that the p values that are less than 0.05 which indicate a significance difference are underlined. As per the obtained p values, we see the differences are significant for most of the cases. That is CCSO significantly outperforms GA in 6 datasets, and it outperform PSO, and grid search in 7 datasets.

The convergence curves of CCSO are monitored and compared with CSO, PSO, and GA in Fig. 7. As per curves, first, we see the curves of CSO-based approach are superior to PSO and GA in most of the datasets. In addition, we see the curves of CCSO and CSO are very competitive for some datasets such as Colon Cancer, Diabetes, Heart, and WDBC cases. For some of them such as Haberman and Heart, GA and PSO also show a competitive efficacy in convergence trends. We observe that CCSO has a better capacity to avoid local optima stagnation drawbacks; hence we see it outperforms basic CSO in dealing with Blood, Haberman, Heart, and Liver datasets. However, the basic CSO also is ranked one on two cases: Parkinsons and Spectf cases. In overall, we see the CCSO can show improved convergence leanings due to its higher exploration capacities in initial steps and its enriched capabilities in performing a smoother transition from diversification to intensification in the last steps. The crossover between the loser particles and the leaders has enhanced the quality of the swarm in a gradual manner (Figs. 7, 8).

In order to further exhibit the distribution of the accuracy rates of CCSO-SVM versus the other optimized SVMs and to study the stability of the algorithms, we show the boxplots of the obtained results in Fig. 9. The boxplot results also confirm the satisfactory efficacy of the proposed CCSO-SVM as it shows very competitive stability compared to the other algorithms in most of cases as it has relatively smaller boxes.

The resulted features’ weights calculated by CCSO method are shown in Figs. 9, 10, 11, 12 and 13. Note that in case there are more than 10 input features in the dataset, only the ten features that have the highest weights are shown for clarity reason. The weighting method is a meaningful way to check the most influencing features for every dataset. As it can be seen in the figures, the proposed CCSO-SVM automatically identifies the weights of the input features for each dataset.

For example, as shown in the Heart dataset figure (Fig. 11), the features have been weighted via CCSO algorithm and arranged by their importance of detecting the status of the heart patient, whether the disease is present or absent. The Age feature, for instance, is the least important among all features, this indicates according to the developed model that the age has less effect for such disease, unlike another feature such as Sex which appears to be more important for the detection process of the heart disease. In addition, the maximum heart rate achieved feature, which is the maximum number of times the heart should beat per minute, shows that the heart rate is essential for the detection process, where the normal adult human heart rate should be between 60 and 100 bpm while resting according to The American Heart Association.

In this work, we select the medical field as an application to form an essential guide for concerned parties in that field. Such guidance can also be implemented to seek the pattern and hidden information of the selected features, such as the most and least important feature. This method will help in preventing the use of useless and time-consuming features and counting on the important ones. In addition, these weights could help decision makers to have better knowledge about the key factors in identifying a specific disease.

5.3 Comparison with other classification algorithms

In this experiment, we compare the performance of the proposed CCSO-SVM with well-regarded classification algorithms that are widely used in the literature. These classifiers include: decision trees algorithm C4.5, k-nearest neighbor (k-NN), Naïve Bayes (NB), Multi-layer Perceptron (MLP) neural network. For C4.5, we used its Java implementation in Weka which is known as (J48). For k-NN, k was set to 1 which gave the best performance. In MLP, the number of hidden neurons was set according to the popular rule which determines this number as the average of input features and number of outputs. Table 9 tabulates the results of all these classifiers. It can be clearly seen that CCSO-SVM obtained the best results in majority of the datasets (7 out 10 datasets). While MLP obtained the best results in two cases, which are Liver and Spectf datasets. Finally, k-NN achieved the best results in one case only, which is Parkinsons, with a slightly better result than CCSO-SVM.

5.4 Comparison with other methods in the literature

Table 10 shows the results of the CCSO-SVM and other proposed approaches in the literature, namely, CSO-ELM & CSO-RELM (Eshtay et al. 2018), TMGWO2 (Abdel-Basset et al. 2020), GCACO (Moradi and Rostami 2015), and UPFS (Dadaneh et al. 2016). It can be seen that the CCSO-SVM outperforms all the other approaches in five datasets and achieved competitive results with the rest of the datasets. Therefore, the results of this section proved the superiority of the proposed method.

As any metaheuristic algorithm, the complexity of the proposed wrapper method highly depends on the complexity of the fitness function. In our case, this can be expressed by the complexity of the SVM with RBF kernel which is \(O(n_{\mathrm{sv}}d)\) where \(n_{\mathrm{sv}}\) denotes the number of support vectors and d represents the number of input dimensions.

Table 9 Comparison of CCSO-SVM with standard classification models
Table 10 Comparison of CCSO-SVM with other methods in the literature

6 Conclusion and future directions

In this work, we proposed an improved CSO-based hybrid SVM model that enhances the accuracy results of SVM based on the exploration and exploitation mechanisms of a crossover-based CSO technique. The developed system and model are utilized to optimize the parameters of SVM in addition to feature weights in dealing with several classification cases. The proposed hybrid CCSO-based SVM model was compared with CSO, GA, PSO and Grid search methods in terms of accuracy results and convergence behaviors. The statistical results on ten datasets from UCI show that the proposed approach can obtain satisfactory results and the best parameters and attribute weights.

For future works, the developed CCSO-based SVM model can be utilized to deal with different applications in medical diagnosis and geoscience areas, such as remote sensing datasets. In addition, a multi-objective version of the proposed CCSO can be utilized to deal with different objective functions, simultaneously. Parallel computing is another direction that is worth investigating in order to speed up the optimization process when the algorithm is applied on large datasets.