1 Introduction

When addressing the data mining field, one must expand briefly and mention the broader concept of Knowledge Discovery in Databases (KDD), which categorizes datasets to find useful information from a large-scale dataset. The Discovery of knowledge has four main processes: data warehouse, preprocessing, data mining, and evaluation process [1, 2]. This process is essential in information acquisition, Machine Learning, pattern discovery, data visualization, databases, statistics, and artificial intelligence.

Data mining goes through multiple steps such as reversion, categorizing, grouping, deviation-change discovery, dependence modeling, and summarization [3]. For the data mining process to obtain the best results, a vital step must be conducted: reduction, transformation, normalization, discretization, integration, feature extraction, data cleaning, and feature selection of the data. Feature selection (variable or subsets selection) is the highlight of the study; it points to the process of choosing relevant and essential features (features are also identified as attributes, properties, characteristics, and dimensions). It discards any irrelevant, redundant, and noisy features that may decrease the accuracy of the classification, therefore lowering the algorithm's performance and the accuracy of the output. With that being said, problems are created by a faulty feature selection result in a more complicated learning process and high computational expenses [4,5,6].

Feature selection can encounter many problems. The most significant and commonly encountered problem is the curse of dimensionality, which happens when the features’ attributes or numbers exceed the samples’ number resulting in problems that lower accuracy and affect learning speed. To address this problem, datasets must be summarized in order to find smaller or narrower attributes and samples that are more relevant to the original matrix while eliminating noise and redundancy, resulting in an enhanced discrimination power and a better classification performance; this process is named dimensionality reduction [2, 7]. Feature selection techniques are applied in several applications such as image processing [8], signal processing [9], pattern recognition [10], text clustering [11,12,13], and machine learning [14, 15].

Approaches of feature selection are usually categorized into three wide groups: wrapper model, filter model, and embedded model; such approaches are selected based on how the selection algorithm and the process of building the model are combined. The wrapping-based model employs learning algorithms to find the optimal subset and evaluate it, which results in finding the predictions that have better performance. Filter model approaches are rankers; they rank attributes based on their relevance to the output variable while having an independent evaluator from any learning algorithm. A filter method has adequate generalization capacity and low computational cost; this method can also handle high-scale dataset. The embedded approach is the method that combines the benefits of the wrapper and filter methods while trying to eliminate their disadvantages. This method starts just like the filter method by independently trying to find the optimal subset. It then employs linear classifiers such as Support Vector Machine (SVM) to enhance this subset by finding correlated features locally to have better local discrimination, resulting in a final optimal subset but at a lower computational cost than that of the wrapper model [16].

Feature Selection (FS) problems are considered real-world problems that affect the classification accuracy and learning speed. Metaheuristic algorithms are high procedure algorithms designed to search for a good enough solution to the search space. Such algorithms use two conflicting criteria in determining the best solution: exploration and exploitation of the search space. Exploring the search space is trying to find a good enough solution in the whole search area. In contrast, exploitation is a method that tries to determine the optimal solutions. In the native sine–cosine Algorithm, the weakness in its exploration strategy is noted, which leads to weakness in its performance during the search space. However, enhancement or modification by creating a hybrid technique can be done by introducing a new version of metaheuristic algorithms to improve performance by balancing the search space's exploitation and exploration. This stimulus underlies all of our attempts to create a predictive model based on hybridization approaches for solving feature selection problems by reducing the number of features, weakly relevant and irrelevant features. Practically, an optimal subset is likely to contain only powerfully relevant features.

The objectives of the proposed feature selection approach are reducing dimensionality and eliminate noise from data. This leads to an increase in learning speed, ease of rules, easy visualization of the data, and predictive accuracy. So, this study aims to achieve maximal accuracy of classification with a minimal number of features. This study assesses the ability of hybridization between metaheuristic algorithms to create a new feature selection approach to solve a feature selection problem by enhancing search space performance. To evaluate the new feature selection method's performance, classification accuracy, best, worst, and mean fitness, Standard Deviation (Std). An average number of features are used as evaluation criteria. This study is significant as it tries to solve Feature Selection (FS) problems by building a new hybrid feature selection approach (SCAGA), which discards redundant, irrelevant, noisy, and weak features from the original dataset. This leads to increased learning speed, ease of rules, dataset visualization, and predictive accuracy for a classification task.

The paper organization is as follows: Sect. 2 explains the literature survey that talks about previous studies and related works. Section 3 introduces the procedures and methodology and discusses the proposed schemes. Section 4 shows the evaluation criteria. Section 5 portrays the results and the discussion of the results, and Sect. 6 discusses the conclusions and future work.

2 Related works

Feature Selection (FS) technique is one technique employed to enhance the prediction accuracy of the searching space problems [17,18,19]. Approaches of searching may be summarized as follows: thorough searching, probabilistic searching, heuristic searching, and involuntary hybrid exploration algorithms [20]. Metaheuristic algorithms plan to decrease the time consumption and only search for a particular path to obtain the optimal solution [2]. Metaheuristic search is typically used on real-world problems and to exact varied computer science series [21]. Heuristics are also suitable to treat other parts of massive data, such as diversity and speed [22].

Different metaheuristic methods are applied to treat feature selection problems [23]. Genetic algorithm (GA) is the furthermost inspected metaheuristic algorithm. Population- and single-based metaheuristic algorithms are proposed [24]. Metaheuristic algorithms that are single-based such as hill climbing and simulated annealing, have been used. Scatter, random, harmony, and hill-climbing searches have main disadvantages; they are very tricky for opening solutions. They often drop in local optima [25]. The subset features have been selected by using the spider monkey optimization approach. The primary population algorithms have been given for the dataset. The assessment of the fitness calculation was done using the SVM for classification accuracy. In order to continue or stop the process, a stopping criterion is tested. The best final subset of attributes with high classification accuracy is defined as the best optimal results [26].

A hybrid binary technique between coral reefs optimization and simulated annealing for attribute selection (BCROSAT) can discover the maximal accuracy and select minimal features’ number for most datasets utilized [27]. Instance selection is a method that can reduce the size of the original training data. A combination of instance selection and feature extraction reduces the large volume of computation time of training the classifier [28]. For the global optimizer and FS algorithm, a novel chaotic slap swarm technique is efficient for two problems: FS problems and global optimization problems [29]. FS is the procedure to statistically identify the utmost relevant features to improve the predictive capability of the classifiers [30]. A method for FS to enhance clustering of documents is by using particle swarm optimization; this approach focuses on enhancing the current implementation of Bayesian calibration to building energy simulation [31].

The feature selection technique of Water Wave Optimization (WWO) builds the text FS technique based on Water Wave Optimization (WWOTFS) [32]. A hybrid approach based on binary chemical reaction optimization and a Tabu search optimization algorithm for FS has been developed. Once the four essential reactions are performed, in the iteration step, the best solution is checked. Then Tabu search is utilized to search neighbors, which is a local search process. An enhanced FS based on the Ant Colony Optimization (FACO) algorithm and the classifier SVM has been used to solve FS problems [33]. With the growing volume of data in networks and the number of feature sets, the security of the network is threatened by extra network attacks, such as APT and DDoS attacks. To speedily detect anomalies in networks, a classification technique is extensively used in the anomaly field of data discovery. However, there are massive irrelevant and redundant features in the dataset, which are considered difficulties that prevent the classification algorithm from creating efficient anomaly detection classifiers. To enhance the performance of classification for classifiers, the ant colony optimization method searches for the optimal features subset. It selects the relevant features independently from the classifier, which can efficiently decrease the algorithms' complexity to classify and improve the classification accuracy of classification.

A new technique for the subset of feature selection in ML, FSS-MGSA (Feature Subset Selection by Modified Gravitational Search Algorithm), is presented. FSS-MGSA is a sophisticated chaotic search algorithm based on the gravity law and interaction of mass. It can be performed when knowledge of the domain is not accessible [34]. The binary of bare-bones particle swarm optimization (BPSO): the stimulus for this method is to design a global search technique using a small number of parameters, which has a better performance when solving feature selection problems also is easy to implement as well. Also, FS methods have been used to find feature subsets that have maximal classification ability [35]. In 2014, Moradi et al. presented a hybrid approach for selecting features in two phases: In the first phase, they seek a decrease in the original set's feature set by using the filtering model. Then, in the second phase, the wrapper model is applied for selecting the best features subset from the reduced feature set [36].

In the FS technique of binary PSO and GA, when determining coronary artery disease using a support vector machine (BPSO-FST), every particle is created from 23 binary cells, which refer to all features in the dataset. The cells' value shows whether the feature would be selected or not, where a cell value of 1 means that the feature is selected. With a value of a cell of 0, the feature is unselected into the dataset [37]. Feature selection is used to identify a powerfully predictive of fields inside a database and decreases the field number presented to the computational process. Feature selection affects some pattern classification aspects containing the learning classification procedure’s accuracy, such as a support vector machine [38]. FS technique based on improved binary-coded Ant Colony Optimization technique (MPACO) is established to increase the accuracy of classification while reducing redundant features [39]. A novel FS technique using PSO has been used for cancer microarray datasets. This method is used for classifying high-dimensional cancer microarray datasets after solving feature selection problems. Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and Probabilistic Neural Network (PNN) are used as classifiers and evaluators [40].

The experiment results showed that a heuristic FS method for text categorization using Chaos Optimization and Genetic technique (CGFSO) found feature subsets that resulted in the maximum accuracy of classification, while it found compact feature subsets. The performance of this approach is quicker than other traditional approaches [41]. Genetic algorithms and particle swarm optimization method can be combined in various ways. In the hybrid of GA and PSO (HGAPSO) FS method, hybridization is done by integrating the updated PSO rules and the standard velocity with selection, mutation, and crossover from the genetic algorithm [42]. FS based on the antlion optimization algorithm is working on the wrapper-based model. The wrapper methodology's main characteristic is using the classifier as a leader of the feature selection technique. Wrapper-based feature selection can classify depending on the next three chief items: classification method, feature-evaluation criteria, and search method.

The first hybrid approach is proposed; Oh and Lee presented aggregation between algorithms and local search methods inserted inside a GA to improve search space by searching the utmost hopeful area discovered by the genetic algorithm procedure [43]. Lately, hybrid metaheuristic techniques have displayed high performance for solving a hard combinatorial optimization problem. For example, combination methods, such as the hybrid between PSO and GA [44] and ant colony optimization with a genetic algorithm, have also been suggested. Simulated Annealing (SA) with a genetic algorithm are some local search approaches inside operative algorithms that balance between exploration and exploitation [45]. A new wrapper-based method for hybrid simulated annealing with a crossover operator was proposed [46]. Moreover, Talbi, Jourdan, Garcia-Nieto, and Alba, proposed GA was hybridized with the PSO technique, which used the SVM as a classifier. Table 1 shows an overview of different FS approaches with their details.

Table 1 Overview of the classification and optimization methods used for feature selection

3 Procedures and methodology

This section supplies a detailed depiction of the research objectives methodology. Generally, wrapper methodology approaches use the method of learning as a classifier to assess the usefulness of features’ subset and subsequently obtain the best performance of a predictive. Wrapper-based approaches gain better results of the quality measurement than filter-based methods. It obtained a subset of features that are enhanced for learning algorithm utilization. The technical details and a good understanding of the existing algorithms are necessary to make the appropriate choice unpractical in most statuses. Inside the field of this area, there is no metaheuristic-based method capable of solving all FS problems [20]. For newly unknown datasets, it will be more intricate to select a suitable method. Nevertheless, enhancements can be made to current algorithms to improve performance during balancing of the search space. This incentive mostly motivates us to make a prediction wrapper model depending on the hybridization technique for solving a feature selection problem.

3.1 Binary version of Sine Cosine Algorithm (SCA)

Recently, algorithms of metaheuristic proved to obtain high performance for solving real-world problems. Feature selection problems are considered binary search problems. In the proposed method, a Binary Sine Cosine Algorithm (SCA) adjusts the continuous SCA to handle feature selection problems in the binary domain. The SCA begins with random positions whereas the search agent (Xi) = 5. Sine Cosine Algorithm works according to mathematical Eq. (3.1) [47].

$$ x_{ij}^{t + 1 } \left\{ {\begin{array}{*{20}c} {x_{ij}^{t } + \, r_{1} * \, sin \, \left( {r2} \right) \, * \, \left| {r_{3}^{ } {(}Xb_{j}^{t} {)} - x_{ij}^{t } } \right|{\text{if}}\;R_{1} < \, 0.5} \\ {x_{ij}^{t } + \, r_{1} * \, cos \, \left( {r2} \right) \, * \, \left| {r_{3}^{ } {(}Xb_{j}^{t} {)} - x_{ij}^{t } } \right|{\text{if}}\;R_{1} \ge \, 0.5,} \\ \end{array} } \right. $$
(3.1)
$$ r_{1} = \, a \, {-} \, t\frac{a}{Tmax} $$
(3.2)

The fitness function of the SCA will increase if the performance of classification over the testing dataset is increased when validating and achieving the minimal number of selected feature selections together.

$$ f\theta = \omega *E + (1 - \omega )\frac{{\sum\nolimits_{i} {\theta i} }}{n}, $$
(3.3)

where the fitness function gives vector θ sized n with 0/1 elements representing unselected/selected feature; n is the features number in the dataset. E is the classifier error rate and ω is a constant value (fixed 0.05) to control the classification accuracy performance to the features number that is selected. The variables used are the same features number in the given dataset. The variables are limited in the range [0, 1], where the variable value approaching 1 means that its corresponding features are candidate to be selected in the classification [28, 29, 44]. In individual fitness calculations, the variable is the threshold to decide the exact features to be assessed as in Eq. (3.4):

$$ f_{ij} = {\text{ 1 if}}\;X_{ij} > \, 0.{5},{\text{ otherwise }}0, $$
(3.4)

where Xij is the value of dimension for search agent i at dimension j. While updating each position of search agent at some dimensions, the updated value can violate the limiting constraints: [0, 1]; hence, we use simple truncation rule to ensure variable limit. Each candidate feature is represented as a binary vector with one dimension. The vector is used for features mapped to be in [0, 1] interval based on threshold value that is set to 0.5, indicating to the upper bound (ub = 1) and the lower bound (lb = 0).

3.2 Initial population

The Sine Cosine Algorithm (SCA) begins first randomly with positions to converge the global optima. It then calculates the value of fitness for every individual. It allocates the utmost remarkable location to FS as candidate features. Every solution is presented as a binary vector in one dimension. The number of the vectors is equal to the number of features in the original dataset. All cells within the vectors are labeled with 0 or 1. One value is indicating that the feature is selected, otherwise indicating that the feature is ignored.

3.3 Classifier

K-Nearest Neighbor (KNN) classifier, The KNN classifier is a predictor of variables weight at a distance based on trial-and-error process. K-Nearest Neighbor is utilized as a part of the fitness function in all the experiments due to its excellent performance in classifying.

3.4 Fitness function

In this study, the fitness function is applied to assess each feature subset in the search space of Sine Cosine Algorithm based on K-Nearest Neighbor (KNN) as a classifier, where K = 5. The proposed fitness function is calculated by using Eq. (3.5):

$$ f\theta = \omega * E + \left( {{1} - \omega } \right) \frac{{\sum\nolimits_{i} {\theta i} }}{n}, $$
(3.5)

where the fitness function is given vector θ sized n with 0/1 element representing select/unselect feature; n is the features’ amount of dataset. E is the error rate of classifier and ω is a constant to control the classification accuracy performance to the features number selected.

3.5 The crossover operator

Crossover is the leading exploration operator in the genetic algorithm. It searches the area for possible solutions depending on present solutions [34]. The binary crossover operators are to exchange bits between two parents selected to reproduce two new individuals. Both will be different from their parents yet hold some parent features. The type of crossover operator depends on the encoding method. Therefore, several types of crossover techniques, such as one-point, two-point, and uniform crossover. Uniform crossover is more exploratory, better for small populations, while the two-point crossover is suitable for large populations. In general, recombining parts of the right individuals gives a better opportunity to produce better individuals.

In this study, the crossover operator is designed by using three critical formulas, which are as follows: single, double, and uniform, enhancing the worst solution that has been selected by the Sine Cosine Algorithm through recombining the worst solution with the best one obtained from the previous iteration. The appropriate formula is chosen in each iteration depending on the roulette wheel selection function to take advantage of the features for each type of species mentioned. The crossover is utilized as an internal agent within Sine Cosine Algorithm, as shown in Fig. 1; Algorithm 3.1 shows the crossover operator's main steps.

Fig. 1
figure 1

The proposed SCAGA method

figure c

3.6 The mutation operator

The mutation operator is utilized to produce new individuals with various features not present in their predecessors. Mutations can be applied for integer, binary, or real representations and categorized into several types. In general, the mutations are generated by randomly selecting one or more bits and then flipping their value with a certain probability (pm = 0.02). The mutation operator is utilized to act as an internal function employed within Sine Cosine Algorithm (SCA) to generate a new solution and improve the exploration ability after applying the crossover operator. The following Eq. (3.6) shows the work of the mutation operator:

$$ X_{i}^{t + 1} = Mutation(X_{i}^{t} ). $$
(3.6)

At first, the metaheuristic algorithms’ process displays two contradictory criteria: strategies of exploration of the search space and exploitation to detect the optimal solution. In the native Sine Cosine Algorithm, we note weakness in its exploration strategy, leading to weakness in its performance during the search space. The main steps of the proposed framework are shown in Fig. 1. In the proposed hybrid feature selection approach to improve the exploration strategy of the Sine Cosine Algorithm and its performance during the search space for solving feature selection problems, we used the Genetic Algorithm as an internal function within the SCA as a hybrid feature selection method named SCAGA.

figure d

The wrapper model applies the classifier method as evidence in the FS method depending on some optimization techniques. The SCA is used as an FS method for balancing the accuracy of classification (maximal value) and the feature number selected (minimal value) in all solutions. In the beginning, population solutions with the SCA algorithm depend on updating its roles on the functions of sine and cosine according to Eq. (3.1).

Proposed Binary version of Sine Cosine algorithm then generates feature subsets. The fitness function is applied to evaluate the feature subsets. Update the iteration. R1 is switched between sine and cosine functions based on the random switch parameter, where it is applied according to Eq. (3.1). The proposed genetic algorithm improves the current position or solution. The crossover operator (embedded within SCA) is applied to generate new offspring based on the best and candidate solutions and then enhanced by the mutation operator. The mutation operator is acting as an internal function within the SCA to prevent the algorithm from falling in the local optima problem and getting an optimal solution or near-optimal solution.

The probability of mutation controls the mutation operator. The optimal rate of mutation is a common problem in this arena. It has to be set at a low rate where the probability of mutation (pm = 0.02) is the best value during tuning parameters. On the other hand, if the rate is set at a max value, then the search will deflect into a random search and avoid the technique of converging to an optimum solution. All features are within this range [0, 1], so the position must be amended based on ub = 1 & lb = 0. Then fitness function of the new solution that is generated by applying the genetic algorithm is computed. After terminating the SCAGA search, we get the best solution, apply the evaluation measurements, and finally get results. An example of this rule can be viewed in Fig. 2. In this Figure, ten features are given, and number 1 presents the selected feature, and number 0 presents the unselected feature.

Fig. 2
figure 2

The solution representation of the feature selection problem

4 Evaluation criteria

The proposed hybrid feature selection method SCAGA is run 20 times. Iteration = 80 at each time because we have come to stability at run = 20 to test both the FS approach's stability and the statistical significance. The informative feature subset is evaluated using the evaluation criteria (accuracy of classification, mean, best and worst fitness, Standard Deviation (std), and average selected size) so the best feature subsets are obtained. The proposed FS method results were at maximal accuracy of a classification and a minimal number of features. The evaluation criteria are explained as follows:

  • Classification accuracy it is used to evaluate the performance of the feature selection method on the dataset that the classifier has been given. The classification accuracy can be calculated by Eq. (4.1) [34]:

    $$ Test = \frac{1}{N}\mathop \sum \limits_{j = 1}^{N} \sqrt {\mathop \sum \limits_{i = 1}^{K} \left( {A_{i} - E_{i} } \right)^{2} } $$
    (4.1)

    where K is the number of test sample points and Ai, Ei are actual and expected class labels for data point i.

  • Best fitness represents the smallest fitness function value for each optimization algorithm at the dissimilar M operations of an optimization algorithm and can be formulated as in the following Eq. (4.2) [48]:

    $$ Best = Min_{{i = 1g_{*}^{i} }}^{M} . $$
    (4.2)
  • Worst fitness represents the maximum solution among the best solutions found for running each optimization algorithm for M times as in Eq. (4.3) [48]:

    $$ Worst = Max_{{i = 1g_{*}^{i} }}^{M} . $$
    (4.3)
  • Mean fitness it represents the average of solutions acquired from running an optimization algorithm for diverse M running as in Eq. (4.4) [48]:

    $$ Mean \, = \frac{1}{M} \mathop \sum \limits_{i = 1}^{M} g_{*}^{i} . $$
    (4.4)
  • Standard Deviation (std) represents the variance of the best solutions found for running each optimization algorithm for M diverse as runs in Eq. (4.5) [48]:

    $$ Std = \sqrt {\frac{1}{M - 1}\sum \left( {g^{i} - Mean} \right)^{2} } $$
    (4.5)
  • Average selection size (average number of features selected) represents the number of features selected to the entire number of features and may be formulated as Eq. (4.6) [20]:

    $$ {\text{Average selection size }} = \frac{1}{M}\mathop \sum \limits_{i = 1}^{M} \frac{{size(g_{*}^{i} )}}{D} , $$
    (4.6)

    where size(x) is the number of values for the vector x, D is the number of features in the original dataset, and \(g_{*}^{i}\) is the optimal solution resulting from run number i.

5 The results and discussion

In this section, the proposed method's performance is evaluated and compared using other similar methods using several feature selection datasets. All experiments are conducted using the same conditions. The maximal iteration number is 80, and the number of search agents is 5. All results were calculated with an average of 20 runs using the framework of Matlab on an Intel Core i5 computer, 2.50G CPU, and 4.00 G of RAM with the 64-bit operating system.

5.1 Datasets and parameters

Sixteen datasets with two high dimensions were collected from the University of California Irvine (UCI) Machine Learning Repository, which is available at https://archives.ics.uci.edu/ml/datasets.html [49]. All details of the datasets are represented in Table 2. The proposed new hybrid method is a wrapper-based procedure. Every solution in the population is represented as an index binary vector for the features in the dataset. We take only the optimal solution and its fitness, which attained maximal accuracy of classification with minimal features. Besides, the parameter settings are summarized in Table 3.

Table 2 Datasets description
Table 3 Experimental parameters setting of proposed method

The relevant features are known previously. In this case, we can validate the features selected by prior knowledge. In contrast, in furthermost real-world problems, the relevant features are unknown previously. So, we have to utilize classification performance for testing datasets to indicate the quality and provide an unbiased evaluation of a final method. As FS is naturally multiobjective, results have been compared in both features’ count instances and obtained classification accuracy.

To conduct a comparison between the diverse FS methods with our proposed hybrid method, the various next indicators are used: First, classification accuracy is an indicator for describing how accurate the classifier is given the selected feature set. Classification accuracy is formulated in Eq. (4.1). Secondly, the best fitness represents the most optimistic solution gained. The criteria of best fitness are formulated in Eq. (4.2). Third, the worst fitness represents the worst solution among all possible solutions that can be obtained for running optimization. The criteria of the worst fitness are formulated in Eq. (4.3). Fourth, mean fitness is the average performance, indicating the average of solutions obtained from running an optimizer with diverse 20 runs; the criteria of mean fitness are formulated in Eq. (4.4). Fifth, Standard Deviation (std) refers to the variation of the acquired optimum solutions from running a stochastic optimizer with 20 diverse runs and formulated in Eq. (4.5). Sixth and finally, the average selection size represents the average number of features selected to the whole number of features defined by Eq. (4.6).

5.2 Experimental results and discussions

In the proposed method, the SCAGA embedded a Genetic Algorithm inside Sine Cosine Algorithm to act as an internal function to improve the exploration ability of the SCA algorithm. The performance of the SCAGA was compared with native SCA and other approaches published in the literature survey as follows: PSO and ALO methods based on two critical goals, namely, the accuracy of the classification and average selection size. The SCAGA was also compared with other FS methods based on the next evaluation criteria: worst fitness, mean fitness, best fitness, and Std. An average of 20 runs-based frameworks calculates all the results of the evaluation criterion.

As shown in Table 4, the proposed hybrid method (SCAGA) is considerably better than the SCA, ALO, and PSO methods in terms of both goals: the number of features selected and the accuracy of classification. The comparison of SCAGA with other FS methods showed that SCAGA performance is better than SCA, PSO, and ALO through all datasets regarding the accuracy of the classification. As noted, the optimal solution is obtained in the M-of-n dataset, so the value of the classification accuracy is one, as it is clearly stated in Table 4. This means that the proposed method can be considered a suitable FS method for small-scale datasets and high-dimensional datasets.

Table 4 The Comparison between the SCAGA method with other optimizations methods in terms of accuracy of classification

In Table 4, the proposed method SCAGA achieved maximal accuracy of classification in all used datasets. So, SCAGA is better than SCA, PSO, and ALO. Furthermore, the average number of selected features in Table 5 signifies that the performance of SCAGA is better than that of other methods over all datasets.

Table 5 The percentage of the selected features for the comparative methods

As shown in Table 6, the obtained results in best fitness criteria are the best when used the proposed SCAGA compared to other methods.

Table 6 Results of best fitness

Table 7 shows that the proposed method (SCAGA) has never taken the worst fitness value, which is straightforward compared with other methods. The bold font refers to the worst value in Table 7. Also, the best results are obtained in mean fitness, as shown in Table 8.

Table 7 Results of worst fitness
Table 8 Results of mean fitness

In Table 9, the results are gained from (Std) evaluation referring to the variation of the acquired optimal solutions from running stochastic optimization with 20 diverse runs. The bold font refers to the best value in Table 9. SCAGA method outperforms other native SCA and related methods in the literature over ten datasets. Here the compared results of the proposed hybrid feature selection method (SCAGA) are given, the results of the native Sine Cosine Algorithm (SCA), and few other approaches related to feature selection picked from the literature survey such as Antlion Optimization (ALO) and Particle Swarm Optimization (PSO). Our results show that the performance of SCAGA is significantly better than SCA, ALO, and PSO that is common in wrapper-based feature selection. In Fig. 3, the proposed method's performance (SCAGA) for preventing local optima problem is the best over the native sine cosine algorithm.

Table 9 The most remarkable solutions in terms of standard deviation of all optimizations
Fig. 3
figure 3

Explanation of the performance of the proposed method for preventing local optima problem

Figure 4a and b summarizes empirical results obtained from proposed methods (SCA and SCAGA). It is observed that the SCAGA method gave a high performance as a multiobjective optimization method where it achieves two conflicting goals, maximum accuracy of classification with the least number of selected attributes on all datasets. All evaluation results fall between [0, 1] when the accuracy of classification is at the maximum values, and average selection size is at the minimum values.

Fig. 4
figure 4

Comparison between the different proposed methods in terms of a Accuracy of classification and b Average selected size

A high-dimensional dataset means data with significant features number that leads to the dimensionality curse; with the high-dimensional dataset, the number of features can exceed the number of observations. Therefore, the calculations become very difficult. FS method has become a critical stage of analyzing high-dimensional datasets.

The SCAGA method is designed for high-dimensional data. It has been shown that it is useful in discarding redundant features and irrelevant features (see Fig. 5). In this study, to evaluate the proposed FS method's performance on high-dimensional datasets, we use two massive datasets, namely, Krvskp.EW (3196 objects with 36 attributes) & Waveform.EW (5000 objects with 40 attributes). As shown in Fig. 5, the proposed SCAGA method got better results than the other methods (SCA, PSO, and ALO) with low-dimensional datasets. It achieves two inconsistent targets, maximum accuracy of classification with the least number of features on all datasets. This means the proposed method can be considered as a suitable FS method for high-dimensional datasets.

Fig. 5
figure 5

Comparison between the different proposed FS methods in high-dimensional dataset: a Classification accuracy and b Average selected size

6 Conclusion and future work

In this paper, an enhanced version of the Sine Cosine Algorithm (SCA) is proposed with a wrapper model to solve the feature selection problems, called SCAGA. The proposed SCAGA is based on utilizing the genetic algorithm’s crossover operator to generate the new solution and then apply the genetic algorithm of the genetic algorithm to enhance the solution generated by applying it to improve the exploration of the Sine Cosine Algorithm (SCA). It allows a more extensive search to prevent falling in local optima and then find the enhanced best solutions. The proposed hybrid SCAGA has significantly improved the performance of the native Sine Cosine Algorithm (SCA) to solve the FS problem from the reported results. It became more robust through the results showing the qualities of the method proposed for solving real-world problems with unknown and challenging search spaces. Therefore, the proposed method (SCAGA) got better results than other methods (SCA, PSO, and ALO) with all datasets. It achieved two contradictory goals, maximal accuracy of classification with a minimal size of features on all datasets, either with small datasets or high-dimensional datasets. For future work, the proposed approach can be applied to other different datasets to generalize the approach for different domains, such as fault diagnosis in wind turbine test rig datasets. Other new optimizers can solve the feature selection problem, such as Arithmetic Optimization Algorithm (AOA).