1 Introduction

Datasets are almost constantly generated in different areas, such as industry, social networking, and business. They can be represented in the form of networks to gain insights from many perspectives, including patterns of information diffusion. Nowadays, a lot of attention is being paid to the study of network analysis technologies because of their considerable impact on many fields, such as classification [1], clustering [2], and link prediction [3]. For example, sociologists classify individuals into various social media groups [4], and biologists predict protein–protein links in interaction networks and discover missing relations between proteins [5]. However, real-world networks have grown to such a large-scale, so researchers are trying to embed them into a low-dimensional space while preserving the useful features [6]. Shi et al. [7, 8] proposed two network embedding (NE) procedures for node classification and a multi-label network embedding (MLNE) to learn feature representations where each node contains some feature values and a set of class labels. These large datasets indeed contain irrelevant, redundant, or erroneous features. Computational burden, increasing storage requirements, overfitting, decreasing prediction performance, and poor output comprehensibility are some of the disadvantages engendered by applying large-scale datasets with misleading features. Dimensionality reduction is a crucial preprocessing step in the analysis of high-dimensional datasets, generally including two strategies: feature extraction and feature selection [8]. Feature extraction tries to define new features by transforming or combining the original features. Two conventional feature extraction techniques are principal component analysis (PCA) and independent component analysis (ICA) [9]. Feature selection tries to eliminate redundant or unnecessary features and selects only those features that are sufficient for solving a problem. Hence, it makes the model simpler and comprehensible by removing insignificant features.

However, finding the best-performing feature subset is a complicated process, especially for a high-dimensional problem due to the large search space. In the last few decades, many traditional feature selection algorithms like sequential forward selection (SFS) [10] and sequential backward selection (SBS) [11] have been utilized to provide promising results in dealing with feature selection problems. Nevertheless, they suffer several issues, such as computational complexity [12] and nesting effects [13]. We can consider the feature selection challenge as a multi-objective optimization (MOO) problem. There is a trade-off between some objectives (e.g., the minimum number of features and minimum error value); thus, optimization techniques can be used to strike the right balance between objectives. In some datasets, the error rate of the classifier can be reduced by removing irrelevant and redundant features.

Nonetheless, if a dataset only consists of key features required by the classifier, removing any feature may reduce the accuracy of this classifier and so the two objectives conflict with each other. Notably, investigating the importance of different features requires various levels of costs like time, storage, or other resources. That is, too many features entail higher costs.

Today, the metaheuristic method is becoming increasingly popular in feature selection and machine intelligence [14, 15]. Several metaheuristic based feature selection strategies have been developed over the years. The basic process of such methods is to optimize an objective function that is defined for feature selection to find the optimal feature subset. The metaheuristic-based feature selection algorithms randomly generate new solutions and compute their fitness values. In subsequent iterations, new solutions are created by the best agents of the current iteration. Hence, these methods prevent the creation of a solution similar to the previous one, thereby reducing the computational time in determining the best subset of features. A majority of feature selection methods have been developed to handle classification problems, and there has been less focus on regression problems [16]. Regression involves predicting a real value of output parameter (i.e., continuous target variable) for the given values of other parameters, which is based on previous input–output observations. The focus of this study is to adopt Biology Migration Algorithm (BMA) in a feature selection method to solve regression problems. The experimental results in [17] with several test functions and four real-life engineering problems proved that BMA is more effective compared with other related optimization techniques. These include Whale Optimization Algorithm (WOA) [18], Bat Algorithm (BA) [19], Animal Migration Optimization (AMO) [20], Cuckoo Search (CS) [21], Gravitational Search Algorithm (GSA) [22], Biogeography-Based Optimization (BBO) [23], and Particle Swarm Optimization (PSO) [24].

The proposed strategy applies BMA for solving the dimensionality reduction problem to minimize the number of features and errors. It also introduces a new approach for feature clustering based on the correlation between features and game theory. Game theory is a branch of modern mathematics that has recently received much attention from researchers in the study of cross-discipline [25]. It quantifies interactions among incentive structures or various players for strategic interdependence. When there is more than one agent who independently performs their own behavior in the process, game theory can present a suitable framework for the analysis of interactions between these decisions [26]. In this paper, we apply game theory to deal with the conflict and competition among multiple objectives in the feature selection problem.

The remainder of the paper is structured as follows. Section 2 briefly reviews some related works. Section 3 lists the main contributions of the paper. Section 4 illustrates basic concepts about multi-objective optimization and the Nash heuristic method. Section 5 presents the proposed dimensionality reduction strategy. Section 6 focuses on reducing the computational complexity of the proposed algorithm. Section 7 reports the experimental results based on UCI datasets and some discussions. Section 8 concludes the paper and presents future work.

2 Related works

Although much research has been conducted on dimensionality reduction, regression problems are rarely studied compared with classification problems. One of the reasons may be a simple formulation for feature selection criteria like the broad margin framework with considering class discriminability [27, 28]. Some feature selection algorithms for classification can be extended for regression problems, while others may not. Straightforward adaptation by discretizing (or binning) the output variable into several classes is not always a suitable solution since this can lead to the loss of important information [16]. Moreover, most feature clustering techniques compute the conditional probability of a feature belonging to a particular class. Thus such probabilities cannot be defined for the regression problem.

Consequently, these algorithms cannot be used directly for reducing dimensions of regression problems. On the other hand, most existing methods of regression are based on the computation of the covariance matrix, eigenvalues, and eigenvectors, rendering them inefficient for large problems. For example, the mRmR [29], relief [30], and LDA [31] methods can be used for classification and with some changes for regression problems. These methods are explained in detail as follows.

  • Linear discriminant analysis (LDA): It is originally introduced for supervised learning, especially classification problems, applicable for finding the optimal linear discriminant functions (OLDFs) [31]. LDA attempts to provide linear combinations of features in such a way that the ratio of between-class scatter and within-class scatter be maximized. Unlike classification problems that have discrete output or classes, it is challenging to consider between- and within-class scatter matrices in regression applications because the continuous target variable is defined. One solution for presenting a regressional version of LDA is that we segment the given dataset into several intervals (i.e., virtual classes) according to the output values with the fixed boundaries. The original LDA for classification problems can now be used, yet the results will depend on the number of intervals and the boundary approach. Moreover, this modified LDA strategy does not consider the degrees of similarity among various classes. Kwak and Lee [32] proposed a regressional version of LDA and extended ICA-FX for regression problems [33].

  • Relief: The original version of the Relief method [30] is used only for binary problems. It ranks features according to the determination of feature value differences between the nearest neighbor example pairs. Then, Sikonja and Kononenko [34] proposed a new modified version (ReliefF) for regression problems. Nevertheless, the nearest cannot be used for regression problems because the predicted variable (i.e., class) is continuous. Therefore, Relief-F [35] for regression is introduced. It defines a probability based on the relative distance between the predicted and actual values to determine whether or not the predicted values of two objects are different.

  • Minimum redundancy, maximum relevance (mRmR): The mRmR method [29] tries to choose some features that show the highest correlation with a class, and there is the least correlation between themselves. Obviously, the relevance and redundancy measures for classification and regression must be different. For the calculation of relevance, "F-statistic" and "mutual information (MI)" are suitable for continuous features and discrete features, respectively. To calculate redundancy, Pearson’s correlation coefficient and MI are applied for continuous features and discrete features, respectively [36].

Now, we present a brief literature review on the new feature selection and feature clustering approaches. Xu and Lee [16] developed a feature clustering method to reduce the dimensionality of regression problems. The proposed algorithm creates a group of clusters based on similarity tests such that similar items fall into the same cluster. Then, it constructs a new feature from each cluster using three weighting methods, namely hard, soft, and mixed, which are based on the weighted combination of training samples. One strength of this method is that the number of clusters is automatically obtained during the process without requiring to be adjusted by the user. By providing the experimental result, they demonstrated that the proposed algorithm could significantly reduce dimensionality while the main properties of the primary dataset are retained. The main disadvantage of this method is that it shows the high computational complexity for high-dimensional datasets.

Rao et al. [37] presented a feature selection algorithm by using the bee colony and gradient boosting decision tree. The proposed algorithm preprocesses and reduces the irrelevant features of the original dataset by the bee colony. Then, it tries to forcibly reduce the initial feature set and create a decision tree for determining the weight of features and supporting the accuracy of the model. The experiments indicated that the proposed algorithm could reduce dimensionality without sacrificing classification accuracy. However, the employment of a decision tree in the weighting process is time-consuming and computationally expensive.

Zhang et al. [38] modified the firefly optimization algorithm for feature selection. In population initialization, it uses a chaotic logistic map to increase swarm diversity. In the search process, it applies Simulated Annealing (SA) and optimal global signals. To improve convergence, the proposed algorithm diverts weak solutions to optimal regions by swarm leaders. The experiments based on several classification and regression problems indicated that the proposed firefly model outperforms other classical methods. However, they did not consider outlier detection.

Ghimatgar et al. [39] proposed a new feature selection strategy, i.e., Modified Graph Clustering based Ant Colony Optimization (MGCACO), according to the graph clustering and ant colony optimization. The MGCACO algorithm consists of three main steps. The first step involves investigating the relevance of features to classes to improving pheromone initialization and assigning a higher priority to the more relevant features. The second step includes finding the redundancy of feature subsets based on the multiple discriminant analysis (MDA). Finally, the third step comprises arranging features according to relevance and redundancy analysis by applying a cost function. The experimental results indicated the superior performance of MGCACO in terms of accuracy. However, the main weakness of MGCACO is that there is no specific way of determining the number of clusters.

Table 1 compares general characteristics, main idea, and limitations of the methods reviewed previously. Most dimensionality reduction techniques presented in the literature are feature selection methods, and a few feature extraction approaches are considered. Based on theoretical and empirical results [40], traditional algorithms suffer from three problems. Firstly, they have problems with overhead memory space for large datasets and discovering the relationship between features that significantly affect the efficiency and accuracy of the model. The second drawback of previous feature selection methods is that they are easily get trapped into local optima during search processes. Therefore, these behaviors cause substantial growth in feature selection delay in exploiting the correlation among lots of relevant features. Thirdly, traditional algorithms may not be efficient and robust for high-dimensional heterogeneous datasets with low quality and noisy points. Nevertheless, robustness is an essential characteristic of satisfactory feature selection methods (SFSMs) because the presence of noise in big data is inevitable, and the processing time may increase by high noise levels. Consequently, an effective feature selection algorithm is necessary to overcome several issues in existing algorithms, such as problems of algorithm implementation, convergence, and computational complexity for regression applications.

Table 1 Summary of dimension reduction methods

Although there has been a lot of research on feature selection based on metaheuristics as well as simultaneous feature selection and clustering, little attention has been paid to using game theory in feature selection. The main reason is that metaheuristic techniques have performed remarkably in solving complex problems, thanks to flexibility, faster convergence by focusing on intensification, and local optima avoidance by emphasizing diversification. Furthermore, game theory can help determine an efficient solution by analyzing conflict and cooperation among various objectives. To the best of our knowledge, most game-theoretic feature selection algorithms are proposed for use in classification problems. Consequently, it would be meaningful to investigate the issue of metaheuristic based game theoretic feature selection for identifying the best subset of features, substantially affecting target prediction. In this paper, we propose a new game theory-based dimensionality reduction algorithm to deal with conflict and competition among multiple objectives of data regression. Thanks to its versatility and robustness, game theory has been applied to solve some complex problems in different research areas like task scheduling and resource management [41]. Moreover, algorithms, according to game theory, show low computational complexity and high computational speed.

3 Contributions

Below is a list of some issues raised in Table 1 that require further attention:

  • Feature selection strategies presented in the literature have been focused on classification problems, and these algorithms cannot be used quickly and efficiently to reduce the dimensionality of regression problems. However, the proposed algorithm can be used for both regression and classification problems with a small modification.

  • All studies have referred to the multi-objective nature of the feature selection problem, though they have indeed dealt with it as single-objective. To address this issue, the proposed algorithm solves the feature selection problem to realize three objectives. These are minimizing the regression error rate, minimizing feature cardinality, and adjusting method precision. For this purpose, it takes advantage of BMA, which is a new metaheuristic method inspired by the biology migration phenomenon.

  • The sequence of features, weighting methods for feature extraction, and fitness values are used as heuristic information for the BMA method. Less important features are suppressed based on the information they contribute to decision-making using the BMA algorithm.

  • In the proposed algorithm, a coalitional game occurs between features to generate appropriate clusters. Put differently, features are considered as players, each of which wants to join a group of features and generate a cluster with high correlation among players (i.e., the correlation coefficient is considered as a coalitional function). The majority of previous strategies have user-defined parameters (e.g., number of clusters and threshold value). However, in the proposed method, clusters are generated automatically without determining the number of clusters by the user. Furthermore, three populations are defined as players to play a non-cooperative game, and each of which aims to improve its own fitness function (i.e., model error, number of features, and Mallows' Cp).

  • Most feature selection strategies are extremely time-consuming, rendering them inefficient for large problems. Therefore, the proposed method uses MapReduce and space reduction techniques to reduce time complexity. It is also used to perform preprocessing (i.e., handling missing values and detecting outliers) to improve data quality and increase overall productivity.

  • We have performed extensive experiments based on ten regression datasets. The proposed algorithm (i.e., GBMA) is compared with conventional state-of-the-art methods in two parts. Firstly, various swarm-based techniques, including MGWO, ABCoDT, and CPSOS, are evaluated. Secondly, the non-swarm intelligence-based techniques, including well-known feature extraction algorithms (e.g., LDAr, WPCA, and FC-C-S), are compared. Moreover, a statistical analysis validates the efficacy and superiority of the proposed method.

4 Preliminaries

In this section, we briefly review the basic information about BMA,‬ MOO, and Nash equilibria that are used in the proposed algorithm (i.e., GBMA).

4.1 Biology Migration Algorithm (BMA)

One of the major techniques for solving complex optimization problems is nature-inspired metaheuristic algorithms. Zhang et al. [17] proposed a metaheuristic algorithm called Biology Migration Algorithm (BMA) that is based on the biology migration behavior in nature. BMA consists of three steps as follows:

  • Population generation: This step models the movement of different biotic species to a new location. Population P with N search agents is randomly generated from the problem space [17].

  • Migration phase: It contains two rules: (1) The positions of agents are modified based on the best position of the solution space, and (2) The agent moves to a new location according to its neighborhood solutions.

  • Updating phase: If a solution cannot be improved during a pre-defined number of cycles (Max_C), it will be replaced by a new one in the solution space. The pseudo-code of this phase is presented by Algorithm 1 [17], where Xi (t) indicates the i-th search agent of the population at the t-th iteration.

    figure a

4.2 Multi-Objective Optimization (MOO)

Generally, MOO is defined as finding a vector of variables to minimize (or maximize) a vector of objective functions in a feasible region. Thus, it can be modeled as Eq. (1) [42]:

$$Minimize\,\,\,F(x) = [f_{1} (x),\,f_{2} (x), \ldots ,f_{d} (x)]\,,\,\,\,\,\,\,\,Subject \, to\quad q(x) \le 0,$$
(1)

where x indicates the vector of variables, \(f_{d} (x)\) shows the d-th objective function, and \(q(x)\) indicates the constraint vector.

One of the main challenges in MOO is the conflicting objectives (i.e., a trade-off between objectives). In other words, one objective is achieved only by making concessions to another objective; thus, the optimal solution cannot be found for all m objective functions simultaneously. Consequently, a "compromise solution" is acceptable for these problems.

One of the primary ways to deal with the multi-objective problem is to consider scalar objectives and define the linear combination of weighted objectives [43]. However, this method will be applicable and efficient once the weights can be determined correctly. This method also has some weaknesses, such as sensitivity to weights and sometimes engendering the loss of information [43]. Two obvious contradictory objectives in the feature selection problem are minimizing the number of features and minimizing the error rate. The feature selection process attempts to select a few significant features to obtain similar or even better regression performance, rather than considering all features. Hence, we consider feature selection as a bi-objective minimization problem. In this paper, we apply a non-cooperative schema with the notion of a player to solve the multi-objective feature selection problem.

4.3 Nash heuristic method

In 1950, Nash introduced the fundamental concept of determining equilibria in cases where several competing players seek to optimize their objective functions [43]. If an optimization problem consists of d objectives, the Nash strategy will define d players, each of which with its own optimization criterion. Each player seeks to optimize its own criterion by assuming that the criteria of the other players are fixed. When none of the players can improve their criteria, the system reaches a state of equilibrium called the Nash equilibrium [39]. The idea of combining Nash with heuristic methods is to bring together a heuristic method and the Nash strategy to find the Nash equilibrium as a solution for MOO problems.

5 The proposed method

This paper proposes a novel dimensionality reduction technique based on BMA and game strategies, i.e., GBMA. GBMA is hybridized with two specific game strategies, i.e., the Nash strategy and coalitional game, to find the best solution and solve the multi-objective feature selection problem. Firstly, the preprocessing steps are taken to transform the original dataset into a format that is easily and more effectively applicable to future processing phases. The main processes taking place here handle the missing values and detect outliers [44, 45]. Using interquartile range (IQR) [43], we attempted to detect and remove outliers to provide a model that fits well the reality and undeviated by outliers. IQR indicates how the data spreads about the median and is used for outlier detection. In other words, outliers are defined as objects that fall below Q1 − 1.5 IQR or above Q3 + 1.5 IQR [46]. Handling the missing values is an essential step during model development since they affect conclusions derived from the data. Missing data can be removed when it is limited to a small number of objects, followed by processing the remaining objects. This method is called listwise deletion and considered as a default option in most statistical software packages [47].

5.1 GBMA framework

Figure 1 shows the framework of the proposed method that consists of four major steps as follows:

  • In the first step, the population of agents is initialized as players (see Sect. 5.2).

  • In the second step, the features corresponding to each agent are clustered based on their correlation coefficients. That is, these features are considered as players that play a coalitional game, each of which desires to join the features that increase the correlation coefficient of extracted features (see Sect. 5.3).

  • In the third step, the new features are extracted from each cluster based on the weighting methods (see Sect. 5.4).

  • In the fourth step, the best agent of each population is determined, followed by updating the agents of two populations (see Sect. 5.5).

Fig. 1
figure 1

The framework of GBMA

Moreover, steps 2–4 are repeated until the stopping criterion is satisfied; otherwise, the algorithm is stopped, and the best feature set is reported.

5.2 Definition of populations as players

This section initially defines the fitness function and then introduces the agent representation. For each objective in the fitness function, players (i.e., populations) are defined that participate in a game, each of which attempting to improve their objective. This game is continued until none of the players can improve its objective.

5.2.1 Fitness function

The proposed algorithm deals with the feature selection problem as a MOO problem. We attempt to optimize three different objectives with three players by computing the population of agents for each criterion. The fitness function is calculated by Eq. (2):

$$fitness = W_{r} \times Er(R) + W_{f} \times \frac{\left| L \right|}{{\left| T \right|}} + W_{p} \times \left( {\frac{{SSE_{p} }}{{MSE_{FULL} }} - All\_s + 2h} \right),$$
(2)

where \(Er(R)\) is the regression error rate, \(\left| L \right|\) indicates the length of the feature, \(\left| T \right|\) is the total number of features, \(MSE_{FULL}\) is the MSE for the full model (i.e., the model containing all features). \(SSE_{h}\) indicates the residual sum of squares (RSS) for the subset model containing h features (i.e., h is the number of features after the feature selection process). Moreover, All_s is the sample size. In Eq. (2), \(W_{r}\), \(W_{f}\), and \(W_{p}\) are three parameters corresponding to the importance of regression quality, subset length, and model precision, respectively, where \(W_{r} ,\,\,W_{f} ,\,\,W_{p} \in [0,1][0,1]\).

Figure 2 presents the attempts made to optimize three different objectives participating in fitness calculation (Eq. 2) and how such merging can be achieved with three players. We show the candidate solution for a triple objective function by Eq. (3):

$$S = XX^{\prime}X^{\prime\prime}$$
(3)

where \(X\) \(X^{\prime}\), and \(X^{\prime\prime}\) show the first, second, and third criteria.

Fig. 2
figure 2

A block diagram of Nash strategy for three objectives

The optimization processes of \(X\),\(X^{\prime}\), and \(X^{\prime\prime}\) are assigned to Player1, Player2, and Player3, respectively. Based on the Nash theory, Player1 tries to optimize S according to the first criterion by changing \(X\), while \(X^{\prime}\) and \(X^{\prime\prime}\) are fixed by Player2 and Player3, respectively. Similarly, Player2 seeks to optimize S according to the second criterion by changing \(X^{\prime}\), while \(X\) and \(X^{\prime\prime}\) are fixed by Player1 and Player3, respectively. In this paper, a population is created for each player (i.e., three populations). Player 1’s optimization process is carried out by the first population, whereas Player2’s and Player3s optimization processes are performed by the second and third populations, respectively.

Let \(X_{k - 1}\) be the best value found by Player1 at generation k − 1, \(X^{\prime}_{k - 1}\) and \(X^{\prime\prime}_{k - 1}\) be the best values found by Player2 and Player3, respectively, at generation k − 1. At generation k, Player1 optimizes X using \(X^{\prime}_{k - 1}\) and \(X^{\prime\prime}_{k - 1}\) to evaluate S (i.e., \(S = X_{k} X^{\prime}_{k - 1}\)). Likewise, Player2 optimizes \(X^{\prime}_{k}\) using \(X_{k - 1}\) and \(X^{\prime\prime}_{k - 1}\) (i.e., \(S = X_{k - 1} X^{\prime}_{k} X^{\prime\prime}_{k - 1}\)).

After the optimization process, Player1 forwards the best value \(X_{k}\) to Player2 and Player3, which will apply it at generation k + 1. Likewise, Player2 forwards the best value \(X^{\prime}_{k}\) to Player1 and Player3, which will apply it at generation k + 1. This process is repeated for \(X^{\prime\prime}_{k}\) as the best value for Player3. Nash equilibrium is reached when the three players (i.e., Player1, Player2, Player3) cannot improve their criteria.

5.2.2 Agent representation

In our schema, each agent in a population consists of two parts. The first part is named Sequencing, whose size is equal to the number of features in a dataset. Each agent represents the order of features affecting clustering. The second part determines the weighting method used to extract new features. Each cell of agents in the first part consists of a constant value between [0, M1], where M1 shows the total number of features. In the second part, each cell consists of a constant value between [0, M2], where M2 indicates the total number of weighting methods. Therefore, the problem dimension in this paper is equal to M, which is the sum of M1 and M2. Each cell is rounded to the next highest integer, after which the index of intended features and weighting method is determined.

Figure 3 shows an example of a population and its agents. Here, we can see the agent coding and decoding for a dataset with four features. For example, in the first agent, the indices of Feature1, Feature2, Feature3, and Feature4 are indicated by 0.75, 1.33, 2.76, and 3.1, respectively. At the end of this section, two populations are generated, similar to Fig. 3.

Fig. 3
figure 3

An example of a population and its agents

5.3 Cluster generation

This section illustrates the correlation measure and explains how clusters are generated based on coalitional games.

5.3.1 Correlation coefficients of features

The proposed algorithm seeks to group the most similar features in the same cluster by a coalitional game. In our method, clustering mainly aims at finding relationships between features where the most distinct and informative features are used in feature extraction. Pearson’s correlation coefficient is applied to estimate the correlation between different features. Hence, the correlation coefficient between two features fei and fej is obtained using Eq. (4) [48]. The features in each agent are taken as players, where they collaborate to attain a higher correlation coefficient:

$$Cp_{ij} = \frac{{\sum\limits_{l = 1}^{All\_S} {(x_{i} (l) - \overline{x}_{i} )(x_{j} (l) - \overline{x}_{j} )} }}{{\sqrt {\sum\limits_{l = 1}^{All\_S} {(x_{i} (l) - \overline{x}_{i} )^{2} } } \sqrt {\sum\limits_{l = 1}^{All\_S} {(x_{j} (l) - \overline{x}_{j} )^{2} } } }},$$
(4)

where \(x_{i} (l)\) and \(x_{j} (l)\) are values of feature vector fei and fej for the l-th sample, respectively. \(\overline{x}_{i}\) is the mean of \(x_{i}\) and \(\overline{x}_{j}\) is the mean of \(x_{j}\) over all \(All\_S\) samples. According to Eq. (4), high \(Cp\) values indicate high similarity between the two features.

Upon determining correlation coefficients between features, the correlation value for feature i can be defined by Eq. (5) [48]:

$$Corr_{i} = \frac{{\sum\limits_{j = 1}^{T} {\left| {Cp_{ij} } \right|} }}{T - 1}\,\,\,i \ne j,$$
(5)

where T denotes the total number of features. The low value of this parameter for a feature suggests a low similarity between this feature and other features (i.e., this is a distinctive feature among other features).

5.3.2 A coalitional game among features

This section explains a coalitional game among features. Each feature is considered as a player in the game. A payoff function is defined by Eq. (6):

$$v_{i} (j) = \left| {Cp_{ij} } \right|\quad \forall i,j \in N.$$
(6)

It is assumed that \(v_{i} (i) = 0\quad \forall i\).

There is a preference relation (\(\succ_{i}\)) between each feature and the feature subsets based on Eq. (7):

$$K_{1} \succ_{i} K_{2} \Leftrightarrow \sum\limits_{{j \in C_{1} }} {v_{i} (j)} \ge \sum\limits_{{j \in C_{2} }} {v_{i} (j)} \,\,\,\,\,\,\,\,\forall K_{1} ,K_{2} \in fe_{i} \quad i = 1,2, \ldots ,T,$$
(7)

where K1 and K2 are two features, C1 and C2 are two clusters, T denotes the total number of features, and vi is a payoff function.

While forming a coalition, each feature wants to join the feature group that maximizes its payoff function.

This process is regarded as a coalitional game. Each feature seeks to find a cluster contributing to its high correlation with the features of that cluster while decreasing the correlation between feature and output. Figure 4 depicts an example of a coalitional game envisaged in the proposed method. In this type of game, players try to improve the payoff function. For example, consider a dataset with four original features (i.e., four participating players) and one target. There is an agent with five cells, the first four cells of which indicate the order of features for clustering, and the last one indicates the weighting method for feature extraction. First, a cluster (G1) with Feature4 that is the first feature number in the agent is generated. Then, the relation (i.e., correlation) between the next feature in the solution (i.e., Feature2 in Fig. 4) with the generated cluster and other remaining features are examined.

Fig. 4
figure 4

An example for coalitional game among features

If the correlation is not improved with joining this feature to cluster G1, the aforesaid feature (i.e., Feature2) and feature with the highest correlation (i.e., Feature3 with a correlation value of 0.8) generate a new cluster called G2. So far, the state of three features is determined (i.e., Feature4 is in cluster G1, and Feature2 and Feature3 are in cluster G2). Now, the correlation of the last feature in the agent (i.e., Feature1) with cluster G1 and cluster G2 is checked. Since there is more than one feature in cluster G2, the average correlation between these two features and Feature1 should be determined. Ultimately, the final feature is placed in the cluster with the highest degree of correlation with its members. In other words, cluster G2 with 0.68 is selected for Feature1 in Fig. 4.

5.4 New feature extraction

Upon cluster generation, a new feature is extracted from each cluster. If T original features are grouped into C clusters, they can be represented by C new features since all features within a cluster are quite similar. Let us assume that original features are represented with vector \(fe_{j} = < fe_{1j} ,\,fe_{2j} , \ldots ,fe_{All\_sj} >^{T} ,\,\,\,1 \le j \le T\), where All_s denotes the total number of samples in the dataset. Then, the new features are shown with vector \(D_{h}^{\prime } = < D_{1h}^{\prime } ,\,D_{2h}^{\prime } , \ldots ,D_{All\_sh}^{\prime } >^{T} ,\,\,\,1 \le h \le Number\_new\_features\) and defined as follows:

$$D^{\prime } = Z*Fe,$$
(8)

where

$$D^{\prime } = [D_{1}^{\prime } D_{2}^{\prime } \ldots D_{h}^{\prime } ]$$
(9)
$$Fe = [fe_{1} \,fe_{2} \ldots \,fe_{T} ]^{T}$$
(10)
$$Z = \left[ {\begin{array}{*{20}c} {S_{11} } & \ldots & {S_{1T} } \\ \ldots & \ddots & \ldots \\ {S_{h1} } & \ldots & {S_{hT} } \\ \end{array} } \right].$$
(11)

Thus, Z shows the weighting matrix. From Eq. (8), we have:

$$D_{h}^{\prime } = S_{h1} fe_{1} + S_{h2} fe_{2} + \cdots + S_{hT} fe_{T} \,,\,\,\,\,1 \le h \le Number\_new\_features.$$
(12)

The proposed method assumes three weighting methods as strict, moderate, and hybrid as follows:

Strict method: Here, each feature only affects the creation of a new feature for its own cluster. Hence, for \(1 \le j \le T\), it is defined as follows [24]:

$$S_{hj}^{st} = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {if\,\,h = \arg \mathop {\max }\limits_{1 \le h \le number\_clusters} \,\;Corr_{st} (fe_{j} ,G_{h} )} \hfill \\ {0,} \hfill & {otherwise} \hfill \\ \end{array} ,} \right.$$
(13)

where \(Corr_{st}\) represents the correlation in the strict weighting method. According to Eq. (13), if \(fe_{j}\) belongs to cluster \(G_{h}\),\(S_{hj}^{st}\) is 1, and \(S_{vj}^{st}\) is 0 for \(v \ne h\). However, in this method, a feature cannot contribute to constructing more than one new feature.

Moderate method: Here, each original feature contributes to extracting features according to its correlation value. For moderate method, we have:

$$S_{hj}^{m} = Corr_{mo} (fe_{j} ,G_{h} )\,\,\,\,\,\,1 \le j \le T,\,\,\,\,1 \le h \le number\_clusters,$$
(14)

where \(Corr_{mo}\) denotes correlation in the moderate weighting method for generating new features, and T denotes the total number of features.

Based on Eq. (14), feature \(fe_{j}\) has the greatest effect on the construction of feature \(D_{h}^{\prime }\) when it has the most strong correlation with cluster \(G_{h}\), while other features are less effective.

Hybrid method: It is a combination of the two methods above. Thus, it can be represented as follows:

$$S_{hj} = \beta \times S_{hj}^{st} + (1 - \beta ) \times S_{hj}^{m} ,$$
(15)

where \(S_{hj}^{st}\) and \(S_{hj}^{m}\) are determined based on Eqs. (13) and (14), respectively. The value of \(\beta\) is set by the user in the range [0–1]. According to Eq. (15), the hybrid method operates as a strict method when \(\beta\) is set to 1. Likewise, the hybrid method is identical to a moderate method when \(\beta\) is set to 0.

Once \(D^{\prime }\) is calculated, the new dataset can be obtained as follows:

$$X^{\prime}_{i} = < D_{i1}^{\prime } ,\,D_{i2}^{\prime } , \ldots ,D_{ih}^{\prime } > ,\,\,\,\,1 \le i \le All\_s,\;\,\,1 \le h \le Number\_new\_features.$$
(16)

As indicated, the dimension of the input vector for each sample is reduced to h (h < T). Therefore, the new dataset may be stated as follows:

$$d_{i}^{\prime } = (X^{\prime}_{i} ,y),\,\,1 \le i \le All\_s.$$
(17)

Using the previous example, a new feature generation procedure can be illustrated. In Fig. 4, the hybrid weighting method used for extracting new features is selected based on the rounded value of the last cell. Now, \(S_{hj}^{{}}\) value is calculated based on Eq. (15). Figure 5 shows an example of the hybrid method. Figure 6 indicates the hybrid values and the newly constructed features.

Fig. 5
figure 5

An example for the hybrid method

Fig. 6
figure 6

The new generated features

From Fig. 4, it can be observed that the last agent cell is 2.1. Upon rounding up, the third method (i.e., hybrid method) is used for generating new features. The hybrid weighting method is obtained by Eq. (15). It combines strict and moderate weighting methods (Eqs. 13, 14). Hence, it is necessary to calculate the strict and moderate weights of each feature in all clusters. For example, the moderate weight for Feature1 based on cluster G1 is equal to − 0.13 that is calculated based on Eq. (14). Its strict weight is 0 since Feature1 is not in cluster G1. The moderate and strict weights based on cluster G2 for Feature1 are equal to 1; hence, the hybrid weight is set to 1. This process is repeated for each feature. Upon obtaining the weighting matrix for all features in each cluster, the new features are calculated based on Eq. (12). In Fig. 6, a new feature D1 based on cluster G1 is calculated as follows:

$$( - 0.06 \times Feature1) + ( - 0.13 \times Feature3) + (0.09 \times Feature3) + (1 \times Feature4).$$

This process is repeated for cluster G2 like D1.

5.5 Updating players and populations

In this work, there are three populations, each of which with different objectives for optimization. The first and second populations seek to minimize the number of features and model errors, respectively. The third one attempts to find a good trade-off between a model with all features and a model with selected features in terms of precision value. Thus, the fitness of each agent is evaluated based on its population objective. Three phases for fitness evaluation are presented in Fig. 7. In summary, steps one and two involve agent initialization and the generation of new features based on the agents in populations, respectively. New features are evaluated in step three.

Fig. 7
figure 7

The general schema for evaluating populations

Evaluation of the first population: To evaluate the agents in the first population, the number of generated features is considered as a basic metric. Since the number of features affects the cost of acquiring sample values and model training, the agent with the lowest number of features is determined as the best agent.

Evaluation of the second population: To evaluate the agents in the second population, the obtained dataset with new features from the second population is trained and tested through Support Vector Regression (SVR). A regression version of SVM with a surrogate loss function is commonly used to solve regression problems. Structural risk minimization (SRM) aims at establishing SVR and minimizing the upper bound on the generalization error. SVR is an effective technique in real-valued function estimation. It applies a nonlinear mapping to map input data x into a high-dimensional feature space Fe and then solves a linear regression problem in the new space.

Regression approximation tries to predict the output value based on a given dataset such as \(G = \left\{ {\left( {x_{i} ,y_{i} } \right)} \right\}_{i}^{n}\), where \(x_{i}\) is the input vector (containing n features {fe1, fe2,..., fen}),\(y_{i}\) is the output value, and tot_s is the total number of samples. We seek to find a regression function like \(y = f(x)\) to estimate the outputs \(\left\{ {y_{i} } \right\}\) based on a new set of input–output samples like \(\left\{ {\left( {x_{i} ,y_{i} } \right)} \right\}\). In the second population, an agent with the lowest regression error is considered as the best agent.

Evaluation of the third population: To evaluate the agents in the third population, before the clustering procedure, the MSE of features in each agent is calculated using SVR. Once new features are generated, their RSS is calculated by SVR. Finally, the trade-off between the precision of the model with all features and model with selected features is calculated by Mallows Cp, and the agent with the lowest value is taken as the best agent. After evaluating the fitness of each agent, the best agents of the three populations are exchanged for the next iteration. In a situation with two players (i.e., two populations), this process has occurred with the best agents of two populations.

Upon determining the best agents for the two populations and exchanging their places, the process of updating agents is triggered. In BMA, the updating process of agents consists of two main phases: the migration phase and the updating phase.

(1) Migration phase: It has two stochastic search operators, namely Rule 1 and Rule 2. In Rule 1, each agent moves to the best agent, and thus the optimal solution is found as quickly as possible. In Rule 2, each agent randomly moves to other positions, and hence exploration ability is improved. Two examples of the migration phase are shown in Figs. 8 and 9.

Fig. 8
figure 8

The process of updating with Rule 1

Fig. 9
figure 9

The process of updating with Rule 2

In Fig. 8, attempts have been made to update agent with number one (\(X_{1} (t)\)) based on Rule 1 in the 20-th iteration, where the maximum number of iterations is set to 100. Furthermore, Rand and Pr values are assumed as 0.2 and 0.5, respectively. Firstly, the best agent (i.e., Best-agent in Fig. 8) must be determined from the previous iteration. For example, in Fig. 8, Best-agent equals 0.65, 2.45, 1.92, and 3.33 in the Sequencing part and 1.44 in the last part. Once updated, it would be 0.85, 1.53, 3.32, and 3.66 for the Sequencing part and 0.68 for the last part of the best agent.

In Fig. 9, it is attempted to update agent with number one (\(X_{1} (t)\)) based on Rule 2 in the 20-th iteration, where the maximum number of iterations is set to 100. Additionally, Rand and Pr values are set to 0.2 and 0.5, respectively. First, two random agents must be specified. For example, in Fig. 9, an agent with number two (\(X_{2} (t)\)) and an agent with number forty (\(X_{40} (t)\)) are randomly selected.

Consider an agent with number one equal 0.75, 1.33, 2.76, and 3.1 in the Sequencing part and 0.28 in the last part. Then, it is updated with updating formulas based on \(X_{2} (t)\) and \(X_{40} (t)\). Finally, the values in the Sequencing and the last part are replaced with new values. Hence, it is equal to 1.49, 1.4, 3.36, and 3.42 in the Sequencing part and 0.51 in the last part of the agent with number one.

(2) Updating phase: The updating phase can be used to enhance the diversity of populations. If an agent cannot be improved after a certain number of iterations, it must be replaced by another. Once updated, the validation form of the agents of the two populations must be checked. An agent has a valid form if (1) all integral part of the value in the Sequencing part falls within the range (0, T), where T denotes the total number of original features, (2) the value of No. Weighting falls within the range (0, 3), and (3) the integral part of the value in the Sequencing part is different from each other.

An invalid agent is transformed into a valid form in two steps as follows:

Step 1: The values in Sequencing and No.weighting parts outside the range are modified based on Eq. (18):

$$fixed\_x = \left\{ {\begin{array}{*{20}l} {\bmod (x,N)} \hfill & {if\;x \in \;Sequencing\;part} \hfill \\ {\bmod (x,3)} \hfill & {if\;x \in \;No.weighting\;part} \hfill \\ \end{array} } \right..$$
(18)

where fixed_x shows the corrected form of value x and function mod(a, b) returns the remainder after a is divided by b.

Step 2: First, the values with the same (i.e., repeated) integral parts and the values with missing integral parts are identified. Then, one of the values with the repeated integral part is randomly replaced with the value with missing integral part.

Algorithm 2 shows the main steps of the proposed method.

figure b

6 Complexity reduction

The high computational complexity of an evolutionary algorithm is attributed to two issues: (1) the scale of the population in the swarm and (2) the size of the search space [49]. This section initially illustrates how a swarm population can be scaled by a parallel technique (i.e., MapReduce). Obviously, the more agents evaluated, the more computation time will be needed, and so the population of the swarm should be cut down. Then, the space reduction is explained.

6.1 GBMA based on MapReduce

A parallel model is designed for BMA using the MapReduce technique. There are two dependencies between agents in the migration phase as follows:

  1. (1)

    Exchanging the global best.

  2. (2)

    Moving toward the direction of neighbors.

Now, we explain how GBMA is parallelized (see Fig. 10). In each iteration of GBMA, there are two main phases (i.e., migration and updating). It evaluates the agent at the new point and updates its global best after being compared with its neighbors. Each agent operates independently of the rest of the swarm, except for updating its global best and moving toward neighbors. Due to the limited communication among agents, updating a swarm can be formulated as a MapReduce operation. Once mapped, an agent receives a new position, value, and information of its right and left neighbors. In the reduction phase, it incorporates information from other agents into the swarm to update its global best. In the parallel version of BMA, the global best is communicated only between right and left neighbors rather than all the agents, rendering MapReduce more efficient.

Fig. 10
figure 10

Diagram of parallel BMA based on MapReduce

6.2 Dynamic varying search area (DVSA)

The complexity of the optimization problem is not only linked to its objective functions but also the search space is one of the essential factors [49]. To accelerate the processing of the algorithm, the search area of each population is modified or reduced dynamically. In each population, assume that there are N cooperative agents in the search space. When the minimal distances between agents of each population (i.e., first and second populations) reach a threshold, the real optimal solution should be found in the region around these agents based on the maximum likelihood estimation (MLE). Therefore, the previous search area S is reduced to \(S^{\prime}\). Now, there are two new populations with the same number of agents, and the search space is reduced. Figure 11 indicates the case in which four cooperative agents reduce their search spaces in the second population. First, they search the solution in S and get the best agents \(R_{1}^{ * }\), \(R_{2}^{ * }\), \(R_{3}^{ * }\), and \(R_{4}^{ * }\) included in \(S^{\prime}\). Accordingly, the search area becomes \(S^{\prime}\). The same procedure applies to the subsequent reductions to \(S^{\prime\prime}\).

Fig. 11
figure 11

An example for search space reduction

6.3 Complexity analysis

In GBMA, assume a dataset with N features, and each population with M agents. Since GBMA is implemented in parallel form (e.g., like Fig. 10), the time complexity of the whole method is equal to maximum time complexity for the execution of one agent. The time complexity of the proposed algorithm is estimated based on the pseudo-code (i.e., Algorithm 2) as follows:

Step 1. Generate populations and check agents—lines: 4–9:

For N features and M agents: \(O(3 \times 2 \times M(N + 1))\),

Step 2. Calculate MSE—line 10:

For N features and p training samples: \(O(N^{2} p + N^{3} )\),

Step 3. Cluster features—line 11:

For N features: \(O(N\log N)\),

Step 4. Extract new features (calculate weights and generate new features)—line 12:

For N features and d clusters: \(O(2Nd)\),

Step 5. Calculate RMSE—line 13:

For k extracted features and p training sample: \(O(k^{2} p + k^{3} )\),

Step 6. Calculate fitness—line 14:

For M agents, in population one: \(O(M)\) and in populations two and three: \(O(M \times (k^{2} p + k^{3} ))\).

Step 7. Find the best agent: \(O(M)\).

These six main steps are repeated from line 22 until the end. Consequently, the total computational complexity of the proposed algorithm is \(O(MM + N^{3} + Nd + MN^{3} + N\log N)\).

7 Experimental results and comparisons

The structure of the experimental results is divided into three main subsections. The first subsection (8.1) describes the characteristics of the regression datasets (all datasets are accessible from the UCI data repository [50]) and the parameter setting of the compared algorithms. The second subsection (8.2) investigates the performance of the proposed method in terms of the feature evaluation criteria in three parts (i.e., swarm-based methods, non-swarm based methods, and statistical tests). The third subsection (8.3) studies the appropriate values for parameter L that adjusts the exploration and exploitation in GBMA for different types of datasets. The applicability of the proposed method is also investigated for classification problems. The details of classification datasets and the comparison results are given in Appendix 1.

7.1 Datasets and parameter setting

The proposed method is evaluated using ten regression datasets. Table 2 shows the details of these regression datasets. Table 3 demonstrates the parameter setting procedure of the compared methods.

Table 2 Regression datasets details
Table 3 Parameter setting of the compared algorithms

7.2 Performance evaluation of the compared methods

7.2.1 Swarm-based methods

This section evaluates the convergence and efficiency of the proposed algorithm (i.e., GBMA) using three swarm-based feature selection methods as follows:

  • Artificial bee colony and gradient boosting decision tree (ABCoDT) [37]: It combines bee colony and decision tree to enhance the quality of the selected features.

  • Modified chaotic particle swarm optimization (CPSOS) [51]: It tries to solve feature selection problems by chaotic particle swarm optimization and strikes a balance between exploration and exploitation by the sigmoid-based acceleration coefficients.

  • Modified grey wolf optimization (MGWO) [52]): It modifies the GWO algorithm to set its parameters and control the exploration and exploitation capabilities.

They are compared in terms of the number of selected features (NSF), regression error (RMSE), and fitness value by ten regression datasets. RMSE and the average NSF are two important criteria to identify how well a feature selection method can choose the most relevant features. Generally, a suitable method obtains a low number of features along with a low RMSE value. In our study, SVR with tenfold cross-validation is used as a regression model to predict the observed value. The RMSE of each method is calculated by the subset of selected features using Eq. (19) as follows:

$$RMSE = \sqrt {\frac{{\sum\limits_{i = 1}^{n} {(p_{i} - o_{i} )} }}{tot\_s}} ,$$
(19)

where pi and oi indicate the predicted and observed values of sample i, respectively. The total number of samples in the training set is represented by tot_s.

Figure 12 shows the RMSE value along with the number of selected features for four metaheuristic methods based on the Bh dataset. As seen, the proposed method (i.e., GBMA) has outperformed other methods and achieved a lower number of features. For 30 iterations, the proposed method selects, on average, 4.4 features, outperforming CPSOS, MGWO, and ABCoDT by 32%, 38%, and 50%. Moreover, it shows, on average, 3.2 for RMSE and so reduces the error rate by 33%, 41%, and 51% compared to CPSOS, MGWO, and ABCoDT, respectively. This improvement can be attributed mainly to the fact that GBMA considers the order of features during clustering and then extracts the new features. Therefore, the proposed method can investigate different arrangements of features and appropriately clusters all features.

Fig. 12
figure 12

RMSE and NSF for different methods

Tables 4 and 5 show the average RMSE values and the NSF for different methods based on various datasets, respectively. From Table 4, it can be seen that the GBMA method outperforms the three other approaches for all test datasets, except for two datasets on which MGWO and CPSOS outperform the other methods with a slight performance difference from the proposed approach. GBMA realizes a much lower RMSE value than other similar methods in almost all regression problems. It shows an average RMSE of ≈ 2.9, while the closest method (i.e., CPSOS) indicates an RMSE of 3.2. Tables 4 and 5 show that CPSOS has better performance in terms of the NSF and RMSE value compared with ABCoDT (on average, CPSOS selects 10% smaller number of features and shows 12% lower RMSE value). This is because CPSOS possesses improved global and local search capabilities thanks to its chaotic function and searches the potential high-performance regions of the feature space.

Table 4 Average of RMSE for different methods
Table 5 Average of NSF for different methods

For subsequent experiments, we consider R-squared (\(R^{2}\)), a statistical measure for a regression model that investigates the scattering pattern of data points along the fitted regression line. Once the regression model (e.g., SVR) has been fit, we must investigate how well the model fits the data. One of the goodness-of-fit (GoF) statistics is R-squared for regression analysis. R-squared is defined as Eq. (20) ranging from 0 and 100%:

$$R^{2} = \frac{Explained \, variation}{{Total \, variation}}.$$
(20)

If a model can explain all the variability of response data around its mean, R2 is 100%. For the same dataset, high values of this criterion would indicate a small difference between actual and predicted data. Figure 13 illustrates the performance of the compared methods in terms of R-squared for the Bh dataset. We can see from Fig. 13 that GBMA performs well for the Bh dataset and improves R-squared about 7%, 12%, and 5% compared to MGWO, ABCoDT, and CPSOS, respectively. When a feature selection method achieves the high value for R-squared compared to other methods means that it can find more influential independent variables. We try to determine features that are independent of each other, each of which with highly dependent on the output variable.

Fig. 13
figure 13

R-squared of different methods based on Bh dataset

Table 6 demonstrates the R-squared values achieved by different methods for nine datasets. Here, we can see that GBMA and CPSOS approaches obtain higher R-squared values (18% and 13%, on average) compared to the other methods. In other words, there is a slight difference between the actual and predicted values, and thus a regression model fits the data well. The advantageousness of GBMA can be mainly attributed to the updating phase. The solution is checked in terms of improvement. If a solution cannot achieve better results (i.e., lower error regression in the regression problem), it is removed from the population and replaced with another one. Hence, the wide areas of search space have been investigated to find the best features. The good performance (behavior) of the CPSOS method is associated with its chaotic function that improves the search process and contributes to escape from local optimum.

Table 6 R-squared of different methods

The objective function that is defined based on Eq. (3), is applied to evaluate the fitness of each selected feature subset. The individual (i.e., agent) with low regression error and the small number of features and low Mallows' Cp value shows a better fitness value. The results of fitness for different metaheuristic methods based on various datasets are shown in Fig. 14. It can be observed that the fitness characteristics are almost the same for GBMA and CPSOS from 35 to 85 iterations for the Af dataset. However, afterward, the fitness of GBMA suddenly reduced since GBMA suddenly jumped out of the local optimal where CPSOS was trapped. From Fig. 14, the best performance is achieved by the proposed algorithm (i.e., GBMA) in the fitness value obtained. Thus, it proves the capability of GBMA in adaptively searching the feature space. GBMA shows 19%, 17%, and 11% improvement in terms of fitness value compared to ABCoDT, MGWO, and CPSOS, respectively. As shown in Fig. 14, GBMA outperforms CPSOS on 9 datasets i.e., Af, Bh, Bc, Rct, Ch, Cc, Re, and Oj, with improved fitness values of about 6.7%, 5.2%, 2.7%, 3.1%, 13%, 17%, 2.3%, and 9%, respectively. One of the disadvantages of CSPSO is that it may not reach global optima and improve its best solution regularly. This is because its mutation update procedure for both local best and global best lacks a mechanism to maintain the best previous solution of each firefly. Thus, they move regardless of their previous best situation. Nonetheless, BMA accepts the new (i.e., the updated) position of an individual when the previous fitness improved by changing its position. If fitness does not improve, a new individual is generated; thus, BMA can escape local optimum.

Fig. 14
figure 14figure 14

Fitness values for different datasets

From Table 7, we can see that GBMA has the best performance in most cases (bold numbers). For instance, it has an average fitness value of 4.5, the lowest among the four algorithms on the Re dataset. MGWO also has an acceptable fitness value, 3.7. It is also indicated that ABCoDT and CPSOS are fairly similar in performance on the Ch dataset. Their average fitness values are nearly the same (8 and 8.1, respectively). Having the highest fitness values, ABCoDT does not exhibit excellent performance compared to other algorithms on Oj, Re, Cc, Bh, and Af datasets.

Table 7 Average the best fitness value for the compared methods

7.2.2 Non-swarm-based methods

This section compares the proposed method with non-swarm-based methods (i.e., methods without heuristic search algorithms like BMA). The feature reduction methods are divided into two types: filter methods and feature extraction methods. To show the generality of the proposed method, it was also compared with some filter methods. Filter methods, along with their performance results in classification datasets, are explained in detail in Appendix 2. The three basic feature extraction algorithms consist of three methods, namely FC-C-S [16], weighted principal component analysis (WPCA) [16], and linear discriminant analysis for regression (LDAr) [16].

In this section, the compared methods are evaluated in terms of average R-squared, average NSF, and average RMSE. The performance of the compared methods on different regression datasets can be seen from Fig. 15. As reflected, the proposed method achieves better results in most of the datasets. For example, GBMA improves R-squared and NSF by 18% and 28%, respectively, compared to FC-C-S. Furthermore, GBMA achieves high R-squared value in comparison with other methods since it can find independent variables that are more influential. This advantage can be attributed partly to the fact that the proposed method applies BMA as a search method to detect the appropriate order of features for generating clusters such that features in the same cluster have a high correlation.

Fig. 15
figure 15figure 15

Spider web diagrams for different regression datasets

In Fig. 15, GBMA improves R-squared and RMSE by 35% and 38%, respectively, in comparison with LDAr. Since GBMA defines three weighting methods (i.e., moderate, strict, and hybrid) for generating new features, it can obtain the information of features from other clusters through these weighting methods. While LDAr transforms the input space into a high-dimensional feature space by a kernel trick and tries to maximize the ratio of distances of samples with significant differences in the output variable, it does not consider the relativeness between features.

7.2.3 Statistical tests

In this section, several statistical tests are applied to provide a comparative analysis of the proposed method. The Friedman test is a nonparametric statistical test that functions based on the average rank (\(rank_{i}\)) of each strategy [53]. Nonparametric testing means that no particular distribution is assumed for the data. Moreover, it can evaluate the results of N different strategies for M datasets. The null hypothesis (i.e., there is no statistical difference in the performance of each strategy) is accepted or rejected according to the P value that is determined by chi-square distribution. Holm's sequential Bonferroni posthoc test [54] is a step-down procedure that applies a minimum rank strategy to specify whether its performance is statistically significant based on other strategies. The test is performed by pairwise comparisons as follows:

$$z = \frac{{(rank_{i} - rank_{j} )}}{{\sqrt {\frac{N(N + 1)}{{6M}}} }},$$
(21)

where N is the number of strategies, and the strategy with minimum Friedman rank is defined as a control strategy. The P value is obtained based on the value of z and normal distribution. In Holm’s test, in case the smallest P value is lower than \(\frac{\alpha }{N - 1}\), the hypothesis is rejected, and the next higher value is checked with \(\frac{\alpha }{N - 2}\). This process continues until the hypothesis is accepted or all P values are checked. Statistic tests are performed for the results of the compared methods based on two evaluation parameters (i.e., RMSE, and NSF). The statistical tests for classification accuracy are given in Appendix 3.

The compared methods (i.e., CPSOS, MGWO, ABCoDT, WPCA, FC-C-S, and LDAr) are ranked based on their RMSE. Consequently, the method with the lowest rank would perform best. Figure 16 displays the ranking of the compared feature selection methods based on the Friedman test. As seen, GBMA, FC-C-S, and CPSOS ranked first (1.5), second (2.61), and third, respectively, among all the methods. According to Fig. 17, the P values for the error rate values of all feature reduction methods are less than 0.05 (i.e., 0.04, 0.03, 0.001, 0.0003, 0.00004, and 0.00002 for GBMA vs. FC-C-S, GBMA vs. CPSOS, GBMA vs. MGWO, GBMA vs. ABCoDT, GBMA vs. WPCA and GBMA vs. LDAr, respectively). Hence, these results are statistically significant.

Fig. 16
figure 16

Friedman ranks for RMSE

Fig. 17
figure 17

Holm test for RMSE

For subsequent experiments, the Friedman and Holm tests are investigated for the NSF of all the compared methods (i.e., GBMA, LDA, FC-C-S, WPCA, CPSOS, MGWO, and ABCoDT). Each of the compared methods is run 30 times; the average NSF of each is calculated. Figure 18 shows the Friedman ranks of the 13 methods based on NSF. From Fig. 18, we can see that the NSF through all strategies is significantly different since the attained P value of the test is 3.33E−07, which is lower than the desired significance level (i.e., α = 0.05). The proposed feature selection method (i.e., the highest-ranking method) is considered as a control strategy in Holm's test to check if its performance statistically differs from other strategies. The results of Fig. 19 demonstrate that the GBMA strategy is significantly better than LDA, FC-C-S, WPCA, CPSOS, and ABCoDT.

Fig. 18
figure 18

Friedman ranks for NSF

Fig. 19
figure 19

Holm test for NSF

7.3 Studying GBMA parameters

A set of extensive experiments have been performed to study the impact of the standard parameters of the GBMA. We test different values for the main parameters in the algorithm (i.e., number of search agents, the maximum number of iterations, and L that indicates the step size at each iteration in the migration phase (see Fig. 8). In this section, the datasets are divided into three groups (i.e., small, medium, and large) based on the size of the dimension. The small dataset includes Af, Ch, Re, Cs. The medium dataset includes Bh, Bc, Yp, and the large dataset includes Cc and Rct. To carry out the experiments, two datasets are randomly selected (i.e., two small datasets, two medium datasets, and two large datasets) to evaluate the different combinations of these parameters.

The parameter “number of search agents (i.e., population size)” is allowed to take five different values (i.e., 30, 40, 50, 60, and 70). Furthermore, the number of iterations is allowed to take four different values (i.e., 30, 50, 70, and 100). The last parameter (i.e., vector L) is linearly changed in three different intervals, namely [0.3, 1], [0.3, 2], and [0.3, 4]. A set of independent experiments are performed for each dataset by varying the number of population sizes, number of iterations, and L simultaneously to illustrate the effect of these parameters on the performance of the GBMA algorithm. A total of 60 parameter combinations are adopted for each dataset. The algorithm was run ten times for every set of parameter values and each dataset. Then, we calculate the average RMSE and fitness value and compare the results. Figures 20a–f shows RMSE values in terms of the number of iteration and population size for AF and Cs as two small datasets. Figures 21a–f illustrates RMSE values in terms of the number of iterations and population size for Bh and Bc as two medium-sized datasets. Figures 22a–f depicts RMSE values in terms of the number of iterations and population size for Cc and Rct as two large datasets. Figures 20, 21 and 22 show how the performance of GBMA is changed by setting different population sizes while varying other parameters (i.e., number of iteration and vector L) for small, medium, and large datasets. The study of main parameters for GBMA on different types of datasets shows that 100 iterations are sufficient to obtain the best results in most cases. Moreover, GBMA with a medium number of agents, between 50 and 60, and the parameter L in [0.3, 2] can obtain the lowest RMSE value for most datasets.

Fig. 20
figure 20

Small datasets investigation

Fig. 21
figure 21

Medium datasets investigation

Fig. 22
figure 22

Large datasets investigation

All the source codes developed in this research are available via a GitLab repository [55].

8 Conclusions

Dimensionality reduction is a primary data-preprocessing technique in statistics, machine learning, and related fields. Since selecting a compact subset of features can reduce the computational cost and achieve good generalization abilities. Feature selection seeks to minimize two conflicting objectives, namely error rate and the number of features. Many feature selection strategies have been developed to deal with this problem, but the focus is less on regression than on classification. Motivated by this, we presented a novel game-theoretic feature selection strategy, which obtains the optimal solution via BMA optimization in less time. The proposed method puts the most correlated features in one cluster and considers the appropriate order for checking correlations among input vectors. Moreover, three weighting methods (i.e., moderate, strict, and hybrid) were applied to extract new features based on the generated clusters. We tested the performance of the proposed multi-objective algorithm with several well-known datasets and recent feature selection methods. In almost all cases, the proposed method achieves better performance than all the other strategies both in terms of accuracy and the number of features. In other words, it provides a compact subset of features with a high predictive capability. Future work may focus on exploring feature selection issues for an imbalanced and noisy dataset. The proposed feature selection can be tested by some real industry applications, such as the fault diagnosis.