1 Introduction

With the fast development of technologies such as big data, the number of attributes (or features) in data obtained by decision-makers is growing at an unprecedented rate. Because of lacking sufficient prior knowledge on real problems, original data usually includes many redundant or/and irrelevant features in order to prevent losing useful information. Obviously, those irrelevant and redundant features must increases storage pressure and cost of computation systems [29, 34]. Most of all, those features may reduce the performance of adopted learning algorithm.

Feature selection (FS) is an effectively dimensional reduction method. It can select a subset of features from all original ones, resulting in reducing the learning cost and maximizing the performance of classification [6, 11, 14, 27, 36]. Now it has been applied into various high-dimensional data [8]. Based on the proportion of labeled samples to unlabeled samples, existing FS approaches can be classed into three categories: supervised, semi-supervised and unsupervised. Supervised or semi-supervised FS methods mainly use class label information to search optimal feature subset. However, in the era of information explosion, not all data can be labeled because of unaffordable costs. Therefore, recently studying unsupervised feature selection (UFS) has received more and more attention [15].

Up to now, researchers have proposed a variety of UFS methods by introducing correlation, information theory, structure information and clustering techniques. He et al. [16] employed Laplacian score to evaluate the effectiveness of features, and proposed a Laplacian score-based algorithm (LS). Cai et al. [9] proposed another effective UFS algorithm, called multi-cluster feature selection algorithm (MCFS), by using the L1-norm minimization regularization term. Wang et al. [31] integrated unsupervised trace ratio formulation and structured sparsity-inducing norms regularization, and proposed an improved UFS method (TRACK). By preserving the local manifold structure of data, these algorithms can effectively reduce irrelevant or redundant features. But they did not consider the local discrimination information which has been demonstrated as an essential property for analyzing data [7]. Li et al. [23] proposed a nonnegative discriminative UFS algorithm (NDFS) by employing spectral clustering to guide feature selection directly. Mitra et al. [26] proposed an UFS algorithm, called unsupervised feature selection using feature similarity (UFSFS), by considering the similarity between features. These clustering-based algorithms do not need any exhaustive search technique, but generally needs a right parameter set to control the size of reduced features. Studying the hybrid mechanism of multiple strategies, Tang et al. [30] proposed a unified UFS framework via feature self-representation and robust graph regularization, and Hou et al. [17] proposed a hybrid algorithm with joint embedding learning and sparse regression. These algorithms all demonstrate competitive results, but their results may still include more redundancy features due to the lack of effective global search strategy.

Since the capability of seeking solutions using global search strategies, recently evolutionary algorithms (EA) have received much attention on solving feature selection problems [32, 35]. However, due to the lack of effective strategies on evaluating individuals, there are relatively few studies on the applications of EA in UFS. Tabakhi et al. [28] proposed a new UFS method, called UFSACO, by using ant colony optimization as a global search technology. Due to the iterative and parallel nature, UFSACO can search in a greater feature space and select a good feature subset. However, it needs larger computational cost to construct a fully connected weighted graph among features in advance. To shorten the computing time of an algorithm, recently Kimovski et al. [22] discussed the parallel implement of multi-objective evolutionary optimization approaches in solving high-dimensional UFS problems. Bhadra and Bandyopadhyay [7] studied the application of a recently developed differential evolution technique, called MoDE, in UFS, and proposed an improved differential evolution based UFS algorithm. Abualigah et al. [2] proposed a harmony search-based UFS technique to solve the text clustering problem. These algorithms improve significantly the search ability of UFS technique, but they still face the problem of “curse of dimensionality” as the number of features increases.

As a relatively new evolutionary technique, particle swarm optimization (PSO) has advantages such as well global search, concise implementation, fast convergence. Up to now researchers have proposed many improved versions, and successfully applied them to various real problems, such as multi-objective optimization problems [25], classifier’s parameter optimization [4], and robot navigation [38]. Recently a few researchers have attempted to study its application in UFS problems. Wang et al. [33] proposed an UFS algorithm based on Markov blanket and PSO (UFSPSO). This algorithm firstly filters out irrelevant features by using the maximum-entropy principle, and then combines PSO and Markov blanket to remove redundant features. Abualigah et al. [1] introduced a hybrid PSO with genetic operators for the unsupervised text feature selection problem. This method uses PSO to remove sparse and non-meaningful features from text. Iranmehr et al. [18] proposed a new PSO-based UFS algorithm for the problem of phonemes sound classification. Adeli and Broumandnia [3] studied a novel feature selection approach based on an adaptive inertia weight-based PSO, and applied successfully to image steganalysis problem. Those methods demonstrate the capability of PSO on solving UFS problems. However, there are still some disadvantages: (1) Most of those algorithms need the population to search optimal feature subset in the whole original feature space, so to face the problem of “Curse of Dimensionality” as the size of features increases; (2) most require decision-makers to set appropriate control parameters in advance, such as inertia weight and acceleration coefficients, for obtaining desirable solutions. But how to set those parameter values is still a challenge, because their values depend on individual applications or optimized problems.

Focused on these, this paper studies a novel unsupervised feature selection method by combining a control parameter-free PSO algorithm with two filter-based search strategies. Herein, the PSO algorithm is designed to find potentially optimal regions, while the two filter-based search strategies are used to improve the convergence speed of the proposed method. The highlights of this paper are as follows:

  1. (1)

    By integrating the global search capacity of PSO with the fast local search capability of filter-based approach, a filter-based bare-bone particle swarm optimization algorithm is proposed for unsupervised feature selection problems.

  2. (2)

    A filter-based strategy, called the space reduction strategy based on average mutual information, is given to remove irrelevant and weakly relevant features for reducing the search space of the swarm, as well as the cost of fitness evaluation.

  3. (3)

    Focusing on exploiting potentially optimal regions obtained by the swarm, a local search strategy based on feature redundancy is proposed to improve the exploiting capability of the swarm.

  4. (4)

    Moreover, a feature similarity-based fitness function and a parameter-free update strategy of particle are introduced to enhance the algorithm’s performance and reduce the dependence of the algorithm on control parameters, respectively.

This paper is organized as follows: Section 2 shows related basic conceptions. Section 3 gives the proposed method, including the filter-based strategy based on average mutual information, and the improved PSO. Section 4 reports experimental results on several test datasets. Conclusions are presented in Section 5.

2 Related work

2.1 Problem formulation

Supposing that S is a data set which contains K samples and D features, F is a set of all features, then a UFS problem can be described as follows: to select d features (d ≤ D) from all the features, so that appointed evaluation indicators (or objective function) H(⋅), such as the classification accuracy, are optimized. However, differing from supervised and semi-supervised cases that can directly use label information to evaluate selected features, generally UFS adopts implicit indicators, such as the proportion of predominant features to selected features [33] or the mean absolute difference [1], to evaluate selected features.

We adopt a binary string to encode a solution in UFS problems:

$$ X=\left({x}_1,{x}_2,...,{x}_D\right),\kern0.5em {x}_j\in \left\{0,1\right\} $$
(1)

where xj = 1 indicates the j-th feature is selected into the subset X; otherwise, it is not. So a UFS problem is formulated as follows:

$$ {\displaystyle \begin{array}{l}\max /\min\;H(X)\\ {}\mathrm{s}.t.X=\left({x}_1,{x}_2,...,{x}_D\right),{x}_j\in \left\{0,1\right\},j=1,2,...,D,\\ {}1\le \mid X\mid \le D,f(X)\in \left[0,1\right].\end{array}} $$
(2)

2.2 Particle swarm optimization

As a swarm intelligence optimization technology, PSO is proposed by Kennedy and Eberhart by simulating the hunting behavior of bird [20]. Compared with other evolutionary algorithms (EAs), PSO abstracts individuals in the population into particles without mass and volume. Through information sharing and cooperation among the particles, the swarm can find optimal solutions from complex search spaces.

In traditional PSO, each particle has a velocity vector and a position vector, and the swarm looks for optimal solutions by constantly changing positions of all the particles. When PSO is applied to FS problem, each particle will represent a potential feature subset of the problem. Taking the i-th particle as example, let \( {X}_i^t=\left({x}_{i1},{x}_{i1},\cdots {x}_{iD}\right) \) and \( {V}_i^t=\left({v}_{i1},{v}_{i2},\cdots {v}_{iD}\right) \) are its position and velocity respectively, then this particle can be updated as follows:

$$ {v}_{ij}^{t+1}=\omega {v}_{ij}^t+{c}_1{r}_1\left(P{b}_{ij}^t-{x}_{ij}^t\right)+{c}_2{r}_2\left(G{b}_{ij}^t-{x}_{ij}^t\right) $$
(3)
$$ {x}_{ij}^{t+1}={x}_{ij}^t+{v}_{ij}^{t+1} $$
(4)

Where, D represents the number of decision variables;\( P{b}_i^t=\left(P{b}_{i1},P{b}_{i2},\cdots, P{b}_{iD}\right) \) is the personal best position (Pbest) found by the i-th particle, and \( G{b}_i^t=\left(G{b}_{i1},G{b}_{i2},\cdots, G{b}_{iD}\right) \) is the global best position (Gbest) found by neighbors of the particle. r1 and r2 are random numbers uniformly distributed in [0, 1]. The inertia weight, ω, and the two acceleration coefficients, c1 and c2, are three control parameters, which are used to control the influences of the previous velocity, Pbest and Gbest on the search process of the swarm.

Traditional PSO was mainly designed to deal with continuous optimization problems. Focused on binary optimization problems, the literature [21] introduced a binary PSO algorithm (BPSO). In the BPSO, the eq. (3) is replaced as follows:

$$ \Big\{{\displaystyle \begin{array}{c}{x}_{ij}^{t+1}=\left\{\begin{array}{ll}1& othersise\\ {}0& if\;s\left({v}_{ij}^{t+1}\right)<{r}_3\end{array}\right.\\ {}s\left({v}_{ij}^{t+1}\right)=1/\left(1+{e}^{-{v}_{ij}^{t+1}}\right)\end{array}} $$
(5)

Wherein, s(⋅) is a sigmoid function, r3is a random values within [0,1].

Furthermore, Kennedy removed the three control parameters of the traditional PSO, and proposed a Gaussian sampling-based method to update the position of a particle, called bare-bones PSO (BBPSO) [19]. In details, the formula of updating particle is as follows:

$$ {x}_{ij}^{t+1}=N\left(\frac{P{b}_{ij}^t+G{b}_{ij}^t}{2},|P{b}_{ij}^t-G{b}_{ij}^t|\right) $$
(6)

In addition, Kennedy proposed other update formula as follows:

$$ {x}_{ij}^{t+1}=\Big\{{\displaystyle \begin{array}{l}N\left(\frac{P{b}_{ij}^t+G{b}_{ij}^t}{2},|P{b}_{ij}^t-G{b}_{ij}^t|\right)\kern1em {r}_4<0.5\\ {}P{b}_{ij}^t\kern10.12em otherwise\end{array}} $$
(7)

Compared with the traditional PSO presented by eqs. (3) and (4), BBPSO does not require the three control parameters, so it is more compact and more practical. Recently, Zhang et al. [37] proposed an improved BBPSO algorithm and applied it to supervised FS problems. The proposed BBPSO algorithm employs two strategies, the uniform mutation and the intensive memory strategy, to improve the search capability of BBPSO.

3 The proposed FBPSO algorithm

The proposed filter-based bare-bone particle swarm optimization algorithm (FBPSO) is introduced in this section. Firstly, a relevance measure is defined by means of mutual information to evaluate the correlation between features. Based on it, a space reduction strategy based on average mutual information is given. After that, the unsupervised feature selection based on BBPSO is introduced by integrating several new operators, i. e., the local search strategy based on feature redundancy and the feature similarity-based fitness function, together with several established techniques such as the parameter-free update strategy of particle, and the real encoding strategy.

3.1 Space reduction strategy based on average mutual information

For a supervised FS problem, the correlation between a feature and class labels can be directly calculated by some measures. After that, optimal feature subsets are determined by using those correlation values. However, in the unsupervised issues, there are no class labels to be employed directly. Therefore, how to evaluate the correlation between a feature and underlying classes needs to define suitable approaches.

Mutual information can be used to evaluate the interdependence between two features. The higher the relevant degree between two features is, the larger the mutual information between them is. Wang et al. [33] used the average mutual information (AMI) to evaluate the relevance of a feature to all the rest features, and stated that the greater the average mutual information of a feature, the higher the relevance of this feature in the dataset. In view of this, the average mutual information can partly reflect the correlation between a feature and potential classes. Therefore, this paper uses it to delete irrelevant and weakly relevant features in advance.

Supposing that F = {f1, f2, ⋯fD} represents a feature set, Yi = (yi1, yi2, ⋯yin) is sample value of the i-th feature fi, the average mutual information of fi is calculated as follows:

$$ AMI\left({f}_i\right)=H\left({f}_i\right)-\sum \limits_{j=1}^D\frac{H\left({f}_i|{f}_j\right)}{D-1} $$
(8)

Where, H(fi) is the information entropy of fi,

$$ H\left({f}_i\right)=-\sum \limits_{y_{ij}\in {Y}_i}p\left({y}_{ij}\right){\log}_2p\left({y}_{ij}\right) $$
(9)

H(fi| fj) is the conditional entropy fi aboutfj,

$$ H\left({f}_i|{f}_j\right)=H\left({f}_i\right)+H\left({f}_j\right)- MI\left({f}_i,{f}_j\right) $$
(10)

MI(fi, fj) is the mutual information (MI) between two features fi and fj,

$$ MI\left({f}_i,{f}_j\right)=\sum \sum p\left({f}_i,{f}_j\right){\log}_2\frac{p\left({f}_i,{f}_j\right)}{p\left({f}_i\right)p\left({f}_j\right)} $$
(11)

According to the average MI defined in Eq. (8), the correlation between a feature and potential classes can be calculated. Based on those correlation values, all the features can be divided into three categories: strongly relevant features, irrelevant features and weakly relevant features. Since irrelevant and weakly relevant features can enlarge the search space of PSO and increase its computational cost, it is necessary to remove these features before implementing PSO. Specifically, in this paper, a threshold θ is used. If the average mutual information of a feature is smaller than θ, the feature is regarded as an irrelevant or weakly relevant feature, and is removed. After removing those features, the search space of PSO used in Subsection 3.2 can be reduced significantly. Furthermore, Algorithm 1 shows pseudo code of the proposed space reduction strategy.

figure a

3.2 The improved BBPSO based on local filter search

By running the space reduction strategy above, we get a new reduced feature set which only contains strongly relevant features. To further remove redundant features from the set, this section presents an improved BBPSO-based feature selection approach with few control parameters. Compared with the standard PSO, BBPSO does not require the three control parameters including inertia weight and two acceleration coefficients, and is more compact and more practical. However, the standard BBPSO proposed in [19] still has disadvantages including premature convergence, especially when the Pbest of a particle happens to be close to Gbest. Focused on this, an improved BBPSO based on local filter search is presented to deal with UFS problems.

3.2.1 Encoding of the particles

In most papers, the binary string is usually used to represent a particle. In the string, if the value of a bit is equal to “1”, then its corresponding feature is selected into feature subset; on the contrary, ‘0’ indicates not. But this kind of encoding strategy needs the sigmoid function s(⋅) to transform the real velocity of a particle to a binary value.

Differing from the binary coding, this paper adopts a more direct method, i.e., real encoding strategy. This strategy uses the probability that each feature is selected as the coding element of a particle. Taking a data set with D features, the i-th particle is expressed as:

$$ {X}_i=\left({x}_{i1},{x}_{i2},\cdots, {x}_{iD}\right) $$
(12)

Where, xij denotes the probability that the j-th feature is selected. Further, a threshold value 0.5 is defined to decide whether a feature is remained or not in the current set. xij > 0.5 indicates that the j-th feature is selected; otherwise, not selected.

3.2.2 Feature similarity-based evaluation function

All features can be divided into two parts: selected features (SF) and non-selected features (NSF), where Xi = SF ∪ NSF and SF ∩ NSF = ∅. In this paper, we consider both the dissimilarity of selected features, fit, and the similarity of non-selected features, fit, to evaluate the fitness of a particle.

The dissimilarity of selected features

To calculate the dissimilarity of selected features, fit, the average of the maximal mutual information of each selected feature to remaining selected features is used as follows:

$$ fi{t}^{\prime }=\frac{1}{\mid SF\mid }{\sum}_{i=1}^{\mid SF\mid}\max \_ NMI\left({f}_i\right) $$
(13)

Where, max _ NMI(fi) is the maximal mutual information of the feature fi to remaining selected features, i.e.,

$$ \max \_ NMI\left({f}_i\right)=\max \left\{ NMI\left({f}_i,{f}_j\right)|{f}_j\in SF,{f}_i\ne {f}_j\right\} $$
(14)

NMI(fi, fj) is the normalized mutual information,

$$ NMI\left({f}_i,{f}_j\right)=\frac{MI\left({f}_i,{f}_j\right)}{\sqrt{H\left({f}_i\right)H\left({f}_j\right)}} $$
(15)

The smaller the value of fit is, the smaller the redundancy between selected featues is.

The similarity of non-selected features

To calculate the similarity of non-selected features,fit, the average mutual information between each non-selected feature and its neighbors among selected features is used as follows:

$$ fi{t}^{{\prime\prime} }=\frac{1}{\mid NSF\mid }{\sum}_{i=1}^{\mid NSF\mid } NMI\left({f}_i,{f}_{\mathrm{min}}\right)\kern1em \left({f}_i\in NSF,{f}_{\mathrm{min}}\in SF\right) $$
(16)

Where, fmin is the feature among the set SF which is closest to the non-selected feature fi. A higher value of fit signifies that each non-selected feature can be represented by one among SF in some well manner.

Combining the similarity and dissimilarity items to ensure the representativeness of selected features, the fitness funtion is defined as the following problem:

$$ fit= fi{t}^{{\prime\prime} }- fi{t}^{\prime } $$
(17)

The smaller the value of fit, the better the quality of selected features.

3.2.3 Local filter search based feature redundancy

To improve the local exploitation performance of the swarm, this subsection proposes a local filter search strategy based on feature redundancy. This strategy includes mainly two operators: the deleting operator and the adding operator. The deleting operator removes redundant features from a feature subset, while the adding operator inserts missing key features into a feature subset.

For the global best position, Gbest, found by the swarm, supposing that the feature set after removing irrelevant and weakly relevant features by the method in Section 3.1 is F, and the feature subset determined by the position Gbest is Fset, the two operators are described as follows:

The deleting operator

Firstly, this operator selects randomly two features from the set Fset, marked as fl1 and fl2; Second, for each feature among {fl1, fl2}, run the k-Nearest Neighbor algorithm to select its k nearest neighbors from the set Fset, marked their neighbors as kNN(fl1) and kNN(fl2) respectively; Next, calculate the average normalized mutual information (A_NMI) between this feature and its neighbors as follows:

$$ A\_ NMI\left({f}_i\right)=\sum \limits_{f_j\in kNN\left({f}_i\right)}\frac{NMI\left({f}_i,{f}_j\right)}{k},{f}_i\in \left\{{f}_{l1},{f}_{l2}\right\} $$
(18)

After that, the one with larger A_NMI value among {fl1, fl2} is removed from Fset.

The adding operator

First, this operator selects randomly two features from the set F/Fset, marked as fl1 and \( {f}_{l2}^{\prime } \); After that, using the above method, calculate the average normalized mutual information (A_NMI) between each feature and its neighbors; and then the one with smaller A_NMI value among \( \left\{{f}_{l1}^{\prime },{f}_{l2}^{\prime}\right\} \) is added into the set Fset.

Furthermore, Algorithm 2 shows steps of the proposed local search strategy. First, Step 1 implements the deleting operator, and deletes the redundant one among {fl1, fl2} from Fset. Second, Step 2 implements the adding operator, and adds the important one among \( \left\{{f}_{l1}^{\prime },{f}_{l2}^{\prime}\right\} \) into the set Fset. Next, Step 3 generates a new position, Gbest, by using the features included in the set Fset. For each feature among Fset, set its corresponding feature bit on Gbestto 1, and set all the rest feature bits to 0. Finally, Step 4 updates the global best position Gbest. If Gbest is better than Gbest, the Gbest is replaced by Gbest; otherwise, we use the position Gbestto replace the worst particle among the swarm.

figure b

3.3 The steps of the proposed FBPSO

According to the above work, we describe steps of the proposed unsupervised feature selection algorithm as follows:

  • Step 1: Implement the space reduction strategy. Calculate the average mutual information of all features according to Eq. (8), and remove irrelevant and weakly relevant features based on the method in Subsection 3.1;

  • Step 2: Initialize all the particles, and evaluate their fitness value by Eq. (17). For each particle, the Pbest is initialized as its oneself position, and the Gbest is set to the particle with the most fitness value;

  • Step 3: For every particle in the swarm, implement the following steps circularly:

  • Step 3.1: Evaluate the fitness values of each particle by Eq. (17).

  • Step 3.2: Update the Pbest and Gbest of each particle by the following standard strategy. For a particle, if its new position is better than the current Pbest, the current Pbest is replaced by the new position. If the new Pbest is better than the current Gbest, Pbest replaces the current Gbest.

  • Step 3.3: Update the position of each particle by Eq. (7);

  • Step 3.3: Run the local filter search proposed in Subsection 3.2;

  • Step 3.4: Loop to Step 3.1 until the termination condition is met.

4 Experiments

This section verifies the performance of the FBPSO algorithm, by comparing with some existing unsupervised feature selection algorithms on several frequently-used data sets.

4.1 Experiment settings

The proposed FBPSO was tested on six data sets, including the biological data Musk1 and Arrhythmia, the image data AR10P and ORL10P, the text data PAMAC and BASEHOCK. The first two data sets are from the UCI repository [10], the rest come from the ASU repository [39]. Table 1 gives general information about these data sets. These data sets have a wide range of spans and representations, with sample sizes ranging from 100 to 2000, and feature numbers from 166 to 10,304. Therefore, they can provide comprehensive and detailed testing for the proposed algorithm under different conditions.

Table 1 The selected data sets

We compare the proposed FBPSO algorithm with MCFS [9], TRACK [31], UFSACO [28] and UFSPSO [33]. Where, MCFS and TRACK belong to non-heuristic approaches, and they have been used in many literatures such as [7, 9, 15, 30, 33]. In our experiments, the two algorithms are mainly used to prove the superiority of EA-based approaches compared with non-heuristic approaches. UFSACO and UFSPSO are two representations of existing EA-based approaches. As we all know, ACO and PSO are two very universal swarm intelligence optimized algorithms. The two algorithms are mainly used to validate the effectiveness of our improved PSO-based algorithm, compared with existing EA-based approaches.

Since all data sets have different feature size, the maximum correlation entropy in the FBPSO and UFSPSO method is set to the entropy relevance value of the ⌈Dlog2D⌉ − th ranked feature for each dataset, and 10-fold cross-validation is used, where ⌈.⌉ is an upward integral function. We set the swarm/population size to 40, the maximal iteration times to 100. The 1-nearest neighbor (1NN) and the decision tree classifier (C4.5) are introduced to calculate the classification accuracy of a feature subset. In order to reduce the impact of randomness, all algorithms are run 20 times on each data set to obtain their statistical results.

4.2 Experimental analysis

We evaluate the proposed FBPSO algorithm by the following two aspects: the proportion of selected features to all features (PRO) and the classification accuracy (ACC) [40]. Fig. 1 shows the PRO value obtained by FBPSO. It can be seen that FBPSO significantly reduces the number of features, where all the PRO values locate within the range 4.72% to 1.19%. It should be noted that the bigger the data size, the smaller the PRO value obtained by FBPSO.

Fig. 1
figure 1

The PRO values obtained by FBPSO

Compared FBPSO with other four feature selection methods, Tables 2 and 3 report their ACC values by using two classifiers and their statistical results, respectively. In the two Tables, the baseline column is the ACC values obtained by the classifier when all features are selected, other columns are average ACC values obtained by different methods, and the last row is the average classification accuracy of an algorithm on different data sets. The best classification results are highlighted at blackbody for the corresponding feature selection method. Moreover, according to the suggestion in [12], the Wilcoxon rank sum test with the significant level of 0.05 is used to show the statistical significance of results. Here the symbol “+” indicates that the null hypothesis (i.e., the median difference between two algorithms is zero) cannot be rejected at the 5% level, and FBPSO is significantly better than the compared one. The symbol “-” indicates that the null hypothesis cannot also be rejected at the 5% level, but FBPSO is significantly worse than the compared one. The symbol “=” indicates that the null hypothesis can be rejected at the 5% level, and means that the differences between FBPSO and the compared one are not significant.

Table 2 Average ACC values found by all the algorithms (1NN)
Table 3 Average classification accuracy obtained by all the algorithms (C4.5)

From the 1NN-based classification results given in Table 2, it can be seen that for the data sets Arrhythmia, AR10P and BASEHOCK, the ACC values of FBPSO is significantly better than that obtained by the four comparison algorithms, i.e., TRACK, MCFS, UFSACO, and UFSPSO. For the data sets, Arrhythmia, AR10P, PCMAC and BASEHOCK, the FBPSO algorithm obtained the highest ACC values. As the last line of Table 2 shown, the FBPSO algorithm obtained the best average ACC values for all the six data sets. UFSPSO also obtained good ACC values similar to FBPSO for the data sets, Musk1, ORL10P and PCMAC, and achieved the second best values in terms of the average classification accuracy, as shown in the last line of Table 2. Compared with the baseline method, the classification accuracies of TRACK, MCFS, UFSACO, UFSPSO and FBPSO are increased by 1.71%, 0.62%, 1.84%, 2.46%, and 2.77%, respectively.

Table 3 shows the ACC values of different algorithms under the C4.5 classifier. We can see that for all the six data sets, the performance of FBPSO is significantly better than the three comparison algorithms, TRACK, MCFS, UFSACO, in terms of the ACC values. UFSPSO also obtained good ACC values similar to FBPSO for the data sets, Musk1, Arrhythmia, AR10P and BASEHOCK, but its performance is significantly worse than FBPSO on the data sets ORL10P and PCMAC. Moreover, on four out of the six data sets, i.e., Musk1, Arrhythmia, AR10P and PCMAC, the FBPSO algorithm has higher ACC values than the other four comparison methods. Compared with the baseline method, TRACK, MCFS and UFSACO, the classification accuracies of FBPSO are increased by 0.86%, 0.58%, 1.47% and 0.95%, respectively. Therefore, the performance of FBPSO is also better than all the four comparison algorithms.

Overall, we can see from the above results that the proposed FBPSO algorithm can optimize the feature selection process and improve the classification accuracy of data. It is a competitive data pre-processing tool.

4.3 Further discussion

This subsection evaluates the effectiveness of our proposed local filter search strategy. Here, FBPSO without the local filter search is denoted as FBPSO/LS. Fig. 2 shows experimental results of both FBPSO and FBPSO/LS. We can see that for all the datasets, under the help of the local search operator, FBPSO obtains higher classification accuracies than FBPSO/LS. Taking the dataset AR10P as example, FBPSO achieves a better classification accuracy value, 73.24, which is 1.42 points higher than FBPSO/LS. Overall, the local search strategy plays a key role in improving the performance of FBPSO.

Fig. 2
figure 2

Average classification accuracy obtained by FBPSO and FBPSO/LS

Moreover, this paper uses a space reduction strategy based on average mutual information to remove irrelevant and weakly relevant features, as shown in Subsection 3.1. Now we design another experiment to test the effectiveness of this strategy. In this experiment, 20 new data sets are constructed based on the two data sets, Musk1 and Arrhythmia. Taking Musk1 as example, first we select randomly its 100 features, and then add a certain degree of noise to these features by the following method: v(xi) = (1 + λ × rand) × v(xi), i = 1, 2⋯, 100. Here, v(xi) represents the value of the i-th selected feature, λ is the degree of noise, rand is a random number within [−1,1]. In our experiment, we set λ to {0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1} respectively to generate ten new data sets.

Fig. 3 shows the removal ratios of features from the 100 features by implementing the space reduction strategy on new data sets. We can see that for the two original data sets (i.e., Musk1 and Arrhythmia when λ = 0), only 87 and 81 out of the 100 features are removed respectively. The main reason is that the 100 features include strongly relevant features. However, as the noise degree increases, the relevant degrees of all the 100 features become gradually weak. Correspondingly, the removal ratios increase rapidly under the help of the space reduction strategy. When the noise degree is equal to 0.2, the removed ratios have already achieved 98% and 97% for the two data sets, Musk1 and Arrhythmia, respectively. When λ ≥ 0.4, all the 100 new features are removed because these features have become totally irrelevant under the action of strong noise. Therefore, the space reduction strategy has good capability on removing irrelevant and weakly relevant features.

Fig. 3
figure 3

The removal ratios on new data sets with differently level noises

5 Conclusion

This paper presented a new unsupervised feature selection algorithm, the filter-based bare-bone particle swarm optimization algorithm, FBPSO. The space reduction strategy based on average mutual information, the problem-specific local search strategy based on feature redundancy and the feature similarity-based evaluation function proposed in this paper, together with several established techniques such as the parameter-free update strategy of particle, and the real encoding strategy, all have made the FBPSO algorithm more effective in dealing with UFS problems. Finally, the experimental results on six datasets showed that the proposed FBPSO algorithm can not only ensure the classification accuracy, but also significantly reduce the number of selected features, and it is a highly competitive unsupervised feature selection method.

Generally, a feature selection problem contains two main objectives, i.e., the number of features and the classification accuracy. An important topic for further research is to study the applications of typical multi-objective particle swarm optimization technologies such as vortex multi-objective PSO [24] in feature selection problems. Another venue of research is to apply the developed algorithms to various real feature selection problems presented in cancer diagnosis [5], image recognition [13], and other practical application areas.