1 Introduction

Feature selection as an effective tool for processing high-dimensional data has been used in various aspects of research and has shown good performance [1,2,3]. An online feature selection method in the form of “streaming” features is becoming popular to make feature selection more relevant. In real life, features are generated all the time and appear one after the other with time. For example, Weibo hot topics are updated every minute, and when a hot topic appears, it may contain a new keyword, which is the key feature to identify the hot topic [4]. The trend of streaming feature popularity makes some feature selection methods that require a complete feature space no longer applicable, because it is impossible to wait for all the features to arrive before processing them uniformly.

To handle data with “streaming” features, researchers have conducted a lot of research and proposed many online feature selection methods, such as Online Streaming Feature Selection (OSFS) [5], a new online streaming feature selection method based on Gap relation (OFS-A3M) [6], Alpha-Investing [7], etc. OSFS [5] is the first form of streaming features and designed a novel framework to handle streaming features. However, the parameter alpha must be specified in advance before feature selection, which has an impact on the algorithm’s independence test. Processing features individually undoubtedly increase the algorithm’s running time, and this disadvantage becomes more apparent as more features are chosen. For example, Alpha-Investing [7] is a framework for processing large data sets where the runtime does not increase exponentially regardless of the number of features, but processing the features only once reduces the time overhead, but increases the redundancy among features, making the algorithm less accurate. At the same time, rough sets are a powerful tool for handling streaming features. To solve the drawback that rough sets cannot handle real-valued data, Hu et al. proposed a neighborhood rough set approach that allows the algorithm to support both continuous and discrete data [8]. For example, OFS-A3M [6] processes streaming features using NRS and proposes an adaptive neighborhood relationship that allows the algorithm to select the right number of features for different instances in the data set, reducing the influence of parameters on the algorithm. However, the above method does not consider the interaction between features, when two or more features are combined will show higher correlation with the labels. Therefore, some important features will be missing if feature interactions are not considered.

Due to the characteristics of group inspiration, evolutionary algorithms have been introduced to solve the feature selection, but most of them only consider a single objective and ignore multiple objectives. DAEA [9] is an EA algorithm based on repeated analysis and improves the basic EA framework in three aspects to obtain good classification and generalization results. To solve the problems encountered by the above methods, this paper uses the PSO evaluation technique to find the best quality feature combinations by a two-layer filtering method. PSO has good exploration ability in solving the supervised feature selection problem. PSO is used in the first layer of filtering to choose the most relevant feature combinations for the arriving groups in the multi-objective optimization function [10] with the objective function. We will use the neighborhood rough set evaluation method for features in the second layer of filtering to judge the features chosen in the first layer of filtering, and we will get the final result after processing all of the groups. There have been many algorithms in papers that combine PSO with rough set(RS), such as MMOFS-3S [11], which combines PSO, MI, and RS, but RS is not the core. Some other algorithms apply to an objective equation that may not be related to NRS or have only one objective function. While our algorithm based on NRS design two kinds of objective functions for optimization, which makes the quality of selected features better.

The contributions made in this paper are as follows: 1) A two-layer filtering architecture is designed by fusing PSO and NRS, which considers the interaction between features and removes redundant features to obtain the best set of features. 2) Two NRS-based objective functions are designed so that the algorithm is optimized according to multiple objectives during the iterative process, finally obtaining the set of features with the highest relevance to the label space. 3) The PSO-NRS algorithm was evaluated on 14 datasets with 6 online streaming feature selection algorithms from 3 classifiers to verify the effectiveness and generalization of the algorithm. Meanwhile, the stability of the algorithms as well as the significance differences are verified using Spider web graph and Critical Difference (CD) pictures. 4) To verify the effectiveness of the combination of PSO and NRS, we designed ablation experiments and demonstrated the effectiveness by analyzing the experimental results.

The rest of the paper is structured as follows. Section 2 discusses related work. Section 3 introduces theoretical knowledge and symbolic definitions. Section 4 presents the algorithm. Section 5 shows experimental results and Section 6 concludes the paper.

2 Related work

In recent years, online streaming feature selection [12,13,14] has become popular as an important branch of feature selection, assuming that features come one by one with the flow of time and processing the incoming features in real time [15, 16]. Since there is no complete feature space, it becomes extremely difficult to select features with high efficiency.

  • Feature selection using neighborhood rough sets

    To resolve continuous data types, NRS are proposed to address the issue that classical rough sets [17,18,19] are inconvenient for dealing with data sets with numerical type attributes. the classical rough set must discretize the data but the discretization process alters the data’s original attribute properties. The NRS [20, 21] can handle both continuous and discrete data. OSFS-ET [22] is a novel early terminated online streaming feature selection framework, which could terminate the streaming feature selection early before the end of streaming features and guarantee a competing performance with the currently selected features. However, the traditional rough set cannot handle real-valued features leading to relatively low usability of the algorithm. A new online streaming feature selection method based on adaptive density neighborhood relation (OFS-Density) [23] proposes a new adaptive neighborhood relation based on density information from surrounding instances that does not require any prior domain knowledge. However, the disadvantage of the feature processing dominance in the online selection process causes the algorithm to occasionally fail to select the best combination of features. An online multi-label streaming feature selection framework (OM-NRS) [24] customizes a criterion to select important features and designs a pairwise correlation constraint between features to filter out the redundant features. Although the algorithm is capable of handling multi-label [25,26,27] data, however, feature interactions are not considered.

    In conclusion, although using NRS has the advantage of not requiring domain knowledge, most algorithms do not consider the interactions between features. Therefore, this paper uses feature groups to explore the interactions between features and improve the efficiency of the algorithm.

  • Feature selection through Particle Swarm Optimization

    The PSO algorithm is inspired by birds foraging and finds the optimal particles through continuous particle swarm iteration. Using PSO to select features has a positive effect [10, 28, 29]. The algorithm works by first initializing N particles, each of which consists of a multidimensional position and velocity vectors, and then the particles update themselves by pursuing two extremes, one of which is the optimal solution reached by the particle itself, called p(best), and the other is the optimal solution found by the whole population, called p(gest), and the pbest and gbest are updated through continuous iterations until the maximum number of iterations is reached or the optimization function has converged.

    A PSO based feature selection method using mutual information (PSO-FS-MI) [30] treats feature selection as a minimization problem and combines PSO with mutual information (MI) [4, 31, 32] using the idea of wrappers for feature selection. Although the wrapper-based approach is time-consuming, satisfactory optimal feature sets can be produced by good optimization techniques. A Multi-objective framework based, Multi-label learning, Online Feature Selection algorithm (MMOFS-2S) [11] divides feature selection into three steps and processes streaming features as a group in line with the idea of online feature selection. Combining PSO, MI, and rough set for feature selection via a three-level filtering framework demonstrates the algorithm’ effectiveness. The complexity of the optimization function, on the other hand, becomes a disadvantage of the algorithm. PEFS [33] models feature selection as a bi-objective optimization problem with feature relevance and redundancy, based on a filtering strategy and using a heterogeneous approach for integrated learning. The method is not able to process streaming features and needs to have the full feature space before performing feature selection. In conclusion, the PSO algorithm for feature selection is a reasonable solution; however, the computational consumption caused by multiple iterations becomes a major disadvantage of the algorithm, and how to design the optimization function reasonably becomes the key.

3 Theory preparation

NRS support both continuous and discrete data sets, which well increase the scalability of the algorithm. In this section, we review some basic concepts and notations of NRS. The symbols and definitions of this section are given in Table 1.

Definition 1

Given DS, Δ(x,y) satisfies the following properties [8]:

  1. (1).

    Δ(x,y) ≥ 0;Δ(x,y) = 0; if and only if x = y

  2. (2).

    Δ(x,y) = Δ(y,x)

  3. (3).

    Δ(x,z) ≤Δ(x,y) + Δ(y,z)

Definition 2

Given DS and A, the neighborhood xi of the instance on A can be expressed as [8]:

$$ {\theta_{A}}({x_{i}}) = \{ {x_{j}} \| {x_{j}} \in U,{\Delta} ({x_{i}},{x_{j}}) \le \theta \} $$
(1)
Table 1 Symbol definition

The size of the threshold 𝜃 determines the number of neighbors of the instance xi. The value of 𝜃 is taken as 0.35 of the distance from the farthest instance of xi, which will eventually lead to an adaptive neighborhood relation R.

Definition 3

Given that DS, 𝜃A(xi), X1,X2,⋯XN are sets of instances divided according to whether the values of decision attributes are the same, the upper and lower approximation of D with respect to A is defined as follows [8]:

$$ \underline{{N_{A}}}D = \bigcup\limits_{i = 1}^{N} {\underline{N_{A}}{X_{i}}} $$
(2)
$$ \overline{ {N_{A}}}D = \bigcup\limits_{i = 1}^{N} {\overline{ {N_{A}}}{X_{i}}} $$
(3)

Among them

$$ \underline {{N_{A}}} X = \{ {x_{i}}\|{\theta_{A}}({x_{i}}) \subseteq X,{x_{i}} \in U\} $$
(4)
$$ \overline{ {N_{A}}}X = \{ {x_{i}}\|{\theta_{A}}({x_{i}}) \cap X \ne \emptyset ,{x_{i}} \in U\} $$
(5)

\(\underline {{N_{A}}} D\) is the positive region of the decision, denoted by POSA(D).

Definition 4

In DS, the dependence of A on D is represented by γA(D) in the following equation [8]:

$$ {\gamma_{A}}(D) = \frac{{\| pos(A) \|}}{{\| U \|}} $$
(6)

We can observe that the range of γA(D) is [0,1]. If γA(D) is equal to 1, it means that the feature subset A depends entirely on D.

Definition 5

In DS, given A and D, the importance of feature f for D is defined as follows [8]:

$$ {\sigma_{A}}(D,f) = {\gamma_{A}}(D) - {\gamma_{A\backslash \{ f\} }}(D) $$
(7)

4 Proposal method

This paper presents a multi-objective optimization algorithm for online group feature selection, which combines a PSO algorithm and NRS to enable the algorithm to select a more effective combination of features. During the algorithm’s processing, the arrival of features in groups allows to consider the interaction between features. To select a better combination of features, we divide the algorithm into two parts, which we call two-layer filtering. Intra-group feature selection uses a multi-objective PSO algorithm to find SF by filtering features in groups. The NRS online feature selection evaluates features based on the neighborhood rough set evaluation criterion, and redundancy updates are processed for features that meet the redundancy requirements. The feature selection algorithm based on two-layer filtering eventually generates a combination of features CF with high relevance and low redundancy to the labels, and the framework is shown in Fig. 1.

Fig. 1
figure 1

Overview of PSO-NRS online feature selection

4.1 Problem definition

Given:

  • A single-label data set, which can be represented using labels D and feature sets C = {f1,f2,⋯,fn}.

  • The features arrive at different periods and continuously flow into the group. Target:

  • A subset of features M is selected from C. The number of M is much smaller than C.

  • M can produce similar efficiency as the original C.

4.2 Intra-group feature selection

The PSO algorithm’s first filtering layer feature selection methods are classified as batch or online stream based on whether or not the entire feature space is available. We consider a group-based approach to stream feature selection to capitalize on the benefits of batch methods while taking the trend of streaming features into account. In the first layer of filtering, the features in the group Gt that arrive over time are randomly combined using the PSO algorithm, which eventually produces a combination of features that are highly correlated with the decision feature D through continuous iterations.

4.2.1 The particle

When initializing a particle swarm, a matrix can be used to represent the swarm and provide a clear view of the state of each particle. Each row in the matrix represents a particle, also known as a solution, which has a position and a velocity vector. The ultimate goal of the PSO algorithm is to discover the best solution. In (8), N particles are initialized and the dimensionality of the vectors in the matrix is the same as the number of features in each group. Every element pj in the P of each particle every element is initially set to [0,1] and the element value is expressed as the probability of selecting that feature. When the probability value is greater than 0.6 it means that this feature is selected, and the opposite is true. The V is initialized in the range [− 0.5,0.5] and is mainly used for the particle update operation.

$$ {\text{position : }}\left( {\begin{array}{*{20}{c}} {{p_{11}}}& {\ldots} &{{p_{1W}}}\\ {\vdots} & {\ddots} & {\vdots} \\ {{p_{N1}}}& {\cdots} &{{p_{NW}}} \end{array}} \right){\text{ }}\text{velocity:}\left( {\begin{array}{*{20}{c}} {{v_{11}}}& {\ldots} &{{v_{1W}}}\\ {\vdots} & {\ddots} & {\vdots} \\ {{v_{N1}}}& {\cdots} &{{v_{NW}}} \end{array}} \right) $$
(8)

Finding the optimal solution through continuous iterations of the particle swarm is the core of the PSO algorithm. In (9) and (10), the process of updating the position and velocity of particle j from the i th to the i + 1 th iteration is described. In each iteration, the particle moves from its position to a new position to approach the optimal position. The position and velocity matrix of the particle can be updated as:

$$ \begin{array}{@{}rcl@{}} vel_{j}^{i + 1} &=& w \times ve{l_{j}^{i}} + {c_{1}} \times {r_{1}} \times ({p_{(best)_{j}^{i}}} - po{s_{j}^{i}})\\ &&+ {c_{2}} \times {r_{2}} \times ({g_{(best)_{i}^{j}}} - po{s_{j}^{i}}) \end{array} $$
(9)
$$ pos_{j}^{i + 1} = vel_{j}^{i + 1} + po{s_{j}^{i}} $$
(10)

where c1 and c2 are two constant values, p(best) is the personal best position of the particle and g(best) is the best position reached by the particle swarm, and r1 and r2 are random values in the range [0,1] to increase the randomness of the search. w is the Inertia weights, which represents the particle’s trust in the previous state of its own motion.

When w is large, the algorithm is good at global exploration; when w is small, the algorithm is good at local exploration. As a result, this algorithm employs a decreasing weight strategy, allowing it to find higher-quality solutions. The update process of w can be expressed as:

$$ w = ({w_{init}} - {w_{end}}) \times (Ite{r_{init}} - Iter)/Iter + {w_{end}} $$
(11)

The formula sets w as the current weight, winit as the initialized weight at the beginning, wend as the size of the weight at the final end, Iterinit as the maximum number of iterations set, and Iter as the current number of iterations.

4.2.2 Optimization functions

In general, the multi-objective EC-based (Evolutionary Computation) feature selection algorithms mainly use archiving and diversity enhancement techniques of Pareto dominance mechanism [34]. The PSO algorithm has the advantage of simplicity compared to other optimization algorithms, but the number of particles and multiple iterations makes the computational loss of the algorithm significantly higher, and the optimization function needs to be carefully designed to reduce the impact of computational effort. The NRS was chosen as one of the methods for calculating the dependency between features and labels because it requires no domain knowledge and no parameters to be set. The algorithm generates the objective function by taking into account the dependencies of S and NS in the group on the labels. In this case, using NRS to determine the dependence of labels on multiple variables significantly greatly reduces the computational overhead. We can find the dependency of features by using the (6).

Algorithm 1 details the calculation of feature-to-label dependencies. The Euclidean distance between each instance and the other instances is calculated first, and the appropriate number of neighbors for each instance is returned using the corresponding neighborhood relation. To complete the dependency calculation, the Definition 3 is used to find the lower approximation of the corresponding feature subset and divide it by the number of instances.

In the initialized particle swarm probability matrix, we label the elements in each particle as S and NS by the magnitude of the probability. Dependency is obtained for each particle’s S and NS on the decision features, labeled rS(D),rNS(D).

$$ Ob{j_{1}} = {r_{S}}(D) - {r_{NS}}(D) $$
(12)

In (12), we want to make the S combinations extremely correlated with the decision features, while the NS combinations are less correlated, so it is not difficult to conclude that the larger the value of Obj1, the better.

If only Obj1 is used as the optimization function, we find that the number of features in S increases with the number of iterations. To restrain the above trend and find the optimal combination of features more accurately, a second objective equation is proposed:

$$ Ob{j_{2}} = {r_{NS}}(D)/{r_{S}}(D) $$
(13)

rS(D) and rNS(D) are the same as in Obj1, so the smaller the value of Obj2, the better. Optimization by multi-objective equation will make the feature combination after iteration produces good results.

Algorithm 1
figure g

Calculate dependency.

4.2.3 Pareto optimal

Pareto optimality refers to an ideal state of resource allocation, indicating that there is no dominance between each other. The objective values of two solutions are constrained to each other in multi-objective optimization, making it difficult to choose the best solution. As a result, NDS [35] and CW [36] are applied to the algorithm for it to find the best solution. NDS takes the objective values of all particles as input and ranks the particles using a certain strategy. The final result is returned as a set of frontiers, with each frontier containing a collection of particles. At each frontier, no particle has an advantage over other particles. The upper frontier outnumbers the lower frontier. In the algorithm, each update to the position matrix produces a different target value, and we rank the particles non-dominantly and obtain frontier 1 as the set of candidate solutions. To select the most suitable from the many non-dominated solutions, we use the crowding distance as a measure and select the one with the lowest density between particles as the optimal solution generated by this iteration.

4.2.4 The specific process of the first phase

We assume that features arrive as streams and the general online streaming feature selection algorithm processes the arriving features in real time. The arriving features are summarized into groups until the number of features satisfies the group size or there are no more features.

The concept of combining the PSO algorithm in the Algorithm 2. Initialize N particles and each particle is given a P and V with dimensions equal to the group’s size. Following the determination of the target value for each particle, the N particles are subjected to NDS and CW calculations, and the best particle is chosen for comparison with g(best). During each iteration, if a more suitable solution appears, g(best) and p(best) need to be updated until the maximum number of iterations or the optimization function has converged. In the end, we get a set of features that are more combinable and the random combination of features breaks the drawbacks of traditional streaming features.

Algorithm 2
figure h

Intra-group feature selection.

4.3 NRS online feature selection

The feature set SF has a high dependency on the label, but there is no redundant update operation among the features in SF, which can easily make some very poor features added to CF. As a result, we will introduce the feature evaluation method based on the rough set, which improves the efficiency the selected features.

4.3.1 The maximum dependency

In Algorithm 3, to select the optimal feature set, the features in SF need to be judged one by one. If the feature fi is added to CF making the overall dependency increase, fi is added to CF, and if it does not increase or even decrease,fi needs to be eliminated. If the dependency remains the same as before, it means that fi has a redundant relationship with the features in CF and must be processed for redundancy updates. This evaluation method is theoretically the most effective in the rough set, but the slow computation speed and lack of sufficient experimental samples make producing equivalent result classes in the high-dimensional space difficult.

Algorithm 3
figure i

NRS online feature selection.

4.3.2 Redundant updates

In a set of feature sets, there are often redundant features that do not serve any purpose and, on the contrary, sometimes reduce the efficiency of the algorithm. For instance, if one of the two features fi and fj is useful, the other will have effect. If the redundant features are not removed, the final feature selection process becomes lengthy and inefficient. When the newly arrived feature reduces the overall dependency to the previous dependency, it is added to the CF and the important test is run on each feature in the CF. If the significance of the feature is equal to 0, the feature is redundant.

4.4 General overview of algorithm

The proposed algorithm considers features arriving continuously over time and saved into a group. Groups arrive in an online fashion. Intra-group feature selection and NRS online feature filtering are used to filter the features in each group. PSONRS employs the PSO technique as the foundation for feature selection, combining the features in the group at random to find the most relevant feature set SF.In the second stage, the features in SF are filtered again using the feature evaluation method of NRS. The overall PSONRS algorithm can be referred to Fig. 2.

The Intra-group feature selection: :
  • N particles are initialized at the beginning, each particle is represented by a multidimensional position and velocity vector and the dimension of the vector is equal to the group size. The position vector is initialized to some random values of probability for each dimensional element, and the probability values are used to obtain S and NS. The target value of each particle is derived from the target equation as a criterion for evaluating the merit of the particle.

  • P and V need to be updated according to the individual optimal p(best) and global optimal g(best), and the updating process can be seen in Section 4.2.1.

  • When the maximum number of iterations is satisfied or the function has converged to the state indicating that we have selected the optimal particle g(best) in the group. the is obtained in the g(best) for the second stage of filtering.

The NRS online feature filtering: :
  • This process involves adding features fi to the CF in the SF until there are no more features in the SF. The feature fi needs to be processed in real time, after a maximum dependency and redundancy update strategy to decide whether the feature is needed or not, as described in Section 4.3.

  • By filtering each set of features at two levels, we end up with a set of features that are highly relevant to the label and have low redundancy.

Fig. 2
figure 2

Overview of online feature selection PSO-NRS

4.5 Time complexity

From the principle, we can know that the time complexity is mainly concentrated in the calculation of the dependency on the label and the iteration of PSO. For the arrival of a set of features, the time complexity of the computation of the target equation in the first layer of filtering is \(O(MN{I^{2}}\log I)\), where M is the number of target equations, N is the number of particles, and I is the number of instances. The time complexity of Algorithm 1 is \(O({N^{2}}\log N)\). The calculation of the non-dominated ranking as well as the congestion distance can be expressed as \(O({N^{2}}\log N)\), so the total time complexity of the first layer of filtering is \(O(MN{I^{2}}\log I + M{N^{2}}\log N)\), which can be reduced to \(O(MN{I^{2}}\log I)\). The second layer of filtering is primarily for feature evaluation and redundancy update, so in the worst case a redundancy update operation for each featuremay be required, with time complexity of \(O(SCI\log I)\), where S is the total number of features selected by the first layer of filtering and C is the candidate feature set of total number of features. The total time complexity is \(O(MN{I^{2}}\log I + SCI\log I)\), which can be summarized as \(O(MN{I^{2}}\log I)\).

5 Experiment

5.1 Data sets

In the experiment, we use three classifiers to evaluate the selected feature set in MATLAB R2018b. The whole algorithm is written by python language. All experimental results are generated in a computer with Intel(R) i7-8750H CPU 8G memory.

To validate the algorithm, 14 benchmark data sets from different domains were found for validation (overview of the data sets) The selected data sets have instances between 62 and 2600. The number of features ranged from 500 to 12582 and the range of categories was between 2 and 40. It can be seen that the selected data sets cover a wide range of types from small to large samples and from low to high-dimensional. It is possible to verify the effectiveness of the algorithm. The details of each data set, such as the number of instances, the number of features, and the number of categories, are listed in Table 2.

Table 2 Details of the data sets

5.2 Experiment to verify the effectiveness of the algorithm

5.2.1 Analysis of experimental data

We present the experimental results for the best performing algorithms in the table with bold and italicized. Tables 34 and 5 lists the accuracy of each algorithm on different classifiers. There are three classifiers used in the experiments K-nearest neighbor (KNN), random forest (RF) and discriminant analysis (DA). Since the initial P and V of the algorithm are randomly generated, different experimental results are generated at the end. To get rid of the randomness of the algorithm in the data set, we perform a fivefold cross-validation on the data set and use the average of the results as the final experimental results. There are various ideas for the implementation of classifiers. For example, For example, KNN compares the unlabeled features with the labeled features and finally selects K labels with the most occurrences. The random forest classifier embodies the idea of integration by having many decision trees, each of which is a classifier. It is the diversity of classifier algorithms that makes the final output different.

Table 3 Prediction accuracy using KNN classifier
Table 4 Prediction accuracy using RF classifier
Table 5 Prediction accuracy using DA classifier

We compare with several state-of-the-art algorithms, such as SAOLA [12], OFS-A3M [6], OFS-Density [23], etc. As can be seen in Table 3, the algorithm performs quite well in the KNN classifier, with the accuracy better than the other algorithms in 12 of the 14 data sets and above 90% in 10 of the data sets. In the RF classifier, our algorithm achieves the optimum in ten data sets, while differing little from the optimum in the remaining data sets. Six data sets achieved the best results in the DA. Although slightly inferior to the previous two classifiers, the algorithm’s efficiency is relatively stable and it does not perform well on one data set but performs poorly on another. By comparing different algorithms, it can be found that there is a big difference in the performance of the algorithms in different domain data sets. In our experiments, we found that the algorithm selected only one feature in a certain data set resulting in a low accuracy rate. There are various reasons for this, it may be due to the data set or the features flowing over too well that other important features cannot be selected. PSO-NRS uses an online group feature selection method and combines features within groups randomly to discover a high-quality feature set. Meanwhile, the application of NRS can adapt to various types of datasets to enhance the generalization ability of the algorithm. Therefore, PSO-NRS can effectively overcome the above-mentioned drawbacks.

To measure the dispersion of the accuracy obtained from the five-fold cross-validation, the standard deviation (SD) was calculated for the five experimental results. Although it is the same data set, the data distribution has been changed significantly by continuously transforming the training and testing sets. From the SD results in Tables 35, they are basically scattered in the range of 0–12, with most of them in the middle of the range, indicating that the experimental results are not too scattered, which also illustrates the necessity and effectiveness of the five-fold cross-validation.

To prove the effectiveness of the algorithm, we compared PSO-NRS with the latest NRSIPSO algorithm [37], which incorporates PSO and NRS. Due to the unavailability of code, the experimental results of this artical are directly quoted in this paper. The ten-fold validation used by NRSIPSO to get the final results, and the five-fold cross-validation used by PSO-NRS, it has been stated that the results of ten-fold cross-validation and five-fold cross-validation are not very different. The results are shown in Table 6. It can be seen that there is not much difference between the two algorithms in terms of accuracy. In the ISOLET dataset, 91 features were selected in the PSO-NRS algorithm, while 152 features were selected in the NRSIPSO, indicating that our algorithm is more efficient in this dataset. In the PROState dataset, PSO-NRS selects 19 features and NRSIPSO selects 4 features, which is slightly inferior to the PSO-NRS algorithm in this dataset. But in general, the PSO-NRS algorithm does not differ much from the recent algorithms in terms of efficiency.

Table 6 Results of comparison PSO-NRS with NRSIPSO

5.2.2 Verify the stability of the algorithm

To present the experimental results more clearly, we ranked the algorithms according to their efficiency on each data set and normalized the ranked results (the normalized range is [0, 0.5]).

It can be observed from the Fig. 3 that our algorithm is always at the edge of the graph (the algorithm in red in the figure), which indicates that the algorithm is not only stable but also very efficient. Although the graphs enclosed in the DA classification are not very rounded, the efficiency of the algorithm is very close to the optimal value in other data sets where the optimal efficiency is not obtained, and the algorithm is consistently in the top three in the ranking. On the contrary, the graphs enclosed by the lines of other algorithms can be represented by irregular graphs, which can get high efficiency in some data sets but show inferiority in other data sets, reflecting an unstable effect.

Fig. 3
figure 3

Examine the stability between different online feature selection algorithms

5.2.3 A significantly different test between algorithms

We use the Friedman test and Bonferroni-Dunn test for statistical analysis of all online feature selection algorithms to compare the performance between algorithms [38].

The Friedman test can be defined as:

$$ {F_{F}} = \frac{{(N - 1){\chi_{F}^{2}}}}{{N(k - 1) - {\chi_{F}^{2}}}} $$
(14)

Equation (14) follows an F distribution, among them, \({\chi _{F}^{2}} = \frac {{12N}}{{k(k + 1)}}\left [ {\sum \nolimits _{j} {{R_{j}^{2}} - \frac {{k{{(k + 1)}^{2}}}}{4}} } \right ]\), \({R_{j}} = \frac {1}{N}\sum \limits _{i = 1}^{N} {{r_{i}^{j}}} \), \({r_{i}^{j}}\) denotes the ranking of the ith algorithm on the jth data set. We can obtain the degrees of freedom as K − 1 and (K − 1)(N − 1). When FF is higher than F{K− 1}{(K− 1)(N− 1)}, it means that the null hypothesis is rejected. From Table 7, we can see that the null hypothesis is rejected for all evaluation metrics at α equal to 0.05, which indicates a significant difference between the algorithms. For the follow-up test, we chose the Bonferroni-Dunn test and determined the critical difference (CD). CD can be denoted as:

$$ {\text{CD}} = {q_{\alpha} }\sqrt {\frac{{k(k + 1)}}{{6N}}} $$
(15)

When K = 7 and N = 14, CD was 2.153 at α equal to 0.05 because qα = 2.638.

Table 7 Friedman test for classifiers

To visualize the significant differences between the algorithms, we plotted the CD graph using the computed CD and FF data. The algorithm on the far right is defined as the best algorithm in Fig. 4 witch has a numerical axis from largest to smallest. In the KNN and RF classifiers, PSO-NRS and OFS-A3M are connected by a black line on the same side and at the right end of the other algorithms, indicating that there is no significant difference between the two algorithms and the effect is similar, whereas the other algorithms have a significant difference, indicating that there is a significant difference in the effect between the algorithms. In the DA classifier, there are four algorithms at the right end of the number axis, but the PSO-NRS algorithm is still at the rightmost end of the number axis, although there is no significant difference between the four algorithms, but our algorithm is still optimal.

Fig. 4
figure 4

Statistical analysis of different algorithms using CD diagram

5.3 Influence of parameters on the PSO-NRS

The use of the PSO algorithm leads to an increase in the number of algorithm parameters, and selecting the most beneficial parameters for the algorithm has a great improvement in the performance of the algorithm. To save space, we averaged the accuracies obtained in the three classifiers to complete the comparison of the experiments. Parameters such as group size, number of particles generated, and number of iterations were explored in four data sets.

5.3.1 The effect of group size on the PSO-NRS

The group size represents the number of features that need to be processed for the first filtering, and a higher number of features indicate a stronger combinability between features. Table 8 shows that the optimum is obtained on two datasets with a group size of 200, and one data set with a group size of 50 and 100, respectively. Figure 5(a) shows an analysis of the number of features selected by the algorithm and the running time, demonstrating that the running time decreases with increasing group size. The number of features selected tends to average out without much deviation. Therefore, to take into account the running efficiency of the algorithm and the real application scenario of the streaming features, 100 is finally chosen as the group size.

Table 8 Influence of parameters on the PSO-NRS
Fig. 5
figure 5

Examine the stability between different online feature selection algorithms

5.3.2 The effect of the number of particles on the PSO-NRS

Each particle represents a solution in the PSO algorithm, using the symbol NOP to denote the number of particles. Table 8 shows that the algorithm achieves the best accuracy in all three data sets when NOP is 15, and achieves the best in one data set when NOP is 10, implying that the number of particles can be considered between 10 and 15. In Fig. 5(b) it can be seen that the running time is increasing with the increase of particles, and in terms of the number of features selected, NOP equals 10 is relatively less in four data sets, so from various considerations, we finally choose NOP equals to 10 as the final result.

5.3.3 Effect of the number of iterations on the PSO-NRS

In the PSO algorithm, the idea that each particle approaches the optimal value through continuous iterations ensures that the algorithm can achieve a good result. Table 8 shows that the algorithm has three data sets that achieve the best results after 20 iterations. The running time of the algorithm significantly increases with the number of iterations. as shown in Fig. 5(c). To make the accuracy of the algorithm increase, we choose to trade the running time of the algorithm for the steady state.

5.4 Analysis of PSO-NRS for ablation experiments and running time

In this section, we design ablation experiments to verify the effectiveness of the algorithm to combine PSO and NRS. The first layer of filtering is removed from the original algorithm and only the second layer of filtering (NRS) is used for online streaming feature selection. The NRS is applied to 6 different types of datasets and the accuracy of the algorithm is analyzed in 3 different classifiers. Table 9 shows the accuracy of NRS in the KNN, RF, and DA classifiers.

Table 9 Ablation experiments

From Table 9 , it can be seen that PSO-NRS has higher accuracy than NRS in all three classifiers in the SRBCT and MLL datasets. In both WARPPIE10P and LYMPHOMA datasets, PSO-NRS has higher accuracy than NRS in two classifiers. In the LUNG and ISOLET datasets, NRS has better accuracy than PSO-NRS in the classifiers. To better verify the pros and cons of the algorithm, we compare the number of feature sets obtained by the NRS and PSO-NRS. As can be seen from Table 10, in the LUNG and ISOLET datasets, the number of features selected by NRS is significantly higher than PSO-NRS, but the accuracy are not significantly different. In the remaining datasets the number of features selected by the two algorithms is comparable, however, the accuracy of PSO-NRS is significantly better than that of NRS. Therefore, this experiment demonstrates the effectiveness of combining PSO with NRS.

Table 10 Number of features selected by the algorithm in the 6 data sets

We compare PSO-NRS with the OFS-A3M algorithm in terms of running time, which also uses NRS for online feature selection. Both algorithms are written using the python language. To reduce the length of the paper, we choose six datasets to compare in terms of running time. The specific experimental results are shown in Table 11. From Table 11, it is clear that OFS-A3M is significantly better than PSO-NRS in running time, which is the inevitable result of the reference to PSO in the algorithm, and the constant iteration makes the computational consumption of the algorithm significantly higher, which becomes a drawback of the algorithm.

Table 11 Comparison with OFS-A3M algorithm in terms of runtime (Unit: second)

6 Conclusion

In this paper, we propose a novel online feature selection algorithm called PSO-NRS. Use online group feature selection to increase the available features and explore the relationship between features to address the lack of available information for online stream feature selection that causes the selected feature set to be less efficient. A double-objective optimization function is designed to find the set of features that are highly correlated with the labels. The use of NRS does not require domain knowledge enhances the generalization ability of the algorithm. A feature search strategy is intended to select a set of features that are highly relevant to the labels and have low redundancy. In the experiments, we show that PSO-NRS is very competitive when evaluating the algorithm’s accuracy among several classifiers and comparing it to some state-of-the-art algorithms; the algorithm’s stability is examined and PSO-NRS is very stable state when compared to other algorithms; statistical tests are performed on the algorithm, and there is a significant difference between PSO-NRS and other algorithms, demonstrating the algorithm’s effectiveness.

There are many scenarios in the field of feature selection that correspond to various possible situations encountered in the study, such as multi-label, semi-supervised, and unsupervised feature selection. In the following research, the above directions can be explored in depth to achieve the purpose of solving practical problems. PSO requires several iterations to find the optimal value therefore, which increase a high computational overhead. In future studies of PSO, the negative impact of iteration can be reduced by finding suitable strategies.