1 Introduction

With the extensive development of information technology, the data collected from practical applications are usually high dimensional. How to learn such high dimensional data effectively is an important issue in machine learning [1, 5]. As well known, high dimensionality brings great difficulties in the procedure of data processing, such as the high computational complexity and the increased probability of over-fitting. To strengthen the discrimination of these models built on the high dimensional data, it is necessary to eliminate the redundant and irrelevant features. Therefore, the dimensionality reduction technology is proposed to find the optimal feature set of low dimension to represent the original data.

Feature extraction [6, 42] and feature selection [9, 31] are the two major methods used for dimensionality reduction. Feature extraction reduces the dimensionality of data by reconstructing new features. Tan et al. [23] proposed a novel data-driven rule-based generation approach for fuzzy rule interpolation (FRI) through the feature extraction. Zhang et al. [39, 40] proposed robust multiview feature extraction methods through the l2,1-norm constraint. Feature selection selects the relevant feature subset from the raw features without changing the semantics of data. Compared with feature extraction, feature selection methods can preserve the physical meaning of original features. According to the label availability, the feature selection methods are classified into supervised ones [3, 16], semi-supervised ones [22, 35], and unsupervised ones [11, 30]. Among the three kinds of methods, the supervised methods are most likely to select the better discriminative feature subset because they can utilize the label information and consider the correlation between labels and features. Instead of the traditional vector-based feature selection that is only suitable for binary classification problems, a series of matrix-based structured sparsity-inducing feature selection (SSFS) methods for tackling multi-class problems have been proposed in recent research. Besides, through the combination of machine learning algorithms, the matrix-based SSFS methods can be applied for specific areas, including the multitask feature selection [28], the multilabel feature selection [37], and the multiview feature selection [15] etc.

It is well known that describing an object from multiple views rather than the single one is more reliable and informative. For example, the description of web page from multiple views, including texts, images, and links, provides more information than that only from the single view. The multiview feature learning that can integrate views to select the relevant feature subset has aroused widespread concern. In this paper, we focus on the supervised matrix-based multiview SSFS. Xiao et al. [32] first proposed the two-view feature selection method. Then Wang et al. [27, 29] presented the multiview feature selection methods to overcome the limitation of two views. Later, a number of supervised multiview feature selection methods have emerged [26, 38, 43]. In these methods, the G1-norm (or G2,1-norm) regularization is introduced to capture the discriminative views for specific classes (or all the classes), and the promising performance has been achieved. However, there still exist some limitations. Usually, the methods are built based on the practice that features of the different views are concatenated into a long vector, which will increase the computational complexity. Besides, the concatenated feature vector ignores the individual statistical characteristics of each view and the different importance of each view to a specific task. In addition, the data is often corrupted by outliers in the real application, so it is necessary to build a robust model to mitigate the impact of outliers.

In this paper, we propose a robust supervised weighted multiview feature selection method that considers the complementary property and the specificity of views into a unified framework to select the relevant views and features. Specifically, we divide the original high-dimensional space into the latent low-dimensional subspaces based on each view and separately impose the weighted penalty in these subspaces instead of concatenating all views. We adopt the individual l2,1-norm penalty for each view, which can induce robustness to outliers. In the process of obtaining the optimal solution of the derived model, we unpack the original large scale problem into m (m is the number of views) manageable low-dimensional subproblems with the reduced computation complexity, and give an iterative algorithm based on the augmented Lagrangian multiplier method.

In summary, the paper makes the following contributions:

  • The proposed method separately imposes the individual weighted l2,1-morm penalty on each view to enhance the robustness and accumulates all views to construct the loss term, which considers both the complementarity and specificity of views.

  • The high-dimensional problem can be split into several low-dimensional problems in the process of optimizing the objective function, which can reduce the computational complexity, and the proposed iterative algorithm achieves a fast convergence rate.

  • The proposed method is compared with several state-of-the-art SSFS methods on the widely used multiview datasets, and the effectiveness of the method has been demonstrated on the extensive experiments.

The rest of the paper is organized as follows. Section 2 reviews the related SSFS methods. Section 3 elaborates the proposed method and the iterative algorithm, and the convergence and computational complexity of the iterative algorithm are also analyzed. In Section 4, we carry out the extensive experiments on the multiview datasets to evaluate the performance of the proposed method. Section 5 concludes the paper.

2 Related work

Many of matrix-based SSFS methods have been proposed. This section will review some related work, including robust feature selection methods, multiview feature selection methods, and view weighted feature selection methods.

RFS [18] is a classical robust SSFS method, in which the l2,1-norm penalty instead of the commonly used Frobenius norm penalty is adopted to mitigate the influence of outliers. Qian et al. [19] proposed a robust unsupervised feature selection (RUFS) method to perform clustering and feature selection simultaneously through the nonnegative matrix factorization, local learning, and l2,1-norm penalty. Du et al. [8] proposed a robust unsupervised feature selection model called RUFSM based on the matrix factorization, l2,1-norm, and manifold regularization term. In [10], He et al. presented an unsupervised feature selection method called SHLFS by combining the low-rank constraint and self-representation into a unified framework. In [13], Lan et al. proposed a robust method called SCM by introducing the capped l2-norm loss, which can effectively reduce the influence of noise and outliers. Also, there exist other feature selection methods that are involved in the robust constraints [17, 24, 36, 41]. All of these methods are not specifically designed to solve the multiview problems. The features of different views are needed to be concatenated straightly when applying these methods to deal with the multiview data, which ignores the relationship among the different views.

An increasing number of multiview feature selection methods have been proposed to consider the relationship of multiple views. Wang et al. [29] proposed a multi-modal multi-task feature selection method which firstly introduces the G1-norm regularizer to enforce the sparsity among multiple modalities. In [26], Wang et al. proposed a sparse multi-modal feature selection method by using the hinge loss instead of the squared loss. Nie et al. [25] proposed a novel multiview feature selection method called SSMVFS to perform both clustering and classification tasks based on the G1-norm. Besides, the G2,1-norm that has the ability to select the relevant views is also used for multiview feature selection. Wang et al. [27] proposed a multiview feature selection method to identify the correlation between genotypes and phenotypes by constructing the sparsity across views via the G2,1-norm. Zhu et al. [43] proposed a block-row sparse multiview feature selection method called MVML through the G2,1-norm regularizer and least square loss term. Zhang et al. [38] established the discriminative feature selection model via the sparse multi-modal learning and ε-dragging technology. Cheng et al. [4] proposed a novel multiview feature selection method named as LHFS, in which the low-rank and hypergraph strategy are introduced to enhance the inherent relationship of data. Even though the above methods considered the relationship among different views through using the G1-norm or G2,1-norm regularizer, they did not consider the the weights of different views and also ignored the influence of outliers.

For further considering the importance of different views, several multiview schemes have been proposed to learn weights for different views [12, 14, 20, 21, 33]. In [33], Xu et al. proposed a weighted multiview clustering with feature selection method (WMCFS) to improve the clustering accuracy via designing two weighting schemes. Shi et al. [20, 21] proposed a semi-supervised multiview feature selection method with the structured multiview sparse regularization and the Hessian regularization to boost the feature selection performance. Krishnasamy et al. [12] proposed a semi-supervised multiview multitask feature selection framework by exploring the complementary information from different views in each task and the shared knowledge between the related tasks. Li et al. [14] proposed a multiview manifold regularized feature selection method which exploited the label information, label relationship, data distribution, and relationship among multiple features to perform feature selection. These methods are unsupervised and semi-supervised ones which lack the consideration of the influence of outliers.

3 Robust weighted multiview feature selection

We first define the notations used in this paper, then describe the new model, and then give an iterative optimization algorithm to solve the model. Finally, the convergence and complexity of the algorithm are analyzed.

3.1 Notations

The upper case letters are used to denote matrices. Vectors are denoted by the lower case letters in bold, and scalars are denoted by the lower case letters. For example, for a matrix MRp×q, mi and mj denote the i th row and the j th column, respectively. The matrix X = [x1,x2,…,xN] ∈ Rl×N represents the set of samples belonging to c classes, where l is the dimension of each sample, N is the total number of samples, and xiRl×1 represents the i th sample. X can be expressed in the form of m views and denoted as \(X=[X_{1}^{\top },X_{2}^{\top },\cdots ,X_{m}^{\top }]^{\top } \in \textit {R}^{l\times N}\), where \(X_{v}=[\mathbf {x}_{1v},\mathbf {x}_{2v},\ldots ,\mathbf {x}_{Nv}]\in R^{l_{v}\times N}\) represents the v th view of the samples. Denote \(Y=[\mathbf {y}^{1^{\top }},\mathbf {y}^{2\top },\ldots ,\mathbf {y}^{N\top }]^{\top }\in \{0,1\}^{N\times c}\) the label matrix of N samples. If the i th sample belongs to the j th category, then the j th entry of yi equals to 1 and the rest ones equal to 0. W is the projection matrix made up of sub-blocks \(W_{v}\in R^{l_{v} \times c}, v=1,\cdots ,m\), that is, \(W=[W_{1}^{\top },W_{2}^{\top },\cdots ,W_{m}^{\top }]^{\top }\in R^{l\times c}\).

3.2 Model

Inspired by RFS [18], the l2,1-norm penalty is used to alleviate the influence of outliers. But different from RFS and most of multiview feature selection methods that take the high-dimensional data matrix X concatenated from multiple views as the input directly, we calculate the weighted penalty separately on the latent subspace based on each view:

$$ \begin{array}{@{}rcl@{}} \min\limits_{\theta_{v},W}&&\sum\limits_{v=1}^{m}(\theta_{v})^{p} \Vert X_{v}^{\top} W_{v}-Y\Vert_{2,1} \\ \text{s.t.}&&\sum\limits_{v=1}^{m}{\theta_{v}}=1, 0\leq \theta_{v} \leq 1 \end{array} $$
(1)

where 𝜃v is the weight of the v th view, and the different weights are assigned to distinguish the different importance of views. The more important the role of the view is, the greater its weight should be. p is the exponential parameter used to control the sparsity of view weight vector 𝜃, and the influence of p is discussed during the later experiments.

In order to eliminate redundant features, we introduce the G2,1-norm sparse regularizer [27] to preserve the important views and the l2,1-norm sparse regularizer [7] to select the discriminative features from the important views. Specifically, the G2,1-norm regularizer is defined as \(\|W\|_{G_{2,1}}={\sum }_{v=1}^{m}\|W_{v}\|_{F}\), which divides the features into m blocks according to views and utilizes the l1-norm between views. It enforces W block-sparsity to eliminate the irrelevant views. The l2,1-norm regularizer ∥W2,1 is defined as \(\|W\|_{2,1}={\sum }_{v=1}^{m}\|W_{v}\|_{2,1}={\sum }_{i=1}^{l} \|\mathbf {w}^{i}\|_{2}\), where wi is the i th row of W, and it can enforce the rows of W which correspond to the irrelevant features to be zeros or close to zeros. Thus, the following optimization model is built to implement multiview feature selection:

$$ \begin{array}{@{}rcl@{}} \min\limits_{W,\theta_{v}}&&\sum\limits_{v=1}^{m} (\theta_{v})^{p} \Vert X_{v}^{\top} W_{v}-Y\Vert_{2,1}+\lambda_{1}\sum\limits_{v=1}^{m}\Vert W_{v}\Vert_{2,1}+\lambda_{2}\sum\limits_{v=1}^{m}\Vert W_{v}\Vert_{F} \\ \text{s.t.}&&\sum\limits_{v=1}^{m}{\theta_{v}}=1, 0\leq \theta_{v} \leq 1 \end{array} $$
(2)

Formulation (2) is the newly proposed multiview feature selection model which can impose a sparse projection matrix from the aspects of views and features. The loss term in (2) can be rewritten as \({\sum }_{v=1}^{m} (\theta _{v})^{p} {\sum }_{i=1}^{N}\Vert \mathbf {x}_{iv}^{\top } W_{v}-\mathbf {y}^{i}\Vert _{2}\), the residual \(\Vert \mathbf {x}_{iv}^{\top } W_{v}-\mathbf {y}^{i}\Vert _{2}\) is adopted instead of the squared one \(\Vert \mathbf {x}_{iv}^{\top } W_{v}-\mathbf {y}^{i}{\Vert _{2}^{2}}\), which can induce the robustness and reduce the impact of the outliers. In this model, one sample in different views is predicted to the same label, but the residuals vary in different views. Such a strategy has the advantage of taking both the complementary property of different views and the specificity of each view into consideration.

3.3 Optimization algorithm

The Augmented Lagrangian Multiplier (ALM) method [2, 8] is used to solve the objective function (2). First, m slack variables Ev(v = 1,2,…,m) are introduced, and then \( X_{v}^{\top } W_{v}-Y\) is replaced with Ev. So (2) can be reformulated as:

$$ \begin{array}{@{}rcl@{}} \min\limits_{W_{v},\theta_{v},E_{v}}&&\sum\limits_{v=1}^{m} (\theta_{v})^{p} \Vert E_{v}\Vert_{2,1}+\lambda_{1}\sum\limits_{v=1}^{m}\Vert W_{v}\Vert_{2,1}+\lambda_{2}\sum\limits_{v=1}^{m}\Vert W_{v}\Vert_{F} \\ \text{s.t.}&& X_{v}^{\top} W_{v}-Y-E_{v}=0, v=1,2,\ldots,m\\ &&\sum\limits_{v=1}^{m}{\theta_{v}}=1, 0\leq \theta_{v} \leq 1 \end{array} $$
(3)

The corresponding augmented Lagrangian function is:

$$ \begin{array}{@{}rcl@{}} L(W_{v},E_{v},\theta_{v})&=&\sum\limits_{v=1}^{m} (\theta_{v})^{p} \Vert E_{v} \Vert_{2,1} + \lambda_{1} \sum\limits_{v=1}^{m} \Vert W_{v} \Vert_{2,1} +\lambda_{2} \sum\limits_{v=1}^{m} \Vert W_{v}\Vert_{F} \\ &&+\sum\limits_{v=1}^{m} tr[{\Lambda}_{v}^{\top}(X_{v}^{\top}W_{v}-Y-E_{v})] \\ &&+\sum\limits_{v=1}^{m} \frac{\mu}{2} \Vert X_{v}^{\top} W_{v}-Y-E_{v}{\Vert_{F}^{2}} \end{array} $$
(4)

where Λv is the Lagrange multiplier, and μ is a penalty parameter. Since

$$ \begin{array}{@{}rcl@{}} && \sum\limits_{v=1}^{m}tr[{\Lambda}_{v}^{\top}(X_{v}^{\top}W_{v}-Y-E_{v})]+\sum\limits_{v=1}^{m}\frac{\mu}{2}\Vert X_{v}^{\top} W_{v}-Y-E_{v}{\Vert_{F}^{2}} \\ &=&\sum\limits_{v=1}^{m} \frac{\mu}{2} \Vert X_{v}^{\top} W_{v}-Y-E_{v}+\frac{1}{\mu}{\Lambda}_{v}{\Vert_{F}^{2}}-\sum\limits_{v=1}^{m}\frac{\mu}{2}\Vert\frac{1}{\mu}{\Lambda}_{v}{\Vert_{F}^{2}} \end{array} $$
(5)

the augmented Lagrangian function is equivalent to the following form:

$$ \begin{array}{@{}rcl@{}} L(W_{v},E_{v},\theta_{v})&=&\sum\limits_{v=1}^{m} (\theta_{v})^{p} \Vert E_{v}\Vert_{2,1}+\lambda_{1}\sum\limits_{v=1}^{m}\Vert W_{v}\Vert_{2,1}+\lambda_{2}\sum\limits_{v=1}^{m}\Vert W_{v}\Vert_{F} \\ &&+\sum\limits_{v=1}^{m} \frac{\mu}{2} \Vert X_{v}^{\top} W_{v}-Y-E_{v}+\frac{1}{\mu}{\Lambda}_{v}{\Vert_{F}^{2}}-\sum\limits_{v=1}^{m}\frac{\mu}{2}\Vert\frac{1}{\mu}{\Lambda}_{v}{\Vert_{F}^{2}} \end{array} $$
(6)

Next, the specific iterative process of solving (3) is given.

First step::

Fix Ev and 𝜃v, update Wv When Ev and 𝜃v are fixed, Wv is the only variable, and its solution can be obtained by solving the following optimization problem:

$$ \begin{array}{@{}rcl@{}} \min\limits_{W}&\lambda_{1}\sum\limits\limits_{v=1}^{m}\Vert W_{v}\Vert_{2,1}+\lambda_{2}\sum\limits\limits_{v=1}^{m}\Vert W_{v}\Vert_{F} +\sum\limits\limits_{v=1}^{m} \frac{\mu}{2} \Vert X_{v}^{\top} W_{v}-Y-E_{v}+\frac{1}{\mu}{\Lambda}_{v}{\Vert_{F}^{2}} \end{array} $$
(7)

Setting the derivative with respect to Wv to zero, we have

$$ \begin{array}{@{}rcl@{}} \lambda_{1}D_{1v}W_{v}+\lambda_{2}D_{2v}W_{v}+\mu X_{v}(X_{v}^{\top}W_{v}-Y-E_{v}+\frac{1}{\mu}{\Lambda}_{v}) =0 \end{array} $$
(8)

where \(D_{1v}\in R^{l_{v} \times l_{v}}\) is a diagonal matrix, \(\mathbf {w}_{v}^{i}\) is the i th row of Wv, and \(D_{2v}\in R^{l_{v} \times l_{v}}\) is a diagonal matrix.

$$ \begin{array}{@{}rcl@{}} D_{1v}= \begin{bmatrix} \frac{1}{2\| \mathbf{w}_{v}^{1} \|_{2}} & & \\ & \ddots& &\\ & & \frac{1}{2\|\mathbf{w}_{v}^{l_{v}}\|_{2}} \end{bmatrix} &D_{2v}= \begin{bmatrix} \frac {1}{2\Vert W_{v}\Vert_{F}} & & \\ & \ddots& &\\ & & \frac {1}{2\Vert W_{v}\Vert_{F}} \end{bmatrix} \end{array} $$
(9)

According to (8), we obtain

$$ \begin{array}{@{}rcl@{}} W_{v}=\mu(\lambda_{1}D_{1v}+\lambda_{2}D_{2v}+\mu X_{v}X_{v}^{\top})^{-1}X_{v}(Y+E_{v}-\frac{1}{\mu}{\Lambda}_{v}) \end{array} $$
(10)
Second step::

Fix Wv and 𝜃v, update Ev

When Wv and 𝜃v are fixed, Ev is the only variable, and its solution can be obtained by solving the following optimization problem:

$$ \begin{array}{@{}rcl@{}} \min\limits_{E_{v}}&&\sum\limits_{v=1}^{m} \left( \frac{(\theta_{v})^{p} }{\mu} \Vert E_{v}\Vert_{2,1}+ \frac{1}{2} \Vert X_{v}^{\top} W_{v}-Y-E_{v}+\frac{1}{\mu}{\Lambda}_{v}{\Vert_{F}^{2}}\right) \end{array} $$
(11)

Define \(J_{v}=\frac {(\theta _{v})^{p} }{\mu } \Vert E_{v}\Vert _{2,1}+ \frac {1}{2} \Vert X_{v}^{\top } W_{v}-Y-E_{v}+\frac {1}{\mu }{\Lambda }_{v}{\Vert _{F}^{2}}\). Since Jv is only related to the v th view, the problem (11) can be decomposed into m sub-optimization problems:

$$ \begin{array}{@{}rcl@{}} \min\limits_{E_{v}}J_{v}, v=1,2,\ldots,m \end{array} $$
(12)

Denote \(G_{v}=X_{v}^{\top }W_{v}-Y+\frac {1}{\mu }{\Lambda }_{v}\), then (12) can be rewritten as

$$ \begin{array}{@{}rcl@{}} \min\limits_{E_{v}}&&\frac{(\theta_{v})^{p}}{\mu} \Vert E_{v}\Vert_{2,1}+\frac{1}{2} \Vert E_{v}- G_{v} {\Vert_{F}^{2}} \end{array} $$
(13)

Furthermore, (13) can be converted as

$$ \begin{array}{@{}rcl@{}} \min\limits_{\mathbf{e}_{v}^{i}}&&\sum\limits_{i=1}^{N} \left( \frac{(\theta_{v})^{p}}{\mu} \Vert \mathbf{e}_{v}^{i}\Vert_{2}+\frac{1}{2} \Vert \mathbf{e}_{v}^{i}- \mathbf{g}_{v}^{i} {\Vert_{2}^{2}}\right) \end{array} $$
(14)

where \(\mathbf {e}_{v}^{i}\) and \(\mathbf {g}_{v}^{i}\) are the i th row of matrix Ev and Gv, respectively. And according to [2], the solution to (14) is

$$ \begin{array}{@{}rcl@{}} \mathbf{e}_{v}^{i}= \left\{ {\begin{array}{*{20}{l}} \left( 1-\frac{(\theta_{v})^{p}}{\mu \| \mathbf{g}_{v}^{i}\|_{2} }\right)\mathbf{g}_{v}^{i},& \| \mathbf{g}_{v}^{i}\|_{2}> \frac{(\theta_{v})^{p}}{\mu}\\ \mathbf{0} ,&\| \mathbf{g}_{v}^{i}\|_{2}\leq \frac{(\theta_{v})^{p}}{\mu} \end{array}} \right. \end{array} $$
(15)
Third step::

Fix Ev and Wv, update 𝜃v When Ev and Wv are fixed, 𝜃v is the only variable, and its solution can be obtained by solving the following problem:

$$ \begin{array}{@{}rcl@{}} \min\limits_{\theta_{v}}&&\sum\limits_{v=1}^{m}(\theta_{v})^{p} \Vert E_{v}\Vert_{2,1} \\ \text{s.t.}&&\sum\limits_{v=1}^{m}{\theta_{v}}=1, 0\leq \theta_{v} \leq 1 \end{array} $$
(16)

The Lagrangian function of (16) is

$$ \begin{array}{@{}rcl@{}} L(\theta_{v},\eta)=\sum\limits_{v=1}^{m}(\theta_{v})^{p}\Vert E_{v}\Vert_{2,1}-\eta\left( \sum\limits_{v=1}^{m}\theta_{v}-1\right) \end{array} $$
(17)

where η is the Lagrange multiplier. By setting the derivatives of (17) with respect to 𝜃v and η to 0, respectively, we have

$$ \begin{array}{@{}rcl@{}} \left\{ {\begin{array}{*{20}{l}} p(\theta_{v})^{p-1}\Vert E_{v} \Vert_{2,1}-\eta=0 \\ \sum\limits_{v=1}^{m}\theta_{v}=1 \end{array}} \right. \end{array} $$
(18)

Thus 𝜃v can be obtained by the following equation:

$$ \begin{array}{@{}rcl@{}} \theta_{v}=\frac{\left( \frac{1}{\Vert E_{v}\Vert_{2,1}}\right)^{\frac{1}{p-1}}}{\sum\limits_{v=1}^{m}\left( \frac{1}{\Vert E_{v}\Vert_{2,1}}\right)^{\frac{1}{p-1}}} \end{array} $$
(19)

Variables Wv, 𝜃v, Ev are updated iteratively according to the above three steps, and the iteration will stop until the iteration number is above 20, or the difference of two successive iterative objective function values is less than 10− 3. The specific process is summarized in Algorithm 1.

figure a

3.4 The analysis of convergency and complexity

In this subsection, the convergency analysis and complexity analysis of the proposed algorithm are displayed.

First, the convergency of the Algorithm 1 is demonstrated by referring to the references [13, 18]. To this end, a lemma as follows is given first [43].

Lemma 1

For any two positive values a and b, the following inequality holds.

$$ \begin{array}{@{}rcl@{}} \sqrt a - \frac{a}{2\sqrt b}\leq \sqrt b - \frac{b}{2\sqrt b} \end{array} $$
(20)

Theorem 1

The Algorithm 1 decreases the objective values monotonically in each iteration until convergence.

The proof is given in Appendix.

Next, the complexity of the proposed algorithm and the related algorithms is analyzed. During the solutions, the update of W is decomposed into m subproblems which can be solved independently. Compared with the direct iterative solving W with complexity O(l3 + l2N + l2 + lc + lNc), the complexity of Wv-update reduces to \(O({l_{v}^{3}}+{l_{v}^{2}}N+{l_{v}^{2}}+l_{v}c+l_{v}Nc)\), where l and lv are the data dimensions of all views and the v th view, respectively, and \(l={\sum }_{v=1}^{m} l_{v}\); N is the number of samples; c is the number of categories. In Algorithm 1, including the updates of W, E, 𝜃, and Λ, the total computational complexity is \(O(t({\sum }_{v=1}^{m}({l_{v}^{3}}+{l_{v}^{2}}N+{l_{v}^{2}}+l_{v}c+l_{v}Nc+Nc)))\), where t is the number of iterations. The complexity of the related methods is as follows:

RFS [18]:O(t(l3 + N3 + l2N + l(c + Nc + N2))

SSMVFS [25]: O(t(l3 + l2(N + c) + l(c + Nc))

LHFS [4]: O(t(l3 + l2N + l(N2 + Nc)))

SHLFS [10]: O(t(l3 + l2N + l(N2 + Nc + N + c)))

SCM [13]: O(t(l3 + l2N + l(N2 + Nc + c)))

ADMC [34]:O(t(l3 + l2N + l(N2 + Nc) + Nc))

Compared with these related methods, the proposed method has a lower computational complexity.

4 Experiment

A series of experiments on the widely used multiview datasets are conducted to evaluate the proposed method. First the experimental setup is described in detail, and then the performance of the proposed method is compared with several state-of-the-art feature selection methods.

4.1 Experimental setup

To evaluate the effectiveness of the proposed method, several popular feature selection methods are compared: RFS [18], SSMVFS [25], LHFS [4], SHLFS [10], SCM [13], and ADMC [34]. The grid search is employed to find the optimal parameters. The comparison methods and the ranges of the involved parameters are as follows.

  • AllFea: Using all raw features without feature selection for classification.

  • RFS: It is a typical robust feature selection method which employs the l2,1-norm on both regularizer and penalty term. The range of the involved parameter γ is {10− 5,10− 4,…,105}.

  • SSMVFS: It is a multiview clustering feature selection method which combines both G1-norm and l2,1-norm as regularizers to enforce sparsity of projection matrix. It can also perform classification when labels of samples are obtained. The ranges of the involved parameters γ1 and γ2 are both {10− 5,10− 4,…,105}.

  • LHFS: It is a multiview feature selection method which employs the low-rank, hyper-graph, and l2,1-norm regularizer to select features. The ranges of the involved parameters α and λ are both {10− 5,10− 4,…,105}.

  • SHLFS: It is a robust unsupervised self-representation feature selection method which uses the l2,1-norm as both loss term and regularizer to select features. The ranges of the involved parameters α and β are both {10− 5,10− 4,…,105}.

  • SCM: It is a robust feature selection method with the capped l2 norm penalty which can reduce the influence of outliers. The range of the involved parameter λ is {10− 5,10− 4,…,105}, and ε is searched from {0,0.1,0.2,…,1}.

  • ADMC: It is a multiview feature selection model based on the adaptive-weighting discriminative regression. The range of the involved parameter α is searched from {10− 5,10− 4,…,105}.

  • Our method: The involved parameters λ1 and λ2 are searched from {10− 5,10− 4,…,105}, and parameter p is searched from {5,10,15,20,25}.

The detailed descriptions of the multiview datasets for experiments are as follows:

  • Internet Advertisements Dataset (Ads)Footnote 1: It is a collection of advertisements on Internet, including 3279 samples with original 1557 features. Three features (height, width, and attributes) are eliminated from all samples since they are absent in some samples. Each sample includes 457, 495, 472, 111, and 19 attributes in URL, origurl, ancurl, alt, and caption terms, respectively.

  • Multiple Features DatasetFootnote 2: It consists of 2000 samples of handwritten digits (0-9) with 649 features. Features are described from 6 different views, including 240 pixel averages in 2 × 3 windows, 216 profile correlations, 76 Fourier coefficients, 64 Karhunen-Love coefficients, 47 Zernike moments, and 6 morphological features.

  • NUS-WIDE-OBJECTFootnote 3: It is an assemble of network images with 31 object categories, including 30000 samples with 1134 features. Each sample is described from six types of low-dimensional features: 64-D color histogram, 73-D edge direction histogram, 144-D color correlogram, 128-D wavelet texture, 225-D block-wise color moments, and 500-D bag of words based on SIFT descriptions. We construct three datasets called NWO1, NWO2, NWO3 for experiments. NWO1 contains all samples whose names start with letter b, and this set includes 4616 samples in total. NWO2 contains all vehicle images with a total of 5346 samples. NWO3 contains all animal images with a total of 8339 samples.

  • AnimalFootnote 4: It is a dataset of 50 animal categories photos including 30457 samples with 10940 features. Each sample is described from 6 types of low-dimensional features: 252-D PHOG, 2000-D SIFT, 2000-D rgSIFT, 2000-D SURF, 2000-D local self-similarity histograms, and 2688-D RGB color histograms. We select the first five categories and the experimental dataset is constructed by random selecting 100 samples of each category which includes 500 samples in total.

In the experiments, the total l features are sorted in the descending order based on the values of ∥wj∥, j = 1,⋯ ,l, and then the top-R features are selected as the inputs of the subsequent 1-near neighbor classifier (1NN) in all experiments, where R = ⌊ratio × l⌋ with ratio varying from 0.1 to 0.9. The 5-fold cross validation is employed and the averaged classification accuracies and the averaged macroF1 scores are reported. The experiments are implemented in MATLAB 2014b with 4.10.0-38-generic, CPU 3.30GHz 3.30GHz and RAM 8GB.

4.2 Experimental results and discussions

First, to explore the effectiveness of each item in the proposed model, the model is divided into three sub-models. The first sub-model is made up of only the loss term of (2), and denoted “Loss only”. The second sub-model is called “Loss + l2,1-norm”, which consists of the loss term and l2,1-norm regularizer of (2). The third sub-model is constructed by combining the loss term and G2,1-norm regularizer of (2), and called “Loss + G2,1-norm”. The comparison experiments between the three sub-models and the proposed integral model of (2) are conducted on five employed datasets. The performance of these models when they achieve the highest results over all ratios are reported in Tables 1 and 2. According to the experimental results, we can get the following findings:

  1. (1)

    The results of the sub-models “Loss + l2,1-norm” and “Loss + G2,1-norm” are both better than those of the sub-model Loss only”, which proves the effectiveness of two regularizers.

  2. (2)

    The results of the proposed integral model are better than those of any sub-models, which show the validity of integrating loss term and regularizers.

Table 1 The best classification accuracies of the sub-models and integral model
Table 2 The best macroF1 scores of the sub-models and integral model

Next, the comparison experiments among the eight methods are carried out on the five employed datasets. The performance of these methods at the number of selected features (ratios) that achieve the best performance is reported in Tables 3 and 4. According to the experimental results, we can get the following findings:

  1. (1)

    All feature selection methods perform better than AllFea on the five datasets, which means redundancy exists in the involved datasets.

  2. (2)

    Note that RFS, SHLFS, SCM, and our method are all robust feature selection methods. The classification accuracies and macroF1 scores of RFS, SCM and our method are better than those of SHLFS on all datasets expect Multiple Features dataset, which is most likely because SHLFS is an unsupervised feature selection method without utilizing the label information and takes no consideration of the relationship between views.

  3. (3)

    Note that SSMVFS, LHFS, ADMC, and our method are all multiview feature selection methods. SSMVFS, ADMC, and our method perform better than LHFS in most cases, since SSMVFS, ADMC, and our method take the correlation between views into account.

  4. (4)

    Our method obtains the best classification accuracies on all datasets except the Ads dataset, and obtains the best macroF1 scores on Multiple Features, NWO1, and NWO3 datasets, which might be attributed to a simultaneous consideration of the label information, the relationship between views, and the effect of outliers. Besides, it achieves the second best classification accuracies on the Ads dataset, and the second best macroF1 scores on Ads and NWO2 datasets.

Table 3 The best classification accuracies of the compared methods
Table 4 The best macroF1 scores of the compared methods

Then, the experiments are carried out to compare the performance of these feature selection methods when the ratio of selected features varies from 0.1 to 0.9. Figures 1 and 2 show the classification accuracies and macroF1 scores under the optimal parameters, respectively. The curve of AllFea is a straight line in the figures since it does not perform feature selection, and it can be regarded as the baseline. According to these two figures, the following findings can be obtained:

  1. (1)

    The performance of these feature selection methods under classification accuracy is consistent with that under the macroF1 score on all datasets except the Ads dataset. In most cases, the feature selection methods outperform AllFea, which shows the necessity of feature selection.

  2. (2)

    In most cases, LHFS and SHLFS do not achieve good performance when the selected features are few, but they perform better as the number of selected features increases. The performance of RFS and SCM is comparable to the multiview feature selection methods SSMVFS and ADMC, because RFS and SCM can effectively handle outliers.

  3. (3)

    As a robust multiview feature selection method that considers the complementarity and specificity of views, the proposed method shows the best performance in most cases.

Fig. 1
figure 1

Classification accuracies under the different ratios of selected features

Fig. 2
figure 2

MacroF1 scores under the different ratios of selected features

Compared with other methods, the biggest improvement of the proposed method is that it can reduce the computational complexity by dividing the original high-dimensional problem into several low-dimensional problems. The advantage is much more apparent while dealing with high-dimensional data. Therefore, the further experiment is conducted on the Animal dataset with 10940 features to demonstrate the superiority. In the experiment, the top-1094 features are selected as inputs. Table 5 records and the time consuming of obtaining the projection matrix under the optimal parameters and the corresponding classification accuracies and macroF1 scores of different feature selection methods. According to Table 5, it can be seen that the proposed method takes the least time and obtains the highest classification accuracy and macroF1 score. Besides, LHFS and SHLFS take the longest time because both of them need to calculate the singular value decomposition of matrices with the size of d × d.

Table 5 The classification accuracies (CA) and macroF1 scores (F1) of the compared methods on the Animal dataset

4.3 Parameter analysis

The proposed model contains three parameters, where the exponential parameter p controls the distribution of view weights 𝜃v, and the regularization parameters λ1 and λ2 control the sparsity of projection matrix W. The effect of these parameters is analyzed when the ratio of selected features is 0.1. First, the influence of p is discussed when both parameters λ1 and λ2 are fixed to the optimal values. Figure 3 shows the performance of the proposed method when p varies in the range of {5,10,15,20,25}. It can be seen that the proposed method is insensitive to p on the involved datasets in the given range. Specifically, the best performance is obtained with p = 10 in almost all cases. Next, the effects of parameters λ1 and λ2 are tested when the parameter p is fixed to the optimal values. The ranges of λ1 and λ2 are both {10− 5,10− 4,…,105}, and the classification accuracies and macroF1 scores are shown in Figs. 4 and 5, respectively. It can be seen that the performance is stable on the Ads dataset, but it is sensitive to the alterations of the parameters λ1 and λ2 on the rest datasets, especially on NWO1 and NWO3 datasets. Moreover, the higher performance can be achieved when λ1 ∈ [10− 5,10− 1] and λ2 ∈ [10− 5,102].

Fig. 3
figure 3

The performance of the proposed method with different values of p

Fig. 4
figure 4

The classification accuracy of the proposed method with the parameters λ1 and λ2 varying in the range of {10− 5,10− 4,…,105}

Fig. 5
figure 5

The macroF1 score of the proposed method with the parameters λ1 and λ2 varying in the range of {10− 5,10− 4,…,105}

4.4 Analysis for the convergence

Now the convergence of Algorithm 1 is analyzed. The values of er in Algorithm 1 are recorded under the optimal values of λ1, λ2, and p when the ratio of selected features is 0.1. The results are shown in Fig. 6, where the x-axis and y-axis denote the number of iterations and the values of er, respectively. It can be seen that Algorithm 1 has a fast convergence speed and the values of er decrease nearly to zero within twenty times of iterations.

Fig. 6
figure 6

Convergence curves of Algorithm 1

5 Conclusion

In this paper, a supervised feature selection method has been proposed by combining the robustness as well as the complementarity and specificity of views into a unified framework. Specifically, the proposed method uses the l2,1-norm as the loss term to induce robustness, adopts the view weighted scheme to integrate the complementary property and specificity of different views, and utilizes two types of regularizers to select the relevant views and features. An iterative algorithm is given and the objective is solved by several small scale subproblems in each iteration. The comparison experiments with several state-of-the-art feature selection methods on multiple widely used multiview datasets demonstrate the effectiveness of the proposed method. Since the manifold regularization term preserves the data geometric structure and explores the data distribution information, how to use the manifold learning technology into the model will be the future work.