Robust multiview feature selection via view weighted

Zhong, Jing; Zhong, Ping; Xu, Yimin; Yang, Liran

doi:10.1007/s11042-020-09617-8

Robust multiview feature selection via view weighted

Published: 08 September 2020

Volume 80, pages 1503–1527, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Robust multiview feature selection via view weighted

Download PDF

Jing Zhong¹,
Ping Zhong²,
Yimin Xu¹ &
…
Liran Yang¹

441 Accesses
1 Citation
Explore all metrics

Abstract

In recent years, combining the multiple views of data to perform feature selection has been popular. As the different views are the descriptions from different angles of the same data, the abundant information coming from multiple views instead of the single view can be used to improve the performance of identification. In this paper, through the view weighted strategy, we propose a novel robust supervised multiview feature selection method, in which the robust feature selection is performed under the effect of l_2,1-norm. The proposed model has the following advantages. Firstly, different from the commonly used view concatenation that is liable to ignore the physical meaning of features and cause over-fitting, the proposed method divides the original space into several subspaces and performs feature selection in the subspaces, which can reduce the computational complexity. Secondly, the proposed method assigns different weights to views adaptively according to their importance, which shows the complementarity and the specificity of views. Then, the iterative algorithm is given to solve the proposed model, and in each iteration, the original large-scale problem is split into the small-scale subproblems due to the divided original space. The performance of the proposed method is compared with several related state-of-the-art methods on the widely used multiview datasets, and the experimental results demonstrate the effectiveness of the proposed method.

Sparse robust multiview feature selection via adaptive-weighting strategy

Article 25 October 2021

Structured Multi-view Supervised Feature Selection Algorithm Research

Multi-view dimensionality reduction learning with hierarchical sparse feature selection

Article 03 October 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the extensive development of information technology, the data collected from practical applications are usually high dimensional. How to learn such high dimensional data effectively is an important issue in machine learning [1, 5]. As well known, high dimensionality brings great difficulties in the procedure of data processing, such as the high computational complexity and the increased probability of over-fitting. To strengthen the discrimination of these models built on the high dimensional data, it is necessary to eliminate the redundant and irrelevant features. Therefore, the dimensionality reduction technology is proposed to find the optimal feature set of low dimension to represent the original data.

Feature extraction [6, 42] and feature selection [9, 31] are the two major methods used for dimensionality reduction. Feature extraction reduces the dimensionality of data by reconstructing new features. Tan et al. [23] proposed a novel data-driven rule-based generation approach for fuzzy rule interpolation (FRI) through the feature extraction. Zhang et al. [39, 40] proposed robust multiview feature extraction methods through the l_2,1-norm constraint. Feature selection selects the relevant feature subset from the raw features without changing the semantics of data. Compared with feature extraction, feature selection methods can preserve the physical meaning of original features. According to the label availability, the feature selection methods are classified into supervised ones [3, 16], semi-supervised ones [22, 35], and unsupervised ones [11, 30]. Among the three kinds of methods, the supervised methods are most likely to select the better discriminative feature subset because they can utilize the label information and consider the correlation between labels and features. Instead of the traditional vector-based feature selection that is only suitable for binary classification problems, a series of matrix-based structured sparsity-inducing feature selection (SSFS) methods for tackling multi-class problems have been proposed in recent research. Besides, through the combination of machine learning algorithms, the matrix-based SSFS methods can be applied for specific areas, including the multitask feature selection [28], the multilabel feature selection [37], and the multiview feature selection [15] etc.

It is well known that describing an object from multiple views rather than the single one is more reliable and informative. For example, the description of web page from multiple views, including texts, images, and links, provides more information than that only from the single view. The multiview feature learning that can integrate views to select the relevant feature subset has aroused widespread concern. In this paper, we focus on the supervised matrix-based multiview SSFS. Xiao et al. [32] first proposed the two-view feature selection method. Then Wang et al. [27, 29] presented the multiview feature selection methods to overcome the limitation of two views. Later, a number of supervised multiview feature selection methods have emerged [26, 38, 43]. In these methods, the G₁-norm (or G_2,1-norm) regularization is introduced to capture the discriminative views for specific classes (or all the classes), and the promising performance has been achieved. However, there still exist some limitations. Usually, the methods are built based on the practice that features of the different views are concatenated into a long vector, which will increase the computational complexity. Besides, the concatenated feature vector ignores the individual statistical characteristics of each view and the different importance of each view to a specific task. In addition, the data is often corrupted by outliers in the real application, so it is necessary to build a robust model to mitigate the impact of outliers.

In this paper, we propose a robust supervised weighted multiview feature selection method that considers the complementary property and the specificity of views into a unified framework to select the relevant views and features. Specifically, we divide the original high-dimensional space into the latent low-dimensional subspaces based on each view and separately impose the weighted penalty in these subspaces instead of concatenating all views. We adopt the individual l_2,1-norm penalty for each view, which can induce robustness to outliers. In the process of obtaining the optimal solution of the derived model, we unpack the original large scale problem into m (m is the number of views) manageable low-dimensional subproblems with the reduced computation complexity, and give an iterative algorithm based on the augmented Lagrangian multiplier method.

In summary, the paper makes the following contributions:

The proposed method separately imposes the individual weighted l_2,1-morm penalty on each view to enhance the robustness and accumulates all views to construct the loss term, which considers both the complementarity and specificity of views.
The high-dimensional problem can be split into several low-dimensional problems in the process of optimizing the objective function, which can reduce the computational complexity, and the proposed iterative algorithm achieves a fast convergence rate.
The proposed method is compared with several state-of-the-art SSFS methods on the widely used multiview datasets, and the effectiveness of the method has been demonstrated on the extensive experiments.

The rest of the paper is organized as follows. Section 2 reviews the related SSFS methods. Section 3 elaborates the proposed method and the iterative algorithm, and the convergence and computational complexity of the iterative algorithm are also analyzed. In Section 4, we carry out the extensive experiments on the multiview datasets to evaluate the performance of the proposed method. Section 5 concludes the paper.

2 Related work

Many of matrix-based SSFS methods have been proposed. This section will review some related work, including robust feature selection methods, multiview feature selection methods, and view weighted feature selection methods.

RFS [18] is a classical robust SSFS method, in which the l_2,1-norm penalty instead of the commonly used Frobenius norm penalty is adopted to mitigate the influence of outliers. Qian et al. [19] proposed a robust unsupervised feature selection (RUFS) method to perform clustering and feature selection simultaneously through the nonnegative matrix factorization, local learning, and l_2,1-norm penalty. Du et al. [8] proposed a robust unsupervised feature selection model called RUFSM based on the matrix factorization, l_2,1-norm, and manifold regularization term. In [10], He et al. presented an unsupervised feature selection method called SHLFS by combining the low-rank constraint and self-representation into a unified framework. In [13], Lan et al. proposed a robust method called SCM by introducing the capped l₂-norm loss, which can effectively reduce the influence of noise and outliers. Also, there exist other feature selection methods that are involved in the robust constraints [17, 24, 36, 41]. All of these methods are not specifically designed to solve the multiview problems. The features of different views are needed to be concatenated straightly when applying these methods to deal with the multiview data, which ignores the relationship among the different views.

An increasing number of multiview feature selection methods have been proposed to consider the relationship of multiple views. Wang et al. [29] proposed a multi-modal multi-task feature selection method which firstly introduces the G₁-norm regularizer to enforce the sparsity among multiple modalities. In [26], Wang et al. proposed a sparse multi-modal feature selection method by using the hinge loss instead of the squared loss. Nie et al. [25] proposed a novel multiview feature selection method called SSMVFS to perform both clustering and classification tasks based on the G₁-norm. Besides, the G_2,1-norm that has the ability to select the relevant views is also used for multiview feature selection. Wang et al. [27] proposed a multiview feature selection method to identify the correlation between genotypes and phenotypes by constructing the sparsity across views via the G_2,1-norm. Zhu et al. [43] proposed a block-row sparse multiview feature selection method called MVML through the G_2,1-norm regularizer and least square loss term. Zhang et al. [38] established the discriminative feature selection model via the sparse multi-modal learning and ε-dragging technology. Cheng et al. [4] proposed a novel multiview feature selection method named as LHFS, in which the low-rank and hypergraph strategy are introduced to enhance the inherent relationship of data. Even though the above methods considered the relationship among different views through using the G₁-norm or G_2,1-norm regularizer, they did not consider the the weights of different views and also ignored the influence of outliers.

For further considering the importance of different views, several multiview schemes have been proposed to learn weights for different views [12, 14, 20, 21, 33]. In [33], Xu et al. proposed a weighted multiview clustering with feature selection method (WMCFS) to improve the clustering accuracy via designing two weighting schemes. Shi et al. [20, 21] proposed a semi-supervised multiview feature selection method with the structured multiview sparse regularization and the Hessian regularization to boost the feature selection performance. Krishnasamy et al. [12] proposed a semi-supervised multiview multitask feature selection framework by exploring the complementary information from different views in each task and the shared knowledge between the related tasks. Li et al. [14] proposed a multiview manifold regularized feature selection method which exploited the label information, label relationship, data distribution, and relationship among multiple features to perform feature selection. These methods are unsupervised and semi-supervised ones which lack the consideration of the influence of outliers.

3 Robust weighted multiview feature selection

We first define the notations used in this paper, then describe the new model, and then give an iterative optimization algorithm to solve the model. Finally, the convergence and complexity of the algorithm are analyzed.

3.1 Notations

The upper case letters are used to denote matrices. Vectors are denoted by the lower case letters in bold, and scalars are denoted by the lower case letters. For example, for a matrix M ∈ R^p×q, mⁱ and m_j denote the i th row and the j th column, respectively. The matrix X = [x₁,x₂,…,x_N] ∈ R^l×N represents the set of samples belonging to c classes, where l is the dimension of each sample, N is the total number of samples, and x_i ∈ R^l×1 represents the i th sample. X can be expressed in the form of m views and denoted as $X=[X_{1}^{\top },X_{2}^{\top },\cdots ,X_{m}^{\top }]^{\top } \in \textit {R}^{l\times N}$, where $X_{v}=[\mathbf {x}_{1v},\mathbf {x}_{2v},\ldots ,\mathbf {x}_{Nv}]\in R^{l_{v}\times N}$ represents the v th view of the samples. Denote $Y=[\mathbf {y}^{1^{\top }},\mathbf {y}^{2\top },\ldots ,\mathbf {y}^{N\top }]^{\top }\in \{0,1\}^{N\times c}$ the label matrix of N samples. If the i th sample belongs to the j th category, then the j th entry of y_i equals to 1 and the rest ones equal to 0. W is the projection matrix made up of sub-blocks $W_{v}\in R^{l_{v} \times c}, v=1,\cdots ,m$, that is, $W=[W_{1}^{\top },W_{2}^{\top },\cdots ,W_{m}^{\top }]^{\top }\in R^{l\times c}$.

3.2 Model

Inspired by RFS [18], the l_2,1-norm penalty is used to alleviate the influence of outliers. But different from RFS and most of multiview feature selection methods that take the high-dimensional data matrix X concatenated from multiple views as the input directly, we calculate the weighted penalty separately on the latent subspace based on each view:

$$ \begin{array}{@{}rcl@{}} \min\limits_{\theta_{v},W}&&\sum\limits_{v=1}^{m}(\theta_{v})^{p} \Vert X_{v}^{\top} W_{v}-Y\Vert_{2,1} \\ \text{s.t.}&&\sum\limits_{v=1}^{m}{\theta_{v}}=1, 0\leq \theta_{v} \leq 1 \end{array} $$

(1)

where 𝜃_v is the weight of the v th view, and the different weights are assigned to distinguish the different importance of views. The more important the role of the view is, the greater its weight should be. p is the exponential parameter used to control the sparsity of view weight vector 𝜃, and the influence of p is discussed during the later experiments.

In order to eliminate redundant features, we introduce the G_2,1-norm sparse regularizer [27] to preserve the important views and the l_2,1-norm sparse regularizer [7] to select the discriminative features from the important views. Specifically, the G_2,1-norm regularizer is defined as $\|W\|_{G_{2,1}}={\sum }_{v=1}^{m}\|W_{v}\|_{F}$, which divides the features into m blocks according to views and utilizes the l₁-norm between views. It enforces W block-sparsity to eliminate the irrelevant views. The l_2,1-norm regularizer ∥W∥_2,1 is defined as $\|W\|_{2,1}={\sum }_{v=1}^{m}\|W_{v}\|_{2,1}={\sum }_{i=1}^{l} \|\mathbf {w}^{i}\|_{2}$, where wⁱ is the i th row of W, and it can enforce the rows of W which correspond to the irrelevant features to be zeros or close to zeros. Thus, the following optimization model is built to implement multiview feature selection:

$$ \begin{array}{@{}rcl@{}} \min\limits_{W,\theta_{v}}&&\sum\limits_{v=1}^{m} (\theta_{v})^{p} \Vert X_{v}^{\top} W_{v}-Y\Vert_{2,1}+\lambda_{1}\sum\limits_{v=1}^{m}\Vert W_{v}\Vert_{2,1}+\lambda_{2}\sum\limits_{v=1}^{m}\Vert W_{v}\Vert_{F} \\ \text{s.t.}&&\sum\limits_{v=1}^{m}{\theta_{v}}=1, 0\leq \theta_{v} \leq 1 \end{array} $$

(2)

Formulation (2) is the newly proposed multiview feature selection model which can impose a sparse projection matrix from the aspects of views and features. The loss term in (2) can be rewritten as ${\sum }_{v=1}^{m} (\theta _{v})^{p} {\sum }_{i=1}^{N}\Vert \mathbf {x}_{iv}^{\top } W_{v}-\mathbf {y}^{i}\Vert _{2}$, the residual $\Vert \mathbf {x}_{iv}^{\top } W_{v}-\mathbf {y}^{i}\Vert _{2}$ is adopted instead of the squared one $\Vert \mathbf {x}_{iv}^{\top } W_{v}-\mathbf {y}^{i}{\Vert _{2}^{2}}$, which can induce the robustness and reduce the impact of the outliers. In this model, one sample in different views is predicted to the same label, but the residuals vary in different views. Such a strategy has the advantage of taking both the complementary property of different views and the specificity of each view into consideration.

3.3 Optimization algorithm

The Augmented Lagrangian Multiplier (ALM) method [2, 8] is used to solve the objective function (2). First, m slack variables E_v(v = 1,2,…,m) are introduced, and then $ X_{v}^{\top } W_{v}-Y$ is replaced with E_v. So (2) can be reformulated as:

$$ \begin{array}{@{}rcl@{}} \min\limits_{W_{v},\theta_{v},E_{v}}&&\sum\limits_{v=1}^{m} (\theta_{v})^{p} \Vert E_{v}\Vert_{2,1}+\lambda_{1}\sum\limits_{v=1}^{m}\Vert W_{v}\Vert_{2,1}+\lambda_{2}\sum\limits_{v=1}^{m}\Vert W_{v}\Vert_{F} \\ \text{s.t.}&& X_{v}^{\top} W_{v}-Y-E_{v}=0, v=1,2,\ldots,m\\ &&\sum\limits_{v=1}^{m}{\theta_{v}}=1, 0\leq \theta_{v} \leq 1 \end{array} $$

(3)

The corresponding augmented Lagrangian function is:

$$ \begin{array}{@{}rcl@{}} L(W_{v},E_{v},\theta_{v})&=&\sum\limits_{v=1}^{m} (\theta_{v})^{p} \Vert E_{v} \Vert_{2,1} + \lambda_{1} \sum\limits_{v=1}^{m} \Vert W_{v} \Vert_{2,1} +\lambda_{2} \sum\limits_{v=1}^{m} \Vert W_{v}\Vert_{F} \\ &&+\sum\limits_{v=1}^{m} tr[{\Lambda}_{v}^{\top}(X_{v}^{\top}W_{v}-Y-E_{v})] \\ &&+\sum\limits_{v=1}^{m} \frac{\mu}{2} \Vert X_{v}^{\top} W_{v}-Y-E_{v}{\Vert_{F}^{2}} \end{array} $$

(4)

where Λ_v is the Lagrange multiplier, and μ is a penalty parameter. Since

$$ \begin{array}{@{}rcl@{}} && \sum\limits_{v=1}^{m}tr[{\Lambda}_{v}^{\top}(X_{v}^{\top}W_{v}-Y-E_{v})]+\sum\limits_{v=1}^{m}\frac{\mu}{2}\Vert X_{v}^{\top} W_{v}-Y-E_{v}{\Vert_{F}^{2}} \\ &=&\sum\limits_{v=1}^{m} \frac{\mu}{2} \Vert X_{v}^{\top} W_{v}-Y-E_{v}+\frac{1}{\mu}{\Lambda}_{v}{\Vert_{F}^{2}}-\sum\limits_{v=1}^{m}\frac{\mu}{2}\Vert\frac{1}{\mu}{\Lambda}_{v}{\Vert_{F}^{2}} \end{array} $$

(5)

the augmented Lagrangian function is equivalent to the following form:

$$ \begin{array}{@{}rcl@{}} L(W_{v},E_{v},\theta_{v})&=&\sum\limits_{v=1}^{m} (\theta_{v})^{p} \Vert E_{v}\Vert_{2,1}+\lambda_{1}\sum\limits_{v=1}^{m}\Vert W_{v}\Vert_{2,1}+\lambda_{2}\sum\limits_{v=1}^{m}\Vert W_{v}\Vert_{F} \\ &&+\sum\limits_{v=1}^{m} \frac{\mu}{2} \Vert X_{v}^{\top} W_{v}-Y-E_{v}+\frac{1}{\mu}{\Lambda}_{v}{\Vert_{F}^{2}}-\sum\limits_{v=1}^{m}\frac{\mu}{2}\Vert\frac{1}{\mu}{\Lambda}_{v}{\Vert_{F}^{2}} \end{array} $$

(6)

Next, the specific iterative process of solving (3) is given.

First step::

Fix E_v and 𝜃_v, update W_v When E_v and 𝜃_v are fixed, W_v is the only variable, and its solution can be obtained by solving the following optimization problem:

$$ \begin{array}{@{}rcl@{}} \min\limits_{W}&\lambda_{1}\sum\limits\limits_{v=1}^{m}\Vert W_{v}\Vert_{2,1}+\lambda_{2}\sum\limits\limits_{v=1}^{m}\Vert W_{v}\Vert_{F} +\sum\limits\limits_{v=1}^{m} \frac{\mu}{2} \Vert X_{v}^{\top} W_{v}-Y-E_{v}+\frac{1}{\mu}{\Lambda}_{v}{\Vert_{F}^{2}} \end{array} $$

(7)

Setting the derivative with respect to W_v to zero, we have

$$ \begin{array}{@{}rcl@{}} \lambda_{1}D_{1v}W_{v}+\lambda_{2}D_{2v}W_{v}+\mu X_{v}(X_{v}^{\top}W_{v}-Y-E_{v}+\frac{1}{\mu}{\Lambda}_{v}) =0 \end{array} $$

(8)

where $D_{1v}\in R^{l_{v} \times l_{v}}$ is a diagonal matrix, $\mathbf {w}_{v}^{i}$ is the i th row of W_v, and $D_{2v}\in R^{l_{v} \times l_{v}}$ is a diagonal matrix.

$$ \begin{array}{@{}rcl@{}} D_{1v}= \begin{bmatrix} \frac{1}{2\| \mathbf{w}_{v}^{1} \|_{2}} & & \\ & \ddots& &\\ & & \frac{1}{2\|\mathbf{w}_{v}^{l_{v}}\|_{2}} \end{bmatrix} &D_{2v}= \begin{bmatrix} \frac {1}{2\Vert W_{v}\Vert_{F}} & & \\ & \ddots& &\\ & & \frac {1}{2\Vert W_{v}\Vert_{F}} \end{bmatrix} \end{array} $$

(9)

According to (8), we obtain

$$ \begin{array}{@{}rcl@{}} W_{v}=\mu(\lambda_{1}D_{1v}+\lambda_{2}D_{2v}+\mu X_{v}X_{v}^{\top})^{-1}X_{v}(Y+E_{v}-\frac{1}{\mu}{\Lambda}_{v}) \end{array} $$

(10)

Second step::

Fix W_v and 𝜃_v, update E_v

When W_v and 𝜃_v are fixed, E_v is the only variable, and its solution can be obtained by solving the following optimization problem:

$$ \begin{array}{@{}rcl@{}} \min\limits_{E_{v}}&&\sum\limits_{v=1}^{m} \left( \frac{(\theta_{v})^{p} }{\mu} \Vert E_{v}\Vert_{2,1}+ \frac{1}{2} \Vert X_{v}^{\top} W_{v}-Y-E_{v}+\frac{1}{\mu}{\Lambda}_{v}{\Vert_{F}^{2}}\right) \end{array} $$

(11)

Define $J_{v}=\frac {(\theta _{v})^{p} }{\mu } \Vert E_{v}\Vert _{2,1}+ \frac {1}{2} \Vert X_{v}^{\top } W_{v}-Y-E_{v}+\frac {1}{\mu }{\Lambda }_{v}{\Vert _{F}^{2}}$. Since J_v is only related to the v th view, the problem (11) can be decomposed into m sub-optimization problems:

$$ \begin{array}{@{}rcl@{}} \min\limits_{E_{v}}J_{v}, v=1,2,\ldots,m \end{array} $$

(12)

Denote $G_{v}=X_{v}^{\top }W_{v}-Y+\frac {1}{\mu }{\Lambda }_{v}$, then (12) can be rewritten as

$$ \begin{array}{@{}rcl@{}} \min\limits_{E_{v}}&&\frac{(\theta_{v})^{p}}{\mu} \Vert E_{v}\Vert_{2,1}+\frac{1}{2} \Vert E_{v}- G_{v} {\Vert_{F}^{2}} \end{array} $$

(13)

Furthermore, (13) can be converted as

$$ \begin{array}{@{}rcl@{}} \min\limits_{\mathbf{e}_{v}^{i}}&&\sum\limits_{i=1}^{N} \left( \frac{(\theta_{v})^{p}}{\mu} \Vert \mathbf{e}_{v}^{i}\Vert_{2}+\frac{1}{2} \Vert \mathbf{e}_{v}^{i}- \mathbf{g}_{v}^{i} {\Vert_{2}^{2}}\right) \end{array} $$

(14)

where $\mathbf {e}_{v}^{i}$ and $\mathbf {g}_{v}^{i}$ are the i th row of matrix E_v and G_v, respectively. And according to [2], the solution to (14) is

$$ \begin{array}{@{}rcl@{}} \mathbf{e}_{v}^{i}= \left\{ {\begin{array}{*{20}{l}} \left( 1-\frac{(\theta_{v})^{p}}{\mu \| \mathbf{g}_{v}^{i}\|_{2} }\right)\mathbf{g}_{v}^{i},& \| \mathbf{g}_{v}^{i}\|_{2}> \frac{(\theta_{v})^{p}}{\mu}\\ \mathbf{0} ,&\| \mathbf{g}_{v}^{i}\|_{2}\leq \frac{(\theta_{v})^{p}}{\mu} \end{array}} \right. \end{array} $$

(15)

Third step::

Fix E_v and W_v, update 𝜃_v When E_v and W_v are fixed, 𝜃_v is the only variable, and its solution can be obtained by solving the following problem:

$$ \begin{array}{@{}rcl@{}} \min\limits_{\theta_{v}}&&\sum\limits_{v=1}^{m}(\theta_{v})^{p} \Vert E_{v}\Vert_{2,1} \\ \text{s.t.}&&\sum\limits_{v=1}^{m}{\theta_{v}}=1, 0\leq \theta_{v} \leq 1 \end{array} $$

(16)

The Lagrangian function of (16) is

$$ \begin{array}{@{}rcl@{}} L(\theta_{v},\eta)=\sum\limits_{v=1}^{m}(\theta_{v})^{p}\Vert E_{v}\Vert_{2,1}-\eta\left( \sum\limits_{v=1}^{m}\theta_{v}-1\right) \end{array} $$

(17)

where η is the Lagrange multiplier. By setting the derivatives of (17) with respect to 𝜃_v and η to 0, respectively, we have

$$ \begin{array}{@{}rcl@{}} \left\{ {\begin{array}{*{20}{l}} p(\theta_{v})^{p-1}\Vert E_{v} \Vert_{2,1}-\eta=0 \\ \sum\limits_{v=1}^{m}\theta_{v}=1 \end{array}} \right. \end{array} $$

(18)

Thus 𝜃_v can be obtained by the following equation:

$$ \begin{array}{@{}rcl@{}} \theta_{v}=\frac{\left( \frac{1}{\Vert E_{v}\Vert_{2,1}}\right)^{\frac{1}{p-1}}}{\sum\limits_{v=1}^{m}\left( \frac{1}{\Vert E_{v}\Vert_{2,1}}\right)^{\frac{1}{p-1}}} \end{array} $$

(19)

Variables W_v, 𝜃_v, E_v are updated iteratively according to the above three steps, and the iteration will stop until the iteration number is above 20, or the difference of two successive iterative objective function values is less than 10^− 3. The specific process is summarized in Algorithm 1.

3.4 The analysis of convergency and complexity

In this subsection, the convergency analysis and complexity analysis of the proposed algorithm are displayed.

First, the convergency of the Algorithm 1 is demonstrated by referring to the references [13, 18]. To this end, a lemma as follows is given first [43].

Lemma 1

For any two positive values a and b, the following inequality holds.

$$ \begin{array}{@{}rcl@{}} \sqrt a - \frac{a}{2\sqrt b}\leq \sqrt b - \frac{b}{2\sqrt b} \end{array} $$

(20)

Theorem 1

The Algorithm 1 decreases the objective values monotonically in each iteration until convergence.

The proof is given in Appendix.

Next, the complexity of the proposed algorithm and the related algorithms is analyzed. During the solutions, the update of W is decomposed into m subproblems which can be solved independently. Compared with the direct iterative solving W with complexity O(l³ + l²N + l² + lc + lNc), the complexity of W_v-update reduces to $O({l_{v}^{3}}+{l_{v}^{2}}N+{l_{v}^{2}}+l_{v}c+l_{v}Nc)$, where l and l_v are the data dimensions of all views and the v th view, respectively, and $l={\sum }_{v=1}^{m} l_{v}$; N is the number of samples; c is the number of categories. In Algorithm 1, including the updates of W, E, 𝜃, and Λ, the total computational complexity is $O(t({\sum }_{v=1}^{m}({l_{v}^{3}}+{l_{v}^{2}}N+{l_{v}^{2}}+l_{v}c+l_{v}Nc+Nc)))$, where t is the number of iterations. The complexity of the related methods is as follows:

RFS [18]:O(t(l³ + N³ + l²N + l(c + Nc + N²))

SSMVFS [25]: O(t(l³ + l²(N + c) + l(c + Nc))

LHFS [4]: O(t(l³ + l²N + l(N² + Nc)))

SHLFS [10]: O(t(l³ + l²N + l(N² + Nc + N + c)))

SCM [13]: O(t(l³ + l²N + l(N² + Nc + c)))

ADMC [34]:O(t(l³ + l²N + l(N² + Nc) + Nc))

Compared with these related methods, the proposed method has a lower computational complexity.

4 Experiment

A series of experiments on the widely used multiview datasets are conducted to evaluate the proposed method. First the experimental setup is described in detail, and then the performance of the proposed method is compared with several state-of-the-art feature selection methods.

4.1 Experimental setup

To evaluate the effectiveness of the proposed method, several popular feature selection methods are compared: RFS [18], SSMVFS [25], LHFS [4], SHLFS [10], SCM [13], and ADMC [34]. The grid search is employed to find the optimal parameters. The comparison methods and the ranges of the involved parameters are as follows.

AllFea: Using all raw features without feature selection for classification.
RFS: It is a typical robust feature selection method which employs the l_2,1-norm on both regularizer and penalty term. The range of the involved parameter γ is {10^− 5,10^− 4,…,10⁵}.
SSMVFS: It is a multiview clustering feature selection method which combines both G₁-norm and l_2,1-norm as regularizers to enforce sparsity of projection matrix. It can also perform classification when labels of samples are obtained. The ranges of the involved parameters γ₁ and γ₂ are both {10^− 5,10^− 4,…,10⁵}.
LHFS: It is a multiview feature selection method which employs the low-rank, hyper-graph, and l_2,1-norm regularizer to select features. The ranges of the involved parameters α and λ are both {10^− 5,10^− 4,…,10⁵}.
SHLFS: It is a robust unsupervised self-representation feature selection method which uses the l_2,1-norm as both loss term and regularizer to select features. The ranges of the involved parameters α and β are both {10^− 5,10^− 4,…,10⁵}.
SCM: It is a robust feature selection method with the capped l₂ norm penalty which can reduce the influence of outliers. The range of the involved parameter λ is {10^− 5,10^− 4,…,10⁵}, and ε is searched from {0,0.1,0.2,…,1}.
ADMC: It is a multiview feature selection model based on the adaptive-weighting discriminative regression. The range of the involved parameter α is searched from {10^− 5,10^− 4,…,10⁵}.
Our method: The involved parameters λ₁ and λ₂ are searched from {10^− 5,10^− 4,…,10⁵}, and parameter p is searched from {5,10,15,20,25}.

The detailed descriptions of the multiview datasets for experiments are as follows:

Internet Advertisements Dataset (Ads)^{Footnote 1}: It is a collection of advertisements on Internet, including 3279 samples with original 1557 features. Three features (height, width, and attributes) are eliminated from all samples since they are absent in some samples. Each sample includes 457, 495, 472, 111, and 19 attributes in URL, origurl, ancurl, alt, and caption terms, respectively.
Multiple Features Dataset^{Footnote 2}: It consists of 2000 samples of handwritten digits (0-9) with 649 features. Features are described from 6 different views, including 240 pixel averages in 2 × 3 windows, 216 profile correlations, 76 Fourier coefficients, 64 Karhunen-Love coefficients, 47 Zernike moments, and 6 morphological features.
NUS-WIDE-OBJECT^{Footnote 3}: It is an assemble of network images with 31 object categories, including 30000 samples with 1134 features. Each sample is described from six types of low-dimensional features: 64-D color histogram, 73-D edge direction histogram, 144-D color correlogram, 128-D wavelet texture, 225-D block-wise color moments, and 500-D bag of words based on SIFT descriptions. We construct three datasets called NWO1, NWO2, NWO3 for experiments. NWO1 contains all samples whose names start with letter b, and this set includes 4616 samples in total. NWO2 contains all vehicle images with a total of 5346 samples. NWO3 contains all animal images with a total of 8339 samples.
Animal^{Footnote 4}: It is a dataset of 50 animal categories photos including 30457 samples with 10940 features. Each sample is described from 6 types of low-dimensional features: 252-D PHOG, 2000-D SIFT, 2000-D rgSIFT, 2000-D SURF, 2000-D local self-similarity histograms, and 2688-D RGB color histograms. We select the first five categories and the experimental dataset is constructed by random selecting 100 samples of each category which includes 500 samples in total.

In the experiments, the total l features are sorted in the descending order based on the values of ∥w^j∥, j = 1,⋯ ,l, and then the top-R features are selected as the inputs of the subsequent 1-near neighbor classifier (1NN) in all experiments, where R = ⌊ratio × l⌋ with ratio varying from 0.1 to 0.9. The 5-fold cross validation is employed and the averaged classification accuracies and the averaged macroF1 scores are reported. The experiments are implemented in MATLAB 2014b with 4.10.0-38-generic, CPU 3.30GHz 3.30GHz and RAM 8GB.

4.2 Experimental results and discussions

First, to explore the effectiveness of each item in the proposed model, the model is divided into three sub-models. The first sub-model is made up of only the loss term of (2), and denoted “Loss only”. The second sub-model is called “Loss + l_2,1-norm”, which consists of the loss term and l_2,1-norm regularizer of (2). The third sub-model is constructed by combining the loss term and G_2,1-norm regularizer of (2), and called “Loss + G_2,1-norm”. The comparison experiments between the three sub-models and the proposed integral model of (2) are conducted on five employed datasets. The performance of these models when they achieve the highest results over all ratios are reported in Tables 1 and 2. According to the experimental results, we can get the following findings:

(1)
The results of the sub-models “Loss + l_2,1-norm” and “Loss + G_2,1-norm” are both better than those of the sub-model Loss only”, which proves the effectiveness of two regularizers.
(2)
The results of the proposed integral model are better than those of any sub-models, which show the validity of integrating loss term and regularizers.

Table 1 The best classification accuracies of the sub-models and integral model

Full size table

Table 2 The best macroF1 scores of the sub-models and integral model

Full size table

Next, the comparison experiments among the eight methods are carried out on the five employed datasets. The performance of these methods at the number of selected features (ratios) that achieve the best performance is reported in Tables 3 and 4. According to the experimental results, we can get the following findings:

(1)
All feature selection methods perform better than AllFea on the five datasets, which means redundancy exists in the involved datasets.
(2)
Note that RFS, SHLFS, SCM, and our method are all robust feature selection methods. The classification accuracies and macroF1 scores of RFS, SCM and our method are better than those of SHLFS on all datasets expect Multiple Features dataset, which is most likely because SHLFS is an unsupervised feature selection method without utilizing the label information and takes no consideration of the relationship between views.
(3)
Note that SSMVFS, LHFS, ADMC, and our method are all multiview feature selection methods. SSMVFS, ADMC, and our method perform better than LHFS in most cases, since SSMVFS, ADMC, and our method take the correlation between views into account.
(4)
Our method obtains the best classification accuracies on all datasets except the Ads dataset, and obtains the best macroF1 scores on Multiple Features, NWO1, and NWO3 datasets, which might be attributed to a simultaneous consideration of the label information, the relationship between views, and the effect of outliers. Besides, it achieves the second best classification accuracies on the Ads dataset, and the second best macroF1 scores on Ads and NWO2 datasets.

Table 3 The best classification accuracies of the compared methods

Full size table

Table 4 The best macroF1 scores of the compared methods

Full size table

Then, the experiments are carried out to compare the performance of these feature selection methods when the ratio of selected features varies from 0.1 to 0.9. Figures 1 and 2 show the classification accuracies and macroF1 scores under the optimal parameters, respectively. The curve of AllFea is a straight line in the figures since it does not perform feature selection, and it can be regarded as the baseline. According to these two figures, the following findings can be obtained:

(1)
The performance of these feature selection methods under classification accuracy is consistent with that under the macroF1 score on all datasets except the Ads dataset. In most cases, the feature selection methods outperform AllFea, which shows the necessity of feature selection.
(2)
In most cases, LHFS and SHLFS do not achieve good performance when the selected features are few, but they perform better as the number of selected features increases. The performance of RFS and SCM is comparable to the multiview feature selection methods SSMVFS and ADMC, because RFS and SCM can effectively handle outliers.
(3)
As a robust multiview feature selection method that considers the complementarity and specificity of views, the proposed method shows the best performance in most cases.

Compared with other methods, the biggest improvement of the proposed method is that it can reduce the computational complexity by dividing the original high-dimensional problem into several low-dimensional problems. The advantage is much more apparent while dealing with high-dimensional data. Therefore, the further experiment is conducted on the Animal dataset with 10940 features to demonstrate the superiority. In the experiment, the top-1094 features are selected as inputs. Table 5 records and the time consuming of obtaining the projection matrix under the optimal parameters and the corresponding classification accuracies and macroF1 scores of different feature selection methods. According to Table 5, it can be seen that the proposed method takes the least time and obtains the highest classification accuracy and macroF1 score. Besides, LHFS and SHLFS take the longest time because both of them need to calculate the singular value decomposition of matrices with the size of d × d.

Table 5 The classification accuracies (CA) and macroF1 scores (F1) of the compared methods on the Animal dataset

Full size table

4.3 Parameter analysis

The proposed model contains three parameters, where the exponential parameter p controls the distribution of view weights 𝜃_v, and the regularization parameters λ₁ and λ₂ control the sparsity of projection matrix W. The effect of these parameters is analyzed when the ratio of selected features is 0.1. First, the influence of p is discussed when both parameters λ₁ and λ₂ are fixed to the optimal values. Figure 3 shows the performance of the proposed method when p varies in the range of {5,10,15,20,25}. It can be seen that the proposed method is insensitive to p on the involved datasets in the given range. Specifically, the best performance is obtained with p = 10 in almost all cases. Next, the effects of parameters λ₁ and λ₂ are tested when the parameter p is fixed to the optimal values. The ranges of λ₁ and λ₂ are both {10^− 5,10^− 4,…,10⁵}, and the classification accuracies and macroF1 scores are shown in Figs. 4 and 5, respectively. It can be seen that the performance is stable on the Ads dataset, but it is sensitive to the alterations of the parameters λ₁ and λ₂ on the rest datasets, especially on NWO1 and NWO3 datasets. Moreover, the higher performance can be achieved when λ₁ ∈ [10^− 5,10^− 1] and λ₂ ∈ [10^− 5,10²].

4.4 Analysis for the convergence

Now the convergence of Algorithm 1 is analyzed. The values of er in Algorithm 1 are recorded under the optimal values of λ₁, λ₂, and p when the ratio of selected features is 0.1. The results are shown in Fig. 6, where the x-axis and y-axis denote the number of iterations and the values of er, respectively. It can be seen that Algorithm 1 has a fast convergence speed and the values of er decrease nearly to zero within twenty times of iterations.

5 Conclusion

In this paper, a supervised feature selection method has been proposed by combining the robustness as well as the complementarity and specificity of views into a unified framework. Specifically, the proposed method uses the l_2,1-norm as the loss term to induce robustness, adopts the view weighted scheme to integrate the complementary property and specificity of different views, and utilizes two types of regularizers to select the relevant views and features. An iterative algorithm is given and the objective is solved by several small scale subproblems in each iteration. The comparison experiments with several state-of-the-art feature selection methods on multiple widely used multiview datasets demonstrate the effectiveness of the proposed method. Since the manifold regularization term preserves the data geometric structure and explores the data distribution information, how to use the manifold learning technology into the model will be the future work.

Notes

References

Ambika P (2018) Machine learning. Handbook of Research on Cloud and Fog Computing Infrastructures for Data Science, pp 209–230
Cai X, Nie F, Huang H (2013) Exact top-k feature selection via l_2,0-norm constraint. In: the 23rd international joint conference on artificial intelligence, pp 1240–1246
Chen X, Zhou G, Chen Y (2017) Supervised multiview feature selection exploring homogeneity and heterogeneity with l_1,2-norm and automatic view generation. IEEE Trans Geosci Remote Sens 55 (4):2074–2088
Article Google Scholar
Cheng X, Zhu Y, Song J, Wen G, He W (2017) A novel low-rank hypergraph feature selection for multi-view classification. Neurocomputing 253:115–121
Article Google Scholar
De Lange L, Ludick D (2019) Application of machine learning for antenna array failure analysis. In: The CEMi 2018-International workshop on computing, electromagnetics, and machine intelligence, pp 5–6
Dhiraj Biswas R, Ghattamaraju N (2019) An effective analysis of deep learning based approaches for audio based feature extraction and its visualization. Multimedia Tools and Applications 78(17):23949–23972
Article Google Scholar
Ding C, Zhou D, He X, Zha H (2006) R1-PCA: Rotational invariant L₁-norm principal component analysis for robust subspace factorization. In: the 23rd international conference on machine learning, pp 281–288
Du S, Ma Y, Li S, Ma Y (2017) Robust unsupervised feature selection via matrix factorization. Neurocomputing, pp 115–127
Fang Y, Li Y, Lei C (2018) Hypergraph expressing low-rank feature selection algorithm. Multimedia Tools and Applications 77(22):29551–29572
Article Google Scholar
He W, Cheng X, Hu R, Zhu Y, Wen G (2017) Feature self-representation based hypergraph unsupervised feature selection via low-rank representation. Neurocomputing 253:127–134
Article Google Scholar
Hu H, Wang R, Nie F (2018) Fast unsupervised feature selection with anchor graph and l_2,1-norm regularization. Multimedia Tools and Applications 77(17):22099–22113
Article Google Scholar
Krishnasamy G, Paramesran R (2019) Multiview laplacian semisupervised feature selection by leveraging shared knowledge among multiple tasks, Signal Process. Image Commun 70:68–78
Google Scholar
Lan G, Hou C, Nie F, Luo T, Yi D (2018) Robust feature selection via simultaneous capped norm and sparse regularizer minimization. Neurocomputing 283:228–240
Article Google Scholar
Li Y, Shi X, Du C, Liu Y, Wen Y (2016) Manifold regularized multi-view feature selection for social image annotation. Neurocomputing 204:135–141
Article Google Scholar
Lin Q, Xue Y, Wen J, Zhong P (2019) A sharing multi-view feature selection method via alternating direction method of multipliers. Neurocomputing 333:124–134
Article Google Scholar
Liu H, Zheng Q, Li Z (2018) An efficient multi-feature SVM solver for complex event detection. Multimedia Tools and Applications 77(3):3509–3532
Article Google Scholar
Meng W, Yan H, Yang J (2019) Robust unsupervised feature selection by nonnegative sparse subspace learning. Neurocomputing 334:156–171
Article Google Scholar
Nie FH, Huang CX, Ding C (2010) Efficient and robust feature selection via joint l_2,1-norms minimization. In: The 24th annual conference on neural information processing systems, pp 1813–1821
Qian M, Zhai C (2013) Robust unsupervised feature selection. In: Proceedings of the international joint conference on artificial intelligence, pp 1621–1627
Shi C, An G, Zhao R (2017) Multiview hessian semisupervised sparse feature selection for multimedia analysis. IEEE Trans Circuits Syst Video Technol 27(9):1947–1961
Article Google Scholar
Shi C, Duan C, Gu Z (2019) Semi-supervised feature selection analysis with structured multi-view sparse regularization. Neurocomputing 330:412–424
Article Google Scholar
Shi C, Duan C, Gu Z (2019) Semi-supervised feature selection analysis with structured multi-view sparse regularization. Neurocomputing 330:412–424
Article Google Scholar
Tan Y, Shum HPH, Chao F, Vijayakumar V, Yang L (2019) Curvature-based sparse rule base generation for fuzzy rule interpolation. J Intel Fuzzy Syst 36(5):4201–4214
Article Google Scholar
Tang C, Liu X, Li M (2018) Robust unsupervised feature selection via dual self-representation and manifold regularization. Knowl-Based Syst 145:1–14
Article Google Scholar
Wang H, Nie F, Huang H (2013) Multi-view clustering and feature learning via structured sparsity. In: The international conference on machine learning, pp 352–360
Wang H, Nie F, Huang H, Ding C (2013) Heterogeneous visual features fusion via sparse multimodal machine. In: The computer vision and pattern recognition on IEEE, pp 3097–3102
Wang H, Nie F, Huang H, Kim S, Nho K (2012) Identifying quantitative trait loci via group-sparse multitask regression and feature selection: an imaging genetics study of the ADNI cohort. Bioinformatics 28(2):229–237
Article Google Scholar
Wang H, Nie F, Huang H, Risacher S, Ding C (2011) Sparse multi-task regression and feature selection to identify brain imaging predictors for memory performance. In: the IEEE international conference on computer vision, computer society, pp 557–562
Wang H, Nie F, Huang H, Risacher S, Saykin A (2012) Identifying disease sensitive and quantitative trait-relevant biomarkers from multidimensional heterogeneous imaging genetics data via sparse multimodal multitask learning. Bioinformatics 28(12):I127–I136
Article Google Scholar
Wang S, Wang H (2017) Unsupervised feature selection via low-rank approximation and structure learning. Knowl-Based Syst 124:70–79
Article Google Scholar
Wang N, Xue Y, Lin Q, Zhong P (2019) Structured sparse multi-view feature selection based on weighted hinge loss. Multimedia Tools and Applications 78(11):15455–15481
Article Google Scholar
Xiao L, Sun Z, He R, Tan T (2013) Coupled feature selection for cross-sensor iris recognition. In: The IEEE sixth international conference on biometrics, Theory, Applications and Systems on IEEE, pp 1–6
Xu Y, Wang C, Lai J (2016) Weighted multi-view clustering with feature selection. Pattern Recogn 53:25–35
Article Google Scholar
Yang M, Cheng D, Nie F (2019) Adaptive-weighting discriminative regression for multi-view classification. Pattern Recogn 88:236–245
Article Google Scholar
Yang X, He L, Qu D (2018) Semi-supervised minimum redundancy maximum relevance feature selection for audio classification. Multimedia Tools and Applications 77(1):713–739
Article Google Scholar
Zen Z, Wang X, Yan F (2018) Robust Discriminative multi-view K-means clustering with feature selection and group sparsity learning. Multimedia Tools and Applications 77(17):22433–22453
Article Google Scholar
Zhang J, Li C, Cao D (2018) Multi-label learning with label-specific features by resolving label correlations. Knowl-Based Syst 159:147–157
Google Scholar
Zhang Q, Tian Y, Yang Y, Pan C (2014) Automatic spatial-spectral feature selection for hyperspectral image via discriminative sparse multimodal learning. IEEE Trans Geosci Remote Sens 53(1):261–279
Article Google Scholar
Zhang L, Zhang Q, Du B (2018) Simultaneous spectral-spatial feature selection and extraction for hyperspectral images. IEEE Trans Cybern 48(1):16–28
Article Google Scholar
Zhang L, Zhang Q, Zhang L, Tao D, Huang X, Du B (2015) Ensemble manifold regularized sparse low-rank approximation for multiview feature embedding. Pattern Recogn 48:3102–3112
Article Google Scholar
Zheng J, Yuan H, Lai L (2018) SGL-RFS: Semi-supervised graph learning robust feature selection. In: The 2018 international conference on wavelet analysis and pattern recognition, pp 155–160
Zhong W, Jiang L, Zhang T (2019) Combining multilevel feature extraction and multi-loss learning for person re-identification. Neurocomputing 334:68–78
Article Google Scholar
Zhu X, Li X, Zhang S (2016) Block-row sparse multiview multilabel learning for image classification. IEEE Transactions on Cybernetics 46(2):450–461
Article Google Scholar

Download references

Acknowledgments

The authors would like to thank the reviewers for their valuable comments and suggestions to improve the quality of this paper.

Author information

Authors and Affiliations

College of Information and Electrical Engineering, China Agricultural University, Beijing, 100083, China
Jing Zhong, Yimin Xu & Liran Yang
College of Science, China Agricultural University, Beijing, 100083, China
Ping Zhong

Authors

Jing Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Ping Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Yimin Xu
View author publications
You can also search for this author in PubMed Google Scholar
Liran Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ping Zhong.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 The proof of Theorem 1

According to step 2 in Algorithm 1,

$$ \begin{array}{@{}rcl@{}} W^{t+1}=\min\limits_{W}&& \lambda_{1}\sum\limits_{v=1}^{m}\Vert W_{v}\Vert_{2,1}+\lambda_{2}\sum\limits_{v=1}^{m}\Vert W_{v}\Vert_{F} \\&&+\sum\limits_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} W_{v}-Y-{E^{t}_{v}}+\frac{1}{\mu^{t}}{{\Lambda}_{v}^{t}}{\Vert_{F}^{2}} \end{array} $$

(21)

Since the following equations hold

$$ \begin{array}{@{}rcl@{}} \|W_{v}\|_{2,1}= tr\left( W_{v}^{\top} D_{1v} W_{v}\right) \end{array} $$

(22)

$$ \begin{array}{@{}rcl@{}} \|W_{v}\|_{F}= tr\left( W_{v}^{\top} D_{2v} W_{v}\right) \end{array} $$

(23)

where D_1v and D_2v are given in (9), (21) can be transformed into

$$ \begin{array}{@{}rcl@{}} W^{t+1}=\min\limits_{W}&&\lambda_{1}\sum\limits_{v=1}^{m}tr\left( W_{v}^{\top} D^{t}_{1v} W_{v}\right)+\lambda_{2}{\sum}_{v=1}^{m}tr\left( W_{v}^{\top} D^{t}_{2v} W_{v}\right) \\ &&+\sum\limits_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} W_{v}-Y-{E^{t}_{v}}+\frac{1}{\mu^{t}}{{\Lambda}_{v}^{t}}{\Vert_{F}^{2}} \end{array} $$

(24)

therefore,

$$ \begin{array}{@{}rcl@{}} &&\lambda_{1}\sum\limits_{v=1}^{m} tr\left( W_{v}^{t+1^{\top}} D^{t}_{1v} W^{t+1}_{v}\right)+\lambda_{2}\sum\limits_{v=1}^{m} tr\left( W_{v}^{t+1^{\top}} D^{t}_{2v} W^{t+1}_{v}\right) \\ &&+ \sum\limits_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} W^{t+1}_{v}-Y-{E^{t}_{v}}+\frac{1}{\mu^{t}}{{\Lambda}_{v}^{t}}{\Vert_{F}^{2}} \\ \leq&&\lambda_{1}\sum\limits_{v=1}^{m}tr\left( W_{v}^{t^{\top}} D^{t}_{1v} {W^{t}_{v}}\right)+\lambda_{2}{\sum}_{v=1}^{m}tr\left( W_{v}^{t^{\top}} D^{t}_{2v} {W^{t}_{v}}\right) \\&&+\sum\limits_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} {W^{t}_{v}}-Y-{E^{t}_{v}}+\frac{1}{\mu^{t}}{{\Lambda}_{v}^{t}}{\Vert_{F}^{2}} \end{array} $$

(25)

Substituting D_1v and D_2v with definitions and the following inequalities can be obtained

$$ \begin{array}{@{}rcl@{}} &&\lambda_{1}\sum\limits_{i=1}^{d} \frac{\|\mathbf{w}^{i^{t+1}}\|_{2}^{2}}{2\|\mathbf{w}^{i^{t}}\|_{2}}+\lambda_{2}\sum\limits_{v=1}^{m} \frac{\|W_{v}^{t+1}\|_{F}^{2}}{2\|{W_{v}^{t}}\|_{F}}+\sum\limits_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} W^{t+1}_{v}-Y-{E^{t}_{v}}+\frac{1}{\mu^{t}}{{\Lambda}_{v}^{t}}{\Vert_{F}^{2}} \\ \leq&&\lambda_{1}\sum\limits_{i=1}^{d} \frac{\|\mathbf{w}^{i^{t}}\|_{2}^{2}}{2\|\mathbf{w}^{i^{t}}\|_{2}}+\lambda_{2}\sum\limits_{v=1}^{m} \frac{\|{W_{v}^{t}}\|_{F}^{2}}{2\|{W_{v}^{t}}\|_{F}} +\sum\limits_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} {W^{t}_{v}}-Y-{E^{t}_{v}}+\frac{1}{\mu^{t}}{{\Lambda}_{v}^{t}}{\Vert_{F}^{2}} \end{array} $$

(26)

According to Lemma 1. we replace a and b with $\|\mathbf {w}^{i^{t+1}}\|_{2}^{2}$ (or $\|W_{v}^{t+1}\|_{F}^{2}$ ) and $\|\mathbf {w}^{i^{t}}\|_{2}^{2}$ (or $\|{W_{v}^{t}}\|_{F}^{2}$ ), respectively, then the following inequalities can be obtained

$$ \begin{array}{@{}rcl@{}} \|\mathbf{w}^{i^{t+1}}\|_{2}- \frac{\|\mathbf{w}^{i^{t+1}}\|_{2}^{2}}{2\|\mathbf{w}^{i^{t}}\|_{2}}\leq \|\mathbf{w}^{i^{t}}\|_{2} - \frac{\|\mathbf{w}^{i^{t}}\|_{2}^{2}}{2\|\mathbf{w}^{i^{t}}\|_{2}} \end{array} $$

(27)

$$ \begin{array}{@{}rcl@{}} \|W_{v}^{t+1}\|_{F} - \frac{\|W_{v}^{t+1}\|_{F}^{2}}{2\|{W_{v}^{t}}\|_{F}}\leq \|{W_{v}^{t}}\|_{F} - \frac{\|{W_{v}^{t}}\|_{F}^{2}}{2\|{W_{v}^{t}}\|_{F}} \end{array} $$

(28)

Adding (26)–(28) on both sides (note that (27) is repeated for 1 ≤ i ≤ d and (28) is repeated for 1 ≤ v ≤ m), gives

$$ \begin{array}{@{}rcl@{}} && \lambda_{1}\sum\limits_{i=1}^{d} \|\mathbf{w}^{i^{t+1}}\|_{2}+\lambda_{2}\sum\limits_{v=1}^{m} \|W_{v}^{t+1}\|_{F} +\sum\limits_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} W^{t+1}_{v}-Y-{E^{t}_{v}}+\frac{1}{\mu^{t}}{{\Lambda}_{v}^{t}}{\Vert_{F}^{2}} \\ \leq&& \lambda_{1}\sum\limits_{i=1}^{d} \|\mathbf{w}^{i^{t}}\|_{2}+\lambda_{2}\sum\limits_{v=1}^{m} \|{W_{v}^{t}}\|_{F} +\sum\limits_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} {W^{t}_{v}}-Y-{E^{t}_{v}}+\frac{1}{\mu^{t}}{{\Lambda}_{v}^{t}}{\Vert_{F}^{2}} \end{array} $$

(29)

Since

$$ \begin{array}{@{}rcl@{}} \sum\limits_{v=1}^{m}\Vert W_{v}\Vert_{2,1}=\sum\limits_{i=1}^{d} \|\mathbf{w}^{i}\|_{2} \end{array} $$

(30)

Equation (29) can be transformed as follows

$$ \begin{array}{@{}rcl@{}} &&\lambda_{1}\sum\limits_{v=1}^{m}\Vert W^{t+1}_{v}\Vert_{2,1}+\lambda_{2}\sum\limits_{v=1}^{m} \|W_{v}^{t+1}\|_{F} +\sum\limits_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} W^{t+1}_{v}-Y-{E^{t}_{v}}+\frac{1}{\mu^{t}}{{\Lambda}_{v}^{t}}{\Vert_{F}^{2}} \\ \leq&&\lambda_{1}\sum\limits_{v=1}^{m}\Vert {W^{t}_{v}}\Vert_{2,1}+\lambda_{2}\sum\limits_{v=1}^{m} \|{W_{v}^{t}}\|_{F}+\sum\limits_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} {W^{t}_{v}}-Y-{E^{t}_{v}}+\frac{1}{\mu^{t}}{{\Lambda}_{v}^{t}}{\Vert_{F}^{2}} \end{array} $$

(31)

According to the step 4 in Algorithm 1,

$$ \begin{array}{@{}rcl@{}} E_{v}^{t+1}=\min\limits_{E_{v}}\sum\limits_{v=1}^{m} ({\theta_{v}^{t}})^{p} \Vert E_{v}\Vert_{2,1}+\sum\limits_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} W^{t+1}_{v}-Y-E_{v}+\frac{1}{\mu^{t}}{{\Lambda}_{v}^{t}}{\Vert_{F}^{2}} \end{array} $$

(32)

thus,

$$ \begin{array}{@{}rcl@{}} &&\sum\limits_{v=1}^{m} ({\theta_{v}^{t}})^{p} \Vert E^{t+1}_{v}\Vert_{2,1}+\sum\limits_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} W^{t+1}_{v}-Y-E^{t+1}_{v}+\frac{1}{\mu^{t}}{{\Lambda}_{v}^{t}}{\Vert_{F}^{2}}\\ &&\leq\sum\limits_{v=1}^{m} ({\theta_{v}^{t}})^{p} \Vert {E^{t}_{v}}\Vert_{2,1}+\sum\limits_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} W^{t+1}_{v}-Y-{E^{t}_{v}}+\frac{1}{\mu^{t}}{{\Lambda}_{v}^{t}}{\Vert_{F}^{2}} \end{array} $$

(33)

Combining Eqs. (31) and (33), the following inequality can be obtained

$$ \begin{array}{@{}rcl@{}} &&\sum\limits_{v=1}^{m} ({\theta_{v}^{t}})^{p} \Vert E^{t+1}_{v}\Vert_{2,1}+\lambda_{1}\sum\limits_{v=1}^{m}\Vert W^{t+1}_{v}\Vert_{2,1}+\lambda_{2}\sum\limits_{v=1}^{m} \|W_{v}^{t+1}\|_{F} \\ &&+\sum\limits_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} W^{t+1}_{v}-Y-E^{t+1}_{v}+\frac{1}{\mu^{t}}{{\Lambda}_{v}^{t}}{\Vert_{F}^{2}} \\ \leq&&\sum\limits_{v=1}^{m} ({\theta_{v}^{t}})^{p} \Vert {E^{t}_{v}}\Vert_{2,1}+\lambda_{1}\sum\limits_{v=1}^{m}\Vert {W^{t}_{v}}\Vert_{2,1}+\lambda_{2}\sum\limits_{v=1}^{m} \|{W_{v}^{t}}\|_{F}\\ &&+\sum\limits_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} {W^{t}_{v}}-Y-{E^{t}_{v}}+\frac{1}{\mu^{t}}{{\Lambda}_{v}^{t}}{\Vert_{F}^{2}} \end{array} $$

(34)

According to the step 5 in Algorithm 1,

$$ \begin{array}{@{}rcl@{}} \theta_{v}^{t+1}=\min\limits_{\theta_{v}}\sum\limits_{v=1}^{m} (\theta_{v})^{p} \Vert E^{t+1}_{v}\Vert_{2,1} \end{array} $$

(35)

thus,

$$ \begin{array}{@{}rcl@{}} \sum\limits_{v=1}^{m} (\theta_{v}^{t+1})^{p} \Vert E^{t+1}_{v}\Vert_{2,1}\leq\sum\limits_{v=1}^{m} ({\theta_{v}^{t}})^{p} \Vert E^{t+1}_{v}\Vert_{2,1} \end{array} $$

(36)

Combining (34) and (36), the following inequality can be obtained

$$ \begin{array}{@{}rcl@{}} &&\sum\limits_{v=1}^{m} (\theta_{v}^{t+1})^{p} \Vert E^{t+1}_{v}\Vert_{2,1}+\lambda_{1}\sum\limits_{v=1}^{m}\Vert W^{t+1}_{v}\Vert_{2,1}+\lambda_{2}\sum\limits_{v=1}^{m} \|W_{v}^{t+1}\|_{F} \\ &&+\sum\limits_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} W^{t+1}_{v}-Y-E^{t+1}_{v}+\frac{1}{\mu^{t}}{{\Lambda}_{v}^{t}}{\Vert_{F}^{2}} \\ \leq&&\sum\limits_{v=1}^{m} ({\theta_{v}^{t}})^{p} \Vert {E^{t}_{v}}\Vert_{2,1}+\lambda_{1}\sum\limits_{v=1}^{m}\Vert {W^{t}_{v}}\Vert_{2,1}+\lambda_{2}\sum\limits_{v=1}^{m} \|{W_{v}^{t}}\|_{F}\\ &&+\sum\limits_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} {W^{t}_{v}}-Y-{E^{t}_{v}}+\frac{1}{\mu^{t}}{{\Lambda}_{v}^{t}}{\Vert_{F}^{2}} \end{array} $$

(37)

According to the step 6 in Algorithm 1,

$$ \begin{array}{@{}rcl@{}} {\Lambda}_{v}^{t+1}=\min\limits_{{\Lambda}_{v}}\sum\limits_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} W^{t+1}_{v}-Y-E^{t+1}_{v}+\frac{1}{\mu^{t}}{\Lambda}_{v}{\Vert_{F}^{2}}-\sum\limits_{v=1}^{m}\frac{\mu^{t}}{2}\Vert\frac{1}{\mu^{t}}{\Lambda}_{v}{\Vert_{F}^{2}} \end{array} $$

(38)

thus,

$$ \begin{array}{@{}rcl@{}} &&\sum\limits_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} W^{t+1}_{v}-Y-E^{t+1}_{v}+\frac{1}{\mu^{t}}{\Lambda}_{v}^{t+1}{\Vert_{F}^{2}}-\sum\limits_{v=1}^{m}\frac{\mu^{t}}{2}\Vert\frac{1}{\mu^{t}}{\Lambda}_{v}^{t+1}{\Vert_{F}^{2}} \\ \leq&&\sum\limits_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} W^{t+1}_{v}-Y-E^{t+1}_{v}+\frac{1}{\mu^{t}}{{\Lambda}_{v}^{t}}{\Vert_{F}^{2}}-\sum\limits_{v=1}^{m}\frac{\mu^{t}}{2}\Vert\frac{1}{\mu^{t}}{{\Lambda}_{v}^{t}}{\Vert_{F}^{2}} \end{array} $$

(39)

Combining (37) and (39), the following inequality can be obtained

$$ \begin{array}{@{}rcl@{}} &&\sum\limits_{v=1}^{m} (\theta_{v}^{t+1})^{p} \Vert E^{t+1}_{v}\Vert_{2,1}+\lambda_{1}\sum\limits_{v=1}^{m}\Vert W^{t+1}_{v}\Vert_{2,1}+\lambda_{2}\sum\limits_{v=1}^{m} \|W_{v}^{t+1}\|_{F} \\ &&+\sum\limits_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} W^{t+1}_{v}-Y-E^{t+1}_{v}+\frac{1}{\mu^{t}}{\Lambda}_{v}^{t+1}{\Vert_{F}^{2}}-\sum\limits_{v=1}^{m}\frac{\mu^{t}}{2}\Vert\frac{1}{\mu^{t}}{\Lambda}_{v}^{t+1}{\Vert_{F}^{2}} \\ \leq&&\sum\limits_{v=1}^{m} ({\theta_{v}^{t}})^{p} \Vert {E^{t}_{v}}\Vert_{2,1}+\lambda_{1}\sum\limits_{v=1}^{m}\Vert {W^{t}_{v}}\Vert_{2,1}+\lambda_{2}\sum\limits_{v=1}^{m} \|{W_{v}^{t}}\|_{F}\\ &&+\sum\limits_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} {W^{t}_{v}}-Y-{E^{t}_{v}}+\frac{1}{\mu^{t}}{{\Lambda}_{v}^{t}}{\Vert_{F}^{2}}-\sum\limits_{v=1}^{m}\frac{\mu^{t}}{2}\Vert\frac{1}{\mu^{t}}{{\Lambda}_{v}^{t}}{\Vert_{F}^{2}} \end{array} $$

(40)

According to the step 7 in Algorithm 1,

$$ \begin{array}{@{}rcl@{}} \mu^{t+1}=\min\limits_{\mu}\sum\limits_{v=1}^{m} \frac{\mu}{2} \Vert X_{v}^{\top} W^{t+1}_{v}-Y-E^{t+1}_{v}+\frac{1}{\mu}{\Lambda}_{v}^{t+1}{\Vert_{F}^{2}}-\sum\limits_{v=1}^{m}\frac{\mu}{2}\Vert\frac{1}{\mu}{\Lambda}_{v}^{t+1}{\Vert_{F}^{2}} \end{array} $$

(41)

thus,

$$ \begin{array}{@{}rcl@{}} &&\sum\limits_{v=1}^{m} \frac{\mu^{t+1}}{2} \Vert X_{v}^{\top} W^{t+1}_{v}-Y-E^{t+1}_{v}+\frac{1}{\mu^{t+1}}{\Lambda}_{v}^{t+1}{\Vert_{F}^{2}}-\sum\limits_{v=1}^{m}\frac{\mu^{t+1}}{2}\Vert\frac{1}{\mu^{t+1}}{\Lambda}_{v}^{t+1}{\Vert_{F}^{2}} \\ \leq&&{\sum}_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} W^{t+1}_{v}-Y-E^{t+1}_{v}+\frac{1}{\mu^{t}}{\Lambda}_{v}^{t+1}{\Vert_{F}^{2}}-\sum\limits_{v=1}^{m}\frac{\mu^{t}}{2}\Vert\frac{1}{\mu^{t}}{\Lambda}_{v}^{t+1}{\Vert_{F}^{2}} \end{array} $$

(42)

Combining (40) and (42), the following inequality can be obtained

$$ \begin{array}{@{}rcl@{}} &&\sum\limits_{v=1}^{m} (\theta_{v}^{t+1})^{p} \Vert E^{t+1}_{v}\Vert_{2,1}+\lambda_{1}\sum\limits_{v=1}^{m}\Vert W^{t+1}_{v}\Vert_{2,1}+\lambda_{2}\sum\limits_{v=1}^{m} \|W_{v}^{t+1}\|_{F} \\ &&+\sum\limits_{v=1}^{m} \frac{\mu^{t+1}}{2} \Vert X_{v}^{\top} W^{t+1}_{v}-Y-E^{t+1}_{v}+\frac{1}{\mu^{t+1}}{\Lambda}_{v}^{t+1}{\Vert_{F}^{2}}-\sum\limits_{v=1}^{m}\frac{\mu^{t+1}}{2}\Vert\frac{1}{\mu^{t+1}}{\Lambda}_{v}^{t+1}{\Vert_{F}^{2}} \\ \leq&&\sum\limits_{v=1}^{m} ({\theta_{v}^{t}})^{p} \Vert {E^{t}_{v}}\Vert_{2,1}+\lambda_{1}\sum\limits_{v=1}^{m}\Vert {W^{t}_{v}}\Vert_{2,1}+\lambda_{2}\sum\limits_{v=1}^{m} \|{W_{v}^{t}}\|_{F}\\ &&+\sum\limits_{v=1}^{m} \frac{\mu^{t}}{2} \Vert X_{v}^{\top} {W^{t}_{v}}-Y-{E^{t}_{v}}+\frac{1}{\mu^{t}}{{\Lambda}_{v}^{t}}{\Vert_{F}^{2}}-\sum\limits_{v=1}^{m}\frac{\mu^{t}}{2}\Vert\frac{1}{\mu^{t}}{{\Lambda}_{v}^{t}}{\Vert_{F}^{2}} \end{array} $$

(43)

Since $ X_{v}^{\top } W_{v}-Y$ is replaced with E_v before, (43) can be transformed as follows

$$ \begin{array}{@{}rcl@{}} &&\sum\limits_{v=1}^{m} (\theta_{v}^{t+1})^{p} \Vert X_{v}^{{t+1}\top} W_{v}^{t+1}-Y\Vert_{2,1}+\lambda_{1}\sum\limits_{v=1}^{m}\Vert W^{t+1}_{v}\Vert_{2,1}+\lambda_{2}\sum\limits_{v=1}^{m} \|W_{v}^{t+1}\|_{F} \\ \leq&&\sum\limits_{v=1}^{m} ({\theta_{v}^{t}})^{p} \Vert X_{v}^{{t}\top} {W_{v}^{t}}-Y\Vert_{2,1}+\lambda_{1}\sum\limits_{v=1}^{m}\Vert {W^{t}_{v}}\Vert_{2,1}+\lambda_{2}\sum\limits_{v=1}^{m} \|{W_{v}^{t}}\|_{F} \end{array} $$

(44)

Thus,

$$ \begin{array}{@{}rcl@{}} Obj(t+1)\leq Obj(t) \end{array} $$

(45)

Equation (45) indicates that the value of the objective function (2) is decreased in each iteration of the Algorithm 1. And beacuse (2) is greater than zero, Theorem 1 is proven.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhong, J., Zhong, P., Xu, Y. et al. Robust multiview feature selection via view weighted. Multimed Tools Appl 80, 1503–1527 (2021). https://doi.org/10.1007/s11042-020-09617-8

Download citation

Received: 02 October 2019
Revised: 29 July 2020
Accepted: 12 August 2020
Published: 08 September 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s11042-020-09617-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Robust multiview feature selection via view weighted

Abstract

Similar content being viewed by others

Sparse robust multiview feature selection via adaptive-weighting strategy

Structured Multi-view Supervised Feature Selection Algorithm Research

Multi-view dimensionality reduction learning with hierarchical sparse feature selection

1 Introduction

2 Related work

3 Robust weighted multiview feature selection

3.1 Notations

3.2 Model

3.3 Optimization algorithm

3.4 The analysis of convergency and complexity

Lemma 1

Theorem 1

4 Experiment

4.1 Experimental setup

4.2 Experimental results and discussions

4.3 Parameter analysis

4.4 Analysis for the convergence

5 Conclusion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix

1.1 The proof of Theorem 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Robust multiview feature selection via view weighted

Abstract

Similar content being viewed by others

Sparse robust multiview feature selection via adaptive-weighting strategy

Structured Multi-view Supervised Feature Selection Algorithm Research

Multi-view dimensionality reduction learning with hierarchical sparse feature selection

Explore related subjects

1 Introduction

2 Related work

3 Robust weighted multiview feature selection

3.1 Notations

3.2 Model

3.3 Optimization algorithm

3.4 The analysis of convergency and complexity

Lemma 1

Theorem 1

4 Experiment

4.1 Experimental setup

4.2 Experimental results and discussions

4.3 Parameter analysis

4.4 Analysis for the convergence

5 Conclusion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix

Appendix

1.1 The proof of Theorem 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation