Keywords

1 Introduction

Many real-world entities are often represented with different views such as web pages [1, 33], multi-lingual news [2, 8, 16] and neuroimaging [22,23,24]. Consistency and complementarity, as the bridges to link all views together, are the two main assumptions in current multi-view learning [30]. The consistency assumption suggests that there is consistent information shared by all views [3, 18, 31]. Apparently, it is insufficient to exploit multi-view data using only consistent information since each view also contains complementary knowledge that other views do not have [1, 9, 19]. Therefore, investigating the complementarity of views is another important paradigm to learn multi-view data.

However, a question comes to our mind: whether the derived consistent and (or) complementary information really always support a better classification performance? Our answer is: no, since empirical pre-experiments indicate that prediction performance on multi-view data can be even worse than using single-view data in some real-world data sets. The main reason is that the consistent or complementary information does not learn discriminative information directly. The classifier constructed by multi-view data may give an even worse classification performance if the learned consistent and (or) complementary information contains no clear discriminative information.

Fig. 1.
figure 1

Illustration of extracting discriminative information from multi-view data via joint non-negative matrix factorization. Each view of the data matrix is a superposition of four different parts: common discriminative part, common non-discriminative part, specific discriminative part and specific non-discriminative part.

In this paper, towards robust multi-view learning, we examine both discriminative and non-discriminative information existing in the consistent and complementary parts, and use only discriminative information for learning. Following this idea, we propose a new multi-view learning algorithm, called DICS, by exploring the Discriminative and non-discriminative Information existing in Common and view-Specific parts among different views via joint non-negative matrix factorization (NMF). Specifically, as usual, multi-view data is factorized into common part shared across views and view-specific parts existing within each view. Beyond, for both common part and each view-specific part, they are further factorized into two parts (discriminative part and non-discriminative part). To better obtain the discriminative parts, a supervised constraint is added to guide the joint NMF factorization. For illustration, Fig. 1 gives a simple example to illustrate the decomposition. Here, each view of data is factorized into four parts: the common discriminative, common non-discriminative, specific discriminative and specific non-discriminative part, respectively. To find the optimal decomposition, we follow the block coordinate descent (BCD) framework [14] to solve the objective function of DICS. Finally, only the derived discriminative parts from common part and view-specific parts are used to construct a classifier. Experimental results show that DICS allows extracting discriminative information as well as discarding non-discriminative information effectively, and supports a gained classification performance, which outperforms many state-of-the-art algorithms on seven real-world data sets.

2 Related Work

The most simplest way to deal with multi-view data is to concatenate all feature vectors of different views into one single long feature vector. However, such method ignores the relationships among multiple views and may suffer from the curse of dimensionality. To present, many advanced multi-view learning algorithms have been proposed, which can be broadly categorized into two types: The first category aims to exploit the consistency, and the second one focuses on exploiting the complementarity among multiple views.

Studies in exploiting consistency generally seek a common representation on which all views have minimum disagreement. For instance, Canonical Correlation Analysis (CCA) related algorithms [3, 6, 11, 12, 26] project two or more views into latent subspaces by maximizing the correlations among projected views. Spectral methods [5, 16, 20, 29, 33] use weighted summation to merge graph Laplacian matrices from different views into one optimal graph for further clustering or embedding. Matrix factorization based methods [8, 18, 27] jointly factorize multi-view data into one common centroid representation by minimizing the overall reconstruction loss of different views. In addition, multiple kernel learning (MKL) [7] can also be considered as exploiting the consistency across different views, where each view is mapped into a new space (e.g. kernel Hilbert space) using kernel trick, and then combines all kernel matrices into one unified kernel by minimizing a pre-defined objective function.

Another paradigm of multi-view learning is to explicitly preserve complementary information of different views. Co-training style algorithms [1, 15, 28, 32] treat each view as complementarity. Generally speaking, it iteratively trains two classifiers on two different views, and each classifier generates its complementary information to help the other classifier to train in the next iteration. Beyond, the Co-EM algorithm [21] can be considered as a probabilistic version of co-training. Subspace related methods are also adopted to learn the complementarity. For instance, [9, 10, 13, 19, 25] learn one shared latent factor and view-specific latent factors to simultaneously capture the consistency and complementarity.

In summary, most existing multi-view learning algorithms mainly focus on learning consistency and complementarity from multi-view data. However, discriminative information existing in consistency and complementarity is not fully investigated, which is actually the direct factor to dominate the learning performance.

3 The Proposed Method

3.1 Preliminaries

Given a non-negative matrix \(\mathbf {X} \in \mathbb {R}_+^{m \times n}\), where each column represents a data point. NMF aims to approximately factorize the data matrix into two non-negative matrix \(\mathbf {W} \in \mathbb {R}_+^{m \times k}\) and \(\mathbf {H} \in \mathbb {R}_+^{n \times k}\), so that,

$$\begin{aligned} \begin{aligned}&\underset{\mathbf {W},\mathbf {H}}{\text {min}} ||\mathbf {X} - \mathbf {W} \mathbf {H}^T||_F^2 \\&\text {s.t.} \;\; \mathbf {W}, \mathbf {H} \ge 0 \end{aligned} \end{aligned}$$
(1)

where \(||\cdot ||_F\) denotes the Frobenius norm. Note that the original data matrix is a linear combination of all column vectors in \(\mathbf {W}\) with weights of corresponding column vectors in \(\mathbf {H}\). Therefore, \(\mathbf {W}\) and \(\mathbf {H}\) are often called the basis matrix and the coefficient matrix respectively.

For multi-view data, NMF-based approaches often take either \(\mathbf {W}\) or \(\mathbf {H}\) as a common factor. One of the representative formulation is as follows.

$$\begin{aligned} \begin{aligned}&\underset{\mathbf {W},\mathbf {H}}{\text {min}} \sum _{v=1}^{n_v} ||\mathbf {X}^{(v)} - \mathbf {W}^{(v)} \mathbf {H}^{T}||_F^2 + \mathrm {\Phi }(\mathbf {W},\mathbf {H}) \\&\text {s.t.} \;\; \mathbf {W},\mathbf {H} \ge 0 \end{aligned} \end{aligned}$$
(2)

where \(n_v\) denotes the number of views, and \(\mathbf {W}^{(v)}\) denotes the basis matrices corresponding to different views. \(\mathbf {H}\) denotes the common coefficient matrix shared across views, and \(\mathrm {\Phi }(\cdot )\) are some regularization terms on \(\mathbf {W}\) and \(\mathbf {H}\). It assumes that different views of one identical object are generated from distinct subspaces, and all views share with one centroid latent representation. This paradigm considers the consistency shared by all views, however, it ignores the complementary knowledge existing in each view.

3.2 Discriminant Learning on Multi-view Data

As multiple views have their commonality and distinctiveness, we first decompose the multi-view data into two parts: common part and view-specific parts, like many existing approaches [9, 10, 13, 19]. Formally, let \(\mathbf {W}_{\mathrm {C}}\) represents the common subspace shared by all views and \(\mathbf {W}_\mathrm {S}^{(v)}\) represents the distinct subspace corresponding to each specific view. Therefore, each view of data matrix can be written as \(\mathbf {X}^{(v)} = \mathbf {W}_\mathrm {C} \mathbf {H}_\mathrm {C}^T + \mathbf {W}_\mathrm {S}^{(v)} \mathbf {H}_\mathrm {S}^{(v)T}\). To derive the common and view-specific information, we thus can formulate our objective function as follows.

$$\begin{aligned} \begin{aligned}&\underset{\mathbf {W},\mathbf {H}}{\text {min}} \sum _{v=1}^{n_v} \bigg |\bigg | \mathbf {X}^{(v)} - \begin{bmatrix} \mathbf {W}_{\mathrm {C}}&\mathbf {W}_{\mathrm {S}}^{(v)} \end{bmatrix} \begin{bmatrix} \mathbf {H}_{\mathrm {C}}^T \\ \mathbf {H}_{\mathrm {S}}^{(v)T} \end{bmatrix}\bigg |\bigg |_F^2 + \mathrm {\Phi }(\mathbf {W},\mathbf {H})\\&\text {s.t.} \; \; \mathbf {W},\mathbf {H} \ge 0 \end{aligned} \end{aligned}$$
(3)

To learn the discriminative information existing in multi-view data, we further leverage the available label information to guide joint matrix factorization in a supervised way. Specifically, we first divide the common part and each view-specific part into the discriminative part and the non-discriminative part, respectively. Namely,

$$\begin{aligned} \mathbf {\widetilde{W}} = \begin{bmatrix} \mathbf {W}_{\mathrm {CD}}&\mathbf {W}_{\mathrm {CN}}&\mathbf {W}_{\mathrm {SD}}^{(v)}&\mathbf {W}_{\mathrm {SN}}^{(v)} \end{bmatrix} \end{aligned}$$
(4)
$$\begin{aligned} \mathbf {\widetilde{H}} = \begin{bmatrix} \mathbf {H}_{\mathrm {CD}}&\mathbf {H}_{\mathrm {CN}}&\mathbf {H}_{\mathrm {SD}}^{(v)}&\mathbf {H}_{\mathrm {SN}}^{(v)} \end{bmatrix} \end{aligned}$$
(5)

where \(\mathbf {W}_{\mathrm {CD}}\) and \(\mathbf {W}_{\mathrm {CN}}\) indicate the common discriminative as well as the non-discriminative part of matrix \(\mathbf {\widetilde{W}}\), respectively. Similarly, \(\mathbf {W}_{\mathrm {SD}}^{(v)}\) and \(\mathbf {W}_{\mathrm {SN}}^{(v)}\) indicate the view-specific parts. It is the same for \(\mathbf {\widetilde{H}}\).

Afterwards, we impose the supervised constraint on the latent coefficient matrix \(\mathbf {H}\). Here, it is worth noting that we only add the constraint on the discriminative part of \(\mathbf {H}\) to derive discriminability. In addition, we should notice that the discriminative information not only exists in the common part, but also in each view-specific part. Therefore, the objective function is further reformulated as follows.

$$\begin{aligned} \begin{aligned}&\underset{\mathbf {W},\mathbf {H},\mathbf {B}}{\text {min}} \sum _{v = 1}^{n_v} \big |\big |\mathbf {X}^{(v)} - \mathbf {\widetilde{W}} \mathbf {\widetilde{H}}^T \big |\big |_F^2 + \mathrm {\Phi }(\mathbf {W},\mathbf {H}) \\&+ \gamma \bigg |\bigg |\mathbf {Y} - \begin{bmatrix} \mathbf {B}_{\mathrm {CD}}&\mathbf {B}_{\mathrm {SD}}^{(v)} \end{bmatrix} \begin{bmatrix} \mathbf {H}_{\mathrm {CD}}^T \\ \mathbf {H}_{\mathrm {SD}}^{(v)T} \end{bmatrix}\bigg |\bigg |_F^2 \\&\text {s.t.} \; \mathbf {W}, \mathbf {H} \ge 0, \; ||(\mathbf {W})_{\cdot ,i}||_2 = 1 \end{aligned} \end{aligned}$$
(6)

where \(\mathbf {Y} \in \mathbb {R}^{c \times n}\) is the label matrix, c is the number of classes, and n is the number of data instances. \(y_{i,j} = 1\) if the instance j belong to class i and 0 otherwise. \(\mathbf {B} = [\mathbf {B}_{\mathrm {CD}} \; \mathbf {B}_{\mathrm {SD}}^{(v)}] \in \mathbb {R}^{c \times (k1 + k3)}\) is a linear projection matrix which maps the latent representation into label space. Subscript “\(\mathrm {C}\)” and “\(\mathrm {S}\)” represent “common” and “specific” respectively. “\(\mathrm {D}\)” and “\(\mathrm {N}\)” represent “discriminative” and “non-discriminative” respectively. For example, \(\mathbf {W}_{\mathrm {CD}}\) denotes the common discriminative subspace. We normalize each column vector of \(\mathbf {W}\) to ensure a unique solution. The supervised regularization term is imposed on \(\mathbf {H}_{\mathrm {D}} = [\mathbf {H}_{\mathrm {CD}} \;\; \mathbf {H}_{\mathrm {SD}}^{(v)}]\) to make the derived patterns discriminative.

3.3 Regularization Terms

To further enhance the discriminative power of latent subspaces, we impose a \(\ell _{1,1}\) norm constraint on \(\mathbf {W}_{\mathrm {D}}\) as \(||\mathbf {W}_{\mathrm {D}}^{T} \mathbf {W}_{\mathrm {D}}||_{1,1}\), where \(\mathbf {W}_{\mathrm {D}} = [\mathbf {W}_{\mathrm {CD}} \; \mathbf {W}_{\mathrm {SD}}^{(v)}]\). This term can be factorized into two parts: \(||\mathbf {W}_{\mathrm {D}}^{T} \mathbf {W}_{\mathrm {D}}||_{1,1} = \sum _{i} \mathbf {w}_{\mathrm {D}i}^{T} \mathbf {w}_{\mathrm {D}i} + \sum _{i \ne j} \mathbf {w}_{\mathrm {D}i}^{T} \mathbf {w}_{\mathrm {D}j}\). The first term is used to prevent overfitting. The second term encourages basis vectors to be as orthogonal as possible, which reduces the redundancy of discriminative bases. At last, we impose a \(\ell _{1,1}\) norm constraint on \(\mathbf {H}_{\mathrm {D}}\), which encourages the discriminative coefficients to be sparse. The reason is that data points of different classes should not possess identical latent concepts (i.e. basis vectors). It is reasonable that a latent concept only appears in a certain class but not in the others. With such intuition, a discriminative latent representation vector \(\mathbf {h}_{\mathrm {D}i}\) should be sparse in the elements which are corresponding to the latent concepts that \(\mathbf {h}_{\mathrm {D}i}\) doesn’t posses. Finally, putting all terms together, the objective function of DICS is formulated as follows.

$$\begin{aligned} \begin{aligned}&\underset{\mathbf {W},\mathbf {H},\mathbf {B}}{\text {min}} \sum _{v = 1}^{n_v} \big |\big |\mathbf {X}^{(v)} - \mathbf {\widetilde{W}} \mathbf {\widetilde{H}}^T \big |\big |_F^2 + \alpha \big |\big |\mathbf {W}_{\mathrm {D}}^{T} \mathbf {W}_{\mathrm {D}}\big |\big |_{1,1} \\&+ \beta \big |\big |\mathbf {H}_{\mathrm {D}}\big |\big |_{1,1} + \gamma \bigg |\bigg |\mathbf {Y} - \begin{bmatrix} \mathbf {B}_{\mathrm {CD}}&\mathbf {B}_{\mathrm {SD}}^{(v)} \end{bmatrix} \begin{bmatrix} \mathbf {H}_{\mathrm {CD}}^T \\ \mathbf {H}_{\mathrm {SD}}^{(v)T} \end{bmatrix}\bigg |\bigg |_F^2 \\&\text {s.t.} \; \mathbf {W}, \mathbf {H} \ge 0, \; ||(\mathbf {W})_{\cdot ,i}||_2 = 1 \end{aligned} \end{aligned}$$
(7)

where \(\alpha , \beta , \gamma \) are non-negative parameters to balance the regularization terms.

3.4 Optimization

The objective function Eq. (7) is not convex over both variables \(\mathbf {W}\) and \(\mathbf {H}\). Therefore, it is impractical to find the global optimum. We follow the general BCD framework to divide the objective function Eq. (7) into several convex subproblems corresponding to each column of \(\mathbf {W}\) and \(\mathbf {H}\), then solve each subproblem successively by fixing the others. In this way, the global convergence and local minimum solutions can be obtained [4].

Firstly, we represent \(\mathbf {W} \mathbf {H}^T\) as the sum of rank-1 outer products. We can equivalently reformulate the objective function Eq. (7) as follows.

$$\begin{aligned} \begin{aligned}&f(\mathbf {W}, \mathbf {H}, \mathbf {B}) = \\&\sum _{v = 1}^{n_v} \bigg |\bigg |\mathbf {X}^{(v)} - \sum _{i = 1}^{k1} \mathbf {w}_{\mathrm {CD}i}\mathbf {h}_{\mathrm {CD}i}^T - \sum _{i = 1}^{k2} \mathbf {w}_{\mathrm {CN}i}\mathbf {h}_{\mathrm {CN}i}^T - \\&\sum _{i = 1}^{k3} \mathbf {w}_{\mathrm {SD}i}^{(v)}\mathbf {h}_{\mathrm {SD}i}^{(v)T} - \sum _{i = 1}^{k4} \mathbf {w}_{\mathrm {SN}i}^{(v)}\mathbf {h}_{\mathrm {SN}i}^{(v)T} \bigg |\bigg |_F^2 + \\&\alpha ( \sum _{i = 1}^{k1} \sum _{j = 1}^{k1} \mathbf {w}_{\mathrm {CD}i}^T \mathbf {w}_{\mathrm {CD}j} + \sum _{i = 1}^{k3} \sum _{j = 1}^{k3} \mathbf {w}_{\mathrm {SD}i}^{(v)T} \mathbf {w}_{\mathrm {SD}j}^{(v)} + \\&2 \sum _{i = 1}^{k1} \sum _{j = 1}^{k3} \mathbf {w}_{\mathrm {CD}i}^{T} \mathbf {w}_{\mathrm {SD}j}^{(v)}) + \beta \mathbf {1}_{1 \times n}( \sum _{i = 1}^{k1} \mathbf {h}_{\mathrm {CD}i} + \sum _{i = 1}^{k3} \mathbf {h}_{\mathrm {SD}i}^{(v)})\\&\gamma \bigg |\bigg | \mathbf {Y} - \sum _{i = 1}^{k1} \mathbf {b}_{\mathrm {CD}i}\mathbf {h}_{\mathrm {CD}i}^T - \sum _{i = 1}^{k3} \mathbf {b}_{\mathrm {SD}i}^{(v)}\mathbf {h}_{\mathrm {SD}i}^{(v)T}\bigg |\bigg |_F^2 \end{aligned} \end{aligned}$$
(8)

where \(\mathbf {w}_{\mathrm {CD}i}\), \(\mathbf {w}_{\mathrm {CN}i}\), \(\mathbf {w}_{\mathrm {SD}i}^{(v)}\), \(\mathbf {w}_{\mathrm {SN}i}^{(v)}\), \(\mathbf {h}_{\mathrm {CD}i}\), \(\mathbf {h}_{\mathrm {CN}i}\), \(\mathbf {h}_{\mathrm {SD}i}^{(v)}\), \(\mathbf {h}_{\mathrm {SN}i}^{(v)}\) are the i-th column vectors of \(\mathbf {W}_{\mathrm {CD}}\), \(\mathbf {W}_{\mathrm {CN}}\), \(\mathbf {W}_{\mathrm {SD}}^{(v)}\), \(\mathbf {W}_{\mathrm {SN}}^{(v)}\), \(\mathbf {H}_{\mathrm {CD}}\), \(\mathbf {H}_{\mathrm {CN}}\), \(\mathbf {H}_{\mathrm {SD}}^{(v)}\), \(\mathbf {H}_{\mathrm {SN}}^{(v)}\) respectively. \(\mathbf {1}_{1 \times n}\) is a row vector of length n with all elements 1.

By fixing all column vectors except the one we want to update, we can obtain the convex subproblem respect to it, then solve it based on the BCD framework. Note that we use \([\cdot ]_+\) to denote \(\text {max}(0,\cdot )\), which projects the negative value to the boundary of feasible region of zero. Finally, we give the update rules as follows.

$$\begin{aligned} \mathbf {w}_{\mathrm {CD}i} = \mathbf {w}_{\mathrm {CD}i} + \Big [ \frac{\sum _{v = 1}^{n_v} (\mathbf {R}^{(v)}\mathbf {h}_{\mathrm {CD}i} - \alpha (\mathbf {W}_{\mathrm {CD}} \mathbf {1}_{k1 \times 1} + \mathbf {W}_{\mathrm {SD}}^{(v)} \mathbf {1}_{k3 \times 1}))}{n_v(\mathbf {h}_{\mathrm {CD}i}^T \mathbf {h}_{\mathrm {CD}i} + \alpha )} \Big ]_+ \end{aligned}$$
(9)
$$\begin{aligned} \mathbf {w}_{\mathrm {CN}i} = \mathbf {w}_{\mathrm {CN}i} + \Big [ \frac{\sum _{v = 1}^{n_v} \mathbf {R}^{(v)}\mathbf {h}_{\mathrm {CN}i}}{n_v(\mathbf {h}_{\mathrm {CN}i}^T \mathbf {h}_{\mathrm {CN}i})} \Big ]_+ \end{aligned}$$
(10)
$$\begin{aligned} \mathbf {w}_{\mathrm {SD}i}^{(v)} = \mathbf {w}_{\mathrm {SD}i}^{(v)} + \Big [ \frac{ \mathbf {R}^{(v)}\mathbf {h}_{\mathrm {SD}i}^{(v)} - \alpha (\mathbf {W}_{\mathrm {CD}} \mathbf {1}_{k1 \times 1} + \mathbf {W}_{\mathrm {SD}}^{(v)} \mathbf {1}_{k3 \times 1})}{\mathbf {h}_{\mathrm {SD}i}^{(v)T} \mathbf {h}_{\mathrm {SD}i}^{(v)} + \alpha } \Big ]_+ \end{aligned}$$
(11)
$$\begin{aligned} \mathbf {w}_{\mathrm {SN}i}^{(v)} = \mathbf {w}_{\mathrm {SN}i}^{(v)} + \Big [ \frac{\mathbf {R}^{(v)}\mathbf {h}_{\mathrm {SN}i}^{(v)}}{\mathbf {h}_{\mathrm {SN}i}^{(v)T} \mathbf {h}_{\mathrm {SN}i}^{(v)}} \Big ]_+ \end{aligned}$$
(12)
$$\begin{aligned} \mathbf {h}_{\mathrm {CD}i} = \mathbf {h}_{\mathrm {CD}i} + \Big [ \frac{\sum _{v = 1}^{n_v} (\mathbf {R}^{(v)T} \mathbf {w}_{\mathrm {CD}i} - \frac{\beta }{2} \mathbf {1}_{n \times 1} + \gamma \mathbf {Q}^{(v)T} \mathbf {b}_{\mathrm {CD}i})}{n_v(\mathbf {w}_{\mathrm {CD}i}^T \mathbf {w}_{\mathrm {CD}i} + \gamma \mathbf {b}_{\mathrm {CD}i}^T \mathbf {b}_{\mathrm {CD}i})} \Big ]_+ \end{aligned}$$
(13)
$$\begin{aligned} \mathbf {h}_{\mathrm {CN}i} = \mathbf {h}_{\mathrm {CN}i} + \Big [ \frac{ \sum _{v = 1}^{n_v} \mathbf {R}^{(v)T} \mathbf {w}_{\mathrm {CN}i}}{n_v(\mathbf {w}_{\mathrm {CN}i}^T \mathbf {w}_{\mathrm {CN}i})} \Big ]_+ \end{aligned}$$
(14)
$$\begin{aligned} \mathbf {h}_{\mathrm {SD}i}^{(v)} = \mathbf {h}_{\mathrm {SD}i}^{(v)} + \Big [ \frac{\mathbf {R}^{(v)T} \mathbf {w}_{\mathrm {SD}i}^{(v)} -\frac{\beta }{2} \mathbf {1}_{n \times 1} + \gamma \mathbf {Q}^{(v)T} \mathbf {b}_{\mathrm {SD}}^{(v)}}{\mathbf {w}_{\mathrm {SD}i}^{(v)T} \mathbf {w}_{\mathrm {SD}i}^{(v)} + \gamma \mathbf {b}_{\mathrm {SD}i}^{(v)T} \mathbf {b}_{\mathrm {SD}i}^{(v)} } \Big ]_+ \end{aligned}$$
(15)
$$\begin{aligned} \mathbf {h}_{\mathrm {SN}i}^{(v)} = \mathbf {h}_{\mathrm {SN}i}^{(v)} + \Big [ \frac{ \mathbf {R}^{(v)T} \mathbf {w}_{\mathrm {SN}i}^{(v)}}{\mathbf {w}_{\mathrm {SN}i}^{(v)T} \mathbf {w}_{\mathrm {SN}i}^{(v)}} \Big ]_+ \end{aligned}$$
(16)

where \(\mathbf {R}^{(v)}\) and \(\mathbf {Q}^{(v)}\) are

$$\begin{aligned} \mathbf {R}^{(v)} = \mathbf {X}^{(v)} - \mathbf {W}_{\mathrm {CD}}\mathbf {H}_{\mathrm {CD}}^T - \mathbf {W}_{\mathrm {CN}}\mathbf {H}_{\mathrm {CN}}^T - \mathbf {W}_{\mathrm {SD}}^{(v)}\mathbf {H}_{\mathrm {SD}}^{(v)T} - \mathbf {W}_{\mathrm {SN}}^{(v)}\mathbf {H}_{\mathrm {SN}}^{(v)T} \end{aligned}$$
(17)
$$\begin{aligned} \mathbf {Q}^{(v)} = \mathbf {Y} - \mathbf {B}_{\mathrm {CD}}\mathbf {H}_{\mathrm {CD}}^T - \mathbf {B}_{\mathrm {SD}}^{(v)}\mathbf {H}_{\mathrm {SD}}^{(v)T} \end{aligned}$$
(18)

Note that we extract the common factors \(\mathbf {R}^{(v)}\) and \(\mathbf {Q}^{(v)}\) from the equations just for saving the writing space. However, it is not efficient for implementation, since the computation orders i.e. \((\mathbf {W} \mathbf {H}^T) \mathbf {h}_i\) and \(\mathbf {W} (\mathbf {H}^T \mathbf {h}_i\)) largely affect the computational complexity. The former takes \(mn(k+1)\) multiply operations, the later takes \((m+n)k\) multiply operations. Obviously the later form is much more efficient in implementation.

In addition, when the other variables are fixed, the projection matrices \(\mathbf {B}_{\mathrm {CD}}\) and \(\mathbf {B}_{\mathrm {SD}}^{(v)}\) can be solved in a closed form as follows.

$$\begin{aligned} \mathbf {B}_{\mathrm {CD}} = \frac{\sum _{v = 1}^{n_v}(\mathbf {Y} - \mathbf {B}_{\mathrm {SD}}^{(v)} \mathbf {H}_{\mathrm {SD}}^{(v)T})}{n_v} \mathbf {H}_{\mathrm {CD}} (\mathbf {H}_{\mathrm {CD}}^T \mathbf {H}_{\mathrm {CD}} + \lambda \mathbf {I})^{-1} \end{aligned}$$
(19)
$$\begin{aligned} \mathbf {B}_{\mathrm {SD}}^{(v)} = (\mathbf {Y} - \mathbf {B}_{\mathrm {CD}} \mathbf {H}_{\mathrm {CD}}^{T}) \mathbf {H}_{\mathrm {SD}}^{(v)} (\mathbf {H}_{\mathrm {SD}}^{(v)T} \mathbf {H}_{\mathrm {SD}}^{(v)} + \lambda \mathbf {I})^{-1} \end{aligned}$$
(20)

where \(\mathbf {I}\) is the identity matrix, \(\lambda \) is a small positive number.

figure a

Initialization. Since the NMF objective function is non-convex and has many local minima, a proper initialization is beneficial to improve learning performance. We develop a heuristic approach to initialize the basis matrix. DICS encourages the discriminative bases to achieve a degree of orthogonality, thus we try to initialize them as orthogonal as possible. To initialize \(\mathbf {W}_{\mathrm {C}}\), we first calculate the mean of multi-view data, i.e. \(\bar{\mathbf {X}} = \frac{1}{n_v} \sum _v^{n_v} \mathbf {X}^{(v)}\). Afterwards, we clustering \(\bar{\mathbf {X}}\) into \(k1+k2\) clusters and obtain the corresponding centroids. Then we compute the pairwise linear correlation coefficients between each pair of centroids, and sort them in an ascending order. At last, we select k1 centroids corresponding to the top k1 correlation coefficients to initialize \(\mathbf {W}_{\mathrm {CD}}\), and use the rest k2 centroids to initialize \(\mathbf {W}_{\mathrm {CN}}\). It is same to initialize each \(\mathbf {W}_{\mathrm {S}}^{(v)}\) by replacing \(\bar{\mathbf {X}}\) with \(\mathbf {X}^{(v)}\).

Time Complexity. The computational complexity of DICS is the same as solving standard NMF problem via hierarchical alternating least squares (HALS) algorithm under the BCD framework [14]. It is \(O(\sum _{v} m_vnk)\) in the multi-view case, where \(m_v\) is the dimension of the v-view feature. Finally, the pseudocode of DICS is given in Algorithm 1.

4 Experiment

In this section, we first experimentally evaluate the proposed algorithm DICS in classification task on seven real world multi-view data sets. Then we empirically investigate that whether the extracted discriminative information from the common and the view-specific parts are really helpful for improving the learning performance. At last, the sensitivity of parameters and the convergence of DICS are analyzed.

4.1 Data Sets

Four popular real-world multi-view data sets are used in the experiment, including WebKB, Reuters, YaleFace and BBC, where the WebKB data set can be further divided into four sub data sets, namely Cornell, Texas, Washington, Wisconsin. Therefore, finally seven data sets are used to evaluate the performance of the proposed algorithm in this study. The statistics of data sets are summarized in Table 1.

Table 1. Statistics of the data sets

4.2 Selection of Comparison Algorithms

We compare DICS algorithm with several single-view and multi-view algorithms to demonstrate its effectiveness. For fair comparison, the source codes of all comparing algorithms are directly downloaded from the author’s website or requested from the author by email. The parameters of all algorithms are selected within the range that the author suggested, which are listed in the following. Also, the source code of our proposed DICS algorithm can be acquired from DropboxFootnote 1.

  • KNN. We use the KNN algorithm (Set \(k=1\)) as the baseline algorithm since all NMF-based algorithms can be regarded as a preprocessing before KNN. We apply KNN on all single views and report the best performance on the view. Also we apply the KNN algorithm on the concatenated feature vector (i.e. KNNcat).

  • NMF. We apply the standard NMF algorithm on each of the single view data and the concatenated feature vector (i.e. NMFcat), as another baseline algorithm.

  • SSNMF. This is a supervised NMF variant proposed in [17], which incorporates a linear classifier to encode the supervised information. We select the regularization parameter \(\lambda \) within the range of [0.5:0.5:3].

  • GNMFFootnote 2. This is a manifold regularized version of NMF [2], which preserves the local similarity by imposing a graph Laplacian regularization. We use the normalized dot product (cosine similarity) to construct the affinity graph, and select the regularization parameter \(\lambda \) within the set of {\(10^{0}\), \(10^{1}\), \(10^{2}\), \(10^{3}\), \(10^{4}\)}.

  • multiNMFFootnote 3. This is a well-known multi-view NMF algorithm proposed in [18]. We select the regularization parameter \(\lambda \) within the set of {\(10^{-3}\), \(10^{-2}\), \(10^{-1}\), \(10^{0}\)}.

  • MVCCFootnote 4. MVCC incorporates the local manifold regularization for multi-view learning [27]. We set parameter \(\alpha \) to 100, and select \(\beta \) and \(\gamma \) within the set of {50, 100, 200, 500, 1000}.

  • MCL. This is a semi-supervised multi-view NMF variant with graph regularized constraint [8]. We select parameter \(\alpha \) within the range of [100:50:250], \(\beta \) within the set of {0.01, 0.02, 0.03}, and set gamma to 0.005 as author suggested.

  • DICS. This is the proposed algorithm. We select parameters: \(\alpha \), \(\beta \) and \(\gamma \) within the set of {\(10^{-2}\), \(10^{-1}\), \(10^{0}\), \(10^{1}\), \(10^{2}\)}.

4.3 Classification on Real-World Data Sets

For DICS and all comparing algorithms, we first perform a five-folds cross validation to select the parameters, then we run ten times 10-folds cross validation with the selected parameters to obtain the final average classification accuracy and standard deviation. For all comparing NMF-based methods, we don’t fix the number of latent factors k a global constant number, considering different algorithms may prefer different ks. Thus, we select k within the range of [5:5:100] for each algorithm. As for DICS, we need to set the number of four latent factors k1, k2, k3, k4 respectively. To avoid searching too large parameter space, we first select \(k_i (i=1,2,3,4)\) within the range of [5:5:20], then we select the regularization parameters by fixing all \(k_i\).

For classification, we first obtain latent representations from different NMF-based approaches, then we use KNN(\(k =1\)) for classification. Specifically, for unsupervised algorithms including NMF, GNMF, multiNMF, MCL and MVCC, we first apply algorithms on the data sets to obtain the latent representations \(\mathbf {H}\), then we use \(\mathbf {H}\) for further training and testing. For supervised method like DICS, we first obtain the discriminative basis \(\mathbf {W}_{\mathrm {D}}\) on training data, then we use the Moore-Penrose Pseudoinverse of \(\mathbf {W}_{\mathrm {D}}\) as projection matrix to obtain new data representation, namely \(\widetilde{\mathbf {X}}^{(v)} = (\mathbf {W}_{\mathrm {D}}^{T}\mathbf {W}_{\mathrm {D}})^{-1}\mathbf {W}_{\mathrm {D}}^{T}\mathbf {X}^{(v)}\). Then we concatenate \(\widetilde{\mathbf {X}}^{(v)}\) as the input for KNN.

Table 2 summarizes the classification results of different multi-view learning algorithms, where the numbers in the parentheses of the table denote the standard deviation. The best result on each data set is highlighted in boldface. As we can see from the results, the proposed DICS outperforms the other comparison algorithms on all seven data sets. DICS is slightly better than other algorithms on Reuters, YaleFace and BBC. But it achieves remarkably promising performance on four WebKB sub data sets, where it outperforms the second best algorithm up to 9.01% on Texas especially. The amazing result may result from twofold: (a) DICS not only explores the common and the view-specific information, but more importantly, the discriminative information existing in these parts is further extracted, which thus supports a gained prediction performance. (b) By filtering out the non-discriminative information from common part and view-specific parts, and adding the supervised constraints on encoding coefficients, the extracted discriminative information is much more effective for classification.

Table 2. Multi-view classification performance on real-world data sets
Fig. 2.
figure 2

Classification accuracy of DICS on different extracted components of multi-view data.

4.4 Empirical Study of DICS Algorithm

DICS assumes that multi-view data can be decomposed into the common part and the view-specific parts, and only the discriminative information in them is essential. To verify this assumption, we first construct the following subspaces: \(\mathbf {W}_{\mathrm {D}} = [\mathbf {W}_{\mathrm {CD}} \; \mathbf {W}_{\mathrm {SD}}^{(v)}]\), \(\mathbf {W}_{\mathrm {N}} = [\mathbf {W}_{\mathrm {CN}} \; \mathbf {W}_{\mathrm {SN}}^{(v)}]\), \(\mathbf {W}_{\mathrm {C}} = [\mathbf {W}_{\mathrm {CD}} \; \mathbf {W}_{\mathrm {CN}}]\) and \(\mathbf {W}_{\mathrm {S}} = [\mathbf {W}_{\mathrm {SD}}^{(v)} \; \mathbf {W}_{\mathrm {SN}}^{(v)}]\), denoting as the “Discriminative”, “Non-discriminative”, “Common” and “Specific” subspace. Afterwards, we project the original data onto these subspaces to obtain the corresponding components of data. We perform classification on each component, and the results are given in Fig. 2. The classification performance of the “Common” part is much worse than the “Specific” part, which suggests that only using the consistent information of multi-view data is not enough to capture the whole discriminative information. Also, performance on the “Discriminative” part is better than all other parts in all data sets except Reuters. It suggests that extracting the discriminative information from the common as well as the view-specific parts, and discarding the non-discriminative parts do help improve the learning performance.

4.5 Parameter Study

There are three regularization parameters in DICS, i.e. \(\alpha \), \(\beta \) and \(\gamma \). \(\alpha \) controls the orthogonality degree of discriminative bases \(\mathbf {W}_{\mathrm {D}}\), \(\beta \) controls the degree of sparsity of discriminative latent representation \(\mathbf {H}_{\mathrm {D}}\), and \(\gamma \) balances the importance of supervised regularization term. To investigate how these parameters affect the final classification accuracy, we vary one parameter at a time within the set of {\(10^{-4}\), \(10^{-3}\), \(10^{-2}\), \(10^{-1}\), \(10^{0}\), \(10^{1}\), \(10^{2}\), \(10^{3}\)}, and fix the others to \(10^{-3}\). Figure 3 shows the variation trend of classification accuracy over different parameters on four typical data sets. The classification accuracy is relatively stable when \(\alpha \) and \(\beta \) are less than 1, then drops sharply after \(\alpha \) and \(\beta \) are increasing. As for parameter \(\gamma \), the classification accuracy on BBC largely increases after \(\gamma \) is greater than \(10^{-2}\), and has become steady after \(\gamma \) is greater than 1. It is similar to other data sets except YaleFace, classification accuracy on YaleFace starts to decrease after \(\gamma \) is greater than 1. Based on the observation, we suggest selecting parameters \(\alpha \) and \(\beta \) within a small range of [0 1], and simply set the parameter \(\gamma =1\) for practical use.

Fig. 3.
figure 3

Classification accuracy curve w.r.t. parameters \(\alpha \), \(\beta \) and \(\gamma \).

4.6 Convergence Analysis

Though the original problem Eq. (7) is non-convex, the derived updating rules can achieve optimal minimum for each subproblem, the original problem Eq. (7) will eventually converge to a local minimum solution. In order to empirically investigate the convergence property of DICS, we plot the convergence curve and the corresponding classification accuracy curve on four typical data sets (see Fig. 4). From all four plots, we can observe that the objective values drop sharply and meanwhile the classification accuracies increase rapidly within about the first 10 iterations. After that, convergence curves and the accuracy curves begin to grow/decrease mildly, then it converges eventually. Usually, DICS will converge in no more than 50 iterations, while the corresponding classification accuracy becomes stable.

Fig. 4.
figure 4

Convergence and the corresponding classification accuracy curve of DICS on four typical data sets.

5 Conclusion

In this paper, we propose a novel multi-view learning algorithm, called DICS, by exploiting the discriminative information existing in multi-view data. To this end, a joint non-negative matrix factorization is employed to factorize multi-view data into a common part and view-specific parts. Beyond, the discriminative and non-discriminative information in these parts are further extracted in a supervised way. In contrast to existing multi-view learning approaches focusing on consistent and (or) complementary information, our new approach, offers an intuitive and effective way to improve classification performance based on the direct discriminative information. The high discriminative power of derived distinct patterns, further demonstrates the effectiveness of DICS on seven multi-view real-world data sets. Although DICS has several desirable properties, it has its own drawbacks. One limitation is that tuning \(k_i\) in DICS is quite troublesome, since inferring the subspace dimensionality is still an open problem for all NMF-based algorithms. We simply tune \(k_i\) via model selection with traditional strategy. However, once we set proper \(k_i\) for each subspace, the promising results can be obtained as we have demonstrated.