Keywords

1 Introduction

In many real applications, data are often coming from multiple sources or with multiple modalities becoming multi-view data, which have attracted extensive attention in the data mining field [1, 7, 14]. Observing that multiple views usually provide each other with complementary and compatible information, integrating them together to get better performance becomes natural [6, 14]. However, in real applications, it is often the case that some or even all of the views suffer from some missing information [12, 19]. For example, in speaker grouping, the audio and visual appearances represent two views and some speakers may miss audio or visual information. Another example is document clustering, different language versions of a document can be regarded as multiple views, but many documents may not be translated into each language. Therefore, it is necessary to explore how to integrate such incomplete multi-view data. In this paper, we focus on multi-view clustering with incomplete data. Exiting approaches for this task can be divided into two categories: completion methods [2, 13] and subspace methods [12, 20].

Completion methods are based on matrix completion. For example, singular value decomposition imputation [2] first uses the eigenvalues to apply a regression to the complete attributes of the instance, to obtain an estimation of the missing value itself, and then applies conventional multi-view clustering methods [7, 10] to derive the clustering results. The difference among different completion methods [7, 9, 10] is that they complete the incomplete data according to different principles. Their performances are usually unsatisfactory when the data are missing block-wise [12].

In recent years, some subspace methods to handle this case have been proposed. Partial multi-view clustering (PVC) [12] first divides the partial examples into two blocks and then executes non-negative matrix factorization (NMF) [11] to learn a low-dimensional representation for each multi-view data. Incomplete multi-modal visual data grouping (IMG) [20] can be regarded as a version of PVC, which requires that the low-dimensional representations conform to a self-learning manifold structure. Other methods such as [17, 18] can also be classified into this category. Although these methods have achieved good performance, there is still room to improve. Most of them are based on NMF, which is aimed at learning the latent low-dimensional representations of data rather than clustering data. As a result, they must utilize a post-clustering such as K-means and spectral clustering to obtain the clustering results. Besides, they assume that the shared data in different views have exactly the same representation in the latent space, which may lead to a negative effect on the intrinsic inner structure of each individual view.

In this paper, with the goal to learn cluster structures directly, we propose the Structured Graph Learning (SGL) method to manipulate incomplete multi-view data and group them simultaneously. For each view, our SGL excavates a structured graph to combine the partial and the complete examples. To establish interaction between the different views, we naturally constrain the subgraphs corresponding to the shared data (in different views) to be close. Thus to some extent, we maintain the intrinsic structure of each individual view, as well as the consistency between different views. By improving the mechanism of the graph learning, the graph matrixes learned by our method have ideal structures-exactly c connected components, so that the clustering results can be derived from these graphs without requiring any post-clustering. Besides, we propose an efficient optimization strategy to solve our formulated problem. Experimental results on real benchmark data sets validate the advantages of our method.

2 The Proposed SGL

For the convenience of presentation, we take two-view data for illustration. As we can see from following formulations, extension to any number of views is direct. For example, most simply, we can separate the multiple views into pairs and then solve all the two-view problems. For input data, each column is a data and each row is an attribute. The feature dimensions of view 1 and view 2 data are \(d_1\) and \(d_2\), respectively. Input data \([{X^{(1)}}^T,{X^{(2)}}^T]^T\in \mathbb {R}^{(d_1+d_2)\times n_3}\), \(\hat{X}^{(1)}\in \mathbb {R}^{d_1\times n_1}\) and \(\hat{X}^{(2)}\in \mathbb {R}^{d_2\times n_2}\) denote the examples appearing and only appearing in both views, view 1 and view 2, respectively. \(X^{(1)}\in \mathbb {R}^{d_1\times n_3}\) denotes the shared data in view 1, and \(X^{(2)}\in \mathbb {R}^{d_2\times n_3}\) denotes the shared data in view 2. We assume that \(X_1\in \mathbb {R}^{d_1\times (n_3+n_1)}=[X^{(1)},\hat{X}^{(1)}]\), \(X_2\in \mathbb {R}^{d_2\times (n_3+n_2)}=[X^{(2)},\hat{X}^{(2)}]\), so that the n-th (\(\forall n\le n_3\)) columns of \(X_1\), \(X_2\) belong to the same example, while the rest columns of them contain no common example. Figure 1 illustrates the notations.

We denote the initial graphs constructed from \(X_1\) and \(X_2\) as \(A_1\in \mathbb {R}^{(n_3+n_1) \times (n_3+n_1)}\) and \(A_2\in \mathbb {R}^{(n_3+n_2)\times (n_3+n_2)}\) respectively. And we denote the learned graphs that best approximate \(A_1\) and \(A_2\) as \(S_1\in \mathbb {R}^{(n_3+n_1)\times (n_3+n_1)}\) and \(S_2\in \mathbb {R}^{(n_3+n_2)\times (n_3+n_2)}\) respectively. We denote the initial graphs that correspond to \(X^{(1)}\) and \(X^{(2)}\) as \(\bar{A}_1\in \mathbb {R}^{n_3\times n_3}\) and \(\bar{A}_2\in \mathbb {R}^{n_3\times n_3}\) respectively. Thus \(\bar{A}_1\) and \(\bar{A}_2\) are the subgraphs of \(A_1\) and \(A_2\) respectively. And we denote the learned graphs that correspond to \(X^{(1)}\) and \(X^{(2)}\) as \(\bar{S}_1\in \mathbb {R}^{n_3\times n_3}\) and \(\bar{S}_2\in \mathbb {R}^{n_3\times n_3}\) respectively. Thus \(\bar{S}_1\) and \(\bar{S}_2\) are the subgraphs of \(S_1\) and \(S_2\) respectively. Figure 2 illustrates the learned graphs \(S_1\) and \(S_2\).

Fig. 1.
figure 1

Notations of data.

Fig. 2.
figure 2

Learned graphs \(S_1\) and \(S_2\). \(S_1\) and \(S_2\) are corresponding to \(X_1\) and \(X_2\) respectively. \(\bar{S}_1\) and \(\bar{S}_2\) are corresponding to \(X^{(1)}\) and \(X^{(2)}\) respectively.

For each individual view, we aim to excavate its intrinsic clustering structure. Motivated by [16], which is effective in learning clustering structure with complete single-view data, we intend to learn the ideal structured graph from the initial graph. Given initial affinity matrixes \(A_1\) and \(A_2\) constructed from \(X_1\) and \(X_2\) respectively, we can learn the graph matrixes \(S_1\) and \(S_2\) that best approximate \(A_1\) and \(A_2\) respectively. Following are the elementary objectives:

$$\begin{aligned} \min _{\sum _js_{1ij}=1,s_{ij}\ge 0} ~ \ {\parallel S_1-A_1\parallel _F^2}, \end{aligned}$$
(1)
$$\begin{aligned} \min _{\sum _js_{2ij}=1,s_{ij}\ge 0} ~ \ {\parallel S_2-A_2\parallel _F^2}. \end{aligned}$$
(2)

Here we append the constraints \(\sum _js_{1ij}=1\) and \(\sum _js_{2ij}=1\) \((\forall i)\) to avoid the case that some rows of \(S_1\) or \(S_2\) are all zeros, and to make \(S_1\) and \(S_2\) to be comparable at the same scale.

For each view, according to Eq. (1) (or Eq. (2)), we can learn a graph to explore and maintain the inner structure of each view. Further, we aim to bridge different views, for which it is natural to take into account the consistency between the shared data in different views. For \(\bar{S}_1\), we can not only learn from \(\bar{A}_1\), but also \(\bar{S}_2\). For \(\bar{S}_2\), we can not only learn from \(\bar{A}_2\), but also \(\bar{S}_1\). Thus we have the following objective function:

$$\begin{aligned} \begin{aligned} \min _{S_1,S_2} ~&\parallel S_1-A_1\parallel _F^2+\parallel S_2-A_2\parallel _F^2 +\mu \parallel \bar{S}_1-\bar{S}_2\parallel _F^2\\ s.t.&\sum _js_{1ij}=1,s_{1ij}\ge 0, \sum _js_{2ij}=1,s_{2ij}\ge 0,\ \end{aligned} \end{aligned}$$
(3)

where \(\mu >0\) is a parameter that balances the first two terms and the last term.

To make the learned graphs have ideal structures directly for clustering tasks, we introduce the following property [4, 15]:

Property 1

The number of connected components in the graph with the similarity matrix S is equal to the multiplicity c of the eigenvalue zero of \(L_S\).

\( L_S \in \mathbb {R}^{n\times n}\) in Property 1 is the Laplacian matrix of the nonnegative similarity matrix S, i.e. \(L_S=D_S-(S^T+S)/2\) (where \(D_S\) is the degree matrix defined as a diagonal matrix whose i-th diagonal element is \(\sum _j{(s_{ij}+s_{ji})/2}\)).

Given a graph with graph matrix S, Property 1 indicates that if \(rank(L_S)=n-c\), then this graph contains c connected components and each component corresponds to a cluster. We can obtain clustering results from the learned graph. Thus the problem (3) can be improved to the following problem:

$$\begin{aligned} \begin{aligned} \min _{S_1,S_2} ~&\parallel S_1-A_1\parallel _F^2+\parallel S_2-A_2\parallel _F^2 +\mu \parallel \bar{S}_1-\bar{S}_2\parallel _F^2\\ s.t.\,\,&\sum _js_{1ij}=1,s_{1ij}\ge 0,rank(L_{S_1})=n_1+n_3-c,\\&\sum _js_{2ij}=1,s_{2ij}\ge 0,rank(L_{S_2})=n_2+n_3-c, \end{aligned} \end{aligned}$$
(4)

which is equivalent to the following optimization problem for large enough values of both \(\lambda _1\) and \(\lambda _2\):

$$\begin{aligned} \begin{aligned} \min _{S_1,S_2}~&\parallel S_1-A_1\parallel _F^2+2\lambda _1\sum _{i=1}^c\sigma _{1i}(L_{S_1})+\parallel S_2-A_2\parallel _F^2\\&+2\lambda _2\sum _{i=1}^c\sigma _{2i}(L_{S_2}) +\mu \parallel \bar{S}_1-\bar{S}_2\parallel _F^2\\ s.t.~&\sum _js_{1ij}=1,s_{1ij}\ge 0, \sum _js_{2ij}=1,s_{2ij}\ge 0.\ \end{aligned} \end{aligned}$$
(5)

Here, \(\sigma _{1i}(L_{S_1})\) and \(\sigma _{2i}(L_{S_2})\) denote the i-th smallest eigenvalue of \(L_{S_1}\) and \(L_{S_2}\) respectively. Noting that \(\sigma _{1i}(L_{S_1})\ge 0(\forall i)\) and \(\sigma _{2i}(L_{S_2})\ge 0(\forall i)\), therefore when \(\lambda _1\) and \(\lambda _2\) are large enough, \(\sum _{i=1}^c\sigma _{1i}(L_{S_1})=0\) and \(\sum _{i=1}^c\sigma _{2i}(L_{S_2})=0\), so that the constraint \(rank(L_{S_1})=n_1+n_3-c\) and \(rank(L_{S_2})=n_2+n_3-c\) in the problem (4) will be satisfied. Thus, the problem (5) is equivalent to the problem (4). According to Ky Fan’s Theorem [5], we have

$$\begin{aligned} \begin{aligned} \sum _{i=1}^c\sigma _{1i}(L_{S_1})=\min _{F_1\in \mathbb {R}^{(n_1+n_3)\times c},F_1^TF_1=I} Tr(F_1^TL_{S_1}F_1),\\ \sum _{i=1}^c\sigma _{2i}(L_{S_2})=\min _{F_2\in \mathbb {R}^{(n_2+n_3)\times c},F_2^TF_2=I} Tr(F_2^TL_{S_2}F_2), \end{aligned} \end{aligned}$$
(6)

where \(F_1\) and \(F_2\) are the intermediate variables. Thus the problem (5) is further equivalent to the following problem:

$$\begin{aligned} \begin{aligned} \min _{S_1,F_1,S_2,F_2} ~&\parallel S_1-A_1\parallel _F^2+2\lambda _1Tr(F_1^TL_{S_1}F_1)+\parallel S_2-A_2\parallel _F^2\\&+2\lambda _2Tr(F_2^TL_{S_2}F_2)+\mu \parallel \bar{S}_1-\bar{S}_2\parallel _F^2\\ s.t.~&\sum _js_{1ij}=1,s_{1ij}\ge 0,\sum _js_{2ij}=1,s_{2ij}\ge 0,\\&F_1\in \mathbb {R}^{(n_1+n_3)\times c},F_1^TF_1=I, F_2\in \mathbb {R}^{(n_2+n_3)\times c},F_2^TF_2=I,\ \end{aligned} \end{aligned}$$
(7)

where both parameters \(\lambda _1\) and \(\lambda _1\) are large enough to guarantee that the sums of the c smallest eigenvalues of both \(L_{S_1}\) and \(L_{S_1}\) are equal to zero.

According to Eq. (7), note that different from previous graph-based methods, our graphs are learned, which are learned from both the initial graphs and the consistency between different views. Besides, Eq. (7) is designed to cluster data points directly by learning graphs with structures which are ideal for clustering. From the learned structured graph, we can obtain clustering results of each view directly. Then by making a simple best one-to-one map between the clustering results of \(X^{(1)}\) and \(X^{(2)}\), the final clustering results can be derived.

Since the problem (7) is not convex and the regularization is added on the partial elements of the graph matrixes, it seems difficult to solve. We propose an efficient algorithm to solve this problem in the next section.

3 Optimization Algorithms

When \(S_1\) and \(S_2\) are fixed, optimizing \(F_1,F_2\), the problem (7) becomes

$$\begin{aligned} \min _{F_1\in \mathbb {R}^{(n_1+n_3)\times c},F_1^TF_1=I} ~ \ Tr(F_1^TL_{S_1}F_1), \end{aligned}$$
(8)
$$\begin{aligned} \min _{F_2\in \mathbb {R}^{(n_2+n_3)\times c}F_2^TF_2=I} ~ \ Tr(F_2^TL_{S_2}F_2). \end{aligned}$$
(9)

The optimal solution of \(F_1\) and \(F_2\) is formed by the c eigenvectors of \(L_{S_1}\) and \(L_{S_2}\) respectively corresponding to their c smallest eigenvalues.

When \(F_1\), \(F_2\) and \(S_2\) are fixed, optimizing \(S_1\), the problem (7) becomes

$$\begin{aligned} \begin{aligned} \min _{S_1}~&\parallel S_1-A_1\parallel _F^2+2\lambda _1Tr(F_1^TL_{S_1}F_1) +\mu \parallel \bar{S}_1-\bar{S}_2\parallel _F^2\\ s.t.&\sum _js_{1ij}=1,s_{1ij}\ge 0, \end{aligned} \end{aligned}$$
(10)

where \(S_1, A_1\in \mathbb {R}^{(n_3+n_1)\times (n_3+n_1)}\) and \(\bar{S}_1, \bar{S}_2\in \mathbb {R}^{n_3\times n_3}\).

The problem (10) is equal to

$$\begin{aligned} \begin{aligned} \min _{\sum _js_{1ij}=1,s_{1ij}\ge 0}~&\sum _{1\le i,j\le n_3+n_1}(s_{1ij}-a_{1ij})^2 +\lambda _1\sum _{1\le i,j\le n_3+n_1}s_{1ij}(f_{1i}-f_{1j})\\ +\,&\mu \sum _{1\le i,j\le n_3+n_1}(\bar{s}_{1ij}-\bar{s}_{2ij})^2. \end{aligned} \end{aligned}$$
(11)

We can solve the following problems separately for each i since the problem (11) is independent for different i.

Update the first \(n_3\) rows of \(S_1\)(\(1\le i\le n_3\)). For each i, we denote \({v_1}_{ij}=f_{1i}-f_{1j}\). Then the problem (11) becomes

$$\begin{aligned} \begin{aligned} \min _{\sum _j{s_1}_{ij}=1,{s_1}_{ij}\ge 0} ~&\sum _{j=1}^{n_3}({s_1}_{ij}-{a_1}_{ij})^2+\sum _{j=n_3+1}^{n_3+n_1}({s_1}_{ij}-{a_1}_{ij})^2+\lambda _1\sum _{j=1}^{n_3}{s_1}_{ij}{v_1}_{ij}\\&+\lambda _1\sum _{j=n_3+1}^{n_3+n_1}{s_1}_{ij}{v_1}_{ij} +\mu \sum _{j=1}^{n_3}(s_{1ij}-\bar{s}_{2ij})^2.\\ \end{aligned} \end{aligned}$$
(12)

For each i, we denote \(s_{1i}=[\bar{s}_{1i} , \hat{s}_{1i}]\), \(v_{1i}=[\bar{v}_{1i} , \hat{v}_{1i}]\) and \(a_{1i}=[\bar{a}_{1i} , \hat{a}_{1i}]\). \(\bar{s}_{1i} \), \(\bar{v}_{1i} \) and \(\bar{a}_{1i} \) contains and only contains the first \(n_3\) elements of the vector \(s_{1i}\), \(v_{1i}\) and \(a_{1i}\) respectively. \(\hat{s}_{1i} \), \(\hat{v}_{1i} \) and \(\hat{a}_{1i} \) contains and only contains the last \(n_1\) elements of the vector \(s_{1i}\), \(v_{1i}\) and \(a_{1i}\) respectively.

The problem (12) can be written in vector form as

$$\begin{aligned} \begin{aligned} \min _{s_{1i}^T\mathbf 1=1,s_{1i}\ge 0}&\parallel \sqrt{1+\mu }\bar{s}_{1i}-\frac{(\bar{a}_{1i}-\frac{1}{2}\lambda _1\bar{v}_{1i}+\mu \bar{s}_{2i})}{\sqrt{1+\mu }}\parallel _2^2 +\parallel \hat{s}_{1i}-(\hat{a}_{1i}-\frac{1}{2}\lambda _1\hat{v}_{1i})\parallel _2^2. \end{aligned} \end{aligned}$$
(13)

We denote \(c1 = \bar{a}_{1i}-1/2\lambda _1\bar{v}_{1i}+\mu \bar{s}_{2i}\), \(c_2=\hat{a}_{1i}-1/2\lambda _1\hat{v}_{1i}, b_1=[c_1, c_2]\). \(U_1\) is a diagonal matrix whose first \(n_3\) diagonal elements are all \(1+\mu \), and the others are all 1. Then the problem (13) becomes

$$\begin{aligned} \begin{aligned} \min _{s_{1i}^T\mathbf 1=1,s_{1i}\ge 0} ~ \ s_1U_1s_1^T-2s_1b_1. \end{aligned} \end{aligned}$$
(14)

This problem can be solved by algorithm in [16].

Update the last \(n_1\) rows of \(S_1\)(\(n_3<i\le n_3+n_1\)). When \(n_3<i\le n_3+n_1\), denoting \(v_{1ij}=f_{1i}-f_{1j}\), the problem (11) becomes

$$\begin{aligned} \begin{aligned} \min _{\sum _j{s_1}_{ij}=1,{s_1}_{ij}\ge 0} ~&\sum _{j}({s_1}_{ij}-{a_1}_{ij})^2+\sum _{j}{s_1}_{ij}{v_1}_{ij}.\\ \end{aligned} \end{aligned}$$
(15)

The problem (15) can be written in vector form as

$$\begin{aligned} \begin{aligned} \min _{s_{1i}^T\mathbf 1=1,s_{1i}\ge 0} ~ \ \parallel s_{1i}-(a_{1i}-\frac{1}{2}\lambda _1v_{1i})\parallel _2^2. \end{aligned} \end{aligned}$$
(16)

This problem can be solved with an effective iterative algorithm [8], or solved by the solution with a similar form as Eq. (30) in [16]

When \(F_1\), \(F_2\) and \(S_1\) are fixed, optimizing \(S_2\) is similar to optimizing \(S_1\). We omit the detailed process.

Update the first \(n_3\) rows of \(S_2\)(\(1\le i\le n_3\)). When \(1\le i\le n_3\), the problem (7) becomes

$$\begin{aligned} \begin{aligned} \min _{s_{2i}^T\mathbf 1=1,s_{2i}\ge 0}~ \ s_2U_2s_2^T-2s_2b_2, \end{aligned} \end{aligned}$$
(17)

where \(b_2=[c_3, c_4]\), \(U_2\) is a diagonal matrix whose first \(n_3\) diagonal elements are all \(1+\mu \), and the others are all 1, \(c_3=(\bar{a}_{2i}-1/2\lambda _2\bar{v}_{2i}+\mu \bar{s}_{1i}), c_4=\hat{a}_{2i}-1/2\lambda _2\hat{v}_{2i}\).

Update the last \(n_2\) rows of \(S_2\)(\(n_3<i\le n_3+n_2\)). When \(n_3<i\le n_3+n_2\), the problem (7) becomes

$$\begin{aligned} \begin{aligned} \min _{s_{2i}^T\mathbf 1=1,s_{2i}\ge 0} ~ \ \parallel s_{2i}-(a_{2i}-\frac{1}{2}\lambda _2v_{2i})\parallel _2^2. \end{aligned} \end{aligned}$$
(18)

The algorithm is provided in Algorithm 1, in which for each data point, we only update the nearest k similarities in S in order to reduce the complexity of updating S, F significantly. This technique makes our method applied on data sets with very large scale.

figure a

4 Discussion

4.1 Convergence Analysis

Property 2

The Algorithm 1 will monotonically decrease the objective of the problem in each iteration, and converge to a local optimum of the problem.

The brief idea in proving this theorem is summarized as follows. The Algorithm 1 converges because the objective function value of Eq. (7) decreases as iteration round increases. In detail, with fixed \(S_1\) and \(S_2\), optimal \(F_1\) and \(F_2\) can be obtained by solving the problems (8) and (9) respectively, which will reduce the objective function value, and with fixed \(F_1\) and \(F_2\), we can get optimal \(S_1\) and \(S_2\) by solving the problems (14) (16) and (17) (18) respectively, which will also reduce the objective function value. In summary, the Algorithm 1 will converge to a local optimum of the problem (7).

4.2 Computational Time

Since SGL is solved in an alternative way, we calculate their total computational complexity by analyzing the computational complexity in solving corresponding alternative optimization problems. The Algorithm 1 of SGL can be divided into three alternative optimization problems. The problems in Eqs. (8) and (9) updating matrix \(F_1\) and \(F_2\) respectively can be solved by eigen-decomposition, and the computational complexity is \(O((n_1+n_3)^3)\) and \(O((n_2+n_3)^3)\) respectively. Therefore, the total computational complexity of this procedure is \(O(\max \{(n_1+n_3)^3,(n_2+n_3)^3\})\).

The problems in Eqs. (14) and (16) to update \(S_1\) row by row are the subproblems of (10). The problem in Eq. (14) can be solved by Lagrange Multiplier and Newton’s method, of which the computational complexity is \(O((n_1+n_3)\times n_3)\). The problem (16) can be solved by an efficient iterative algorithm [8], and the computational complexity is \(O((n_1+n_3)\times n_1)\). Therefore, the total computational complexity of this step is \(O((n_1+n_3)^2)\).

The problems in Eqs. (17) and (18) to update \(S_2\) are similar to Eqs. (14) and (16) respectively, and the total computational complexity is \(O((n_2+n_3)^2)\).

As a result, the total computational complexity of SGL is \(O(T\times \max \{(n_1+n_3)^3,(n_2+n_3)^3\})\), where T is the number of iterations. Obviously, cubic time complexity is caused by spectral decompositions. But in our algorithm, the Laplacian matrixes on which we implement spectral decompositions is sparse. Besides, nowadays there are many other novel alternative efficient methods handling spectral decomposition.

5 Experiment

5.1 Data Sets

MSRCv1 is comprised of 240 images in 8 class in total. Two pairs of visual features are extracted: 256 Local Binary Pattern(LBP) and 512 GIST noted as MSRCv1GC, 1302 CENTRIST and 200 SIFT noted as MSRCv1CS. Ionosphere is composed of 351 free electrons in the ionosphere observed by a system in Goose Bay, Labrador. The data can be divided into two classes: “Good” are those showing evidence of some type of structure in the ionosphere. “Bad” returns are those that do not. Caltech101-7 contains 441 objective images in 7 categories as whole. We extract two features from each image data, including 200 SIFT and 32 Gabor texture. Handwritten numerals (HW) contains 2000 images data points for 0 to 9 digit classes and each class has 200 data points. We select 216 profile correlations (FAC) and 76 Fourier coefficients of the character shapes (FOU) for our clustering. WebKB is comprised of 1051 pages collected from four universities. Each page has 2 views: 334 citation view, and 2949 content view. Statistics of the data sets are summarized in Table 1.

Table 1. Data sets descriptions

5.2 Comparing Methods

Single-view methods V1CLR and V2CLR: With the partial example ratio being zero, CLR [16] algorithm is executed separately on view 1 and view 2 noted as V1CLR and V2CLR respectively.

Completion methods CentroidSC and PairwiseSC: Firstly, the partial examples are completed with the Robust Rank-k Matrix Completion [9] method. Two co-regularization schemes [10] are proposed to accomplish this work: Centroid-Based Co-regularization noted as CentroidSC and Pairwise-Based Co-regularization noted as PairwiseSC.

Subspace methods PVC [12] and IMG [20].

Other method: CGC [3] is aimed to deal with many-to-many instance relationship, which supports the situation of incomplete views.

We construct the sparse affinity matrixes \(A_1\) and \(A_2\) by Eq. (35) in [16]. Following [12], we randomly select a ratio of examples to be partial to simulate the partial view setting, i.e. they only appear in either of the two views while the remaining ones are described by both views. we evenly assign them to the two views to simplify the experiment. Each time we randomly select \(10\%\) to \(90\%\) examples, with \(10\%\) as interval, as partial examples. We repeat such process 10 times and record the average and standard deviation results. For PVC and IMG, the k-means algorithm is performed to get the final clustering result as in the original paper. The other clustering methods may be also feasible. We utilize two standard clustering evaluation metric to measure the multi-view clustering performance, that is, Clustering Accuracy (ACC) and Normalized Mutual Information (NMI). Same as [12], we test all the methods under different Partial Example Ratio (PER) varying from 0.1 to 0.9 with an interval of 0.1.

Table 2. Experimental ACC (the higher the better) results (mean(std)) on six data sets. The best result is highlighted in boldface. T-Test (statistical significance of T-Test is 5%) results between our and other algorithms, Win (\(\bullet \)) means our performs better. Lose (\(\otimes \)) means other algorithm performs better. Tie (\(\odot \)) means that our and other algorithms cannot outperform each other.

5.3 Clustering Result Comparison

Table 2 and Fig. 3 report the ACC and NMI values respectively on various data sets with different PER ratio settings. From these figures and the table, we make the following observations and discussions.

In almost all the settings, our method usually outperform both methods executed on single complete view (V1CLR and V2CLR) even when the PER ratio equals \(30\%\) , which confirm that our approach synthesizes the information from both views validly and proposed constraint between the subgraphs corresponding to the same samples (from different views) is effective. As the partial example ratio PER varies from \(10\%\) to \(90\%\), the proposed method usually performs much better than other multi-view clustering baselines. Particularly, the performance of our approach improves much compared with the baselines when Per is less than \(40\%\). And with more missing examples, the performance of all the methods drops basically.

CentroidSC and PairwiseSC usually perform worst on almost all the data sets, which may be caused by that matrix completion requires the randomness of the missing locations, while the data are missing block-wise for the multi-view incomplete data setting. One may be curious why our method performs much better than other subspace learning methods. This may be caused by that their assumption that shared data in different views have exactly the same representations in the latent space may damage the inner structure of each view and increase the risk of over-fitting. Besides, our graphs are learned from both the initial graphs and the consistency between views. And the learned graphs contain ideal clustering structures.

One may be also interested in the reason why our method has a considerable improvement when PER is high. When PER is high, for most of the subspace learning methods, it is hard to accurately estimate the common representation \(P_c\) simply from the little common complete data. IMG utilizes a general global structure to remedy this deviation. However, the global structure overlooks the specific intrinsic structure of each individual view. Our structures are more detailed and specific.

Fig. 3.
figure 3

The NMI (the higher the better) results for the six data sets. PER (partial example ratio) is the ratio of partial examples. Partial examples are evenly distributed to the two views.

5.4 Convergence Study

We experiment on data set MSRCv1 to show the convergence property. The convergence curves and corresponding NMI performances with \(PER=30\%\) and \(PER=70\%\) setting are plotted in Fig. 4. For \(PER=30\%\) setting, we set \(\{\lambda 1, \lambda 2, \mu \}\) as \(\{50, 50, 2.0\}\). For \(PER=70\%\) setting, we set \(\{\lambda 1, \lambda 2, \mu \}\) as \(\{50, 50, 9.0\}\). The objective function during each iteration is drawn in black. We can see that the value of objective function decreases rapidly with the increasing of iteration round. Inspiringly, it only takes around 12 rounds to converge. The NMI value during each iteration is drawn in red, from which we can see 12 round is enough to get good clustering performance.

Fig. 4.
figure 4

Convergence curve of Objective function value and corresponding NMI performance curve vs number of iterations of our with \(PER=30\%\) and \(PER=70\%\) on MSRCv1GC data set. (Color figure online)

5.5 Parameter Study

We study parameters on four data sets: MSRCv1GC, MSRCv1SW CaltechSW and HW. There are three parameters to explore: \(\lambda _1\), \(\lambda _2\) and \(\mu \). Following [16], we determined both \(\lambda _1\) and \(\lambda _2\) in a heuristic way : in each iteration, we computed the numbers of zero eigenvalues of \({L_S}_1\) and \({L_S}_2\), if one is larger (smaller) than k, we divide (multiply) it by two (respectively). Following [12], We tune \(\mu \) for three different PER 30%, 50% and 70%. As above experiment, we randomly select a ratio of examples to be partial and repeat such process 10 times to record the average results. The effect of the parameter \(\mu \) is showed in Fig. 5.

From Fig. 5, it is easy to see that on all data sets, our method achieves steadily good performance for NMI with a very large range of \(\mu \) in all settings, which validates the robustness of our method. Our method usually has a relatively good performance when \(\mu \) is in the range of [2.0, 4.0] for the \(PER=30\%\) and \(PER=50\%\) settings, while the best range becomes [7.0, 9.0] for the \(PER=70\%\) setting.

Fig. 5.
figure 5

Effect of the parameter \(\mu \) on four data set with three different PER.

6 Conclusion

In this paper, we propose a method to handle multi-view clustering problem in the case that all the views suffer from the missing of some data. Different from existing approaches, we simultaneously manipulate and cluster incomplete multi-view data. We excavate and maintain the intrinsic structure of each individual view, and establish interaction between the different views through the shared data. For each view, a graph with exactly c connected components can be learned so that the clustering results can be derived from graphs without any post-clustering. To optimize our proposed objective, we provide the solution which can simultaneously handle both the whole and partial regularization problem. Experimental results on six real-world multi-view data sets compared with several baselines validate the effectiveness of our method. In the future, we will study how to reduce the computational cost of our method.