Keywords

1 Introduction

Recently, multi-view data, whose data features are collected from multiple heterogenous but related views, have arisen in a number of fields [1,2,3,4,5,6,7,8], such as pattern recognition, data mining, natural language processing, etc. For instance, a web page can be described in two views, one contains the words occurring in the page and the other contains the words occurring in the hyperlinks pointing to that page [4]. Another example is the multilingual document, which is available in several languages such that each language is taken as a separate view [5]. In these fields, data clustering is a basic but widely used technique [9]. Considering clustering the multi-view data, it is difficult to produce good results by using only one view of feature, since usually each view only provides partial information [10]. Therefore, it is necessary to properly combine information from all views together to improve the clustering performance. This leads to the emergence of multi-view clustering.

Proximity-based method is a kind of typical method for multi-view clustering. These methods integrate the information from different views by making use of the predefined proximity matrices together. One of the most straightforward scheme for view integration is weighted combination, which combines the proximity matrices from all views together by weighted addition via an adaptive weighting parameter for each view [11, 12]. Besides, some other useful methods are developed. In [13, 14], co-training based approaches are adopted to share information among views, which improves the proximity matrices to fit multi-view data. Co-regularized approaches are also effective approaches to view integration [15, 16]. Wang et al. propose a method which considers the neighborhood consistency of different views [17], while Xia et al. consider the low-rank and sparse properties of proximity matrices [18].

Despite the success of the aforementioned proximity-based methods, they suffer from some common problems. First, proximity matrices are needed as inputs for these methods, while usually data features are given rather than proximity matrices. In this case, some nearest-neighbors-based methods are applied on data features to construct proximity matrices, such as k-nearest neighbors, Gaussian proximity [19] and self-tuned Gaussian [20]. However, these proximity construction methods do not consider the underlying cluster structures, such that the constructed proximities may not exhibit good properties for clustering. Moreover, these methods only consider separately the information in individual views, leading to the loss of the inter-view correlations.

In order to address these problems, we propose a new multi-view proximity learning (MVPL) method for multi-view clustering. In the multi-view proximity learning, both the relations between data objects in individual views and the correlations across different views are considered. For the intra-view relations, a novel idea of data representative is adopted, such that the cluster structure is also taken into account during the learning process. Besides, spectral-embedding-based scheme is designed for modeling the inter-view correlations, such that both the view consistency and complement properties can be utilized for improving the clustering performance. Accordingly, an objective function is designed and an alternative iteration scheme is proposed to optimize the objective. Extensive experiments conducted on both synthetic and real-world datasets demonstrate the effectiveness of the proposed model.

2 The Proposed Model

In order to address the proximity learning problem for multi-view data, our model should consider two parts. One is the intra-view learning quality, which means that the learning process should consider the relations between data objects within each view. Inspired by [21], the proposed model discovers these relations based on the idea of representative. It can transform the original view feature into a more suitable representation for proximity learning, by which the cluster structures are also considered. In particular, in each view, each feature vector has a dedicated representative that is very similar to itself, and representatives of data objects with higher proximity should be similar to each other. The other part is to model the correlations across different views, such that both the view consistency and complement properties can be utilized for improving clustering performance. The view consistency property implies that the proximities learnt from different views will reach a certain degree of consistency, while the view complement property implies that one view will provide complementary information for the other views. Accordingly, a well-designed inter-view criterion function is proposed based on spectral embedding. For clarity, Fig. 1 illustrates the main idea of our method by a two-view example. From the figure, we find that the data representatives are determined by both view features and learnt proximities. Similarly, the learnt proximities are derived from intra-view data representatives and further mutually affected in a inter-view manner by spectral embedding. In what follows, we will introduce the model in detail.

Fig. 1.
figure 1

Illustration of the main idea of our method. In this simple example, the dataset contains a document view and an image view.

2.1 The Objective Function

Given a dataset containing n objects whose features are collected from m views, the features in the v-th view are represented by matrix \(X^v=[\mathbf {x}^v_1, \mathbf {x}^v_2, \dots , \mathbf {x}^v_n]\in \mathbb {R}^{d^v\times n}\), where \(d^v\) is the dimensionality of the v-th view and \(\mathbf {x}^v_i\) is the feature vector for the i-th object in the v-th view. The goal of the multi-view proximity learning is to learn a proximity set \(\{S^1, S^2, \dots , S^m\}\), where \(S^v=[s^v_{ij}]_{n\times n}\) is the proximity matrix for the v-th view with \(s^v_{ij}\) representing the proximity between the i-th and j-th objects in the v-th view. According to the discussion above, the learning process should consider both intra-view and inter-view criteria.

Intra-view Criterion. In order to discover the relations between data objects in individual views, we introduce the idea of data representatives, which are better representations with clearer cluster structures for data objects. Intuitively in this process, original data point is moved to a better position for clustering according to its relations with other data points. We use \(U^v=[\mathbf {u}^v_1, \mathbf {u}^v_2, \dots , \mathbf {u}^v_n] \in \mathbb {R}^{d^v\times n}\) to denote the representative matrix where \(\mathbf {u}^v_i\) is the representative for feature vector \(\mathbf {x}^v_i\) in the v-th view. Treating the original feature as important basis for learning representative, \(\mathbf {u}^v_i\) should not be far from \(\mathbf {x}^v_i\), otherwise the topological structure will be destroyed. Besides, the learning of representative should also consider the proximities between objects. If two data objects have higher proximity in one view then their representatives should be relatively closer. Similarly, the proximity learning should consider the relations between data representatives. If two data representatives \(\mathbf {u}^v_i\) and \(\mathbf {u}^v_j\) are close in the v-th view, then \(s^v_{ij}\) should be relatively large. In other words, the learning processes of representatives and proximities are mutually affected by each other. According to the above discussion, the intra-view criterion is defined as follows,

$$\begin{aligned} \begin{aligned} \varPhi ^v(U^v, S^v)&= \frac{1}{n}\sum _{i=1}^n\Vert \mathbf {x}^v_i - \mathbf {u}^v_i\Vert _2^2 + \frac{\alpha }{n^2}\left( \sum _{i=1}^n \sum _{j=1}^n s^v_{ij}\Vert \mathbf {u}^v_i-\mathbf {u}^v_j\Vert _2^2 + \beta \Vert S^v\Vert _F^2\right) \\&\text {s.t.} \sum _{j=1}^n s^v_{ij}=1, s^v_{ij}\ge 0, \forall i,j \end{aligned} \end{aligned}$$
(1)

where \(\Vert \cdot \Vert _2\) is the \(L_2\) norm of vector, \(\Vert \cdot \Vert _F^2\) is the Frobenius norm of matrix and \(\alpha , \beta >0\) are trade-off parameters. In our paper, the probabilistic proximities are used. Therefore constraint \(\sum _{j=1}^n s^v_{ij}=1\) and \(s^v_{ij}\ge 0\) should be introduced. The term \(\beta \Vert S^v\Vert _F^2\) is adopted for controlling the sparsity of learnt proximity. If \(\beta \) is large, the learnt proximity matrix will be relatively dense, while a smaller \(\beta \) will make the matrix sparser.

Inter-view Criterion. The inter-view criterion considers both the view consistency and view complement properties. We design such criterion by introducing the concept of spectral embedding. Spectral embedding is a low-dimensional representation of data object, which is obtained through spectral decomposition on specific matrix. In our model, spectral embedding is the representation integrating information from all views. By denoting the embedding matrix as \(F=[\mathbf {f}_1, \mathbf {f}_2, \dots , \mathbf {f}_n] \in \mathbb {R}^{c\times n}\) with \(\mathbf {f}_{i}\) being the c-dimensional spectral embedding of the i-th data object, the relation between F and the learnt proximity \(S^v\) can be modeled by

$$\begin{aligned} \begin{aligned} \frac{1}{2n^2} \sum _{i=1}^n \sum _{j=1}^n s^v_{ij} \Vert \mathbf {f}_i-\mathbf {f}_j\Vert _2^2~~~~~\text {s.t.}~FF^T = I \end{aligned} \end{aligned}$$
(2)

where I is the identity matrix. Here \(FF^T=I\) is a widely used constraint for weakening the relations between the features of embedding, which makes F a better representation [19]. If the distance between \(\mathbf {f}_i\) and \(\mathbf {f}_j\) is small, it implies that i-th and j-th data objects may have higher proximity in all views. If the value of (2) is smaller, the learnt proximity of the v-th view is more consistent with the spectral embedding F. Since the spectral embedding F carries information from all views, the high consistency between F and \(S^v\) implies that information of other views is transferred to the v-th view, which reflects the view complement property. Moreover, proximities from different views can reach a certain degree of consistency through F. Here F is regarded as a medium for inter-view interactions, which reflects the view consistency property. Considering all views together, we get the inter-view criterion as follows,

$$\begin{aligned} \begin{aligned} \varPsi (\{S^v\}, F)&= \frac{1}{2n^2} \sum _{v=1}^m \sum _{i=1}^n \sum _{j=1}^n s^v_{ij} \Vert \mathbf {f}_i-\mathbf {f}_j\Vert _2^2~~~~~\text {s.t.}~ FF^T = I \end{aligned} \end{aligned}$$
(3)

which models the inter-view correlations through the spectral embedding.

The Overall Objective Function According to the discussion above, we can use \(\varPhi ^v(U^v, S^v)\) to measure the intra-view learning quality and \(\varPsi (\{S^v\}, F)\) to measure the inter-view consistency and complement properties. By integrating them together, we can get the overall objective function as follows,

$$\begin{aligned} \begin{aligned}&\min _{\{U^v\}, \{S^v\}, F} \sum _{v=1}^m \varPhi ^v(U^v, S^v) + \gamma \varPsi (\{S^v\}, F)\\&\text {s.t.}~ \sum _{j=1}^n s^v_{ij}=1, s^v_{ij}\ge 0, \forall i, j,v, FF^T = I \end{aligned} \end{aligned}$$
(4)

where \(\gamma >0\) is the trade-off parameter balancing the intra-view criterion and the inter-view criterion. By minimizing the objective function (4), both the learning quality of proximities in all views and the interactions between different views are considered, such that suitable proximities for multi-view data can be obtained. Following the convention of spectral clustering, the dimensionality of spectral embedding can be set as the predefined number of clusters [19].

2.2 Determination of Parameter \(\beta \)

In the proposed model, three parameters are needed as inputs for proximity learning. Parameter \(\alpha \) is adopted to control the distances between data features and data representatives, while parameter \(\gamma \) is adopted for controlling the view consistency. Both parameters should be determined according to the properties of datasets. In comparison, \(\beta \) is adopted for controlling the sparsity of learnt proximities, which has less variability. Therefore, it is necessary to design a method for determining its value more easily.

Inspired by [22], we propose a method based on k-nearest neighbors to determine \(\beta \). It also induces a method for constructing single-view proximity, which will be used in our experiments. Considering data feature in certain view, whose data matrix is \(X=[\mathbf {x}_1, \dots , \mathbf {x}_n]\in \mathbb {R}^{d\times n}\) (here we ignore the superscript specifying view index for simplicity), we can learn the proximity vector \(\mathbf {w}_i=[w_{i1}, w_{i2},\dots , w_{in}]^T\) associated with \(\mathbf {x}_i\) by solving the following model

$$\begin{aligned} \begin{aligned} \min _{\mathbf {w}_i} \frac{1}{2}\left\| \mathbf {w}_i + \frac{\mathbf {d}^\mathbf {x}_{i}}{2\beta _i}\right\| _2^2~~~~~\text {s.t.}~ {\mathbf {w}_i}^T \varvec{1} = 1, \mathbf {w}_i\ge 0, \end{aligned} \end{aligned}$$
(5)

where \(\beta _i>0\) is the sparsity parameter, \(\varvec{1}\) is the all-one vector and \(\mathbf {w}_i\ge 0\) means all elements of vector \(\mathbf {w}_i\) are not less than 0. We assume the original distance vector as \(\hat{\mathbf {d}}^\mathbf {x}_i=[\hat{d}^\mathbf {x}_{i1}, \hat{d}^\mathbf {x}_{i2}, \dots , \hat{d}^\mathbf {x}_{in}]^T\), where \(\hat{d}^\mathbf {x}_{ii}\) is set as a very large number (i.e. ignoring \(\mathbf {x}_i\) itself) and \(\forall j\not = i, \hat{d}^\mathbf {x}_{ij}=\Vert \mathbf {x}_i-\mathbf {x}_j\Vert _2^2\). The distance vector \(\mathbf {d}^\mathbf {x}_i\) in (5) is defined by \(\mathbf {d}^\mathbf {x}_i=[d^\mathbf {x}_{i1}, d^\mathbf {x}_{i2}, \dots , d^\mathbf {x}_{in}]^T\), which is the sorted vector of \(\hat{\mathbf {d}}^\mathbf {x}_{i}\) such that \(d^\mathbf {x}_{i1}\le d^\mathbf {x}_{i2}\le \dots \le d^\mathbf {x}_{in}\). In the model, the parameter \(\beta _i\) determines the number of nonzero elements in the proximity information vector \(\mathbf {w}_i\). If \(\beta _i=0\), there will be only one nonzero element in the vector, corresponding to the nearest neighbor of object \(\mathbf {x}_i\). If \(\beta _i\rightarrow \infty \), all elements will be nonzero except the one corresponding to \(\mathbf {x}_i\). Aiming to solve problem (5), we write down its Lagrangian function as

$$\begin{aligned} \begin{aligned} \mathcal {L}(\mathbf {w}_i, \eta , \mathbf {\mu }_i) = \frac{1}{2}\left\| \mathbf {w}_i + \frac{\mathbf {d}^\mathbf {x}_{i}}{2\beta _i}\right\| _2^2 - \eta \left( {\mathbf {w}_i}^T \varvec{1}-1\right) - \mathbf {\mu }_i^T \mathbf {w}_i \end{aligned} \end{aligned}$$
(6)

where \(\eta \) and \(\mathbf {\mu }_i\ge 0\) are Lagrangian multipliers. According to the KKT condition, the optimal solution of \(\mathbf {w}_i\) is given by

$$\begin{aligned} \begin{aligned} w_{ij} = \max \left( -\frac{d^\mathbf {x}_{ij}}{2\beta _i} + \eta , 0\right) . \end{aligned} \end{aligned}$$
(7)

If there are exactly k nonzero elements in the vector \(\mathbf {w}_i\), we get the value of Lagrangian multiplier \(\eta =\frac{1}{k} + \frac{1}{2k\beta _i} \sum _{j=1}^k d^\mathbf {x}_{ij}\) [22]. These k nonzero elements of \(\mathbf {w}_i\) correspond to the k-nearest neighbors of \(\mathbf {x}_i\) and the elements of \(\mathbf {w}_i\) satisfy \(\forall 1\le j\le k, w_{ij}>0\) and \(\forall j\ge k+1, w_{ij}=0\). According to the constraint \({\mathbf {w}_i}^T \varvec{1} = 1\), the sparsity parameter \(\beta _i\) can be set as

$$\begin{aligned} \begin{aligned} \beta _i = \frac{k}{2}d^\mathbf {x}_{i,k+1} - \frac{1}{2}\sum _{j=1}^k d^\mathbf {x}_{ij}, \end{aligned} \end{aligned}$$
(8)

such that the resulting \(\mathbf {w}_i\) will have exactly k nonzero elements. Considering all data objects, the sparsity parameter \(\beta \) can be set as the average of \(\beta _i\), which is given by

$$\begin{aligned} \begin{aligned} \beta = \frac{1}{n} \sum _{i=1}^n \left( \frac{k}{2}d^\mathbf {x}_{i,k+1} - \frac{1}{2}\sum _{j=1}^k d^\mathbf {x}_{ij}\right) . \end{aligned} \end{aligned}$$
(9)

Using the method above, we can determine the sparsity parameter according to the number of neighbors k, which is much easier to tune. Furthermore, the single-view weighted k-nearest neighbors proximity can be constructed after k is determined. For multi-view data, since different views may have different distance distributions, it is more reasonable to use different sparsity parameters for different views. Therefore, the modified intra-view criterion function is given by

$$\begin{aligned} \begin{aligned} \widetilde{\varPhi }^v(U^v, S^v) = \frac{1}{n}\sum _{i=1}^n\Vert \mathbf {x}^v_i - \mathbf {u}^v_i\Vert _2^2 + \frac{\alpha }{n^2}\left( \sum _{i=1}^n \sum _{j=1}^n s^v_{ij}\Vert \mathbf {u}^v_i-\mathbf {u}^v_j\Vert _2^2 + \beta ^v\Vert S^v\Vert _F^2\right) \end{aligned} \end{aligned}$$
(10)

where \(\beta ^v>0\) is the sparsity parameter for the v-th view determined by the aforementioned method via the number of neighbors k. Finally, our objective is given by

$$\begin{aligned} \begin{aligned}&\min _{U^v, S^v, F} \sum _{v=1}^m \widetilde{\varPhi }^v(U^v, S^v) + \gamma \varPsi (\{S^v\}, F) \\&\text {s.t.}~ \sum _{j=1}^n s^v_{ij}=1, s^v_{ij}\ge 0, \forall i, j,v, FF^T = I. \end{aligned} \end{aligned}$$
(11)

Although more sparsity parameters are introduced to control the model in (11) compared with (4), they can be determined via the same number of nearest neighbors k.

2.3 Optimization

In this subsection, the alternative iteration scheme is used to solve problem (11).

Update \({{{\varvec{U}}}^{{\varvec{v}}}{} \mathbf{.}}\) When \(S^v\) and F are fixed, the subproblem with respect to \(U^v\) is given by

$$\begin{aligned} \begin{aligned} \min _{U^v} \frac{1}{n}\sum _{i=1}^n \Vert \mathbf {x}^v_i - \mathbf {u}^v_i\Vert _2^2 + \frac{\alpha }{n^2}\sum _{i=1}^n\sum _{j=1}^n s^v_{ij}\Vert \mathbf {u}^v_i-\mathbf {u}^v_j\Vert _2^2. \end{aligned} \end{aligned}$$
(12)

In order to rewrite the subproblem into matrix form, we introduce the property [19] as

$$\begin{aligned} \begin{aligned} \mathrm {Tr}\left( HL_GH^T\right) = \frac{1}{2} \sum _{i=1}^n \sum _{j=1}^n g_{ij} \Vert \mathbf {h}_i-\mathbf {h}_j\Vert _2^2 \end{aligned} \end{aligned}$$
(13)

where \(\mathrm {Tr}(\cdot )\) is the trace operator for matrix, \(G=\{g_{ij}\}\in \mathbb {R}^{a\times a}\) and \(H=[\mathbf {h}_1, \mathbf {h}_2, \dots , \mathbf {h}_a]\in \mathbb {R}^{b\times a}\). \(L_G\) is the unnormalized Laplacian matrix of G defined by \(L_G=D_G-G\), where \(D_G\) is the degree matrix of G. Using the property (13), the subproblem can be transformed as

$$\begin{aligned} \begin{aligned} \min _{U^v} \Vert X^v - U^v\Vert _F^2 + \frac{2\alpha }{n} \mathrm {tr}\left( U^v L^v_S {U^v}^T\right) \end{aligned} \end{aligned}$$
(14)

where \(L^v_S\) is the unnormalized Laplacian matrix of \((S^v+{S^v}^T)/2\). Setting the derivative with respect to \(U^v\) to zero, we find that \(U^v\) satisfies the equation as follows

$$\begin{aligned} \begin{aligned} U^v\left( I + \frac{2\alpha }{n} L^v_S\right) = X^v, \end{aligned} \end{aligned}$$
(15)

which can be solved by matrix inversion. Besides, the problem is essentially a least-square problem, which can also be solved in many efficient ways.

Update \({{{\varvec{S}}}^{{\varvec{v}}}{} \mathbf{.}}\) When \(U^v\) and F are fixed, the subproblem with respect to \(S^v\) is given by

$$\begin{aligned} \begin{aligned}&\min _{S^v} \sum _{i=1}^n \sum _{j=1}^n s^v_{ij}\Vert \mathbf {u}^v_i-\mathbf {u}^v_j\Vert _2^2 + \beta ^v\sum _{i=1}^n \sum _{j=1}^n {s^v_{ij}}^2 + \frac{\gamma }{2\alpha } \sum _{i=1}^n \sum _{j=1}^n s^v_{ij} \Vert \mathbf {f}_i-\mathbf {f}_j\Vert _2^2 \\&\text {s.t.}~ \sum _{j=1}^n s^v_{ij}=1, s^v_{ij}\ge 0, \forall i,j,v. \end{aligned} \end{aligned}$$
(16)

By denoting \(d^v_{ij} = \Vert \mathbf {u}^v_i-\mathbf {u}^v_j\Vert _2^2 + \frac{\gamma }{2\alpha } \Vert \mathbf {f}_i-\mathbf {f}_j\Vert _2^2\), \(\mathbf {d}^v_i=[d^v_{i1}, d^v_{i2}, \dots , d^v_{in}]^T\) and \(\mathbf {s}^v_i=[s^v_{i1}, s^v_{i2}, \dots , s^v_{in}]^T\), we translate the problem into vector form as follows

$$\begin{aligned} \begin{aligned}&\min _{\mathbf {s}^v_i} \left\| \mathbf {s}^v_i + \frac{\mathbf {d}^v_i}{2\beta ^v}\right\| _2^2 \\&\text {s.t.}~ {\mathbf {s}^v_i}^T\varvec{1}=1, \mathbf {s}^v_i\ge 0, \end{aligned} \end{aligned}$$
(17)

which is equivalent to computing the Euclidean projection of point \(-\mathbf {d}^v_i/(2\beta ^v)\) onto the probability simplex. The problem has a unique solution, which can be solved by using the method proposed in [23].

Update F. When \(U^v\) and \(S^v\) are fixed, the subproblem with respect to F is to solve a trace minimization problem as

$$\begin{aligned} \begin{aligned} \min _{FF^T=I} \mathrm {Tr}(FL_S F^T) \end{aligned} \end{aligned}$$
(18)

where \(L_S=\sum _{v=1}^m L^v_S\). The optimal F is a matrix formed by the c eigenvectors of \(L_S\) corresponding to the c smallest eigenvalues.

By alternatively update \(U^v\), \(S^v\) and F, the objective value will decrease and finally converge as the iteration goes, from which the solution of problem (11) can be obtained. The optimization algorithm is summarized in Algorithm 1. After learning the proximity matrices, the spectral clustering is applied on the proximity matrices to obtain the clustering results.

figure a

3 Experiment

In this section, extensive experiments are conducted to demonstrate the effectiveness of the proposed method on one synthetic dataset and four real-world datasets. On the synthetic dataset, we will show how the proposed method works. While on the real-world datasets, parameter analysis, convergence analysis and comparison experiments will be conducted. The code of our method and the testing datasets are available on dropboxFootnote 1.

Fig. 2.
figure 2

Synthetic experiment. In the figures, points in the first class are in blue while those in the second class are in red. Green lines are edges representing the proximities between data objects, i.e., if the proximity between two data objects in certain view is larger than zero then there is an edge between them. (Color figure online)

3.1 Synthetic Experiment

A synthetic dataset consisting of two views, namely Two-Gaussian and Two-moon, is used for demonstrating how the proposed method works. Figure 2(a) and (d) plot the original data points in both views with edges representing the initial proximities learnt by the method introduced in Sect. 2.2. In order to show the significance of considering inter-view criterion, a variant of our method, called SVPL, is introduced by setting \(\gamma =0\). It is a single-view proximity learning method which considers only the intra-view criterion. Figure 2(b) and (e) show the results of SVPL, where points denote the learnt representatives. From the subfigures, we find that SVPL transforms the original data view into a more suitable state for clustering based on representatives. It is essentially equivalent to making the data points move in such a manner that the intra-class connections are stronger and the inter-class connections are weaker. However, the learnt proximity is not good enough since there are still edges between the two clusters. Therefore, we need to consider the inter-view information. Figure 2(c) and (f) show the results of MVPL, where points denote representatives learnt by MVPL. From these two subfigures, we find that there are no edges between clusters in both views. This implies that much better proximities are learnt by considering both the intra-view and the inter-view criterion. The comparison results confirm the significance of considering inter-view criterion.

3.2 Real-World Datasets and Evaluation Measures

In this subsection, we will first introduce the four real-world datasets used in experiments.

  1. 1.

    Handwritten numeral dataset

    Multiple features (Mfeat) dataset is a handwritten numeral dataset from UCI machine learning repository [24]. The dataset contains handwritten digits from 0 to 9 and each category has 200 objects. In our experiment, we use three kinds of feature to represent images, namely 216 profile correlations, 76 Fourier coefficients and 47 Zernike moments, where each kind of features is regarded as a view.

  2. 2.

    Multi-source news dataset

    3Sources datasetFootnote 2 is a multi-source news dataset consisting of news collected from three sources, namely BBC, Guardian and Reuters. Although the original dataset contains 984 news articles covering 416 distinct news stories, there are only 169 stories reported by all three medias. In our experiment, we only use these 169 news objects so that each object has three views of features.

  3. 3.

    Object image datasets

    Caltech101 [25] is an image dataset consisting of 101 categories of images for object recognition problem. Following the previous work [26], two subsets are selected to generate two datasets for experimental purpose. The first subset is called Caltech101-7, containing 1474 images from 7 widely used categories. The second one is a larger subset called Caltech101-20, which contains 2386 images of 20 categories. Three kinds of features are extracted from the images to generate three views, namely 1984-dimensional HOG feature, 512-dimensional GIST feature and 928-dimensional LBP feature.

The statistic of the four real-world datasets is shown in Table 1.

Table 1. Statistic of the four real-world datasets.

In order to evaluate the clustering performance of the proposed method and the compared methods, three widely used measures are adopted in our experiments, namely accuracy (ACC), normalized mutual information (NMI) and purity (PUR). For each measure, higher value indicates better performance [17]. In comparison experiments, following [21], the average rank of the performance obtained by each method is also reported across all datasets.

3.3 Parameter Analysis

In this subsection, parameter analysis is conducted to show the effect of the three parameters \(\alpha \), \(\gamma \) and k. The first parameter to be analyzed is k, which determines the value of \(\beta ^v\). By fixing \(\alpha =1\) and \(\gamma =0.001\), we tune the value of k in range [5, 70] with step 5. The performance in terms of all three measures on the four datasets are reported in Fig. 3. From the figure, we find that the method performs not so well when k is too small due to the failure of preserving the neighborhood structures. As the value of k increases, the performance will gradually increase. After reaching the highest point (often around \(k=30\)), the value of curve will gradually decrease. Although the method may perform not so well with relatively larger k, it produces acceptable results. The main reason is that by introducing the idea of representative, which transforms the original data into a more suitable state for proximity learning, the negative impact caused by the noisy neighbors will be alleviated.

Fig. 3.
figure 3

Parameter analysis on number of neighbors k.

Fig. 4.
figure 4

Parameter analysis on \(\alpha \) and \(\gamma \) on Mfeat.

Fig. 5.
figure 5

Parameter analysis on \(\alpha \) and \(\gamma \) on 3Sources.

Fig. 6.
figure 6

Parameter analysis on \(\alpha \) and \(\gamma \) on Caltech101-7.

Fig. 7.
figure 7

Parameter analysis on \(\alpha \) and \(\gamma \) on Caltech101-20.

Next we analyze the effect of \(\alpha \) and \(\gamma \) by setting \(k=30\). According to the properties of datasets, different ranges of \(\gamma \) are used for different datasets while the same range of \(\alpha \) is used for all datasets. The experimental results are shown in Figs. 4, 5, 6 and 7 respectively. From the figures, we find our method has similar performance with similar \(\gamma /\alpha \). What is more, the value of \(\alpha \) should not be set too large since it may lead to information loss in terms of topological structure. In practice, user can select the value of \(\alpha \) in [0.5, 1] and the value of \(\gamma \) from \(\{0.01, 0.001, 0.0001\}\) by which satisfactory performance can be obtained.

3.4 Convergence Analysis

In this subsection, convergence analysis is conducted to explore the convergence property of the proposed iterative algorithm by setting \(\alpha =1\), \(\gamma =0.001\) and \(k=30\). Figure 8 plots the log value of objective as a function of iteration step. From the subfigures, we find that the log values of objective decrease rapidly during the iterations on all four datasets. Usually, the algorithm will converge within 30 times of iteration.

Fig. 8.
figure 8

Convergence analysis of optimization.

3.5 Comparison Experiment

In this subsection, we compare the proposed MVPL method with several state-of-the-art algorithms. Two types of clustering methods are used for comparison, namely the traditional single-view clustering methods and the multi-view clustering methods. For the single-view methods, three representative algorithms are selected, namely k-means (KM) [27], normalized cut (NCut) [19] and robust continuous clustering (RCC) [21]. These single-view methods operate on each individual views from which the best results are reported. For multi-view clustering methods, five state-of-the-art algorithms are used, namely multi-view k-means (MVKM) [28], multi-view spectral clustering (MVSC) [12], co-training multi-view clustering (CoTrn) [13], co-regularized multi-view clustering (CoReg) [15] and multi-view learning with adaptive neighbors (MLAN) [29]. Following [13], for the methods that generate multiple view-specific clustering results (i.e. CoTrn, CoReg and MVPL), prior knowledge is used to select the most informative view for comparison purpose. For all the spectral-clustering-like compared methods, we use the method mentioned in Sect. 2.2 to construct the proximity matrices, which is shown to be a good method for proximity construction [30]. And the sparsity of the proximity matrices is determined by the number of neighbors k. We will tune k in the range of [10, 50] to select the best proximity according to the three measures for all the methods. For all spectral-clustering-like methods and k-means-like methods, we set the number of clusters c as the ground-truth number. Besides, for all the methods involving k-means, we run each algorithm 50 times in the same parameter setting and select the results with the smallest objective as the result for this setting. For all the methods, the best parameters are tuned as suggested by the authors.

Table 2. Clustering results in terms of ACC on all datasets.
Table 3. Clustering results in terms of NMI on all datasets.
Table 4. Clustering results in terms of PUR on all datasets.

The comparison results obtained by all the methods on the four real-world datasets in terms of ACC, NMI and PUR are reported in Tables 2, 3 and 4 respectively. In the tables, the best performance among all the methods is highlighted in bold. From the tables, we find that the proposed MVPL method outperforms all the other methods on ACC and NMI. In particular, our method has achieved on average 14% percent improvement in terms of ACC and 12% improvement in terms of NMI on all the datasets. For PUR, although our method cannot reach the highest PUR on all the datasets, it still ranks the first on average. Overall, the comparison results have demonstrated the effectiveness of the proposed method.

4 Conclusion

In this paper, we propose a novel proximity learning method for multi-view clustering, called multi-view proximity learning. Through the method, proximities between data objects with multiple views of features can be obtained, which are suitable for multi-view clustering. Accordingly, our method adopts two criteria to fulfill the task, namely intra-view criterion and inter-view criterion. For the intra-view part, we not only make use of the relations between data objects but also take cluster structures into account within individual views. For the inter-view part, we model the correlations between views based on spectral embedding, which utilizes the view consistency and complement properties such that the learning performance is improved. Extensive experiments conducted on both synthetic and real-world datasets demonstrate the effectiveness of our method.