Keywords

1 Introduction

As a fundamental research field in machine learning and data mining, clustering has been widely used in various applications. The Multiple Kernel Clustering (MKC) methods are widely investigated to incorporate complementary information across kernels and avoid the kernel selection or design problem. The paradigm of MKC has received considerable attentions, and several methods have been proposed recently. In general, existing MKC algorithms can be roughly grouped into three categories. The first category takes the early fusion strategy, where one sub-task is to learn the consensus kernel from multiple candidate kernels, and another is to perform clustering on the single consensus kernel [5, 7, 8, 10]. The second category algorithms also take the early fusion strategy via the paradigm of multiple graph clustering. One sub-task is to extract multiple affinity graphs from multiple kernels, and another one is to integrate these affinity graphs to get the final consensus graph. These two sub-tasks can be concatenated separately as in [6, 15] or optimized jointly as in [14, 16]. The third category utilizes the late fusion strategy, where multiple base partitions are first generated from individual kernels and then integrated to build the final consensus partition [9, 11, 17].

Based on the inherent connection between consensus kernel learning and clustering on consensus kernel, most existing MKC algorithms jointly optimize these procedures within a unified learning framework, where these targets are often manually balanced by additional hyper-parameters. These algorithms often perform well with suitable parameters detected by the grid search strategy, which uses the ground truth to determine the proper parameters. However, the limitation of such a strategy is also clear. On the one hand, the performance of MKC algorithms is less stable and largely dependent on the choice of hyper-parameters. On the other hand, it is less applicable and even infeasible to search for the best hyper-parameters via the so-called grid search strategy for the unsupervised MKC scenario, where no label information is available. Therefore, it is much preferable to develop MKC methods without parameter turning or even parameter-free.

To address the problems mentioned above on candidate kernel weighting and hyper-parameter sensitivity, we propose to learn the optimal neighborhood kernel called TRMKC by minimizing the approximation errors between consensus and base kernels using the corr-entropy induced metric, where the large errors caused by poor quality kernels could be largely suppressed. In contrast, good kernels corresponding to small errors could be carefully emphasized. It should be noted that the kernel weight can be estimated directly from the optimization problem, not like the plug-and-play strategy as [13, 19]. TRMKC is proposed to integrate consensus kernel learning and clustering by maximizing the trace ratio criterion without introducing additional hyper-parameters, where the balance between these two coupled sub-procedures can be automatically balanced, and the performance could be mutually reinforced. To solve the optimization problem with the trace ratio criterion and the corr-entropy induced non-quadratic function, we present an alternative optimization strategy with monotonic convergence proof, which reformulates it into a series of sub-problems with trace difference and quadratic programming by utilizing the half-quadratic optimization techniques. Extensive MKC experimental results on 11 benchmark datasets with ten recent MKC algorithms well demonstrate the effectiveness of TRMKC.

2 Design of TRMKC

Given multiple kernel grammatrices \(\{\mathbf {K}^i\}_{i=1}^{m}\) on n samples, where \(\mathbf {K}^i \in \mathcal {R}^{n \times n}\) is the i-th kernel matrix and it is PSD as required. The MKC algorithm aims to identify the c consensus cluster structure across these candidate kernels.

2.1 Consensus Kernel Learning via CIM

Instead of using a traditional quadratic function like [13, 19], the approximation error between consensus kernel \(\mathbf {K}\) and single individual kernel \(\mathbf {K}^i\) is measured by the corr-entropy induced non-quadratic loss function, which leads to the following optimization problem for consensus kernel learning,

$$\begin{aligned} \min _{\mathbf {K}} \quad \sum _{i=1}^{m} \ell (\mathbf {K}, \mathbf {K}^i) = \sum _{i=1}^{m}\left( 1 - \exp (-\frac{||\mathbf {K} - \mathbf {K}^i||^2}{\delta ^2}) \right) , \quad \mathrm {s.t.}\quad \mathbf {K} \succeq 0, \end{aligned}$$
(1)

where \(\delta \) is the bandwidth of the Gaussian function, which is estimated automatically in this paper. From the perspective of the loss function, the benefits of the above non-quadratic loss function against the traditional quadratic one are two folds. Firstly, the CIM loss is upper bounded and change slowly for large errors. Therefore, the effect of a low-quality kernel with large error could be large suppressed. Secondly, the CIM loss changes quickly on small errors. As a result, similar high-quality kernels, which correspond to small errors, could be carefully distinguished. The different behavior of CIM function on large and small errors makes it is appropriate to characterize the expectation of consensus kernel across these candidate kernels.

From the perspective of kernel weighting, the implicit kernel weights can also be derived from the CIM loss function according to the half-quadratic optimization theory [1]. It assigns larger weights to good kernels with small errors and small weights to bad kernels with large errors. Moreover, we further introduce the explicit weight for each kernel by solving the following problem,

$$\begin{aligned} \min _{\mathbf {K}, \mathbf {s}} \quad \sum _{i=1}^{m} \frac{1}{s_i}\left( 1 - \exp (-\frac{||\mathbf {K} - \mathbf {K}^i||^2}{\delta ^2}) \right) , \mathrm {s.t.}\quad \mathbf {K} \succeq 0, \sum _{i=1}^{m}s_i = 1, s_i \ge 0, \text { } \forall i, \end{aligned}$$
(2)

where \(\mathbf {s} \in \mathcal {R}^{m \times 1}\) and \(\frac{1}{s_i}\) can be seen as the explicit weight for the i-th kernel.

2.2 Clustering on Consensus Kernel

Spectral clustering and k-means are two major traditional clustering methods. Ratio cut aims to maximize the mean similarities between data points in the same cluster while k-means aim to minimize the mean distances between points in the same cluster. Recently, it has been shown that both k-means and ratio cut can be unified into the following framework [12],

$$\begin{aligned} \max _{\mathbf {H}} \quad \mathrm {tr}(\mathbf {H}^T \mathbf {A} \mathbf {H}), \quad \mathrm {s.t.}\quad \mathbf {H}^T \mathbf {H} = \mathbf {I}, \end{aligned}$$
(3)

where \(\mathbf {H} \in \mathcal {R}^{n\times c}\) is the partition matrix and \(\mathbf {A} \in \mathcal {R}^{n \times n }\) is the similarity matrix. The mere difference between ratio-cut and k-means is that the former usually uses the Gaussian kernel to capture the pairwise similarity, while k-means adopts the linear kernel. Based on such a unified view of these two methods, the final clustering on consensus kernel can also be achieved by solving the optimization problem in Eq. (3), where we have \(\mathbf {A} = \mathbf {K}\).

2.3 The Proposed TRMKC

We have two sub-tasks, i.e., minimizing Eq. (2) for consensus kernel learning and maximizing Eq. (3) for clustering on consensus kernel. Instead of introducing additional hyper-parameter to balance these targets, we take the trace ratio criterion to incorporate these two coupled procedures in a unified learning framework, which can be presented as follows,

$$\begin{aligned} \max _{\mathbf {s}, \mathbf {H}, \mathbf {K}} \quad&\frac{\mathrm {tr}(\mathbf {H}^T \mathbf {K} \mathbf {H})}{\sum _{i=1}^{m} \frac{1}{s_i} (1 - \exp (-\frac{||\mathbf {K} - \mathbf {K}^i||_{F}^2}{\delta ^2}) )} \\ \mathrm {s.t.}\quad&\mathbf {H}^T \mathbf {H} = \mathbf {I}, \mathbf {K} \succeq 0, \sum _{i=1}^{m} s_i =1, s_i \ge 0, \forall i. \nonumber \end{aligned}$$
(4)

Eq. (4) learns the consensus kernel around individual kernels by minimizing the weighted corr-entropy induced non-quadratic loss in the denominator and performs clustering on the consensus kernel by maximizing the quadratic term in the nominator. The consensus kernel \(\mathbf {K}\) is involved both in the nominator and denominator. These two targets can be automatically balanced without the involvement of hyper-parameters. Based on these discussions, the optimization problem in Eq.(4) is suitable for MKC.

2.4 Optimization

We present an alternative learning algorithm to maximize Eq. (4).

Update \(\mathbf {H}\) with fixed \(\mathbf {K}\) Given \(\mathbf {s}\) and \(\mathbf {K}\), the sub-problem with respect to \(\mathbf {H}\) can be simplified as

$$\begin{aligned} \max _{\mathbf {H}} \quad \mathrm {tr}(\mathbf {H}^T \mathbf {K} \mathbf {H}), \quad \mathrm {s.t.}\quad \mathbf {H}^T \mathbf {H} = \mathbf {I}. \end{aligned}$$
(5)

Since the consensus kernel \(\mathbf {K}\) is PSD, the optimal solution of \(\mathbf {H}\) can be obtained by the c eigenvectors corresponding to the c largest eigenvalues.

Update \(\mathbf {s}\) with fixed \(\mathbf {K}\) and \(\mathbf {H}\) Denote \(h_i = 1 - \exp (-\frac{||\mathbf {K} - \mathbf {K}^i||_{F}^2}{\delta ^2})\), the rest problem w.r.t. \(\mathbf {H}\) can be rewritten as

$$\begin{aligned} \min _{\mathbf {s}} \quad&\sum _{i=1}^{m} \frac{h_i}{s_i}, \quad \mathrm {s.t.}\sum _{i=1}^{m} s_i =1, s_i \ge 0, \forall i. \end{aligned}$$
(6)

The optimal solution can be computed by \( s_i = \frac{\sqrt{h_i}}{\sum _{i'=1}^{m} \sqrt{h_{i'}}}\).

Update \(\mathbf {K}\) with fixed \(\mathbf {H}\) and \(\mathbf {s}\) Given \(\mathbf {H}\) and \(\mathbf {s}\), the problem w.r.t. \(\mathbf {K}\) can be written as,

$$\begin{aligned} \max _{\mathbf {K}} \quad \frac{\mathrm {tr}(\mathbf {H}^T \mathbf {K} \mathbf {H})}{\sum _{i=1}^{m} \frac{1}{s_i} (1 - \exp (-\frac{||\mathbf {K} - \mathbf {K}^i||_{F}^2}{\delta ^2}) )}, \quad \mathrm {s.t.}\quad \mathbf {K} \succeq 0. \end{aligned}$$
(7)

The trace ratio problem in Eq. (7) can be solved by a series of trace difference problems [3]. Given \(\mathbf {H}\), \(\mathbf {s}\) and \(\mathbf {K}^{t-1}\) at the \(t-1\)-th iteration, we first introduce the following auxiliary variable at the t-th iteration,

$$\begin{aligned} \lambda ^t = \frac{\mathrm {tr}(\mathbf {H}^{T} \mathbf {K}^{t-1} \mathbf {H})}{\sum _{i=1}^{m} \frac{1}{s_i} (1 - \exp (-\frac{||\mathbf {K}^{t-1} - \mathbf {K}^i||_{F}^2}{\delta ^2}) )}, \end{aligned}$$
(8)

where the index of \((t-1)\) on \(\mathbf {H}, \mathbf {s}\) is omitted. Equation (7) can be reformulated into the following trace difference problem,

$$\begin{aligned} \max _{\mathbf {K}} \quad&\mathrm {tr}(\mathbf {H}^{T} \mathbf {K} \mathbf {H}) - \lambda ^t \sum _{i=1}^{m} \frac{1}{s_i} (1 - \exp (-\frac{||\mathbf {K} - \mathbf {K}^i||_{F}^2}{\delta ^2}) ), \quad \mathrm {s.t.}\quad \mathbf {K} \succeq 0. \end{aligned}$$
(9)

According to the HQ theory, we introduce the auxiliary variable \(\mathbf {g} \in \mathcal {R}^{m\times 1}\) with \(g_i = \exp (-\frac{||\mathbf {K}^{t-1} -\mathbf {K}^i||^2}{\delta ^2})\). Then Eq. (9) can be rewritten as,

$$\begin{aligned} \min _{\mathbf {K}} \quad&\lambda ^t \sum _{i=1}^{m} \frac{g_i}{s_i}||\mathbf {K} - \mathbf {K}^i||^2 - \mathrm {tr}(\mathbf {K}^T (\mathbf {H} \mathbf {H}^{T})), \quad \mathrm {s.t.}\quad \mathbf {K} \succeq 0. \end{aligned}$$
(10)

By introducing the auxiliary variable \(\mathbf {\hat{K}} = \frac{1}{\sum _{i'=1}^{m}\frac{g_{i'}}{s_{i'}}}\sum _{i=1}^{m} \frac{ g_i }{s_i } \mathbf {K}^i + \frac{1}{2 \lambda ^t\sum _{i'=1}^{m}\frac{g_{i'}}{s_{i'}}} \mathbf {H} \mathbf {H}^{T}\), the problem in Eq. (10) can be simplified as,

$$\begin{aligned} \min _{\mathbf {K}} \quad&||\mathbf {K} - \mathbf {\hat{K}}||^2, \quad \mathrm {s.t.}\quad \mathbf {K} \succeq 0. \end{aligned}$$
(11)

The above problem is the quadratic projection of \(\mathbf {\hat{K}}\) within the positive semi-definite space. The optimal solution can be obtained by setting

$$\begin{aligned} \mathbf {K} = \mathbf {U}_{\mathbf {\hat{K}}} \mathbf {S}_{\mathbf {\hat{K}}}^{+} \mathbf {V}_{\mathbf {\hat{K}}}^T, \end{aligned}$$
(12)

where \(\mathbf {\hat{K}} = \mathbf {U} \mathbf {S} \mathbf {V}^T\) is the singular value decomposition (SVD) of \(\mathbf {\hat{K}}\) and \(\mathbf {S}^{+}\) is the non-negative projection of \(\mathbf {S}\), i.e., \(\mathbf {S}^{+} = \max (\mathbf {S}, \mathbf {0})\).

3 Experiments

We compare TRMKCFootnote 1 with 10 MKC algorithms to show the effectiveness and provide the convergence analysis of TRMKC on benchmark datasets.

3.1 Experiment Setup

We use 11 widely used data sets for multiple kernel clustering comparison [2, 4]. We omit the details of these data sets due to the space limitation. We compare TRMKC with 10 recent developed MKC methods, including RMSC [18], RMKKM [2], MKCMR [8], LKAM [5], ONKC [10], JMKSC [19], MKCSS [20], ONALK [7], SPMKC [13], OPLF [9]. The codes are mostly provided by the authors in public repository or emails. We adopt clustering accuracy (ACC) to evaluate the clustering result.

Table 1. ACC(%) of 11 algorithms on 11 benchmark datasets

We follow the same strategy in [2, 19] to build 12 base kernels to evaluate the clustering performance of the MKL methods. There are some parameters should be set in advance for fair. So, it is non-trivial to choose the proper multiple hyper-parameters corresponding to the best clustering performance for LKAM, ONKC, JMKSC, MKCSS, ONALK, and SPMKC across different datasets. As a result, some hyper-parameters of these algorithms are determined according to the suggestions from their papers, while others are determined by observing the corresponding sensitivity study results and choosing the stable value.

The concrete setting of hyper-parameters for all these algorithms are presented as follows. For all these MKC algorithms, the number of clusters is set to the true number of classes for all the data sets. For RMSC, the regularization parameter is set to \(\lambda =0.005\) as in [18]. For RMKKM, the parameter \(\gamma \) to control the kernel weight distribution is set to \(\gamma = 0.3\) as in [2]. For MKCMR [8], the regularization parameter is fixed to \(\lambda = 2^{-2}\). For LKAM [5], the regularization parameter and the neighborhood size are set to \(\lambda =2^{-1}\) and \(\tau =0.05n\). For ONKC [10], the importance of the neighborhood kernel learning and the kernel diversity balancing term are set to \(\rho =2^{-4}\) and \(2^{-7}\). For JMKSC, the hyper-parameters are set to \(\alpha =10^{-1}, \beta =20, \gamma =5\) as suggested in [19]. For MKCSS [20], the constrained rank value, the importance of the Frobenius term, the kernel diversity balancing term and the number of neighbors are fixed as 0.1n, \(10^{-4}||\mathbf {K}_{\mathrm {avg}}||_\mathrm {F}\), \(2^{-2}\) and 0.01n. For ONALK [7], the kernel diversity balancing term, the importance of the neighborhood kernel learning and the threshold of neighborhood similarity are set to \(\lambda =2,\rho =2^{-1},\tau =0.1\). For SPMKC [13], the hyper-parameters are set to \(\lambda _1=4, \lambda _2=1, \lambda _3=100,\lambda _4=1000\). For OPLF, it is free of hyper-parameters [9]. For TRMKC, we use the neighborhood kernel as [20] and the local size is fixed to \(5\log (n)\) on all the datasets without tuning. For all these methods except for OPLF, we run k-means 10 times and obtain the final result corresponding to the minimal value of k-means objective. The variance for all the compared algorithms is 0 in our experiments.

3.2 Clustering Results Analysis

Based on the clustering result in Table 1, we can see that

  • TRMKC outperforms all these 10 MKC algorithms on 11 benchmark data sets in most cases. We can see that our method achieves 13.59% improvement against the second-best result on averages in terms of ACC.

  • The results for MKCMR, LKAM, ONKC, JMKSC, MKCSS, ONALK and SPMKC degenerate in general and exhibit poor performance in many cases. These algorithms would achieve better results by tuning the parameters according to the grid-search strategy. However, it is infeasible in practice. Since we run these algorithms with fixed hyper-parameters, the ubiquitous degeneration indeed indicates that these algorithms are less stable without careful tuning.

  • Considering the difficulty of hyper-parameter determination in the MKC scenario and the degeneration of exiting MKC algorithms on fixed parameters, these experimental results motivate us to develop a parameter-free method for the MKC task. Compared with baseline methods, the results of TRMKC demonstrate not only the superiority in performance but also the stableness across all data sets.

4 Conclusion

In this paper, we propose a novel parameter-free MKC method with the trace ratio criterion (TRMKC), which minimizes the approximation errors for optimal kernel learning and maximizes the mean similarities on the consensus one. An alternative algorithm has been developed to solve the optimization problem with trace ratio criterion and the non-quadratic CIM loss function. Comprehensive experimental results on 11 widely-used datasets show that exiting MKC methods with hyper-parameters are indeed less stable and exhibit poor results without proper parameter determination. Moreover, TRMKC consistently outperforms the state-of-the-art competitors in terms of clustering performance.